NOTE
Communicated by Norman Draper
Mixture Models Based on Neural Network Averaging Walter W. Focke
[email protected] Institute of Applied Materials, Department of Chemical Engineering, University of Pretoria, Pretoria 0001, South Africa
A modified version of the single hidden-layer perceptron architecture is proposed for modeling mixtures. A particular flexible mixture model is obtained by implementing the Box-Cox transformation as transfer function. In this case, the network response can be expressed in closed form as a weighted power mean. The quadratic Scheff´e K-polynomial and the exponential Wilson equation turn out to be special forms of this general mixture model. Advantages of the proposed network architecture are that binary data sets suffice for “training” and that it is readily extended to incorporate additional mixture components while retaining all previously determined weights. 1 Introduction Complex mixtures arise in diverse situations. They abound in nature as rocks, ores, coal, seawater, crude oil, natural gas, and the atmosphere. Human activity also leads to the creation of mixtures. This may be deliberate and purposeful, as in the alloying of metals to create special alloys (e.g., stainless steel) or in the formulation of a chemical product (e.g., paint). It may also be unintended (e.g., through side reactions occurring during chemical conversions). Occasionally it may be necessary to express the properties of a mixture in terms of its composition. Global models—those that are able to correlate mixture behavior over the entire factor space—are desirable. Predictive mechanistic theories are rare; thus, empirical mixture models are the norm. This implies the need for experimental data to fix adjustable parameters and also for statistical tests to validate the model and parameter choices. The selection of the functional form is crucial: statistical tests are valid only under the assumption that the postulated model is also valid (Bates & Watts, 1988). The empirical model should satisfy the following requirements:
r r
The mathematical form should be sufficiently flexible to correlate the underlying information without unnecessary restrictions. It should be consistent with available physical theory, especially in some limits of simplification.
Neural Computation 18, 1–9 (2006)
C 2005 Massachusetts Institute of Technology
2
W. Focke
r r
The parameters should be easy to interpret and estimate with common multivariate estimation techniques. It should be parameter parsimonious. Ideally, the coefficients should be obtainable from pure component properties or, at most, from binary mixture information (Prausnitz, Lichtenthaler, & de Azevedo, 1999). The model should then be predictive for the general multivariate case.
n Let R+ denote the set of all positive real numbers and R+ its n-product. A mixture property is a measured scalar output y ∈ R+ that depends solely on n the n-tuplet x = (x1 , x2 , . . . , xn )T ∈ R+ that expresses mixture composition in terms of dimensionless fractional units (e.g., mol fraction). This implies the restrictions xi ≥ 0 and the simplex constraint:
x1 + x2 + . . . . + xn = 1.
(1.1)
Cornell (2002) provides a comprehensive review of experimental design for mixtures. The recommended model forms for y = y(x) are the Scheff´e K-polynomials (Scheff´e, 1958; Draper & Pukelsheim, 1998). They offer advantages such as homogeneity of regression terms and reduced ill conditioning of the information matrix (Prescott, Dean, Draper, & Lewis, 2002). A common conjecture in chemical engineering is that only binary interactions between species need to be considered in mixtures (Hamad, 1998; Prausnitz et al., 1999). This is a strong assumption but is accepted here for the simplicity and convenience it introduces. If it were true, experimental data from the constituent binary subsystems would suffice to predict the multicomponent behavior. This disqualifies the higher-order Scheff´e polynomials from consideration. They include ternary and higher-interaction parameters that cannot necessarily be determined from binary data alone. The use of classic mixture models is now well established (Cornell, 2002). Application of neural networks to mixture correlation and prediction is a recent trend (Rowe & Colbourn, 2003). This communication considers simple global mixture models inspired by neural network architecture. 2 A Neural Network Model for Mixture Properties Figure 1 shows the proposed neural network for modeling mixture properties. It is a modified version of the popular single hidden-layer perceptron (SHLP) (Hagan, Demuth, & Beale, 1996) architecture. The following specifications apply:
r
Each of the n components in the mixture is associated with a specific input node–hidden layer neuron pair. This feature reflects the notion that the mixture components form a set of orthogonal basis vectors. Adding or removing such a node-neuron pair is equivalent to adding or removing this component from the mixture.
Mixture Models Based on Neural Network Averaging
x1
a11 a12 a1n
x2
Σ
T
Σ
T
3
T(u1)
T(u2)
x1 x2
-1 Σ T
y
xn an1
xn
an2
Σ
ann
T(un)
T
Figure 1: Neural network for mixture property estimation.
r r
The hidden neurons apply an arbitrary but strictly monotonic transfer function to the sum of weighted inputs xi . The output neuron employs the input xi as variable weights and applies the inverse of the transform used by the hidden neurons. This feature embodies the desire that the output should reflect a weighted average of the pure component and binary mixture properties only. It also ensures the dimensional homogeneity of the output response. Physical properties are dimensional quantities expressed in terms of characteristic units. This requires that arguments to nonlinear functions such as logarithmic, exponential, or trigonometric functions be dimensionless numbers.
Conceptually, the output of this modified neural network can be interpreted as a generalized quasi-arithmetic mean of the hidden-layer neuron summations ui (Bullen, 2003): y = Mn (u, x) := T
−1
n
xi T(ui ) .
(2.1)
i=1
This result is unaffected when the transform T in equation 2.1 is replaced by arbitrary linear combinations αT + β, provided α = 0 (Bullen, 2003).
4
W. Focke
The ui in equation 2.1 are distinct positive functions defined on the weights a ij ∈ R+ as follows:
ui =
n
a ij x j .
(2.2)
j=1
The model has a total of up to n2 adjustable parameters. Of these, the n coefficients a ii are determined by pure component behavior, whereas the n(n – 1) coefficients a ij quantify the nonideal, behavior of the corresponding binary mixtures. Thus, in theory at least, binary data sets suffice for “training” the network to predict multicomponent behavior. Furthermore, adding more components does not affect the weights of the existing network. In practice, the quality of predictions depends on the choice of transfer function and whether the postulate that “binary interactions suffice” is well founded. The full mathematical form of equation 2.1 found by substituting equation 2.2 into 2.1 is n n y = T −1 xi T a ij x j . i=1
(2.3)
j=1
When a ij = a ∀ i, j, the nature of the transfer function is immaterial, as the network output is just y = a . With a ij = a ii ∀ i, j, it returns the weighted quasi-arithmetic mean of the pure component property values, a ii : y=T
−1
n
xi T(a ii ) .
(2.4)
i=1
Actually equation 2.4 states that the transform of the dependent variable T(y) varies linearly with composition. The simplest possible transfer function is the linear transformation T(u) = u. Implementing it in equation 2.1, for the mixture neural network of Figure 1, yields
y=
n n i=1 j=1
a ij xi x j =
n i=1
a ii xi2 +
n n
(a ij + a ji )xi x j .
(2.5)
i=1 j>i
Equation 2.5 corresponds exactly to the second-degree Scheff´e Kpolynomial (Draper & Pukelsheim, 1998). It is conventional to correct for the overparameterization of this model by setting a ij = a ji (Cornell, 2002;
Mixture Models Based on Neural Network Averaging
5
Scheff´e, 1958). When a ij + a ji = a ii + a jj , the output defined by equation 2.5 is linear in the mole fractions: y=
n
a ii xi .
(2.6)
i=1
Employing the logarithmic transformation T(u) = n(u) instead yields the output
y=
n
xi n a ij x j .
i=1
(2.7)
j=1
This is the weighted geometric mean mixture model, an exponential form of the semitheoretical Wilson (1964) model used for the excess Gibbs energy of mixtures. The Box-Cox transformation (Box & Cox, 1964) is usually applied to reduce the heteroskedasticity of the residuals and to bring them closer to a normal distribution. It is defined by ur − 1 r T(u) → n(u)
T(u) =
for
r = 0
(2.8a)
for
r → 0.
(2.8b)
When equations 2.8a and 2.8b define the transfer function, the neural network output resembles the generalized power mean constructed by Ku, Ku, & Zhang (1999):
y=
] k[r n (u,
x) := lim+ s→r
n
1/s s
xi [ui (x)]
,
(2.9)
i=1
where r ∈ R and, as before, the ui are defined by equation 2.2. Equation 2.9 provides a flexible functional framework that includes both models described above: setting the exponent r = 1 yields the Scheff´e quadratic K-polynomial, whereas r = 0 recovers the exponential Wilson model. With a ij = a jj ∀ i, j, equation 2.9 also reduces to the linear form
y=
n i=1
a ii xi .
(2.10)
6
W. Focke
When a ij = a ii ∀ i, j, equation 2.9 simplifies to a special case of equation 2.4: yr =
n
xi a iir .
(2.11)
i=1
The model of equation 2.9 may appear to have too many adjustable parameters. Strategies exist to reduce their number (Daroch & Waller, 1985). For example, when a ij = b ii ∀ j = i, equation 2.2 simplifies to ui = a ii xi + b ii (1 − xi ).
(2.12)
This leads to a drastic reduction of the number of adjustable parameters from n2 to a total of just 2n (excluding the parameter r ). 3 Model Consistency Means, such as the generalized power means, are defined by an infinite sequence of continuous and strictly monotonic real functions k[r1 ] (u1 , x1 = 1) = a 11 ; k[r2 ] (u1 , u2 , x1 , x2 ); . . . k[rn ] (u1 , u2 , . . . , un , x1 , x2 , . . . , xn ) . . . associated with a characteristic set of axioms (Bullen, 2003). Inspection reveals that this model satisfies the following elementary consistency requirements (Hamad, 1998):
r r r
Parameter values do not change when more components are added to the mixture. The mixture property y reduces to the pure component value when any mole fraction approaches unity, that is, k[rn ] (u, ek ) = a kk where ek = n (0, . . . , 0, xk = 1, 0, . . . , 0)T is an orthonormal basis of R+ . The relation for an n-component mixture reduces to the corresponding (n − 1)−component form in the limit of infinite dilution of one of the ] (u, x). components: lim k[rn ] (u, x) = k[rn−1 xn →0
The following axioms are relevant with regard to mixture-model consistency: Symmetry: k[rn ] (u, x) is not changed when the u and x are permuted simultaneously. This follows from the commutative law of addition. Symmetry implies that predicted property values are independent of the way in which component indices are assigned. Reflexivity: k[rn ] (a , a , . . . , a , x1 , x2 , . . . , xn ) = a . Note that ui = a when a ij = a ∀ j.
Mixture Models Based on Neural Network Averaging
7
Decomposability. According to Michelsen and Kistenmacher (1990), a consistent model is also invariant with respect to dividing one component into two or more identical subcomponents. The appendix shows that the current model conforms to this requirement. Homogeneity: k[rn ] (u, x) is homogeneous of order one, that is, for all λ: ] [r ] k[r n (λu1 , λu2 , . . . , λun , x1 , . . . , xn ) = λkn (u1 , u2 , . . . , un , x1 . . . , xn ).
(3.1) The homogeneity property ensures that the model is dimensionally homogeneous. The ui are linear combinations of the a ij , j = 1, 2, . . . n. Therefore, the model is also homogeneous of degree one in the parameters a ij . According to Euler’s theorem on homogeneous functions of degree one, it follows that
y=
n n i=1 j=1
a ij
∂y . ∂a ij
(3.2)
The relative condition number of y with respect to the parameter a ij is defined as (Higham, 2002): C R (a ij ) =
a ij ∂ y . y ∂a ij
(3.3)
This number quantifies the sensitivity of a function with respect to small changes in a parameter value a ij as follows. If |C R (a ij )| 1, the function is very well conditioned; when |C R (a ij )| ≈ 1, it is well conditioned; but if |C R (a ij )| 1, it is considered to be ill conditioned. Combining Higham’s (2002) definition, equation 3.3, with Euler’s result, equation 3.2, reveals that the relative condition numbers sum to unity: n n
C R (a ij ) = 1.
(3.4)
i=1 j=1
The generalized weighted power mean is a monotonic increasing function of the a ij . Therefore, all C R (a ij ) ≥ 0, and it immediately follows from equation 3.4 that this model is intrinsically well conditioned with respect to all its adjustable parameters.
8
W. Focke
4 Conclusion Neural network computing is often equated with a black-box modeling approach. Critics also point out that the values of the weights in the network per se have no physical meaning. Thus, it is hard to account for, and properly validate, results obtained with a neural network. In this study, quite the opposite was achieved. Neural network architecture analysis led to conceptual insight. It revealed an underlying unity between the empirical quadratic Scheff´e polynomials and the semitheoretical Wilson models. Both models are special cases of the more general model defined by equation 2.9. Appendix: Decomposability of the Generalized Weighted Power Mean Assume that components n − 1 and n are in fact identical with a (n−1),k = a n,k ∀ k and therefore also un−1 (x) = un (x) ∀ x. This is justified by symmetry. To show that the model is invariant with respect to dividing one component into two or more identical subcomponents, it is sufficient to prove that , ] , k[r n (u1 , . . . , un−1 , un , x1 , . . . , xn−1 , xn ) = kn−1 (u1 , . . . , un−1 , x1 , . . . , xn−1 [r ]
, = xn−1 + xn, ). , , + xn, with xn−1 and xn, highlighting the mole fractions Here, xn−1 = xn−1 associated with the two identical components: , ui = a i1 x1 + a i2 x2 + . . . + a i(n−1) xn−1 + a in xn, , = a i1 x1 + a i2 x2 + . . . + a i(n−1) (xn−1 + xn, )
= a i1 x1 + a i2 x2 + . . . + a i(n−1) xn−1 . From this it also follows that un−1 = un and thus that , , urn−1 + xn, urn = (xn−1 + xn, )urn−1 = xn−1 urn−1 . xn−1
Now , ] r r r , r 1/r k[r n (u, x) = (x1 u1 + x2 u2 + . . . + xn−1 un−1 + xn un )
1/r = x1 ur1 + x2 ur2 + . . . + xn−1 urn−1 ] = k[rn−1 (u, x).
This completes the proof.
Mixture Models Based on Neural Network Averaging
9
Acknowledgments I gratefully acknowledge financial support for this research from the THRIP program of the Department of Trade and Industry and the National Research Foundation of South Africa as well as Xyris Technology. References Bates, D. M., & Watts, D. G. (1988). Nonlinear regression analysis and its applications. New York: Wiley. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. J. Roy. Statist. Soc. B, 26, 211–243, discussion, 244–252. Bullen, P. S. (2003). Handbook of means and their inequalities. Dordrecht: Kluwer. Cornell, J. A. (2002). Experiments with mixtures (3rd ed.). New York: Wiley. Daroch, J. N., & Waller, J. (1985). Additivity and interaction in three-component experiments with mixtures. Biometrika, 72, 153–163. Draper, N. R., & Pukelsheim F. (1998). Mixture models based on homogeneous polynomials. J. Statist. Plann. Inference, 71, 303–311. Hagan, M. T., Demuth, H. B., & Beale, M. (1996). Neural network design. Boston: PWS Publishing. Hamad, E. Z. (1998). Exact limits of mixture properties and excess thermodynamic functions. Fluid Phase Equilibria, 142, 163–184. Higham, N. J. (2002). Accuracy and stability of numerical algorithms (2nd ed.). Philadelphia: SIAM. Ku, H. T., Ku, M. C., & Zhang, X. M. (1999). Generalized power means and interpolating inequalities. Proc. Amer. Math. Soc., 127, 145–154. Michelsen, M. L., & Kistenmacher, H. (1990). On composition-dependent interaction coefficients. Fluid Phase Equilibria, 58, 229–230. Prausnitz, J. M., Lichtenthaler, R. N., & de Azevedo E. G. (1999). Molecular thermodynamics of fluid-phase equilibria. Upper Saddle River, NJ: Prentice Hall. Prescott, P., Dean, A., Draper, N., & Lewis, S. (2002). Mixture experiments: Illconditioning and quadratic model specification. Technometrics, 44, 260–268. Rowe, R. C., & Colbourn, E. A. (2003). Neural computing in product formulation. Chem. Educator, 8, 1–8. Scheff´e, H. (1958). Experiments with mixtures. J. Roy. Statist. Soc. B, 20, 344–360. Wilson, G. M. (1964). Vapor-liquid equilibrium. XI. A new expression for the excess free energy of mixing. J. Am. Chem. Soc., 86, 127–130.
Received March 23, 2005; accepted May 20, 2005.
LETTER
Communicated by Maxim Bazhenov
Sensory Memory for Odors Is Encoded in Spontaneous Correlated Activity Between Olfactory Glomeruli Roberto F. Gal´an
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, 10115 Berlin, Germany, and Department of Biological Sciences and Center for the Neural Basis of Cognition, Mellon Institute, Carnegie Mellon University, Pittsburgh 15213, PA, U.S.A.
Marcel Weidert
[email protected]
Randolf Menzel
[email protected] Institute for Neurobiology, Freie Universit¨at Berlin, 14195 Berlin, Germany
Andreas V. M. Herz
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, 10115 Berlin, Germany
C. Giovanni Galizia
[email protected] Department of Entomology, University of California, Riverside, CA 92521, U.S.A.
Sensory memory is a short-lived persistence of a sensory stimulus in the nervous system, such as iconic memory in the visual system. However, little is known about the mechanisms underlying olfactory sensory memory. We have therefore analyzed the effect of odor stimuli on the first odor-processing network in the honeybee brain, the antennal lobe, which corresponds to the vertebrate olfactory bulb. We stained output neurons with a calcium-sensitive dye and measured across-glomerular patterns of spontaneous activity before and after a stimulus. Such a single-odor presentation changed the relative timing of spontaneous activity across glomeruli in accordance with Hebb’s theory of learning. Moreover, during the first few minutes after odor presentation, correlations between the spontaneous activity fluctuations suffice to reconstruct the stimulus. As spontaneous activity is ubiquitous in the brain, modifiable fluctuations could provide an ideal substrate for Hebbian reverberations and sensory memory in other neural systems.
Neural Computation 18, 10–25 (2006)
C 2005 Massachusetts Institute of Technology
Sensory Memory for Odors
11
1 Introduction Animals are able to evaluate a sensory stimulus for a certain time after stimulus offset, as in trace conditioning or delay learning (Clark, Manns, & Squire, 2002; Grossmann, 1971). Such sensory memories have been extensively investigated in the visual system, such as for afterimages (i.e., visual persistence) or iconic memory and for the acoustic system (Crowder, 2003) and imply that a neural representation of the stimulus remains active after the stimulus, for example, during the time interval between a presented cue and a task to be performed or an association to be made. “Delayed matching and nonmatching to sample” paradigms also rely on a temporary storage of sensory information, generally referred to as working memory (Del Giudice, Fusi, & Mattia, 2003). Such tasks have also been successfully solved by the honeybee, Apis mellifera, and prove that it possesses an exquisite sensory and working memory (Giurfa, Zhang, Jenett, Menzel, & Srinivasan, 2001; Grossmann, 1971; Menzel, 2001). Analyzing the neural basis of working memory in vertebrates, cortical neurons have been found that elevate their discharge rate during a delay period (Fuster & Alexander, 1971). These findings suggest a straightforward realization of Hebbian “reverberations” (Hebb, 1949) in that persistently active delay cells provide the memory trace and thus allow the animal to compare sequentially presented stimuli (Amit & Mongillo, 2003). In all of these studies, however, the investigation of sensory memory was embedded into a more complex paradigm in order to allow for a behavioral readout. Therefore, it cannot be excluded that the physiological traces of sensory memory contained a context-dependent or task-dependent component that is difficult to isolate. Taking a purely physiological approach and using the honeybee as an experimental animal, we asked whether a stimulus alone could modify brain activity in a way that would suggest a sensory memory. ¨ Odor stimuli are particularly salient for honeybees (Menzel & Muller, 1996). We therefore sought an initial memory trace following a nonreinforced odor stimulus. We find that the relative timing of spontaneous activity events in the antennal lobe is modified after a passive olfactory experience. This change follows the Hebbian covariance rule: pairs of coactivated or coinhibited glomeruli increased their correlation; glomeruli with an opposite response sign decreased their correlation (Sejnowski, 1977). Unlike the implications of Hebb’s rules in developmental studies (“fire together, wire together”), the effect observed here was short-lived, with a decay time of a few minutes. We therefore suggest that this form of short-term Hebbian plasticity in the honeybee antennal lobe serves as an olfactory sensory memory. 2 Materials and Methods 2.1 Imaging. Bees were prepared as described elsewhere (Sachse & Galizia, 2002). Briefly, forager bees (Apis mellifera carnica) were collected
12
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
from the hive, cooled for anesthesia, and restrained in Plexiglas stages. A window was cut into the cuticle to expose the brain. Then the calciumsensitive tracer dye FURA-dextran (MW 10,000, Molecular Probes) was injected into the tract of PN axons that leads from the antennal lobe (AL) to the mushroom bodies (lateral antenno-cerebralis-tract, lACT). After 3 to 5 hours, the bees were tested for successful staining of PNs using a fluorescent microscope: successful staining was evident when the PN cell bodies were brightly fluorescent under 380 nm light exposure. We then recorded the calcium activity in the antennal lobe for 4 minutes, at a rate of six image pairs (340 nm/380 nm) per second. (Some animals were recorded for longer periods.) After 2 minutes (at half time), we presented an odor for 4 sec¨ onds, using a computer-controlled olfactometer (Galizia, Joerges, Kuttner, Faber, & Menzel, 1997). Odors used differed between animals and were 1-hexanol, 1-octanol, limonene and a mixture of limonene and linalool (all from Sigma-Aldrich). We excluded the 4 seconds before, during, and after stimulus presentation (a total of 12 seconds) from the analysis to ensure that no direct stimulus-response would contaminate the analysis. Great care was taken not to expose the animal to any test odor before the experiment, and no animal was investigated twice. Recordings were done using a Till-Photonics monochromator and CCD-based imaging system, Olympus BX50WI microscope, Olympus 20W dip objective, NA 0.5; dichroic 410, LP 440 nm. Nine animals were examined. The experimental protocol complied with German law governing animal care. 2.2 Data Preanalysis. We calculated the ratio between the 340 nm and the 380 nm images for each measured point in time. In the FURA-dextran staining, individual glomeruli were clearly visible. We could thus identify the glomeruli on the basis of the morphological atlas of the honeybee AL (Galizia, McIlwrath, & Menzel, 1999). For each AL, we identified between 8 and 15 glomeruli (mean = 11.6; SD = 2.1). Time traces for each glomerulus were calculated by spatially averaging the pixel values belonging to that glomerulus. For each AL, this procedure yielded two sets of matrices (before and after stimulation) consisting of between 8 and 15 glomeruli at 696 and 672 points in time, respectively (6 images per s for 2 minutes minus 4 and 8 seconds, respectively). 2.3 Mathematical Analysis. The measured traces were high-pass filtered (cut-off frequency 0.025 Hz) to remove effects caused by a differential offset in different glomeruli, thus generating zero-mean time series. (Traces in Figures 1B and 1C are not filtered). Odor responses were described as vectors whose components measure the maximum glomerular activity elicited by the specific odor. Glomeruli were categorized into activated or inhibited by an odor by visual inspection of the response trace. Pair-wise correlations were calculated as the correlation coefficient of their spontaneous activity. Lagged correlations were normalized to the correlation coefficient for zero
Sensory Memory for Odors
13
lag and were calculated for relative shifts at 1 second intervals between −5 seconds and +5 seconds. All lags gave qualitatively the same results; Figure 3B shows the data for a lag of 3 seconds. The similarity between odor response and spontaneous activity was measured as their scalar product at every measured point in time. Correlation matrices across glomeruli were calculated for different time windows, taking the entire stretch before the stimulus, after the stimulus, or within four 1 minute intervals. We calculated the difference matrix as the difference between the correlation after and before stimulus. Significance for each element of the differential correlation matrix was calculated by bootstrapping the original data (2000 replications, α = 0.05). A principal component analysis was performed on the correlation matrices and the difference matrix, and the first principal components (PCs) were extracted. The significances between glomerular activity patterns and the first PC of the spontaneous activity matrix were assessed by Kendall’s non-parametric correlation coefficient r (Press, Teukolsky, Vetterling, & Flannery, 1992). Distributions of similarities across time and animals met normality conditions and were tested with an ANOVA. 3 Results Odors evoke combinatorial patterns of activity across olfactory glomeruli in the first brain area that processes olfactory input, the insect antennal lobe (AL) or the vertebrate olfactory bulb (Hildebrand & Shepherd, 1997; Korsching, 2002). These patterns can be visualized using optical imaging techniques (Galizia & Menzel, 2001; Sachse & Galizia, 2002). In honeybees, olfactory projection neurons (PNs) show a high degree of spontaneous activity in honeybees, that is, activity in the absence of an olfactory stimulus ¨ (Abel, Rybak, & Menzel, 2001; Muller, Abel, Brandt, Zockler, & Menzel, 2002), which can also be found in other species (Hansson & Christensen, 1999). We have not investigated the driving force of this spontaneous background activity, but it appears likely to be controlled by a recurrent inhibitory network within the antennal lobe, since spontaneous activity is blocked by GABA and increased by the chloride channel blocker Picrotoxin, and may also rely on background activity from sensory neurons (Sachse & Galizia, 2005). In many neurons, there is a direct relationship between the intracellular calcium concentration and the neuron’s activity (Haag & Borst, 2000; Ikegaya et al., 2004). PNs that show a high occurrence of action-potential bursts in electrophysiological recordings also show a continuous fluctuation of intracellular calcium levels at the same temporal scale (Galizia & Kimmerle, 2004). Therefore, it is possible to use calcium imaging to study bursts of spontaneous activity in olfactory PNs (Sachse & Galizia, 2002). To measure the spatiotemporal extent of such spontaneous activity patterns, we applied the calcium-sensitive dye FURA-dextran to the axon tract, leaving the AL, and obtained a backfill of PNs within the AL. Optical recording of the PNs confirmed that their intracellular calcium
14
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
concentration was constantly changing; the spontaneous activity in the AL did not consist of longer periods of sustained activity but of short, sporadic activity bouts (see Figures 1A to 1D; see also additional data on the Web). These spontaneous events were glomerular in the sense that the spatial extent of individual activity spots always corresponded to a glomerulus in the morphological view. Their amplitude varied across both glomeruli and in time. Stimulating the antennae with an odor led to characteristic odor-evoked activity patterns. The glomeruli activated in these patterns corresponded to those that were expected for the tested odors from previous studies (Galizia, Sachse, Rappert, & Menzel, 1999; Sachse & Galizia, 2002). The magnitude of intracellular calcium increases in an odor response was up to an order of magnitude higher than those of typical spontaneous activity fluctuations (see Figure 1B). After stimulation, odor-evoked calcium levels decayed back to baseline within a few seconds. This is consistent with measurements from olfactory receptor neurons, which in most cases stop firing at the end of an olfactory stimulus (de Bruyne, Foster, & Carlson, 2001; Getz & Akers, 1994), and supports the notion that the phenomena reported below are intrinsic to the AL and do not reflect an odor-specific structure in the input from receptors. This finding suggests that if there is a sensory memory within the AL, it cannot rely on persistent activity, that is, increased mean discharge rates after stimulation, as observed in other systems (Amit & Mongillo, 2003; Fuster & Alexander, 1971). Figure 1: Unrewarded odor stimuli affect the spontaneous activity in the honeybee antennal lobe (AL), as illustrated by data from one bee. (A, left) Fluorescent image of the AL with projection neurons (PNs) stained with Rhodamin-dextran. (A, right) Sequence of spontaneous activity frames before, during (red frame), and after odor exposure (odor: 1-octanol). Note the similarity of the activity pattern at 80 seconds and 110 seconds with the odor response. Activity is color coded with equal scaling in all images. (For the entire measurement, go to http://galizia.ucr.edu/spontActivity.) (B) Raw traces for three identified glomeruli (red: T1-17; green: T1-33; black: T1-62) over a 240 second stretch. Stimulus lasted for 4 seconds, starting at t = 0, as indicated by the bar. After the stimulus, activity fluctuations repeatedly co-occur. (C) Close-up view of two stretches in B. Open triangles indicate some occurrences where either glomerulus 17 (red) or 33 (green) were independently active. Closed triangles are those where both were active at the same time. Not all such instances are marked. (D) Projection of the spontaneous activity across all identified glomeruli before, during, and after odor presentation onto the activity during odor presentation itself. Filled triangles above the trace indicate all instances where the similarity measure is greater than 0.75 (dotted line). Such events are more prevalent after stimulus presentation, where the trace fluctuates more strongly rather than staying for longer periods in the high-similarity regime. The activity trace was high-pass filtered.
Sensory Memory for Odors
15
A
-110s -80s
-10s
stim. 10s
60s
80s
110s
B signal
8 6 4 2 0 -2 -4 -6 -100
-50
50
0
100 time (s)
signal
2 1 0 -1 -2 -3 -30
signal
C
-25
-20
2 2 1 1
Glom 17 0 0 -1 -1 Glom 33 -2 -2 Glom 62 -3 -3 10 -15 -10 time (s)
15
20
25
30 time (s)
similarity (r)
D 1 0
-1 -100
-50
0
50
100 time (s)
However, the relative timing of activity events in some glomeruli appeared to be altered. For instance, in Figure 1B, simultaneous fluctuations of glomerulus 17 and 33 are seen more frequently after the odor stimulus than before. We found that the glomeruli that increased their coactivity were those that had been activated by the odor stimulus. As shown in Figure 1C, coactive events between such glomeruli could also occur before stimulation (see filled triangles) and as independent activity bouts (empty triangles). After stimulation, coactive events were more frequent (filled triangles), and independent bouts were rare (none in the example here). Sensory memory in the AL may thus change the relative timing of spontaneous activity across
16
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
AL glomeruli. To extend this to entire across-glomeruli patterns, we determined the similarity (Kendall’s correlation) between odor response and spontaneous activity at every measured point in time (see Figure 1D). The resulting trace varies between 1 (perfect match, as at stimulus onset) and −1. Because the activity traces were high-pass filtered before the projection, the values right before stimulus onset lead to the projection having value −1. Instances where the similarity exceeds a threshold of 0.75 are marked by filled triangles. Clearly, after stimulation, spontaneous events resembling the odor-evoked response pattern became more prevalent: the spontaneous activity is on average more strongly attracted to the odor-evoked pattern. We then investigated which specific properties of the spontaneous activity changed after stimulation. While overall activity increased in some individuals and decreased in others, across animals the standard deviation of the spontaneous activity was not affected by odor exposure ( p = 0.426, Wilcoxon paired sample; see Figure 2A). This implies that the overall amplitude of the fluctuations did not increase; across animals, there was no increase in baseline activity. However, the total (summed) duration of spontaneous events that mimic the odor-evoked pattern (i.e., exceeding the 0.75 correlation threshold; see also Figure 1C) clearly differed after odor exposure ( p < 0.05, Wilcoxon paired sample) and increased more than twofold (see Figure 2B). There was no change in the amplitude of each frequency component of the spontaneous activity, as demonstrated by approximately equal power spectra before and after stimulation (see Figure 2C), and no change in overall spontaneous activity, as seen when comparing the histogram of activity occurrences before and after stimulation (see Figure 2D). Thus, the short-term odor memory reported here is encoded only in the correlated spontaneous fluctuations rather than in their amplitude or their mean value. Which glomeruli change their relative activity timing? In order to address this question, we sorted all glomeruli pairs into three categories based on their response properties to the presented odor: pairs where the tested odor led to increased intracellular calcium concentration in both glomeruli, pairs where one partner increased and the other decreased its calcium level, and pairs where at least one of the two did not respond to the odor. We then analyzed their correlation before and after odor exposure (see Figure 3A). We found that coactive pairs increased their correlation; pairs with opposing sign decreased their correlation, and nonactive pairs remained unchanged. Thus, the correlation changes followed a Hebbian rule: pairs of glomeruli where both were excited or inhibited by the stimulus increased their spontaneous coherence after stimulation; pairs of glomeruli where one was excited and the other inhibited decreased their correlation. To test whether an unspecific change in spontaneous activity might explain the changed correlations, we shifted the activity traces against each other after and before odor stimulus and recalculated the pair-wise correlations on these traces (see Figure 3B). After relative shifting, a correlation purely caused by increased
Sensory Memory for Odors
12
10 8 6 4
p=0.426
2 4
6 5 4 3 2
C
D
0 1 2 3 4 5 6 attraction before stim. (%)
1000
104
103
600
101
400
10-0
200
0.5 1.0 1.5 2.0 2.5 frequency (Hz)
before stim.
after stim.
800
102
10-1 0
p=0.047
1 0
6 8 10 12 s.d. before stim.
counts
0 2
power spectrum
0
attraction after stim. (%)
B
s.d. after stim.
A
17
0
-20 0 +20 ∆F/F
-20 0 +20 ∆F/F
Figure 2: Statistical analysis of spontaneous AL activity before and after stimulation. (A) Standard deviations of the total AL activity fluctuations for each animal before odor exposure plotted against the same after odor exposure. No systematic change is observed ( p = 0.426, Wilcoxon paired-sample test). (B) Odor-specific attraction, defined by the total (summed) duration of spontaneous events that closely resemble the odor-evoked activity pattern (r > 0.75; see also Figure 1D), and given as percentage of the total recording time after odor exposure plotted against the same before odor exposure. After stimulation, attraction increased significantly across animals ( p < 0.05, Wilcoxon paired-sample test) and was on average more than twice as large as before stimulation. (C) Power spectrum of the spontaneous activity before (thin line) and after (thick line) odor stimulation, averaged over all glomeruli in all bees. (D) Histogram of spontaneous activity (F/F values in high-pass-filtered traces) before and after stimulation across glomeruli. The distribution does not change after the stimulus, indicating that there is no overall increase in spontaneous activity. Both distributions are slightly supergaussian (black lines represent fit to gaussian distributions).
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
A pairwise correlation
original traces both glom. inhibited or both excited one glom. inhibited, the other excited at least one glom. did not respond **
B
re af te r
fo
re af te r
fo be
be
fo
re af te r
**
be
pairwise correlation
shifted traces both glom. inhibited or both excited one glom. inhibited, the other excited at least one glom. did not respond
fo r
e
**
be
af te r
e fo r be
af te r
be fo r
e
**
af te r
18
Figure 3: (A) Pair-wise correlation between olfactory glomeruli. As predicted by the Hebbian learning rule, pairs of glomeruli that were both excited or both inhibited by the odor significantly increased their correlation ( p 0.05, Wilcoxon paired-sample test); pairs of glomeruli where one was excited and the other one inhibited significantly decreased their correlation ( p 0.05); pairs of glomeruli where at least one did not respond to the odor did not significantly change their correlation ( p 0.05). (B) Pair-wise lagged correlation between olfactory glomeruli. By shifting the activity time series of each glomerulus with respect to each other (lag = 3 seconds), the correlations between glomeruli were generally reduced, and the increased correlation of coactive glomeruli was abolished. The decrease of correlation in the other groups is a consequence of the statistical properties of these traces (see the text). A similar picture was obtained for any time lag larger than 1 second.
Sensory Memory for Odors
19
activity should remain visible, while a correlation caused by specific timing of co-occurring events should decrease or disappear. We found that all correlation increases were lost or reduced, showing that the observed effect is due to a precisely timed coactivity rather than to an increased baseline activity. It should be noted that there is a small background correlation within the antennal lobe across all glomeruli, as evident in Figure 3A. Shifting the data reduces this background correlation (compare values between Figure 3A and Figure 3B); this effect also affects the traces after odor delivery, leading to the significances in Figure 3B for the non-co-excited glomeruli, where the decrease in correlation due to the stimulus and that due to shifting add up to a significant effect (see Figure 3B). We next calculated the correlation matrix of glomerular activity, before and after stimulus presentation (left panels of Figures 4A and 4C) by calculating the pair-wise correlation between their activity time courses. We derived the correlation changes by subtracting the two matrices (left panel of Figure 4D). In the example shown, glomeruli 17, 28, 33, 36, and 52 increased their pair-wise correlation; they were coactive more often than before odor exposure. These glomeruli are those that responded most strongly to the odor (see Figure 4B). Pairs of glomeruli that were both inhibited during stimulation tended to increase their correlation, too, as shown by the pairs 23-49 and 29-37 in the left panel of Figure 4D. In contrast, most pairs where one glomerulus was excited by the odor and the other was inhibited decreased their correlation. This phenomenon is clearly apparent for pair 17-37, 17-49, or 23-52. Thus, it seems that the pair-wise correlation of glomerular activation patterns well after stimulus offset resembled the odor-evoked response patterns. To test this key hypothesis, we performed a principal component analysis (PCA). PCs are the eigenvectors of a correlation matrix. In particular, the first PC corresponds to the eigenvector whose eigenvalue has the largest magnitude. In our case, it represents the dominant pattern of spontaneous activity in the sense that its average projection onto the spontaneous activity is maximal (Beckerman, 1995). We performed a PCA of the correlation matrices before and after stimulation, as well as for the difference of both. We then quantified the correspondence between the first principal component (right panels in Figures 4A, 4C, and 4D) and the odor-evoked pattern (see Figure 4B, right side only). There was no significant relationship before odor stimulation (r = 0.31, p = 0.143). After odor stimulation, the correspondence was highly significant (r = 0.79, p < 0.001). The same holds for the first PC of the difference matrix (r = 0.74, p < 0.001). This finding was confirmed across animals: the correlation matrices derived from the spontaneous activity after stimulation with an odor clearly reflected the pattern that was elicited by the odor, with mean correlation values between first PC and odor-evoked response of 0.52 after 1 minute and 0.39 after 2 minutes, as compared to 0.22 and 0.19 for 2 and 1 minutes before stimulation, respectively. These
20
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
differences were highly significant (two-way ANOVA, p < 0.001, with no significant difference between animals, p = 0.11; see Figure 4E). In addition to this across-animal analysis, we asked (for each animal) whether the first PC corresponds to the odor response. One minute after stimulation, this was the case in six of nine animals; after 2 minutes, their number had dropped to three. This indicates that the memory trace encoded in the correlated activity fluctuations decreases on a timescale of a few minutes. As a control, we also tested the prestimulus condition and found no resemblance between the first PC and the odor response in any animal, as expected. 4 Discussion As shown by our results, a single odor exposure without any predictive or associative value can lead to transient changes of the correlations between spontaneously active glomeruli that can last for more than 1 minute. Most notably, the pattern of activity that corresponds to the experienced odor is repeatedly “reactivated” during this period and thus constitutes a type of reverberation that is rather distinct from persistent activity (Amit & Mongillo, 2003; Fuster & Alexander, 1971). Hebb may have foreseen this
Figure 4: Spontaneous activity after odor stimulation represents an olfactory memory. Data in A–D are from one animal. (A) Matrix of pair-wise correlations between glomerular activity before the stimulus is given (left). The diagonal elements of the matrix equal unity by definition and are depicted in black. Components of the first principal component (PC) of this matrix (right). The lack of significant correlation with the odor response is within the frame. (B) Glomerular activity pattern elicited by the odor (2-octanol). Glomeruli are arranged according to their activity strength. This sequence of glomeruli is kept throughout A–D. The left panel is deliberately left empty to ease comparison with the other panels of the figure. (C) As A, but after odor stimulation. The components of the first PC clearly differ from those before the stimulus (A) and resemble the response pattern (B). The significance of the correlation with the odor response is given within the frame. (D) As in A, but for the differences (after or before) of pair-wise correlations. Nonsignificant entries of the matrix (by bootstrap analysis; see the text) have been set to zero and are shown in white; the diagonal elements equal zero by definition and are depicted in gray. As in C there is a statistically significant correlation to the odor response pattern (B). (E) Population data. Box plot of Kendall’s correlation between the first eigenvector and the odor response calculated from correlation matrices 2 minutes and 1 minute before and after stimulus delivery. There is a highly significant increase in the correlation. Numbers above the box plots indicate how many animals had a significant correlation between the first PC and the odor response. In agreement with our other results, this correlation was not significant for those bees for which the attraction did not increase after odor presentation (see Figure 2C).
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
1.0 0.8 0.6 0.4 0.2 0.0
0.8 0.6
r=0.31 p=0.1431
0.4 0.2 0.0
-0.2
glomerulus 23 37 49 29 62 42 38 39 28 52 33 36 17
signal
before stimulus
A
21
size of first PC
Sensory Memory for Odors
B
0.8 0.6 0.4 0.2
odor response
0.0 -0.2
glomerulus
C
after stimulus
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
0.8 0.6 0.4 0.2
size of first PC
23 37 49 29 62 42 38 39 28 52 33 36 17
0.0 -0.2
0.8 0.6
r=0.79 p=0.0002
0.4 0.2 0.0
-0.2
glomerulus
D
after minus before
23 37 49 29 62 42 38 39 28 52 33 36 17 23 37 49 29 62 42 38 39 28 52 33 36 17
0.4 0.2 0.0 -0.2
size of first PC
23 37 49 29 62 42 38 39 28 52 33 36 17
0.8 0.6
r=0.74 p=0.0004
0.4 0.2 0.0
-0.2
glomerulus
E
stimulus inference
23 37 49 29 62 42 38 39 28 52 33 36 17
1.0
0/9
0/9
6/9
3/9
-2
-1
1
2 minutes
0.8 0.6 0.4 0.2 0.0
22
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
possibility when he discussed both “persistence or repetition of a reverberatory activity” (Hebb, 1949, p. 62). Statistical properties of spontaneous activity patterns have been investigated in other systems, notably the mammalian cortex (Ikegaya et al., 2004). In this structure, the correlation between units is related to the behavioral state of the animal (Vaadia et al., 1995). A possible mechanism for creating specific firing patterns between neurons is provided by the concept of synfire-chains (Abeles, 1991; Durstewitz, Seamans, & Sejnowski, 2000). In this study we have not investigated the relationship of individual spikes, but that of entire activity bursts, which are reflected in the calcium increases at the temporal resolution of our measurements. Therefore, a direct comparison should be approached with caution. Which are the neurons that mediate the observed correlation changes? Since the changes occur between glomeruli, neurons connecting between glomeruli are likely candidates. In the honeybee, there are up to 4000 neurons local to the AL (Flanagan & Mercer, 1989; Fonta, Sun, & Masson, 1993; Galizia & Menzel, 2000). These neurons are not a uniform group: they have different transmitters (GABA, histamine), different morphologies (heterogeneous, homogeneous), and innervate different groups of glomeruli. Further work should elucidate which subpopulation of LNs accounts for stimulus-specific modifications of the glomerular fluctuations. Clearly, however, the changes follow a Hebbian correlation rule: glomeruli that are coactivated during a stimulus probably increase a (putative) reciprocal excitatory connection and/or decrease an inhibitory connection, so that their co-occurrence in a spontaneous activity event becomes more likely than before stimulation (see Figure 3). However, since the network does not consist of pair-wise connections, this description is certainly too simplistic. More work is needed to identify the mechanisms that may account for these findings. The stimulus may induce short-term changes of synaptic or membrane properties, which are known to influence spontaneous activity patterns (Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003; Tsodyks, Kenet, Grinvald, & Arieli, 1999). Stimulus-dependent modifications may, however, also be purely dynamic in the sense of “a memory trace that is wholly a function of a pattern of neural activity, independent of any structural change” (Hebb, 1949). The observed multiglomerular activity fluctuations are readily interpreted if we visualize the AL as a network with odor-specific attractors and a high level of spontaneous activity (Gal´an, Sachse, Galizia, & Herz, 2004). If an odor is presented, the basin of attraction corresponding to this stimulus is increased and biases the network fluctuations toward the odorspecific pattern. This may lead to the network enhancing the representation of that odor relative to others, as seen in Figure 2C. Such a short-term memory effect has been observed in locusts (Stopfer & Laurent, 1999). In that study, the coherence of PN activity increased when an odor stimulus was iterated. If the sensory memory of the previous (but same) stimulus was
Sensory Memory for Odors
23
still active, it would cause a reduced threshold for that pattern and thus facilitate a more coherent response, that is, more strongly synchronized PN activity, as reported (Stopfer & Laurent, 1999). In the rat olfactory bulb, exposure to an odor slightly modifies the response profile of mitral cells (Fletcher & Wilson, 2003), a finding that might indicate that similar changes in the interglomerular neural network occur in mammals as those observed here. It remains unclear, though, under what conditions, if any, changes are read out by other brain areas, that is, if spontaneous activity bouts can be coincident and cause spurious “remembrances" of the experienced odor, or whether these bouts are “perceived" as odor whiffs by the animal. By briefly changing the network activity, they might also play a role in classical ¨ conditioning of odors with appetitive rewards (Menzel & Muller, 1996). Let us also note that even if an external observer can retrieve the odor from the spontaneous activity, this does not prove that the animal actually uses this information. Nevertheless, the observed correlation changes are a robust and predictable phenomenon that occurs in the AL; by itself, this unexpected finding invites further investigation. In conclusion, we have revealed traces of sensory memory, in vivo, and have demonstrated that a single odor stimulus can modify the spontaneous activity of olfactory glomeruli. As traditional paradigms investigating Hebbian reverberations have exclusively focused on persistent activity after stimulation, not on correlated activity fluctuations, it is to be expected that future investigations along the lines of our study may reveal previously overlooked memory traces in many other neural systems. Acknowledgments The work of M. W. and R. M. was supported by the Deutsche Forschungsgemeinschaft (SFB 515).
References Abel, R., Rybak, J., & Menzel, R. (2001). Structure and response patterns of olfactory interneurons in the honeybee, Apis mellifera. J. Comp. Neurol., 437(3), 363–383. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J., & Mongillo, G. (2003). Selective delay activity in the cortex: Phenomena and interpretation. Cereb. Cortex, 13(11), 1139–1150. Beckerman, M. (1995). Adaptive cooperative systems. New York: Wiley. Clark, R. E., Manns, J. R., & Squire, L. R. (2002). Classical conditioning, awareness, and brain systems. Trends. Cogn. Sci., 6(12), 524–531. Crowder, R. G. (2003). Sensory memory. In J. H. Byrne (Ed.), Learning and memory (2nd ed., pp. 607–609). New York: Macmillan.
24
R. F. Gal´an, M. Weidert, R. Menzel, A. V. M. Herz, & C. G. Galizia
de Bruyne, M., Foster, K., & Carlson, J. R. (2001). Odor coding in the Drosophila antenna. Neuron, 30(2), 537–552. Del Giudice, P., Fusi, S., & Mattia, M. (2003). Modelling the formation of working memory with networks of integrate-and-fire neurons connected by plastic synapses. J. Physiol. Paris, 97(4–6), 659–681. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Neurocomputational models of working memory. Nat. Neurosci., 3 Suppl., 1184–1191. Flanagan, D., & Mercer, A. R. (1989). Morphology and response characteristics of neurones in the deutocerebrum of the brain in the honeybee Apis mellifera. J. Comp. Physiol. (A), 164, 483–494. Fletcher, M. L., & Wilson, D. A. (2003). Olfactory bulb mitral-tufted cell plasticity: Odorant-specific tuning reflects previous odorant exposure. J. Neurosci., 23(17), 6946–6955. Fonta, C., Sun, X. J., & Masson, C. (1993). Morphology and spatial distribution of bee antennal lobe interneurones responsive to odours. Chemical Senses, 18, 101–119. Fuster, J. M., & Alexander, G. E. (1971). Neuron activity related to short-term memory. Science, 173(997), 652–654. Gal´an, R. F., Sachse, S., Galizia, C. G., & Herz, A. V. (2004). Odor-driven attractor dynamics in the antennal lobe allow for simple and rapid olfactory pattern classification. Neural Comput., 16(5), 999–1012. ¨ Galizia, C. G., Joerges, J., Kuttner, A., Faber, T., & Menzel, R. (1997). A semi-in-vivo preparation for optical recording of the insect brain. J. Neurosci. Methods, 76(1), 61–69. Galizia, C. G., & Kimmerle, B. (2004). Physiological and morphological characterization of honeybee olfactory neurons combining electrophysiology, calcium imaging and confocal microscopy. J. Comp. Physiol. A, 190(1), 21–38. Galizia, C. G., McIlwrath, S. L., & Menzel, R. (1999). A digital three-dimensional atlas of the honeybee antennal lobe based on optical sections acquired by confocal microscopy. Cell Tissue Res., 295(3), 383–394. Galizia, C. G., & Menzel, R. (2000). Odour perception in honeybees: Coding information in glomerular patterns. Curr. Opin. Neurobiol., 10(4), 504–510. Galizia, C. G., & Menzel, R. (2001). The role of glomeruli in the neural representation of odours: Results from optical recording studies. J. Insect. Physiol., 47(2), 115– 130. Galizia, C. G., Sachse, S., Rappert, A., & Menzel, R. (1999). The glomerular code for odor representation is species specific in the honeybee Apis mellifera. Nat. Neurosci., 2(5), 473–478. Getz, W. M., & Akers, R. P. (1994). Honeybee olfactory sensilla behave as integrated processing units. Behav. Neural. Biol., 61(2), 191–195. Giurfa, M., Zhang, S., Jenett, A., Menzel, R., & Srinivasan, M. V. (2001). The concepts of “sameness" and “difference" in an insect. Nature, 410(6831), 930–933. ¨ Grossmann, K. E. (1971). Belohnungsverzogerung beim Erlernen einer Farbe an einer ¨ kunstlichen Futterstelle durch Honigbienen. Z. Tierpsychol., 29, 28–41. Haag, J., & Borst, A. (2000). Spatial distribution and characteristics of voltage-gated calcium signals within visual interneurons. J. Neurophysiol., 83(2), 1039–1051. Hansson, B. S., & Christensen, T. A. (1999). Functional characteristics of the antennal lobe. In B. S. Hansson (Ed.), Insect olfaction (pp. 125–161). Heidelberg: Springer.
Sensory Memory for Odors
25
Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hildebrand, J. G., & Shepherd, G. M. (1997). Mechanisms of olfactory discrimination: Converging evidence for common principles across phyla. Annu. Rev. Neurosci., 20, 595–631. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304(5670), 559–564. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425(6961), 954–956. Korsching, S. (2002). Olfactory maps and odor images. Curr. Opin. Neurobiol., 12(4), 387–392. Menzel, R. (2001). Searching for the memory trace in a mini-brain, the honeybee. Learn. Mem., 8(2), 53–62. ¨ Menzel, R., & Muller, U. (1996). Learning and memory in honeybees: From behavior to neural substrates. Annu. Rev. Neurosci., 19, 379–404. ¨ Muller, D., Abel, R., Brandt, R., Zockler, M., & Menzel, R. (2002). Differential parallel processing of olfactory information in the honeybee, Apis mellifera L. J. Comp. Physiol. A., 188(5), 359–370. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Sachse, S., & Galizia, C. G. (2002). Role of inhibition for temporal and spatial odor representation in olfactory output neurons: A calcium imaging study. J. Neurophysiol., 87(2), 1106–1117. Sachse, S., & Galizia, C. G. (2005). Topography and dynamics of the olfactory system. In S. Grillner (Ed.), Microcircuits: The interface between neurons and global brain function. Cambridge, MA: MIT Press. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4(4), 303–321. Stopfer, M., & Laurent, G. (1999). Short-term memory in olfactory network dynamics. Nature, 402(6762), 664–668. Tsodyks, M., Kenet, T., Grinvald, A., & Arieli, A. (1999). Linking spontaneous activity of single cortical neurons and the underlying functional architecture. Science, 286(5446), 1943–1946. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373(6514), 515–518.
Received January 19, 2005; accepted May 18, 2005.
LETTER
Communicated by Fred Rieke
The Optimal Synapse for Sparse, Binary Signals in the Rod Pathway Paul T. Clark
[email protected]
Mark C. W. van Rossum
[email protected] Institute for Adaptive and Neural Computation, School of Informatics, Edinburgh, EH1 2QL, U.K.
The sparsity of photons at very low light levels necessitates a nonlinear synaptic transfer function between the rod photoreceptors and the rodbipolar cells. We examine different ways to characterize the performance of the pathway: the error rate, two variants of the mutual information, and the signal-to-noise ratio. Simulation of the pathway shows that these approaches yield substantially different performance at very low light levels and that maximizing the signal-to-noise ratio yields the best performance when judged from simulated images. The results are compared to recent data.
1 Introduction In this letter, we study early visual processing at very low light levels. At these so-called scotopic light levels, the photon capture rate per rodphotoreceptor is on the order of one per minute. The rod cells can detect single photons (Baylor, Lamb, & Yau, 1979; Baylor, Nunn, & Schnapf, 1984; Schneeweis & Schnapf, 1995). Photon capture by the rod can lead to a response in the ganglion cells (Barlow, Levick, & Yoon, 1971; Mastronarde, 1983b). The rod is the most common cell type in the retina; there are 20 times more rods than cones (Sterling & Demb, 2004). The large number of rods serves to detect as many photons as possible, whereas the spatial resolution of the rod pathway is low. The scotopic rod pathway therefore has a large convergence. The rod-bipolar cell collects the signal from some 10..100 rods ¨ (Dacheux & Raviola, 1986; Grunert, Martin, & W¨assle, 1994; Tsukamoto, Morigiwa, Ueda, & Sterling, 2001) while each rod connects to only two bipolar cells. Even when no photon is absorbed, the rod response is corrupted with continuous noise. This poses a potential problem for the pathway: a sharp thresholding function before summing the rod responses is required to maintain the single photon response. Without the nonlinearity, the single Neural Computation 18, 26–44 (2006)
C 2005 Massachusetts Institute of Technology
The Optimal Synapse
27
responses would drown in the noise (Baylor et al., 1984). An earlier biophysical model showed how such a nonlinearity can be implemented and demonstrated that the nonlinearity is indeed necessary to obtain observed performance (van Rossum & Smith, 1998). Recently, the transfer function of the synapse was measured, providing direct evidence for the existence of such a nonlinearity, and confirmed its synaptic mechanism (Field & Rieke, 2002b; Sampath & Rieke, 2004; Berntson, Smith, & Taylor, 2004). The performance of the pathway is critically dependent on the synaptic transfer function and its threshold. This raises the question how the threshold should be set from first principles. Interestingly, this question in general has no straightforward answer (Basseville, 1989). In this study, we research ways to set the optimal synaptic transfer function for sparse, binary signals. Counterintuitively, we show that different performance criteria lead to different optimal thresholds. Simulation of the pathway suggests that these different threshold settings greatly influence the signal in the bipolar cell. We first introduce our description of the pathway and then analyze different performance measures in the case of a sharp binary threshold. Next we extend to the more general case of smooth transfer functions, for which we rely on simulations. Finally, we discuss our predictions. We are not aware of any other studies comparing performance measures for sparse, binary detection—not in the bipolar pathway or in a general case. 2 Rod and Rod-Bipolar Pathway 2.1 Model for the Rods. The layout of the modeled rod-bipolar pathway is shown in Figure 1. At the lowest light levels, a rod photoreceptor might
t t+1 t+2 t+3 0 0 0 0
dim light
Gaussian noise +
0
1
0
0
+
0
0
0
0
+
0
0
0
0
+
0
0
0
0
+
photon flux
rods
parameters: threshold, slope
Σ
non–linear synapses rod–bipolar
Figure 1: Diagram of our model of the rod-bipolar pathway. A dim light source causes a very sparse flux of photons (modeled in discrete time steps). The photons are detected by the rods. Intrinsic gaussian rod noise corrupts the response. After a nonlinear synapse, the rod-bipolar sums the rod responses. The question is how the synapse’s threshold and slope should best be set to minimize signal loss.
28
P. Clark and M. van Rossum
detect a photon only every few minutes. Not every photon is absorbed and detected, but for simplicity we assume that each photon a rod receives is detected and leads to a response. This effectively yields an extra scale factor in the light level, the quantum efficiency, which is estimated between 3% and 50% (Baylor et al., 1979; Field, Sampath, & Rieke, 2005). The number of photons absorbed by the rod follows a Poisson distribution. The full dynamics of the response and the noise are not taken into account. We discretize the time into bins with the duration of the pathway’s integration time. The rod integration time is some 100..200 ms (Baylor et al., 1984; Walraven, Enroth-Cugell, Hood, MacLeod, & Schnapf, 1990). With the light level ρ, we denote the probability that a rod receives a photon per time bin. Because the power spectra are similar, it is unlikely that temporal integration by the synapse can strongly reduce the noise in favor of the signal (van Rossum & Smith, 1998). However, it important to note that, precise data lacking, synaptic filtering could lower the noise somewhat; in addition, bandpass filtering could increase the temporal information (Bialek & Owen, 1990; Armstrong-Gold & Rieke, 2003). At the low light levels we consider here, ρ 1 and the number of absorbed photons n is small: mostly zero and sometimes one. Thus, at the low light levels considered, a rod can essentially have two responses: it either detects a photon or does not. The probability that a particular rod detects two photons is negligible (ρ 2 ), as is the probability that two out of N rods detect a photon simultaneously, namely, ρ 2 N(N − 1). The task of the bipolar cell is therefore to discriminate between the case that none of the rods absorbed a photon and the case that one rod absorbed a photon. Importantly, the rod response is noisy, and its voltage distribution can be fitted to a gaussian with a standard deviation that increases with the number of photons absorbed. The probability distribution for a certain response amplitude x from a rod is (Field and Rieke, 2002a, 2002b) P(x) =
∞ ρ n exp(−ρ) G nx, ¯ σ D2 + nσ A2 n! n=0
=
∞ ρ n exp(−ρ) n=0
n!
¯ 1 (x − nx) exp − 2 σ 2 + nσ 2 , (2.1) D A 2π σ D2 + nσ A2 1
where G denotes the gaussian distribution. Without loss of generality, the mean response to a single event x¯ is normalized to 1. The empirical values for the noise in mouse rods are σ D = 0.27 and σ A = 0.33 (Field & Rieke, 2002b). These values for σ are only approximate values for the noise seen by the bipolar. It should be noted that the signal as seen by the bipolar can be noisier than this, because stochastic vesicle release can corrupt the signal further (Rao, Buchsbaum, & Sterling, 1994; van Rossum & Smith, 1998);
The Optimal Synapse
29
there are no precise estimates on this. On the other hand, synaptic filtering might reduce the noise, as stated above. Finally, the rod signal is corrupted with thermally driven spontaneous isomerization of rhodopsin (Baylor et al., 1984). This rate is about 10−3 events per rod per integration time. These events introduce extra errors because they are indistinguishable from real photon captures and therefore cannot be filtered out. They are thought to be a major contribution to the so-called dark light (Copenhagen, Donner, & Reuter, 1987; Sterling, Freed, & Smith, 1988). 2.2 The Bipolar Cell. The rod provides input to both OFF and ON bipolar cells. The OFF-bipolar pathway does not seem to be tuned for low ¨ scotopic vision (Soucy, Wang, Nirenberg, Nathans, & Meister, 1998; Volgyi, Deans, Paul, & Bloomfield, 2004); hence, we ignore it here. The rod ONbipolar cell pools the signal from some 10 to 100 rods. However, as the rod signal is noisy, the single photon signal would be lost in the noise if the bipolar were to sum the rod signals linearly. The reason is that the noise is pooled from √ all rods; thus, the standard deviation of the noise in the bipolar scales as N, whereas only one rod carries the signal. It has been noted that therefore it is essential to threshold the rod signals before they are summed by the bipolar (Baylor et al., 1984). In a modeling study, it was proposed that this threshold is implemented using a second-messenger cascade synapse; such a threshold mechanism yielded performance consistent with the physiological and psychophysical data (van Rossum & Smith, 1998). For now, we assume that the synaptic transfer function g(x) is a sharp step function with a threshold given by θ so g(x) = 0, if x < θ and g(x) = 1 otherwise. The threshold θ is the adjustable parameter. The simple transfer function is easy to study, and a binary function might seem to fit the binary input signal best. This second statement will turn out not to be fully correct, as is shown below, where other synapse models are discussed. Consider first that there is just one rod connected to the bipolar cell. We introduce the false-positive rate α (no photon, but erroneously detected) and the false-negative rate β (photon was received but not detected in the bipolar). For one rod, the n = 0 and n = 1 term in equation 2.1 yield
1 θ α= 1 − erf √ 2 2σ D 1 θ − 1 β= 1 + erf 2 2 2 2 σD + σ A In case N rods are connected to the bipolar cell, the bipolar cell is assumed N g(xi ). And after to sum the thresholded rod responses, that is, y = i=1 some combinatorics, one finds that the probability for the absorption of k
30
P. Clark and M. van Rossum
photons and a bipolar response j equals P( j, k) =
N j
(1 − ρ) N− j ρ j
j N− j j α k−l (1 − α) N+l− j−k β j−l (1 − β)l , × k −l l l=0
with the convention that ( ij ) = 0 if j < 0. In the limit of small ρ and small ρ N, again only two errors are important. First, none of the rods received a photon, but the output is unequal to zero. This probability is written as α N = 1 − (1 − α) N and can be interpreted as the generalized false-positive rate. Second, one of the rods received a photon, but the bipolar output is zero. This is written as β N = β(1 − α) N−1 . The false-positive and -negative rate characterize the pathway as a function of the threshold level θ . The threshold can roughly be deduced from ganglion cell data that showed that the false-negative rate is about 50% (Mastronarde, 1983a; van Rossum & Smith, 1998). This corresponds to a threshold setting of θ ≈ 1. In this letter, we examine the more fundamental question of how the optimal value for the threshold follows from the performance measure imposed. 3 Performance Measures 3.1 Threshold from Minimizing the Detection Errors. The problem of how to set the threshold can be analyzed with signal detection theory (Green & Swets, 1966; Van Trees, 1968). The setting of the threshold determines the trade-off between false positives and false negatives. A natural choice is to weigh both errors equally. In this case, the error equals the mean square error between input and output, or the Hamming distance. With one rod connected to the bipolar, the total error rate, denoted E R, is ER(θ ) = (1 − ρ)α(θ ) + ρβ(θ ), where the first term is the false-positive rate and the second term the falsenegative rate. Now the threshold θ can be varied so that the ER is minimal. If multiple rods are converging onto the bipolar, a simple counting argument gives for the error rate, ER = (1 − ρ N)α N + ρ Nβ N , where we ignored terms of order ρ 2 . In Figure 2A, the total error is plotted as a function of the threshold level when 10 rods are converging. For low thresholds, the false-positive rate is very high, and the error rate is close to 1. For very high thresholds, the error rate is quite low: ER = ρ N. Here the high threshold eliminates all photon events. The output is completely dark, which is not far from the truth but not very useful. For intermediate
The Optimal Synapse
A
31
B
0
10
10
10
Signal–to–noise
Error rate
10
–1
–2
–3
0
1
2
3
Mutual information (bits)
0.008 0.006 0.004 0.002 0.000
0
1
2
3
Threshold
Threshold
C
0.010
0.002
0.001
0
0
1
2
3
Threshold
Figure 2: The behavior of the three performance measures of the bipolar pathway as a function of the threshold level. (A) The total error rate, which counts false positive and false negatives. The minimum in the error rate corresponds to best performance. The dashed line indicates the much worse performance when the synapses are linear and the threshold is done after the summation. (B) The signal-to-noise ratio for a contrast discrimination task. The thin dashed line indicates the performance when thresholding is done after the summation. (C) The mutual information between the light level and the bipolar output as a function of the threshold. The dashed line indicates the mutual information between the rod signal and bipolar signal (y-scale divided by 20 to aid visualization; on the same scale, the dashed line would be much larger than the solid line). Parameters for all panels: 10 rods, light-level: 0.001 photons/rod/time step; rod noise is according to Field and Rieke (2002b).
thresholds, the error rate has a minimum for which false-positive and falsenegative rates are traded off. As the light level is lowered, the optimal threshold increases and can be larger than 1. This gives the somewhat counterintuitive result that if the signal is very sparse, a high threshold is beneficial, although this causes missing a large fraction of the events. To show the benefit of the thresholding synapse, we also show the error rate when the synapses are linear and the signal is thresholded after
32
P. Clark and M. van Rossum
summing (see Figure 2A, dashed line). This error rate has no minimum, and performance is worse in this case. 3.2 Threshold from Bayesian Inference. The same threshold value also follows from probabilistic Bayesian inference. The probability that the rod absorbed a photon (k = 1) versus that it did not (k = 0), given a response y, is P(k = 1|y) P(k = 0|y) + P(k = 1|y)
P(y|k = 0) P(k = 0) −1 = 1+ P(y|k = 1) P(k = 1) −1 (1 − ρ SP )G 0, σ D2 + ρ SP G 1, σ D2 + σ A2 1 − ρ = 1+ , ρ G 1, σ D2 + σ A2
g(y) =
(3.1)
where for completeness, we introduced the spontaneous isomerization rate ρ SP , measured in events per rod per time step. It mimics a photon event (see below). Under the simplification that σ A = 0 and ρ SP = 0, the probability that the rod absorbed a photon given the response is given by the well-known logistic function (Mackay, 2003), g(y) =
1 , 1 + exp[−(y − θ )/κ]
(3.2)
ρ and κ = σ D2 . If this probability is 50% with parameters θ = 12 − σ D2 ln 1−ρ or higher, a photon event was most likely, and the output is set to one; otherwise, there was likely no photon, and the output is set to zero. This threshold setting corresponds to the point where the rod probability distributions for the one photon and no photon signal intersect. The inference interpretation is equivalent to minimizing the number of errors done in the previous section and thus yields exactly the same optimal threshold. When spontaneous events are taken into account, the threshold is approximated by
θ=
1 − σ D2 ln(ρ − ρ SP ). 2
(3.3)
This has no solution for ρ < ρ SP ; the intuition is that any response was likely a spontaneous event rather than a real photon. When the assumption σ A = 0 is dropped, g(y) is no longer a monotonic function. Instead, a transition occurs at a negative value of y, which make g(y) = 1 also for negative
The Optimal Synapse
33
y. However, the probability of these small values of y is negligible. Furthermore, numerically the relevant upper threshold is virtually identical to the case that σ A = 0 (1.17 versus 1.19 when ρ = 10−4 ). 3.3 Threshold from Signal-to-Noise Ratio. Another performance measure of the pathway is the following: the discrimination should be clearest when the signal-to-noise ratio in the bipolar is maximal (Field & Rieke, 2002b). Therefore, the synapse should be tuned to maximize the signal-tonoise ratio. Here we maximize the signal-to-noise ratio in a contrast discrimination task in which a dark patch has to be distinguished from a brighter one. For a given light level, the response distribution of the bipolar cell is Q(y; ρ) = [1 − q ]δ(y) + q δ(y − 1), where q (ρ) = α N + ρ N(1 − α N − β N ) is the average bipolar output, consisting of both correct and false responses. The variance of this distribution is q (1 − q ). The signal-to-noise ratio is SNR(ρ1 , ρ2 ) =
2[q (ρ1 ) − q (ρ2 )]2 . q (ρ1 )[1 − q (ρ1 )] + q (ρ2 )[1 − q (ρ2 )]
The values of ρ1 and ρ2 are set as follows. When the discrimination is hardest, the discrimination between dark and highest light level is already difficult. Therefore, we will examine the case that ρ1 = 0 and ρ2 = 2ρ (the factor 2 ensures that the mean light level is ρ). However, we also consider the discrimination between two almost equal light levels: SNR(ρ − δρ, ρ + δρ) where δρ ρ. In practice, we found that this gave almost identical thresholds. In Figure 2B, the SNR is plotted for thresholding and for linear synapses followed by thresholding after summing. As the figure shows, the thresholding clearly improves the SNR. 3.4 Threshold from Information Theory. The detection problem and the need for a threshold in the synapse can also be studied using information theory. In general, the mutual information between an input variable x and an output y is I M = d x P(x) dy P(y|x)[log2 P(y|x) − log2 P(y)] (Cover & Thomas, 1991). We first calculate the mutual information between the light intensity and the bipolar signal as a function of the threshold. As above, we consider an input distribution with just two light intensities, 0 and 2ρ, with each probability 12 . P(y, x) has four terms: P(0, 0) = 12 [1 − q (0)], P(1, 0) = 12 q (0), P(0, ρ) = 12 [1 − q (2ρ)], and P(1, ρ) = 12 q (2ρ). The mutual information therefore becomes a sum over four terms,
1 1 1 IMRHO = [1 − q (0)] log2 [1 − q (0)] − log2 1 − q (0) − q (2ρ) 2 2 2
1 1 1 q (0) + q (2ρ) + q (0) log2 q (0) − log2 2 2 2
34
P. Clark and M. van Rossum
1 1 1 + [1 − q (2ρ)] log2 [1 − q (2ρ)] − log2 1 − q (0) − q (2ρ) 2 2 2
1 1 1 q (0) + q (2ρ) . + q (2ρ) log2 q (2ρ) − log2 2 2 2 This measure, labeled IMRHO, is our third performance criterion to set the threshold. The mutual information has a maximum as a function of the threshold level (see Figure 2C, solid line). Like the other criteria, the mutual information deteriorates when the signal is thresholded only after the rod signals have been summed (not shown). For sharp thresholds, the IMRHO is very similar to the SNR. In the case that the discrimination is done between ρ and ρ + δρ with small δρ, one can show by expansion in δρ that they are identical. Above, the information between light level and bipolar was used. Alternatively, one can optimize the mutual information between the actual photon signal and the bipolar signal. The photon signal is given by a Poisson process dependent on the light level. After all, one can argue that threshold should care only about the photons that are absorbed by the rod. We term this criterion IMROD. If just a single rod is connected to the bipolar, x describes the photon signal and y the bipolar output. Both x and y take values zero and one only. This does not mean the noise in the rod is ignored; it is captured in the α and β. P(y, x) now has the terms P(0, 0) = (1 − ρ)(1 − α), P(1, 0) = (1 − ρ)α, P(0, 1) = ρβ, P(1, 1) = ρ(1 − β). This yields I MROD = (1 − ρ)α{log2 α − log2 [α + ρ(1 − α − β)]} + (1 − ρ)(1 − α){log2 (1 − α) − log2 [1 − α − ρ(1 − α − β)]} + ρβ{log2 β − log2 [1 − α − ρ(1 − α − β)]} + ρ(1 − β){log2 (1 − β) − log2 [α + ρ(1 − α − β)]}. When instead of one rod, N rods are converging onto the bipolar cell, α N and β N should be used and ρ should be replaced by ρ N. This second mutual information measure reaches much higher values. This is understandable because unlike IMRHO, it lacks the Poisson process, which links the light level to actual photons. In the Poisson process, a lot of information is lost. Because the threshold setting does not affect the transformation of light level into absorbed photons, one could expect that both information measures have a similar dependence on the threshold. But this variant predicts consistently a lower threshold value (see Figure 2C, dashed curve).
The Optimal Synapse
35
4 Optimal Threshold Levels We have seen that the different performance criteria can yield different optimal threshold values. To gain a better understanding, we examined the optimal threshold as stimulus parameters are varied. The first observation is that for the binary transfer function, the SNR and IMRHO predict very similar thresholds. Figure 3A shows the optimal threshold for all criteria as the light level is varied. For high light levels, all approaches yield an optimal threshold close to 0.5 (although the approximations are expected to break down when ρ N ≈ 1). In practice, the minimal light level is limited by the dark-light to some 10−3 events/rod/integration time, although behavioral responses can persist at even lower light levels. At these light levels, the different approaches are still quite similar. To expose the differences more clearly, we purposefully neglected the spontaneous events and considered unrealistically low light levels. The threshold according to the SNR and ER is roughly linear in the log of the light level. The threshold value from IMROD is lower than for the SNR or ER. The intuition is that the mutual information approach prefers lower threshold values, because a low threshold yields a richer output distribution, although this increases the error rate. Next, we examined the dependence on the number of rods converging on the bipolar cell. The threshold values depend only weakly on the number of rods (see Figure 3B). With increasing the number of rods, the thresholds come closer. Finally, the thresholds depend on the noise in the rods (see Figure 3C). The lower the noise, the smaller the threshold. This is easily understood as in the zero noise limit, where the discrimination is easy; a threshold of 1/2 would be best according to all criteria. For high noise, the ER threshold is proportional to σ 2 (as shown above), whereas the IMROD threshold increases linearly with σ . Interestingly, for high noise, the optimal SNR threshold decreases after an initial increase. As stated above, rod responses contain spontaneous rhodopsin isomerization events that have not been included so far. Effectively, this introduces additional false positives. The false-positive rate becomes α SP = (1 − ρ SP )α + ρ SP (1 − β), where ρ SP is the spontaneous isomerization rate measured in events per rod per time step. These events affect the various performance criteria differently. The ER predicts a higher threshold when the spontaneous events are included (see Figure 3D). In fact, the optimum threshold diverges when the mean light level approaches the spontaneous event level (see also equation 3.3). For light levels less than the spontaneous rate, there is no optimal threshold; the curve in Figure 2A has no minimum. Indeed, the fewest errors in that case are made when the output is always zero. In contrast, the other measures have a finite optimal threshold for light levels lower than the spontaneous rate. For the SNR, this is easily understood: in the presence of spontaneous rate, a discrimination task has to discriminate between ρ1 + ρ SP and ρ2 + ρ SP . Hence, the optimal
36
P. Clark and M. van Rossum
B Optimal threshold
1.5
Optimal threshold
A
1
0.5 −6 10
−2
D
3
2
1
0
0
0.2 0.4 Rod noise
0.6
1.2 1.1 1 0.9 0.8
0
10
Optimal threshold
Optimal threshold
C
−4
10 10 Light level (events/rod/time)
1.3
0
5 10 15 Number of rods per bipolar
20
1.5
1
0.5 −6 10
−4
−2
10 10 Light level (events/rod/time)
10
0
IMROD ER SNR, IMRHO
Figure 3: Dependence of the optimal threshold according to different performance criteria versus various parameters. Dashed line: the number of errors criterion (ER); dotted line: the signal-to-noise ratio (SNR) and the mutual information between light level and bipolar response (IMRHO) (overlapping); solid line: mutual information between rod and bipolar response (IMROD). (A) Optimal thresholds for as a function of the light level. Ten rods; noise as in Field and Rieke. (B) Optimal threshold value as a function of the number of rods. Light level of 10−4 events per rod; noise as in Field and Rieke (2002b). (C) Dependence of the threshold value on the noise level in the rod. In this simulation, a simplified noise model was used, where the rod noise was independent of photon absorption (i.e., σ A = 0). 10 rods, ρ = 10−4 . (D) Effect of spontaneous events on the optimal threshold. The spontaneous event rate was 10−3 events per rod per time step. Other parameters as in A. Notably, the threshold based on the number of errors diverges when the light level is less than the spontaneous rate.
threshold shifts as if the light level were higher and equal to ρ + ρ SP rather than ρ. We tested whether the precise value of the time bin is important. In particular, the mutual information and its optimal threshold could depend
The Optimal Synapse
37
on the resolution of the sampling. We doubled the duration of the time bin. This manipulation doubles ρ, but also leads to a different value of α and β. For light levels below 0.01 events per second there is no noticeable difference in the threshold for both ER and SNR. Only for the mutual information IMROD did we see a slightly higher threshold (1.02, compared to 0.99 for the original time bin; ρ = 10−4 , N = 10). 5 Simulated Rod-Bipolar Pathway A priori it is not obvious which performance criterion should be used to set the synaptic threshold; all presented methods seem valid. To tackle this question, we simulated how the threshold setting would change the output of the bipolar system. It is likely that the different thresholds would lead to different visual percepts. These simulations allow us to examine the bipolar pathway as a function of the threshold level. The simulations consist of the following steps: 1. An image was split in rectangular regions, each corresponding to the receptive field of a single bipolar cell. The gray-scale of a certain pixel was extracted and multiplied with the mean light level to obtain the pixel’s light level. 2. A Poisson process with a rate given by the light level determined if a rod absorbed a photon. 3. Gaussian noise was added to the rod response. 4. The nonlinear transfer function was applied to the rod response to mimic the synapse. 5. The transformed rod responses were summed in the bipolar. We repeated this procedure for each bipolar cell, and the final output picture was averaged over many trials. This averaging mimics the pooling by the amacrine cells. Finally, we applied histogram equalization to the output picture using image processing software. This smooth, monotonic transformation improved visibility and gave the images a similar appearance despite very different mean output levels. Without it, the images can either easily saturate or become very dark. In the retina, such transformations can be performed by the circuitry of the amacrine cells and further downstream. It is important to note that this simulated pathway is just an approximation of the real one, as the number of trials used here is much higher than the number of bipolars connected to the amacrine cell, and the bipolarto-amacrine synapse might also contain a threshold; as in the bipolar, the signals are still quite sparse. These effects could change the results. Unfortunately, they are hard to study given our limited knowledge of processing by these circuits at the lowest light levels.
38
P. Clark and M. van Rossum
A
B
IMROD
SNR, IMRHO
ER
IMROD
SNR, IMRHO
ER
C
Figure 4: Simulated rod-bipolar image processing. (A) Original image. (B) Simulated images with a threshold level of the synapse 1.03 (optimal to maximize mutual information, IMROD), 1.33 (SNR and IMRHO), and 1.38 (ER). The threshold according to SNR gives a better-quality image than the image with the threshold according to IMROD. Average over 50,000 samples; mean light level 10−5 ; 10 rods; noise according to Field & Rieket (2002b). (C) Same as in B except that the rod noise is higher. Now, minimizing the error rate (right-most figure) does not lead to a clear picture. On the other hand the IMROD criterion performs decently for these parameters. The threshold settings were θ = 1.12 (IMROD), 1.66 (SNR), and 2.78 (ER). In combination with A, the SNR and IMRHO yield consistently the clearest image. σ D = 0.5; σ A = 0; ρ = 10−4 ; 10 rods; average over 50,000 samples.
We applied the simulation on an input image with a high- and a lowcontrast letter and a somewhat natural scene (see Figures 4A and 4B). The light level was deliberately chosen very low to emphasize the differences between the criteria. The low threshold level as predicted by IMROD leads to a high false-positive rate. As a result, the image is not very clear, and
The Optimal Synapse
39
low-contrast boundaries are hard to see. Setting the threshold according to the SNR (and the similar IMRHO) yields a clearer picture. Also, the ER yields a good output in this case, which is expected, as the predicted thresholds are very close. In the above situation, the SNR and the ER criteria predict a very similar threshold level (see also Figure 3A). To further distinguish between the thresholds predicted by the SNR and the ER criteria, we simulated a situation with high rod noise, σ D = 0.5, σ A = 0. In this case, the threshold according to the SNR is lower (see Figure 3C). Now the ER method performs worse, but the SNR still yields a good image (see Figure 4C). Combined, these results indicate that for the sharp threshold synapse, maximizing the SNR, or the almost identical IMRHO, consistently gives the best images in our simulation. 6 Performance with a Sigmoidal Synaptic Transfer Function So far, a hard threshold function has been imposed. We wondered if a smoother threshold function would yield different results. We have not tried to derive the best possible transfer function, but a variable-slope parameter κ was added to the transfer function to make a logistic function g(x) = [1 + exp(−(x − θ )/κ)]−1 . When κ = 0, the sharp threshold is recovered. When the transfer function is the smooth logistic function, the output of the bipolar becomes continuously valued. Both the SNR and the mutual information measures are easily calculated numerically when the transfer function is soft. It simply requires a discretization of the bipolar output in sufficiently small bins. The total number of errors ER is slightly ambiguous. We define it as follows: when no photon was absorbed but the bipolar voltage was larger than 1/2, a false positive was counted, whereas in the opposite case, a false negative was counted. We optimized the synapse by varying the slope of the transfer function in addition to the threshold value. Analytical treatment becomes intractable in this case, so we rely on numerical evaluation. The results are shown in Figure 5. The ER depends only very weakly on the smoothness (not shown). But the SNR and both mutual information measures improve when a smooth rather than a sharp transfer function is used. This can be seen by comparing the values at κ = 0 (the hard threshold) with nonzero κ. On the other hand, the transfer function should not be taken as too smooth, κ > ∼ 1; otherwise, the performance decreases. In this limit of large κ, the synaptic transfer function becomes smooth and mimics a linear one, which, as was shown above, has poor performance. The optimal value for κ is close to σ 2 , as expected from equation 3.2. The optimal value for the threshold θ , which now describes the transfer function’s midpoint, increases for smoother transfer functions. Both IMRHO and IMROD have a broad plateau with a very shallow maximum.
40
P. Clark and M. van Rossum
A
B
IMROD
C
SNR
4e-05
IMRHO
1e-05
2e-05
0
0
0.5
1
1.5
2
0.3 0.2 0.1 0
0
0
Threshold
0.5
1
1.5
2
0.3 5e-06 0.2 0.1 0 0
Threshold
0
0.5
1
1.5
Threshold
2
0.3 0.2 0.1 0
Inverse slope
0.0005
Figure 5: Performance criteria as a function of both the threshold θ and the inverse slope κ of the synaptic transfer function. (A, C) IMROD and IMRHO increase with a smoother synapse and have broad maxima. (B) The SNR has a sharp maximum and decreases as the transfer function is made much smoother (higher inverse slope). Parameters as in Figure 4A.
IMROD
SNR
IMRHO
ER
Figure 6: Simulated images when the synaptic transfer function is smooth. The optimal synapse settings were IMROD: θ = 1.17, κ = 0.14; SNR: θ = 1.37, κ = 0.06; IMRHO: θ = 1.36, κ = 0.11; and ER: θ = 1.38, κ = 0. The other parameters as in Figure 4A.
The SNR has a sharper profile and has a maximum at rather low κ. Its optimal threshold shifts about 0.05 upward. The values of the optima are given in Figure 6. Although the smoother synaptic transfer function increases the performance, inspection of the simulated images shows no obvious improvement (see Figure 6). This is not surprising as the increase is only some 20%, which is too small to yield significantly improved images. Likewise, the IMROD criterion still provides an image with many false positives. Another effect that occurs with the smooth transfer function is that the SNR and IMRHO no longer predict the same optimal threshold. However, this difference is too small to be visible in the images (compare SNR to IMRHO in Figure 6). 7 Discussion We have studied the signal transfer in the first synapse of the visual system at low light levels. The presence of noise in the rods necessitates a strong nonlinearity in the synapse, as otherwise the continuously present noise
The Optimal Synapse
41
from the other rods would swamp the signal from a rod receiving a photon. We considered a variety of performance criteria that could be used to tune the synapse. At higher light levels, when the signal is not extremely sparse, the predicted thresholds are similar. But at low light levels, the performance is sensitive to the choice of criterion. There is no principal choice on which criterion to use a priori (Basseville, 1989). Our results show that the predictions for the threshold are quite different at the lowest light levels. Which threshold, then, does the pathway use, and which threshold setting is the best? One possibility is to use the results from the simulated pathway, although these should be interpreted with care given the uncertainties in the circuitry. In the images, the signal-to-noise ratio (SNR) and the mutual information (IMRHO) consistently yield the best performance. The error rate (ER) (equivalent to Bayesian inference) is a reasonable criterion when rod noise is close to the measurements in Field and Rieke (2002b), but it predicts too high a threshold when the rod noise is larger. Interestingly, for good performance of the synapse, the mutual information needs to be calculated between the light level and the output (IMRHO), not between the photon signal and the output (IMROD). Otherwise, the threshold predicted is too low, and the pathway can have a very poor performance. Although this is not in conflict with information theory, it is a somewhat unexpected effect. We have also explicitly included the effect of spontaneous events and the rod noise on the optimal synapse parameters. Finally, we considered a smooth transfer function and found that it is slightly better than a sharp one, but the difference is small. An interesting question is how biology tunes and adapts the synapse according to the light level; we discussed some candidates earlier (van Rossum & Smith, 1998). This remains an outstanding issue experimentally and theoretically, as it emphasizes that the biology would need to optimize a quite noisy cost function. In the experiments, the threshold did not seem to adapt to the light level of the flash (Field & Rieke, 2002b), but the nonlinearity became weaker at higher background levels, increasing the response to flashes (Sampath & Rieke, 2004). 7.1 Comparison to Earlier Work and Data. We can compare our results to the data. Given the desired performance criterion, the current study predicts which threshold to expect at a given light level and noise level. Experimentally, however, the threshold and the other parameters (noise and convergence ratio) are hard to access. In the experiments, the transfer function and its threshold were not measured directly, but were inferred from the dependence of the mean and variance of the bipolar flash responses at higher light levels of about ρ ≈ 1. In Field and Rieke (2002b) the experimental data were described well when the transfer function was assumed to be a linear function with a step, that is, g(x) = x/[1 + exp(−(x − θ )/κ)]. The stimulus was a flash (about 1Hz)
42
P. Clark and M. van Rossum
at a flash intensity of 10−4 Rh∗ ; the mean light level was therefore some 10−5 events per integration time. (We think it is more likely that the threshold adapts to the mean light level, not to the flash strength.) The experimentally observed threshold level found was 1.3. The optimal threshold for these parameters according to the SNR is 1.37 and a slope of 0.06 (see Figure 5). The observed inverse slope, κ = 0.1, in the data was close to the prediction from maximizing the SNR. As was already noted in Field and Rieke (2002b), the SNR gives a good prediction of the threshold. However, caution should be used; the vesicle release noise could effectively increase the rod noise, necessitating a higher threshold (see Figure 3C), whereas temporal filtering might reduce the noise. It is also not clear how the presumably omnipresent spontaneous events are consistent with these findings; including them would reduce the optimal threshold level (see Figure 3D). However, in Berntson et al. (2004), the synaptic transfer was found to saturate when more than one photon was absorbed, as was assumed in this study. The reason for the discrepancy between the two experiments is not clear. For the current study, the difference in the actual shape of the transfer function is negligible, as the chance for multiple photon absorption in the rod is very small in the considered regime. However, it is likely that the different assumptions of the transfer function change the estimate for the threshold. This second experimental study found a threshold of 0.85 (Berntson et al., 2004). This seems to fit this study better when realistic spontaneous event rates are included. Although in principle this study gives explicit predictions for the synaptic transfer function and its dependence on light level, the experiments and noise measurements are not sensitive enough to decide which performance criterion the biological synapse follows.
7.2 Relevance to Other Systems. This study seems to deal with a particular circuit and circumstances: the bipolar pathway at very low light levels. In order to have a beneficial effect of the threshold, the following conditions should occur: (1) the signal should be sparse (i.e., only one or a few inputs out of many are simultaneously active), (2) the signal should be discrete, (3) and all inputs carry noise in the absence of a signal. However, the problem of detecting a sparse binary signal amid gaussian noise could be of much more general consideration. Consider, for instance, a view-invariant face cell in the cortex receiving many inputs, each of them active only when the face is seen from a particular angle. Given the ongoing spontaneous activity in neurons, such a system could need similar thresholding as the rod-bipolar pathway, in particular when the receptive field of the invariant cell is much wider than the tuning of the cells providing inputs. It is not clear how far this analogy holds. Nonlinearities in the synaptic transfer, such as synaptic facilitation (Varela et al., 1997), seem too weak to provide the required nonlinearities. Another possibility is that by using
The Optimal Synapse
43
population coding, the system spreads the signal out over many inputs, relieving this problem. Acknowledgments We thank Alexander Heimel, Fred Rieke, Chris Williams, and Robert Smith for insightful comments. Efstathios Politis helped in the initial phase of this project. The method of simulating low-light-level images was inspired by work of Andrew Hsu.
References Armstrong-Gold, C. E., & Rieke, F. (2003). Bandpass filtering at the rod to secondorder cell synapse in salamander (ambystoma tigrinum) retina. J. Neurosci., 23, 2796–2806. Barlow, H. B., Levick, W. R., & Yoon, M. (1971). Responses to single quanta of light in retinal ganglion cells of the cat. Vision Research Supplement, 3, 87–101. Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18, 349–369. Baylor, D. A., Lamb, T., & Yau, K. W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 237–253. Baylor, D. A., Nunn, B. J., & Schnapf, J. L. (1984). The photocurrent, noise and spectral sensitivity of rods of the monkey macaca fascicularis. J. Physiol., 357, 575–607. Berntson, A., Smith, R. G., & Taylor, W. R. (2004). Transmission of single photon signals through a binary synapse in the mammalian retina. Vis. Neurosci., 21, 693–702. Bialek, W., & Owen, W. G. (1990). Temporal filtering in retinal bipolar cells: Elements of an optimal computation? Biophys. J., 58, 1227–1233. Copenhagen, D. R., Donner, K., & Reuter, T. (1987). Ganglion cell performance at absolute threshold in toad retina: Effect of dark events in rods. J. Physiol., 393, 667–680. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dacheux, R. F., & Raviola, E. (1986). The rod pathway in the rabbit retina: A depolarizing bipolar and amacrine cell. J. Neurosci., 6, 331–345. Field, G. D., & Rieke, F. (2002a). Mechanisms regulating variability of the single photon response of mammalian rod photoreceptors. Neuron, 35, 733–747. Field, G. D., & Rieke, F. (2002b). Nonlinear signal transfer from mouse rods to bipolar cells in implications for visual sensitivity. Neuron, 34, 773–785. Field, G. D., Sampath, A. P., & Rieke, F. (2005). Retinal processing near absolute threshold: From behavior to mechanism. Ann. Rev. Physiol., 67, 491–514. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. ¨ Grunert, U., Martin, P. R., & W¨assle, H. (1994). Immunocytochemical analysis of bipolar cells in the macaque monkey retina. Journal of Comparative Neurology, 348, 607–627.
44
P. Clark and M. van Rossum
Mackay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. Mastronarde, D. N. (1983a). Correlated firing of cat retinal ganglions cells: I. Spontaneously active inputs to X- and Y-cells. J. Neurophysiol., 49, 303–324. Mastronarde, D. N. (1983b). Correlated firing of cat retinal ganglions cells: II. Responses of X- and Y-cell to single quantal events. J. Neurophysiol., 49, 325–349. Rao, R., Buchsbaum, G., & Sterling, P. (1994). Rate of quantal transmitter release at the mammalian rod synapse. Biophys. J., 67, 57–64. Sampath, A. P., & Rieke, F. (2004). Selective transmission of single photon responses by saturation at the rod-to-rod bipolar synapse. Neuron, 41, 431–443. Schneeweis, D. M., & Schnapf, J. L. (1995). Photovoltage in rods and cones in the macaque retina. Science, 268, 1053–1056. Soucy, E., Wang, Y., Nirenberg, S., Nathans, J., & Meister, M. (1998). A novel signaling pathway from rod photoreceptors to ganglion cells in mammalian retina. Neuron, 21, 481–493. Sterling, P., & Demb, J. B. (2004). Retina. In G. M. Shepherd (Ed.), Synaptic organization of the brain. New York: Oxford University Press. Sterling, P., Freed, M., & Smith, R. G. (1988). Architecture of rod and cone circuits to the on-beta ganglion cell. J. Neurosci., 8, 623–642. Tsukamoto, Y., Morigiwa, K., Ueda, K., & Sterling, P. (2001). Microcircuits for night vision in the mouse retina. J. Neurosci., 21, 8616–8623. van Rossum, M. C. W., & Smith, R. G. (1998). Noise removal at the rod synapse of mammalian retina. Vis. Neurosci., 15, 809–821. Van Trees, H. L. (1968). Detection, estimation, and modulation theory I. New York: Wiley. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. ¨ Volgyi, B., Deans, M. R., Paul, D. L., & Bloomfield, S. A. (2004). Convergence and segregation of the multiple rod pathways in mammalian retina. J. Neurosci., 24, 11182–11192. Walraven, J., Enroth-Cugell, C., Hood, D. C., MacLeod, D. I. A., & Schnapf, J. L. (1990). The control of visual sensitivity. In L. Spillmann and S. J. Werner (Eds.), Visual perception: The neurophysiological foundations (pp. 53–101). San Diego, CA: Academic Press.
Received September 20, 2004; accepted April 26, 2005.
LETTER
Communicated by Emilio Salinas
Simultaneous Rate-Synchrony Codes in Populations of Spiking Neurons Naoki Masuda
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Japan, and ERATO Aihara Complexity Modelling Project, Japan Science and Technology Agency, Tokyo, Japan
Firing rates and synchronous firing are often simultaneously relevant signals, and they independently or cooperatively represent external sensory inputs, cognitive events, and environmental situations such as body position. However, how rates and synchrony comodulate and which aspects of inputs are effectively encoded, particularly in the presence of dynamical inputs, are unanswered questions. We examine theoretically how mixed information in dynamic mean input and noise input is represented by dynamic population firing rates and synchrony. In a subthreshold regime, amplitudes of spatially uncorrelated noise are encoded up to a fairly high input frequency, but this requires both rate and synchrony output channels. In a suprathreshold regime, means and common noise amplitudes can be simultaneously and separately encoded by rates and synchrony, respectively, but the input frequency for which this is possible has a lower limit. 1 Introduction Both synchrony and firing rates seem to play important roles in sensory, cognitive, and motor behavior, although the precise role of synchrony is still unclear. Synchrony levels and firing rates can be dynamically and simultaneously modulated by different signals. For example, monkeys during motor tasks are suggested to encode behavioral events and cognitive events ¨ Diesmann, & with firing rates and synchrony, respectively (Riehle, Grun, Aertsen, 1997). Rates and synchrony respectively encode task-related signals and expectation during visual tasks (de Oliveira, Thiele, & Hoffmann, 1997). Firing rates and oscillatory synchrony represent the identity (difference) and the category (overlap) of odor stimulus patterns in zebrafish olfactory systems (Friedrich, Habermann, & Laurent, 2004). If a more general temporal code with precise spike timing is taken into account, rates and spike time may simultaneously represent stimulus identity and external time (Berry, Warland, & Meister, 1997) or movement speed and body position (Huxter, Burgess, & O’Keefe, 2003). Theoretically, such simultaneous Neural Computation 18, 45–59 (2006)
C 2005 Massachusetts Institute of Technology
46
N. Masuda
codes have not been sufficiently studied because many factors interact with synchrony and firing rates. For example, synchronous inputs raise firing rates (Shadlen & Newsome, 1998; Burkitt & Clark, 2001; Salinas & Sejnowski, 2001; Moreno, de la Rocha, Renart, & Parga, 2002; Tiesinga & Sejnowski, 2004), and increased firing rates can decrease synchrony (Brunel, 2000; Burkitt & Clark, 2001). Furthermore, a theory must link dynamic inputs and outputs to these experimental configurations. Another complication is input modalities. The nature of effective inputs is not trivial. In static regimes, overall levels of balanced excitation and inhibition, which can be considered the input noise amplitude, determine firing rates, particularly in the subthreshold regime (Shadlen & Newsome, 1994, 1998). Experimental (Chance, Abbott, & Reyes, 2002) and theoretical (Tiesinga, Jos´e, & Sejnowski, 2000; Burkitt, Meffin, & Grayden, 2003; Kuhn, Aertsen, & Rotter, 2004) studies of gain modulation also support the coding of noise amplitudes as firing rates, although specific input-output relations change considerably if conductance inputs are considered. Other firing properties such as the coefficient of variation have also been examined in detail in this subthreshold noise-driven regime (Shadlen & Newsome, 1998; Rudolph & Destexhe, 2003; Kuhn et al., 2004). When dynamic inputs are considered, firing rates represent noise variance up to a high cutoff frequency (Lindner & Schimansky-Geier, 2001; Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004) and abrupt changes in mean inputs (Herrmann & Gerstner, 2001; Moreno et al., 2002), as well as conventional deterministic slow inputs (Knight, 1972). The cited articles focus on single postsynaptic neurons. Populations were assumed to consist of multiple neurons whose incident noise is independent for each neuron, which is unrealistic (Shadlen & Newsome, 1994, 1998; Salinas & Sejnowski, 2001). Neurons generally share inputs from upstream neurons because of divergent connectivity. Fluctuation of local field potentials is another major source of shared noise inputs (Huxter et al., 2003). Such common inputs usually limit the amount of information in population firing rates (Shadlen & Newsome, 1994, 1998; Litvak, Sompolinsky, Segev, & Abeles, 2003), whereas they induce synchronous firing (Mainen & Sejnowski, 1995; Reyes, 2003). Although these are established results, the interactions of common noise with spatially independent noise and mean inputs and the consequences of dynamic inputs are poorly understood. In this letter, we examine how neural populations encode dynamic inputs in dynamic patterns of firing rates and synchrony. We consider dynamic inputs comprising independently changing biases, noise different for each neuron, and common noise. Relevant codes are shown to vary according to the baseline input bias and the input frequency. In section 2, to illuminate the influence of the bias and the input frequency, we treat a theoretically tractable case in which common noise inputs are absent. Section 3 treats common noise inputs, and the information that dynamical synchrony and firing rates carry about the input is determined numerically.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
47
The relevance of the results to experiments and other theories is discussed in section 4. 2 Coding in the Absence of Common Noise Inputs We assume an uncoupled population of n leaky integrate-and-fire (LIF) neurons with spatially uniform dynamic inputs. The dynamics of the ith neuron is described by τ
d Vi (t) = −Vi (t) + µi (t), dt
(2.1)
where τ , Vi (t), and µi (t) are the membrane time constant, the membrane potential, and the external input, respectively. The neuron fires when Vi reaches the threshold θ , and then Vi is instantaneously reset to the resting potential Vr . We use LIF neurons instead of more realistic conductancebased models to facilitate mathematical analysis. Coupling between neurons is neglected, but its effect can be readily understood, as discussed in section 4. We start with the combination of a deterministic common mean input denoted by µ(t) and an independent noise source that is different for each neuron and whose amplitude is modulated according to σu (t) (Silberberg et al., 2004). Consequently, we write √ µi (t) = µ0 + µ(t) + σu (t) τ ηi (t),
(2.2)
where µ0 is the bias defined so that the temporal average of µ(t) becomes zero. Gaussian white noise, which is independent for each neuron, is denoted by ηi (t). Suprathreshold inputs (µ0 > θ ) allow neurons to fire even without noise, and subthreshold inputs (µ0 < θ ) must occur with noise for neurons to fire. Let us first ask what is encoded in the asynchronous regime by the population firing rate, which is denoted by ν(t) and normalized by the number of neurons. A Fokker-Planck analysis, in which the probability density of the membrane potentials is derived analytically, reveals how the bias affects the input characteristics encoded in firing rates. Analysis of synchrony and common noise requires numerics and is covered in section 3. The Fokker-Planck equation for the probability density of membrane potentials, denoted by P(V, t), is written as ∂ P(V, t) σu2 (t) ∂ 2 P(V, t) ∂ + = 2 ∂t 2τ ∂V ∂V =−
∂ S(V, t) , ∂V
V − µ0 − µ(t) P(V, t) τ (2.3)
48
N. Masuda
where S(V, t) = −
σu2 (t) ∂ P(V, t) V − µ0 − µ(t) − P(V, t) 2τ ∂V τ
(2.4)
is the probability current. Using the boundary condition, P(θ, t) = 0, ν(t) is given by (Brunel, 2000; Lindner & Schimansky-Geier, 2001; Moreno et al., 2002; Silberberg et al., 2004) ν(t) = S(θ, t) = −
σu2 (t) ∂ P(θ, t) . 2τ ∂V
(2.5)
Equation 2.5 underlies the claim in Lindner and Schimansky-Geier (2001) and Silberberg et al. (2004) that σu (t) but not µ(t) is coded in instantaneous firing rates. However, whether ν(t) or its delayed version represents µ(t) or σu (t) depends appreciably on µ0 . We deal with the transient response to examine the µ0 effect. Let us denote the stationary density and the stationary firing rate for constant inputs (µ(t) = 0 and σu (t) = σ u ) by PS (V) ≡ limt→∞ P(V, t) and ν0 , respectively. We combine ∂∂tPS = 0, equations 2.3 and 2.5, and an additional boundary condition caused by the resetting of neurons, a discontinuity of S(V, t) at V = Vr , by the amount ν(t). Then we obtain (Brunel, 2000) (θ −µ0 )/σ u 2ν0 τ (V − µ0 )2 Vr − µ0 u2 e du, exp − u− PS (V) = σu σu σ 2u (V−µ0 )/σ u (2.6) where is the Heaviside function, and the stationary firing rate ν0 is
√
ν0 = τ π
(θ −µ0 )/σ u
(Vr −µ0 )/σ u
du e
u2
−1
u −∞
e
−x 2
dx
.
(2.7)
According to equation 2.3, P(V, t) changes on the timescale of τ when the inputs are nonstationary. For a transient period not much longer than τ , the first-order approximation yields ∂ P(V, t) ∼ σu2 (t) ∂ 2 PS V − µ0 − µ(t) ∂ + PS = ∂t 2τ ∂ V 2 ∂V τ σ 2 (t) − σ 2u ∂ PS µ(t) ∂ PS = u ) − P , (−V + µ 0 S − ∂V τ ∂V τ σ 2u
(2.8)
Simultaneous Rate-Synchrony Codes in Spiking Neurons
A
49
B P(V,t)
P(V,t)
4 2 0
2
0 0
0.25 0.5 0.75 V
1
0
0.25 0.5 0.75 V
1
Figure 1: The membrane potential distribution of LIF neurons in the stationary state when (A) µ0 = 0.9 and (B) µ0 = 1.1. The plots are of numerical simulations of n = 300 neurons (steps), a theoretical prediction based on equation 2.5 (solid lines) and a theoretical prediction under the continuous boundary conditions (dotted lines).
where we have used be estimated by
∂ PS ∂t
= 0 and equation 2.3. Slow responses in ν(t) can
dν(t) 1 dσu2 (t) ∂ P(θ, t) σu2 (t) ∂ ∂ P(θ, t) =− − . dt 2τ dt ∂V 2τ ∂t ∂ V
(2.9)
The first term of equation 2.9 is a consequence of the instantaneous change in ν(t) induced by the change in σu (t). Regarding the second term, we cannot exchange the derivative with respect to V and that with respect to t. This is because equation 2.8 holds only for V < θ . At V = θ , equation 2.3 is singular because of the boundary conditions. The probability current S(θ, t) is actually determined by adjusting the amount of discontinuity of S(V, t) at V = Vr so that P(θ, t) is always pinned at 0 (Brunel, 2000). Although this technicality stems from the specific assumptions (especially P(θ, t) = 0), different boundary conditions or different spiking neuron models do not qualitatively change the situation. For example, we could permit Vi to jump from V = Vr back to V = θ as a noise effect because the hard thresholding and discontinuous resetting of the LIF neuron are just approximations to real neurons. Then P(θ, t) could be nonzero. In Figures 1A (µ0 = 0.9) and 1B (µ0 = 1.1), we compare the stationary membrane potential distributions obtained numerically (steps), by equation 2.5 (solid lines), and by the modified theory outlined above (dotted lines). Figure 1 indicates that this type of modification has little effect on P(V, t) and ν(t). Consequently, we proceed with the original formalization
50
N. Masuda
to evaluate ∂ P(θ − V, t)/∂t with V θ . Because P(θ, t) = 0, an increase in P(θ − V, t) with respect to t means a more negative slope of P at V = θ or a larger firing rate. We substitute PS (θ − V) ∼ = PS (θ ) into equation 2.8 to obtain 2ν0 µ(t) ∂ P(θ − V, t) ∼ 2ν0 (θ − µ0 ) 2 σu (t) − σ 2u + . = ∂t σ 4u σ 2u
(2.10)
The second term guarantees that ν(t) encodes µ(t) with some delay because of the single-neuron dynamics (Knight, 1972; Brunel, 2000; Lindner & Schimansky-Geier, 2001; Gerstner & Kistler, 2002). When µ0 > θ , the quantity represented by equation 2.10 decreases with σu (t) to attenuate the instantaneous response of ν(t) to σu (t) (see equation 2.5), reproducing the high-pass nature of noise-coded signals (Herrmann & Gerstner, 2001; Moreno et al., 2002). A larger bias magnifies the influence of µ(t). However, when µ0 < θ , the delayed response does not counteract the instantaneous response of ν(t) to σu (t). Consequently, σu (t) with either low or high frequency is primarily encoded in ν(t). This finding is consistent with quasistatic arguments. If the inputs are sufficiently slow relative to τ , ν(t) is given by equation 2.7, with µ0 and σ u 2 u 2 replaced by µ0 + µ(t) and σu (t), respectively. Since e u −∞ e −x d x increases monotonically with u, the integral in equation 2.7 decreases with µ(t) for any µ0 and with σu (t) for µ0 + µ(t) < θ . In these cases, ν0 duly represents µ(t) or σu (t). However, when µ0 > θ , the range of integration, or [(θ − µ0 − µ(t))/σu (t), (Vr − µ0 − µ(t))/σu (t)], shrinks as σu (t) increases, whereas the magnitude of the integrand increases with σu (t). The trade-off between these two factors determines ν0 , which implies that the firing rate depends more weakly on σu (t) than in the subthreshold case. To be more quantitative, we numerically simulate n = 300 uncoupled LIF neurons with τ = 10 ms, θ = 1, and Vr = 0. We update µ(t) and σu (t) with period T (T = 1 ms was used in Silberberg et al., 2004, to generate fast inputs). The amplitudes of µ(t) and that of σu (t) are chosen from uniform distributions on [−0.075, 0.075] and [0.030, 0.055], respectively. We fix these dynamical ranges and also that of common noise, which is incorporated in section 3, because stronger input modulation obviously leads to their better representation by the outputs. We determine ν(t) by the normalized number of spikes in each bin of width T ms corresponding to the input switching. Then the cross-correlation functions between ν(t) and the inputs, corr (µ(t), ν(t)) and corr (σu (t), ν(t)), are used as performance measures. Figure 2A shows the dependence of corr (µ(t), ν(t)) on µ0 and T. As this theory and those of others (Knight, 1972; Lindner & Schimansky-Geier, 2001; Gerstner & Kistler, 2002; Silberberg et al., 2004) predict, corr (µ(t), ν(t)) increases with T and the neural ensemble works as a low-pass filter for µ(t). With T fixed, corr (µ(t), ν(t)) increases with µ0 in accordance with
Simultaneous Rate-Synchrony Codes in Spiking Neurons
B 1
corr(σu(t),ν(t))
corr(µ(t),ν(t))
A
51
0.75 0.5 0.25 0 0.9
1
1.1 1.2 1.3 µ0
1 0.75 0.5 0.25 0 0.9
1
1.1 1.2 1.3 µ0
Figure 2: Dependence of (A) corr (µ(t), ν(t)) and (B) corr (σu (t), ν(t)) on µ0 and T in the absence of common noise inputs. The results for T = 0.3 ms (thickest lines), 1 ms, 2 ms, 5 ms, and 30 ms (thinnest lines) are shown.
equation 2.10. Figure 2B shows that corr (σu (t), ν(t)) decreases with µ0 except when T is very small. This result is predicted by equations 2.7 and 2.10. It agrees with the prediction of enhanced rate coding by using subthreshold stochastic resonance (Lindner & Schimansky-Geier, 2001), and it extends the observation that the subthreshold regime yields higher gains for static noise inputs (Tiesinga et al., 2000; Chance et al., 2002; Moreno et al., 2002; Burkitt et al., 2003; Kuhn et al., 2004). For an even smaller µ0 , a sufficiently large σu (t) that would drive up Vi does not generally last long enough to make neurons fire, and rate coding of σu (t) deteriorates because neurons rarely fire. Rate coding is optimized at a certain µ0 for fast σu (t). The optimal bias is at a slightly subthreshold level. This may be related to the fact that membrane potentials often hover around this level (Shadlen & Newsome, 1994, 1998). An optimal T exists for a range of given µ0 , implying a bandpass property for σu (t). The high-pass nature is expected from the theory, whereas very fast σu (t) (T = 0.5 and 1 ms in Figure 2B) cannot be captured by firing rates because n is finite (Brunel, 2000; Lindner & Schimansky-Geier, 2001). 3 Coding in the Presence of Common Noise Inputs We next apply inputs with common noise represented by √ √ µi (t) = µ0 + µ(t) + σu (t) τ ηi (t) + σc (t) τ η(t),
(3.1)
where σc (t) is the dynamical signal carried by the common noise and η(t) is a gaussian white noise. We renew σc (t) every T ms to a random value
52
N. Masuda
from the uniform distribution on [0, 0.025], as is done for µ(t) and σu (t). The dynamic range of σu (t) and that of σc (t) are assumed to be the same to compare their effects on dynamic outputs. In addition to firing rates, transient synchrony, which is typically reinforced by common inputs, might be functionally relevant (de Oliveira et al., 1997; Riehle et al., 1997; Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Salinas & Sejnowski, 2001; Friedrich et al., 2004). Because both common noise and transient synchrony are difficult to treat mathematically, we resort to numerical simulations. We measure synchrony with two spike-based methods. The first statistic uses a prescribed temporal precision T (< T) ms. We subdivide each bin of width T ms into T/T bins and count the number of spikes in each sub-bin. Counts are denoted by N1 , N2 , . . . , NT/T . Since stronger synchrony results in a more rugged distribution of the spike counts, the degree of dynamical synchrony syn(t) (i T ≤ t < (i + 1)T, i ∈ Z) is defined to be the normalized standard deviation of {Ni ; 1 ≤ i ≤ T/T }, namely, 2
T/T
T/T T/T
T T T Ni − Nj Ni . syn(t) = T T T i=1
j=1
(3.2)
i=1
In calculating the correlation functions between syn(t) and the inputs, we discard the bins without spikes. We set T = 1 ms because synchrony on this timescale is biologically relevant (Mainen & Sejnowski, 1995; Diesmann, Gewaltig, & Aertsen, 1999; Steinmetz et al., 2000). In experiments, we often do not know when the inputs change. Binning is impossible because we know neither T nor T . With this severe condition in mind, a dynamic synchrony measure C Vp (t) based on lumped spike trains (Tiesinga & Sejnowski, 2004) is also calculated. The minimum distance between spike times of different neurons becomes small during synchrony. This idea is quantified by creating a spike train with spike times {ti : i ∈ Z} by merging all the spikes from n neurons. To define C Vp (t) as the instantaneous coefficient of variation of {ti : i ∈ Z}, we clip consecutive a + 1 spike times {ti : 0 ≤ i ≤ a } from {ti : i ∈ Z} so that t is closer to ta /2 than any other ti . Then a 2
a 2 1 1 1
C Vp (t) ≡ √ t − ti−1 − ti − ti−1 a n a i=1 i i=1 a 1 ti − ti−1 , a i=1
(3.3)
Simultaneous Rate-Synchrony Codes in Spiking Neurons
53
√ where 1/ n is the normalization factor (Tiesinga & Sejnowski, 2004). Perfect √ synchrony leads to C Vp (t) = 1, whereas asynchrony yields C Vp (t) = 1/ n. The choice of a is arbitrary, and we set a = 40. Although ν(t) encodes σc (t) and σu (t) in a similar way, a major difference is that the upper cutoff frequency of the input is lower for σc (t) because induced synchrony effectively decreases the number of neurons. With this in mind, we start with T = 5 ms, which prescribes an easy scheme for ν(t) to encode µ(t) and σu (t) under proper µ0 and sufficient asynchrony, as revealed in Figure 2. Figure 3A indicates that in the presence of considerable interference by common noise, ν(t) favors µ(t) (squares) or σu (t) (circles), depending on µ0 . Results are similar to those in Figure 2A. Although corr (σc (t), ν(t)) also decreases with µ0 (triangles), this relation is weak regardless of µ0 . However, σc (t) is represented with high fidelity by dynamical synchrony, as shown in Figure 3B, which displays corr (σc (t), syn(t)) (thick line with triangles) and corr (σc (t), C Vp (t)) (thin line with triangles). Synchrony induced by common noise is extended to the case of mixed dynamical inputs. Only in far subthreshold regimes are synchronous firing rates too low for syn(t) or C Vp (t) to represent σc (t). In suprathreshold regimes, simultaneously applied µ(t) and σc (t) are more or less independently encoded in dynamical firing rates and dynamical synchrony. Dynamical synchrony is anticorrelated more strongly with σu (t) than with µ(t) because noise directly desynchronizes neurons. When rate coding of σu (t) is relevant with small µ0 , synchrony also has the same information in σu (t). That is, σu (t) occupies two output channels: firing rates and synchrony. Although µ(t) is also anticorrelated with synchrony (Brunel, 2000; Burkitt & Clark, 2001), this effect is much smaller, especially for suprathreshold µ0 , where rate coding of µ(t) is efficient. We note that syn(t) (thick lines in Figure 3B and also in Figures 3D and 3F, as explained below) rather than C Vp (t) (thin lines) is correlated with the inputs more often than the other way round. This is because the timing of input changes is available only for the calculation of syn(t). However, C Vp (t) behaves consistently like syn(t), indicating that the results are not sensitive to synchrony measures. Figure 2 indicates that the timescales of µ(t) and σu (t) affect the efficiency of rate coding. How is this extended to the cases in which σc (t) and dynamical synchrony are involved? As shown in Figures 3C and 3E for T = 2 and 15 ms, σc (t) does not influence ν(t) regardless of T or µ0 , and the relevant mode of rate coding as a function of µ0 and T is similar to that shown in Figure 2 in which σc (t) is absent. In the suprathreshold regime, dynamical synchrony represents low-passed σc (t), as shown by the triangles in Figures 3D (T = 2 ms) and 3F (T = 15 ms). Then µ(t) and σc (t) can be separately encoded in the rates and the synchrony, but they are low-pass filtered. In the subthreshold regime, dynamical synchrony and firing rates represent σu (t) up to a relatively high frequency (the circles in Figures 3D and 3F). In this situation, σu (t) can be encoded up to a high frequency,
54
N. Masuda
1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
E corr with ν(t)
0.5 0.25 0 -0.25 -0.5 0.9
1 µ0
1.1
1.2
0.9
1 µ0
1.1
1.2
0.9
1 µ0
1.1
1.2
D corr with syn(t),CVp(t)
corr with ν(t)
C
corr with syn(t),CVp(t)
B
0.5 0.25 0 -0.25 -0.5
F 1 0.75 0.5 0.25 0 0.9
1 µ0
1.1
1.2
corr with syn(t),CVp(t)
corr with ν(t)
A
0.5 0.25 0 -0.25 -0.5
Figure 3: (A, C, E) Cross-correlations between ν(t) and each of the three inputs (squares: µ(t); circles: σu (t); triangles: σc (t)). (B, D, F) Cross-correlations between the degrees of dynamical synchrony (thick lines: syn(t); thin lines: C Vp (t)) and the three inputs. We set (A, B) T = 5 ms, (C, D) T = 2 ms, and (E, F) T = 15 ms.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
1 0.75 0.5 0.25 0 0.9
1 1.1 µ0
1.2
corr with ν(t)
C
corr with syn(t),CVp(t)
B 0.5 0.25 0 -0.25 -0.5 0.9
1 1.1 µ0
1.2
0.9
1
1.2
D 1 0.75 0.5 0.25 0 0.9
1
µ0
1.1
1.2
corr with syn(t),CVp(t)
corr with ν(t)
A
55
0.5 0.25 0 -0.25 -0.5 1.1
µ0
Figure 4: (A, C) Cross-correlations between ν(t) and the inputs, and (B, D) crosscorrelations between the degrees of dynamical synchrony and the inputs. The strength of the uncorrelated noise lies in (A, B) [0.010,0.035] and (C, D) [0.055, 0.080]. We set T = 5 ms. The corresponding results for the noise strength [0.030, 0.055] are shown in Figures 3A and 3B.
although it occupies both rate and synchrony channels. In short, there is a trade-off between the number of manageable inputs and the quality of the conveyed information about each input. We also examine effects of background noise. Since its baseline level is biologically difficult to estimate, we try several test values. In Figure 3, σu (t) has an amplitude taken from [0.030, 0.055] and can be regarded as a summation of a dynamic σu (t) with amplitude 0.025 and background noise whose level is statically equal to 0.030. With T = 5 ms, as in Figures 3A and 3B, Figure 4 shows the coding results when the amplitude of σu (t) has the same dynamic range (= 0.025) as in Figure 3 but with different static levels. The
56
N. Masuda
amplitude of σu (t) falls in [0.010, 0.035] ([0.055, 0.080]) for Figures 4A and 4B (4C and 4D). A large background noise depresses synchronous firing and coding of σc (t) on synchrony, particularly in the suprathreshold regime (see Figure 4D). Figure 4C shows that rate coding does not degrade so much. For a small background noise, σc (t) is coded with fidelity in the dynamic synchrony level (see Figure 4B). In the subthreshold regime, rate coding of σu (t) degrades to a large extent, as background noise decreases and synchrony is preferred. In both subthreshold and suprathreshold regimes, relative contributions of firing rates and synchrony on input encoding depend on the level of background noise. 4 Discussion We have examined how firing rates and synchrony can simultaneously encode dynamic inputs of different modalities. With a small bias, the independent noise signal σu (t) is encoded up to a high frequency by occupying both rate and synchrony channels. With a large bias, firing rates and synchrony represent the mean signal µ(t) and the common noise signal σc (t) separately. Although the use of two channels is efficient in this case, the inputs can be coded only up to lower cutoff frequencies. Our results for σu (t) extend a variety of work on gain modulation of noisecoded signals (Tiesinga et al., 2000; Chance et al., 2002; Burkitt et al., 2003; Kuhn et al., 2004) and on coding of balanced excitation-inhibition inputs (Shadlen & Newsome, 1998) to the dynamical setup. We also have extended the results of other investigations of dynamical noise inputs (Lindner & Schimansky-Geier, 2001; Silberberg et al., 2004) to network situations. It is well known that σc (t) induces synchrony and limits the precision of rate coding (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2001; Masuda & Aihara, 2003; Litvak et al., 2003; Reyes, 2003). We have shown that dynamical synchrony actually represents σc (t) and that σc (t) does not interfere with rate coding of the other inputs. Tiesinga and Sejnowski (2004) also mentioned that dynamical firing rates and synchrony may measure different entities. They used interneuron networks with spatially heterogeneous inputs to see the transition between the rate regime and the synchrony regime. We have presented another mechanism, with a systematic evaluation of the effect of different input modalities. Real neural networks abound in feedback and heterogeneity. Stronger recurrent connectivity (Brunel, 2000; Burkitt & Clark, 2001; Gerstner & Kistler, 2002; Masuda & Aihara, 2003) and homogeneity (Brunel, 2000; Gerstner & Kistler, 2002; Masuda & Aihara, 2003) make synchrony more likely. According to Figure 4, more synchrony (asynchrony) means that synchrony (a firing rate) carries more information on dynamic inputs regardless of the bias level. We expect that feedback and homogeneity also control the balance between the rate code and the synchrony code (Masuda & Aihara, 2003). However, feedback, which could be modeled as a part of common
Simultaneous Rate-Synchrony Codes in Spiking Neurons
57
noise, does not exactly correspond to external input. Rather, feedback spikes may underlie more combinatorial or memory-related codes such as the synfire chain (Abeles, 1991; Diesmann et al., 1999; Litvak et al., 2003; Reyes, 2003). If synchrony is dynamically modulated by σc (t) in a neural population, the convergent nature of coupling (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2001) makes dynamical synchronous outputs serve as σc (t) for a downstream population. The repetition of this process in a layered manner defines an extended synfire chain so that time-dependent σc (t) can propagate through layers. At the same time, ν(t) at an output side of a population can be µ(t) at an input side of another population. This cascade produces a chain of rate code (van Rossum, Turrigiano, & Nelson, 2002; Litvak et al., 2003; Masuda & Aihara, 2003). A novel point is that these two types of chains can be multiplexed. With static inputs, simultaneous propagation of a firing rate and a synchrony level through feedforward networks was analyzed in the context of stable propagation of synfire chains (Diesmann et al., 1999; Gerstner & Kistler, 2002). Our results extend that work to a dynamic framework and input encoding. This scheme contrasts with the situation in which the rate code and the synchrony code alternatively, not simultaneously, propagate in a feedforward manner (Masuda & Aihara, 2002, 2003; van Rossum et al., 2002). In experimental situations in which both firing rates and spike timing are expected to be simultaneously functional (Riehle et al., 1997; Huxter et al., 2003; Friedrich et al., 2004), firing rates and synchrony may express mean input and common noise input, respectively. Our theory predicts that this multiplexing scheme cannot handle fast inputs. Multiplexing may be used to represent sensory, behavioral, or cognitive signals that do not require high temporal resolution, such as static odor information (Friedrich et al., 2004) and physical location and speed (Huxter et al., 2003). We are uncertain if behavioral and cognitive events discussed in Riehle et al. (1997) are relatively fast. Dynamical firing rates and synchrony are negatively correlated in some behavioral tasks. For example, synchrony is high during stimulus expectation periods, whereas asynchrony accompanied by increased firing rates emerges at stimulus onset (de Oliveira et al., 1997). This observation may be understood if independent noise signals, rather than mean inputs or common noise, are raised as the stimulus is turned on. We speculate that even relatively rapid changes in stimuli can be processed in this situation.
Acknowledgments We thank H. Nakahara, M. Okada, D. Nozaki, B. Doiron, K. Aihara, Y. Tsubo, and S. Amari for helpful discussions. This work is supported by the Special Postdoctoral Researchers Program of RIKEN.
58
N. Masuda
References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Comput., 13, 2639–2672. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. de Oliveira, S. C., Thiele, A., & Hoffmann, K.-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17(23), 9248–9260. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Friedrich, R. W., Habermann, C. J., & Laurent, G. (2004). Multiplexing using synchrony in the zebrafish olfactory bulb. Nat. Neurosci., 7(8), 862–871. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to integrate-and-fire neuron. J. Comput. Neurosci., 11, 135–151. Huxter, J., Burgess, N., & O’Keefe, J. (2003). Independent rate and temporal coding in hippocampal pyramidal cells. Nature, 425, 828–832. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24(10), 2345–2356. Lindner, B., & Schimansky-Geier, L. (2001). Transmission of noise coded versus additive signals through a neuronal ensemble. Phys. Rev. Lett., 86(14), 2934–2937. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feedforward networks with excitatory-inhibitory balance. J. Neurosci., 23(7), 3006–3015. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons, Science, 268, 1503–1506. Masuda, N., & Aihara, K. (2002). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Comput., 15, 103–125. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89(28), 288101.
Simultaneous Rate-Synchrony Codes in Spiking Neurons
59
Reyes, A. D. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nat. Neurosci., 6(6), 593–599. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization Riehle, A., Grun, and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Rudolph, M., & Destexhe, A. (2003). The discharge variability of neocortical neurons during high-conductance states. Neuroscience, 119, 855–873. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. J. Neurophysiol., 91, 704–709. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Tiesinga, P. H. E., Jos´e, J. V., & Sejnowski, T. J. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Rev. E, 62, 8413–8419. Tiesinga, P. H. E., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of firing rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966.
Received October 20, 2004; accepted June 1, 2005.
LETTER
Communicated by Daniel Amit
Spontaneous Dynamics of Asymmetric Random Recurrent Spiking Neural Networks H´edi Soula
[email protected]
Guillaume Beslon
[email protected] Artificial Life and Behavior, PRISMA, National Institute of Applied Science, Lyon, France
Olivier Mazet
[email protected] Mathematic Lab, Camille Jordan Institute, National Institute of Applied Science, Lyon, France
In this letter, we study the effect of a unique initial stimulation on random recurrent networks of leaky integrate-and-fire neurons. Indeed, given a stochastic connectivity, this so-called spontaneous mode exhibits various nontrivial dynamics. This study is based on a mathematical formalism that allows us to examine the variability of the afterward dynamics according to the parameters of the weight distribution. Under the independence hypothesis (e.g., in the case of very large networks), we are able to compute the average number of neurons that fire at a given time—the spiking activity. In accordance with numerical simulations, we prove that this spiking activity reaches a steady state. We characterize this steady state and explore the transients.
1 Introduction Many neurobiological problems require understanding the behaviors of large, recurrent spiking neural networks. Indeed, it is assumed that these observable behaviors are a result of the collective dynamics of interacting neurons. The question then becomes, given a connectivity of the network and a single neuron property, what are the possible kinds of dynamics? In the case of homogeneous nets (the same connectivity inside the network), some authors have found sufficient conditions for phase synchronization (locking) or stability (Chow, 1998; Gerstner, 2001). Coombes (1999) calculated Lyapunov exponents in a given symmetric connectivity map and showed that some neurons were “chaotic” (the highest exponent was positive). In very general cases (Golomb, 1994; van Vreeswijk & Sompolinsky, Neural Computation 18, 60–79 (2006)
C 2005 Massachusetts Institute of Technology
Spontaneous Dynamics of Random Recurrent Networks
61
1996; Meyer & van Vreeswijk, 2002), it has been shown that the dynamics can show a broad variety of aspects. In the case of integrate-and-fire (I&F) neurons, Amit and Brunel (1997a, 1997b) used consistency techniques on nets of irregular firing neurons. This technique allowed them to derive a self-sustaining criterion. Using FokkerPlanck diffusion, the same kind of method was used in the case of linear I&F neurons in Mongillo and Amit (2001), Fusi and Mattia (1999), and Mattia and del Guidice (2000), for stochastic networks dynamics with noisy input current (del Guidice & Mattia, 2003), and in the case of sparse weight connectivity (Brunel, 2000). However, stochastic recurrent spiking neurons networks are rarely studied in their spontaneous functioning. Indeed, in most cases, the dynamics is driven by an external current—whether meaningful or noisy. Without this external current, the resulting behavior is often considered as trivial. However, our experimental results show that large, random recurrent networks do exhibit nontrivial functioning modes. Depending on a coupling parameter between neurons (in our case, the variance of the distribution of weights), the network is able to follow a wide spectrum of spontaneous behavior, from the trivial neural death (the initial stimulation does not produce any further spiking activity) to an extreme locking mode (some neurons fire all the time, while others never do). In the intermediate states, the average spiking activity grows with the variance. Thus, we basically follow the same ideas as in Amit and Brunel (1997a) and Fusi and Mattia (1999) and try to predict these behaviors when using large, random networks. In this case, we need to make an independence hypothesis and use mean field techniques. Note that this so-called mean field hypothesis has been rigorously proven in a different neuronal network model (Moynot & Samuelides, 2002). More precisely, in our case, the connectivity weights will follow an independent and identically distributed law, and the neurons’ firing activities are supposed to be independent. After introducing the spiking neural model, we propose a mathematical formalism that allows us to determine (with some approximations) the probability law of the spiking activity. Since no hypothesis other than independence is used, a re-injection of the dynamics is needed. It leads, expectedly, to a massive use of recursive equations. Although nonintuitive, these equations are a solid ground on which many conclusions can rigorously be drawn. Fortunately, the solutions of these equations are as expected, that is, the average spiking activity (and as a consequence the average frequency) reaches a steady state very quickly. Moreover, this steady state depends on only the parameters of the weight distribution. To keep the arguments simple, we detail the process for a weight matrix following a centered normal law. Extensions are proposed afterward for a nonzero mean and a sparse connectivity. All of these results corroborate accurately with simulated neural networks data.
62
H´edi Soula, G. Beslon, and O. Mazet
2 The Neural Model The following series of equations describe the discrete I&F model we use throughout this letter (Tuckwell, 1988). Our network consists of N all-toall coupled neurons. Each time a given neuron fires, a synaptic pulse is transmitted to all the other neurons. This firing occurs whenever the neuron potential V crosses a threshold θ from below. Just after the firing occurs, the potential of the neuron is reset to 0. Between a reset and a spike, the dynamics of the potential is given by the following (discrete) temporal equation:
Vi (t + 1) = γ Vi (t) +
Wi j δ t − Tjn .
(2.1)
n>0 j=N
The first part of the right-hand side of the equation describes the leak current—γ is the leak (0 ≤ γ ≤ 1). Obviously, a value of 0 for γ indicates that the neuron has no short-term memory. On the other hand, γ = 1 describes a linear integrator. Since we study only spontaneous dynamics, there is no need to introduce any input current in equation 2.1. The Wi j are the synaptic influences (weights) and δ(x) = 1 whenever x = 0 and 0 otherwise (Kronecker symbol). The Tjn are the times of firing of a neuron j and a multiple of the sample discretization time. The times of firing are formally defined for all neurons i as Vi (Tin − t) > θ and the nth firing date recursively as Tin = inf t | t > Tin−1 , Vi (t) ≥ θ .
(2.2)
We set Ti0 = −∞. Moreover, once it has fired, the neuron’s potential is reset to zero. Thus, when computing Vi (Tin + 1), we set Vi (Tin ) = 0 in equation 2.1. Finally and in order to simplify, we restrict ourselves to a synaptic weight √ distribution that follows a centered normal law N(0, σ 2 ) and let φ = σ N be the coupling factor. 3 General Study In this section we give a very general formulation of the distribution of the spiking activity defined as Xt —the numbers of firing neurons at a time step t for an event. The basic idea consists of partitioning the spiking activity according to the instantaneous period of the neurons. Hence, we write, (1)
Xt = Xt
(t−1)
+ . . . + Xt
,
Spontaneous Dynamics of Random Recurrent Networks
63
(k)
when Xt is the number of neurons that have fired at t and t − k but not in between. If we suppose that the starting potential of all neurons is 0 and only X0 0 neurons were excited in order to make them fire, we have Vi (1) = Xj=1 Wi j . Thus, using equation 2.2, we get
X1 =
N
χ{Vi (1)>θ } ,
i=1
where χ{Vi (1)>θ } = 1 whenever Vi (1) > θ and 0 otherwise. Furthermore, we have X2 =
X1
χ{Vi (2)>θ } +
i=1
N
χ{Vi (1)<θ,Vi (2)>θ } .
i=1
Indeed, the number of firing neurons at time step 2 are those that have X1 (1) χ{Vi (2)>θ } since the reset fired twice (at t = 1 and t = 2, that is, X2 = i=1 (2) potential is 0). We need to add those that have not fired at t = 1 (X2 = N i=1 χ{Vi (1)<θ,Vi (2)>θ } ). Thus, for t, taking into account the initial step, we have recursively
Xt =
t u=1
ˆ X t−u χ{Vi (t−u+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } ,
(3.1)
i=1
where Xˆ k = Xk for k = 0 and Xˆ 0 = N. Xˆ t−u The quantity i=1 χ{Vi (t−u+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } is a sum of random Bernoulli variables whose number of terms (Xk ’s) are also random variables. So if we now assume that the neurons’ dynamics are independent, we can calculate the expectation of Xt using the first Wald’s identity (Wald, 1945; a simple proof is given in appendix A):
E(Xt ) =
t−1
E( Xˆ k )P(k, t),
(3.2)
k=0
setting P(k, t) = E(χ{Vi (k+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } |Vi (k) = 0). The P(k, t) are the expectations of Bernoulli distributions. It leads to Var χ{Vi (k+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } = P(k, t)(1 − P(k, t)).
64
H´edi Soula, G. Beslon, and O. Mazet
We are now able to retrieve the variance of Xt (second Wald’s identity): Var(Xt ) =
t−1 (E( Xˆ k )P(k, t)(1 − P(k, t)) + Var( Xˆ k )P(k, t)2 ).
(3.3)
k=0
More generally, the moment-generating function (G X (s) = E(s X )) can be recursively computed, G Xt (s) =
t−1
G Xˆ k (P(k, t)s + 1 − P(k, t)),
(3.4)
k=0
since E(s χ{Vi (k+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } ) = P(k, t)s + 1 − P(k, t)). Equation 3.2 could have been found immediately and intuitively. Nevertheless, we obtain a more general result providing all the other moments via equation 3.4. 4 Average Number Calculation Equations 3.2 to 3.4 are useful when we can estimate the P(k, t) coefficients. We recall that P(k, t) = E(χ{Vi (k+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } |Vi (k) = 0). Since we are in spontaneous mode, the only input of a neuron consists of weight from neurons that have fired the previous time. Thus, its potential is a random sum (Xt ’s) of independent and identically distributed normal laws (the weights). However, this random sum is not, in general, a normal law itself. Nevertheless, as is proven in appendix B (see equation B.1), when the number N of neurons is large enough and for a general class of distributions of the random variable Xt , we can write, for a neuron that has its potential Vi (t) = 0, 1 Vi (t + 1) ∼ N(0, E(Xt )σ 2 ) =⇒ P(Vi (t + 1) > θ ) = √ 2π
∞ √
θ E(Xt )σ
x2
e − 2 dx. (4.1)
From now on, we suppose that γ = 0. We then claim that the neurons whose potential vanishes at a time step t are exactly those that fired at time t (except for t = 0 when it is true for all neurons). Taking into account a nonzero leak, we need to make another independence assumption concerning the previous charges. Indeed, the charge
Spontaneous Dynamics of Random Recurrent Networks
65
received by a neuron between t and t + k comes from γ k Xt + . . . + γ Xt+k−1 + Xt+k “neurons.” But in order to proceed further, we assume that these charges are independent. Note that when γ is equal to zero, this extra hypothesis is not needed. It leads to l(k,m) l(k,t−k) t−k−1 1 − P P(k, t) = Wi j > θ P Wi j > θ , (4.2) m=1
j=1
j=1
where we set l(k, m) =
m+k−1
γ m−k−i−1 E(Xi ).
i=k
In order to simplify the notations, we put xt = 1 pφ (y) = √ 2π
∞ √θ yφ
P(k, t) = pφ
and
x2
e − 2 dx.
We recall that we set φ = σ t−1
E(Xt ) , N
√
N. So equation 4.2 becomes
γ
t−i−1
t−1
xi
i=k
1 − pφ
m−1
γ
m−i−1
xi
.
(4.3)
i=k
m=k+1
Using the same notation, we get the recursive computations of xt , xt+1 =
t k=0
xˆ k pφ
t
γ t−i xi
i=k
t
1 − pφ
m−1
γ m−i−1 xi
,
(4.4)
i=k
m=k+1
where ∀k > 0, xˆ k = xk and xˆ 0 = 1. we can compute the P(k, t) recursively. Indeed, let ukt = tMoreover, t−i xi ; then ukt+1 = γ ukt + xt . Moreover, i=k γ t P(k, t + 1) = pφ ukt 1 − pφ ukm−1 m=k+1
= pφ ukt
1
− 1 P(k, t), pφ ukt−1
(4.5)
with P(t, t + 1) = pφ (xt ) and utt = xt . P(k, t) = 0 whenever t ≤ k. Equations 4.4 and 4.5 are our main result. We can see that there is no more reference to the number of neurons.
66
H´edi Soula, G. Beslon, and O. Mazet
In the case γ = 0, we can deduce the result independently from the above equations. Indeed, since it means that at each time step the potential is reset to zero, we can directly write the probability of the spiking of one neuron according to the number of neurons that have previously fired. Thus, with the same notations and hypothesis, xt = pφ (xt−1 ).
(4.6)
But it is a special case of equation 4.4 when γ → 0. (See appendix C for a detailed proof.) 5 Analysis First, note that equation 3.2 can be viewed as an integral over past charges of the form
x(t) =
t
−∞
ˆ t, ˆ t)d tˆ x(t)P(
(5.1)
This is exactly Gerstner’s formula (Gerstner, 2000) to compute spiking activity defined (using our notations) as 1 δ t − Tin . N N
xt = lim
N→∞
n≥0 i=0
Moreover, the computation of equations 4.4 and 4.5 is obtained by partitioning the firing neurons according to their instantaneous period. It amounts to counting neurons according to their local field. Indeed, P(k, t) is the proportion of neurons that experience the same charge from t − k to t. This technique can be traced back to Gerstner and van Hemmen (1992). In that case, it is used within a symmetric weight matrix and a discrete formalism. It leads to similar recursive equations. The independence of the charge hypothesis leads us to consider the local field of a particular neuron as a noise “redrawn” at each time step. Then this neuron’s potential is tested against the threshold. It means that we transformed the noise in the input (diffusive) into a noise in the threshold (escape noise). This mapping is studied in detail in Plesser and Gerstner (2000).
Spontaneous Dynamics of Random Recurrent Networks
67
5.1 General Study. Now that P(k, t) are defined analytically, we are able to estimate the evolution of xt . Indeed, following from the definition of the P(k, t), we have for large enough t, t
P(k, t) = 1.
(5.2)
k=0
It yields that xt is bounded. Moreover, for large enough t, xt is monotonic. Therefore, xt converges toward x when t → ∞. The average spiking activity is stationary. Note that x = 0 (neural death) is an obvious (and trivial) solution. For high enough φ, we have another one bounded by 1/2. In this case, the P(k, t) are close to a geometric distribution of parameter pφ (x ). Due to the definition of P(k, t), it leads to a geometric distribution of instantaneous period P, that is, P(P = k) = P(t − k, t).
(5.3)
For example, P(t − 1, t) = pφ (x ) is the probability for a neuron that has fired at t − 1 to fire at t (the proportion of neuron that fires at the maximum rate). It enables us to define a network frequency f as f = (E(P))−1 = pφ (x ).
(5.4)
Let us define the network average frequency F¯ (t) of a network over a period T and at a given time t by F¯ (t) =
N T 1 1 δi (t − k), N T i=1
(5.5)
k=0
where δi (t) = 1 if the neuron i has fired at time t and δi (t) = 0 otherwise (in other words, δi (t) = n≥0 δ(t − Tin )). Switching the sum symbol gives F¯ (t) =
T N T 1 1 1 1 δi (t − k) = Xt−k T N T N k=0
i=1
(5.6)
k=0
for this realization of the distribution. Taking the expectation leads to f¯ (t) = E( F¯ (t)) =
T 1 xt−k . T k=0
(5.7)
68
H´edi Soula, G. Beslon, and O. Mazet
It means that the average frequency (on a time window T) is the average of the spiking activity (over a period T). Thus, when t → ∞, it leads to f¯ (t) = x .
(5.8)
Due to discrete timing, we generally do not have f = f¯ . Instead, we have f < f¯ . It gives pφ (x ) ≤ x .
(5.9)
The inequality becomes an equality if γ = 0. This is the case we now study. 5.2 Simple Case. If we consider the case γ = 0, we recall that xt = ∞ x2 pφ (xt−1 ). So x = pφ (x ) = √12π √ θ e − 2 dx. This consistency equation can x φ
be approximated (Amit & Brunel, 1997a). More precisely, a solution x = 0 exists when pφ (x) crosses the line y = x and is stable if and only if pφ (x ) < 1 (here, pφ is positive for all positive numbers). If it exists that x such that pφ (x) > x, then pφ (x) = x has only two solutions. The first (lower) is an unstable fixed point, and the other is a stable fixed point. So if x0 is above the lower fixed point, the average number of neurons converges toward x . In the other case, it converges to 0 (i.e, neural death). We can derive a sufficient condition for the convergence to zero (see appendix D for details): φ<
2e 3
34
1
π 4 θ ≈ 2.08θ.
(5.10)
5.3 Previous Charges Independence. In the case γ = 0, we now need to suppose the independence of charges. However, in the general case, we allowed the potential to have strong negative values. It is more than biologically wrong; it dramatically impedes the independence hypothesis. Indeed, some neurons with very low potential will never fire, no matter what happens. In order to take this into account, we make a (biologically plausible) assumption: the potential is not allowed to decrease under a minimal value vmin —a reflecting barrier. This leads us to reconsider the charge function: P(k, t) = E χ{Vi (t−k+1)<θ,...,Vi (t−1)<θ,Vi (t)>θ } .
Spontaneous Dynamics of Random Recurrent Networks
69
We recall that under the hypothesis of independence and since we have a normal law for the weights, it leads to
P(k, t) = pˆ φ
t−1
γ
t−i−1
xi
i=k
t−1
1 − pˆ φ
m−1
γ
m−i−1
xi
,
(5.11)
i=k
m=k+1
where pˆ φ is the new probability function we need to compute. Let us assume that a neuron i has taken a charge C previously and is subject to a charge xt at time t. The probability that the total charge exceeds the threshold must be split in two cases: the potential that occurred by C is below vmin or not. In the first case (below vmin ), the charge will be xt on a potential vmin with probability p = P( < vmin ), where ∼ N(0, Cσ 2 ). In the other, case it will be C + xt (with probability 1 − p). The resulting probability will be P(Vi (t) > θ ) = P( > θ) where ∼ ( pN(vmin , xt σ 2 ) + (1 − p)N(0, (C + xt )σ 2 )) = P( > θ) where ∼ N( pvmin , ((1 − p)C + xt )σ 2 ). We can find the P(k, t) recursively, noting that the probability p depends on previous xt . To simplify, we suppose that vmin = 0 (i.e, the neuron cannot have negative values). In this case, whatever the charge is, the probability p is always 1/2. So the last equation becomes C 2 P(Vi (t) > θ ) = P ∼ N 0, + xt σ , > θ . 2
(5.12)
It acts as if the decay rate was divided by two. Thus, inserting this equation into equation 4.4 leads to
xt+1 =
t m=0
xˆ m pφ
t−m γ i i=0
2
xt−i
t−m−1 j=0
1 − pφ
j γ k k=0
2
x j−k+m
, (5.13)
where ∀m > 0, xˆ m = xm , and xˆ 0 = 1.
70
H´edi Soula, G. Beslon, and O. Mazet
5.4 Variance Evolution. The computation of P(k, t) enables us to compute another moment of the distribution. We recall (using the second Wald’s identity) Var(Xt ) =
t−1 (E( Xˆ k )P(k, t)(1 − P(k, t)) + Var( Xˆ k )P(k, t)2 ). k=0
This equation can be rewritten as Var(Xt ) − E(Xt ) =
t−1
P(k, t)2 (Var( Xˆ k ) − E( Xˆ k )).
(5.14)
k=0
Since the expectation converges, using the same reasoning, we can conclude that the variance converges to a stationary state. 5.5 Extending the Class of the Weight Matrix Distribution. For simplicity, we used a centered normal law. We now can extend the class of weight distribution available to compute the P(k, t). It is easy to insert a nonzero mean for the distribution. Let us assume √ that the weights follow a normal law N(µ/N, φ/ N). Provided that all the hypotheses remain valid, we can insert this mean into the definition of pφ , which becomes pφ,µ , defined as 1 pφ,µ (y) = √ 2π
∞ θ −µy √ yφ
x2
e − 2 dx.
It remains now to replace the previous pφ with this new one. We can extend this result to a more biological model using two separate populations of neurons: excitatory and inhibitory. Let us suppose that we have N neurons with Ni inhibitory neurons and Ne excitatory neurons so that N = Ne + Ni . As before, the whole network is totally connected, and the weight distribution for each population follows a gaussian distribution— N(µi , σi ) for inhibitors and N(µe , σe ) for excitators. In this case, one neuron, the excitator, projects toward both populations, and the draws are independent. The same is true for an inhibitory neuron. Assuming that µi < 0 and µe > 0, we can compute Xt+1 using 1 P(Vi (t + 1) > θ ) = √ 2π
∞
√
θ −µi Xti −µe Xte
x2
e − 2 dx.
σi2 Xti +σe2 Xte
e We reuse the same equations with these two variables since we have Xt+1 = Ni i e (1 − N )Xt+1 and Xt+1 = Xt+1 − Xt+1 . Indeed, the whole charge created by
Spontaneous Dynamics of Random Recurrent Networks
71
the spiking activity is spread among all neurons. We can extend this with more than two populations. The results follow the same pattern as in the case of one population. Finally, we can introduce a sparse weight matrix (a matrix with zero coefficients) computed as follows: a weight w has a probability √ p to be zero and a probability 1 − p to follow a normal law N(µ/N, φ/ N). As above, it leads to a new pˆ φ,µ function. When calculating the charge, it came from “X” neurons, leading to a sum of X normal laws. In the case of the sparse matrix, it reduce to (1 − p)X neurons. So our new function pˆ φ,µ becomes pˆ φ,µ (y) = pφ,µ ((1 − p)y). It remains to insert this new function into equation 5.13. 6 Results and Comparison In order to compute the P(k, t), we need a seemingly false hypothesis: the independence of charges. The total charge for a given neuron is calculated as if the weight matrix was redrawn at each time step. In other words, the stochastic input of one neuron (given by the others) is treated as a noisy threshold. As will be shown in the results, it is a rather good approximation. It supposes also that for the self-sustaining mode, the system does not die. This problem will appear every time. It will lead either to a slight overestimation of the spiking activity (when the probability to die is weak) or a complete failure (for intermediary coupling factor). We conducted extensive numerical simulations to confront our formulas. For each set of parameters, 1000 random networks of 1000 I&F neurons were used. We used the same threshold value (θ = 1.0) and, to be in concordance with equation 5.13, we set vmin = 0. We tested various γ , φ, x0 , µ, and p. All results were consistent with theoretical computations. We obtain a striking accuracy in describing the temporal evolution of the averaged spiking activity, both qualitatively and quantitatively. The steadystate is quickly reached (a few time steps), and the transients are striking well, predicted by our equations. When γ = 0, the prediction was slightly overestimated. On the other hand, when γ = 0.0, the prediction is accurate. Sample results are displayed in Figures 1 and 2 for µ = 0.0 (centered weight matrix) and p = 0.0 (no sparsity) and various values of φ and γ . The figures show also that when γ = 0.0 and φ = 2.5 (in Figure 1) and also when γ = 1.0 and φ = 1.5, the prediction completely fails. These are not isolated points. In fact, for all γ ’s, there is an interval of φ (once p and µ are chosen) where the spiking activity shows no regularity. On the boundary between death and self-sustaining activity, both independence hypotheses fail.
72
H´edi Soula, G. Beslon, and O. Mazet
Figure 1: Transients. Comparison between theoretical computations and experimental data. Results for two leak values (γ ∈ {0.0, 0.5}). The spiking activity increases with the coupling factor (φ ∈ {1.5, 2.5, 3.5, 4.5, 5.5}). Theoretical results are displayed with a plain curve, and experimental data points are circles. The curves are paired and increasing with φ. The lowest pair (experimental and theoretical) corresponds to φ = 1.5, while the highest corresponds to φ = 5.5. Note that for γ = 0.0 and φ = 2.5, the theoretical computation predicts a quite higher value than the experimental result. Note also the slight overestimation of the spiking activity as soon as γ = 0. The starting number of spiking neurons is x0 = 0.10 for each simulation. Parameters: N = 1000, θ = 1, p = µ = 0.
For the variance prediction, results are less accurate. Expectedly, the higher the moment, the harder the prediction. More neurons would probably have been needed to obtain an accurate prediction. Nevertheless, when φ is high enough, we are able to describe precisely the evolution of the variance. It converges as quickly as the expectation to a limit. Moreover, this limit decreases with the coupling factor. Since the variance computation is based on the expectation, we did not expect it to work around intermediate values of φ, where the prediction completely failed. Figure 3 displays typical variance prediction for two values of φ, and a leak equals 0.9 (the others parameters were θ = 1, p = 0, and µ = 0.0).
Spontaneous Dynamics of Random Recurrent Networks
73
Figure 2: Transients (continued): More comparisons between theoretical computations and experimental data. Results for two higher leak values (γ ∈ {0.9, 1.0}). The spiking activity increases with the coupling factor (φ ∈ {1.5, 2.5, 3.5, 4.5, 5.5}). Theoretical results are displayed with a plain curve; experimental data points are circles. The curves are paired and increasing with φ (see Figure 1). Note that the slight overestimation of the spiking activity increases with γ . Note also that when γ = 1.0 and φ = 1.5, theory predicts death, while experimental values do not converge to zero. The starting number of spiking neurons is x0 = 0.10 for each simulation. Parameters: N = 1000, θ = 1, p = µ = 0.
7 Discussion The independence hypothesis can be a powerful way to approach random networks of spiking neurons. Mean field techniques and local field partition are commonly used to deal with them. However, to the best of our knowledge, no other study has allowed a theoretical derivation of all the moments of the spiking activity distribution. Using this framework, we are able to describe the behaviors of large spiking neural networks in spontaneous functioning. The initial stimulation corresponds to a synchronous spiking of a fraction (x0 ) of the network. This shows that a spontaneous regime can be self-sufficient. It means that networks can afford discontinuous inputs without losing their internal activity.
74
H´edi Soula, G. Beslon, and O. Mazet
Figure 3: Variance. The variance of the spiking activity is displayed for two values √ of the coupling factor (φ ∈ {4.5, 5.5}). For clarity, the variance was scaled by N and displayed with the corresponding predicted expectation of the spiking activity (dotted curves). As in previous figures, circles correspond to experimental data points and the plain curve to theoretical computation. Note that the variance converges as quickly as the expectation. Note also that the higher the coupling factor, the better the prediction. Parameters: N = 1000, θ = 1, γ = 0.9, p = µ = 0.
√ More precisely, we proved that the coupling factor (φ = σ N) can characterize the average spiking activity. For instance, whatever the initial stimulation is (provided it is strong enough), the network’s average spiking activity reaches the same steady state. Since the neural death is also a possible steady state, in dynamical systems terminology, depending on the value of φ, we exhibited a bifurcation (Brunel & Hakim, 1999). The spiking activity grows with the coupling factor, while the variability of the spiking activity distribution decreases. Moreover, for a high value of φ, it seems that the self-sustaining activity is maintained by neurons that fire at the maximum rate. Averaging does not tell anything on one particular neuron. Experimental data show that a very high value of φ is needed to obtain periodic neurons. However, the number of periodic neurons grows with the coupling factor, leading to an extreme locking. At the limit φ = ∞, all neurons are periodic (either 1 or ∞), and those that fire do so synchronously (that is, all the time). However, around the bifurcation, independence is deemed to fail and, consequently, also the prediction. Indeed, the coupling is too weak to allow regularities.
Spontaneous Dynamics of Random Recurrent Networks
75
Since our method allows us to derive all the moments of the distribution, a way is open to obtain a full description of the spiking activity distribution. We are also able to provide the stochastic repartition of the neurons’ instantaneous period. This information is a starting point for studying the effect on spiking activity of a learning algorithm based on spiking delays. Appendix A: Wald’s Identity Let us define, for s ∈ [0, 1], the moment generating function G X (s) of a random variable X, by G X (s) = E(s X ). It is easy to see that the nth moment of X is given by the nth first derivative of G X evaluated in 1: G X (1) = E(X), G X (1) = E(X(X − 1)), . . . G X (1) (n)
= E(X(X − 1) . . . (X − n + 1)). Given (Xi )i∈N a sequence of independent and identically distributed (i.i.d.) random variables, and N an integer-valued random variable indeN pendent of Xi , let us define Y = i=1 Xi . We then have N G Y (s) = E s i=1 Xi =
+∞
n P(N = n)E s i=1 Xi
n=0
=
+∞
n P(N = n)E s X1
n=0
= G N (G X1 (s)). By derivating once and then twice the above equality, we respectively obtain the first and second Wald’s identities: E(Y) = E(N)E(X1 ), Var(Y) = E(N)Var(X1 ) + Var(N)E(X1 )2 . Appendix B: Random Sum of Random Variables Let us first prove a general result, which will have an application for the neuronal potential, written as a random sum of i.i.d. random variables.
76
H´edi Soula, G. Beslon, and O. Mazet
Let f be a function so that limk→∞ f (k) = α ∈ R, ( pkN )(k,N)∈N2 , a sequence N pkN = 1. Now define g(N) = satisfying ∀k ∈ N, lim N→∞ pkN = 0 and k=1 N N k=1 pk f (k), and prove that lim g(N) = α.
(B.1)
N→∞
∀ > 0, ∃N0 ∈ N, ∀k > N0 , | f (k) − α| < 2 . So we can write N N 0 N N pk ( f (k) − α) + pk ( f (k) − α) |g(N) − α| ≤ k=N0
k=1
N 0 N ≤ pk ( f (k) − α) + . 2 k=1
N0 being fixed, it remains to complete the proof to get a rank N1 so that ∀N > N1 ,
N 0 N pk ( f (k) − α) < . 2 k=1
Application. Let X(N) be a sequence of random variables on [1, N] so that lim E(X(N) ) = ∞ and ∀k, lim P(X(N) = k) = 0,
N→∞
N→∞
which is satisfied, for instance, when X(N) ∼ N( N2 , σ 2 ) or when X(N) is uni(N) form on [1, N]. In this case, if we set pk = P(X(N) = k), we can write (N) g(N) = E( f (X )), and equation B.1 yields lim E( f (X(N) ) − f (E(X(N) )) = 0.
N→∞
(B.2)
Equation 4.1 derives from the case 1 f (k) = √ 2π
+∞ √1 kσ
x2
e − 2 dx.
Appendix C: Simple Case We prove here that substituting γ = 0 in equation 4.4 gives the equation xt = pφ (xt−1 ).
Spontaneous Dynamics of Random Recurrent Networks
77
It yields xt+1 =
t
xˆ m pφ (xt )
t−1
(1 − pφ (x j )).
j=m
m=0
For t = 0, it gives x1 = pφ (x0 ). Using the recurrence hypothesis ∀m = 1 . . . t, xm = pφ (xm−1 ), we find that xt+1 =
t
pφ (xm−1 ) pφ (xt )
(1 − pφ (x j ))
j=m
m=0
= pφ (xt )
t−1
t
pφ (xm−1 )
t−1
(1 − pφ (x j ))
j=m
m=0
= pφ (xt )vt−1 . It is now enough to note that vt−1 = pφ (xt−1 ) + (1 − pφ (xt−1 ))
t−1
pφ (xm−1 )
m=0
t−2
(1 − pφ (x j )))
j=m
= pφ (xt−1 ) + (1 − pφ (xt−1 ))vt−2 . Since v0 = (1 − pφ (x0 )) + pφ (x0 ) = 1, we have vt = 1 for all t. We show by recurrence that we get equation 4.6. Appendix D: Sufficient Condition for Neural Death Let us go back to 1 pφ (y) = √ 2π
∞ √θ yφ
x2
e − 2 dx,
for y ∈ [0, 1]. Taking the derivative over y, we have 2 θ − θ e 2yφ2 . pφ (y) = √ 2 2πφy3/2
We see immediately that pφ (y) ≥ 0, so pφ is increasing. So stable nonzero fixed points should appear for a value of y that crosses the line y = x from above. Then a sufficient condition for neural death is that ∀y pφ (y) < 1.
78
H´edi Soula, G. Beslon, and O. Mazet
Let z = 1y , τ =
√θ , 2φ
and g(z) = pφ ( 1z ). Then
τ z3/2 2 g(z) = √ e −τ z . 2 π Taking the derivative of g(z) yields τ z1/2 2 g (z) = √ e −τ z 2 π
3 − τ 2z . 2
So g (z) has the same sign as 2τ3 2 − z, and since g (0) = g (+∞) = 0, the maximum of g(z) is obtained for z = 2τ3 2 . But z = 1y . Thus, z ∈ [1, +∞[. It leads to max
y∈[0,1]
pφ (y)
3 , g(1) . = max g(z) = max g z∈[1,+∞[ 2τ 2
So a sufficient condition for zero to be the only fixed point becomes g
3 2τ 2
Since τ =
√θ , 2φ
φ<
2e 3
3/2
=
τ ( 2τ3 2 ) 2 3 e −τ 2τ 2 = √ 2 π
3 2e
32
1 < 1. √ 2 πτ2
we get 34
1
π 4 θ.
Acknowledgments We thank Nicolas Brunel and Wulfram Gerstner for their stimulating discussions and helpful comments on this work. We are also grateful to the anonymous referees for their comments on the letter, which helped to improve it substantially. We acknowledge the financial support of the French ACI (Computational and Integrative Neuroscience—Neuroscience Int´egrative et Computationnelle). References Amit, D., & Brunel, N. (1997a). Dynamics of recurrent spiking neurons before and following learning. Network: Comput. Neural. Syst., 8, 373–404.
Spontaneous Dynamics of Random Recurrent Networks
79
Amit, D., & Brunel, N. (1997b). Model of global spontaneous activity and local structured delay activity during learning periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Chow, C. (1998). Phase-locking in weakly heterogeneous neural networks. Physica D, 118(3–4), 343–370. Coombes, S. (1999). Chaos in integrate-and-fire dynamical systems. In Proc. of Stochastic and Chaotic Dynamics in the Lakes. American Institute of Physics Conference Proceedings. del Guidice, P., & Mattia, M. (2003). Stochastic dynamics of spiking neurons. In E. Korutcheva & R. Cuerno (Eds.), Advances in condensed matter and statistical mechanics. Happauge, NY: Nova Science. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-fire neurons. Neural Computation, 11, 633–652. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states and locking. Neural Computation, 12, 43–89. Gerstner, W. (2001). Populations of spiking neurons. In W. Maas & C. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Golomb, D. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Mattia, M., & del Guidice, P. (2000). Population dynamics of interacting spiking neurons. Physical Review E, 66(5), 051917. Meyer, C., & van Vreeswijk, C. (2002). Temporal correlations in stochastic networks of spiking neurons. Neural Computation, 14(2), 369–404. Mongillo, G., & Amit, D. (2001). Oscillations and irregular emission in networks of linear spiking neurons. Computational Neuroscience, 11, 249–261. Moynot, O., & Samuelides, M. (2002). Large deviations and mean-field theory for asymmetric random recurrent neural networks. Probability Theory and Related Fields, 123(1), 41–75. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-fire models: From stochastic input to escape rates. Neural Computation, 12, 367–384. Tuckwell, H. (1988). Introduction to theoretical neurobiology, Vol. 2, Non linear and stochastic theories. Cambridge: Cambridge University Press. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wald, A. (1945). Sequential tests of statistical hypotheses. Ann. Math. Stat., 16, 117– 186.
Received October 11, 2004; accepted May 31, 2005.
LETTER
Communicated by Garrett Stanley
Bayesian Population Decoding of Motor Cortical Activity Using a Kalman Filter Wei Wu
[email protected]
Yun Gao
[email protected] Division of Applied Mathematics, Brown University, Providence, RI 02912, U.S.A.
Elie Bienenstock
[email protected] Division of Applied Mathematics and Department of Neuroscience, Brown University, Providence, RI 02912, U.S.A.
John P. Donoghue john
[email protected] Department of Neuroscience, Brown University, Providence, RI 02912, U.S.A.
Michael J. Black
[email protected] Department of Computer Science, Brown University, Providence, RI 02912, U.S.A.
Effective neural motor prostheses require a method for decoding neural activity representing desired movement. In particular, the accurate reconstruction of a continuous motion signal is necessary for the control of devices such as computer cursors, robots, or a patient’s own paralyzed limbs. For such applications, we developed a real-time system that uses Bayesian inference techniques to estimate hand motion from the firing rates of multiple neurons. In this study, we used recordings that were previously made in the arm area of primary motor cortex in awake behaving monkeys using a chronically implanted multielectrode microarray. Bayesian inference involves computing the posterior probability of the hand motion conditioned on a sequence of observed firing rates; this is formulated in terms of the product of a likelihood and a prior. The likelihood term models the probability of firing rates given a particular hand motion. We found that a linear gaussian model could be used to approximate this likelihood and could be readily learned from a small amount of training data. The prior term defines a probabilistic model of hand kinematics and was also taken to be a linear gaussian model. Decoding was performed using a Kalman filter, which gives an efficient recursive method for Bayesian inference when the likelihood and prior are linear Neural Computation 18, 80–118 (2006)
C 2005 Massachusetts Institute of Technology
Motor Cortical Decoding Using a Kalman Filter
81
and gaussian. In off-line experiments, the Kalman filter reconstructions of hand trajectory were more accurate than previously reported results. The resulting decoding algorithm provides a principled probabilistic model of motor-cortical coding, decodes hand motion in real time, provides an estimate of uncertainty, and is straightforward to implement. Additionally the formulation unifies and extends previous models of neural coding while providing insights into the motor-cortical code. 1 Introduction Recent research on developing neural motor prostheses has demonstrated the feasibility of direct neural control of computer cursor motion and other devices using implanted electrodes in nonhuman primates (Wessberg et al., 2000; Serruya, Hatsopoulos, Paninski, Fellows, & Donoghue, 2002; Taylor, Helms Tillery, & Schwartz, 2002; Carmena et al., 2003). These results are enabled by a variety of mathematical decoding methods that produce an estimate of the subject’s state (e.g., hand position) from a sequence of measurements (e.g., the firing rates of a population of cells). A number of algorithms have been proposed for decoding extracellularly recorded neural firing activity in the arm area of primary motor cortex (MI) to perform off-line reconstruction of hand motion or online control of cursors or robotic devices (see Schwartz, Taylor, & Helms Tillery, 2001; Serruya, Hatsopoulos, Fellows, Paninski, & Donoghue, 2003, for a brief overview). Here we pose this problem as one of Bayesian inference in which the goal is to estimate the a posteriori probability of hand kinematics conditioned on an observed sequence of firing rates. The Bayesian approach formulates this posterior probability as the product of a likelihood term and an a priori probability. The likelihood term models the probability of the firing rates given the current hand motion and can be learned from training data. The prior combines a model of how the hand moves over time with an estimate of the kinematics at the previous time instant. The Kalman filter (Gelb, 1974; Kalman, 1960) provides an efficient recursive algorithm to optimally estimate the posterior probability when the likelihood and prior models are linear and gaussian. Previous methods for decoding MI activity include the population-vector algorithm (Georgopoulos, Schwartz, & Kettner, 1986; Moran & Schwartz, 1999a, 1999b; Schwartz & Moran, 1999; Taylor et al., 2002), linear filtering (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004; Sanchez et al., 2003; Serruya et al., 2002; Wessberg et al., 2000), and artificial neural networks (Wessberg et al., 2000). Each of these methods can be viewed as a direct method that attempts to estimate the hand kinematics x as a function of the neural firing z, that is, x = f 1 (z).
82
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
In contrast, most models of neural coding can be viewed as generative models where the neural activity is a function of a behavior or stimulus x, that is, z = f 2 (x) + noise, where the noise might be assumed to be Poisson. The Bayesian approach presented here provides a clear and rigorous way of taking generative models of neural encoding and exploiting them to perform decoding. Previous authors have explored the relationship between firing rates in MI and various aspects of hand motion (position, direction, speed, velocity, or acceleration) (Kettner, Schwartz, & Georgopoulos, 1988; Georgopoulos et al., 1986; Moran & Schwartz, 1999b; Flament & Hore, 1988). While previous studies have typically viewed these behavioral variables in isolation, we found that decoding performance is improved when the encoding model simultaneously takes into account all these variables. This suggests the need for a richer model of neural coding than is typically considered. Additionally, our results demonstrate the importance of modeling correlated noise in the firing rates of multiple cells. Decoding performance drops significantly when cells are assumed to be conditionally independent. Much of the prior work on motor cortical decoding has focused on relatively constrained “center-out” motions rather than the continuous motions considered here (Paninski et al., 2004). To cope with continuous motion, we adopt a simple prior model of hand motion that, in contrast to previous decoding methods, explicitly models the temporal evolution of the hand kinematics. Bayesian methods have been exploited previously to infer the 2D location of a rat from hippocampal place-cell activity (Brown, Frank, Tang, Quirk, & Wilson, 1998; Twum-Danso & Brockett, 2001; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998). The application of Bayesian decoding to motor cortical data was proposed in Gao, Black, Bienenstock, Shoham, & Donoghue, (2002) with various Kalman filter formulations being recently studied (Sanchez, Erdogmus, Principe, Wessberg, & Nicolelis, 2002; Wu, et al., 2002, 2003). Some of the results in this letter were previously reported in Wu et al. (2003). The Kalman filter has a number of desirable properties for motor cortical decoding. The inclusion of prior information about the system state enables an efficient recursive formulation of the decoding algorithm and effectively smooths noisy estimates in a mathematically principled way; this is particularly important for decoding complex, natural hand motions required for neural motor prostheses. We reconstructed hand trajectories from prerecorded neural firing rates and found that the Kalman filter method was more accurate than previous approaches while also being computationally efficient. The detailed application of the method provides insight into neural coding in MI and is useful for examining optimal time lags between
Motor Cortical Decoding Using a Kalman Filter
83
spiking activity and hand movement, the accuracy of different models of hand kinematics, the contributions of different neural population sizes, and the effect of temporal bin sizes for estimating neuronal firing rate. We describe the implementation and structure of a Kalman filter method that is computationally efficient to learn, requires little training data, provides real-time decoding, is applicable to the complex natural motions, and thus seems well suited to prosthetic applications. The structure of this letter is as follows. Section 2 summarizes the experimental paradigm used previously to obtain the neural recordings. This section also introduces the linear gaussian model of motor cortical activity, our Bayesian framework, the underlying statistical assumptions, the Kalman filter decoding algorithm, and an approach for estimate optimal time lags. Section 3 describes the decoding results as well as experimental results related to various modeling choices that are important for accurate decoding. It also compares the Kalman filter with previous decoding methods (the linear filter and population vector methods). Section 4 discusses related work and section 5 offers conclusions. The appendix provides the mathematical details of the Kalman filter algorithm, a comparison with the Wiener filter, and a few additional algorithmic and modeling details with associated experimental results. 2 Materials and Methods 2.1 Experimental Methods. The neural data used here were prerecorded and have been described elsewhere (Paninski et al., 2004; Serruya et al., 2002). Briefly, after initial task training, two macaque monkeys were implanted with silicon microelectrode arrays containing 100 platinized-tip probes (Cyberkinetics Inc., Foxboro, MA). Details of the array and recording protocols are described elsewhere (Maynard, Nordhausen, & Normann, 1997; Maynard et al., 1999). The devices were implanted in the arm area of primary motor cortex (MI) (see Donoghue, Sanes, Hatsopoulos, & Gaal, 1998, for details). All procedures were in accordance with protocols approved by Brown University Institutional Animal Care and Use Committee. Signals were amplified and digitized using commercial hardware (Plexon Inc., Dallas TX). As is common practice in the literature, waveforms crossing experimenter-determined thresholds were further processed to detect action potentials; the details differed for each of the two behavioral tasks described below. Action potentials were then counted within fixed time windows (bins), and the firing rate (number of spikes per unit time) within each bin was computed for each neuron. All encoding and decoding analysis was performed using these discrete approximations to the firing rate. The behavioral paradigms for the two tasks below are described in Paninski et al. (2004) and Serruya et al. (2002). In each task, the monkey viewed a computer monitor while continuously moving a manipulandum on a 30 cm × 30 cm tablet (with approximately a 25 cm × 25 cm workspace)
84
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
that was parallel to the floor. The position of the manipulandum (hand position) controlled the 2D motion of a feedback cursor on the monitor. The hand position and neural activity were recorded simultaneously. In addition to hand position, we computed derivatives of the position using finite differences; these approximated velocity, acceleration, and higher-order hand kinematics. For example, given a time series of positions (xk , yk ) at time tk , k−1 k−1 we approximated the velocity, (vx,k , v y,k ) as ( xk −x , yk −y ) (where t is the t t time step length). It is well known that this approach results in noisy estimates of the derivatives with the effects of noise being more pronounced in the higher derivatives. Consequently, we also experimented with using splines to smoothly interpolate the position data. Differentiating these splines produced less noisy derivatives but did not significantly improve decoding performance. It also increased the complexity of the method and complicated real-time decoding. Consequently, in all further analysis we used the simple, discrete, derivative approximations to represent the hand kinematics. Pursuit Tracking Task. The pursuit tracking task was clearly described in Paninski et al. (2004). Briefly, a target dot moved slowly on the monitor, and the behavioral task required moving the feedback cursor with the manipulandum so that it tracked the target within a given distance range. On each trial, the target motion followed a unique random walk from a different random starting location. Each trial ended when the dot was out of the tracking range and lasted at most 10 seconds, with the majority of trials lasting approximately 8 or 9 seconds. Short trials, in which the monkey was unable to track the target for more than 5 seconds, were judged unsuccessful and were not considered. Our subsequent analysis was based on the remaining 182 trials. The 2D histograms of the hand position, velocity, and acceleration over all trials are shown in Figure 1 (first row). Note that these distributions are significantly different from those obtained during simple, stylized, movements found in center-out reaching tasks (Taylor et al., 2002). While the hand motions were more “general” than those in stereotyped reaching tasks, the motion was still constrained to follow a particular path. After thresholding, detected waveforms were analyzed off-line using commercial software (Plexon Inc., Dallas TX). Twenty-five well-isolated individual units (neurons) were detected (Serruya et al., 2003), and the firing rate for each unit was computed in nonoverlapping 50 ms time bins. The hand kinematics were subsampled to match the 50 ms time intervals. Pinball Task. The pinball task (Serruya et al., 2002) was designed to test a direct neural control task and differed from the pursuit tracking task in that the target did not move continuously but rather appeared in random locations on the monitor. The monkey was required to move the feedback dot with the manipulandum to “hit” the target (within a
Motor Cortical Decoding Using a Kalman Filter
position (m)
85
acceleration (m/s 2)
velocity (m/s) 0.1
1
0.2
0.1
ay
y
vy
0.5 0
0
-0.5 0 0
0.1
0.2
-0.1 -0.1
x
0
-1 -1
0.1
vx
1
10
0.5 0.2
5
0.1
0
ay
vy
y
0
ax
0 -5
0
0
0.1
x
0.2
-0.5 -0.5
0
vx
0.5
-10 -10
0
10
ax
Figure 1: Normalized, log 2D histograms of kinematics for both the pursuit tracking task (first row) and the pinball task (second row). These plots show the log prior probability of the position, velocity, and acceleration of the hand.
prespecified distance). When the target was acquired, it disappeared and then reappeared in a new random location. Each time the target appeared, the monkey moved to hit the new location. The motions made by the monkey on the 2D plane were less constrained than in the pursuit tracking task since the exact motion used to reach the targets was under the control of the subject. From this experiment, we obtained two sets of trials: one was approximately 3.5 minutes in length and was used as training data to learn our encoding model; the other was approximately 1 minute in length and was used as test data for decoding. The 2D histograms of position, velocity, and acceleration of all data are shown in Figure 1 (second row). Note that the distribution over position is more uniform than in the pursuit tracking task. Also note that the magnitude of the velocity and acceleration of the hand was significantly higher in this task. The recording of hand motion and the approximation of the temporal derivatives (velocity, acceleration, and so forth) were exactly the same as in the pursuit tracking task. The waveforms, however, were sorted online into units using manually set thresholds without a separate off-line sorting process. All the waveforms crossing the thresholds were treated as action
86
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
potentials from a single unit, potentially resulting in multiunit data. The firing rate was computed for each such unit in 70 ms time bins, and there were 42 such units for this task. The firing rates of the cells were significantly higher than in the pursuit tracking task (likely due to the faster hand motion). Quantitatively, the average hand speeds were approximately 14.67 cm/s (pinball) and 2.88 cm/s (pursuit tracking), while the average firing rates over all cells were approximately 30 spikes/s (pinball) and 10 spikes/s (pursuit tracking). 2.2 Statistical Methods. Our focus here is on a probabilistic, Bayesian approach for inferring hand movement from the firing rate of neurons in MI. We want our approach to (1) have a sound probabilistic foundation, (2) explicitly model noise in the data, (3) indicate the uncertainty in estimates of hand position, (4) make the statistical assumptions about the data explicit, (5) require a minimal amount of “training” data, (6) provide online estimates of hand position with short delay (less than 200 ms); and (7) provide insight into the neural coding of movement. To that end, we developed a Kalman filtering method (Gelb, 1974; Welch & Bishop, 2001) that provides a rigorous and well-understood framework that addresses these issues. This approach provides a control-theoretic model for the encoding of hand movement in motor cortex and for inferring, or decoding, this movement from the firing rates of a population of neurons. The approach generalizes and extends previous coding models. 2.2.1 Modeling Neural Coding of Hand Kinematics. Georgopoulos, Kalaska, Caminiti, and Massey (1982) observed that the firing rate of cells in MI were approximated by a cosine “tuning function.” In particular, the firing rate, zk , of a cell at some time tk is related to movement direction, θk , by zk = h 0 + h p cos(θk − θ p ),
(2.1)
where θ p is the cell’s “preferred” direction (i.e., the direction of maximal response) and h 0 and h p are scalar constants. Equivalently, this can be expressed as zk = h 0 + h 1 sin(θk ) + h 2 cos(θk ),
(2.2)
where h 1 , h 2 are scalar parameters that can be fit to training data. The above model has formed the foundation for much of the analysis of motor cortical encoding of motion. For general prosthetic applications, however, this model is insufficient as it relates neural activity only to movement direction. Since it does not capture the full kinematics of hand motion, decoding using this approach has focused on center-out tasks, where movement direction is the key component of the motion.
Motor Cortical Decoding Using a Kalman Filter
87
Moran and Schwartz (1999b) extended the above model to include the speed, ||v||, of the hand, zk = h 0 + ||v||(h x sin(θk ) + h y cos(θk )).
(2.3)
This is equivalent to modeling the firing rate as a linear function of velocity in the x and y coordinates, zk = h 0 + h x vx,k + h y v y,k ,
(2.4)
where vx,k and v y,k represent the hand velocity in the x and y directions, respectively, at time tk . A similar linear model was proposed to relate firing rate and hand position (Kettner et al., 1988), zk = f 0 + f x xk + f y yk ,
(2.5)
where xk and yk represent hand position and f 0 , f x , f y are the linear coefficients that are fit to training data. We found that linear models of position and velocity (see equations 2.5 and 2.4, respectively) provided reasonable approximations to our data. In the pursuit tracking data, 23 of the total 25 cells are appropriately characterized by equations 2.5 and 25 are by equations 2.4 (multiple regression, F test, p < 0.05). In the pinball data, these numbers are 39 and 41 of the total 42 cells (multiple regression, F test, p < 0.05). Some examples of linear fits are shown in Figure 2 for illustration. In addition to position and velocity, the firing rate has been shown to be related to hand acceleration (Flament & Hore, 1988). Based on this previous work, we assume (for now) that the population activity is related to the position, velocity, and acceleration of the hand (this will be verified and extended later). We then define the system state to be a six-dimensional vector xk = [x, y, vx , v y , a x , a y ]kT representing the x-position, y-position, x-velocity, y-velocity, x-acceleration, and y-acceleration of the hand at time tk = kt where t = 70 ms (pinball task) or 50 ms (pursuit tracking task) in our experiments. Other models of the hand kinematics are possible and are explored in the experimental results. Below, we show how to estimate this system’s state in a principled way that extends previous work, incorporates an explicit noise model, and models correlations between the activity of different cells. 2.2.2 Bayesian Formulation. The above models all linearly relate components of the system state to the firing rates of individual neurons. Generalizing this idea, let observations zk ∈ RC represent a C × 1 vector containing the firing rates at time tk for C observed neurons within a t time interval.
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black position (cell 6)
1.5
position (cell 19) 1. 4
y-position
1
0. 2
linear fit
0.6
2
linear fit
2.5
y-position
2
1.5
1 0.5 0
linear fit
3
linear fit 2
2.6
0.6
1
0.2
0.5
-0.2
0
velocity (cell 2)
position (cell 26) 3
1.8
2.2
1.6
1.8
1.4
x-position
1.5
1
1.4
y-velocity
position (cell 24)
0
0
linear fit
1.6 0.2
2
0.5
1.8
0.4
velocity (cell 16)
1
0. 6
0
1.5 1
1
0.5
linear fit
velocity (cell 3)
y-velocity
88
velocity (cell 31) 2
2
1
1
0
0
linear fit
linear fit 2
2.5
1.5
2
1
1.5 1
0.5
x-velocity
Figure 2: Empirical tuning functions and their linear approximation. The top two rows show data and models from the pursuit tracking task, while the bottom two rows illustrate the pinball data. For each task, the first row shows empirical firing rates, and the second row shows the linear fit to these data. The empirical plots show the mean firing rate as a function of position (left two columns) or velocity (right two columns). The darkest blue areas correspond to kinematics that were never observed. The color coding (blue through red) represents the mean firing rate observed for a discrete region in the parameter space. The linear fit to these data is computed during the training of the Kalman filter model. The linear fit is seen to provide a crude but reasonable approximation to the raw data.
Let zk = [z1 , z2 , . . . , zk ]T represent the history of measurements up to time bin k. We pose the problem of estimating the system state as one of Bayesian inference. Let p(xk |zk ) be the a posteriori probability of the system state conditioned on the measurements. We make two Markov assumptions:
p(xk |xk−1 , xk−2 , . . . , x1 ) = p(xk |xk−1 )
(2.6)
Motor Cortical Decoding Using a Kalman Filter
89
and p(zk |xk , zk−1 ) = p(zk |xk ).
(2.7)
These state that, given the hand kinematics at time k − 1, the hand kinematics at time k is conditionally independent of the previous hand motions and that, conditioned on the current state, the firing rates are independent of the firing rates at previous time instants. Then, using Bayes’ theorem and simple algebra, we can write the posterior probability as p(xk |zk ) = κ p(zk |xk )
p(xk |xk−1 ) p(xk−1 |zk−1 ) dxk−1 ,
(2.8)
where p(zk |xk ) is called the likelihood of the state, p(xk |xk−1 ) is a temporal prior that models how the state evolves from one time instant to the next, and p(xk−1 |Zk−1 ) is simply the posterior probability at the previous time instant. The term κ is a normalizing term, independent of xk , which ensures that posterior integrates to 1. Decoding then involves estimating the posterior probability, p(xk |Zk ), at each time instant, which we can do recursively using equation 2.8. Having a representation of the full posterior has advantages over methods that estimate xk with no representation of uncertainty. Given the posterior, one can compute an estimate of xk in a variety of standard ways. For example, one can compute the expected value of xk , E[xk |Zk ], or the value that maximizes p(xk |Zk ) resulting in a maximum a posteriori (MAP) estimate. In the general case, the posterior probability, p(xk |Zk ), can be an arbitrary, nongaussian, multimodal distribution (Gao et al., 2002; Cisek & Kalaska, 2002), and the integral in equation 2.8 can be numerically approximated using Monte Carlo integration techniques (Brockwell, Rojas, & Kass, 2004; Gao et al., 2002; Shoham, 2001). To achieve a real-time decoding algorithm, we assume that the likelihood and the prior are both gaussian, and this leads to a closed-form recursive solution for estimating the posterior (which is also gaussian); this is known as the Kalman filter (Kalman, 1960; Gelb, 1974; Welch & Bishop, 2001). To fully specify the decoding model, we must define the likelihood, the temporal prior, and the algorithm for estimating the posterior. These are described below. Likelihood Model. The likelihood relates the hand kinematics to the neural firing rates. We begin by defining a generative model for the activity of the population as z k = Hk x k + q k ,
(2.9)
90
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
where k = 1, 2, . . . , M, M is the number of time steps in the trial, and Hk ∈ RC×6 is a matrix that linearly relates the six-dimensional hand state to the firing rates. We assume that the noise in the observations is zero mean and normally distributed with covariance Qk , that is, qk ∼ N(0, Qk ), Qk ∈ RC×C . Note that the assumption of zero-mean gaussian noise in equation 2.9 is applicable only to centralized (zero-mean) firing rates. The raw data are neither zero mean nor truly gaussian. Consequently, before processing the firing-rate data, they were square-root transformed (Maynard et al., 1999; Moran & Schwartz, 1999b) to make them better modeled by a gaussian. This preprocessing step is not absolutely necessary and improves decoding performance only slightly; nevertheless, we will assume square-roottransformed firing rates in the remainder of the article. We then centralized the firing data as well as the hand kinematics so that both had zero mean; this was done by computing the mean firing and mean kinematics in the training set and subtracting this from the data in both the training and test sets. Previous work showed that the correlation in MI neurons is important for the encoding of movement parameters (Fetz, Toyama, & Smith, 1991; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998). Consequently, we take Qk to be a full covariance matrix to model correlated noise in the firing rates. Specifically, noise here results from errors in the predictions from the generative model. This formulation of an explicit generative model with an explicit noise term generalizes previous work. In our experiments, we found that the full system state and full covariance matrix produced the most accurate decoding results. The previous models above can be viewed as using a reduced system state and a diagonal covariance matrix. These simplified models resulted in degraded performance. This will be quantitatively verified later for both tasks. Temporal Prior. By modeling how the system state is expected to evolve over time, the temporal prior can reduce the effects of noise and smooth the estimates in a mathematically principled way. We take the simple approach of assuming that the state propagates in time according to a linear gaussian model, xk+1 = Ak xk + wk ,
(2.10)
where Ak ∈ R6×6 is the coefficient matrix, and the noise term Wk ∼ N(0, Wk ), wk ∈ R6×6 . As in the case of the measurement model, Wk is taken to be a full covariance matrix. Equation 2.10 states that the hand kinematics (position, velocity, and acceleration) at time k + 1 is linearly related to the state at time k. This is similar to the system model used in Brown et al. (1998) for estimating a rat’s position from the firing rates of hippocampal place cells. In contrast, here we also model higher-order hand kinematics (velocity and
Motor Cortical Decoding Using a Kalman Filter
91
acceleration). While more complex, nonlinear, dynamical models could also be exploited, our focus here is on the likelihood term, which represents our model of the neural code. 2.2.3 Learning and Decoding with the Kalman Filter. In practice, Ak , Hk , Wk , Qk may vary with time tk ; however, we make the common simplifying assumption that they are constant (independent of k). Thus, we can estimate all the constant parameters A, H, W, Q from training data by maximizing the joint probability p(X M , Z M ), where both the hand kinematics X M = [x1 , x2 , . . . , x M ]T and the firing rates Z M are known for the M time instants in the training set. Given our first-order Markov assumptions, the joint probability distribution over states X M and firing rates Z M is p(X M , Z M ) =
p(x1 )
M k=2
p(xk |xk−1 )
M
p(zk |xk ) .
k=1
This is simply the product of the prior hand state p(X1 ) at the first time instant, the probability of each new state conditioned on the previous one, and the likelihood at each time instant. Given training data, we maximize this probability with respect to A, H, W, Q, as described in the appendix. Decoding then involves reconstructing (or inferring) the system state at each time instant given the prior estimate of the state and the new measurements of neural firing. Since the measurement and system equations, 2.9 and 2.10, are both linear and gaussian, decoding can be performed using the Kalman filter (Kalman, 1960). The details of the algorithm for recursive Bayesian inference are provided in the appendix. The appendix also outlines the relationships between the Kalman filter, traditional linear filtering methods, and the Wiener filter. 2.3 Estimating the Optimal Lag. The physical relationship between neural firing and arm movement implies the existence of a time lag between them (Moran & Schwartz, 1999b; Paninski et al., 2004). If an “optimal lag” can be found, it should improve the model’s encoding ability and the accuracy of the decoding. Introducing a time lag in the model means that to estimate the state xk at time tk , we should consider measurements, zk−i , at some previous (or future) instant in time tk−i for some integer i. 2.3.1 Uniform Lag. The simplest assumption is that all cells exhibit the same lag. Finding the optimal lag then involves fitting the Kalman model for a variety of integer values i and choosing the one that gives the best encoding or decoding performance. Since we can easily bound the range of possible lags, this search is straightforward to perform.
92
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
In our experiments, the data were binned into 70 or 50 ms time bins. We found that a uniform lag of two or three time bins, corresponding to 140 or 150 ms, produced the most accurate decoding results. This result is similar to Moran and Schwartz (1999b) where 145 ms was chosen as the average time lag between firing activity and hand velocity. 2.3.2 Nonuniform Lag. We found, however, that the best decoding results were obtained by allowing each cell to have its own lag (within a predefined range of possible lags consistent with known lags for cells in MI). To deal with different lags for each cell, we extend our notation: let li ∈ {0, 1, . . . , L} be the lag of the ith cell, i = 1, 2, . . . , C (C is the total number of cells). Let {l j } = {l1 , . . . , lC } be some set of lag times for the C units. As suggested by the optimal 140 or 150 ms uniform time lag for both tasks, we took the maximum lag to be L = 4 in the pinball task (corresponding to 0 ms ≤ lag ≤ 280 ms) and L = 5 in the pursuit tracking task (corresponding to 0 ms ≤ lag ≤ 250 ms). The kth firing-rate vector is zk = (z1,k , z2,k , . . . , zC,k ), in which each zi,k is the firing rate of cell i at time step k − li . For each possible assignment, {l j }, of lags to cells, one could train the Kalman filter model. The Kalman filtering algorithm generates the error covariance matrix Pk (for k large enough, it is approximately constant). Letting mse({l j }) be the sum of the first two components on the main diagonal of Pk (equivalent to the mean-squared error of position), our goal is to find the optimal assignment of lags to cells {l j }, that is,
argmin{l j }={li ∈{0,...,L};i=1,2,...,C} mse({l j }).
A brute force search of all possible combinations of lags would require estimating the Kalman model for L C possibilities. This is impractical, so we instead devise a fast, randomized, greedy searching approach, which is able to obtain very stable approximations to the desired time lags. The algorithm is sketched in Table 1. The algorithm works by optimizing the lag time for cells individually; that is, it solves for the lag of one cell while holding the lags of all the other cells fixed. The algorithm then cycles through all the cells in a random order, updating their lag in this way. This process is repeated multiple times, and we found that it converges quickly in practice (after three or four passes through the cells). This algorithm requires that the Kalman filtering algorithm be applied to the training data only L RC times. In our experiments, we took the number of iterations, R, to be five. In the pinball experiment, for example, the number of cells, C, was 42, resulting in a computational cost that was much lower than the L C operations required for exhaustive search.
Motor Cortical Decoding Using a Kalman Filter
93
Table 1: Greedy Algorithm for Estimating Individual Lag Times for Each Cell. 1. Randomly choose initial lag li ∈ {0, 1, . . . , L} for each cell i = 1, 2, . . . , C. 2. For iteration = 1 to R % Randomize the order in which cells are considered % to minimize effects of dependencies between cells For C iterations select a cell index c at random from {1, . . . , C} without replacement hold all other lags constant (i.e. lm , m = c). Update lc by minimizing: lc = argminlc ∈{0,1,...,L} mse({l1 , . . . , lc , . . . , lC }). 3. Return the final set of values {lk }.
3 Results The Kalman filter is well known, and its implementation is quite standard. Our focus here is on the development of this method for neural prosthetic applications. This application to neural decoding requires a number of modeling choices, which are explored below in detail. Optimizing these modeling choices may provide insight into neural coding in MI. Below, we report results for the two different continuous hand-motion tasks described in section 2 on two different animals. The accuracy of the reconstructions is reported using two different criteria: the correlation coefficient (CC, describing the similarity) and the mean squared error (MSE, describing the Euclidean distance) between the reconstructed and true hand trajectories. Assume (xˆ t , yˆ t ) is the estimate for the true position (xt , yt ), t = 1, . . . , T. Then MSE and CC are defined as follows: MSE =
T 1 ((xt − xˆ t )2 + (yt − yˆ t )2 ), T t=1
and ¯ ¯ (x − x ¯ )( x ˆ − x ˆ ) (y − y ¯ )( y ˆ − y ˆ ) t t t t , , t CC = t 2 2 2 ¯ ¯) ˆ t − xˆ ) ¯) ˆ t − y¯ˆ )2 t (xt − x t (x t (yt − y t(y
where x¯ and y¯ represent the means of the x and y positions, respectively. Note that while we computed the full kinematic state vector, we characterized the error solely in terms of position, since position accuracy is the key criterion for many neural control tasks. We note also that the MSE is more meaningful for prosthetic applications, where the subject needs precise control of cursor position; we observed decoding results with relatively
94
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
high correlation coefficients that were sometimes far from the true 2D hand trajectory. We first explore the basic Kalman model and its decoding performance for the two different movement tasks. We then analyze its behavior with respect to a number of modeling choices. Finally, we compare it with other techniques (the population vector and linear filter). 3.1 Decoding Pinball Task. To be practical, we must be able to train the model (i.e., estimate A, H, W, Q) using a small amount of data. Experimentally we found that approximately 2.5 minutes of training data sufficed for accurate reconstruction (this is similar to the result for fixed linear filters reported in Serruya et al., 2002). The exact amount needed will increase as the number of parameters in the model (neurons and kinematic variables) increases. We explore the effect of varying the amount of training data below. After learning the Kalman model as described in section 2 and the appendix, we evaluate it by reconstructing test trajectories off-line using approximately 1 minute of recorded neural data not present in the training set. At the beginning of a test trial, we made the assumption that hand kinematics were unknown, and we let the predicted initial condition be equal to the average of the kinematics in the training data. The Kalman filter was then applied to reconstruct the hand kinematics. Some examples of the reconstructed hand trajectory are shown in Figure 3A; Figure 4 shows the reconstruction of each component of the state variable (position, velocity and acceleration in x and y) for 1/3 of the test data. Note that since the ground-truth velocity and acceleration curves were computed from the position data with simple finite differences, the ground truth for these variables is quite noisy. Pursuit Tracking Task. The data from the pursuit tracking task consisted of 182 short trials from which we randomly selected 130 trials (approximately 17 minutes) as training data and took the remaining 52 trials (approximately 6.5 minutes) as test data. In contrast to the pinball task, each test trial here was of short duration, and consequently the choice of initial starting state for the hand kinematics had a large impact on the accuracy of the results. Whereas in the pinball task, we let the starting kinematics be equal to the mean state in the training data, here we took it to be the true initial condition. Starting from this state, the Kalman filter algorithm was applied to reconstruct the system state as before. Figure 3B shows some examples of reconstructed hand trajectories. Numerical results are presented below for various lags. 3.2 Real-Time Performance. The encoding and decoding software was written in Matlab (Mathworks Inc., Natick, MA), and the experiments were
Motor Cortical Decoding Using a Kalman Filter
12
16
8
4
0
6
10
14
14
y-position (cm)
y-position (cm)
y-position (cm)
A
95
12 8 4 0 6
18
trial # : 15
20 18 16 14 12 8
10
18
12
14
16
y-position (cm)
y-position (cm)
14
16
18
16 14 12 10 8
22
16
14
14
10
12
14
16
18
x-position (cm)
trial # : 99
x-position (cm)
10
x-position (cm)
18
8
18
18
12 12
6
trial # : 89
x-position (cm) 20
2
22
20
y-position (cm)
y-position (cm)
22
10
14
6
x-position (cm)
x-position (cm)
B
10
10
18
trial # : 164
20 18 16 14 12 10 10
12
14
16
18
x-position (cm)
Figure 3: (A) Pinball task. Reconstructed trajectories (portions of 1 min test data; each plot shows 50 time instants (3.5 s): true target trajectory (dashed) and reconstruction using the Kalman filter (solid). (B) Pursuit tracking task. Reconstruction of hand position on a few test trials: true hand trajectory (dashed) and reconstruction using the Kalman filter (solid).
run on a computer with a Pentium III 866MHz processor. For the pursuit tracking task, with 17 min of training data, the Kalman model took 1.70 s to learn, while the pinball task, with approximately 3.5 min of training data, took 0.14 s to train. Decoding was performed at a rate of 0.15 ms per 50 ms time bin for the pursuit tracking task and 1.54 ms per 70 ms time bin for the pinball task. This decoding cost is insignificant relative to these bin sizes. Note that decoding for the pinball data took longer due to the fact that the kinematic
96
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black x-position
y-position
20
15
15
10
10
5 5
10
15
20
0
5
x-velocity
10
15
20
15
20
y-velocity
2 1 0 -1 -2
2 0 -2 5
10
15
20
5
x-acceleration
10
y-acceleration 2 1 0 -1 -2
1 0 -1
5
10
time (sec)
15
20
5
10
15
20
time (sec)
Figure 4: Pinball task. Reconstruction of each component of the system state variable: true target motion (dashed) and reconstruction using the Kalman filter (solid). 20 s from a 1 min test sequence are shown. (From Wu et al., 2003.)
model had more parameters (six rather than four as discussed below) and there were more neurons recorded for this task (42 versus 25). 3.3 Optimal Lag Pinball Task. It is convenient to choose time lags corresponding to multiples of the bin size, although one could always re-bin the data and compute a finer discretization of the time lag. We used uniform time lags of 0, 70, 140, 210, 280 ms for the training and testing of the Kalman filter (see Table 2). From Table 2 (similar to Table 1 in Wu et al., 2003), we see that the optimal uniform lag was approximately two time bins (or 140 ms). Note that at each time step, the firing rate was the number of spikes over the previous 70 ms bin; therefore, binning the spikes introduces a lag even at what we are calling 0 lag in the binned data. The same lag property also exists in the pursuit tracking task (with bin size 50 ms). In general, we observed that individual neurons do not all have the same optimal lag, though, in practice, the range of possible lags is bounded (0 ≤ i ≤ 4 or 0 ms ≤ lag ≤ 280 ms). To simplify our data analysis, we assume that the optimal lag for all cells is less than four time bins (280 ms) and
Motor Cortical Decoding Using a Kalman Filter
97
Table 2: Pinball Task: Reconstruction Results for the Recursive Kalman Filter with Varied Time Lags. Method Kalman (0 ms lag) Kalman (70 ms lag) Kalman (140 ms lag) Kalman (210 ms lag) Kalman (280 ms lag) Kalman (nonuniform, init 1) Kalman (nonuniform, init 2)
CC (x, y)
MSE (cm2 )
(0.78, 0.91) (0.80, 0.93) (0.82, 0.93) (0.81, 0.89) (0.76, 0.82) (0.84, 0.93) (0.84, 0.93)
7.01 6.25 5.87 6.80 8.81 4.75 4.77
Notes: The upper portion shows uniform lags for all cells, and the bottom portion shows nonuniform lags. In the case of uniform time lags, we found that 140 ms provided the best description of the relationship between firing pattern and hand movement. Optimizing the individual lag for each cell, however, produced relatively more accurate decoding in terms of mean-squared error. The nonuniform time lag is obtained by a greedy search process with a randomized initial starting point. The results for two differentiations (init 1 and init 2 are shown.)
exploit the greedy algorithm described in Table 1 to approximate the optimal lag for each neuron. The algorithm requires an initial guess of the lag for each cell. We experimented with multiple random initial conditions and found that the greedy search algorithm produced lags that were almost the same in all cases. The results for two experiments are shown in Figure 5A. These two suboptimal time lag solutions gave similar decoding results (see Table 2). These results suggest that a nonuniform time lag is superior to a uniform one for modeling the relationships between firing patterns and hand movement. In the interest of simplicity, however, and for the remainder of the letter, except where noted, we choose a fixed uniform time lag of 140 ms with which we evaluate all other aspects of the Kalman model. Pursuit Tracking Task. For the pursuit tracking task, the decoding accuracy for different uniform lag times is shown in Table 4 (column labeled “Pos, Vel, Accel”); the accuracy is reported in terms of the average MSE and CC over 52 test trials. We observed that 150 ms was approximately the optimal uniform lag, which is consistent with the 140 ms lag found for the pinball task. (Recall that data for the pursuit tracking task were binned into 50 ms time bins in contrast to the 70 ms bins in the pinball task.) Here we found less variation in decoding results with respect to time lag. Consequently the optimal nonuniform lag (see Figure 5B) showed no improvement over the best uniform lag. We hypothesize that this is due to the much slower hand motions present in this task; the hand is moving slowly enough that the difference in kinematic state between various lags
98
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
lag step
A 4
4
2
2
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
40
lag step
B 4 2 0
5
10
15
20
25
Figure 5: (A) Pinball experiment. Optimization of individual lag times within a bound of 0 ms ≤ lag ≤ 280 ms. The vertical axis shows the lag in terms of the number of time bins. Each unit on the horizontal axis corresponds to one of the 42 cells. The left plot shows two random initial lags: (1) dashed line with circles; (2) solid line with stars. The right plot shows final estimated lags from the two initial conditions. Note that for most cells, these two solutions are the same. This suggests that the greedy search approach, with random initial starting lags, provides reasonably stable results. (B) Pursuit tracking experiment. Optimization of individual lag times using the greedy algorithm. The range of possible lag steps is bounded 0 ≤ i ≤ 5, or 0 ms ≤ lag ≤ 250 ms.
is less important. For the remainder of the letter, except where noted, we choose a fixed uniform time lag of 150 ms for the pursuit tracking task. 3.4 Kinematic Model. We assumed above (by default) that position, velocity, and acceleration of the hand were related to neural firing rates. Other kinematic models with more or fewer terms could be employed. To understand the relationship between the order of the kinematic model and neural activity, we tested a variety of models, including position alone (zeroth order); position and velocity (first order); position, velocity, and acceleration (second order); and so on up to fifth order. Note that velocity and acceleration correspond to first and second derivatives of position, respectively. The next three higher-order derivatives are referred to as jerk, snap, and crackle. Pinball Task. We trained and evaluated each of these models. To avoid overfitting, we learned each Kalman model using the training data set and then evaluated the decoding accuracy in test data set. The decoding results in Table 3 show that adding higher-order terms increases the decoding accuracy, but with diminishing returns. As the number of terms in the model increases, so does the need for training data and the risk of overfitting. The optimal choice of kinematic model is an example of a model selection problem. The Bayesian information criterion (BIC) (Rissanen, 1989)
Motor Cortical Decoding Using a Kalman Filter
99
Table 3: Pinball Task: Decoding and BIC Results with Respect to the Order of the Kinematic Model. Number of Orders
CC (x, y)
MSE (cm2 )
α+β
0 1 2 3 4 5
(0.72, 0.87) (0.82, 0.91) (0.82, 0.93) (0.82, 0.93) (0.82, 0.93) (0.82, 0.93)
7.72 6.31 5.87 5.72 5.61 5.60
−80834 −84049 −84411 −84458 −84307 −84081
Notes: Order 0 corresponds to a system state containing just hand position. Order 1 combines position with the first derivative of position (i.e., velocity). Order 2 uses position, velocity, and acceleration. Order 3 adds the third derivative (jerk), while orders 4 and 5 add the additional derivatives (snap and crackle, respectively). Experiments showed that the higher-order system models result in more accurate decoding but with diminishing returns. The boldface entries are optimal in terms of a BIC as well as for practical purposes.
provides one approach to deal with the general case. Assume the hand kinematics X M = [x1 , x2 , . . . , x M ]T and the firing rates Z M = [z1 , z2 , . . . , z M ]T are known for the M time instants in the training set. Let α = − log A,W,H,Q p(X M , Z M ),
and
β=
# of parameters log M. 2
Then from information theory, α is the number of bits to describe the data, and β is the number of bits to describe the model. BIC searches for the minimal number of bits to describe the model and the data: the kinematic model that has the smallest α + β. Table 3 shows that (position, velocity, acceleration, jerk) was the optimal kinematic model. In practice, we have found a second-order model to be sufficient and to provide a good tradeoff between accuracy and model complexity. This model is used in the experiments below (except where noted). Pursuit Tracking Task. Table 4 shows the decoding accuracy for different kinematic models as well as different lags. For this data set, we found that velocity is crucial, while the acceleration and higher-order terms result in overfitting of the model and a reduction of accuracy. One hypothesis for the difference between the pursuit tracking and pinball tasks with respect to the optimal kinematic model has to do with the particular type of motion present. In the pursuit tracking task, the range of accelerations was much smaller than in the pinball task. In particular, the magnitude of the acceleration was much smaller relative to the magnitude of the velocities (see Figure 1). We observed increased firing rates as velocity and acceleration increased. The slow motions in the pursuit tracking task and the resulting low firing rates suggest that the effects of acceleration cannot be distinguished
100
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Table 4: Pursuit Tracking Task: Reconstruction Results for the Recursive Kalman Filter with Varying Time Lag and Different Kinematic Models. Position Lag
CC (x, y)
MSE
0 ms 50 ms 100 ms 150 ms 200 ms 250 ms Nonuniform
(0.43,0.39) (0.46,0.40) (0.49,0.40) (0.51,0.41) (0.53,0.41) (0.54,0.40) (0,53,0.41)
Position, Velocity (cm2 )
7.64 7.56 7.46 7.38 7.34 7.33 7.14
CC (x, y) (0.79,0.68) (0.79,0.68) (0.79,0.68) (0.79,0.68) (0.79,0.67) (0.78,0.67) (0.79,0.68)
MSE
(cm2 )
6.16 6.08 6.08 5.99 5.96 6.02 6.00
Pos, Vel, Accel CC (x, y)
MSE (cm2 )
(0.78,0.65) (0.78,0.66) (0.79,0.65) (0.79,0.64) (0.78,0.64) (0.78,0.64) (0.79,0.65)
6.22 6.17 6.17 6.15 6.18 6.16 6.19
Notes: A uniform time lag of approximately 150 ms was roughly optimal, as was a kinematic model containing just position and velocity (see the boldface entry). Here there was no improvement when using a nonuniform lag.
from noise. In all experiments below involving the pursuit tracking task, except where noted, we used a kinematic model containing only position and velocity. 3.5 Conditional Dependence of the Firing Rates. In section 2, we suggested that the firing rates of the neurons in MI are not independent conditioned on the hand kinematics and emphasized the importance of learning a full covariance matrix Q. Conditional independence would imply that the likelihood could be written in terms of the probability of the firing rates, zi,k , of the individual neurons, i = 1 . . . C: p(zk |xk ) =
C i=1
p(zi,k |xk ) =
C i=1
√
1 2πq i
1 exp(− (zi,k − Hi xk )2 /q i2 ) 2
where q i2 is the observation variance for cell i and Hi is the ith row of H. This model corresponds to having a diagonal covariance matrix Q with the q i along the diagonal. Learning the parameters of the Kalman model proceeds as before but with the restriction that Q is diagonal. Table 5 shows the decoding results obtained under this conditional independence assumption. We found that the full covariance matrix provides a better probabilistic model of the neural data and results in more accurate decoding. The importance of conditional dependence of the firing rates can be demonstrated using a t-test (Larsen & Marx, 2001). For each pair of cells, we obtained the p-value under the null hypothesis that the firing rates for this pair were conditionally independent. We found that over all the pairs, 51% (pinball) and 40% (pursuit tracking) of p-values were less than 0.05; therefore, the null hypotheses of the independence of the corresponding
Motor Cortical Decoding Using a Kalman Filter
101
Table 5: Decoding Accuracy for Full and Diagonal Matrix for Both Pinball and Pursuit Tracking Tasks. Pinball Task Method Full covariance Diagonal covariance
CC (x, y) (0.82, 0.93) (0.82, 0.93)
Pursuit Tracking Task
MSE
(cm2 )
5.87 6.91
CC (x, y)
MSE (cm2 )
(0.79, 0.68) (0.79, 0.67)
5.99 6.35
pairs should be rejected. This alternative analysis also suggests the importance of representing a full covariance matrix Q. Note that estimating the full covariance matrix may be problematic for large populations of neurons. As the size of Q increases, more training data are needed to avoid overfitting. To cope with this problem, one can use principal component analysis (PCA) to reduce the dimensionality of the observation data. In the reduced-dimensional PCA space, the directions are decorrelated, but since the firing rates are not gaussian, PCA does not make the directions conditionally independent. Consequently, it is still advantageous to fit a full covariance matrix; the advantage is that it can be fit to lower-dimensional data. 3.6 Number of Neurons. We explored the effect of varying the population size on decoding accuracy and, not surprisingly, found that larger populations result in more accurate decoding. For any N = 1, . . . , C, (C = 42 in pinball; C = 25 in pursuit tracking), we randomly selected subsets of N neurons from our population. We then fit the Kalman model to the training data and reconstructed the test data. The random subsets were chosen 100 times, and the decoding results were averaged (see Figure 6A). We found that for both data sets, the MSE and CC improved with increasing population size. Current recording technology is now able to produce 100 or more simultaneous recordings. The results here suggest that these larger populations may lead to decreases in MSE, which should make neural prosthetic control more accurate. 3.7 Amount of Training Data. We also explored the effect of the quantity of training data on decoding accuracy: larger amounts of training data resulted in more accurate decoding. Using the same test data, we selected different amounts of partial training data to learn the Kalman model. The decoding results are shown in Figure 6B. We found that in the pinball task, approximately 2.5 minutes of training data was sufficient; in the pursuit tracking task, we used approximately 10 minutes of training data, although more data improved the results slightly.
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
MSE (cm 2)
A 20 15 10 1
5
10
15
20
25
30
35
40
MSE (cm 2 )
102
CC 1
5
10
MSE (cm 2 )
15
20
25
30
35
6
1
5
1
5
number of neurons
15
20
25
10
15
20
25
number of neurons
8 7 6 1.5
2
2.5
3
3.5
8 7 6
4
6
8
10
12
14
16
6
8
10
12
14
16
1
CC
1 0.5 0
10
0.5 0
40
MSE (cm 2)
CC
0.5
B
CC
7
1
1
0
8
1.5
2
2.5
3
length of training data (min)
3.5
0.5 0
4
length of training data (min)
Figure 6: (A) Decoding accuracy with respect to number of neurons. The upper plot is the average of the MSE over 100 random subsets; the lower plot is the corresponding correlation coefficient for x (circle) and y (square). The left plot is for the pinball task and the right for the pursuit tracking task. Note that even with a single neuron, the value of the CC can be deceptively high, while the MSE is large, as expected. (B) Decoding accuracy with respect to number of training data in both pinball and pursuit tracking tasks. The illustrations of the four plots correspond to those in A. The basic trend is that the larger the training data set, the higher the decoding accuracy, but the improvement appears to diminish as the number of training data increases.
Modeling Summary. A Bayesian framework such as the Kalman filter requires us to define a likelihood model and a prior model. The prior model here is given by the linear gaussian system model, and we do not explore various generalizations of that model. Rather, the experiments above pertain to the likelihood and how it is formulated. While the standard Kalman framework constrains us to a linear gaussian model, there are still many modeling choices necessary to match this framework to motor cortical activity. Based on the above experiments, a number of general observations can be made. We observed that a full covariance matrix is critical for decoding accuracy. Time lags of roughly 150 ms improved results, and while there was some benefit to nonuniform lag times, there was much to be gained from a simple uniform lag. The order of the kinematic model was dependent
Motor Cortical Decoding Using a Kalman Filter
103
on the application motion, but for general, unconstrained, and continuous motions, it appeared that higher-order models (up to fourth order) were beneficial. Not surprisingly, we also found that accuracy increases with the number of cells used. The appendix also contains an analysis of the effect of varying the bin size. We found that larger time bins improved decoding accuracy, but the optimal size depended on the task, with fast motions requiring smaller bins than slow motions. In summary, the best decoding performance for the two tasks were obtained as follows: Pinball: CC = (0.84, 0.93), MSE = 4.55: 140 ms time bin, nonuniform lag, second-order model (pos, vel, accel) Pursuit tracking: CC = (0.81, 0.70), MSE = 4.66: 300 ms time bin, 150 ms uniform lag or nonuniform lag, first-order model (pos, vel) 3.8 Comparison with Previous Decoding Methods. A variety of other methods have been proposed for performing motor cortical decoding. Below, we compare the Kalman filter to the most popular methods and then suggest further extensions of the Bayesian decoding framework. 3.8.1 Population Vector Method. The population vector approach (Georgopoulos et al., 1982, 1986) treats MI cells as having a “preferred” movement direction and models the hand movement direction as a weighted vector average of these preferred directions where the weight is proportional to the firing rate of the cell. The population vector algorithm can be expressed in the following equation (Moran & Schwartz, 1999b),
P Vk =
C i B zi,k − z¯ i · , zmaxi − z¯ i zmaxi − z¯ i
(3.1)
i=1
where P Vk is the population vector at time tk = kt, k = 1, . . . , M, zi,k is the firing rate of cell i at time tk , z¯ i and zmaxi are the average and maximum firing i is the 2D unit length vector pointing rates of cell i over all time steps, and B in the preferred direction of cell i, i = 1, . . . , C. This formulation models only movement direction and not hand position and hence represents only a subset of the kinematic information modeled by the Kalman filter. To recover hand position, one integrates the direction estimates over time and then normalizes the result to obtain an estimate of hand position (Schwartz, 1993). The population vector method has been used in the decoding of various hand movement tasks, including center-out reaching (Moran & Schwartz, 1999b), sinusoid tracing (Schwartz, 1993), spiral tracing (Moran & Schwartz, 1999a), and figure-eight tracing (Schwartz & Moran, 1999). Recently, this
104
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black 1
1
0.5
0.5
0
0
-0.5
-0.5
-1 -1
-0.5
0
0.5
1
-1 -1
-0.5
0
0.5
1
Figure 7: Preferred directions of all cells in pinball (left) and pursuit tracking (right) tasks. The distributions of directions are not uniform over [0, 2π ].
approach has been applied to real-time neural control of 3D cursor (centerout) movement (Taylor et al., 2002). In both the pinball and pursuit tracking tasks, we learned the parameters of the population vector model from training data; the preferred directions are shown in Figure 7. We then estimated the hand position in test data using the population-vector method; the decoding results are shown in Table 6. We found that the population vector method does not accurately reconstruct the hand trajectory for these complex motions (particularly the more natural motions of the pinball task). The lack of accuracy may be due to at least two reasons. First, the approach assumes that the population uniformly samples the range of movement directions, yet for the small sample sizes available with current multielectrode recording technology, this may be difficult to satisfy (Gribble & Scott, 2002). For the data sets we considered, the movement directions were not uniformly distributed (see Figure 7). One of the key advantages of the full covariance matrix in the Kalman filter is that it accounts for the nonuniform distribution of encoding properties across the population. If two cells encode the same information, the errors in the likelihood term are correlated, and the covariance matrix appropriately accounts for their conditional dependence; the linear filter method below does something similar. This weighting of the data is critical for sound inference. Another issue with the population vector approach is that the integration of direction estimates to compute position information results in the compounding of errors over time. Thus, the population vector method may be appropriate for simple stylized motions or motions of short duration, yet to model general, continuous, and complex motions, a more powerful model (richer kinematics and a probabilistic formulation) is preferable.
Motor Cortical Decoding Using a Kalman Filter
105
Table 6: Decoding Results with Different Methods for the Pinball and the Pursuit Tracking Tasks.
Pinball task Method Population vector Linear filter (N = 14) Kalman t = 140 ms, nonuniform lag Pursuit Tracking task Method Population vector Linear filter (N = 30) Kalman t = 300 ms, 150 ms uniform lag
CC (x, y)
MSE (cm2 )
CC (x, y) (0.26, 0.21) (0.79, 0.93) (0.84, 0.93)
MSE (cm2 ) 75.0 6.48 4.55
(0.57, 0.43) (0.73, 0.67) (0.81, 0.70)
13.2 4.74 4.66
Note: Boldface type indicates the best decoding results.
3.8.2 Linear Filter Method. Linear filters reconstruct hand position as a linear combination of the firing rates over some fixed time period (Carmena et al., 2003; Serruya et al., 2002; Warland, Reinagel, & Meister, 1997; Wessberg et al., 2000), that is, xk = a +
N C
zi,k− j f i, j ,
(3.2)
i=1 j=0
where xk is the x-position (or, equivalently, the y-position) at time tk = kt, k = 1, . . . , M, a is a constant offset, zi,k− j is the firing rate of neuron i at time tk− j , and f i, j are the filter coefficients. The coefficients can be learned from training data using simple least-squares regression. The method has been used for off-line decoding of MI activity (Paninski et al., 2004) and for real-time neural control (Carmena et al., 2003; Serruya et al., 2002; Wessberg et al., 2000). The parameter N specifies the number of time bins used to estimate the hand position. Empirically, N is chosen so that the total data used are between 500 ms and 2 s. In Serruya et al. (2003), the authors stated that N = 30 is the optimal choice because “shorter filters did not perform as well and that longer filters did not provide much additional information.” A larger N results in a large computational cost in estimating the filters (by inverting a large matrix), which becomes prohibitive when the number of cells is large. It also requires more training data to avoid overfitting. Consequently, we took N = 30 to be the maximum number of time bins. To make the linear filter effective and compare it to the Kalman filter, we selected the N that gave the best decoding results, where we allowed N to range over {1, 2, . . . , 30} time bins. For the pinball task, N = 14 (approximately 1 sec) gave the best decoding accuracy, while in the pursuit tracking
106
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black x-position
y-position
20 10
15
5
10
0 5
10
time (sec)
15
20
5
10
15
20
time (sec)
Figure 8: Pinball task. Reconstruction of position using the linear filter: true target trajectory (dashed) and reconstruction using the linear filter (solid). (From Wu et al., 2003.)
task, N = 30 (approximately 1.5 sec) was optimal (although there is little difference in the results for N ≥ 27). A 1 sec worth (pinball) and 1.5 sec worth (pursuit tracking) of firing-rate information before hand kinematics is also used in the method described in Serruya et al. (2002, 2003). Note that since the linear filter uses data over a long time window, it does not benefit from the use of time-lagged data. Note also that it does not explicitly reconstruct velocity or acceleration. One could compute velocity with the linear filter (Carmena et al., 2003), but to estimate position, one would need a way to combine linear filter estimates for position and velocity. The Kalman filter automatically does this within an optimal Bayesian framework. The linear filter reconstruction of position for the pinball data is shown in Figure 8. The results are visually similar to the Kalman results (compare Figure 4), yet Table 6 shows that the Kalman filter gives a more accurate reconstruction (higher CC and lower MSE). Figure 9 shows the linear filter reconstruction for the same four trials as in the pursuit tracking task shown in Figure 3B. The linear filter is a discrete Wiener filter (Gelb, 1974) in the sense that at each time instant, the hand position is a linear function of the firing rates of all the cells over the past N bins; the appendix explores this relationship formally. The linear filter approach lacks an explicit generative model of neural activity and hence provides little insight into neural coding (apart from suggesting a linear relationship between hand position and firing rate). Both the linear filter and population vector lack an explicit prior model for how the system state evolves. This results in noisy estimates of the state and necessitates the use of large time windows for the linear filter. In practice, post hoc smoothing is often used to reduce the noise in the linear filter estimates, yet this introduces undesirable lags in the reconstruction. The Kalman filter uses many fewer data at each time instant than the linear filter, but has an explicit temporal model that incorporates prior estimates in a recursive fashion. In particular, the probabilistic model (first-order
Motor Cortical Decoding Using a Kalman Filter
trial # : 15
22
18
y-position (cm)
y-position (cm)
18 16 14 12 10
12
14
16
16 14 12 10 8 8
18
trial # : 99
18 16 14 12 12
14
16
x-position (cm)
12
14
16
18
trial # : 164
22
y-position (cm)
20
10
x-position (cm)
x-position (cm)
y-position (cm)
trial # : 89
20
20
10 8
107
18
20 18 16 14 12 10 10
12
14
16
18
x-position (cm)
Figure 9: Pursuit tracking task. Reconstruction using the linear filter: true hand trajectory (dashed) and reconstruction using the linear filter (solid).
Markov) underlying the Kalman filter means that only a single bin of data is used at each time instant, and these bins should never overlap in time. 3.8.3 Artificial Neural Networks. Artificial neural networks (ANNs) have also been used for neural decoding (Ghazanfar, Stambaugh, & Nicolelis, 2000; Sanchez et al., 2003) and real-time prediction of hand trajectories from neural ensembles in MI (Wessberg et al., 2000). ANNs can perform many types of statistical learning, yet they do not provide explicit generative or probabilistic models that are open to inspection. The models can, however, be analyzed to varying degrees to try to tease out what they encode (Sanchez et al., 2003). Wide variability in the specific implementation and training of these methods makes it impractical to quantitatively compare our results with previous ANN implementations. 4 Discussion Our focus in this letter has been the development of a decoding algorithm appropriate for neural prosthetic applications. The Kalman filter is
108
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
computationally efficient (real time), requires little training data, and provides more accurate off-line decoding than previous methods such as population vectors or linear filtering. While the linear gaussian model of neural coding is an approximation, it provides a reasonable trade-off between computational efficiency and accuracy. Its successful use in decoding may be thought of as resulting from its achieving a good bias/variance trade-off (Geman, Bienenstock, & Doursat, 1992); equivalently, it provides a reasonable solution to the model selection problem, given the complexity of the task and the size of the training data sets at hand. More powerful nonlinear, and nongaussian, likelihood models can be constructed (Brockwell et al., 2004; Gao et al., 2002, Gao, Black, Bienenstock, Wu, & Donoghue, 2003) and used for Bayesian decoding; the decoding task, however, becomes more difficult. For example, the likelihood can be formulated using generalized linear models (Gao et al., 2003), generalized additive models (Gao et al., 2003), mixture models (Wu et al., 2004a), and fully nonparametric models (Gao et al., 2002). In Gao et al. (2002) we introduced particle filtering to solve the general Bayesian decoding task in MI with nonlinear, nongaussian, likelihoods (see also Brockwell et al., 2004; Shoham, 2001). The particle filtering method is more general than the Kalman filter proposed here. With sufficient computational resources, linear gaussian decoding using a particle filter approaches the accuracy of the Kalman filter. Additionally, more complex likelihood models can give higher accuracy than the linear-gaussian model (Gao et al., 2003). Current implementations of particle filtering, however, do not run in real time and, hence, are not yet appropriate for neural prosthetic applications. The method proposed here assumes that firing rates can be approximated by a linear model and that action potentials from multiple cells recorded by a single electrode can be cleanly separated (sorted). In practice, however, automated spike sorting methods may incorrectly classify multiple waveforms as being produced by the same cell. This can reduce decoding accuracy since it violates the model assumptions. An extension of the Kalman filter reformulates the likelihood as a probabilistic mixture of linear gaussian models (Wu et al., 2004a) in order to cope, to some extent, with errors of this type. An efficient algorithm known as the switching Kalman filter (Murphy, 1998) can be used for decoding with this mixture model. Wood, Fellows, Donoghue, & Black, 2004) suggest this sorting may not be a signficant problem for neural prosthesis applications and that good decoding accuracy can be obtained using a Kalman filter with unsorted, multiunit data. Our formulation also assumes that motor cortex codes movement in terms of firing rates. In the Bayesian formulation, this assumption too can be relaxed and a point process model, taking into account spike timing, can be used for decoding (Brown et al., 1998; Eden, Frank, Barbieri, Solo, & Brown, 2004). The resulting decoding algorithm is more complex, however, than the simple Kalman filter presented here.
Motor Cortical Decoding Using a Kalman Filter
109
Here we focused on the likelihood (measurement model) as derived from studies of neural coding. To evaluate different choices for the likelihood, we used a simple linear gaussian temporal prior (system model). More complex, nonlinear models or higher-order Markov assumptions might lead to more accurate models of arm motion and more accurate decoding results. The linear filter and Kalman filter both produce reasonable decodings of neural activity. Depending on the specifics of the model, the Kalman filter can be viewed as a linear filter (Weiner filter) in which all measurements back to the initial time step are taken into account, yet the contribution of measurements decays exponentially with time. The Kalman filter, however, provides a number of benefits. It is easier to train (less computationally intensive) and, more important, formulates the problem as one of Bayesian inference. Since assumptions about the data and the noise are explicitly spelled out, the Kalman model provides more insight into the encoding model that is used. Moreover, by making the assumptions explicit, it suggests avenues for improving both modeling and decoding by relaxing some of these assumptions. We have studied off-line decoding accuracy rather than online control. It is reasonable to expect that algorithms that provide better off-line decoding will provide better online accuracy as well. One might also posit that better control algorithms may make the training of paralyzed human subjects easier. Results of online control studies, however, suggest that even algorithms such as the population vector method, with its poor off-line accuracy, can be used to control devices (Taylor et al., 2002). It may be the case that the brain adapts to a particular control algorithm and that improving the algorithm produces little gain in accuracy. Such a hypothesis remains to be tested. Research on decoding for neural control has focused on cells in motor cortex. The rationale underlying this choice is that it will be easier for subjects to learn to control physical devices when this area of cortex is used, since it already represents information about motion. Even in paralyzed subjects, imagined arm motions produce activity in MI, suggesting its appropriateness for human neural prostheses (Kennedy & Bakay, 1998). Other brain areas, however, may provide useful control signals for motor prosthetic applications. The applicability of the Kalman filter outside MI remains to be tested. The results reported here involved only two animals, and each animal was trained on, and performed, a slightly different task. More studies will be required to generalize these results to a broader range of tasks and conditions. 5 Conclusions We have described a discrete linear Kalman filter that is appropriate for the neural control of 2D cursor motion. We showed that the model could be easily learned using a few minutes of training data and provided real-time
110
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
estimates of hand position given the firing rates of a population of cells in primary motor cortex. The estimated trajectories were more accurate than those produced by the population vector and linear filtering methods most commonly used in the literature. Analysis of the method was performed using two data sets involving complex, continuous hand motions. Neural recordings from two different monkeys were obtained from chronically implanted microelectrode arrays. The experiments suggested that a linear gaussian model provided an approximate model relating the firing rates with continuous hand movement. Moreover, the model enabled the use of the Kalman filter to perform real-time recursive Bayesian decoding. The Kalman filter proposed here provides a rigorous probabilistic approach with a well-understood theory. By making the assumptions explicit and providing an estimate of uncertainty, the Kalman filter offers significant advantages over previous methods. Unlike previous methods, this model estimates a full kinematic state vector (position, velocity, and higherorder terms). It also provides an estimate of the uncertainty in the result in terms of an error-covariance matrix. Various experiments explored the use of this decoding model to choose the kinematic variables that gave the best encoding and decoding performance. In contrast to the relatively constrained pursuit tracking task, we showed that for the more natural 2D motions of the pinball task, incorporating acceleration and higher-order terms into the model improved the accuracy of the decoding. The experiments suggest a number of general conclusions, which remain to be verified in more subjects and in online experiments:
1. A time lag of 140 to 150 ms between firing activity and hand kinematics improves decoding results. 2. As the number of neurons in the population increases, one observes a corresponding increase in decoding accuracy. While even small populations produce reasonable results, the ability, in the not-toodistant future, to record from much larger populations is likely to increase accuracy even further. 3. A few minutes of training data are sufficient for learning the linear gaussian model. The number of data required may depend on the task. 4. With the gaussian model, a full covariance matrix usefully captures the conditional dependence between the firing rates of different neurons. Decoding accuracy is improved by using a full covariance matrix as compared to a diagonal matrix. 5. Firing rates in larger bin sizes, which are better approximated by a gaussian model, improve the decoding accuracy up to a task-specific limit.
Motor Cortical Decoding Using a Kalman Filter
111
Currently, we are working to evaluate the performance of the Kalman filter for the closed-loop neural control of cursor motion in tasks such as described here. Our recent work demonstrates the feasibility of Kalman filter in the online neural control experiment (Wu, Shaikhouni, Donoghue, & Black, 2004b). More experiments with more animals, however, are needed to confirm those observations. Our future work will focus on the application of automated spike sorting methods that provide an estimate of uncertainty in the rate signal. This uncertainty can be naturally incorporated into the Kalman model by allowing an adaptive measurement covariance matrix Qk . Additionally, one may explore alternative measurement noise models, nonlinear system models, and nonlinear decoding methods. Appendix: Algorithmic and Experimental Details A.1 Learning the Linear Gaussian Model. Training the Kalman model requires that we learn the matrices A, H, W, Q from example data. We seek the coefficient matrices and covariance matrices that maximize the joint probability p(X M , Z M ), that is, argmax A,W,H,Q p(X M , Z M ) = argmax A,W,H,Q p(X M ) p(Z M |X M ) = {argmax A,W p(X M ), argmax H,Q p(Z M |X M )}. Using the linear gaussian properties of p(X M ) and p(Z M |X M ), the above maxmization has closed-form solutions: A=
M k=2
T xk xk−1
M
−1 T xk−1 xk−1
k=2
M M 1 W= xk xkT − A xk−1 xkT M−1 k=2
H=
M
zk xkT
k=1
k=2
M
−1 xk xkT
k=1
M M 1 T T Q= zk zk − H xk zk . M k=1
k=1
A.2 Decoding (the Kalman Filter Algorithm). The assumptions of linear gaussian models for the system and measurement equations allow us to exploit the Kalman filter algorithm for recursive Bayesian inference. The theory is well developed (Kalman, 1960) and the algorithm is summarized here. The algorithm comprises two steps:
112
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Step 1: Time Update Equations. At each time tk , we use the system model to take the a priori state estimate, xˆ k−1 , from the previous time tk−1 , and predict it forward to time tk : xk−1 . xˆ − k = Aˆ
(A.1)
Recall that the uncertainty in the system model is expressed in the covariance matrix W and this uncertainty must be incorporated into our a priori estimate of the posterior covariance: T P− k = APk−1 A + W,
(A.2)
where P− k is the a priori error covariance matrix at time tk . Step 2: Measurement Update Equations. Using the prediction xˆ − k and firing rate zk , we update the state estimate using the measurement and compute the posterior error covariance matrix: x− xˆ k = xˆ − k + Kk (zk − Hˆ k ),
(A.3)
Pk = (I − Kk H)P− k ,
(A.4)
where Pk represents the state error covariance after taking into account the neural data and Kk is the Kalman gain matrix, given by − T T −1 Kk = P− k H (HPk H + Q) .
(A.5)
This Kk produces a state estimate that minimizes the mean-squared error of the reconstruction (see Gelb, 1974, for details). A.3 Comparison of the Kalman Filter and the Wiener Filter. Simple linear filtering methods that directly reconstruct the system state from a history of firing rates are commonly used for decoding. These methods compute the state as a linear combination of previous firing rates. In particular, the Wiener filter is an optimal linear filter that uses all previous firing data. The simplified Kalman filter developed here can be viewed as an efficient, recursive version of the Wiener filter in which the modeling assumptions (A, H, W, Q) are made explicit. We can obtain several basic properties of Kalman filter by studying its relationship with the Wiener filter. From equations A.1 and A.3, we have that xˆ k = xˆ − x− k + Kk (zk − Hˆ k ) = (I − Kk H)ˆx− k + Kk zk = (I − Kk H)Aˆxk−1 + Kk zk .
Motor Cortical Decoding Using a Kalman Filter
113
Note that equations A.2, A.4, and A.5 are independent of the new measurements of firing rates. If A, H, W, Q are constant over time, the propagation of P− k , Pk , Kk by equations A.2, A.4, and A.5 converges to steady-state solutions (Kalman & Bucy, 1961). Let K = limk→∞ Kk , and M = (I − KH)A. For k large enough (i.e., k > 20 in practice), xˆ k ≈ Mˆxk−1 + Kzk ≈ M2 xˆ k−2 + MKzk−1 + Kzk ≈··· ≈ Mk−1 xˆ 1 +
k−2
M j Kzk− j .
j=0
where M j is the jth power of matrix M (i.e. M2 = M · M ). The above equation shows that in the Kalman framework, the estimate at time step k is a linear function of the firing rates at all time instants from time t2 to the present. This corresponds to the Wiener filter (Gelb, 1974), but the advantage of the Kalman implementation is that it computes the state estimate recursively, resulting in an efficient algorithm. Note also that the coefficients of all firing rates decay exponentially with respect to the current time step. This shows three basic properties of the Kalman filter: (1) it estimates the state at each time step using all the previous and present measurements (firing rates); (2) the weights of the firing rates decay exponentially (those far from the present time have a weak effect on the state estimate); and (3) for k >> 1, the state estimate is approximately independent of the initial state. This last point means that the choice of the initial state is relatively unimportant. A.4 Effect of Bin Size on Decoding Accuracy. We studied the effect of varying the bin size on decoding accuracy and found that accuracy was improved by increasing the bin size beyond the 70 ms and 50 ms bins used in the analysis above. Table 7 and Figure 10A summarize the results. For the pinball task, we varied the bin size, t, by multiples of 70 ms from 0 to 700 ms. For the pursuit tracking task, we considered bins ranging from 0 to 700 ms in 50 ms increments. In all cases, we used nonoverlapping time bins, as the use of overlapping bins results in a severe violation of the assumption of conditional independence underlying by the Kalman framework (see equation 2.7). With overlapping bins, data are “reused,” and the resulting estimates are no longer statistically valid. For each test condition (bin size), the Kalman model was trained, and hand kinematics were calculated every t ms. Table 7 shows that larger bins resulted in better decoding accuracy, up to a limit. One reason for this
114
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Table 7: Decoding Results for Varying Bin Sizes. Pinball Task Method t t t t t t t
Pursuit Tracking Task
CC (x, y)
= 70 ms = 140 ms = 210 ms = 280 ms = 350 ms = 420 ms = 490 ms
MSE
(0.82, 0.93) (0.83, 0.93) (0.81, 0.92) (0.82, 0.93) (0.78, 0.92) (0.78, 0.92) (0.75, 0.89)
(cm2 )
Method t t t t t t t t
5.87 5.29 5.45 5.16 5.95 5.81 7.25
= 50 ms = 100 ms = 150 ms = 200 ms = 250 ms = 300 ms = 400 ms = 500 ms
CC (x, y)
MSE (cm2 )
(0.79, 0.68) (0.80, 0.68) (0.81, 0.68) (0.81, 0.69) (0.81, 0.69) (0.81, 0.70) (0.79, 0.70) (0.79, 0.71)
5.99 5.63 5.27 5.03 4.89 4.66 4.59 4.64
Notes: The other model parameters are as described above: pinball task: uniform lag = 140 ms, kinematic model = (pos, vel, accel); pursuit tracking task: uniform lag = 150 ms, kinematic model = (pos, vel). Boldface type indicates the best decoding results.
may be that as the bin size grows, the variation in the firing rate is better approximated by the gaussian model used in the Kalman filter. In the case of the slow motions of the pursuit tracking task, larger bin sizes were appropriate. For the fast motions of the pinball task, bin sizes beyond approximately 280 ms resulted in a loss of accuracy. This suggests
10 8 6 100
200
300
400
500
600
700
MSE (cm 2)
MSE (cm 2)
A 6 5.5 5 4.5 100
200
300
400
500
600
700
bin size (ms)
bin size (ms)
B 15
y-position (cm)
y-position (cm)
12 10 8 6 4 2
14 13 12 11
0 5
10
x-position (cm)
15
11
12
13
14
15
16
x-position (cm)
Figure 10: (A) Decoding accuracy (MSE) as a function of bin size for the pinball task (left) and the pursuit tracking task (right). (B) Example reconstruction with varying bin size. (left) Pinball task: 70 ms (solid), 280 ms (tightly dashed), and 490 ms (loosely dashed). (right) Pursuit tracking task: 50 ms (solid), 300 ms (tightly dashed) and 500 ms (loosely dashed).
Motor Cortical Decoding Using a Kalman Filter
115
that while larger bin sizes can increase accuracy, the ultimate size is limited and is related to the speed of motion. Additionally, increased bin size had a negative effect on the detail of the recovered trajectories (see Figure 10B). As bin size increases, the frequency of state estimates decreases, resulting in a coarser approximation to the underlying trajectory. In general, larger bin sizes (up to some limits) produce more accurate results but at the cost of introducing a delay in estimating the system state. The constraints of a particular application will determine the appropriate bin size. Note that if a uniform lag of, for example, 140 ms time bins is used, we can exploit measurement data binned into 140 ms time bins without introducing any delay (or output lag) in the estimate of the system state relative to the natural hand motion. For real-time prosthesis applications, this system delay should be less than 200 ms, which suggests that overall bin size minus the uniform lag time should be less than 200 ms. For the pinball task, with a 140 ms lag, this would mean a maximum bin size of approximately 280 ms. For the pursuit tracking task, with a 150 ms lag, a maximum bin size of 250 to 300 ms would be appropriate. While this increases accuracy, it comes at the cost of a “jerkier” reconstruction. Acknowledgments This work was supported in part by the DARPA BioInfoMicro Program, the NIH NINDS Neural Prosthesis Program and Grant #NS25074, and the NSF ITR Program award #0113679. We thank D. Mumford, E. Brown, M. Serruya, A. Shaikhouni, J. Dushanova, C. Vargas-Irwin, L. Lennox, D. Morris, D. Grollman, and M. Fellows for their assistance. J.P.D. is a cofounder and shareholder in Cyberkinetics, Inc., a neurotechnology company that is developing neural prosthetic devices. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899– 1907. Brown, E., Frank, L., Tang, D., Quirk, M., & Wilson, M. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411– 7425. Carmena, J. M., Lebedev, M. A., Crist, R. E., O’Doherty, J. E., Santucci, D. M., Dimitrov, D. F., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. L. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. PLoS, Biology, 1, 001–016. Cisek, P., & Kalaska, J. F. (2002). Simultaneous encoding of multiple potential reach directions in dorsal premotor cortex. Journal of Neurophysiology, 87, 1149–1154.
116
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Donoghue, J. P., Sanes, J. N., Hatsopoulos, N. G., & Gaal, G. (1998). Neural discharge and local field potential oscillations in primate motor cortex during voluntary movements. Journal of Neurophysiology, 79, 159–173. Eden, U., Frank, L., Barbieri, R., Solo, V., & Brown, E. (2004). Dynamic analysis of neural encoding by point process adaptive filtering. Neural Computation, 16, 971–988. Fetz, E., Toyama, K., & Smith, W. (1991). Synaptic interaction between cortical neurons. In A. Peters & E. Jones (Eds.), Cerebral cortex (Vol. 9, pp. 1–47). New York: Plenum. Flament, D., & Hore, J. (1988). Relations of motor cortex neural discharge to kinematics of passive and active elbow movements in the monkey. Journal of Neurophysiology, 60, 1268–1284. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2002). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietterich, S. Becker, & Z. Ghahram (Eds.), Advances in neural information processing systems, 14 (pp. 213–220). Cambridge, MA: MIT Press. Gao, Y., Black, M. J., Bienenstock, E., Wu, W., & Donoghue, J. P. (2003). A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions. In 1st International IEEE/EMBS Conference on Neural Engineering (pp. 189–192). Capri, Italy. Gelb, A. (1974). Applied optimal estimation. Cambridge, MA: MIT Press. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 8, 1527–1537. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neural population coding of movement direction. Science, 233, 1416–1419. Ghazanfar, A., Stambaugh, C., & Nicolelis, M. (2000). Encoding of tactile stimulus location by somatosensory thalamocortical ensembles. Journal of Neuroscience, 20, 3761–3775. Gribble, P. L., & Scott, S. H. (2002). Method for assessing directional characteristics of non-uniformly sampled neural activity. Journal of Neuroscience Methods, 113, 187–197. Hatsopoulos, N., Ojakangas, C., Paninski, L., & Donoghue, J. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. Proceedings of the National Academy of Sciences, 95(26), 15706–15711. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. ASME, Journal of Basic Engineering, 82, 35–45. Kalman, R. E., & Bucy, R. (1961). New results in linear filtering and prediction. Trans. ASME, Journal of Basic Engineering, 83, 95–108. Kennedy, P. R., & Bakay, R. A. (1998). Restoration of neural output from a paralyzed patient by a direct brain connection. NeuroReport, 9, 1707–1711. Kettner, R., Schwartz, A., & Georgopoulos, A. (1988). Primary motor cortex and free arm movements to visual targets in three-dimensional space. III. Positional gradients and population coding of movement direction from various movement origins. Journal of Neuroscience, 8, 2938–2947.
Motor Cortical Decoding Using a Kalman Filter
117
Larsen, R. J., & Marx, M. L. (2001). An introduction to mathematical statistics and its applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Maynard, E., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement directions. Journal of Neuroscience, 19, 8083–8093. Maynard, E., Nordhausen, C., & Normann, R. (1997). The Utah intracortical electrode array: A recording structure for potential brain-computer interfaces. Electroencephalography and Clinical Neurophysiology, 102, 228–239. Moran, D., & Schwartz, A. (1999a). Motor cortical activity during drawing movements: Population representation during spiral tracing. Journal of Neurophysiology, 82, 2693–2704. Moran, D., & Schwartz, A. (1999b). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Murphy, K. P. (1998). Switching Kalman filter (Tech. Rep. 98-10). Cambridge, MA: Compaq Cambridge Research Laboratory. Paninski, L., Fellows, M., Hatsopoulos, N., & Donoghue, J. P. (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Sanchez, J., Erdogmus, D., Principe, J., Wessberg, J., & Nicolelis, M. (2002). Comparison between nonlinear mappings and linear state estimation to model the relation from motor cortical neuronal firing to hand movements. In Proceedings of SAB Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices (pp. 59–65). Edinburgh, Scotland. Sanchez, J., Erdogmus, D., Rao, Y., Principe, J., Nicolelis, M., & Wessberg, J. (2003). Learning the contributions of motor, premotor, and posterior parietal cortices for hand trajectory reconstruction in a brain machine interface. In Proceedings of the 1st international IEEE EMBS Conference on Neural Engineering. Capri, Italy. Schwartz, A. (1993). Motor cortical activity during drawing movements: Population representation during sinusoid tracing. Journal of Neurophysiology, 70, 28– 36. Schwartz, A., & Moran, D. (1999). Motor cortical activity during drawing movements: Population representation during lemniscate tracing. Journal of Neurophysiology, 82, 2705–2718. Schwartz, A., Taylor, D., & Helms Tillery, S. (2001). Extraction algorithms for cortical control of arm prosthetics. Current Opinion in Neurobiology, 11, 701–707. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., & Donoghue, J. (2003). Robustness of neuroprosthetic decoding algorithms. Biological Cybernetics, 88, 219–228. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Brain-machine interface: Instant neural control of a movement signal. Nature, 416, 141–142. Shoham, S. (2001). Advances towards an implantable motor cortical interface. Unpublished doctoral dissertation, University of Utah. Taylor, D., Helms Tillery, S., & Schwartz, A. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296, 1829–1832.
118
W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black
Twum-Danso, N., & Brockett, R. (2001). Trajectory estimation from place cell data. Neural Networks, 14, 835–844. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Welch, G., & Bishop, G. (2001). An introduction to the Kalman filter (Tech. Rep. 95–041). Chapel Hill: University of North Carolina at Chapel Hill. Wessberg, J., Stambaugh, C., Kralik, J., Beck, P., L. M., Chapin, J., Kim, J., Biggs, S., Srinivasan, M., & Nicolelis, M. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408, 361–365. Wood, F., Fellows, M., Donoghue, J. P., & Black, M. J. (2004). Automatic spike sorting for neural decoding. In Proc. the 26th Annual Internaltional Conference of the IEEE EMBS (pp. 4009–4012). San Francisco. Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., & Donoghue, J. P. (2002). Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter. In SAB’02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices (pp. 66–73). Edinburgh, Scotland. Wu, W., Black, M. J., Gao, Y., Bienenstock, E., Serruya, M., Shaikhouni, A., & Donoghue, J. P. (2003). Neural decoding of cursor motion using a Kalman filter. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 133–140). Cambridge, MA: MIT Press. Wu, W., Black, M. J., Mumford, D., Gao, Y., Bienenstock, E., & Donoghue, J. P. (2004a). Modeling and decoding motor cortical activity using a switching Kalman filter. IEEE Transactions on Biomedical Engineering, 51, 933–942. Wu, W., Shaikhouni, A., Donoghue, J. P., & Black, M. J. (2004b). Closed-loop neural control of cursor motion using a Kalman filter. Proc. IEEE Engineering in Medicine and Biology Society (pp. 4126–4129). San Francisco. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044.
Received July 8, 2004; accepted May 3, 2005.
LETTER
Communicated by Garrison Cottrell
Facial Attractiveness: Beauty and the Machine Yael Eisenthal
[email protected] School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
Gideon Dror
[email protected] Department of Computer Science, Academic College of Tel-Aviv-Yaffo, Tel-Aviv 64044, Israel
Eytan Ruppin
[email protected] School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
This work presents a novel study of the notion of facial attractiveness in a machine learning context. To this end, we collected human beauty ratings for data sets of facial images and used various techniques for learning the attractiveness of a face. The trained predictor achieves a significant correlation of 0.65 with the average human ratings. The results clearly show that facial beauty is a universal concept that a machine can learn. Analysis of the accuracy of the beauty prediction machine as a function of the size of the training data indicates that a machine producing human-like attractiveness rating could be obtained given a moderately larger data set. 1 Introduction In this work, we explore the notion of facial attractiveness through the application of machine learning techniques. We construct a machine that learns from facial images and their respective attractiveness ratings to produce human-like evaluation of facial attractiveness. Our work is based on the underlying theory that there are objective regularities in facial attractiveness to be analyzed and learned. We first briefly describe the psychophysics of facial attractiveness and its evolutionary origins. We then provide a review of previous work done in the computational analysis of beauty, attesting to the novelty of our work. 1.1 The Psychophysics of Beauty 1.1.1 Beauty and the Beholder. The subject of visual processing of human faces has received attention from philosophers and scientists, such as Neural Computation 18, 119–142 (2006)
C 2005 Massachusetts Institute of Technology
120
Y. Eisenthal, G. Dror, and E. Ruppin
Aristotle and Darwin, for centuries. Within this framework, the study of human facial attractiveness has had a significant part: “Beauty is a universal part of human experience, it provokes pleasure, rivets attention, and impels actions that help ensure survival of our genes” (Etcoff, 1999). Various experiments have empirically shown the influence of physical attractiveness on our lives, as both individuals and part of a society; its impact is obvious by the amounts of money spent on plastic surgery and cosmetics each year. Yet the face of beauty, something we can recognize in an instant, is still difficult to formulate. This outstanding question regarding the constituents of beauty has led to a large body of ongoing research by scientists in the biological, cognitive, and computational sciences. Over centuries, the common notion in this research has been that beauty is in the eye of the beholder—that individual attraction is not predictable beyond our knowledge of a person’s particular culture, historical era, or personal history. However, more recent work suggests that the constituents of beauty are neither arbitrary nor culture bound. Several rating studies by Perrett et al. and other researchers have demonstrated high crosscultural agreement in attractiveness rating of faces of different ethnicities (Cunningham, Roberts, Wu, Barbee, & Druen, 1995; Jones, 1996; Perrett, May & Yoshikawa, 1994; Perrett et al., 1998). This high congruence over ethnicity, social class, age, and sex has led to the belief that perception of facial attractiveness is data driven—that the properties of a particular set of facial features are the same irrespective of the perceiver. If different people can agree on which faces are attractive and which are not when judging faces of varying ethnic background, then this suggests that people everywhere are using similar criteria in their judgments. This belief is strengthened by the consistent relations, demonstrated in experimental studies, between attractiveness and various facial features, with both male and female raters. Cunningham (1986) and Cunningham, Barbee, & Pike (1990) showed a strong correlation between beauty and specific features, which were categorized as neonate (features such as small nose and high forehead), mature (e.g., prominent cheekbones), and expressive (e.g., arched eyebrows). They concluded that beauty is not an inexplicable quality that lies only in the eye of the beholder. A second line of evidence in favor of a biological rather than an arbitrary cultural basis of physical attractiveness judgments comes from studies of infant preferences for face types. Langlois et al. (1987) showed pairs of female faces (previously rated for attractiveness by adults) to infants only a few months old. The infants preferred to look at the more attractive face of the pair, indicating that even at 2 months of age, adult-like preferences are demonstrated. Slater et al. (1998) demonstrated the same preference in newborns. The babies looked longer at the attractive faces, regardless of the gender, race, or age of the face. The owner-vs.-observer hypothesis was further studied in various experiments. Zaidel explored the question of whether beauty is in the perceptual
Facial Attractiveness: Beauty and the Machine
121
space of the observer or a stable characteristic of the face (Chen, German, & Zaidel, 1997). Results showed that facial attractiveness is more dependent on the physiognomy of the face than on a perceptual process in the observer for both male and female observers. 1.1.2 Evolutionary Origins. Since Darwin, biologists have studied natural beauty’s meaning in terms of the evolved signal content of striking phenotypic features. Evolutionary scientists claim that the perception of facial features may be governed by circuits shaped by natural selection in the human brain. Aesthetic judgments of faces are not capricious but instead reflect evolutionary functional assessments and valuations of potential mates (Thornhill & Gangestad, 1993). These “Darwinian” approaches are based on the premise that attractive faces are a biological “ornament” that signals valuable information; attractive faces advertise a “health certificate,” indicating a person’s “value” as a mate (Thornhill & Gangestad, 1999). Advantageous biological characteristics are probably revealed in certain facial traits, which are unconsciously interpreted as attractive in the observer’s brain. Facial attributes like good skin quality, bone structure, and symmetry, for example, are associated with good health and therefore contribute to attractiveness. Thus, human beauty standards reflect our evolutionary distant and recent past and emphasize the role of health assessment in mate choice or, as phrased by anthropologist Donald Symons (1995), “Beauty may be in the adaptations of the beholder.” Research has concentrated on a number of characteristics of faces that may honestly advertise health and viability. Langlois and others have demonstrated a preference for average faces: composite faces, a result of digital blending and averaging of faces, were shown to be more attractive than most of the faces used to create them (Grammer & Thornhill, 1994; Langlois & Roggman, 1990; Langlois, Roggman, & Musselman, 1994; O’Toole, Price, Vetter, Bartlett, & Blanz, 1999). Evolutionary biology holds that in any given population, extreme characteristics tend to fall away in favor of average ones; therefore, the ability to form an average-mate template would have conveyed a singular survival advantage (Symons, 1979; Thornhill & Gangestad, 1993). The averageness hypothesis, however, has been widely debated. Average composite faces tend to have smooth skin and be symmetric; these factors, rather than averageness per se, may lead to the high attractiveness attributed to average faces (Alley & Cunningham, 1991). Both skin texture (Fink, Grammer, & Thornhill, 2001) and facial bilateral symmetry (Grammer & Thornhill, 1994; Mealey, Bridgstock, & Townsend, 1999; Perrett et al., 1999) have been shown to have a positive effect on facial attractiveness ratings. The averageness hypothesis has also received only mixed empirical support. Later studies found that although averageness is certainly attractive, it can be improved on. Composites of beautiful people were rated more appealing than those made from the larger, random population (Perrett et al.,
122
Y. Eisenthal, G. Dror, and E. Ruppin
1994). Also, exaggeration of the ways in which the prettiest female composite differed from the average female composite resulted in a more attractive face (O’Toole, Deffenbacher, et al., 1998; Perrett et al., 1994, 1998); these turned out to be sexually dimorphic traits, such as small chin, full lips, high cheekbones, narrow nose, and a generally small face. These sex-typical, estrogen-dependent characteristics in females may indicate youth and fertility and are thus considered attractive (Perrett et al., 1998; Symons, 1979, 1995; Thornhill & Gangestad, 1999). 1.2 Computational Beauty Analysis. The previous section clearly indicates the existence of an objective basis underlying the notion of facial attractiveness. Yet the relative contribution to facial attractiveness of the aforementioned characteristics and their interactions with other facial beauty determinants are still unknown. Different studies have examined the relationship between subjective judgments of faces and their objective regularity. Morphing software has been used to create average and symmetrized faces (Langlois & Roggman, 1990; Perrett et al., 1994, 1999), as well as attractive and unattractive prototypes (http://www.beautycheck.de), in order to analyze their characteristics. Other approaches have addressed the question within the study of the relation between aesthetics and complexity, which is based on the notion that simplicity lies at the heart of all scientific theories (Occam’s razor principle). Schmidhuber (1998) created an attractive female face composed from a fractal geometry based on rotated squares and powers of two. Exploring the question from a different approach, Johnston (Johnston & Franklin, 1993) produced an attractive female face using a genetic algorithm, which evolves a “most beautiful” face according to interactive user selections. This algorithm mimics, in an oversimplified manner, the way humans (consciously or unconsciously) select features they find the most attractive. Measuring the features of the resulting face showed it to have “feminized” features. This study and others, which have shown attractiveness and femininity to be nearly equivalent for female faces (O’Toole et al., 1998), have been the basis for a commercial project, which uses these sex-dependent features to determine the sex of an image and predict its attractiveness (http://www.intelligent-earth.com). 1.3 This Work. Previous computational studies of human facial attractiveness have mainly involved averaging and morphing of digital images and geometric modeling to construct attractive faces. In general, computer techniques used include delineation, transformation, prototyping, and other image processing techniques, most requiring fiducial points on the face. In this work, rather than attempt to morph or construct an attractive face, we explore the notion of facial attractiveness through the application of machine learning techniques. Using only the images themselves,
Facial Attractiveness: Beauty and the Machine
123
we try to learn and analyze the mapping from two-dimensional facial images to their attractiveness scores, as determined by human raters. The cross-cultural consistency in attractiveness ratings demonstrated in many previous studies has led to the common notion that there is an objective basis to be analyzed and learned. The remainder of this letter is organized as follows. Section 2 presents the data used in our analyses (both images and ratings), and section 3 describes the representations we chose to work with. Section 4 describes our experiments with learning facial attractiveness, presenting prediction results and analyses. Finally, section 5 consists of a discussion of the work presented and general conclusions. Additional details are provided in the appendix. 2 The Data 2.1 Image Data Sets. To reduce the effects of age, gender, skin color, facial expression, and other irrelevant factors, subject choice was confined to young Caucasian females in frontal view with neutral expression, without accessories or obscuring items (e.g., jewelry). Furthermore, to get a good representation of the notion of beauty, the data set was required to encompass both extremes of facial beauty: very attractive as well as very unattractive faces. We obtained two data sets, which met the above criteria, both of relatively small size of 92 images each. Data set 1 contains 92 young Caucasian (American) females in frontal view with neutral expressions, face and hair comprising the entirety of the picture. The images all have identical lighting conditions and nearly identical orientation, in excellent resolution, with no obscuring or distracting features, such as jewelry or glasses. The pictures were originally taken by Japanese photographer Akira Gomi. Images were received with attractiveness ratings. Data set 2 contains 92 Caucasian (Israeli) females, aged approximately 18, in frontal view, face and hair comprising the entirety of the picture. Most of the images have neutral expressions, but in order to keep the data set reasonably large, smiling images in which the mouth was relatively closed were also used. The images all have identical lighting conditions and nearly identical orientation. This data set required some image preprocessing and is of slightly lower quality. The images contain some distracting features, such as jewelry. The distributions of the raw images in the two data sets were found to be too different for combining the sets, and therefore all our experiments were conducted on each data set separately. Data set 1, which contains highquality pictures of females in the preferred age range, with no distracting or obscuring items, was the main data set used. Data set 2, which is of slightly lower quality, containing images of younger women with some distracting features (jewelry, smiles), was used for exploring cross-cultural
124
Y. Eisenthal, G. Dror, and E. Ruppin
consistency in attractiveness judgment and in its main determinants. Both data sets were converted to grayscale to lower the dimension of the data and simplify the computational task. 2.2 Image Ratings 2.2.1 Rating Collection. Data set 1 was received with ratings, but to check consistency of ratings across cultures, we collected new ratings for both data sets. To facilitate both the rating procedure and the collection of the ratings, we created an interactive HTML-based application that all our raters used. This provided a simple rating procedure in which all participants received the same instructions and used the same rating process. The raters were asked to first scan through the entire data set (in grayscale) to obtain a general notion of the relative attractiveness of the images, and only then to proceed to the actual rating stage. They were instructed to use the entire attractiveness scale and to consider only facial attractiveness in their evaluation. In the rating stage, the images were shown in random order to eliminate order effects, each on a separate page. A rater could look at a picture for as long as he or she liked and then score it. The raters were free to return to pictures they had already seen and adjust their ratings. Images in data set 1 were rated by 28 observers—15 male, 13 female— most in their twenties. For data set 2, 18 ratings were collected from 10 male and 8 female raters of similar age. Each facial image was rated on a discrete integer scale between 1 (very unattractive) and 7 (very attractive). The final attractiveness rating of a facial image was the mean of its ratings across all raters. 2.2.2 Rating Analysis. In order to verify the adequacy and consistency of the collected ratings, we examined the following properties: • Consistency of ratings. The raters were randomly divided into two groups. We calculated the mean ratings of each group and checked consistency between the two mean ratings. This procedure was repeated numerous times and consistently showed a correlation of 0.9 to 0.95 between the average ratings of the two groups for data set 1 and a correlation of 0.88 to 0.92 for data set 2. The mean ratings of the groups were also very similar for both data sets, and a t-test confirmed that the rating means for the two groups were not statistically different. • Clustering of raters. The theory underlying the project is that individuals rate facial attractiveness according to similar, universal standards. Therefore, our assumption was that all ratings are from the same distribution. Indeed, clustering of raters produced no apparent grouping. Specifically, a chi-square test that compared the distribution of ratings of male versus female raters showed no statistically significant differences between these two groups. In addition, the correlation between the average female ratings and average male ratings was very high: 0.92 for data set 1 and 0.88 for data
Facial Attractiveness: Beauty and the Machine
125
set 2. The means of the female and male ratings were also very similar, and a t-test confirmed that the means of the two groups were not statistically different. The results show no effect of observer gender. An analysis of the original ratings for data set 1 (collected from Austrian raters) versus the new ratings (collected from Israeli raters) shows a high similarity in the images rated as most and least attractive. A correlation of 0.82 was found between the two sets of ratings. These findings strongly reinforce previous reports of high cross-cultural agreement in attractiveness rating. 3 Face Representation Numerous studies in various face image processing tasks (e.g., face recognition and detection) have experimented with various ways to specify the physical information in human faces. The different approaches tried have demonstrated the importance of a broad range of shape and image intensity facial cues (Bruce & Langton, 1994; Burton, Bruce, & Dench, 1993; Valentine & Bruce, 1986). The most frequently encountered distinction regarding the information in faces is a qualitative one between feature-based and configurationalbased information, that is, discrete, local, featural information versus spatial interrelationship of facial features. Studies suggest that humans perceive faces holistically and not as individual facial features (Baenninger, 1994; Haig, 1984; Young, Hellawell, & Hay, 1989), yet experiments with both representations have demonstrated the importance of features in discriminative tasks (Bruce & Young, 1986; Moghaddam & Pentland, 1994). This is a particularly reasonable assumption for beauty judgment tasks, given the correlation found between features and attractiveness ratings. Our work uses both kinds of representations. In the configurational representation, a face is represented with the raw grayscale pixel values, in which all relevant factors, such as texture, shading, pigmentation, and shape, are implicitly coded (though difficult to extract). A face is represented by a vector of pixel values created by concatenating the rows or columns of its image. The pixel-based representation of a face will be referred to as its pixel image. The featural representation is motivated by arguments tying beauty to ideal proportions of facial features such as distance between eyes, width of lips, size of eyes, and distance between the lower lip and the chin. This representation is based on the manual measurement of 37 facial feature distances and ratios that reflect the geometry of the face (e.g., distance between eyes, mouth length and width). The facial feature points according to which these distances were defined are displayed in Figure 1. (The full list of feature measurements is given in the appendix, along with their calculation method.) All raw distance measurements, which are in units of pixels, were normalized by the distance between pupils, which serves
126
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 1: Feature landmarks used for feature-based representation.
as a robust and accurate length scale. To these purely geometric features we added several nongeometric ones: average hair color, an indicator of facial symmetry, and an estimate of skin smoothness. The feature-based measurement representation of a face will be referred to as its feature vector. 4 Learning Attractiveness We turn to present our experiments with learning facial attractiveness, using the facial images and their respective human ratings. The learners were trained with the pixel representation and with the feature representation separately. 4.1 Dimension Reduction. The pixel images are of an extremely high dimension, of the order of 100,000 (equal to image resolution). Given the high dimensionality and redundancy of the visual data, the pixel images underwent dimension reduction with principal component analysis (PCA). PCA has been shown to relate reliably to human performance on various face image processing tasks, such as face recognition (O’Toole, Abdi, Deffenbacher, & Valentin, 1993; Turk & Pentland, 1991) and race and sex classification (O’Toole, Deffenbacher, Abdi, & Bartlett, 1991), and to be semantically relevant. The eigenvectors pertaining to large eigenvalues have been shown to code general information, such as orientation and categorical assessment, which has high variance and is common to all faces in the set ¨ (O’Toole et al., 1993; O’Toole, Vetter, Troje, & Bulthoff, 1997; Valentin & Abdi, 1996). Those corresponding to the smaller eigenvalues code smaller,
Facial Attractiveness: Beauty and the Machine
127
more individual variation (Hancock, Burton, & Bruce, 1996; O’Toole et al., 1998). PCA was also performed on the feature-based measurements in order to decorrelate the variables in this representation. This is important since strong correlations, stemming, for example, from left-right symmetry, were observed in the data. 4.1.1 Image Alignment. For PCA to extract meaningful information from the pixel images, the images need to be aligned, typically by rotating, scaling, and translating, to bring the eyes to the same location in all the images. To produce sharper eigenfaces, we aligned the images according to a second point as well—the vertical location of the center of the mouth, a technique known to work well for facial expression recognition (Padgett & Cottrell, 1997). This nonrigid transformation, however, involved changing face height-to-width ratio. To take this change into account, the vertical scaling factor of each face was added to its low-dimensional representation. As the input data in our case are face images and the eigenvectors are of the same dimension as the input, the eigenvectors are interpretable as faces and are often referred to as eigenfaces (Turk & Pentland, 1991). The improvement in sharpness of the eigenfaces from the main data set as a result of the alignment can be seen in Figure 2. Each eigenface deviates from uniform gray, where there is variation in the face set. The left column consists of two eigenfaces extracted from PCA on unaligned images; face contour and features are blurry. The middle column shows eigenfaces from images aligned only according to eyes. The eyes are indeed more sharply defined, but other features are still blurred. The right column shows eigenfaces from PCA on images aligned by both eyes and vertical location of mouth; all salient features are much more sharply defined. 4.1.2 Eigenfaces. PCA was performed on the input vectors from both representations, separately. Examples of eigenvectors extracted from the pixel images from the main data set can be seen in Figure 3. The eigenfaces in the top row are those pertaining to the highest eigenvalues, the middle row shows eigenfaces corresponding to intermediate eigenvalues, and the bottom row presents those pertaining to the smallest eigenvalues. As expected, the eigenfaces in the top row seem to code more global information, such as hair and face shape, while the eigenvectors in the bottom row code much more fine, detailed feature information. Each eigenface is obviously not interpretable as a simple single feature (as is often the case with a smaller data set), yet it is clearly seen in the top row eigenfaces that the directions of highest variance are hair and face contour. This is not surprising, as the most prominent differences between the images are in hair color and shape, which also causes large differences in face shape (due to partial occlusion by hair). Smaller variance can also be seen in other features, mainly eyebrows and eyes.
128
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 2: (Left) Eigenfaces from unaligned images. (Middle) Eigenfaces from images aligned only by eyes. (Right) Eigenfaces from images aligned by both eyes and mouth.
4.2 Feature Selection. The eigenfaces are the features representing the face set; they can be combined in a certain weighting to represent a specific face. A low-dimensional representation using only the first eigenvectors minimizes the squared error between the face representation and the original image and is sufficient for accurate face recognition (Turk & Pentland, 1991). However, omitting the dimensions pertaining to the smaller eigenvalues decreases the perceptual quality of the face (O’Toole et al., 1993, 1998). Consequently, we anticipated that using the first m eigenfaces would not produce accurate results in our attractiveness evaluation task. Indeed, these experiments resulted in poor facial attractiveness predictions. We therefore selected the eigenfaces most important to our task by sorting them according to their relevance to attractiveness ratings. This relevance was estimated by calculating the correlation of the eigenvector projections with the human ratings across the various images. Interestingly, in the pixel representation, the features found most correlated with the attractiveness ratings were those pertaining to intermediate and smaller eigenvalues. Figure 4 shows the eigenfaces; the top row displays those pertaining to the highest eigenvalues, and the bottom row presents the eigenfaces with projections most correlated with human ratings. While the former show mostly general
Facial Attractiveness: Beauty and the Machine
129
Figure 3: Eigenfaces from largest to smallest eigenvalues (top to bottom).
Figure 4: Eigenfaces pertaining to highest eigenvalues (top row) and highest correlations with ratings (bottom row).
130
Y. Eisenthal, G. Dror, and E. Ruppin
features of hair and face contour, the latter also clearly show the lips, the nose tip, and eye size and shape to be important features. The same method was used for feature selection in the feature-based representation. The feature measurements were sorted according to their correlation with the attractiveness ratings. It should be noted that despite its success, using correlation as a relevance measure is problematic, as it assumes the relation between the feature and the ratings to be monotonic. Yet experiments with other ranking criteria that do not make this assumption, such as chi square and mutual information, produced somewhat inferior results. 4.3 Attractiveness Prediction. The original data vectors were projected onto the top m eigenvectors from the feature selection stage (where m is a parameter on which we performed optimization) to produce a lowdimensional representation of the data as input to the learners in the prediction stage. 4.3.1 Classification into Two Attractiveness Classes. Although the ultimate goal of this work was to produce and analyze a facial beauty predictor using regression methods, we begin with a simpler task, on which there is even higher consistency between raters. To this end, we recast the problem of predicting facial attractiveness into a classification problem: discerning “attractive” faces (the class comprising the highest 25% rated images) from “unattractive” faces (the class of lowest 25% rated images). The main classifiers used were standard K-nearest neighbors (KNN) (Mitchell, 1997) and support vector machines (SVM) (Vapnik, 1995). The best results obtained are shown in Table 1, which displays the percentage of correctly classified images. Classification using the KNN classifier was good; correct classifications of 75% to 85% of the images were achieved. Classification rates with SVM were slightly poorer, though for the most part in the same percentage range. Both classifiers performed better with the feature vectors than with the pixel images; this is particularly true for SVM. Best SVM results were achieved using a linear kernel. In general, classification (particularly with KNN) was good for both data sets and ratings, with success percentages slightly lower for the main data set.
Table 1: Percentage of Correctly Classified Images. Data Set 1
Data Set 2
Pixel Images
KNN SVM
75% 68%
77% 73%
Feature Vectors
KNN SVM
77% 76%
86% 84%
Facial Attractiveness: Beauty and the Machine
131
KNN does not use specific features, but rather averages over all dimensions, and therefore does not give any insight into which features are important for attractiveness rating. In order to learn what the important features are, we used a C4.5 decision tree (Quinlan, 1986, 1993) for classification using feature vectors without preprocessing by PCA. In most cases, the results did not surpass those of the KNN classifier, but the decision tree did give some insight into which features are “important” for classification. The features found most informative were those pertaining to size of the lower part of the face (jaw length, chin length), smoothness of skin, lip fullness, and eye size. These findings are all consistent with previous psychophysics studies. 4.3.2 The Learners for the Regression Task. Following the success of the classification task, we proceeded to the regression task of rating prediction. The predictors used for predicting facial beauty itself were again KNN and SVM. For this task, however, both predictors were used in their regression version, mapping each facial image to a real number that represents its beauty. We also used linear regression, which served as a baseline for the other methods. Targets used were the average human ratings of each image. The output of the KNN predictor for a test image was computed as the weighted average of the targets of the image’s k nearest neighbors, where the weight of a neighbor is the inverse of its Euclidean distance from the test image. That is, let v1 , . . . ,vk be the set of k nearest neighbors of test image v with targets y1 , . . . ,yk , and let d1 , . . . ,dk be their respective Euclidean distances from v. The predicted beauty y for the test image v is then wi yi y = i , i wi
i = 1,
2, . . . , k
where wi = (di + δ)−1 is the weight of neighbor vi and where δ is a smoothing parameter. On all subsequent uses of KNN, we set δ = 1. KNN was run with k values ranging from 1 to 45. As a predictor for our task, KNN suffers from a couple of drawbacks. First, it performs averaging, and therefore its predicted ratings had very low variance, and all extremely high or low ratings were evened out and often not reached. In addition, it uses a Euclidean distance metric, which need not reflect the true metric for evaluation of face similarity. Therefore, we also studied an SVM regressor as an attractiveness predictor, a learner that does not use a simple distance metric and does not perform averaging in its prediction. The SVM method, in its regression version, was used with several kernels: linear, polynomials of degree 2 and 3, and gaussian with different values of γ , where log2 γ ∈{−6, −4, −2, 0}. γ is related to the width parameter
132
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 5: Prediction results obtained with pixel images (A) and feature-based representation (B). Performance is measured by the correlation between the predicted ratings and the human ratings.
δ by γ = 1/(2σ 2 ). We performed a grid search over the values of slack parameter c and the width of regression tube w such that log10 c ∈ { − 3, −2, −1, 0, 1} and w ∈ {0.1, 0.4, 0.7, 1.0}. In all runs, we used a soft margin SVM implemented in SVMlight (Joachims, 1999). Due to the relatively small sample sizes, we evaluated the performance of the predictors using cross validation; predictions were made using leaven-out, with n = 1 for KNN and linear regression and n = 5 for SVM. 4.3.3 Results of Facial Attractiveness Prediction. Predicted ratings were evaluated according to their correlation with the human ratings, using the Pearson correlation. Figure 5 depicts the best results of the attractiveness predictors on the main data set. Figure 5A shows the best correlations achieved with the pixel-based representation, and Figure 5B shows the best results for the feature-based representation. Prediction results for the pixel images show a peak near m = 25 features, where the maximum correlation achieved with KNN is approximately 0.45. Feature-based representation shows a maximum value of nearly 0.6 at m = 15 features, where the highest correlation is achieved with both SVM and linear regression. Highest SVM results in both representations were reached with a linear kernel. Results obtained on the second data set were similar. The normalized MSE of the best predicted ratings is 0.6 to 0.65 (versus a normalized MSE of 1 of the “trivial predictor,” which constantly predicts the mean rating). KNN performance was poor—significantly worse than that of the other regressors in the feature-based representation. These results imply that the Euclidean distance metric is probably not a good estimate for similarity of faces for this task. It is interesting to note that the simple linear regressor performed as good as or better than the KNN predictor. However, this effect may be attributed to our feature selection method,
Facial Attractiveness: Beauty and the Machine
133
ranking features by the absolute value of their correlations with the target, which is optimal for linear regression. 4.3.4 Significance of Results. All predictors performed better with the feature-based representation than with the pixel images (in accordance with results of classification task). Using the feature vectors enabled a maximum correlation of nearly 0.6 versus a correlation of 0.45 with the pixel images. To check the significance of this score, we produced an empirical distribution of feature-based prediction scores with random ratings. The entire preprocessing, feature selection, hyperparameter selection, and prediction process was run 100 times, each time with a different set of randomly generated ratings, sampled from a normal distribution with mean and variance identical to those of the human ratings. For each run, the score taken was the highest correlation of predicted ratings with the original (random) ratings. The average correlation achieved with random ratings was 0.28, and the maximum correlation was 0.47. Figure 6A depicts the histogram of these correlations. Using QQplot, we verified that the empirical distribution of observed correlations is approximately normal; this is shown in Figure 6B. Using the normal approximation, the correlation obtained by our featurebased predictor is significant to a level of α = 0.001. The numbers and figures presented are for the KNN predictor. Correlations achieved with linear regression have different mean and standard deviation but a similar z-value. The distribution of these correlations was also verified to be approximately normal, and the correlation achieved by our linear regressor was significant to the same level of α = 0.001. This test was not run for the SVM predictor due to computational limitations.
Figure 6: Correlations achieved with random ratings. (A) Histogram of correlations. (B) QQplot of correlations versus standard normal distributed data. Correlations were obtained with KNN predictor.
134
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 7: Correlation of the weighted machine ratings with human ratings versus the weighting parameter, α.
4.3.5 Hybrid Predictor. Ratings predicted with the two representations were very different; the best ratings achieved using each representation had a correlation of only 0.3 to 0.35. The relatively low correlation between the feature-based and pixel-based predictions suggests that results might be improved by using the information learned from both representations. Therefore, an optimal weighted average of the best feature-based and pixelbased ratings was calculated. We produced a hybrid machine that generates the target rating yhybrid = αyfeature + (1 − α)ypixel , where yfeature is the rating of the feature-based predictor, ypixel is the prediction of the pixel-based machine, and 0 ≤ α ≤ 1. Figure 7 shows the correlation between the hybrid machine ratings and the human ratings as a function of the weights tried (weights shown are those of the feature-based ratings, α). The hybrid predictor was constructed using the best feature-based and pixel-based ratings obtained with linear regression. As evident from the graph, the best weighted ratings achieve a correlation of 0.65 with the human ratings. The hybrid predictor with the optimal value of α = 0.65 improves prediction results by nearly 10% over those achieved with a single representation. Its normalized MSE is 0.57, lower than that of the individual rating sets. These weighted ratings have the highest correlation and lowest normalized MSE with the human scores. Therefore, in subsequent analysis, we use these weighted ratings as the best machine-predicted ratings, unless stated otherwise. 4.3.6 Evaluation of Predicted Rating Ranking. An additional analysis was performed to evaluate the relative image ranking induced by the best
Facial Attractiveness: Beauty and the Machine
135
Figure 8: Probability of error in the predicted relative order of two images as a function of the absolute difference in their original human ratings.
machine predictions. Figure 8 shows the probability of error in the predicted relative ordering of two images as a function of the absolute difference d in their original, human ratings. The differences were binned into 16 bins. The probability of error for each value of d was computed over all pairs of images with an absolute difference of d in their human ratings. As evident from the graph, the probability decreases almost linearly as the absolute difference in the original ratings grows. 4.3.7 The Learning Curve of Facial Attractiveness. For further evaluation of the prediction machine, an additional experiment was run in which we examined the learning curve of our predictor. We produced this curve by iteratively running the predictor for a growing data set size in the following manner. The number of input images, n, was incremented from 5 to the entire 92 images. For every n, the predictor was run 10 times, each time with n different, randomly selected images (for n = 92, all images were used in a single run). Testing was performed on the subsets of n images only, using leave-one-out, and results were evaluated according to the correlation of the predicted ratings with the human ratings of these images. Figure 9 shows the results for the KNN predictor trained using the feature representation with k = 16 and m = 7 features. The correlations shown in the plot are the average over the 10 runs. The figure clearly shows that the performance of the predictor improves with the increase in the number of images. The slope of the graph is positive for every n ≥ 50. Similar behavior was observed with other parameters and learners. This tendency is less distinct in the corresponding graph for the pixel images.
136
Y. Eisenthal, G. Dror, and E. Ruppin
Figure 9: Accuracy of prediction as a function of the training set size.
5 Discussion This letter presents a predictor of facial attractiveness, trained with female facial images and their respective average human ratings. Images were represented as both raw pixel data and measurements of key facial features. Prediction was carried out using KNN, SVM, and linear regression, and the best predicted ratings achieved a correlation of 0.65 with the human ratings. We consistently found that the measured facial features were more informative for attractiveness prediction on all tasks tried. In addition to learning facial attractiveness, we examined some characteristics found correlated with facial attractiveness in previous experiments. In particular, we ran our predictor on the “average” face, that is, the mathematical average of the faces in the data set. This face received only an average rating, showing no support for the averageness hypothesis in our setting. This strengthens previous experiments that argued against the averageness hypothesis (as described in section 1.1.2). The high attractiveness of composite faces may be attributable to their smooth skin and symmetry and not to the averageness itself, explaining the fact that the mathematical average of the faces was not found to be very attractive. Given the high dimensionality and redundancy of visual data, the task of learning facial attractiveness is undoubtedly a difficult one. We tried additional preprocessing, feature selection, and learning methods, such as Wrapper (Kohavi & John, 1996), Isomap (Tenenbaum, de Silva, & Langford, ¨ 2000), and kernel PCA (Scholkopf, Smola, & Muller, 1999), but these all produced poorer results. The nonlinear feature extraction methods probably
Facial Attractiveness: Beauty and the Machine
137
failed due to an insufficient number of training examples, as they require a dense sampling of the underlying manifold. Nevertheless, our predictor achieved significant correlations with the human ratings. However, we believe our success was limited by a number of hindering factors. The most meaningful obstacle in our project is likely to be the relatively small size of the data sets available to us. This limitation can be appreciated by examining Figure 9, which presents a plot of prediction performance versus the size of the data set. The figure clearly shows improvement in the predictor’s performance as the number of images increases. The slope of the graph is still positive with the 92 images used and does not asymptotically level off, implying that there is considerable room for improvement by using a larger, but still realistically conceivable, data set. Another likely limiting factor is insufficient data representation. While the feature-based representation produced better results than the pixel images, it is nonetheless incomplete; it includes only Euclidean distance-based measurements and lacks fine shape and texture information. The relatively lower results with the pixel images show that this representation is not informative enough. In conclusion, this work, novel in its application of computational learning methods for analysis of facial attractiveness, has produced promising results. Significant correlations with human ratings were achieved despite the difficulty of the task and several hindering factors. The results clearly show that facial beauty is a universal concept that a machine can learn. There are sufficient grounds to believe that future work with a moderately larger data set may lead to an “attractiveness machine” producing humanlike evaluations of facial attractiveness. Appendix: Features Used by the Feature-Based Predictors A.1 Feature-Based Representation. Following is a list of the measurements comprising the feature-based representation: 1. Face length 2. Face width—at eye level 3. Face width—at mouth level 4. Distance between pupils 5. Ratio between 2 and 3 6. Ratio between 1 and 2 7. Ratio between 1 and 3 8. Ratio between 4 and 2 9. Right eyebrow thickness (above pupil)
138
Y. Eisenthal, G. Dror, and E. Ruppin
10. Left eyebrow thickness (above pupil) 11. Right eyebrow arch—height difference between highest point and inner edge 12. Left eyebrow arch—height difference between highest point and inner edge 13. Right eye height 14. Left eye height 15. Right eye width 16. Left eye width 17. Right eye size = height * width 18. Left eye size = height *width 19. Distance between inner edges of eyes 20. Nose width at nostrils 21. Nose length 22. Nose size = width * length 23. Cheekbone width = (2–3) 24. Ratio between 23 and 2 25. Thickness of middle of top lip 26. Thickness of right side of top lip 27. Thickness of left side of top lip 28. Average thickness of top lip 29. Thickness of lower lip 30. Thickness of both lips 31. Length of lips 32. Chin length—from bottom of face to bottom of lower lip 33. Right jaw length—from bottom of face to right bottom face edge 34. Left jaw length—from bottom of face to left bottom face edge 35. Forehead height—from nose top to top of face 36. Ratio of (distance from nostrils to eyebrow top) to (distance from face bottom to nostrils) 37. Ratio of (distance from nostrils to face top) to (distance from face bottom to nostrils)
Facial Attractiveness: Beauty and the Machine
139
38. Symmetry indicator (description follows) 39. Skin smoothness indicator (description follows) 40. Hair color indicator (description follows) A.2 Symmetry Indicator. A vertical symmetry axis was set between the eyes of each image, and two rectangular, identically sized windows, surrounding only mouth and eyes, were extracted from opposite sides of the axis. The symmetry measure of the image was calculated as N1 i (Xi − Yi )2 , where N is the total number of pixels in each window, Xi is the value of pixel i in the right window, and Yi is the value of the corresponding pixel in the left window. The value of the indicator grows with the asymmetry in a face. This indicator is indeed a measure of the symmetry in the facial features, as the images are all consistent in lighting and orientation. A.3 Skin Smoothness Indicator. The “smoothness” of a face was evaluated by applying a Canny edge detection operator to a window from the cheek/forehead area; a window representative of the skin texture was selected for each image. The skin smoothness indicator was the average value of the output of this operation, and its value monotonously decreases with the smoothness of a face. A.4 Hair Color Indicator. A window representing the average hair color was extracted from each image. The indicator was calculated as the average value of the window, thus increasing with lighter hair. Acknowledgments We thank Bernhard Fink and the Ludwig-Boltzmann Institute for Urban Ethology at the Institute for Anthropology, University of Vienna, Austria, for one of the facial data sets used in this research. References Alley, T. R., & Cunningham, M. R. (1991). Averaged faces are attractive, but very attractive faces are not average. Psychological Science, 2, 123–125. Baenninger, M. (1994). The development of face recognition: Featural or configurational processing? Journal of Experimental Child Psychology, 57, 377–396. Bruce V., & Langton S. (1994). The use of pigmentation and shading information in recognizing the sex and identities of faces. Perception, 23, 803–822. Bruce, V., & Young, A. W. (1986). Understanding face recognition. British Journal of Psychology, 77, 305–327. Burton, A. M., Bruce, V., & Dench, N. (1993). What’s the difference between men and women? Evidence from facial measurement. Perception, 22(2), 153–176.
140
Y. Eisenthal, G. Dror, and E. Ruppin
Chen, A. C., German, C., & Zaidel, D. W. (1997). Brain asymmetry and facial attractiveness: Facial beauty is not simply in the eye of the beholder. Neuropsychologia, 35(4), 471–476. Cunningham, M. R. (1986). Measuring the physical in physical attractiveness: Quasi experiments on the sociobiology of female facial beauty. Journal of Personality and Social Psychology, 50(5), 925–935. Cunningham, M. R., Barbee, A. P., & Pike, C. L. (1990). What do women want? Facial metric assessment of multiple motives in the perception of male physical attractiveness. Journal of Personality and Social Psychology, 59, 61–72. Cunningham, M. R., Roberts, A. R., Wu, C. H., Barbee, A. P., & Druen, P. B. (1995). Their ideas of beauty are, on the whole, the same as ours: Consistency and variability in the cross-cultural perception of female attractiveness. Journal of Personality and Social Psychology, 68, 261–279. Etcoff, N. (1999). Survival of the prettiest: The science of beauty. New York: Anchor Books. Fink, B., Grammer, K., & Thornhill, R. (2001). Human (homo sapien) facial attractiveness in relation to skin texture and color. Journal of Comparative Psychology, 115(1), 92–99. Grammer, K., & Thornhill, R. (1994). Human facial attractiveness and sexual selection: The role of symmetry and averageness. Journal of Comparative Psychology, 108(3), 233–242. Haig, N. D. (1984). The effect of feature displacement on face recognition. Perception, 13, 505–512. Hancock, P. J. B., Burton, A. M., & Bruce, V. (1996). Face processing: Human perception and PCA. Memory and Cognition, 24, 26–40. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Johnston, V. S., & Franklin, M. (1993). Is beauty in the eye of the beholder? Ethology and Sociobiology, 14, 183–199. Jones, D. (1996). Physical attractiveness and the theory of sexual selection: Results from five populations. Ann Arbor: University of Michigan Museum. Kohavi, R., & John, G. H. (1996). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Langlois, J. H., & Roggman, L. A. (1990). Attractive faces are only average. Psychological Science, 1, 115–121. Langlois, J. H., Roggman, L. A., Casey, R. J., Ritter, J. M., Rieser-Danner, L. A., & Jenkins, V. Y. (1987). Infant preferences for attractive faces: Rudiments of a stereotype? Developmental Psychology, 23, 363–369. Langlois, J. H., Roggman, L. A., & Musselman, L. (1994). What is average and what is not average about attractive faces? Psychological Science, 5, 214–220. Mealey, L., Bridgstock, R., & Townsend, G. C. (1999). Symmetry and perceived facial attractiveness: A monozygotic co-twin comparison. Journal of Personality and Social Psychology, 76(1), 151–158. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill. Moghaddam, B., & Pentland, A. (1994). Face recognition using view-based and modular eigenspaces. In R. J. Mammone & J. D. Murley, Jr. (Eds.), Automatic systems for the identification and inspection of humans, Proc. SPIE, 2257.
Facial Attractiveness: Beauty and the Machine
141
O’Toole, A., Abdi, H., Deffenbacher, K. A., & Valentin, D. (1993). Low-dimensional representation of faces in higher dimensions of the face space. Journal of the Optical Society of America A, 10(3), 405–411. O’Toole, A. J., Deffenbacher, K. A., Valentin, D., McKee, K., Huff, D., & Abdi, H. (1998). The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition, 26, 146–160. O’Toole, A. J., Deffenbacher, K. A., Abdi, H., & Bartlett, J. A. (1991). Simulating the “other-race effect” as a problem in perceptual learning. Connection Science Journal of Neural Computing, Artificial Intelligence, and Cognitive Research, 3, 163–178. O’Toole, A. J., T. Price, T. Vetter, J. C. Bartlett, & V. Blanz (1999). 3D shape and 2D surface textures of human faces: The role of “averages” in attractiveness and age. Image and Vision Computing, 18, 9–19. ¨ O’Toole, A. J., Vetter, T., Troje, N. F., & Bulthoff, H. H. (1997). Sex classification is better with three-dimensional head structure than with image intensity information. Perception, 26, 75–84. Padgett, C., & Cottrell, G. (1997). Representing face images for emotion classification. In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Perrett, D. I., Burt, D. M., Penton-Voak, I. S., Lee, K. J., Rowland, D. A., & Edwards, R. (1999). Symmetry and human facial attractiveness. Evolution and Human Behavior, 20, 295–307. Perrett, D. I., Lee, K. J., Penton-Voak, I., Rowland, D. A., Yoshikawa, S., Burt, D. M., Henzi, S. P., Castles, D. L., & Akamatsu, S. (1998). Effects of sexual dimorphism on facial attractiveness. Nature, 394, 826–827. Perrett, D. I., May, K. A., & Yoshikawa, S. (1994). Facial shape and judgments of female attractiveness. Nature, 368, 239–242. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Schmidhuber, J. (1998). Facial beauty and fractal geometry. Available online at: http://www.idsia.ch/∼juergen/locoface/newlocoface.html. ¨ Scholkopf, B., Smola, A., & Muller, K. R. (1999). Kernel principal component analysis. ¨ In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Slater, A., Von der Schulenberg, C., Brown, E., Badenoch, M., Butterworth, G., Parsons, S., & Samuels, C., (1998). Newborn infants prefer attractive faces. Infant Behavior and Development, 21, 345–354. Symons, D. (1979). The evolution of human sexuality. New York: Oxford University Press. Symons, D. (1995). Beauty is in the adaptations of the beholder: The evolutionary psychology of human female sexual attractiveness. In P. R. Abramson & S. D. Pinkerton (Eds.), Sexual nature, sexual culture (pp. 80–118). Chicago: University of Chicago Press. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global framework for nonlinear dimensionality reduction. Science, 290(550), 2319–2323. Thornhill, R., & Gangestad, S. W. (1993). Human facial beauty: Averageness, symmetry and parasite resistance. Human Nature, 4(3), 237–269.
142
Y. Eisenthal, G. Dror, and E. Ruppin
Thornhill, R., & Gangestad, S. W. (1999). Facial attractiveness. Trends in Cognitive Sciences, 3, 452–460. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Valentin, D., & Abdi, H. (1996). Can a linear autoassociator recognize faces from new orientations? Journal of the Optical Society of America, series A, 13, 717–724. Valentine, T., & Bruce, V. (1986). The effects of race, inversion and encoding activity upon face recognition. Acta Psychologica, 61, 259–273. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Young, A. W., Hellawell, D., & Hay, D. C. (1989). Configurational information in face perception. Perception, 16, 747–759.
Received July 30, 2004; accepted May 15, 2005.
LETTER
Communicated by Peter Hancock
Classification of Faces in Man and Machine Arnulf B. A. Graf ∗
[email protected]
Felix A. Wichmann
[email protected]
¨ Heinrich H. Bulthoff
[email protected]
¨ Bernhard Scholkopf
[email protected] ¨ Max Planck Institute for Biological Cybernetics, D 72076 Tubingen, Germany
We attempt to shed light on the algorithms humans use to classify images of human faces according to their gender. For this, a novel methodology combining human psychophysics and machine learning is introduced. We proceed as follows. First, we apply principal component analysis (PCA) on the pixel information of the face stimuli. We then obtain a data set composed of these PCA eigenvectors combined with the subjects’ gender estimates of the corresponding stimuli. Second, we model the gender classification process on this data set using a separating hyperplane (SH) between both classes. This SH is computed using algorithms from machine learning: the support vector machine (SVM), the relevance vector machine, the prototype classifier, and the K-means classifier. The classification behavior of humans and machines is then analyzed in three steps. First, the classification errors of humans and machines are compared for the various classifiers, and we also assess how well machines can recreate the subjects’ internal decision boundary by studying the training errors of the machines. Second, we study the correlations between the rankorder of the subjects’ responses to each stimulus—the gender estimate with its reaction time and confidence rating—and the rank-order of the distance of these stimuli to the SH. Finally, we attempt to compare the metric of the representations used by humans and machines for classification by relating the subjects’ gender estimate of each stimulus and the distance of this stimulus to the SH. While we show that the classification error alone is not a sufficient selection criterion between the different algorithms humans might use to classify face stimuli, the distance of these stimuli to the SH is shown to capture essentials of the internal decision
∗ Present
address: Center for Neural Science, New York University, New York, NY,
USA. Neural Computation 18, 143–165 (2006)
C 2005 Massachusetts Institute of Technology
144
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
space of humans. Furthermore, algorithms such as the prototype classifier using stimuli in the center of the classes are shown to be less adapted to model human classification behavior than algorithms such as the SVM based on stimuli close to the boundary between the classes.
1 Introduction Bringing together theoretical modeling and behavioral data is arguably one of the main challenges when studying the “computational brain” (Churchland & Sejnowski, 1992). The aim of this letter is to obtain a better understanding of the algorithms responsible for the classification of visual stimuli by humans. For this, we combine machine learning and psychophysical techniques to gain insights into the algorithms human subjects use during visual classification of images of human faces according to their gender. In this “machine-learning-psychophysics” approach, we substitute a complex system that is very hard to analyze—the human brain—with a reasonably complex system—a learning machine (Vapnik, 1998). The latter is complex enough to capture some essentials of the human behavior but is still amenable to close analysis (Poggio, Rifkin, Mukherjee, & Niyogi, 2004). The research presented in this article is focused on a novel methodology that bridges the gap between human psychophysics and machine learning by extracting quantitative information from a (high-level) human behavioral experiment. The past decade has seen important technological advances in neuroscience from a microscopic scale (e.g., multiunit recordings) to a macroscopic scale (e.g., functional magnetic resonance imaging), yielding novel insights into visual processing. However, on an algorithmic level, the methods and understanding of brain processes involved in visual recognition are still limited, although numerous attempts have been made since this problem was pointed out by Marr (1982). Recently various computational models for visual recognition have been proposed. For instance, a network of Gabor wavelet filters was used to describe the processing of visual information (Mel, 1997). Independent component analysis was combined with a nearest-neighbor classifier to model face recognition (Bartlett, Movellan, & Sejnowski, 2002). The computations done by the human visual system for facial expression recognition were described using Gabor wavelets, principal component analysis, and artificial neural networks (Dailey, Cottrell, Padgett, & Adolphs, 2002). Object recognition and classification was also modeled using a hierarchical model composed of a network of nonlinear units combined using a maximum operation (Riesenhuber & Poggio, 1999, 2002). While each of these methods is successful for its own task, they illustrate the divergence of the approaches used to understand human category learning as pointed out, for example, in the overview by Ashby and Ell (2001). In this letter, we propose a novel
Classification of Faces in Man and Machine
145
method combining machine learning and human psychophysics to shed light on the algorithms humans use to classify visual stimuli. Our framework allows us to compare directly the classification behavior of different algorithms to that of humans. While the results obtained in this letter have no claim to be biologically inspired or to explain a specific function of the visual system (see, e.g., Rolls & Deco, 2002, for an overview of such computational methods), we instead ask the following questions: Can we generate testable hypotheses about the algorithms humans use to classify visual inputs? Can we find a classifier whose behavior reflects human classification behavior significantly better than others? Current high-level vision research, with its intrinsically complex stimuli, is hampered by a lack of methods to answer such questions at the algorithmic level. The method presented here has the potential to contribute to overcoming this obstacle. An initial attempt using machine learning to help understand the algorithms humans use to classify the gender of faces was presented by Graf and Wichmann (2004). This letter extends that work. In section 2 we present a psychophysical gender classification experiment of images of human faces and analyze the subjects’ responses—the gender estimate with its reaction time and confidence rating. Section 3 introduces several algorithms from machine learning that will be used to model the classification behavior of humans. Our analysis of the classification behavior of humans proceeds in three steps. First, the classification performance of humans and machines is compared in section 4, and the findings are related to those described in the literature. Second, we correlate in section 5 the rank-order of the subjects’ responses to each stimulus with the rank-order of the distance of this stimulus to the separating hyperplane (SH) of the machine. The success of these studies encourages us to perform the third step in section 6: a metric comparison of the representations used by humans and machines for classification, using the subjects’ gender estimate of each stimulus and the corresponding distance to the SH of the machine. Section 7 summarizes our results and discusses their implications. 2 Human Classification In a human psychophysical classification experiment, 55 human subjects were asked to classify a random gender-balanced subset of 152 out of 200 realistic human faces according to their gender. The stimuli were presented sequentially once to each subject. The temporal envelope of stimulus presentation was a modified Hanning window (a raised cosine function with a raising time of 500 ms and a plateau time of 1000 ms, for a total presentation time of 2000 ms per face). After the presentation of each stimulus, a blank screen with mean luminance was shown to the subjects for 1000 ms before the presentation of the following stimulus. We recorded the subjects’ estimated gender (female or male) together with the reaction time (RT) and
146
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
a confidence rating (CR) on a scale from 1 (unsure) to 3 (sure). No feedback on the correctness of the subjects’ answers was provided. Subjects were asked to classify the faces as fast as possible to obtain perceptual, rather than cognitive, judgments. Most of the time they responded well before the presentation of the stimulus had ended (mean reaction time over all stimuli and subjects was approximately 900 ms). A training phase of 8 faces (4 male and 4 female faces) preceded the actual classification experiment in order to acquaint the subjects with the stimuli and the experimental procedure. Subjects viewed the screen binocularly with their head stabilized by a headrest. All subjects had normal or corrected-to-normal vision and were paid for their participation. Most of them were students from the University of ¨ Tubingen, and all of them were naive to the purpose of the experiment. Each stimulus was an 8-bit grayscale frontal view of a Caucasian face with a nominal size of 256 × 256 pixels. All faces were centered on the display, had the same pixel-surface area and the same mean intensity, and they came from a processed version of the MPI face database1 (Blanz & Vetter, 1999). The details of the image processing are described in Graf and Wichmann (2002). The stimuli were presented against the mean luminance (50 cd/m2 ) of a linearized Clinton Monoray CRT driven by a Cambridge Research Systems VSG 2/5 display controller. Neither the presentation of a male nor of a female face changed the mean luminance of the screen. The subjects’ gender estimates were analyzed using signal detection theory (Wickens, 2002). We assume that on the decision axis, the internal class representations are corrupted by gaussian distributed noise with same unit variance but different means. We define correct response probabilities for male (+) and female (−) stimuli as P+ = P( yˆ = 1|y = 1) and P− = P( yˆ = −1|y = −1), where yˆ is the estimated class and y the true class of the stimulus. The discriminability of both classes can then be computed as d = Z(P+ ) + Z(P− ), where Z = −1 , and is the cumulative normal distribution with zero mean and unit variance. Averaged across all subjects, we obtain a high discriminability, d = 2.63 ± 0.57, suggesting that the classification task is comparatively easy for the subjects, albeit not trivial (no ceiling effect). Furthermore, the subjects exhibit a pronounced male bias in the responses defined as log(β) = 12 (Z2 (P+ ) − Z2 (P− )) = 1.49 ± 1.15, indicating that more females are classified as males than males as females. In Figure 1 we show the relation between the average across all subjects of the subjects’ responses for each stimulus, each point in these plots representing one stimulus. We can first see that for P( yˆ = +1|x) ≈ 1, all the stimuli are male and that for P( yˆ = +1|x) ≈ 0, all the stimuli are female. Second, we can observe the male bias already mentioned: a higher density of responses near P( yˆ = +1|x) ≈ 1. Furthermore, there are female stimuli for which P( yˆ = +1|x) > 12 , but no male stimuli for which P( yˆ = +1|x) < 12 . 1
To be found online at http://faces.kyb.tuebingen.mpg.de.
Classification of Faces in Man and Machine
147 3
1.1
male stimuli female stimuli
2.8 2.6
RT
CR
1 0.9
2.4 2.2 2
male stimuli female stimuli
0.8
1.8 0
0.5 P(y^=+1|x)
1
0
0.5 P(y^=+1|x)
1
Figure 1: Relation between the subjects’ responses—the probability P( yˆ = +1|x) to answer male, the reaction time RT, and the confidence rating CR— on a stimulus-by-stimulus basis (responses averaged across subjects).
Clearly the threshold for male-female discrimination depends on the male bias and is located in [ 12 , 1]. Third, we notice that for stimuli with a high probability to belong to either class (P( yˆ = +1|x) = 0 or 1), the corresponding RTs are short and the CRs are high. In other words, when the subjects make a correct gender estimate, they answer fast, and they are confident of their response. For the stimuli where the subjects have difficulty choosing a class (P( yˆ = +1|x) ≈ 0.5), they take longer to respond (long RT) and are unsure of their response (low CR). Subjects thus have a rather good knowledge of the correctness of their gender estimate. 3 Machine Classification To model the subjects’ classification behavior using machine learning, we first need to preprocess the stimuli to reduce their “apparent” dimensionality. We use principal component analysis PCA (Duda, Hart, & Stork, 2001), a widely used linear preprocessor from unsupervised machine learning, to preprocess the data. PCA is an eigenvalue decomposition of the covariance matrix associated with the data matrix D = B E along the directions of largest variance where the columns of the basis matrix B are constrained to be orthonormal and the rows of the encoding matrix E are orthogonal. The rows of B are termed eigenfaces according to one of the first studies to apply PCA to human faces (Sirovich & Kirby, 1987). PCA has also been successfully applied to model face perception and classification in a large number of studies, from psychophysics (O’Toole, Abdi, Deffenbacher, & Valentin, 1993: Valentin, Abdi, Edelman, & O’Toole, 1997; O’Toole, Deffenbacher, Valentin, McKee, Huff, & Abdi, 1998; O’Toole, Vetter, & Blanz, 1999; Furl,
148
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
Phillips, & O’Toole, 2002), to artificial recognition systems (Turk & Pentland, 1991; Golomb, Lawrence, & Sejnowski, 1991; Gray, Lawrence, Golomb, & Sejnowski, 1995; O’Toole, Phillips, Cheng, Ross, & Wild, 2000; Bartlett et al., 2002) and facial expression modeling (Calder, Burton, Miller, Young, & Akamatsu, 2001). Like all previous studies, we apply PCA to the vectors obtained when reshaping the intensity matrix of the pixels of each face into a single 2562 × 1 vector. We keep the full space of the data, that is, the 200 nonzero components of the PCA decomposition of the data, and obtain a PCA-encoding data matrix E of size 200 × 200, where each row is the encoding corresponding to a face stimulus. By construction, these encodings are already centered. Subsequently these encodings are also normalized since this has been shown to be quite effective in real-world applications for some classifiers (Graf, Smola, & Borer, 2003). Since we consider the full encoding space of dimension 200, the choice of PCA as a preprocessor is of little consequence, and the face stimuli can be reconstructed perfectly from these encodings. In this letter, we consider two types of stimulus data sets for each subject: the true and the subject data sets. The patterns in both data sets are represented by their (centered and normalized) PCA encodings. The true data set contains the p = 152 encodings xi ∈ R200 , i = 1, . . . , p of the stimuli seen by the subject, combined with the true labels yi = ±1 of these stimuli—their true gender as given by the MPI face database. The subject data set is composed of the same encodings xi , combined this time with the labels yˆi of the stimuli as estimated by the subject in the psychophysical classification experiment. This data set represents what we assume to be the subject’s internal representation of the face space. Altogether we thus have 55 true and subject data sets. We use methods from supervised machine learning to model classification. The classifiers are applied to the true and the subject data sets and thus classify in the PCA space of dimension 200. We consider classifiers that are linear: they classify using a separating hyperplane (SH) defined by its normal vector w and offset b. Furthermore, these classifiers can all be expressed in dual form: the normal vector is a linear combination of the patterns of the data set w = i αi xi . Since we cannot investigate all such classifiers in an exhaustive manner, we consider the most representative member of each one of four families of classification principles: the support vector machine, the relevance vector machine, the prototype classifier and the K-means classifier. Figure 2 shows these classifiers applied on a twodimensional toy data set. These classifiers are presented and discussed in further detail below. ¨ The support vector machine SVM (Vapnik, 2000; Scholkopf & Smola, 2002) is a state-of-the-art maximum margin classification algorithm rooted in statistical learning theory. SVMs classify by maximizing the margin separating both classes while minimizing the classification errors. This tradeoff between maximum margin and misclassifications is controlled by a
Classification of Faces in Man and Machine
SVM
RVM
149
Prot
Kmean
Figure 2: Classification of a two-dimensional toy data set using the classifiers considered in this study. The dark lines indicate the SHs.
parameter C set by cross-validation.2 The optimal dual space parameter α maximizes the following expression,
αi −
i
1 yi y j αi α j xi |x j , 2 ij
subject to
αi yi = 0
and 0 ≤ αi ≤ C,
i
where ·|· stands for the inner (or scalar) product between two vectors. The offset is computed as b = yi − w| xi i|0<αi
2 Cross-validation is used to assess in an unbiased manner the classification error of an algorithm on a given data set. An N-fold cross-validation scheme separates the data set into N subsets where N − 1 are used for training and the remaining one is used for testing. The average over all N possibilities is then an estimate of the classification error of the classifier on the considered data set. When estimating optimal parameters, the parameter yielding the minimal classification error is chosen.
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
150
Learning then amounts introduced using a gaussian distribution for P(α| β). to maximizing with respect to β the following conditional probability: α. P( y|X, β) = P( y|X, α)P( α| β)d The value of β maximizing the above probability is then used to compute and thus also w α using P(α| β), and b. Since this integral cannot be solved analytically, the Laplace approximation (local approximation of the integrant by a gaussian) is used for solution, yielding an iterative update scheme for β. Some classifiers used in neuroscience, cognitive science, and psychology are variants of the mean-of-class prototype classifier Prot (Reed, 1972; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). Its popularity may be due in part to its intuitiveness—representing the class by its mean tendency—as well as due to its simplicity: it classifies according to the nearest mean-ofclass prototype. In its simplest form, all dimensions are weighted equally, but variants exist where the weight of each dimension is inversely proportional to the class variance along that dimension. As we cannot estimate class variance along all 200 dimensions from only 200 stimuli, we chose to implement the simplest prototype classifier with equal weights along all dimensions where the prototypes are defined as i (yi ± 1) i x p ± = . i (yi ± 1) The weight vector and the offset are then computed respectively as w = p + − p −
and b =
p − 2 − p + 2 . 2
The expression for w can be rewritten as a linear combination of the patterns xi , and thus Prot can be viewed as a classifier in dual form. Note that due to the homogeneity of the faces in the MPI face database (Graf & Wichmann, 2002) this classifier is likely to be close to the “best” possible prototype classifier. The popularity of prototype classification has led to several variants. For instance, the general context model (Palmeri, 2001; Nosofsky, 1991) is a classifier where instead of computing x − p ± as for the prototype classifier, the quantity i|yi =±1 x − xi is used for classification. Moreover, the Fisher linear discriminant classifier FLD (Fisher, 1936) is a whitened variant of the prototype classifier. Indeed, the FLD weight vector = Sw−1 ( p + − p − ), where Sw = S+ + S− can be written as w and S± = xi |yi =±1 |xi − p ± xi − p ± | is the within-class covariance matrix of the positive and negative data, respectively (Duda et al., 2001), the notation |··| standing for the outer product of two vectors. Consequently, if we disregard the constant offset b, we can write the decision
Classification of Faces in Man and Machine
151 −1/2
−1/2
function as w| x = Sw−1 ( p + − p − )|x = Sw ( p + − p − )|Sw x, which is a prototype classifier using the prototypes p ± after whitening the space −1/2 with Sw . Finally, we may mention that FLD is prone to overfitting when considering fewer patterns p than dimensions n, which is the case for us: p = 152 ≤ n = 200. This makes FLD not suited as a classifier for our studies. An extension of prototype classification is to consider for each class multiple “prototypes” computed, for instance, using the K-means clustering algorithm (Duda et al., 2001). By combining these prototypes with a nearestneighbor classifier, we obtain the K-means classifier Kmean. The number of means K is assumed to be the same for both classes, and its value is determined using cross-validation. The SH obtained here is piecewise linear, and Kmean represents the family of piecewise linear SH algorithms. Every portion of the SH of Kmean is computed using the Prot algorithm, which makes Kmean a classifier in dual form “by parts.” The extension of the prototype algorithm to a multiprototype one has been suggested by Edelman (1995) in the context of his “chorus of prototype” approach, which cannot be directly applied to our study. Our Kmean classifier is close in spirit, however. 4 Classification Errors of Man and Machine First we assess the classification errors of humans and machines using crossvalidation, a method involving multiple training and testing sets, which allows us to estimate the generalization ability of the classifiers. Second, we show that for the particular task we chose, training on the entire data set using a single training and testing set does not lead to overfitting since the classification errors obtained with and without cross-validation are not significantly different. Finally, we study the training error of the classifiers, which is a measure of how well the classifiers can recreate the subjects’ internal decision boundary for faces. For humans, the classification error on the true data set is simply obtained by considering the mean and standard deviation over all 55 human subjects of the individual mean classification error computed by comparing the true gender of a stimulus with its estimate. The classification error on the subject data set cannot be computed directly since the subject’s labels are not known beforehand. To obtain this error, we use a method derived from cross-validation where for each stimulus shown to a particular subject, we compute the mean error the other subjects made on this stimulus by defining as an error when the other subjects responded differently than the considered subject did. The classification error on the subject data set is thus computed by treating each subject’s responses in turn as being “correct” and calculating the classification error of all the other subjects by this standard. In other words, we compare the subjects’ gender responses on common stimuli and determine the mean consistency between subjects. We then
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
152
errors with cross–validation
0.35 0.3
classification error on true dataset classification error on subject dataset
0.25 0.2 0.15 0.1 0.05 0 man
SVM
RVM
Prot
Kmean
errors without cross–validation
0.4 0.35 0.3
training error on true dataset training error on subject dataset classification error on true dataset classification error on subject dataset
0.25 0.2 0.15 0.1 0.05 0 SVM
RVM
Prot
Kmean
Figure 3: (Top) Classification error of humans and machines on both data sets assessed using cross-validation (multiple training and testing sets). (Bottom) Training and classification errors of the machines on both data sets computed without cross-validation (single training and testing set).
compute the mean and standard deviation of this error over all the stimuli presented to that subject. For machines, the mean and standard deviation of the classification error is obtained, for both the true and the subject data sets, using a single five-fold cross-validation on the classification error for the RVM and Prot and a double five-fold cross-validation to determine also the optimal values of C for the SVM and K for Kmean. The mean and standard error over all 55 subjects of the mean and standard deviation of the above “individual” classification errors are computed for both data sets and are shown in the top row of Figure 3. When considering the classification error of humans, we notice that the standard error is smaller for the subject data set than for the true one. This is due to our method of assessing the classification error on the subject data set: it is computed using the consistency between each subject’s responses and the other subjects’ responses on the same set of stimuli. As the subjects’ responses tend to agree—a stimulus whose gender is difficult to assess by one subject is likely to be difficult to classify also by the other subjects—the average gender response over all subjects will also be similar. Hence the
Classification of Faces in Man and Machine
153
standard error of the subjects’ responses will be small. However, on the true data set, we do not have this “average of average” consistency effect, which is why the standard error of the classification error is larger on the true than on the subject data set. On the true data set, while the classification errors are not significantly different for humans versus the SVM and for humans versus the RVM (the error bars overlap), humans significantly outperform Prot and Kmean. On the subject data set, however, all the machines perform on average worse than humans, at least given our method of assessing the human classification error on the subject data set. This suggests that at least on the subject data set, humans and machines may be using different image features for classification. In all considered cases, the classification error on the subject data set is higher than on the true data set, which suggests that the subjects’ labels make classification more difficult. This may be due to the inherent variability (noise or jitter) in the subjects’ labeling. Prot and Kmean perform much worse than humans on both data sets, suggesting that either humans do not use Prot and Kmean for classification, or they do not use the PCA representation, or none of these. The above results can be compared to those obtained by Graf and Wichmann (2004) where instead of applying PCA directly on the pixel information, PCA was applied to a representation of the faces that uses correspondences between the images such as texture and shape maps (e.g., a nose is mapped to a nose). Although the conclusions were similar, the classification errors of machines are higher in this study, which suggests that a representation using correspondences, that is, an additional amount of information, makes classification an easier task for the machines. There have also been numerous attempts to compare the classification performance of humans and machines in the context of gender classification. Most of them used artificial neural network (ANN) classifiers applied on a PCA representation of the image intensity information. The so-called holons, computed from the PCA representation, were used by Cottrell and Metcalfe (1991) as inputs to an ANN in the EMPATH recognition system to predict the identity, the emotion, and the gender of the face stimuli. This system was shown to perfectly classify gender and to outperform humans for assessing emotion. In Golomb et al. (1991), ANNs were shown to classify gender better than humans, although not much, using the so-called SEXNET architecture. Contrary to the above findings, in our case the SVMs, although a principled version of ANNs, do not perform significantly better than humans, which may be due to the fact that we use linear SVMs. Other studies (Gray et al., 1995) using face stimuli at different resolutions but without the PCA stage indicate that the gender classification problem seems to be linearly separable since a simple perceptron yielded results similar to a multilayer ANN. We have obtained a similar result: some linear classification algorithms are good models for gender classification as testified by their relatively low classification errors.
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
154
A cross-validation scheme involving multiple training and testing sets is useful to assess the generalization ability of a classifier by giving an estimate of its classification error on a given data set. However, training on the entire data set may not always yield to overfitting. In particular, if we can show that the classification error assessed using cross-validation is not significantly different from the one obtained by training on the entire data set and testing on a separate testing set, we can then assert that the classifiers are not overfitting, even if trained on the entire data set. Moreover, we then also have a gain of interpretability of the classification process: training on the entire data set yields a single SH, while cross-validation amounts to using multiple SHs in a piecewise linear manner. In the case of the SVM and Kmean, in order to determine the optimal value of the parameters C and K , respectively, we still need to proceed to a single 10-fold cross-validation on the classification error. However, the classifiers are still trained on the whole data set using these optimal values, and therefore each classifier has a single SH. We then compute for each subject the mean and standard deviation of the following errors for the various classification algorithms:
r r r
The training error on the true and on the subject data set The classification error on the true data set computed using the unseen stimuli with their true labels The classification error on the subject data set determined using the unseen stimuli with, as labels, the sign of the mean of the other subjects’ responses for each of these unseen stimuli
The unseen stimuli are the remaining 48 stimuli out of the 200 that have not been seen by the considered subject. These training and testing errors are then averaged, and the standard error is computed over all subjects. The resulting values are shown in the bottom row of Figure 3. We compare the generalization ability of the classifiers when trained once on the entire data set (no cross-validation) or multiple times on parts of it as done for cross-validation by comparing, respectively, the classification errors of the bottom and top rows of Figure 3. Although the classification errors are slightly lower without cross-validation, which may be due to overfitting, these errors do not differ significantly. Moreover, although the errors themselves are slightly changed, their relation among each other is unchanged. Therefore, for the considered task, we do not need to use cross-validation. Most important, the training errors on both data sets are a measure of how well the classifiers can recreate the subjects’ internal decision boundary for face representation. While the SVM and the RVM perform quite well at this task, Prot and Kmean are rather poor candidates. As for the classification errors, the machines have on average more difficulty learning the subject data set than the true one.
Classification of Faces in Man and Machine
155
Comparing the classification errors of humans and machines mainly describes the input-output mapping of the human brain and of the machine. This shows only what is available in a black-box approach, and, as we may guess, this is not enough to make strong claims about the algorithms that humans actually use to classify visual stimuli. To infer these algorithms from our machine-learning-psychophysics approach, we have to take a closer look at the inner workings of the classification behavior of humans and machines. 5 Rank-Order Relation Between Man and Machine In this section we investigate the classification behavior of humans using machine learning. For this we study, on a stimulus-by-stimulus basis, the correlations between the average of the subjects’ responses—the subjects’ classification error, the corresponding reaction time (RT), and confidence rating (CR)—for a stimulus x and the average response of the machine represented by the distance, δ(x) =
w| x + b , w
of this stimulus to the SH of the machine in the PCA space, the averages being computed across all 55 subjects. The metric used to compute the above distance is the common and simple Euclidean 2-norm. The distance δ reflects how the learning machine structured the face space of the subjects. To link machine learning and human classification, we make the following conjecture: the closer a stimulus is to the SH (the smaller |δ|), the harder the classification should be (more errors by the subjects, longer RTs, and lower CRs). The rank-order of both the responses of humans (classification error, RT, and CR) and machines (|δ|) is considered so as to avoid having to specify the precise metric of how to relate humans and machines. If this approach is successful, we can then consider the full metric information given by the responses of humans and machines (see section 6). Since the training errors on the true and the subject data sets do not differ significantly (see the bottom row of Figure 3), we may consider only the SHs obtained using the subject data set. Moreover, only these SHs reflect what we hypothesize to be the internal face representation of the subjects. Hence, for each subject, a “personal” SH is computed using the labels yˆi estimated by this subject. The distance δ between the SH and each stimulus presented to this subject is then computed for each classification algorithm. In the case of Kmean, this distance is computed using the piece of hyperplane constructed using the “prototype” of each class nearest to the considered stimulus. We then assess the correlation between the average classification behavior of humans and machines on a stimulus-by-stimulus basis. For this, we compute, for each stimulus and classifier, the relation
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
156
RVM r=−0.71 ± 0.01
Prot r=−0.13 ± 0.02
Kmean r=−0.53 ± 0.02
subject error
SVM r=−0.78 ± 0.01
machine |δ| on subject dataset RVM r=−0.72 ± 0.01
Prot r=−0.24 ± 0.02
Kmean r=−0.57 ± 0.02
subject RT
SVM r=−0.79 ± 0.01
machine |δ| on subject dataset RVM r=0.73 ± 0.01
Prot r=0.20 ± 0.02
Kmean r=0.55 ± 0.02
subject CR
SVM r=0.81 ± 0.01
machine |δ| on subject dataset
Figure 4: Rank-order analysis on a stimulus-by-stimulus basis between the responses of humans (the classification error, the corresponding RT and CR) and machines (|δ| computed on the subject data set). Both axes range from 1 to 200. In the plots of the top row, the horizontal aggregations are stimuli that have been perfectly classified by the subjects, which translates in a tied-rank analysis into a horizontal line with an offset.
between the absolute value |δ| of the average across all subjects of the distance of that stimulus to the SH and the mean response of the subjects for that stimulus. To assess this correlation, we perform a nonparametric rank-order correlation analysis using the tied-rank of the subject’s response and of |δ| across the set of stimuli by computing Spearman’s rank correlation coefficient r . The mean value of r and its standard deviation are obtained using a bootstrap method by averaging over 1000 random poolings of 90% of the 200 stimuli. Figure 4 shows these rank-order correlation plots relating
Classification of Faces in Man and Machine
157
humans and machines, each of the 200 scatter points representing one face stimulus. Considering the relation between the rank-order of the subjects’ responses and |δ| of machine, we notice that stimuli far from the SH (high |δ|) are classified more accurately (low subject error), faster (short subject RT), and with higher confidence (high subject CR) than stimuli close to the SH. These rather intuitive trends are present for all classifiers, albeit to different degrees, and illustrate that |δ| may indeed be a good measure to bridge the gap between human psychophysics and machine learning. Given a classifier, if the man-machine correlations are high for one of the subjects’ responses, they can be expected to be high also for the other responses since the subjects’ responses are related, as already pointed out in section 2. These rank-order correlations allow us also to get a first hint at the algorithms humans may use to classify visual stimuli. The SVM shows the highest man-machine correlations for all responses. Moreover, it has the lowest training error on the subject data set (see the bottom row of Figure 3). In other words, the SVM can almost perfectly recreate the subjects’ internal decision space, and it also gives the best man-machine correlations. The SVM is thus a good candidate to model algorithmically visual gender classification in humans. Although the RVM has a slightly higher training error on the subject data set, its good man-machine correlations make it also a good candidate for this enterprise. The prototype classifier shows the lowest man-machine correlation for all responses. Under the assumptions of this study (in particular, no nonlinear preprocessing), a mechanism akin to prototype learning seems to be a poor model of human classification behavior. A piecewise extension of Prot such as Kmean also shows low man-machine correlations and is not nearly as good as the SVM or the RVM. It is thus unlikely that humans use this type of piecewise linear decision function. However, we cannot draw any definite conclusions for Prot and Kmean since both classifiers have rather high training errors on the subject data set. Comparing these results to those reported in Graf and Wichmann (2004), we notice that the man-machine correlations are higher in the present study. We may then conclude that using the correspondence information between the face images, although reducing the classification errors as mentioned in section 4, decreases the man-machine correlations. The latter may hint at the fact that a texture-shape correspondence representation may not be used by humans to encode visual information. It further emphasizes that classification performance per se and man-machine correlations are not equivalent measures. At first sight, the result that the correspondence information reduces the man-machine correlations may contradict the one obtained by Hancock, Bruce and Burton (1998), where it is shown that applying PCA on the texture and shape information separately increases the man-machine correlations for face recognition. The setting of the two studies is, however, different: while we focus here on gender classification,
158
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
the study by Hancock et al. (1998) mainly considers face recognition. The task performed by humans and machines is thus quite different between the two studies. While correspondence information should improve face recognition by better relating the face stimuli and removing artifacts, it may at the same time also degrade some gender-specific cues that are necessary for gender classification. Furthermore, it is difficult to compare both studies directly because of the difference in the implementation of the preprocessing stage: in the study by Graf and Wichmann (2004), PCA is applied to the concatenation of the texture and shape vectors, while in the study by Hancock et al. (1998), PCA is applied to the texture and shape vectors separately. Our results can also be related to those of Ashby, Boynton, and Lee (1994) where the reaction time RT for the classification of low-level stimuli is shown to decrease with the distance of the stimuli to the “categorization decision bound”—the SH in this study. The RT is also shown to be independent of the distance of the stimuli to the prototypes of each class. In this study, we corroborate those findings and also extend them in two ways. First, we consider the gender estimate and the confidence rating corresponding to the reaction time and find that these responses are related. Second, we investigate different algorithms rooted in machine learning to compute this SH. Third, in the next section, we gain insights into the actual metric humans use for visual gender classification. From the above rank-order correlation studies, we conclude that our data are orderly and that there is structure in the data that the machines can uncover. Even when removing the metric information presented in the responses of humans and machines by computing their tied-rank, some distinct trends can be seen in the data, these trends allowing us to compare humans and machines. In the next section, we proceed to a more quantitative analysis of the classification algorithms humans use for gender classification by removing the absolute value and the rank-order operations. We thus assess directly the metric of the internal decision space in humans using machine learning.
6 Metric Relation Between Man and Machine The success of the above rank-order analysis suggests to us that the distance δ of the stimuli to the SH of a classifier is a meaningful measure to compare the classification behavior of humans and machines. Moreover, δ seems to capture more information about man-machine comparisons than the classification error. In this section, we proceed with a metric analysis by relating on a stimulus-by-stimulus basis the probability that a stimulus is classified as male to the distance of this stimulus to the SH for each classifier, this time, however, without taking the rank-order of both quantities and by considering δ instead of |δ|.
Classification of Faces in Man and Machine
159
The subjects’ gender responses yˆ are used to define the mean probability P( yˆ = +1|x) that a stimulus x is classified as male across all 55 subjects. This probability has the characteristic of a smooth psychometric function: it is near 0 for stimuli classified predominantly as females, increases to 1/2 for stimuli where the classification is more difficult, and approaches 1 for stimuli classified mainly as males. This situation is typical for virtually all psychophysical tasks where human performance is a smooth, monotonic function of task difficulty. If any of the machines has captured more than just the input-output (classification error) mapping of the human subjects but instead captured some aspects of the human internal representation for gender classification, then the distance of a face to the SH should reflect the human classification difficulty. Thus, a regression of a monotonic function against the responses of machines δ on the x-axis and the responses of humans P( yˆ = +1|x) on the y-axis should yield a good fit: an averaged psychometric function. We fit that subject-averaged psychometric function to the responses of humans and machines using a constrained maximumlikelihood method (Wichmann & Hill, 2001). The goodness of fit is assessed using the variance explained σexp , which compares the amount of information captured from the data by the fitted function to the amount captured by a horizontal fit through the data. A high value of σexp indicates a good fit, whereas a low value indicates a poor fit; σexp ranges from 1.0 (perfect fit) to 0.0 (no explanatory gain over a horizontal line, that is, no relation between the variables). We fit either a clipped linear, a Weibull, or a logistic function to the data, selecting the one for each classifier that maximizes σexp . The plots of Figure 5 relate on a stimulus-by-stimulus basis P( yˆ = +1|x) to δ, which is scaled to [0, 1] and computed, as in section 5, using the subject data set. Since we use a linear preprocessor (PCA), a linear classifier, and the Euclidean norm for the computation of δ, we may expect that the linear fit would be the best type of fit. For the SVM, we indeed find that the best-fitting function is a clipped linear regression, whereas for the other classifiers, we require nonlinear sigmoidal functions. The SVM, which has a low training error on the subject data set and also exhibits the highest manmachine correlations in the rank-order analysis, provides here the best fit (highest values of σexp ) and is also the only one of the studied classifiers that allows a linear interpolation between the responses of humans and machines. The SVM thus again creates the gender classification space for faces closest to that of humans. The RVM has a lower quality of fit (lower value of σexp ), and the interpolation function follows a Weibull function. It seems thus less appropriate for our purpose, although it is still a possible candidate. Consistent with the previous rank-order results, the prototype classifier exhibits the least structure in the data, and consequently also the poorest goodness of fit σexp . Its piecewise linear extension Kmean shows slightly more structure, but is still far worse than the SVM or the RVM. Similarly to the previous findings of this study, it thus seems unlikely that
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
160
SVM linear, σ =0.91
RVM weibull, σ =0.86 exp
1
1
0.8
0.8 P(y^=+1|x)
P(y^=+1|x)
exp
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
0.2
0.4 0.6 δ on subject data
0.8
1
0
0.2
0.8
1
0.8
1
Kmean logistic, σ =0.75 exp
1
1
0.8
0.8 P(y^=+1|x)
P(y^=+1|x)
Prot weibull, σ =0.32 exp
0.4 0.6 δ on subject data
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
0.2
0.4 0.6 δ on subject data
0.8
1
0
0.2
0.4 0.6 δ on subject data
Figure 5: Metric analysis on a stimulus-by-stimulus basis between the response of humans P( yˆ = +1|x) and machines δ (computed on the subject data set and rescaled to [0, 1]).
humans use algorithms based on the concept of prototype to classify the gender of faces.
7 Conclusions Estimating the human internal metric representation of objects and categories is one of the central problems in cognitive psychology, and many previous investigations exist using, for instance, the geometrical relations between objects in feature spaces (Edelman, 1999) or the aftereffects induced in humans by face stimuli (Leopold, O’Toole, Vetter, & Blanz, 2001). There have also been previous attempts to study human and machine classification behavior by comparing, for instance, the generalization ability of humans and machines using the so-called other-race effect (Furl et al., 2002). In this article, we introduced a unified algorithmic approach based on machine learning techniques to gain insights into both the internal decision space of humans and the classification behavior of
Classification of Faces in Man and Machine
161
humans in the context of a gender classification task of images of human faces. Our research introduces a novel methodology and tests it by applying it to visual gender classification. Understanding the algorithms humans use in classification tasks shows what computations a more biologically realistic model should perform. We hope that our results provide further guidance for the construction of neural population models that could explain human classification behavior on a more microscopic scale (Dayan & Abbott, 2001; Gerstner & Kistler, 2002). However, before dealing with these microscopic aspects, a better knowledge of the macroscopic classification behavior is necessary. In this letter, we hope to have given a framework to study human classification of visual stimuli quantitatively. The main aspect of our letter is that we study the subjects’ internal decision space for face stimuli. First, the input-output characteristics of human and machine gender classification were compared using the classification error as a measure. Second, the classification behavior of humans and machines was related by comparing the rank-orders of the responses of humans (the subjects’ classification error with the corresponding reaction and confidence rating) and machines (the distance of the stimuli to the SH of the machines). First trends in the data were obtained from these rank-order studies: stimuli far from the SH are classified more accurately, faster, and with higher confidence than stimuli closer to the SH. In other words, the distance of stimuli to a hyperplane separating both classes is demonstrated to be a useful measure to compare humans and machines. Third, we considered the full metric information contained in the responses of humans and machines and studied the subjects’ internal decision space for gender classification of images of faces. From this, we concluded that combining a linear preprocessor (PCA) with a linear classifier in a Euclidean metric space gave exceedingly good fits for the SVM: the distance of a face to the SH was an almost perfect predictor of the human classification performance averaged across all our subjects. In contrast, the prototype classifier behaved in the least human-like manner. This finding supports the arguments against ¨ ak (1998). Here we show that the concept of prototype outlined by Foldi´ more sophisticated algorithms such as the SVM better capture the human internal face space, at least given our gender classification task. Both the rank-order and the metric studies on the subjects’ internal decision space for faces gave similar results: the SVM, and to some extent the RVM, are the best candidates to model the classification algorithms in humans, while the prototype classifier as well as its piecewise linear extension Kmean seem to be least adapted for this task. A classification algorithm using the center of the classes such as for the prototype classifier seems thus less adapted to model human classification behavior than a classifier maximizing the margin between the classes such as the SVM. In other words, when making decisions about the gender of faces, humans may rely more on androgynous faces that are difficult to classify (such as
162
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
the support vectors, that is, stimuli lying on or in the margin stripe) rather than on the prototypical faces that are easy to classify. We have focused here on the classification algorithms using a single preprocessor, PCA. This allowed us to study in depth the algorithmic models for gender classification that humans use. However, the preprocessing stage cannot be ignored in a complete model of visual gender classification. While such a model would be beyond the scope of this letter, our future studies derived from Graf (2004) will include the use of other preprocessors, such as independent component analysis, nonnegative matrix factorization, or Gabor wavelet filters. Acknowledgments We thank C. Wallraven, M. Giese, A. Kohn, M. Jazayeri, J. A. Movshon, and ¨ I. Bulthoff for helpful comments and suggestions. A. B. A. G. was supported by a grant from the European Union (IST 2000-29375 COGVIS). In addition, part of this work was supported by the German Research Council (DFG) grant Wi-2103 awarded to F. A. W.
References Ashby, F., Boynton, G., & Lee, W. (1994). Categorization response time with multidimensional stimuli. Perception and Psychophysics, 55(1), 11–27. Ashby, F., & Ell, S. (2001). The neurobiology of human category learning. Trends in Cognitive Sciences, 5(5), 204–210. Bartlett, M., Movellan, J., & Sejnowski, T. (2002). Face recognition by independent component analysis. IEEE Transactions on Neural Networks, 13(6), 1450–1464. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In Siggraph99 (pp. 187–194). New York: ACM Press. Calder, A., Burton, A., Miller, P., Young, A., & Akamatsu, S. (2001). A principal component analysis of facial expressions. Vision Research, 41, 1179–1208. Churchland, P., & Sejnowski, T. (1992). The computational brain. Cambridge, MA: MIT Press. Cottrell, G., & Metcalfe, J. (1991). EMPATH: Face, emotion, and gender recognition using holons. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 654–571). San Mateo, CA: Morgan Kaufmann. Dailey, M., Cottrell, G., Padgett, C., & Adolphs, R. (2002). EMPATH: A neural network that categorizes facial expressions. Journal of Cognitive Neuroscience, 14(8), 1158– 1173. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). New York: Wiley. Edelman, S. (1995). Representation, similarity, and the chorus of prototypes. Minds and Machines, 5, 45–68. Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press.
Classification of Faces in Man and Machine
163
Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. ¨ ak, P. (1998). What is wrong with prototypes. Behavioral and Brain Sciences, 21(4), Foldi´ 471–472. Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In P. M. B. Vit´anyi (Ed.), Second European Conference on Computational Learning Theory (pp. 23–37). New York: Springer. Furl, N., Phillips, P., & O’Toole, A. (2002). Face recognition algorithms and the otherrace effect: Computational mechanisms for a developmental contact hypothesis. Cognitive Science, 26, 797–815. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Golomb, B., Lawrence, D., & Sejnowski, T. (1991). SEXNET: A neural network identifies sex from human faces. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 572–577). San Mateo, CA: Morgan Kaufmann. Graepel, T., Herbrich, R., & Williamson, R. (2001). From margin to sparsity. In T. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 210–216). Cambridge, MA: MIT Press. Graf, A. (2004). Classification and feature extraction in man and machine. Unpublished doctoral dissertation, Max Planck Institute for Biological Cybernetics. Graf, A., Smola, A., & Borer, S. (2003). Classification in a normalized feature space using support vector machines. IEEE Transactions on Neural Networks, 14(3), 597– 605. Graf, A., & Wichmann, F. (2002). Gender classification of human faces. In H. H. ¨ Bulthoff, S.-W. Lee, T. A. Poggio, & C. Wallraven (Eds.), Biologically Motivated Computer Vision, LNCS 2525 (pp. 491–501). New York: Springer. Graf. A., & Wichmann, F. (2004). Insights from machine learning applied to human ¨ visual classification. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (pp. 905–912). Cambridge, MA: MIT Press. Gray, M., Lawrence, D., Golomb, B., & Sejnowski, T. (1995). A perceptron reveals the face of sex. Neural Computation, 7(6), 1160–1164. Hancock, P., Bruce, V., & Burton, A. (1998). A comparison of two computer-based face recognition systems with human perceptions of faces. Vision Research, 38, 2277–2288. Haykin, S. (1999). Neural networks: A comprehensive approach (2nd ed.). Upper Saddle River, NJ: Prentice Hall. ¨ LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. In G. B. Orr ¨ & K.-R. Muller (Eds.), Neural networks: Tricks of the trade, LNCS 1524. New York: Springer-Verlag. Leopold, D., O’Toole, A., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neuroscience, 4(1), 89–94. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Freeman. Mel, B. (1997). SEEMORE: Combining color, shape and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9(4), 777–804.
164
¨ ¨ A. Graf, F. Wichmann, H. Bulthoff, and B. Scholkopf
Nosofsky, R. (1991). Tests of an exemplar model for relating perceptual classification and recognition memory. Journal of Experimental Psychology: Human Perception and Performance, 17(1), 3–27. O’Toole, A., Abdi, H., Deffenbacher, K., & Valentin, D. (1993). Low-dimensional representation of faces in higher dimensions of the face space. Journal of the Optical Society of America A, 10(3), 405–411. O’Toole, A., Deffenbacher, K., Valentin, D., McKee, K., Huff, D., & Abdi, H. (1998). The perception of face gender: The role of stimulus structure in recognition and classification. Memory and Cognition, 26, 146–160. O’Toole, A., Phillips, P., Cheng, Y., Ross, B., & Wild, H. (2000). Face recognition algorithms as models of human face processing. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition. Piscataway, NJ: IEEE Computer Society Press. O’Toole, A., Vetter, T., & Blanz, V. (1999). Three-dimensional shape and twodimensional surface reflectance contributions to face recognition: An application of three-dimensional morphing. Vision Research, 39, 3145–3155. Palmeri, T. (2001). The time course of perceptual categorization. In U. Hahn & M. Ramscar (Eds.), Similarity and categorization. New York: Oxford University Press. Poggio, T., Rifkin, R., Mukherjee, S., & Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428, 419–422. Reed, S. (1972). Pattern recognition and categorization. Cognitive Psychology, 3, 382–407. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Riesenhuber, M., & Poggio, T. (2002). Neural mechanisms of object recognition. Current Opinion in Neurobiology, 12, 162–168. Rolls, E., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rosch, E., Mervis, C., Gray, W., Johnson, D., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Rosenblatt, F. (1958). The perception: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Sirovich, L., & Kirby, M. (1987). Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A, 4(3), 519–524. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–214. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Valentin, D., Abdi, H., Edelman, B., & O’Toole, A. (1997). Principal component and neural network analyses of face images: What can be generalized in gender classification? Journal of Mathematical Psychology, 41, 398–413. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Classification of Faces in Man and Machine
165
Vapnik, V. (2000). The Nature of statistical learning theory (2nd ed.). New York: Springer. Wichmann, F., & Hill, N. (2001). The psychometric function: I. Fitting, sampling and goodness-of-fit. Perception and Psychophysics, 63(8), 1293–1313. Wickens, T. (2002). Elementary signal detection theory. New York: Oxford University Press. Williams, C., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351.
Received December 27, 2004; accepted June 1, 2005.
LETTER
Communicated by Shun-ichi Amari
Exploring Latent Structure of Mixture ICA Models by the Minimum β-Divergence Method Md. Nurul Haque Mollah
[email protected] Department of Statistical Science, Graduate University for Advanced Studies, Minato-ku, Tokyo 106-8569, Japan
Mihoko Minami
[email protected]
Shinto Eguchi
[email protected] Institute of Statistical Mathematics and the Graduate University for Advanced Studies, Minato-ku, Tokyo 106-8569, Japan
Independent component analysis (ICA) attempts to extract original independent signals (source components) that are linearly mixed in a basic framework. This letter discusses a learning algorithm for the separation of different source classes in which the observed data follow a mixture of several ICA models, where each model is described by a linear combination of independent and nongaussian sources. The proposed method is based on a sequential application of the minimum β-divergence method to separate all source classes sequentially. The proposed method searches the recovering matrix of each class on the basis of a rule of sequential change of the shifting parameter. If the initial choice of the shifting parameter vector is close to the mean of a data class, then all of the hidden sources belonging to that class are recovered properly with independent and nongaussian structure considering the data in other classes as outliers. The value of the tuning parameter β is a key in the performance of the proposed method. A cross-validation technique is proposed as an adaptive selection procedure for the tuning parameter β for this algorithm, together with applications for both real and synthetic data analysis. 1 Introduction Blind source separation (BSS) by independent component analysis (ICA; Hyv¨arinen, Karunen, & Oja, 2001) has been applied in solving various signal processing problems, including speech enhancement, telecommunications, and medical signal processing. ICA attempts to recover the original sources that have independent and nongaussian structure from observable linearly mixed data. In the classical ICA model, all source signal vectors belong Neural Computation 18, 166–190 (2006)
C 2005 Massachusetts Institute of Technology
Exploring Latent Structure for ICA by Beta-Divergence
167
to only one source class S , and all mixed signal vectors belong to the same class in the entire data space D . However, in practice, these source vectors may originate from several source classes, and the corresponding mixed signal vectors belong to several classes in the data space. In this case, the performance of classical ICA may not be so good. Therefore, Lee, Lewicki, and Sejnowski (2000) proposed ICA mixture models by modeling the observed data as a mixture of several mutually exclusive classes, each described by linear combinations of independent, nongaussian densities. However, one problem encountered when applying this method is that the number of classes c should be known in advance, which is difficult in practice (McLachlan & Peel, 2000). We assume that source vectors come from c source classes {S 1 , S 2 , . . . , S c } and that the corresponding mixed signal vectors belong to c different data classes {D 1 , D 2 , . . . , D c } in the entire data space D , where the number c is unknown. In addition, we assume that the data class D k occurs in the entire data space D due to the source vectors that originate from the source class S k , (k = 1, 2, . . . , c). In other words, source class S k is hidden as the data class D k in the entire data space D . In practice, the occurrence order of a mixed signal vector in the entire data space D from a source class is unknown. However, we can assume that an unobservable cmixed signal vector zjk ∈ D k = {zjk ; j = 1, 2, . . . , nk }, (k = 1, 2, . . . , c; k=1 nk = n) follows an ICA model as zjk = A k sjk + bk ,
(1.1)
where Ak is an m × m nonsingular mixing matrix, bk is the bias vector, and sjk ∈ S k = {sjk ; j = 1, 2, . . . , nk }, (k = 1, 2, . . . , c) is the jth random vector in the source class k with zero mean vector, the components of which are assumed to be independent and nongaussian. However, in a practical situation, an observable mixed signal c vector x t ∈ D = {x t ; t = 1, 2, . . . , n} is obtained as one vector of D k = {zjk ; j = k=1 c c 1, 2, . . . , nk , k = 1, 2, . . . , c; k=1 nk = n} such that D = k=1 D k . If the permutation of {z11 , z12 , . . . , zjk , . . . , znc c } into {x 1 , x 2 , . . . , x n } is purely random, then equation 1.1 reduces to the ICA mixture models. In the ICA mixture models, the observed data in each class are considered to be a linear combination of independent and nongaussian sources. (See Lee & Lewicki, 2000; Lee et al., 2000; Lee, 2001, for a detailed discussion.) When the data in each class are modeled as multivariate gaussian, the model is known as a gaussian mixture model. One problem with the existing method is that it cannot recover hidden classes properly when c is unknown or misspecified. In this situation, our proposed method is a sequential application of the minimum β-divergence method (cf. Minami & Eguchi, 2002) to extract all hidden classes sequentially based on a rule of step-by-step change of the shifting parameter.
168
M. Mollah, M. Minami, and S. Eguchi
Later, we propose a stopping rule for repeated application of the minimum β-divergence method based on the cumulative weight. In order to recover the kth hidden class, we estimate a recovering matrix Wk for A −1 k and a shifting parameter µk for A−1 k bk , based on the minimum β-divergence method, initializing µk by a vector x 0 ∈ D k that transforms the mixed signal vector x t ∈ D into a new signal vector yt , (t = 1, 2, . . . , n) by yt = Wk x t − µk ,
(1.2)
such that yt ∈ Sk = { sjk ; j = 1, 2, . . . , nk }, ∈ D ∗,
if
xt ∈ D k
otherwise
where sjk is the estimate of the source vector sjk = A−1 k (zjk − bk ). Here, S k is the kth recovered class whose component vectors are classified from the data class D k , and D ∗ is the set corresponding to the unclassified data points. If Wk is properly obtained, then sjk is equal to sjk , except for an arbitrary scaling of each signal component and the permutation of the indices. An appropriate value of the tuning parameter β is a key to the proposed method. Therefore, an adaptive selection procedure is proposed for the tuning parameter β. Section 2 reviews the minimum β-divergence method, and section 3 describes the proposed method for exploring the hidden class. In section 4, we discuss a selection method for the tuning parameter β. Finally, section 5 presents numerical examples, and section 6 presents the conclusions of this study. 2 Minimum β-Divergence Method Several estimators for ICA can be considered as being derived through the framework of the maximum likelihood estimation, with various choices for density functions. In other words, the estimators are the minimizers of the Kullback-Leibler (K-L) divergence between the empirical distribution and a certain form of density function—for example, Jutten and H´erault’s (1991) heuristic approach, entropy maximization (Bell and Sejnowski, 1995), minimization of cross-cumulants (Cardoso & Souloumiac, 1993), approximation of mutual information by Gram-Charlier expansion, the natural gradient approach (Amari, Cichocki, & Yang, 1996), and others. Amari and Cardoso (1997) showed that the estimation functions of this type of estimator are unbiased provided that the means of the original signals are zeros. However, this type of estimator is not robust to outliers. Minami and Eguchi (2002) proposed a robust BSS method by minimizing β-divergence (Eguchi & Kano, 2001). This method is referred to as the minimum β-divergence method, and the corresponding estimator is referred to as the minimum
Exploring Latent Structure for ICA by Beta-Divergence
169
β-divergence estimator. Next, we review the basic formulation of the minimum β-divergence method. The β-divergence between two probability density functions p(x) and q (x) is defined as Dβ ( p, q ) = −
1 β p (x) − q β (x) p(x) β
1 β+1 p (x) − q β+1 (x) d x, β +1
for
β > 0,
which is nonnegative and equal to zero if and only if p(x) = q(x) (cf. Minami & Eguchi, 2002). We note that the β-divergence reduces to K-L divergence when β −→ 0, that is, lim Dβ ( p, q ) = β↓0
p(x) log
p(x) d x = DKL ( p, q ). q (x)
Next, suppose that an observed signal vector x is a linear transformation of vector s whose components are independent of each other. There exists a matrix W and a shifting parameter vector µ such that the components of y = Wx − µ are independent of each other. Thus, the joint density of y can be expressed as the product of marginal density functions q 1 , q 2 , . . . , q m by q ( y) =
m
q i (yi ),
i=1
and the joint density function of x can be expressed as r (x, W, µ) = | det(W )|
m
q i (wi x − µi ),
(2.1)
i=1
where wi is the ith row vector of W and µi is the ith component of µ. The β-divergence between the density of a recovered signal vector and the product of marginal densities (if they are known) would attain the minimum value of zero if and only if the recovered signals are independent of each other. The minimum β-divergence method is an estimating procedure β (˜r , r0 (., W, µ)) between the empirbased on the empirical β-divergence D ical distribution r˜ of x and r0 (x, W, µ) rather than the unknown density expressed by equation 2.1, where r0 (x, W, µ) = | det(W )|
m i=1
pi (wi x − µi ).
(2.2)
170
M. Mollah, M. Minami, and S. Eguchi
Here, pi is a specific density form rather than an unknown density q i , for example, pi (z) = c 1 exp(−c 2 z4 ) for subgaussian signals and pi (z) = c 2 / cosh(z) for supergaussian signals. Moreover, the switching scheme of the extended infomax ICA (Lee, Girolami, & Sejnowski, 1999) between subgaussian and supergaussian densities can be adopted if the nongaussianities of the source signals are unknown. The minimum β-divergence method finds β (˜r , r0 (., W, µ)). This minthe minimizer of the empirical β-divergence D imization is equivalent to maximizing the following quasi β- likelihood function; 1
lβ (x t ; W, µ), n n
L β (W, µ) =
(2.3)
t=1
where log (r0 (x, W, µ)) , for β = 0 lβ (x; W, µ) = 1 β 1−β , r0 (x, W, µ) − b β (W ) − β β
for
0<β<1
(2.4)
and 1 b β (W) = β +1
β+1 r0 (x, W, µ)d x
=
m det(W )β β +1
β+1
pi
(zi )dz.
i=1
The estimating functions (derivatives of lβ (x; W, µ)) are given by β F1 (x, W, µ) = r0 (x, W, µ) Im − h(Wx − µ) (Wx)T W−T − β b β (W)W−T , β
F2 (x, W, µ) = r0 (x, W, µ) h(Wx − µ),
(2.5) (2.6)
and the estimating equations are as follows: 1 β r0 (x t , W, µ)(Im − h(Wx t − µ) (Wx t )T )W−T − β b β (W )W−T = O, n n
t=1
(2.7) 1 β r0 (x t , W, µ) h(Wx t − µ) = 0, n n
t=1
(2.8)
Exploring Latent Structure for ICA by Beta-Divergence
171
where h( y) = (h 1 (y1 ), . . . , h m (ym ))T h i (yi ) = −
and
pi (yi )
d log pi (yi ) =− . dyi pi (yi )
In the estimating function, the multiplicative term β
r0 (x, W, µ) ∝
m
β
pi (wi x − µi )
(2.9)
i=1
can be considered to be a weight function that provides a weight for each data point. The weight of each possible outlier is reduced to approximately zero by this weight function. This weight function is a key to robust BSS by the minimum β-divergence method. Since β-divergence with β = 0 is equivalent to the K-L divergence, the minimum β-divergence estimator with β = 0 is equivalent to the estimator derived from the K-L divergence with explicitly included shift parameters. The minimum β-divergence estimator is locally consistent, as a method derived from the K-L divergence (Minami & Eguchi, 2002). 3 New Proposal for Exploring Latent Structure Lee et al. (2000) proposed a method for extracting all hidden classes simultaneously from the mixture of ICA models using the maximum-likelihood method. In this section, we propose an iterative algorithm for the same purpose based on sequential application of the minimum β-divergence method. The proposed method explores the recovering matrix of each class on the basis of the initial condition of the shifting parameter µ. If the initial value of the shifting parameter is close to the mean of the kth class, then the estimates for the recovering matrix Wk and the shifting parameter µk can be obtained for this class by considering the data in other classes as outliers. Thus, we can estimate {(Wk , µk ); k = 1, 2, . . . , c} by the repeated application of the minimum β-divergence method to recover all hidden classes that are sequentially based on a rule for the step-by-step change of the shifting parameter µ. In order to create a rule for the sequential change of the shifting parameter µ, let us consider the weight function φ as φ(x, W, µ) =
m i=1
β
pi (wi x − µi ).
(3.1)
172
M. Mollah, M. Minami, and S. Eguchi
Next, we discuss a sequential estimating procedure for {(Wk , µk ); k = 1, 2, . . . , c} based on a rule of the step-by-step change of the shifting parameter µ. 0 for the recovering matrix W to the identity Step 1. Set the initial value W 0 for µ to any one vector x t ∈ D . Find matrix, and set the initial value µ the estimates for W and µ by the minimum β-divergence method using (1) and µ (1) , these initial values. Let the obtained estimates be denoted as W respectively. Let us suppose that (k − 1) pairs of estimates,
(1) , µ (2) , µ (k−1) , µ (1) , W (2) , . . . , W (k−1) , W
are obtained sequentially in steps 1 to (k − 1). 0 to the identity matrix for the recovering Step k: Set the initial value W matrix W. As the initial value for the shifting parameter µ, we use the minimizer of the cumulative weight: 0 = argmin µ x t ∈D
k−1
( j) , µ ( j) . φ xt ; W
(3.2)
j=1
Find the estimates for W and µ by the minimum β-divergence method (k) and using these initial values. Let the obtained estimates be denoted as W (k) , respectively. µ Accordingly, the desired estimates are
(2) , µ (c) , µ (1) , µ (1) , W (2) , . . . , W (c) . W
In order to recover the hidden classes from the observed data, we transform the observed component vector x t into a new component vector yt by (k) x t − µ (k) , yt = W
t = 1, 2, . . . , n; (k = 1, 2, . . . , c).
(3.3)
(k) and µ (k) are the estimates for A(k) −1 and A(k) −1 b(k) , respectively, then If W (k) , µ (k) ) ≥ αk } is classified a class of data points D (k) = {x t ∈ D : φ(x t ; W into source class (k). Note that the weight of each unclassified data point is almost close to zero. In order to separate the recovered signals, we chose the value of αk heuristically as (k) , µ (k) + γ αk = min φ x t ; W x t ∈D
(k) , µ (k) , − min φ x t ; W x t ∈D
(k) , µ (k) max φ x t ; W x t ∈D
(3.4)
Exploring Latent Structure for ICA by Beta-Divergence
173
with 0.01 ≤ γ ≤ 0.05. Also, one can choose αk based on the percentile of the densities. The cumulative weighting plot represents the weights of both the classified and unclassified data points. Thus, the classification procedure can be continued until the remaining unclassified data points are transferred to classified data points by monitoring the cumulative weighting plot and the value of the termination index (TI) = |Jn | ≤ 1 after each step, where |J | is the number of elements in the set
c
(k) ≥ α , J = t: φ x t ; W(k) , µ
(3.5)
k=1
where we choose the value of α heuristically as c
(k) , µ (k) + γ α = min φ xt ; W
x t ∈D
k=1
− min x t ∈D
max x t ∈D
c
(k) , µ (k) φ xt ; W k=1
c
(k) , µ (k) . φ xt ; W k=1
The value TI = a ≤ 1 suggests 100a% observed data points are classified into distinct source classes and the rest 100(1-a)% data points remain unclassified as outliers. The classification procedure is terminated when the value of the termination index TI exceeds a certain value. In our simulation study, we terminated the procedure when TI exceeds 0.90. In the following section, we introduce an adaptive selection procedure for the tuning parameter β. 4 Selection Procedure for β The tuning parameter β is a a key to the performance of the proposed method. Minami and Eguchi (2003) proposed an adaptive selection procedure for β, and their procedure is basically followed here. In order to find an appropriate β, we evaluate the estimates using various values of β. There are four aspects involved in evaluating the estimates: 1. Measure for evaluation 2. Generalization scheme 3. Scaling of estimates for the recovering matrix 4. How to decide β 4.1 Measure for Evaluation. We would like to recover a hidden class from the entire data space using the minimum β-divergence method based
174
M. Mollah, M. Minami, and S. Eguchi
on the initial condition of the shifting parameter µ, considering other classes as outliers. Therefore, the measure used for evaluation should give a good evaluation when a hidden class is recovered but should not have an excessive penalty for the existence of outliers. The K-L divergence between the distribution of x and the pseudo model, equation 2.2, or equivalently, the pseudo-log likelihood does not satisfy this condition. We would like to use β-divergence with a fixed value β0 of β as a measure for evaluating estimators for hidden class separation and so define the measure used for β and evaluation of the minimum β-divergence estimators as W µβ as β , µ ˆ β,β0 W Dβ0 (β) = E Dβ0 r˜ , r0 ·, ˆ β,β0 ,
(4.1)
ˆ β,β0 and µ where ˆ β,β0 are explained later, r˜ is the empirical distribution of x, r0 is defined in equation 2.2, and the notation E denotes the expectation with respect to the underlying distribution of the data: 1 β , µ β , β,β0 W ˆ β,β0 W Dβ0 r˜ , r0 ·, ˆ β,β0 = Const. − lβ x, µβ,β0 . n x ∈D 0 β , β,β0 W µβ,β0 ) is defined as equation 2.4, with β = β0 , W = Here, lβ0 (x, (β,β0 Wβ ), and µ = µβ,β0 . 4.2 Generalization Scheme. The measure Dβ0 (β) is a measure of the generalization performance of an estimator, which is related to the prediction capability for independent test data. If we use the same data set to evaluate Dβ0 (β) as that used to estimate a recovering matrix, then Dβ0 (β) will be underestimated. In a data-rich situation, the best approach is to divide the data set into a small number of subsets and use one of these subsets for estimation and another for evaluation. In other situations, a simple and widely used method of sample reuse is the K -fold cross-validation (CV) method (Hastie, Tibshirani, & Friedman, 2001). The K -fold CV method uses some of the available data to find the estimate and different data to test the estimate. For the current problem, we employ the K -fold CV method as a generalization scheme. We split the data into K approximately equal-sized and similarly distributed sections. For the kth section, we find the estimate using the other K − 1 parts of the data and calculate the β0 -divergence for the kth section of the data. We then combine the calculated β0 -divergence values to obtain the CV estimate. 4.3 Scaling of Estimates for the Recovering Matrix. For the BSS problem, the scaling and shifting of the recovered signals, as well as the scaling of a recovering matrix, are arbitrary because scaling and shifting do not
Exploring Latent Structure for ICA by Beta-Divergence
175
Table 1: K -Fold Cross-Validation Procedure. Split the data set D into K parts; P (1), · · · , P (K ). Let P −k = {x|x ∈ / P (k)}. For k = 1, · · · , K : • Estimate W and µ by
maximizing L β (W, µ) using P β , µ) = argmax lβ (x; W, µ). (W W, µ x ∈P −k
−k ,
β , µ) using P • Estimate β,β0 and µβ,β0 by maximizing L β0 (W
β , µ). β,β , = argmax lβ0 (x; W 0 µβ,β0 , µ −k x ∈P
−k ,
• Compute CV(k) using P (k),
β , β,β W lβ0 x, µβ,β0 CV(k) = − 0 x ∈P (k)
End K 1
Then, Dβ0 (β) = CV(k) . n k=1
affect independence. However, β-divergences differ with the scaling. That is, for any µ1 and µ2 Dβ0 (˜r , r0 (·, W, µ1 )) = Dβ0 (˜r , r0 (·, W, µ2 )),
(4.2)
in general, where = diag(λ1 , · · · , λm ) unless = Im . The scaling and shifting condition for the minimum β-divergence method differs with the value of β. In order to properly evaluate the minimum β-divergence estimates, we need to rescale and shift the estimates under a common condition. For this purpose, we use the scaling and shifting condition for the minimum β-divergence estimator with β = β0 . That β with β by the diagis, we rescale the minimum β-divergence estimate W ˆ β,β0 and use the shift parameter µ onal matrix ˆ β,β0 for evaluation, where β , µ)) among diagonal matrix ˆ β,β0 and µ ˆ β,β0 minimize Dβ0 (˜r , r0 (·, W and vector µ. Table 1 summarizes the procedure used to find the K -fold CV estimate Dβ0 (β). 4.4 How to decide β. As a measure for the variation of Pβ0 (β), we compute SDβ0 (β) = the standard error of
1 CV(k), |P (k)|
176
M. Mollah, M. Minami, and S. Eguchi
where |P (k)| denotes the number of elements in the kth part of data P (k). Plots of Dβ0 (β) for β with the auxiliary boundary curves Dβ0 (β) ± SDβ0 (β) will help to judge an optimum β. We denote this optimum β by βopt . Often we have to employ the upper auxiliary boundary curve (UABC) with the curve of Dβ0 (β) in order to choose the βopt . We choose the smallest β as the βopt whose evaluated value Dβ0 (βopt ) is not larger than the value of UABC that corresponds to the smallest value of Dβ0 (β). However, there is no theoretical justification for this rule, which is known as the one-standard error rule (Hastie et al., 2001). If the curve of Dβ0 (β) is flat for a wide range of β, then there might be only one class with no outlier and βopt = 0. When there is more than one data class or outliers exist in the entire data space, typical shapes of curves of Dβ0 (β) that enable us to choose an appropriate value β are elbow and dipper shapes. So, if the curve does not have these shapes, we increase the value of β0 . If these shapes do not appear for any β0 , then there might be only one class with no outlier, and βopt = 0 (Minami & Eguchi, 2003). 5 Numerical Examples We investigated the performance of the proposed procedure for recovering the hidden classes of mixture ICA models using both synthetic and real data sets. For simulation, we generated the following data sets by formula (1.1) using different mixing matrices Ak and bias vectors bk : Data set 1: Two-dimensional, two-class mixtures (see Figure 1a) generated with uniform (subgaussian) independent sources; 200 samples were drawn from each class to make 400 samples in total. Data set 2: Two-dimensional, two-class mixtures (see Figure 2a) generated with Laplace (supergaussian) independent sources; 200 samples were drawn from each class to make 400 samples in total. Data set 3: Five-dimensional, two-class mixture generated with uniform independent sources. Plots of two observed signals are shown in Figure 3 (top) using the combination rule; 200 samples were generated from each class to make 400 samples in total. Data set 4: Two-dimensional, six-class mixtures generated with uniform independent sources; 200 samples were drawn from each class. Also we added two-dimensional 20 random vectors (*) from a gaussian class and arranged them from 1001 to 1020 sample points to make 1220 sample points in total. Figure 5a represents the observed values. Data set 5: Two-dimensional, two-class mixtures shown in Figures 7d and 7E. One class is the mixture of sinusoid signal (see Figure 7a) and gaussian noise (see Figure 7c, the first half), and the other class is the mixture of saw-tooth signal (see Figure 7b) and gaussian noise (see Figure 7c, the last half).
Exploring Latent Structure for ICA by Beta-Divergence
177
Data set 6: Two-dimensional, two-class mixtures of voices and music noises (see Figures 8c and 8d). The sample size is 100,000 in total, the first and third quarter of samples are the mixture of voice of person 1 and background music noise, and the second and last quarter are the mixture of voice of person 2 and background music noise. In the following simulation study, we used pi (z) = exp(−z4 /4) for subgaussian signals and pi (z) = 1/cosh(z) for supergaussian signals for estimation. For convenience of presentation, samples in data set 1 to 5 were ordered by class. However, we did not use any information on sample order in estimation so that the estimation results must be the same even when samples were randomly ordered. We used a quasi-Newton method with BFGS update (Nocedal, 1992) for the minimum β-divergence method and other optimization problems. 5.1 Simulation with Randomly Generated Synthetic Data. Data sets 1, 2, and 3 are randomly generated synthetic data sets. There are two hidden classes in each of these data sets. Figures 1, 2, 3, and 4 depict observed signals and estimation results for these data sets. In the plots of observed signals and recovered signals both, one class is represented by the symbol “.” and the other one by the symbol “×.” For selection of β, we computed Dβ0 (β) for β varying from 0 to 1 by 0.1 with β0 = 0.3, 0.6, and 0.9 using the tenfold CV algorithm given in Table 1. In the plots of Dβ0 (β), asterisks (*) are Dβ0 (β), and the smallest value is indicated by a circle outside the asterisk. Dotted lines are Dβ0 (β) ± SDβ0 (β). For data set 1 (uniform independent source), we used β0 = 0.3 for selection of β at step 1 because the plot of Dβ0 (β) with β0 = 0.3 (see Figure 1d) shows an elbow shape. By the one-standard error rule, we chose β = 0.3. Dβ0 (β) with β0 = 0.6 and 0.9 (see Figures 1e and 1f) had the same property with β0 = 0.3, and these also suggested β = 0.3. At step 2, the plot of Dβ0 (β) with β0 = 0.3 (see Figure 1g) is flat for small β and has a sudden increase at certain points, indicating it cannot be used for the selection of β. Therefore, we used β0 = 0.6 (see Figure 1h) for selection of β with the same reason as step 1 and chose β = 0.6. Dβ0 (β) with β0 = 0.9 (see Figure 1i) had the same shape as that with β0 = 0.6, and it also suggested β = 0.6. Figures 1b (1) , and 1c show recovered signals by the estimate (W µ(1) ) obtained at step 1 (2) , and (W µ(2) ) obtained at step 2, respectively. We observe that one hidden (1) , µ(1) ), and the other one is recovered by class is properly recovered by (W (2) , (W µ(2) ). Figures 1j and 1k show the weight of each data point correspond(1) , (2) , ing to the estimates (W µ(1) ) and (W µ(2) ), respectively. One can see that at each step of estimation of W and µ, one class of data was used, and the other class of data was totally ignored by the weight function 3.1. The value of the termination index (TI) was 0.99 when the sequential recovering procedure was terminated. The arrows in Figure 1a are the estimated mixk and the bias vectors ing matrices A bk found by the algorithm, and these
178
M. Mollah, M. Minami, and S. Eguchi
Observed and Recovered Values (c) Recovered by ( W(2), µ(2) ) ( b ) Recovered by ( W(1), µ(1) ) y2 y2
( a ) Observed Values x2
4 1
3 2
0 0
−1
0
−2
−3
−3
0
3
6
x
−10
1
−2
−5
0
y
0
1
2
4
y
1
(β ) by K-fold CV D β0 ( d ) β = 0.3
−3
0
0
−.03
0
.020
β
( e ) β0 = 0.6
−3
1
x 10
( f ) β = 0.9 0
0
−4
− .04 0
D (β)
x 10
−2
−.02
0
Dβ ( β )
−.01
0.9 β ( g ) β = 0.3 0.3
0
−8 0 2.5 2
.016
1.5
.014
1 0.3
β
0.9
0.3 −3
.018
.012 0
−1
−6
x 10
0.6
−2 0
0.9
( h ) β = 0.6
0.3 −3
0
0
x 10
0.6
0.9
( i ) β = 0.9 0
−1 −2
0.5 0
0.3
0.6
−3 0
0.9
0.3
0.6
0.9
Weight for Each Sample Point (k) )
1
1
(2) (2)
φ( x(t); W , µ
(1)
φ( x(t); W , µ
(1)
)
(j)
.5
0
0
100
t
300
400
.5
0
0
100
t
300
400
(1) , Figure 1: For data set 1. (a) Observed values. (b–c) Recovered values by (W (1) , µ(1) ), respectively. (d–f, g–i) Dβ0 (β) with β0 = 0.3, 0.6 and 0.9 at µ(1) ) and (W steps 1 and 2, respectively. (j–k) Weight for each sample point at steps 1 and 2, respectively.
Exploring Latent Structure for ICA by Beta-Divergence
179
parameters matched the parameters used to generate the data for each class. The arrows in Figures 1b and 1c represent the center of the recovered classes. For data set 2 (Laplace independent source), we used β0 = 0.6 for selection of β at step 1 because the plot of Dβ0 (β) with β0 = 0.6 (see Figure 2e) shows an elbow shape, while that with β0 = 0.3 (see Figure 2d) is flat for small β and has a sudden increase around 0.4 indicating it cannot be used for selection of β. By the one-standard error rule, we chose β = 0.5 from Dβ0 (β) with β0 = 0.9 had the same shape as that with Dβ0 (β) with β0 = 0.6. β0 = 0.6, and it also suggested β = 0.5. At step 2, we again used β0 = 0.6 (or β0 = 0.9) for selection of β with the same reason (see Figures 2h and 2i), and chose β = 0.6. Figures 2b and 2c show recovered signals by the (1) , (2) , estimate (W µ(1) ) obtained at step 1 and (W µ(2) ) obtained at step 2, respectively. We see that one hidden class is recovered properly by the es(1) , (2) , timate (W µ(1) ), and the other is recovered by the estimate (W µ(2) ). Figures 2j and 2k display the weight of each data point corresponding to the estimates (W(1) , µ(1) ) and (W(2) , µ(2) ), respectively. Again, at each step for estimation of W and µ, one class of data was used, and the other class of data was totally ignored by the weight function, equation 3.1. The value of TI was 0.92 when the sequential recovering procedure was terminated. To investigate the performance of the proposed procedure for highdimensional data, we analyzed the five-dimensional subgaussian (uniform) data. Data set 3 consists of two classes. With projection of observed data onto two-dimensional coordinates, two classes are overlapped as shown in Figure 3 (top). For estimation of recovering matrix at step 1, we chose β = 0.2, because all of Dβ0 (β) with β0 = 0.3, 0.6, and 0.9 (see Figures 3a–3c) have an elbow shape and β = 0.2 is suggested by all of them. At step 2, we chose β = 0.2 as in the previous step using Figures 3e and 3f. Figure 4 (top (1) , and middle) show recovered values by the estimate (W µ(1) ) obtained at (2) , step 1 and (W µ(2) ) obtained at step 2, respectively. It is observed that (1) , µ(1) ), and the one hidden class is recovered properly by the estimates (W µ(2) ). Figures 4j and 4k show the other is recovered by the estimates (W(2) , weight for each data point at steps 1 and 2, respectively. As in the previous examples, at each step, one class of data was used for estimation, and the other class of data was totally ignored by the weight function. The value of TI was 0.99 when the sequential recovering procedure was terminated. To demonstrate the validity of the proposed methods for mixtures of several classes, we considered the two-dimensional, seven-class mixture of synthetic data shown in Figure 5a. Original independent sources are uniform random numbers in six classes; the other class consists of twodimensional 20 gaussian random numbers (*). Data classes corresponding to the uniform random numbers are represented by the symbols ∇, o, ×, +, and ., respectively. For this data set, we used β0 = 1.2 for selection of the values of the tuning parameter β. Figures 5j to 5p depict the values of Dβ0 (β) with β0 = 1.2 for steps 1 to 7, respectively. We chose β = 0.8 for steps 1 to
180
M. Mollah, M. Minami, and S. Eguchi
Observed and Recovered Values x2
(c) Recovered by ( W(2), µ(2) ) ( b ) Recovered by ( W(1), µ(1) ) y2 y2
( a ) Observed Values
3 0
2
20
1
−10
10
0
−20
0
−1
−30 −2
0
2
−10
x1
0
y
10
−10 −10
1
0
10
y
1
(β) by K-fold CV D β0 ( e ) β = 0.6
( d ) β = 0.3 0
Dβ ( β )
.04
0
0.015
0.03 0.02
0
.02 0.01 0
− .02 0
0.01
β ( g ) β = 0.3 0.3
0.9
0.005 0
0
.04
Dβ ( β )
( f ) β0 = 0.9
0.3
0.02
.02
0.6
0 0
0.9
( h ) β = 0.6 0
0.05
0.3
0.6
0.9
( i ) β0 = 0.9
0
0
0
− 0.02
0
− .02 0
0.3
β
0.6
− 0.04 0
0.3
0.6
0.9
− 0.05 0
0.3
0.6
0.9
Weight for Each Sample Point (j)
(k) 1 (2)
φ( x(t); W , µ
(2)
(1)
φ( x(t); W , µ
(1)
)
)
1
.5
0
0
100
t
300
400
.5
0
0
100
t
300
400
(1) , Figure 2: For data set 2. (a) Observed values. (b–c) Recovered values by (W (1) , µ(1) ), respectively. (d–f, g–i) Dβ0 (β) with β0 = 0.3, 0.6 and 0.9 at µ(1) ) and (W steps 1 and 2, respectively. (j–k) Weight for each sample point at steps 1 and 2, respectively.
Exploring Latent Structure for ICA by Beta-Divergence
181
Observed Values x2
x3
x4
5
5
x5 5
0
0
0
0
−5
−5
−5
−5
5 0 0
5
x1
−5 x5
x4
0
x1
−5 x4
5 0 −5 −5
0
1
5
−5
x
x5
−5 x5
0
0
0
1
0
x2
0
x4
5
0
0
−5
−5
x2
x
0
x3
−5
0
x2
−5
x3
0
−5 −5
x3
0
−5 −5
(β) by K-fold CV D β0
.010
0
.015
.041
0
0
.015
.020
.028
β
0
.025
.030
.026 0
D (β)
( c ) β = 0.9
( b ) β = 0.6
( a ) β0 = 0.3
β
D (β)
.032
β ( d ) β = 0.3 0.2
0
0.6
.005 0
.0205
.040
.0200
.039
.0195
.038
.0190
.037 0
.005
.010
0.2
β
.0185 0.6 0
0.2
0.4
0.6
( e ) β0 = 0.6
0
0
0.2 −3
11
x 10
0.4
0.6
( f ) β0 = 0.9
10 9
0.2
0.4
0.6
8 0
0.2
0.4
0.6
Figure 3: For data set 3. (Top) Observed values. (a–c, d–f) Dβ0 (β) with β0 = 0.3, 0.6 and 0.9 at steps 1 and 2, respectively.
6 and β = 0.9 for step 7 based on the one-standard error rule. Figures 5b to 5h show the plots of recovered classes by the estimated recovering matrices and shifting vectors at steps 1 to 7, respectively. Figures 5b to 5e and 5g to 5h show that each estimated recovering matrix at steps 1, 2, 3, 4, 6, and 7 recovers independent sources for one of the subgaussian classes, while Figure 5f shows that the estimated recovering matrix at step 4 does not recover independent sources for any one class, since the shifting vector was initialized to the gaussian class (see Figure 6e) at this step and ICA cannot separate gaussian signals. Figure 5i shows the estimated structures of data set 4. The origins of arrows are located at µ(i) , and the directions are the columns of
182
M. Mollah, M. Minami, and S. Eguchi
,µ Recovered by (W (1) (1) ) y2
y3
y
4
4
5
4
0
0
0
0
−4 −5
0
5
−4
y1 −5
y4 4 0 −4 −4
0
4
y
2
0
5
y1 −4
y5
y4
4
4
0
0
−4 −4
0
4
y
4
y
2
−4 −4
0
0
4
4
y
y
5
0
−4
1
0
4
y
−4 −5
1
y5
y
4
4
0
0
0
5
y2
5
−4 −4
y3
3
4
0
4
y
−4 −4
3
0
4
y4
,µ Recovered by (W (2) (2) ) y2
y3
y
4
y5
y
0
0
0
0
0
−2
−2
−4
−4
−4
−2
0
y
−3
1
y4
0
y −4 1
y5
y
0
0
−2
−2
−4 −4 −2
0
y
2
−2
−2
−4
0
−2
−4 −2
0
−4 −2
y1 y
0
0
−2
−2
0
y3
0
y1
−4
−4 −2
0
y2
y
4
y −4−4 2
3
5
5
0 −2 −4 −3
0
y3 −4
−2
0
y4
Weight for Each Sample Point (b) φ( x(t); W(2), µ(2) )
(1)
φ( x(t); W , µ
(1)
)
(a) 1
.5
0
0
100
t
300
400
1
.5
0
0
100
t
300
400
(1) , Figure 4: For data set 3. (Top) Recovered values by (W µ(1) ). (Middle) Recov(2) , ered values by (W µ(2) ). (Bottom) (a–b) Weight for each sample point at steps 1 and 2, respectively.
−1 . The structure of the ICA mixture data was properly estimated. FigW (i) ures 6a to 6g show the weight for each data point corresponding to the estimates (W(i) , µ(i) ), i = 1, . . . , 7, and Figures 6h to 6n show the cumulative weight after each step from 1 to 7, respectively. Figure 6n shows that
Exploring Latent Structure for ICA by Beta-Divergence
183
Observed and Recovered Values x2
(c) Recovered by (W(2),µ(2))
(b) Recovered by (W(1),µ(1))
(a) Observed values
2
y2
5
−2
0
−4
y
0
2 0 −2 −4
−6
y1 −10 0 (e) Recovered by (W ,µ ) (4) (4)
x1 −5 0 5 (d) Recovered by (W(3),µ(3))
y1 0 5 (f) Recovered by (W(5),µ(5))
y2
y2
y2
4
0
8 6
2 −5
0
4 2
−2
−10
−4 −15 −10 −5 0 y 1 (g) Recovered by (W(6),µ(6))
y
y
2
10
0
y1 −6 −4 (i) Independent directions
−10 −5 0 5 y1 (h) Recovered by (W(7),µ(7))
2
x2
5
2 0
0
0 −10
−2
−5
−20 −20
0
20
−10
y1
−4 −10
y
0
−5
1
0
5
x1
(β) by K-fold CV D β0 −3
× 10
( j ) β0 = 1.2
−3
−2.5
x 10
( k ) β = 1.2 0
0
−3.5
β 1.2 ( n ) β0 = 1.2
0.4 −3
( l ) β0 = 1.2
−3
0
x 10
( m ) β0 = 1.2
−1
× 10
−4 0
0.4 −4
−8.5
−2
−4
x 10
0.8
1.2
( o ) β = 1.2 0
−5 0
0.4 −3
−2
x 10
0.8
1.2
−3 0
0.4
0.8
1.2
( p ) β0 = 1.2
−9 0
−2.5
β
D (β)
x 10
−3
−2.5 0 −0.8
−3
−1 −2
−3
Dβ ( β )
−1.5
−1.2 0
−9.5
0.4
β
1.2
−10 0
0.4
0.8
1.2
−3 0
0.4
0.8
1.2
(i) , Figure 5: For data set 4. (a) Observed values. (b–h) Recovered values by (W µ(i) ) for i = 1, · · · , 7, respectively. (i) Independent direction. (j–p) Dβ0 (β) with β0 = 1.2 at steps 1 to 7, respectively.
184
M. Mollah, M. Minami, and S. Eguchi
Weight for Each Sample Point (a)
0
0
400
1
.5
0
1200
0
t 800 (e)
400
.01
.5
0
1200
0
400
t 800 (f)
1200
1
φ( x(t); θ
(5)
φ( x(t); θ( 6 ) )
)
t 800 (d)
1
φ( x(t); θ( 3 ) )
φ( x(t); θ( 2 ) )
φ( x(t); θ( 1 ) )
1
.5
φ( x(t); θ( 4 ) )
(c)
(b)
1
.5
0
0
400
t 800 (g)
0
1200
0
400
t
800
.5
0
1200
0
400
t
800
1200
φ( x(t); θ
(7)
)
1
θ
.5
(k)
0
0
400
t
800
={W ,µ (k)
(k)
} , k = 1, 2, ..., 7
1200
Cumulative Weight Corresponding to Each Sample Point
.5
400
1200
0
400
1
t 800 (n)
0
400
1
.5
0
0
φ5( x(t) )
φ4( x(t) )
1
t 800 (k)
1200
t 800 (l)
.5
0
1200
0
400
t
800
0
0
400
1
φ6( x(t) )
0
.5
3
2
1
1
φ ( x(t) )
φ ( x(t) )
φ ( x(t) )
1
.5
0
1200
t 800 (m)
1200
800
1200
.5
0
0
400
t
φ ( x(t) ) : Cumulative weight at x(t) after k−th step, k
.5
( k = 1, 2, ...., c = 7 )
7
φ ( x(t) )
(j)
(i)
(h)
1
0
0
400
t
800
1200
Figure 6: For data set 4. (a–g) Weight for each sample point at steps 1 to 7, respectively. (h–n) Cumulative weight at steps 1 to 7, respectively.
Exploring Latent Structure for ICA by Beta-Divergence
185
cumulative weights corresponding to gaussian data points are negligible in a comparison of the cumulative weights corresponding to nongaussian data points. Therefore, data belonging to the gaussian class were considered as outliers in each step by the proposed algorithm. As in the previous examples, at each step, one class of data was used for estimation and the other classes of data were totally ignored by the weight function. The value of TI was 0.97 when the sequential recovering procedure was terminated. 5.2 Simulation with Artificial and Natural Signals. Data sets 5 and 6 were generated with artificial and natural signals, respectively. Both were used to investigate the performance of the proposed procedure for automatic context switching in the BSS problem, which was first introduced by Lee et al. (2000). There are two hidden classes in data set 5. One class is a mixture of sinusoid signals (see Figure 7a) and gaussian noises (see Figure 7c, the first half) and the other class is a mixture of sawtooth signals (see Figure 7b) and gaussian noises (see Figure 7c, the last half). Although there are three different source signals, at any given moment, only two source signals were linearly mixed and two mixed signals were observed (see Figures 7d and 7e). We chose β = 0.45 and 0.5 with the same procedure as in the previous example by K-fold CV for steps 1 and 2, respectively. Figures 7f and 7g, and 7h and 7i show recovered signals by the estimated recovering matrices at steps 1 and 2, respectively. Sinusoid signals were recovered by the estimated recovering matrix at step 1, and sawtooth signals were recovered by the estimated recovering matrix at step 2. The value of TI was 0.92 when the sequential recovering procedure was terminated. For data set 6, let us imagine that two students are talking to each other while they are listening to music in the background. Two microphones are placed somewhere in the room to record the conversation. The conversation alternates so that person 1 talks while person 2 listens, then person 1 listens to person 2, and so on. In this case, the voice of person 1 overlaps with the background music signal by a mixing matrix A1 and bias vector b1 , whereas the voice of person 2 overlaps with the background music signal by a mixing matrix A2 and bias vector b2 . Although there are three different source signals, there are only two source signals mixed in the observed data at any given moment. The original signals and observed mixed signals are shown in Figures 8a and 8b, and 8c and 8d, respectively. We chose β = 0.45 and 0.4 with the same procedure as in the previous example by K-fold CV for steps 1 and 2, respectively. Figures 9a and 9b show recovered signals by the estimate obtained at step 1, and Figures 9c and 9d show those by the estimate obtained at step 2. Figures 9e and 9f are weights of sample points for estimation at steps 1 and 2, respectively. The voice of person 1 was recovered by the estimate obtained at step 1. The weights for mixed signals of the voice of person 2 were almost zero for the estimation at step 1. By the estimate obtained at step 2, the voice of person 2 was recovered, and the
186
M. Mollah, M. Minami, and S. Eguchi
Original Signals (a)
(b)
(c)
2
2
4
1
1
2
0
0
0
−1
−1
−2
−2
0
100
200
300
−2
0
100
200
300
−4
0
200
400
600
Mixed Signals (d)
(e)
10
5
x2(t)
1
x (t)
5 0
0 −5
−5 − 10
− 10
,µ Recovered Signals by (W (1) (1) ) (f)
(g)
5
15 10
y (t) 2
y1(t)
0 −5
5 0 −5
− 10
t
Recovered Signals by
(W
(2) ,
µ (2) )
(i)
(h) 1 0
y2(t)
y1(t)
0
−1 −2 −3
−5 −10
Weight for Each Sample Point
.5
0
(k) φ( x(t); W(2), µ(2) )
(1)
φ( x(t); W , µ
(1)
)
(j) 1
1
.5
0
Figure 7: For data set 5. (a–c) Original signals, where (a) sinusoid signals, (b) sawtooth signals, and (c) gaussian noises. (d–e) Mixed signals. (f–g) Recov(2) , (1) , µ(1) ). (h–i) Recovered signals by (W µ(2) ). (j–k) Weight ered signals by (W for each sample point at steps 1 and 2, respectively.
Exploring Latent Structure for ICA by Beta-Divergence
187
Original Signals (a)
(b)
1
1
0.5
0.5
0
0
−0.5
−0.5
−1 0
2
4
6
8
4
10 × 10
−1 0
2
4
6
8
10
t
8
10
Mixed Signals (c)
(d )
1.0
1.0
x1(t)
x2(t)
0
0
− 0.5
0
2
4
t
8
4
10 × 10
− 0.5
0
2
4
Figure 8: For data set 6. (a) Voice signals, where rectangular boxes represent the voice of person 1, and the rest are the voice of person 2. (b) Music signal. (c) Mixed signal 1. (d) Mixed signal 2.
weights for mixed signals of the voice of person 1 were almost zero for the estimation. Figures 9g and 9h represent the recovered voice conversion and background music noise, respectively, in which we change the scale of the recovered signals to compare with the original voice conversion and music noise explicitly. The value of TI was 0.91 when the sequential recovering procedure was terminated.
6 Conclusions This article proposed a one-by-one hidden class separation algorithm based on the minimum β-divergence method for ICA mixture models. The proposed procedure searches the recovering matrix of each class on the basis of the initial conditions of the shifting parameter. If the initial value of the shifting parameter vector µ belongs to a data class, then the minimum β-divergence estimator finds the estimates of the recovering matrix and shifting parameter for this class. In order to obtain estimates of the recovering matrix and the shifting parameter for other data classes, the initial value of the shifting parameter is changed according to the observed vector having the minimum cumulative weight. Using the proposed method, all hidden classes can be explored sequentially from the entire
188
M. Mollah, M. Minami, and S. Eguchi
Recovered Signals (b)
(a) 0.5
1.0
y (t) 2
y1(t)
− 0.5
0
−1
− 0.5 0
2
t
4
8
10
0
2
4
t
8
10
t
8
10
t
8
10
6
8
10
(d)
(c) 1.0
1.0
y (t) 1
y (t) 2
0
0
− 0.5 −1
0
2
t
4
8
− 0.5 0
10
2
4
Weight for Each Sample Point (f )
1
φ( x(t); W , µ ) (2) (2)
φ( x(t); W(1), µ(1) )
(e)
.5
0 0
2
t
4
8
10
1
.5
0 0
2
4
Unmixed Recovered Signals (g)
(h)
1
1
0.5
0.5
0
0
−0.5
−0.5
−1 0
2
4
6
8
10
−1 0
2
4
(1) , Figure 9: For data set 6. (a–b) Recovered signals by (W µ(1) ). (c–d) Recovered (2) , signals by (W µ(2) ). (e–f) Weight for each sample point at steps 1 and 2, respectively. (g–h) Recovered voice conversion and background music noise, respectively.
data space. We suggested a termination index for the proposed method based on the cumulative weight. On the basis of our simulation results, the value of the TI should be greater than 0.90 to terminate the classification procedure.
Exploring Latent Structure for ICA by Beta-Divergence
189
The value of the tuning parameter β is a key to the performance of the proposed method. We used an adaptive selection procedure for β proposed by Minami and Eguchi (2003). The β-divergence Dβ0 (·) with fixed β0 was used as a measure for the evaluation of the tuning parameter value β. Dβ0 (β) for different values of β were estimated by K-fold cross-validation. This procedure is summarized in Table 1. In our simulation, we used fixed density functions for estimation by the minimum β-divergence method. However, it can be modified by the same switching scheme employed by extended infomax algorithm (Lee et al., 1999) between sub- and supergaussian distributions. The main purpose of the proposed method is similar to the conventional ICA mixture models proposed by Lee et al. (2000). The procedure they proposed finds the estimates for all mixing matrices and shifting parameters simultaneously, whereas the method proposed here finds the estimate for each recovering matrix and shifting parameter sequentially. The proposed algorithm always converges within 80 iterations for the estimation of a recovering matrix, whereas the mixture ICA algorithm of Lee et al. converges after 300 to 500 iterations for the simultaneous estimation of all recovering matrices. However, for the recovery of all hidden classes, the computational cost may be similar for both methods because the proposed method requires cross-validation. The procedure proposed by Lee et al. (2000) may be simpler than the method proposed here when the number of hidden classes, c, is known. However, if the number of hidden classes is unknown or misspecified, their method fails to find good estimates. We applied their procedure to data set 4, described in section 4, with c = 3, 7(exact) and 12. Their procedure successfully estimated the mixing matrices when c = 7, but failed to find good estimates when c = 3 or 12. Unsupervised classification might be one of the most important applications of mixture ICA model. In section 3, we proposed a sequential classification procedure carried out at the same time as the sequential extraction of hidden classes. However, once all estimates of hidden class structures are obtained, one may use the Bayes rule for simultaneous classification of observations. Scaling of sources to compute class probability can be obtained β0 and by the method described in section 3.3 as µβ0 , that is, scaling when β0 = 0. When classes are not overlapped so much, the sequential classification methods and the Bayes rule will give similar results. The case in which classes are heavily overlapped is still difficult for the proposed method as well as the model-based classification by Lee et al. (2000).
Acknowledgments We thank the anonymous reviewers for their detailed comments and questions that improved the quality of the presentation of this letter.
190
M. Mollah, M. Minami, and S. Eguchi
References Amari, S., and Cardoso, J. F. (1997). Blind source separation—Semi-parametric statistical approach. IEEE Trans. on Signal Processing, 45, 2692–2700. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind source separation. In D. Touretzky, M. Mozer, & M. Hasselm. (Eds.), Advances in neural information processing, 8, (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. Proc. IEEE, 140, 362–370. Eguchi, S., & Kano, Y. (2001). Robustifying maximum likelihood estimation (Research memorandum 802). Tokyo: Institute of Statistical Mathematics. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer. Hyv¨arinen, A., Karunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Jutten, C., & H´erault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Lee, T.-W. (2001). Independent component analysis: Theory and applications. Norwell, MA: Kluwer. Lee, T.-W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 14, 409–433. Lee, T.-W., & Lewicki, M. S. (2000). The generalized gaussian mixture model using ICA. In Proc. International Workshop on Independent Component Analysis (ICA’00) (pp. 239–244). Helsinki. Lee, T.-W., Lewicki, M. S., & Sejnowski, T. J. (2000). ICA mixture models for unsupervised classification of non-gaussian classes and automatic context switching in blind signal separation. IEEE Trans. on Pattern Analysis an Machine Int., 22, 1078–1089. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Minami, M., & Eguchi, S. (2002). Robust blind source separation by β-divergence. Neural Computation, 14, 1859–1886. Minami, M., & Eguchi, S. (2003). Adaptive selection for minimum β-divergence method. In Proceedings of ICA-2003 Conference. Nara, Japan. Nocedal, J. (1992). Theory of algorithms for unconstrained optimization. Cambridge: Cambridge University Press.
Received August 19, 2004; accepted May 19, 2005.
LETTER
Communicated by Aapo Hyvarinen
An Adaptive Method for Subband Decomposition ICA Kun Zhang
[email protected]
Lai-Wan Chan
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong
Subband decomposition ICA (SDICA), an extension of ICA, assumes that each source is represented as the sum of some independent subcomponents and dependent subcomponents, which have different frequency bands. In this article, we first investigate the feasibility of separating the SDICA mixture in an adaptive manner. Second, we develop an adaptive method for SDICA, namely band-selective ICA (BS-ICA), which finds the mixing matrix and the estimate of the source independent subcomponents. This method is based on the minimization of the mutual information between outputs. Some practical issues are discussed. For better applicability, a scheme to avoid the high-dimensional score function difference is given. Third, we investigate one form of the overcomplete ICA problems with sources having specific frequency characteristics, which BS-ICA can also be used to solve. Experimental results illustrate the success of the proposed method for solving both SDICA and the overcomplete ICA problems. 1 Introduction Independent component analysis (ICA) is a statistical and computational technique to extract independent signals given only observed data that are assumed to be mixtures of some independent sources (Hyv¨arinen, Karhunen, & Oja, 2001). In many situations, the hidden factors that underlie the observations are statistically independent, so that they can be revealed by ICA. ICA is the most widely used method for blind source separation (BSS). Particularly, ICA has been successfully applied to speech separation and enhancement, biomedical signal (e.g., EEG, MEG, fMRI), analysis and processing (Vig´ario, 1997; Vig´ario, Jousm¨aki, Hari, & Oja, 1998; Makeig, Bell, Jung, & Sejnowski, 1996), wireless communications (Ristaniemi & Joutsensalo, 1999), and financial time series (Kiviluoto & Oja, 1998). In addition, as a representation technique of the observed data, it is useful for feature extraction and image processing (Olshausen & Field, 1996; Hyv¨arinen, Hoyer, & Oja, 2001). Neural Computation 18, 191–223 (2006)
C 2005 Massachusetts Institute of Technology
192
K. Zhang and L.-W. Chan
However, the independence property of sources may not hold in some real-world situations, especially in biomedical signal processing and image processing, and therefore the standard ICA cannot give the expected results. Some extended data models have been developed to relax the independence assumption in the standard ICA model, such as multidimensional ICA (Cardoso, 1998), independent subspace analysis (Hyv¨arinen & Hoyer, 2000), topographic ICA (Hyv¨arinen, Hoyer et al., 2001), and tree-dependent component analysis (Bach & Jordan, 2003). In this letter, we focus on the subband decomposition ICA (SDICA) model (Cichocki & Georgiev, 2003; Tanaka & Cichocki, 2004; Cichocki & Zurada, 2004). This model can considerably relax the assumption regarding mutual independence between the original sources by assuming that the wide-band source signals are generally dependent but some narrow-band subcomponents of the sources are independent. In general, in order to solve the SDICA problem, we first need to apply a filter to the observations to allow the frequency bands of the independent subcomponents to pass through and then apply the standard ICA algorithms to these filtered signals. In Cichocki et al. (2003) and Cichocki and Georgiev (2003), the subband is selected by a priori knowledge or by comparing some simple statistical measures (such as l p -norm, p = 1, or 0.5, or kurtosis) of the subband signals. Generally these selection methods seem to be arbitrary if we do not have the prior information on the subband of the independent subcomponents. In Tanaka and Cichocki (2004) and Cichocki and Zurada (2004), an additional assumption that at least two groups of subcomponents are statistically independent is incorporated, so that the true mixing matrix can be recovered. This assumption may help solve some practical BSS problems. However, it does not necessarily hold. And even with this assumption, it remains a problem to divide the optimal subbands. Moreover, these methods cannot recover the source-independent subcomponents. In some cases, the source-independent subcomponents are of interest, and the dependent subcomponents should be removed. For example, the dependent subcomponents may denote the power supply, which yields a sinusoidal inference (Tanaka & Cichocki, 2004). In this letter, we discuss the validity of the existing methods for SDICA and the conditions for separability of the SDICA model. We then propose an adaptive method for SDICA, called band-selective ICA (BS-ICA). This method can automatically select the frequency band in which the subcomponents of the original sources are most independent, and consequently the mixing matrix and a filtered version of the source-independent subcomponents can be estimated. In order to do that, we apply a linear filter on each observation, followed by a linear demixing stage. The parameters in the filter and the demixing matrix are adjusted by minimizing the mutual information between the outputs. By incorporating some penalty term, the prior knowledge on the independent subcomponents can be taken into account.
An Adaptive Method for Subband Decomposition ICA
193
The overcomplete model is also considered in this letter. In this model, the number of the original independent sources is greater than that of the observations, so standard ICA algorithms cannot cope with this problem. We are concerned with the overcomplete ICA problems in which there exists a subset of sources such that each source in this subset has some frequency band outside the frequency bands of the sources not in this subset. The relationship between such overcomplete ICA problems and SDICA is addressed. Based on the relationship, BS-ICA can also be exploited to solve this kind of overcomplete ICA problem. The letter is organized as follows. In section 2, the SDICA model is introduced. Section 3 discusses the separability of the SDICA model and details BS-ICA, the adaptive method for solving the SDICA problem. Some practical issues are also discussed. In section 4, by investigating the relationship between SDICA and the overcomplete ICA model, we propose to use the BSICA method to solve one form of the overcomplete ICA problems. We then solve some SDICA and overcomplete ICA problems with our method and discuss the experimental results in section 5. Section 6 draws the conclusion. 2 SDICA Model The linear instantaneous ICA is the basic ICA model. In this model, we have some observable variables x = [x1 , x2 , . . . , xm ]T . They are assumed to be linear mixtures of some mutually statistically independent variables s = [s1 , s2 , . . . , sn ]T , x = As =
n
a i si ,
(2.1)
i=1
where the m × n matrix A = [a 1 , a 2 , . . . , a n ] is called the mixing matrix, and it is assumed that at most one of the sources si is gaussian distributed. In the ICA model, A and s are both unknown. By using some linear transform, the task of ICA is to find the signals y = [y1 , y2 , . . . , yn ]T , y = Wx,
(2.2)
which are as mutually independent as possible such that they provide an estimate of s. The n × m matrix W is called the demixing matrix. If the original sources s are exactly recovered up to the scaling and permutation indeterminacies, WA, the product of W and A, should be a generalized permutation matrix, which can be denoted by PD, where P is a permutation matrix and D is a nonsingular diagonal matrix. According to the relation between the source number n and the sensor number m, the ICA problem is divided into three cases: complete ICA (m = n), undercomplete ICA (m > n), and overcomplete ICA (m < n). For
194
K. Zhang and L.-W. Chan
the first two cases, under the additional assumption that the mixing matrix A is of full column rank, the ICA problem can be solved in a variety of ways (see, e.g., Comon, 1994; Cardoso & Souloumiac, 1993; Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996; Hyv¨arinen & Oja, 1997; Ye, Zhu, & Zhang, 2004). In general, overcomplete ICA is much harder to solve. In this model, even if the mixing matrix A has been estimated, the sources cannot be recovered exactly. And in practice, some additional information, such as the probability densities of sources, is needed for solving the problem. The overcomplete ICA model is discussed in section 4. A powerful extension and generalization of the basic ICA model is the subband decomposition ICA (SDICA). It relaxes considerably the independence assumption. The key idea in this model is the assumption that the wide-band source signals can be dependent; however, only some of their narrow-band subcomponents are independent (Cichocki & Georgiev, 2003). In other words, all sources si are not necessarily independent, but can be represented as the sum of several subcomponents as si (t) = si,1 (t) + si,2 (t) + . . . + si,L (t),
(2.3)
where si,k , k = 1, . . . , L are narrow-band subcomponents. And the subcomponents are mutually independent for only a certain set of k; more precisely, in this article, we assume that the subcomponents with k in this set are spatially independent stochastic sequences.1 The observations are still generated from the sources si according to equation 2.1. Here we assume that the number of sources is equal to that of the observations and that the observations are zero mean. Similar to ICA, the goal of SDICA is to estimate the mixing matrix, the original sources, and the source-independent subcomponents if possible. SDICA can be performed by the structure in Figure 1A. As the first step, we apply a filter h(t) to filter out the dependent subcomponents of the sources. Suppose h(t) exactly allows one independent subcomponent, say, the kth subcomponent si,k (t), pass through. Let s(k) (t) = [s1,k (t), . . . , sn,k (t)]T . Then the filtered observations h(t) ∗ x(t) = A[h(t) ∗ (s)(t)] = As(k) (t). Therefore, in the second step, we just need to apply an ICA algorithm to the filtered observations, and we can obtain the demixing matrix W associated with the mixing matrix A. In fact, the idea of temporal filtering has been proposed in the ICA or BSS literature in many scenarios. For example, as a preprocessing step for ICA,
1 Following Liu and Luo (1998), we use the following terminology throughout the article. Let s1 (t) and s2 (t) be two stochastic sequences. Let Si denote the collection of the sequence si (t) over the time, that is, Si = {si (t)|∀t}. Then the stochastic sequences s1 (t) and s2 (t) are said to be spatially independent if any two finite subsets of S1 and S2 are statistically independent.
An Adaptive Method for Subband Decomposition ICA
B
A x1(t) . . xm(t).
195
h(t) h(t)
z1(t)
y1(t)
. . . zm(t)
. . . yn(t)
W
x1(t) . . xm. (t)
W
y1(t) . . . yn(t)
Figure 1: (A) The structure to perform SDICA. h(t) is a linear time-invariant filter, and W is the demixing matrix for z(t). Here we assume m = n. (B) The linear instantaneous ICA demixing procedure is shown for comparison.
temporal filtering is used to reduce the effect of noise (with a low-pass filter) or increase the independence of source signals (with a high-pass filter) (Hyv¨arinen, Karhunen, et al., 2001; Cichocki & Amari, 2003). In particular, Hyv¨arinen (1998) proposed applying traditional ICA to the innovation process of the mixture, which is defined as the error of the best prediction and is usually obtained by temporal filtering the mixture. The rationale for using innovation processes is that usually they are more independent from each other and more nongaussian than the original processes. In Cichocki, Rutkowski, and Siwek, (2002), assuming the sources have different autocorrelation function (or equivalently different frequency shapes), one can perform blind source extraction by minimizing the error between the extracted signal and its predicted value by an adaptive filter. Moreover, for extracting narrow-band signals with the presence of noise, a bandpass filter with fixed or adjustable center frequency and bandwidth can be exploited instead (Gharieb & Cichocki, 2003; Cichocki & Belouchrani, 2001). In the existing methods for SDICA, the frequency subband in which source sub-components are independent is determined by either some a priori information (Cichocki & Georgiev, 2003) or exploiting the stronger assumption that at least two of the subcomponents are statistically independent (Tanaka & Cichocki, 2004; Cichocki & Zurada, 2004). In practice, ICA is mainly used for BSS, so the exact a priori information on h(t) is usually unavailable. The assumption that at least two of the subcomponents are statistically independent is not necessarily true, and the design of the optimal filter h(t) for good performance remains a problem. It is therefore very useful to develop a method that adaptively estimates the optimal filter h(t), such that the source-independent subcomponents pass through it and the dependent subcomponents are attenuated, and consequently both the mixing matrix A and the source-independent subcomponents can be recovered. Band-selective ICA (BS-ICA) is such a method. 3 Band-Selective ICA In this section we first investigate the feasibility of solving the SDICA problem without a priori knowledge on the frequency localizations of the
196
K. Zhang and L.-W. Chan
independent subcomponents. Next, we develop the BS-ICA method, which can adjust the filter h(t) and the matrix W, such that the learned h(t) allows only independent subcomponents to pass through and W is the demixing matrix associated with A. 3.1 Separability of SDICA. In section 2 we briefly reviewed the existing methods for SDICA. Now we provide the theoretical foundations that sustain these methods. Furthermore, we discuss why it is possible to adaptively estimate the SDICA separation system {h(t), W} without a priori knowledge on h(t). Lemma 1. Let s1 (t), s2 (t), . . . , sn (t) be some spatially independent stochastic sequences. Then h(t) ∗ s1 (t), h(t) ∗ s2 (t), . . . , h(t) ∗ sn (t) are also independent for any nonzero linear filter h(t). The proof is straightforward. In fact, h(t) ∗ si (t) is a linear mixture of si (t) at different time indices with constant weights. And s1 (t − τ1 ), s2 (t − τ2 ), . . . , sn (t − τn ) are always independent for any values of τi , so h(t) ∗ s1 (t), h(t) ∗ s2 (t), . . . , h(t) ∗ sn (t) are independent. Lemma 2. Let si (t), i = 1 , . . . , n, be some spatially independent stochastic sequences, and let s(t) = [s1 (t), s2 (t), . . . , sn (t)]T . Suppose we have the ICA model x = As. Then we have the ICA model x˜ = A˜s, where x˜ (t) and s˜ (t) are two vectors of stochastic sequences generated by applying any nonzerofilter h(t) on x(t) and s(t) respectively. Let T be the length of the sequences si (t). Denote by X the matrix that contains the observations x(1), . . . , x(T) as its columns, and similarly for S, S, and X. Since X = AS, and temporal filtering of X corresponds to multiplying X from the right by a Toeplitz matrix containing the elements of h(t) as its rows, we have X = A S (Hyv¨arinen, Karhunen, et al., 2001, p. 264). Moreover, the components of s˜ (t) are mutually independent according to lemma 1. Therefore we have the ICA model x = A s. Without loss of generality, in the SDICA model we can merge all the independent subcomponents and all the dependent subcomponents, respectively. Consequently we can rewrite equation 2.3 as follows, si (t) = si,I (t) + si,D (t),
(3.1)
where si,I (t) denotes the independent subcomponent and si,D (t) denotes the dependent subcomponent. Let s I (t) = [s1,I (t), . . . , sn,I (t)]T and s D (t) = [s1,D (t), . . . , sn,D (t)]T . Correspondingly, we can decompose the ob
servation as x(t) = As(t) = As I (t) + As D (t) = x I (t) + x D (t), where x I (t) = [x1,I (t), . . . , xn,I (t)]T and x D (t) = [x1,D (t), . . . , xn,D (t)]T . Now we make the following assumptions:
An Adaptive Method for Subband Decomposition ICA
197
Assumption 1. For i = 1, 2, . . . , n, at most one of the filtered independent subcomponents h(t) ∗ si,I (t) is gaussian distributed for the filters h(t) which can filter out all the dependent subcomponents si,D (t).2 Assumption 2. Components of x D (t) will not become independent after filtering with h(t) and linear transform with W.3 These assumptions are not restrictive in the general case. Under these assumptions, we have the following proposition: Proposition 1. Let x(t) be the observations in the SDICA model. Under assumptions 1 and 2, the outputs of the SDICA separation system in Figure 1A, yi (t), are spatially independent stochastic sequences if and only if the filter h(t) filters out the dependent subcomponents si,D (t) and WA is a generalized permutation matrix. Proof: According to Figure 1A, the SDICA separation procedure is h(t) ∗ s1,D (t) h(t) ∗ s1,I (t) .. .. y(t) = W · A · +W·A· = y I (t) + y D (t), . . h(t) ∗ sn,I (t) h(t) ∗ sn,D (t)
(3.2) where y(t) = [y1 (t), . . . , yn (t)]T , y I (t) = [y1,I (t), . . . , yn,I (t)]T , and y D (t) = [y1,D (t), . . . , yn,D (t)]T . From equation 3.2, we can see that yi (t) have two subcomponents, yi,I (t) and yi,D (t), which are in different frequency bands. Suppose that yi (t) are spatially independent stochastic sequences. According to lemma 1, we know that all subband subcomponents of yi (t), including yi,D (t), are also independent for all i, which contradicts assumption 2. Therefore, yi,D (t) must vanish. Furthermore, when h(t) filters out si,D (t), the second term in equation 3.2 vanishes, and equation 3.2 becomes the demixing procedure of the basic ICA model according to lemma 2. According to the Darmois-Skitovich theorem (Kagan, Linnik, & R´ao, 1973) or theorem 11 in Comon (1994), WA must be a generalized permutation matrix to make yi (t) mutually independent. Conversely, if h(t) filters out si,D (t) and WA is a generalized permutation matrix, then yi (t) are a permuted and filtered version of the spatially
2 This assumption is analogous to the nongaussianity assumption in the basic ICA model. Here if si,I (t) is independent and identically distributed (i.i.d.), h(t) ∗ si,I (t) is always nongaussian given that si,I (t) is nongaussian, according to a Cramer’s lemma (Cram´er, 1962). However, here si,I (t) may not be i.i.d. So we need to make the assumption on the distribution of h(t) ∗ si,I (t) directly rather than si,I (t). 3 When this assumption is violated, we may have some more local optima. See section 4 for this case.
198
K. Zhang and L.-W. Chan
independent stochastic sequences si,I (t), i = 1, . . . , n. Clearly yi (t) are spatially independent stochastic sequences. The assumption that independent signals have different frequency representations has been exploited to develop BSS methods (Yilmaz & Rickard, 2004; Rickard, Balan, & Rosca, 2003; Belouchrani & Amin, 1998). Here, since si,I (t) and s j,D (t) have difference frequency localizations, one may consider that they are affected by different factors. We further make the following assumption: Assumption 3. The independent subcomponent si,I (t) and the dependent subcomponent s j,D (t) are spatially independent stochastic sequences for all pairs of i and j. In order to present proposition 2 as an extension of proposition 1, we give the following general lemma: Lemma 3. Let the zero-mean random variables U1 , U2 , V1 , and V2 satisfy that Ui is independent from Vj and that the random vectors U = [U1 , U2 ]T and V = [V1 , V2 ]T admit linear structures (for the definition of linear structures, see Kagan et al., 1973):
U1 = a 1 t1 + a 2 t2 + . . . + a N tN U2 = b 1 t1 + b 2 t2 + . . . + b N tN
V1 = c 1 r1 + c 2 r2 + . . . + c Mr M V2 = d1 r1 + d2 r2 + . . . + d Mr M
(3.3)
where all of ti and r j are mutually independent, and a i , b i , c i , and di are constants. Assume that at most one of the sets {ti , i = 1 , . . . , N} and {ri , i = 1 , . . . , M} has gaussian elements. Then U1 + V1 and U2 + V2 are independent if and only if U1 , U2 are independent and V1 , V2 are independent. Proof: It is trivial to show that if U1 , U2 are independent and V1 , V2 are independent, U1 + V1 and U2 + V2 are independent. Conversely, suppose that U1 + V1 and U2 + V2 are independent. Since at most one of the sets {ti , i = 1, . . . , N} and {ri , i = 1, . . . , M} has gaussian components, there are two possible cases to consider: Case 1: None of ti and ri is gaussian. Note that U1 + V1 and U2 + V2 are linear mixtures of nongaussian independent variables si and ri . According to the Darmois-Skitovich theorem, we have a i b i = c j d j = 0 for all i and j. Consequently, from equation 3.3, we can see that U1 , U2 are independent and V1 , V2 are independent. Case 2: One of {ti , i = 1, . . . , N} and {ri , i = 1, . . . , M} has gaussian components. Without loss of generality, we assume {ti } has k gaussian components, which are t1 , . . . , tk . Let U1G = a 1 t1 + . . . + a k tk , and
An Adaptive Method for Subband Decomposition ICA
199
U2G = b 1 t1 + . . . + b k tk . According to the Darmois-Skitovich theorem, we have a i b i = c j d j = 0 for i > k and all j. Consequently, V1 and V2 are independent, and U1 − U1G is independent from U2 − U2G . We also have 0 = E{(U1 + V1 )(U2 + V2 )} = E{U1G U2G }. And thus U1G and U2G are independent, as they are joint gaussian. Therefore, U1 and U2 are also independent. With the help of the Darmois-Skitovich theorem, it is easy to extend this lemma to the case where U and V are high-dimensional. In fact, when U and V are high-dimensional and they admit linear structures, under the condition that for all i at most one of ti and ri is gaussian, one can see that components of U + V are mutually independent if and only if components of U and components of V are both mutually independent. And in this case, the mutual independence between components of U + V is equivalent to their pairwise independence. The following assumptions are made on the dependent subcomponents: Assumption 4. s D (t) admits a linear structure, that is, s D (t) = Br(t), where B is a constant matrix and the vector r(t) = [r1 (t), . . . , r M (t)]T consists of M spatially independent sequences; M > n and M can be arbitrarily large.4 Assumption 5. For n = 2, at most one of the sets {h(t) ∗ si,I (t), i = 1, . . . , n} and {h(t) ∗ ri (t), i = 1, . . . , M} has gaussian components. If n > 2, at most one of h(t) ∗ si,I (t), i = 1, . . . , n, and h(t) ∗ ri (t), i = 1, . . . , M, is gaussian. In assumption 4, the matrix B can be represented as [In | 0n×(M−n) ] · B, where In denotes the n-dimensional identity matrix, 0n×(M−n) the n × (M − n) zero matrix, and B an M-dimensional square matrix. This means that s D (t) can be regarded as the projection of the linear transform of the M-dimensional independent random vector r(t) on the n-dimensional space. The larger M is, the more vectors s D (t) can be represented by Br(t). We can now present the following proposition, which is more applicable than proposition 1: Proposition 2. Let x(t) be the observations in the SDICA model. Under assumptions 1 to 5, the outputs of the SDICA separation system, yi (t), are instantaneously independent if and only if the filter h(t) filters out the dependent subcomponents si,D (t) and WA is a generalized permutation matrix. Proof: Since si,I (t) and s j,D (t) are assumed to be spatially independent stochastic sequences for all pairs of i and j, yi,I (t) and y j,D (t), defined If M ≤ n, components of s D (t) may become independent after linear transform, and assumption 2 is violated. In this case, the SDICA problem is actually a special case of the overcomplete ICA, which is detailed in section 4. 4
200
K. Zhang and L.-W. Chan
in equation 3.2, are always independent for all pairs of i and j according to lemma 1. Suppose yi (t) are independent. Under the assumptions in this proposition, the components of y I (t), and those of y D (t), must be mutually independent according to lemma 3 and its extension in the highdimensional case. Therefore, yi,D (t) must vanish, as they are assumed to be always dependent. Then equation 3.2 becomes the demixing procedure of the basic ICA model. Obviously this proposition is true. Assumptions 1 to 5 are generally not very restrictive. And it is important to emphasize that it is possible for proposition 2 to be true even when some of the assumptions are violated. According to proposition 2, we can use the SDICA separation system in Figure 1A, to separate the observations generated by the SDICA model, and the filter h(t) and the demixing matrix W in the SDICA system can be obtained by making yi (t) mutually independent. 3.2 The Learning Rule for BS-ICA. The SDICA system can be considered as a special case of the blind separation system of convolutive mixtures. The approach based on information maximization provides simple and efficient algorithms for blind separation of convolutive mixtures, but it has the side effect of temporally whitening the output (Bell & Sejnowski, 1995; Torkkola, 1996). The temporally whitening effect must be avoided in SDICA. Hence, here the information-maximization principle should not be adopted. Mutual information is a natural and canonical measure of statistical dependence, and the mutual information-minimization approach has been applied for blind separation of convolutive mixtures in Babaie-Zadeh, Jutten, and Nayebi (2001b). We can use mutual information to measure the dependence between yi and estimate h(t) and W by minimizing the mutual information between yi . In information theory, the mutual information between n random variables y1 , . . . , yn is defined as I (y1 , . . . , yn ) =
n
H(yi ) − H(y),
(3.4)
i=1
where y = (y1 , . . . , yn )T , and H(•) denotes the (differential) entropy. I (y1 , . . . , yn ) is always nonnegative and is zero if and only if yi are mutually independent. We can then derive the learning rules for h(t) and W in Figure 1A based on the minimization of mutual information between yi . 3.2.1 Adjusting W in the Instantaneous Stage. Since in the SDICA separation system, the filtering stage and instantaneous stage are in a cascade structure, h(t) will not be involved explicitly in the learning rule for W, and the instantaneous stage just aims at minimizing the mutual information
An Adaptive Method for Subband Decomposition ICA
201
given zi as input. The learning rule for W is then the same as that in the basic ICA model (Bell & Sejnowski, 1995; Cardoso, 1997), ∂ I (y) = −[WT ]−1 − E{ψy (y)zT }, ∂W
(3.5)
where z = [z1 , . . . , zn ]T , zi (t) = h(t) ∗ xi (t), and ψy (y) = [ψ y1 (y1 ), . . . , ψ yn (yn )]T is called the marginal score function (MSF) in Babaie-Zadeh et al. (2001b). ψ yi (u) is the score function of the random variable yi , defined as ψ yi (u) = (log p yi (u)) =
p yi (u) p yi (u)
.
(3.6)
Multiplying the right-hand side of equation 3.5 with wT w, we get the natural gradient method (Amari et al., 1996; Cardoso & Laheld, 1996): W ∝ (I + E{ψy (y)yT })W.
(3.7)
In order to make yi of an unit variance when the algorithm converges, we replace the entries on the diagonal of I + E{ψy (y)yT } by 1 − E(yi2 ). In this way, the above learning rule is modified as follows: W ∝ (I − diag{E(y12 ), . . . , E(yn2 )} + E{ψy (y)yT } − diag{E[ψy (y)yT ]})W, (3.8) where diag{E(y12 ), . . . , E(yn2 )} denotes the diagonal matrix with E(y12 ), . . . , E(yn2 ) as its diagonal entries, and diag{E[ψy (y)yT ]} denotes the diagonal matrix whose diagonal entries are those on the diagonal of E{ψy (y)yT }. 3.2.2 Adjusting h(t) in the Filtering Stage. Let h(t) be a causal finite impulse response (FIR) filter, h(t) = [h 0 , h 1 , . . . , h L ]. We have
p yi (yi ) ∂ yi ∂ log p yi (yi ) · = −E ∂h k p yi (yi ) ∂h k n ∂ yi ∂z p = −E ψ yi (yi ) · · ∂z p ∂h k
∂ H(yi ) = −E ∂h k
= −E t
p=1
ψ yi (t) ·
n p=1
wi, p x p (t − k) ,
(3.9)
202
K. Zhang and L.-W. Chan
where wi, p is the (i, p)th entry of the matrix W. And n ∂ log py (y) ∂ yi ∂ log py (y) = −E · ∂h k ∂ yi ∂h k i=1 n n ∂ log p (y(t)) y E wi, p x p (t − k) · =− ∂ yi (t)
∂ H(y) = −E ∂h k
i=1
p=1
= −E t ϕ yT (t) · W · x(t − k) ,
(3.10)
where ϕ y (y) = [ϕ1 (y), . . . , ϕn (y)]T is called the joint score function (JSF) in Babaie-Zadeh et al. (2001b), and its ith element is ∂ log py (y) ϕi (y) = = ∂ yi
∂ ∂ yi
py (y)
py (y)
.
(3.11)
Combining equations 3.9 and 3.10 gives ∂ I (y) ∂ H(yi ) ∂ H(y) = − ∂h k ∂h k ∂h k i=1 T = −E t ψ y (t) · W · x(t − k) + E t ϕ yT (t) · W · x(t − k) = E t β yT (t) · W · x(t − k) , n
(3.12)
where β y (y) = ϕ y (y) − ψ y (y) is defined as the score function difference (SFD) in Babaie-Zadeh et al. (2001b). The SFD is an independence criterion; it vanishes if and only if yi are mutually independent. Now the elements of h(t) can be adjusted according to equation 3.12 with the gradient-descent method. Since the SFD estimation cannot be avoided in updating h k , alternatively we can adopt the SFD-based algorithm for adjusting W (Samadi, BabaieZadeh, Jutten, & Nayebi, 2004): W ∝ −E{β y (y)yT }W.
(3.13)
Equation 3.8 (or equation 3.13) and equation 3.12 are the learning rules of BS-ICA. When the instantaneous stage and the filter stage both converge, the matrix W is the demixing matrix associated with the mixing matrix A. The original sources can be estimated as Wx. Moreover, the outputs yi form the estimate of a filtered version of the independent subcomponents si,I . As in separating convolutive mixtures (Babaie-Zadeh et al., 2001b) and separating convolutive postnonlinear mixtures (Babaie-Zadeh, Jutten, &
An Adaptive Method for Subband Decomposition ICA
203
Nayebi, 2001a), the algorithm (see equations 3.12 and 3.13) involves the SFD (β y (y)). The SFD estimation problem has been addressed in several articles (e.g., Taleb & Jutten, 1997, and Pham, 2003). In our experiments, we adopt Pham’s method, because it is fast and comparatively accurate. The SFD estimation requires a large number of samples and is difficult to perform when the dimension of the output is greater than two. In section 3.4, we propose a scheme to avoid high-dimensional SFD. 3.3 Practical Considerations. In Figure 1A, if we exchange a scalar factor between the filter h(t) and the matrix W, the output y does not change. This scaling indeterminacy will do harm to the convergence of our algorithm. Therefore, we set h 0 , the first element in h(t), as 1, to eliminate this indeterminacy. In applying BS-ICA, we should not neglect the effect of the capacity of the filter h(t). If the proportion of the dependent subcomponents is extremely large, it is hard to eliminate the effect of the dependent subcomponents due to the limited capacity of h(t), such that our method may converge to an incorrect target under this condition. 3.3.1 Order of h(t) and Parameter Initialization. In practice, the performance of SDICA is affected by the order of h(t). Note that there is no ideal digital filter. There will always be finite transition bands between the stopbands and pass-bands. Furthermore, a digital filter can never completely attenuate the amplitude of the signal in the stop bands, and it will not allow the signal in the pass-bands to pass though unscathed. Instead, the stopbands will be attenuated by a finite gain factor; hence, our method may fail to filter out the dependent subcomponents when their energy is extremely high. We should also consider the effect of the length of h(t). If the length, L + 1, is too short, the resolution of the filter is poor, and the width of the transition band will increase. But a long length of the filter will result in heavy computational load and may cause the algorithm to be less robust. In our three experiments, the filter length is set as 17, 11, and 21, respectively. In BS-ICA, there are many parameters to be tuned. Also, due to the limited accuracy in the SFD estimation, there may be some local optimum, especially when the data dimension is high. In practice, when the proportion of the dependent subcomponents is not that large, we found it is very useful to initialize the demixing matrix W with that obtained by traditional ICA algorithms (e.g., the FastICA algorithm). This initialization scheme can improve the convergence speed and may help to avoid some local optimum. 3.3.2 Penalty Term with Prior Knowledge on h(t). Notice that the decomposition of each source si (t) into the independent subcomponent si,I (t) and the dependent subcomponent si,D (t) (see equation 3.1) is not unique. In other words, if h(t) filters out not only the dependent subcomponent si,D (t) but also part of the independent subcomponent si,I (t), the independence between output yi can also be achieved. This means that we may recover
204
K. Zhang and L.-W. Chan
only part of the independent subcomponent with BS-ICA. This would not affect the solution when we aim only to estimate the true mixing matrix A. However, if we also aim to recover the whole independent subcomponents, we can tackle this problem by incorporating an additional penalty term in the objective function. We can take into account the prior information on the frequency localization of the independent subcomponent. For instance, if we know in advance that the frequency of the independent subcomponent of interest is around the radian frequency 0 , we can modify the objective function by incorporating the frequency response magnitude of h(t) at 0 , J 1 = I (y1 , . . . , yn ) − λ|H( j0 )|2 ,
(3.14)
where λ is a small enough positive number. Since H( j0 ) =
L
h(t)e − j0 t ,
t=0
we have ∗
|H( j0 )| = H( j0 ) · H ( j0 ) = 2
L
h(t)e
t=0
=
L L
− j0 t
·
L
h(k)e
j0 k
k=0
h(t)h(k) cos[(t − k)0 ],
t=1 k=1
and then, ∂|H( j0 )|2 h(t) cos[(k − t)0 ]. =2 ∂h(k) L
t=0
The gradient of J 1 with respect to h k is L ∂ J1 = E β yT (t) · W · x(t − k) − 2λ h t cos[(k − t)0 ]. hk
(3.15)
t=0
h k can then be adjusted with the gradient-descent method. h 0 is still set as 1 during the update process. 3.3.3 To Minimize the Distortion and Improve the Robustness. The outputs yi are the estimate of (a filtered version of) the independent subcomponents.
An Adaptive Method for Subband Decomposition ICA
205
In order to make yi be as close as possible to the original independent subcomponents, the pass-band of h(t) should be as wide as possible provided that h(t) filters out the dependent subcomponents. This can be achieved by using equation 3.15 as the learning rule, with the value of 0 no longer fixed; each time it is randomly chosen between 0 and the Nyquist frequency π. With this scheme, our method becomes more robust. If necessary, after convergence of BS-ICA, we can further modify the learned h(t) such that it has a flat spectrum in the pass-bands, while its stop-bands remain. In this way, the distortion is further reduced.
3.4 To Avoid High-Dimensional SFD. Due to the curse of dimensionality, it is difficult to estimate the probability density function (pdf) in highdimensional spaces, and hence the SFD estimation is an obstacle when applying our algorithm to the high-dimensional case. So it is useful to find a way to avoid the SFD in the learning rule. In general, pairwise independence is weaker than mutual independence. However, according to the Darmois-Skitovich theorem, for the linear instantaneous ICA, the property that outputs yi are pairwise independent is equivalent to the mutual independence between yi (Comon, 1994). In this case, we can use a heuristic way to achieve mutual independence between yi by minimizing the sum of the pairwise mutual information:
J =
n n−1
I (yi , y j ).
(3.16)
i=1 j=i+1
This function is always nonnegative and is zero if and only if yi are pairwise independent. Under the assumptions made in proposition 2, the mutual independence between the outputs of the SDICA system, yi (t), is equivalent to their pairwise independence according to the extension of lemma 3 in the highdimensional case. Consequently, in SDICA, we can minimize the objective function, equation 3.16, to make yi mutually independent. In this way, the SFD estimation is always performed in the two-dimensional space, as explained below. By deriving the gradient of equation 3.16 with respect to h k , we can get the update rule for h(t),
∂J = E γ yT (t) · W · x(t − k) , ∂h k
(3.17)
206
K. Zhang and L.-W. Chan
where γ y (t) = [γ1 (t), . . . , γn (t)]T is defined as pairwise score function difference (PSFD), and γi =
n j=1, j=i
β yi (yi , y j ) =
n
ϕ yi (yi , y j ) − (n − 1)ψ yi (yi ).
j=1, j=i
Note that ϕ yi (yi , y j ) is the score function of the joint pdf of (yi , y j ) with ∂ log p y ,y (yi ,y j )
i j , and ψ yi (yi ) is the score funcrespect to yi , that is, ϕ yi (yi , y j ) = ∂ yi tion of yi . (For a proof of the rule, refer to the appendix.) In order to obtain γ y , we need to estimate β yi (yi , y j ) for all i = j. This means that the SFD estimation is always performed in the two-dimensional space, regardless of the original data dimension. For simplicity, in the instantaneous stage, we still use I (y1 , . . . , yn ) as the objective function, and the learning rule for W is still equation 3.5. From the definition of the PSFD γ y , we can see the SFD of each pair of yi is needed to construct the n-dimensional PSFD γ y (y). Therefore, for estimating γ y (y), we need to estimate a set of two-dimensional SFDs with
Cn2 = n(n−1) elements. Clearly the complexity of the PSFD-based algorithm 2 (see equation 3.17) is a quadratic function of n in each iteration. This scheme can also be applied to the algorithms for separating linear instantaneous or convolutive mixtures with the SFD involved (Babaie-Zadeh et al., 2001b). Consequently, these algorithms become more applicable for high-dimensional data. 4 For Overcomplete ICA Problems with Sources Having Specific Frequency Characteristics For the overcomplete ICA, the mixing system (see equation 2.1) is not invertible. In fact, due to the information lost in the mixing process caused by the reduction of the data dimension, even if we know the mixing matrix A, we cannot recover all the independent sources exactly. Generally, solving the overcomplete ICA problem involves two processes: estimation of the mixing matrix and estimation of the original sources with the help of their assumed prior probability densities and the estimated mixing matrix (Olshausen & Field, 1997; Lewicki & Sejnowski, 2000). While the method proposed by Hyv¨arinen, Cristescu, and Oja (1999) is based on the FastICA algorithm (see Hyv¨arinen & Oja, 1997), combined with the concept of quasiorthogonality for producing more independent components than observations. This idea has been extended from a maximum likelihood viewpoint in Hyvarinen and Inki (2002). In addition, the work in Amari (1999) focuses on only the natural gradient learning algorithm to estimate the mixing matrix A and does not treat the problem of source recovery. In these methods, some assumed prior distributions, which are usually sparse,
An Adaptive Method for Subband Decomposition ICA
207
are imposed on the sources; accordingly, these methods may not work in some scenarios. Here we consider the overcomplete ICA problems from a different point of view: we exploit the information of the frequency spectra of the sources.
4.1 Relation to SDICA. SDICA is closely related to the overcomplete ICA problem. Here we are concerned with one form of the overcomplete ICA problems, in which there exists a subset of sources with the number k (2 ≤ k ≤ m), forming the vector s(1) , such that each source in s(1) has some frequency band outside the frequency bands of the sources not in s(1) . Denote the vector consisting of the sources not in s(1) by s(2) . Partition the mixing matrix A into A(1) and A(2) according to s(1) and s(2) , the two disjoint subsets of sources. Then we have
x = As = A(1) A(2) ·
s(1) s
(2)
= A(1) s(1) + A(2) s(2) .
(4.1)
If A(1) is square and nonsingular, we can further get x = A(1) (s(1) + A A(2) s(2) ). Generally A(1)−1 A(2) is not a generalized permutation matrix; otherwise, some independent sources will be merged together, and the overcomplete ICA becomes the ordinary ICA problem (Cao & Liu, 1996). Consequently the elements of A(1)−1 A(2) s(2) are dependent. The elements of s(1) + A(1)−1 A(2) s(2) can then be regarded as sources following the assumption in SDICA (see equation 2.3). If A(1) is not square, we consider the case where the rank of A(1) is k. (1) , which consists of There exists a k × k submatrix of A(1) , denoted by A (1) (1) some rows of A , such that A is of the full rank. Denote the vector of the (1) by x˜ , and the corresponding submatrix observations corresponding to A (2) . We have x˜ = A (1) (s(1) + A (1)−1 A (2) s(2) ). The elements of s(1) + of A(2) by A (2) s(2) can still be considered as sources following the assumption in (1)−1 A A SDICA. Consequently, the problem of overcomplete ICA becomes a special case of the SDICA problem. (1)−1
4.2 To Solve One Form of the Overcomplete ICA Problems. If we use the SDICA system to separate the overcomplete mixture x discussed above, we have
(1) (2) h(t) ∗ s1 (t) h(t) ∗ s1 (t) .. .. + W · A(2) · . y(t) = W · A(1) · . . (1) (2) h(t) ∗ sk (t) h(t) ∗ sn−k (t)
(4.2)
208
K. Zhang and L.-W. Chan
In the general case, WA(1) and WA(2) will not be generalized permutation matrices at the same time; otherwise the overcomplete ICA problem is degenerated to the ordinary ICA problem (Cao & Liu, 1996). In this case, as a consequence of lemma 3, there are two possibilities to achieve the independence between yi in equation 4.2: we can either use h(t) to filter out s(2) and set WA(1) as a generalized permutation matrix, or we can filter out s(1) and let WA(2) be a generalized permutation matrix. Therefore, we can use the BS-ICA method to separate the overcomplete mixtures of this kind. For estimating the k sources in s(1) , we can confine the demixing matrix W to be a k × m matrix. The outputs yi form an estimate of a filtered version of these sources, and the learned W is the associated demixing matrix. When k = m = n2 and the frequency bands of components of s(1) and those of s(2) do not overlap, theoretically the algorithm has at least two local minima: one is to recover s(1) and the other to recover s(2) . In practice, due to the limited capacity of the filter, one of these two minima may be hard to obtain, especially when the corresponding sources have very little contribution to the observations. But this local minimum can be obtained in another simple way. In this case, when we achieve one local minimum—for example, components of s(1) are recovered and they completely pass though h(t) (if necessary, this can be achieved by incorporating a penalty term; see section 3.3.3)—the filter h −1 (t) attenuates s(1) and allows s(2) to pass though.5 Applying a linear instantaneous ICA algorithm to h −1 (t) ∗ x(t), we can estimate the demixing matrix associated with A(2) and the filtered version of components of s(2) . We will discuss this issue with the help of the experiment. Noisy ICA is a special case of overcomplete ICA. Usually the noise is assumed to be white, that is, it has a flat frequency spectrum. The independent source signals usually concentrate on a narrower frequency band. Our method may produce two different outcomes for such noisy ICA problems: h(t) suppresses the noise and the independent sources pass, or h(t) attenuates the independent sources and allows the noise to pass through. The former case is what we generally want to achieve. For the latter case, we can further apply h −1 (t) to the observations to attenuate the noise and obtain the signals. 5 Experiments To assess the quality of the demixing matrix W for separating observations generated by the mixing matrix A, we use the Amari performance index
5 For a causal finite impulse response (FIR) filter, its inverse always exists. If it is a minimum-phase filter, it has a causal inverse. Otherwise, the inverse is a noncausal filter.
An Adaptive Method for Subband Decomposition ICA
209
Perr (Amari et al., 1996; Cichocki & Amari, 2003),
Perr
n n n | p | | p | 1 i j ji = − 1 + − 1 n(n − 1) maxk | pik | maxk | pki | i=1
j=1
j=1
(5.1) where pi j = [WA]i j . The smaller Perr is, the closer the product of W and A is to the generalized permutation matrix. Note that the proposed BS-ICA method can not only estimate the mixing matrix A and the original sources, but also estimate the independent subcomponents. As in blind separation of convolutive mixtures, the BS-ICA method for SDICA produces outputs as an estimate of the filtered version of the independent subcomponents instead of the original ones. Therefore, for measuring the separation quality, we also use the output signal-to-noise ratio (SNR), defined as (assuming there is no permutation indeterminacy)
SNRi = 10 log10
E yi2
E yi2 |si,I =0
,
(5.2)
where yi |si,I = 0 stands for what is at the ith output when the corresponding input sub-component si,I is zero (Babaie-Zadeh et al., 2004). A high SNR means that the contribution to this output from the other sources (including the dependent subcomponents) is small. 5.1 Experiment 1: SDICA with Artificially Generated Data. In the first experiment, we use some artificially generated signals to test the performance of the BS-ICA method for solving the SDICA problem. The four independent subcomponents are an amplitude-modulated signal, a sign signal, a high-frequency noise signal, and a speech signal. Each signal has 1000 samples. Each original source si contains one of the above independent signals, together with a sinusoid wave with a particular frequency but different phases for different sources, which is the dependent subcomponent. Figure 2 shows these signals as well as their frequency characteristics (the magnitude of their Fourier transforms). The experimental setting is similar to the first experiment in Tanaka and Cichocki (2004). Any nonsingular matrix could be used as the mixing matrix. In this experiment, the mixing matrix is
0.7 0.7
1
1.2
1 1.4 1 0.3 A= . 1.2 0.3 0.7 1.0 0.4 1.1 1.2 0.6
210
K. Zhang and L.-W. Chan
s
1,I
2
200
0
–2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
s
2,I
1
s
3,I
5
4,I
5
s
2
3
0 0
1
2
3
1
2
3
1
2
3
2
3
100
0
–5
450
500
550
600
650
700
750
800
2
0 0 50
0
–5
i,D
1
0
–1
wave of s
0 0 400
450
500
550
600
650
700
750
800
0 0 500
0
–2
450
500
550
600
650
700
750
800
0 0
1
t
Ω (rad)
Figure 2: The source-independent subcomponents (top four) and the waveform of the dependent subcomponents (bottom one), as well as their frequency characteristics, represented by the magnitude of their Fourier transforms. Only 400 samples are plotted for illustration. The sources si = si,I + si,D , and si,D are sinusoid waves with the same frequency but different phases. 1000
x1
10 0
–10
x2
10
5
x3
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0 1000
1
2
3
450
500
550
600
650
700
750
800
0 0
1
2
3
0
–10
0 –5 10
x4
450
0
–10
t
Ω (rad)
Figure 3: The observations (left) and their frequency characteristics (right) in experiment 1.
The observations xi , as well as their frequency characteristics, are shown in Figure 3. Since some source-independent subcomponents in this experiment have disjoint frequency representations, the method proposed in Tanaka and Cichocki (2004) may fail.
An Adaptive Method for Subband Decomposition ICA
211
Here the dimension of observations is four. Since it is very difficult to estimate the SFD in the four-dimensional space, we use the scheme described in section 3.4 to avoid the high-dimensional SFD. Consequently, we just need to estimate the two-dimensional SFD, which is estimated using Pham’s method (Pham, 2003). The length of the causal filter h(t) is 17. h 0 is set as 1, and all the other elements of h(t) are initialized as 0. W is initialized as the identity matrix. The learning rate for the filtering stage is 0.1, and that for the instantaneous stage is 0.15. Equation 3.7 is adopted for adjusting W. At convergence, the product of W and A is
−0.0003
0.0287 0.0027 0.6611
−0.0049 0.9249 0.0009 0.0016 WA = . 1.8876 −0.0037 0.0113 0.0108 −0.0545 −0.0134 0.6420 0.0113 The Amari performance index Perr = 0.0278, from which we can see WA is almost a generalized permutation matrix. It means that W is really the demixing matrix with respect to A. Then the original sources si can be estimated as Wx with good performance. We repeat this experiment and incorporate an additional term to achieve little distortion with λ = 0.01, as discussed in section 3.3.3, and the experimental result is almost the same. For comparison, we also apply the FastICA algorithm (Hyv¨arinen & Oja, 1997) (with the tanh nonlinearity and in the symmetric estimation mode) to directly separate the observations x. The resulting Amari performance index is Perr = 0.139, which indicates poor performance in recovering the mixing procedure A due to the effect of the dependent subcomponents. Figure 4 shows the output SNRs (with respect to the source-independent subcomponents depicted in Figure 2) versus iterations and the waveforms of the outputs yi . The SNRs are very high, meaning that the dependent subcomponents have been filtered out by the filter h(t) and the outputs are a good estimate of (the filtered version) of the independent subcomponents. This can be verified by examining the frequency response magnitude of h(t) (see Figure 5). By comparing the frequency representation of z1 (t) (note that z1 (t) = h(t) ∗ x1 (t)) in Figure 5C with that of x1 (t) in Figure 3, top, we can see the effect of the dependent subcomponents around = 0.47 rad is almost eliminated. From Figure 5B, we can see that the frequency response magnitude of h(t) varies greatly in the pass-bands. This results in the distortion in the estimated independent subcomponents. As discussed in section 3.3.3, for less distortion of the recovered independent subcomponents, we design a causal filter h 1 (t), which has the same stop-band as h(t) and has a nearly constant magnitude in the pass-bands. In this experiment, the frequency response magnitude of h 1 is indicated by the dotted line in Figure 5B. The result of applying h 1 (t) to the recovered sources Wx(t) is shown in Figure 6.
212
K. Zhang and L.-W. Chan
B
A
5 1
45
y
40
0
−5 2
35
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
25
0
−2 5
20
3
15
y
SNR (dB)
2
30
10
4
0
y
SNR 1 SNR 2 SNR3 SNR
5
4
–5 0
500
1000
1500
2000
0
−5 5 0
−5
2500
iteration
t
Figure 4: (A) The output SNRs with respect to the source-independent subcomponents. (B) The outputs yi .
A
B
1
h(t) h1(t)
2
h(t)
|H(jΩ)|
0.5
1
0
–0.5
0
0
5
t
10
15
0
Ω (rad)
1
2
3
C
500
1
|Z (jΩ)|
1000
0
0
1
Ω (rad)
2
3
Figure 5: (A) The learned filter h(t). (B) Its frequency response magnitude (solid line). (C) The magnitude of the Fourier transform of z1 (t) = h(t) ∗ x1 (t). The dotted line in B shows the frequency response magnitude of h 1 (t), which has the same stop-band as h(t), but the magnitude in the pass-bands is almost a constant.
We can see that now the independent subcomponents are recovered with less distortion compared to Figure 4B. We repeat the above experiment for 20 runs, and in each run the mixing matrix A is randomly chosen and is guaranteed to be nonsingular. In each
An Adaptive Method for Subband Decomposition ICA
213
5 0 –5 2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
0 –2 5 0 –5 5 0 –5
t
Figure 6: The recovered independent subcomponents by applying h 1 (t) to the recovered sources Wx(t).
run, we run the FastICA algorithm, BS-ICA with W initialized by FastICA, and BS-ICA with W initialized with the identity matrix. We find that BS-ICA with W initialized by FastICA always converges quickly (in 500 iterations), and generally BS-ICA with W initialized with the identity matrix needs at least 1000 iterations to converge. The resulting Amari performance index of these three methods is shown in Figure 7. In all runs, BS-ICA with W initialized by FastICA converges to the desired target. And in three runs (runs 2, 8, and 14 in Figure 7), BS-ICA with W initialized with the identity matrix converges to a wrong target. There are two main reasons for this phenomenon. First, the second and first independent subcomponent signals have a very narrow band, and their frequency peaks are very close to that of the dependent subcomponent, as seen in Figure 2. Hence, they are very likely to be attenuated and distorted greatly. For the three runs where Perr is very high, we find that all columns of WA, except the second one, have only one dominant element. In other words, the poor performance in these three runs is caused by the second source-independent subcomponent signal. Second, the proportion of the dependent subcomponent is quite large, and it is hard to be filtered out. For these three runs, BS-ICA with W initialized with the identity matrix also converges to the desired target if we reduce the amplitude of the dependent subcomponent by half. 5.2 Experiment 2: Separating Images of Human Faces. Since the face images, represented as one-dimensional signals, are apparently dependent,
214
K. Zhang and L.-W. Chan 0.4 0.35
FastICA BS –ICA (W initialized by FastICA) BS–ICA(W randomly initialized)
0.3
P
err
0.25 0.2 0.15 0.1 0.05 0 0
5
10
15
20
run
Figure 7: The Amari performance index of FastICA, BS-ICA with W initialized by FastICA, and BS-ICA with W initialized with the identity matrix, for 20 runs. In each run, the mixing matrix A is randomly chosen.
it is very hard to separate them with the ICA technique (Hyv¨arinen, 1998). Hyv¨arinen (1998) successfully separated four images of human faces by applying traditional ICA on the innovation processes. Here we use BS-ICA to do such a task. The four original images of human faces are the same as in Hyv¨arinen (1998), as shown in Figure 8A.6 We mix them with a random mixing matrix. Figure 8B shows the mixtures of the original images for illustration. The Amari performance index of separating the mixtures with FastICA is 0.592, and that of the natural gradient method (see equation 3.7, with the score function adaptively estimated from data) is 0.289.7 Figure 8C shows the separation result by the natural gradient method. Clearly the result produced by traditional ICA is poor. We repeat BS-ICA with W initialized with FastICA and BS-ICA with W initialized with the identity matrix for 20 runs. In each run, the mixing matrix is randomly chosen. The length of h is 11. The learning rate in the filtering stage is 0.04, and that in the linear demixing stage is 0.08. Since it is very computation demanding and memory demanding to do SFD estimation on a large number of samples, in each iteration we process only 6 Many thanks to Aapo Hyv¨ arinen for providing us the images and granting permission to use them. 7 Here, FastICA and the natural gradient method (with the score function estimated from data) give different results. The main reason is that the original images are highly correlated, as seen from the correlation matrix computed in Hyv¨arinen (1998), while the outputs of FastICA are always uncorrelated.
An Adaptive Method for Subband Decomposition ICA
215
A
B
C
D
Figure 8: Separating mixtures of images of human faces. (A) Original images. (B) Some mixtures of the original images. (C) Separation result by traditional ICA (natural gradient method with the score function estimated from data). (D) Separation result by BS-ICA.
3000 samples. We find that no matter which method is used to initialize W, the Amari performance index obtained by BS-ICA is always between 0.0397 and 0.0473. In other words, the original images are successfully separated with good performance. Figure 8D shows the separation result of one run. The learned h(t), as well as its frequency-response magnitude, is given in
216
K. Zhang and L.-W. Chan
B 2
8
1
6
|H(jΩ)|
h(t)
A
0
4 2
–1 –2
h(t) hAR(t)
0
2
4
t
6
8
10
0
0
1
Ω (rad)
2
3
Figure 9: (A) The learned h(t) in separating mixtures of images. (B) Its frequency response magnitude (solid line). For comparison, the dotted line shows the frequency-response magnitude of the filter h AR (t), which is obtained by fitting a tenth-order autoregressive model to the observed images.
Figure 9. We can see that h(t) attenuates not only the low-frequency part but also the high-frequency part of the observations. For comparison, we repeat the experiment in Hyv¨arinen (1998). We use a tenth-order autoregressive model to estimate the innovation processes from the observations. The frequency response magnitude of the filter producing the innovation processes from the observations, denoted by h AR (t), is shown by the dotted line in Figure 9B. After that, we estimate the mixing matrix by applying traditional ICA on the innovation processes. FastICA gives the Amari performance index 0.132, and the natural gradient method with the score function estimated from data gives 0.061. So the original images are also recovered successfully by exploiting innovation processes. And by comparing the performance index, one can see that BS-ICA gives a better result. 5.3 Experiment 3: Overcomplete ICA. In this experiment, we test the usefulness of our method for the overcomplete ICA problem. The four independent sources are an amplitude-modulated signal (s1 ), a sign signal (s2 ), a high-frequency noise signal (s3 ), and a sinusoid (s4 ), which are the first three independent subcomponents and the dependent subcomponent in the first experiment (see Figure 2). The mixing matrix is A=
0.7 0.7 1.5 1
1
1.4 1.5 0.4
.
Note that each pair of the columns of A is linear independent. The two observations, together with their frequency characteristics, are shown in Figure 10. Since only two observations are available, each time the BS-ICA method can recover only two sources and the corresponding mixing matrix. From
An Adaptive Method for Subband Decomposition ICA
x
1
5
400
0
–5
200
450
500
550
600
650
700
750
800
10
x2
217
0 0
1
2
3
2
3
600 400
0 200
–10
450
500
550
600
650
700
750
800
0 0
1
t
Ω (rad)
Figure 10: The two observations (left) and their frequency characteristics (right) in experiment 3.
B
A
8
1
|H(jΩ)|
h(t)
0.5 0
4
–0.5 –1
0
5
t 10
15
20
0
0
Ω (rad)
1
C
2
3
|Z1(jΩ)|
4000
0
0
1
Ω (rad)
2
3
Figure 11: (A) The filter h(t). (B) Its frequency-response magnitude. (C) The magnitude of the Fourier transform of z1 (t) = h(t) ∗ x1 (t).
Figure 2, we can see that they all have different frequency characteristics, and s3 has a wide frequency band. In order to recover two among all four sources, the other two sources would be filtered out by h(t). For a good frequency resolution, the length of h(t) should not be too small. Here we set the length of h(t) as 21. h 0 is set as 1, and all the other elements of h(t) are initialized as 0. W is initialized as the identity matrix. The learning rates for the filter stage and instantaneous stage are both 0.15. After about 1500 iterations, the algorithm converges. The filter h(t), its frequency response magnitude, and frequency characteristics of z1 (t) are shown in Figure 11. From Figure 11B and Figure 2, we can see h(t) has
218
K. Zhang and L.-W. Chan
y
1
2
0
–2
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
2
2
0
–2
t
Figure 12: Two outputs produced by the BS-ICA method for the overcomplete ICA problem. They are the estimate of a filtered version of some original sources. The SNR of y1 with respect to s4 is 20.1 dB, and that of y2 with respect to s2 is 18.2 dB.
significantly attenuated the amplitude-modulated signal (s1 ) and the highfrequency noise signal (s3 ). The filtered version of s4 and s2 is recovered with the SNR values 20.1 dB and 18.2 dB, as shown in Figure 12. The product of W and A is WA =
0.0305 −0.0051 0.1225 0.1451
0.1350
0.2265 0.1803 −0.0018
.
As s1 and s3 are filtered out by h(t), we neglect the first and third columns of this matrix and get the Amari performance index Perr = 0.040. So the mixing matrix associated with s2 and s4 is successfully recovered. Note that different sources may be recovered if we use different initialization values for h(t) and W. Now we have obtained the estimate of two sources with h(t) filtering out the other two. In order to recover the remaining two sources, we could run our algorithm again with h −1 (t) as a good initialization value for the filter. If the frequency representations of the recovered sources and the remaining ones do not overlap (or overlap slightly), we can simply apply h −1 (t) to allow the remaining sources to pass through and attenuate the recovered sources. By applying a linear instantaneous ICA algorithm to the signals h −1 (t) ∗ xi (t), we can get the estimate of the remaining sources without running BS-ICA again. This simple scheme is adopted in the following experiment. The linear instantaneous ICA algorithm chosen is the FastICA algorithm with the tanh nonlinearity and in the symmetric estimation mode. The product of the demixing matrix obtained
An Adaptive Method for Subband Decomposition ICA
219
4
y
1
2 0
–2 –4 400
450
500
550
600
650
700
750
800
450
500
550
600
650
700
750
800
y
2
2 0
–2 400
t
Figure 13: Outputs produced by applying FastICA on h −1 (t) ∗ x(t). The SNR of y1 with respect to s1 is 12.6 dB, and that of y2 with respect to s3 is 18.1 dB.
by FastICA, and the mixing matrix A is WA =
−0.2167 −0.5029 −0.0044 0.0060
0.4263
0.2153 −0.3237 −0.5298
.
Neglecting its second and fourth columns, the Amari performance index is Perr = 0.040, which indicates that the columns in the mixing matrix associated with s1 and s3 are successfully recovered. Figure 13 shows the output signals, with the SNR values 12.6 dB and 18.1 dB, respectively. 6 Conclusion In this article, we considered the problem of subband decomposition ICA. We investigated the feasibility of adaptively separating mixtures generated by the subband decomposition ICA model. Based on the minimization of the mutual information between outputs, we developed an adaptive algorithm for subband decomposition ICA, called band-selective ICA. The advantage of this algorithm is that it automatically selects the frequency bands in which source subcomponents are most independent and attenuates the dependent subcomponents. Practical issues in implementing our method were considered, and some techniques were suggested to improve the performance of this algorithm. We also discussed the relationship between subband decomposition ICA and overcomplete ICA. By taking into account the information of the frequency bands of sources, our algorithm can be exploited to solve one form of the overcomplete ICA problems in which sources have specific frequency localization characteristics. Experimental results have been given to illustrate the performance of our method for subband decomposition ICA as well as overcomplete ICA.
220
K. Zhang and L.-W. Chan
Appendix: Proof of the Rule, Equation 3.17 According to equation 3.16, we have
J=
n n−1
I (yi , y j )
i=1 j=i+1
=
n i=1
n 1 (n − 1)H(yi ) − H(yi , y j ) 2 j=1, j=i
and ∂ H(yi , y j ) = −E ∂h k
log p yi ,y j (yi , y j ) ∂h k
∂ log p yi ,y j (yi , y j ) ∂ yi ∂ log p yi ,y j (yi , y j ) ∂ y j · + · ∂ yi ∂h k ∂ yj ∂h k n = −E ϕ yi (yi (t), y j (t)) wi, p x p (t − k) + ϕ y j (yi (t), y j (t))
= −E
p=1
n w j, p x p (t − k) . × p=1
Also taking into account equation 3.9, we have n n ∂ H(y , y ) ∂J ∂ H(y ) 1 i j i (n − 1) = − ∂h k ∂h k 2 ∂h k j=1, j=i
i=1
n n E ϕ yi (yi (t), y j (t)) − (n − 1)ψ yi (yi (t)) = i=1
·
n p=1
j=1, j=i
wi, p x p (t − k)
= E γ yT (t) · W · x(t − k) . This is exactly equation 3.17.
An Adaptive Method for Subband Decomposition ICA
221
Acknowledgments This work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China. We are very grateful to the anonymous referees for their helpful comments and suggestions. We also thank Deniz Erdogmus for helpful discussions. References Amari, S. (1999). Natural gradient for over- and under-complete bases in ICA. Neural Computation, 11, 1875–1883. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2001a). Blind separating convolutive post-nonlinear mixtures. In Proc. ICA2001 (pp. 138–143). San Diego, CA. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2001b). Separating convolutive mixtures by mutual information minimization. In Proc. IWANN (pp. 834–842). New York: Springer. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2004). A minimization-projection (MP) approach for blind separating convolutive mixtures. In ICASSP 2004. Montreal, Canada. Bach, F. R., & Jordan, M. I. (2003). Beyond independent components: Trees and clusters. Journal of Machine Learning Research, 4, 1205–1233. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., & Amin, M. G. (1998). Blind source separation based on timefrequency signal representations. IEEE Transactions on Signal Processing, 46, 2888– 2897. Cao, X.-R., & Liu, R.-W. (1996). General approach to blind source separation. IEEE Transactions on Signal Processing, 44, 562–571. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cardoso, J.-F. (1998). Multidimensional independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’98). Seattle, WA. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. Signal Processing, 44, 3017–3030. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceeding–F, 140, 362–370. Cichocki, A., & Amari, S. (2003). Adaptive blind signal and image processing: Learning algorithms and applications. (rev. ed.). New York: Wiley. Cichocki, A., Amari, S., Siwek, K., Tanaka, T., et al. (2003). ICALAB toolboxes for signal and image processing. Available online at http://www.bsp.brain.riken.jp/ ICALAB/.
222
K. Zhang and L.-W. Chan
Cichocki, A., & Belouchrani, A. (2001). Source separation of temporally correlated source using bank of band pass filters. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal Separation (ICA2001) (pp. 173– 178). San Diego, CA: Cichocki, A., & Georgiev, P. (2003). Blind source separation algorithms with matrix constraints. IEICE Transactions on Information and Systems, Special Session on Independent Component Analysis and Blind Source Separation, E86-A (1), 522–531. Cichocki, A., Rutkowski, T. M., & Siwek, K. (2002). Blind signal extraction of signals with specified frequency band. In Neural Networks for Signal Processing XII: Proceedings of the 2002 IEEE Signal Processing Society Workshop (pp. 515–524). Piscataway, NJ: IEEE. Cichocki, A., & Zurada, J. M. (2004). Blind signal separation and extraction: Recent trends, future perspectives, and applications. In ICAISC 2004 (pp. 30–37). New York: Springer. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cram´er, H. (1962). Random variables and probability distributions (2nd ed.). Cambridge: Cambridge University Press. Gharieb, R. R., & Cichocki, A. (2003). Second-order statistics based blind source separation using a bank of subband filters. Digital Signal Processing, 13, 252– 274. Hyv¨arinen, A. (1998). Independent component analysis for time-dependent stochastic processes. In Proc. Int. Conf. on Artificial Neural Networks (ICANN’98) (pp. 541– 546). Skovde, Sweden. Hyv¨arinen, A., Cristescu, R., & Oja, E. (1999). A fast algorithm for estimating overcomplete ICA bases for image windows. In Proc. Int. Joint Conf. on Neural Networks (pp. 894–899). Washington, DC. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phases and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12, 1705–1720. Hyv¨arinen, A., Hoyer, P. O., & Oja, E. (2001). Image denoising by sparse code shrinkage. In S. Haykin & B. Kosko (Eds.), Intelligent signal processing. Piscataway, NJ: IEEE Press. Hyv¨arinen, A., & Inki, M. (2002). Estimating overcomplete independent component bases for image windows. Journal of Mathematical Imaging and Vision, 17, 139– 152. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Kagan, A. M., Linnik, J. V., & R´ao, C. R. (1973). Characterization problems in mathematical statistics. New York: Wiley. Kiviluoto, K., & Oja, E. (1998). Independent component analysis for parallel financial time series. In Proc. ICONIP’98 (pp. 895–898). Tokyo, Japan. Cambridge, MA: MIT Press. Lewicki, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365.
An Adaptive Method for Subband Decomposition ICA
223
Liu, R. W., & Luo, H. (1998). Direct blind separation of independent non-gaussian signals with dynamic channels. In Proc. Fifth IEEE Workshop on Cellular Neural Networks and Their Applications (pp. 34–38). London, England. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T.-J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pham, D. T. (2003). Fast algorithm for estimating mutual information, entropies and scores functions. In Proceeding of ICA 2003. Nara, Japan. Rickard, S., Balan, R., & Rosca, J. (2003). Blind source separation based on spacetime-frequency diversity. In 4th International Symposium on Independent Component Analysis and Blind Source Separation (ICA2003). Nara, Japan. Ristaniemi, T., & Joutsensalo, J. (1999). On the performance of blind source separation in CDMA downlink. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 437–441). Aussois, France. Samadi, S., Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2004). Blind source separation by adaptive estimation of score function difference. In Proc. ICA 2004 (pp. 9–17). New York: Springer. Taleb, A., & Jutten, C. (1997). Entropy optimization—application to blind source separation. ICANN (pp. 529–534). New York: Springer. Tanaka, T., & Cichocki, A. (2004). Subband decomposition independent component analysis and new performance criteria. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’04) (pp. 541–544). Piscataway, NJ: IEEE. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. Proc. IEEE workshop on Neural Networks for Signal Processing (pp. 423–432). Kyoto, Japan. Vig´ario, R. (1997). Extraction of ocular artifacts from EEG using independent component analysis. Electroenceph. Clin. Neurophysiol, 103, 395–404. Vig´ario, R., V. Jousm¨aki, M. H., Hari, R., & Oja, E. (1998). Independent component analysis for identification of artifacts in magnetoencephalographic recordings. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 229–235). Cambridge, MA: MIT Press. Ye, J.-M., Zhu, X.-L., & Zhang, X.-D. (2004). Adaptive blind separation with an unknown number of sources. Neural Computation, 16, 1641–1660. Yilmaz, O., & Rickard, S. (2004). Blind separation of speech mixtures via timefrequency masking. IEEE Transactions on Signal Processing, 52, 1830–1847.
Received February 7, 2005; accepted June 1, 2005.
LETTER
Communicated by Gabriel Huerta
On Consistency of Bayesian Inference with Mixtures of Logistic Regression Yang Ge
[email protected]
Wenxin Jiang
[email protected] Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A.
This is a theoretical study of the consistency properties of Bayesian inference using mixtures of logistic regression models. When standard logistic regression models are combined in a mixtures-of-experts setup, a flexible model is formed to model the relationship between a binary (yes-no) response y and a vector of predictors x. Bayesian inference conditional on the observed data can then be used for regression and classification. This letter gives conditions on choosing the number of experts (i.e., number of mixing components) k or choosing a prior distribution for k, so that Bayesian inference is consistent, in the sense of often approximating the underlying true relationship between y and x. The resulting classification rule is also consistent, in the sense of having near-optimal performance in classification. We show these desirable consistency properties with a nonstochastic k growing slowly with the sample size n of the observed data, or with a random k that takes large values with nonzero but small probabilities. 1 Introduction Mixtures of experts (ME; Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical mixtures of experts (HME; Jordan & Jacobs, 1994) are popular techniques for regression and classification, and have attracted attention in the areas of neural networks and statistics. ME and HME are a variety of neural networks that have an interpretation of probabilistic mixture, in contrast to the usual neural nets, which are based on linear combinations. With mixture, instead of with linear combinations, simple models are combined in ME and HME for improved predictive capability. This structure of the probabilistic mixture allows the use of convenient computing algorithms such as the expectation maximization (EM) algorithm (Jordan & Xu, 1995) and the Gibbs sampler (Peng, Jacobs, & Tanner, 1996). The ME and HME are flexible constructions that can allow various models or experts to be mixed. For binary classification, simple and standard classifiers such as logistic regression models can be combined to model the Neural Computation 18, 224–243 (2006)
C 2005 Massachusetts Institute of Technology
Mixtures of Logistic Regression
225
relationship between a binary response y ∈ {0, 1} and a predictor vector x. Such combined models can approximate arbitrary smooth relations between y and x in the sense of Jiang and Tanner (1999a). Peng, Jacobs, and Tanner (1996) apply mixtures of binary and multinomial logistic regression for pattern recognition. They found that a Bayesian approach based on simulating the posterior distribution gives better performance than a frequentist approach based on likelihood. Recently, Wood, Kohn, Jiang, and Tanner (2005) studied binary regression where probit-transformed spline models with different smoothing parameters are mixed, and Markov chain Monte Carlo methods are used to simulate the posterior distributions for both the model parameters and the number of mixing components. Their extensive simulations and empirical studies demonstrate excellent performance of the Bayesian approach and local adaptivity of the mixing paradigm. These empirical successes have motivated us to study the theoretical reasons behind the question: Why does the Bayesian procedure work well in such mixture models of binary regression? The purpose of this letter is to study the consistency properties of Bayesian inference for mixtures of binary logistic regression. Will inferential results based on the posterior distribution be reliable? In a Bayesian approach, the posterior distribution will propose various relationships between x and y, based on some observed data. We will investigate the conditions under which the proposed relationships are consistent or often close to the underlying true relationship. This will also imply that the resulting plug-in classification rule has near-optimal performance. There are several senses of consistency. The precise formulation of these problems is given in section 2. We assume that the true model possesses some unknown smooth mean function E(y|x), which can be outside the n proposed ME family and that the observed data (yi , xi )i=1 are n independent and identical copies of (y, x). In section 3 we study the consistency problem for a sequence of ME models, where the number of experts (or mixture components) K = kn increases with sample size n. Such a construct allows a large number of experts eventually and enables good functional approximation (Jiang & Tanner 1999a). We will show that the following condition on kn leads to consistency: kn increases to infinity at a rate slower than na for some a ∈ (0, 1), as sample size n increases. Later in section 3, we consider the case when the number of experts K is considered to be random and follows a prior distribution. We will show that the critical conditions for consistency involve the prior on K : the prior is supported for all large values of K and has a sufficiently thin tail. Our work parallels that of Lee (2000), who studies similar properties (without classification consistency) for ordinary neural networks (NN) based on linear combinations. Lee’s method involves truncating the space of all proposed models into a limited part and an unlimited part, and shows that (1) the unlimited part has very small prior probability satisfying some
226
Y. Ge and W. Jiang
tail condition; (2) the limited part is not too complicated in the sense of satisfying an entropy condition; and (3) the prior is chosen to have not too small a probability mass around the true model, which is an approximation condition. Condition 3 typically involves some approximation results, since the prior proposed models have to be able to get as near as possible to the true model; otherwise, the prior mass would be zero over some neighborhood of the true model. We implement these conditions for ME. However, we note that there is a fundamental difficulty resulting from the mechanism of approximation with ME. In the known mechanism (Jiang & Tanner 1999a), to approximate a true relation arbitrarily well, ME with many experts crowded together and with large parameter values is used. This results in large parameter values of the ME model (the components that describe the changing of mixing weights) increase with K . In a Bayesian approach, typically very small prior is given to such ME configurations; they have large parameter values lying in the tail of the prior. Yet we would like to show that the resulting posterior of such configurations is not too small, since these configurations are close to the true relation. In order to handle this difficulty, we characterize how large the ME parameter values are needed for good approximation: values of order ln(K ) are sufficient, which are not too far in the tail of the prior distribution. Such a result is established by embedding ME with K ∗ (< K )-experts as a subset of ME with K-experts. (See lemma 5 and its proof.) When we consider the situation with random K , we face another difficulty: the usual priors of K , such as geometric or Poisson, cannot satisfy both conditions 1 and 3. If the truncation occurs at a K that is too large, the limited part of the proposed model space may become too complicated to satisfy the entropy condition. If the truncation occurs at a K that is too small, the tails may be too thick to satisfy the tail condition. Such a dilemma was not discussed in Lee (2000), who did not consider the entropy condition for the case of random K . In order to handle this situation, we introduce a contraction sequence for the number of experts: K = k(i), which grows to infinity as i increases but grows more slowly than i. Then the prior probability, for example, λi = (0.5)i for the geometric, is put on index i. Since k(i) can stay unchanged for some i, this contraction sequence effectively groups the geometric probabilities together at the choice of number of experts and produces a thinner tail. We show that using suitably contracted geometric or Poisson priors on K , all conditions hold to produce consistency. 2 Notation and Definitions 2.1 Models. We first define the single-layer ME models where logistic regression models are mixed.
Mixtures of Logistic Regression
227
The binary response variable is y ∈ {0, 1}, and x is an s-dimensional predictor. As in Jiang and Tanner (1999a, 1999b), we let = [0, 1]s = ⊗qs =1 [0, 1] be the space of the predictor x, and let x have a uniform distribution on . This is a convenient starting point, and the results can be easily adapted to the case when x has a positive density and is supported on a compact set. This convenient formulation results in several simplified relations. The joint density of p(y, x) is the same as the conditional density p(y|x), which is completely determined by the conditional probability of a positive response P(y = 1|x), which is equal to the condition mean or regression function µ(x) = E(y|x), which is alternatively formulated in a transformed version h(x) = log{µ(x)/(1 − µ(x))}, called the log-odds. We consider a family of smooth relations between y and x as defined in Jiang and Tanner (1999a, 1999b): is the family of joint densities p(y, x) such that the log-odds h(x) has continuous derivatives up to the second order, which are all bounded above by a constant, uniformly over x. Such a nonparametric family can be approximated by mixtures of logistic regression (Jiang & Tanner 1999a, 1999b). Define the family k , of mixtures of k logistic regression models, as follows: k is the set of joint densities f (y, x|θ ), such that the conditional densities have the form u +vT x k α j +β T x e j j f (y|x, θ ) = j=1 g j H j , where g j = k ul +vT x ; Hj = ( e α +βj T x ) y ( α 1+β T x )1−y . l=1
e
l
1+e
j
j
1+e
j
j
The α, β, u, v’s are parameters of the model. Except that we restrict u1 = 0 and v1 = 0 for the sake of the identifiability, we allow all components of the parameter vectors to vary in (−∞, ∞). We denote by θ the combined vector of parameters θ = (α1 , β 1T , . . . , αk , β kT , u2 , v2T , . . . , uk , vkT )T ∈ dim(θ ) , where dim(θ ) = (s + 1)(2k − 1). 2.2 Bayesian Inference. The observed data set is (Y1 , X1 ), . . . , (Yn , Xn ), which we denote simply as (Yi , Xi )n . Here n is the sample size. We assume (Yi , Xi )n to be an independent and identically distributed (i.i.d.) sample of an unknown density f 0 from the nonparametric family of smooth relations . The mixture of logistic regression approach involves estimating the nonparametric f 0 using parametric relations f from k , the family of mixtures of k logistic regression models. We now describe Bayesian inference for uncovering f 0 based on (Yi , Xi )n . In the mixture of logistic regression approach, one first puts a prior distribution πn to propose densities f from the k-mixture family k (through the corresponding parameters θ ). This prior will then produce a posterior distribution of f over k (through on the observed nthe corresponding θ ),conditional n data: πn (dθ |(Yi , Xi )n ) = i=1 f (Yi , Xi |θ )πn (dθ )/ i=1 f (Yi , Xi |θ )πn (dθ ). Then the predictive density, which is the Bayes estimate of f 0 , is given by fˆ n (·) = f (·|θ )πn (dθ |(Yi , Xi )n ).
228
Y. Ge and W. Jiang
Let µ0 (x) = E f0 [Y|X = x] = y=0,1 yf 0 (y|x) be the true regression function; then µ ˆ n (x) = E fˆn [Y|X = x] is the estimated regression function. For now, we will let k = kn be nonstochastic and possibly depend on sample size n, which explains the dependence of prior on n. Later we will also consider the case when k = K is itself regarded as a random component of the parameter; a prior randomly decides to use an f from k with probability P(K = k), k = 1, 2, 3, . . . The prior densities on θ -components are assumed to be independent normal with zero mean and common positive variance σ 2 . (The results can be easily generalized to cases with different means and variances.) 2.3 Consistency. We first define consistency in regression function estimation, which we will call R-consistency. Definition 1 (R-Consistency). µ ˆ n is asymptotically consistent for µ0 if P as n → ∞. (µ ˆ n (x) − µ0 (x))2 dx → 0 Here and below, the convergence in probability of the form P q {(Yi , Xi )n } → q 0 , for any quantity dependent on the observed data, means limn→∞ P(Yi ,Xi )n [|q {(Yi , Xi )n } − q 0 | ≤ ] = 1 for all > 0, where (Yi , Xi )n are an i.i.d. random sample from the true density f 0 . This definition describes a desirable property for the estimated regression function µ ˆ n to be often (with P(Yi ,Xi )n tending to one) close (in L 2 sense) to the true µ0 , for large n. Now we define consistency in terms of the density function, which we will term as D-consistency. First, for any ε > 0, define a Hellinger ε-neighborhood by Aε = { f : DH ( f, f 0 ) ≤ ε}, where DH ( f, f 0 ) =
√ √ ( f − f 0 )2 dxdy is the Hellinger distance.
Definition 2 (D-Consistency). Suppose (Yi , Xi )n is an i.i.d. random sample from density f 0 . The posterior is asymptotically consistent for f 0 over Hellinger neighborhood if for any ε > 0 , P
Pr (Aε |(Yi , Xi )n ) → 1
as n → ∞.
That is, the posterior probability of any Hellinger neighborhood of f 0 converges to 1 in probability. This definition describes a desirable property for the posterior-proposed joint density f to be often close to the true f 0 , for large n.
Mixtures of Logistic Regression
229
Now we define the consistency in classification, which we will call C-consistency. Here we consider the use of the plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] in predicting Y. We are interested in how the misclassification error E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n } approaches the minimal error P{Co (X) = Y} = infC: Dom(X)→{0,1} P{C(X) = Y}, where Co (x) = I [µ0 (x) > 1/2] is the ideal Bayes rule based on the (unknown) true mean function µ0 . Definition 3 (C-Consistency). Let Bˆ n : Dom(X) → {0,1} be a classification rule that is computable based on the observed data (Yi , Xi )n . If limn→∞ E (Yi ,Xi )n P Bˆ n (X) = Y|(Yi , Xi )n } = P{Co (X) = Y}, then Bˆ n is called a consistent classification rule. It is straightforward to show that three consistency concepts are related in our situation with binary data, where µ ˆ n and µ0 are bounded between [0, 1]: Proposition 1 (Relations among three consistencies). D-Consistency =⇒ RConsistency =⇒ C-Consistency. Proof. The first relation is due to lemma 4. The second relation is due to ¨ corollary 6.2 of Devroye, Gyorfi, and Lugosi (1996). We will first establish D-consistency; then R- and C-consistencies naturally follow. 3 Results and Conditions We first consider the case when the number of experts K = kn is a nonstochastic sequence depending on sample size n. Theorem 1 (Nonstochastic K). Let the prior for the parameters, πn (dθ ), be independent normal distributions with mean zero and fixed variance σ 2 for each parameter in the model. Let kn be the number of experts in the model, such that i. limn→∞ kn = ∞ and ii. kn ≤ na for all sufficiently large n, for some 0 < a < 1. Then we have the following results: a. The posterior distribution of the densities is D-consistent for f 0 , that is, P
Pr ({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞, for all > 0.
b. The estimated regression function µ ˆ n is R-consistent for µ0 , that is, (µ ˆn − P
µ0 )2 dx → 0 as n → ∞.
230
Y. Ge and W. Jiang
c. The plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] is C-consistent for the Bayes rule Co (x) = I [µ0 (x) > 1/2], that is, limn→∞ E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n } = P{Co (X) = Y}. Now we consider the case when the number of experts K is a random parameter. We will consider the possibility that K = k(I ) is constructed out of a more basic random index I , which, for example, can be the geometric or the Poisson distribution. The function k(·) will be called a contraction function. We will see the reason to introduce the contraction: the sufficient condition we propose on K requires a very thin probability; common distributions such as geometric or Poisson can be used only after a tail-thinning contraction. The densities f (y, x|k, θ ) are now indexed by both the parameter vector θ and the number of experts k. The prior is π(k, dθ ) = ˜ k π(dθ |k), where λ ˜ k = P[k(I ) = k], and π(dθ |k) is again chosen to be λ the independent N(0, σ 2 ) distributions on all components of θ . The n n posterior distribution is then π(k, dθ |(Y , X ) ) = f (Y , X i i i i |k, θ )π(k, i=1 n
dθ )/ ∞ j=1 i=1 f (Yi , Xi | j, θ )π( j, dθ ). Then the predictive density, which is the Bayes estimate of f 0 , is given f (·|k, θ )π(k, dθ |(Yi , Xi )n ). The corresponding estimated by fˆ n (·) = ∞ k=1 regression function is µ ˆ n (x) = y=0,1 y fˆ n (y|x). The plug-in classification rule is Cˆ n (x) = I [µ ˆ n (x) > 1/2]. Theorem 2 (Random K). Suppose the priors π(dθ |k) conditional on the number of experts are independent normal with mean 0 and fixed variance σ 2 . Suppose the prior put on the number of experts k(I ) satisfies the following conditions: iii. P[k(I ) = k] > 0 for all sufficiently large k. iv. The tail probabilities decrease at a faster-than-geometric rate, that is, there exists q > 1 such that fixing any r > 0, for all sufficiently large k, P[k(I ) ≥ k] ≤ exp(−k q r ). Then we have the following results. d. The posterior distribution of the densities is D-consistent for f 0 , that is, P
Pr({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞, for all > 0.
e. The estimated regression function µ ˆ n is R-consistent for µ0 , that is, (µ ˆn − P
µ0 )2 dx → 0 as n → ∞. f. The plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] is C-consistent for the Bayes rule Co (x) = I [µ0 (x) > 1/2], that is, limn→∞ E (Yi ,Xi )n P{Cˆ n (X) = Y|(Yi , Xi )n = P{Co (X) = Y}.
Mixtures of Logistic Regression
231
The super-geometrically thin tail condition iv cannot be directly satisfied if the number of experts follows some common distributions such as geometric or Poisson. However, if one applies a contraction k(·) to a geometric or Poisson random variable, where k(·) grows very slowly, the probability of a large contracted k(I ) can be made sufficiently small. Remark 1. For example, consider the contractions of the form k(I ) = χ(I ) + 1, where u represents the integer part of u, and χ(I ) is a strictly and slowly increasing function. It is easy to confirm that when I is a geomet1+δ ric random variable, taking χ(I ) = I 1/q (for some δ > 0 and q > 1) will make k(I ) satisfy condition iv after using the equation P(I ≥ B) = P(I > 1+δ 0) B . When I is a Poisson random variable, taking χ(I ) = {ln(I + 1 )}1/q (for some δ > 0 and q > 1) will make k(I ) satisfy condition iv after applying a Chebyshev’s inequality to obtain P(I ≥ B) ≤ E I /B. Below we will first give the proofs of the main theorems. The lemmas used will be stated and proved later. 3.1 Proof of Theorem 1. The proof involves splitting the space kn of all kn -expert densities into a limited part Fn and an unlimited part Fnc and applying proposition 2 below. Let Fn be the set of ME models with each parameter bounded by Cn in absolute value, that is, |u j | ≤ Cn , |v j h | ≤ Cn , |α j | ≤ Cn , |β j h | ≤ Cn , j = 1, . . . , kn , h = 1, . . . , s, where Cn grows with n such that n 2 +η ≤ Cn ≤ exp(nb−a ) for some η > 0 and 0 < a < b < 1 (a is the same a as in the bound of kn ). 1
Proposition 2 (Lee, 2000, theorem 2). Suppose the following conditions hold: Tail condition 1. There exists an r > 0 and N1 , such that πn (Fnc ) < e xp(−nr ) ∀n ≥ N1 . condition 2. exists some constant c > 0 such that ∀ε > 0, Entropy √ There ε 2 H (u)du ≤ c nε for all sufficiently large n. [ ] 0 Approximation condition 3. For all γ , ν > 0, there exists an N2 , such that πn (KLγ ) ≥ e xp(−nν), ∀n ≥ N2 . Then the posterior is asymptotically consistent for f 0 over Hellinger neighborhoods,
232
Y. Ge and W. Jiang
that is, for any > 0, P
Pr ({ f : DH ( f, f 0 ) ≤ }|(Yi , Xi )n ) → 1 as n → ∞. Here, for any γ > 0, define a Kullback-Leibler γ -neighborhood by KLγ = { f : DK ( f, f 0 ) ≤ γ }, f 0 ln( f 0 / f )dxdy is the Kullback-Leibler divergence. where DK ( f, f 0 ) = This proposition was proved in Lee (2000), theorem 2, where the entropy condition was used but not stated explicitly. Here H[ ] () is the Hellinger bracketing entropy defined in √ the following steps, where the family of function is taken to be F ∗ = { f : f ∈ Fn }, the set of√ square roots of densi√ ties from Fn , and the metric is the L 2 -norm so that f − g = DH ( f, g) the Hellinger distance, for any two densities f and g. Definition 4 (Brackets and bracketing entropy). i. For any two functions l and u, define the bracket [l, u] as the set of all functions f such that l ≤ f ≤ u. ii. Let · be a metric. Define an ε-bracket as a bracket with u − l ≤ ε. iii. Define the bracketing number of a set of functions F ∗ as the minimum number of ε-brackets needed to cover F ∗ , and denote it by N[ ] (ε, F ∗ , · ). iv. The bracketing entropy, denoted by H[ ] (·) = ln N[ ] (·, F ∗ , · ), is the natural logarithm of the bracketing number. Now we prove theorem 1. Lemma 1 guarantees the tail condition. Lemma 2 guarantees the entropy condition. Lemma 3 guarantees the approximation condition. Therefore, we have the D-consistency due to proposition 2. The R- and C-consistencies follow from proposition 1. 3.2 Proof of Theorem 2. Let Gm be a restricted set of mixtures of mexperts models, whose parameter components are all bounded by Cn m in 1 absolute value, with n 2 +η ≤ Cn ≤ exp nb−a for some η > 0 and 0 < a < b < 1, where a = 1/q . Such restricted sets Gm are nested due to proposition 3. n Gk , where kn = (cn)a , a = 1/q ∈ (0, 1) and c ∈ (0, 1]. We let Fn = ∪kk=1 k n ˜ Tail condition 1. Note that π(Fnc ) = 1 − π(Fn ) ≤ ∞ k=kn +1 λk + k=1 λ˜ k π(θ ∞ > Cn k|k), where θ ∞ is the maximum absolute value of all the θ components. For all sufficiently large n and all r > 0, the tail probabil ˜ ity ∞ λ is less than e −nr /2 due to condition iv. All tail probabilities k=kn +1 k π(θ ∞ > Cn k|k) are less than e −nr /2 also, due to the choice of Cn and the normality of π(dθ |k) (using Mill’s ratio for normal tail probabilities).
Mixtures of Logistic Regression
233
Therefore, π(Fnc ) ≤ e −nr for all sufficiently large n and all r > 0, showing the tail condition i. n Entropy condition 2: Note that the Fn = ∪kk=1 Gk = Gkn since the sets of density functions represented by Gk are increasing with k (see proposition 3). Then the entropy condition can be computed for Gkn , where the bounds of the parameter values are now Cn kn instead of the previous bound Cn . Repeating the proof of the entropy condition as before shows that the condition still holds. Approximation condition iii. Fix any γ > 0. Then π(K L γ ) = ∞ k=1 P(K = k)π(K L γ |k) ≥ P(K = kn )π(K L γ |kn ) > 0, due to the positive P(K = kn ) (guaranteed by condition iii of theorem 2) and that π(K L γ |kn ) > e −nν , fixing any ν > 0, for all large enough n, which was proved for nonstochastic kn before. (Here kn = (cn)a is less than na and increases to ∞.) Therefore, π(K L γ ) ≥ e −nν for all sufficiently large n, fixing any ν > 0, since the lefthand side is positive and not dependent on n. This shows the approximation condition iii. All conditions of proposition 2 hold, and the D-consistency holds, which further implies the R- and C-consistency. 4 Lemmas Used for Proving the Theorems In the first three lemmas, the number of experts kn satisfies conditions i and ii of theorem 1. The prior πn for the parameters is such that each parameter is an independent normal with mean 0 and fixed variance σ 2 . The dimension of the parameters dim(θ ) will be denoted as dn = (s + 1)(2kn − 1) both here and later in the proofs. Lemma 1 (for tail condition). There exists a constant r > 0, such that πn (Fnc ) < e xp(−nr ) for all sufficiently large n. Here Fn is the limited part of the kn -experts family defined in section 3.1. Lemma√2 (for entropy condition). Consider the family of square-root densities F ∗ = { f : f ∈ Fn } defined in section 3.1. Then the following relations hold for the Hellinger bracket entropy H[ ] () for F ∗ : a. H[ ] (u) ≤ ln[(
4Cn2 dn u
)dn ].
b. There exists a constant c > 0, such that ∀ε > 0, ε √ H[ ] (u)du ≤ c nε 2 0
for all sufficiently large n.
234
Y. Ge and W. Jiang
Lemma 3 (for approximation condition). For all γ , ν > 0, there exists an N2 , such that πn (K L γ ) ≥ e xp(−nν), ∀n ≥ N2 . Here K L γ is the Kullback-Leibler neighborhood defined in section 3.1. The following lemma holds whether or not the number of experts is random. Using the notation in sections 2.2 and 2.3, we have: Lemma 4 (regression function versus density function). a. (µ ˆ n − µ0 )2 dx ≤ 4D2H ( fˆ n , f 0 ). b. D2H ( fˆ n , f 0 ) ≤ 2 + 4πn [{ f : DH ( f, f 0 ) > }|(Yi , Xi )n ], ∀ > 0. The next proposition is used to form the nested sequence of restricted models in section 3.2 for proving consistency with a random number of experts. Proposition 3 (Nesting). Let Gm = m { f : |θl | < Cm, ∀1 ≤ l ≤ dim(θ )} for some C ≥ 1 not dependent on m, which is a restricted set of m-expert models with parameters bounded by Cm. If m ≥ m, then Gm ⊆ Gm . Here m is the m-expert family defined in section 2.1. The proofs of these results are contained in the appendix. 5 Conclusions Our work shows that Bayesian inference based on mixtures of logistic regression can be a reliable tool for estimating the regression function and the joint density, as well as for binary classification. We expect that analogous properties may be studied in multiway classification, where multinomial logistic regression models are mixed. This, as well as Bayesian inference based on mixtures of generalized linear models (such as mixtures of Poisson and gamma regression), form natural topics for future research. So far, we have focused on classification rules of the form Cˆ n (x) = I [µ ˆ n (x) > 1/2]. However, as a referee points out, the concept of C-consistency can also be ˆ n (x) > r ] for some r ∈ (0, 1), which extended to rules of the form Cˆ n (x) = I [µ may be useful in situations with asymmetric costs, as when misclassifying Y = 1 as 0 costs more than misclassifying Y = 0 as 1. A long-standing question in ME theory is the selection of number of experts (or mixing components) K. The current work provides insight from the view of Bayesian inference, from either choosing a nonstochastic sequence K = kn dependent on sample size n or treating K as random and placing a suitable prior on K . The latter approach is especially interesting, since it can generate a posterior distribution on K conditional on the ob served data: π(K |da ta ) = θ π(K , dθ |da ta ). This method of inference on K
Mixtures of Logistic Regression
235
is in some sense robust and protective against model misspecification; it does not need to assume a true model with number of experts k0 . The true model is a nonparametric one with arbitrary smooth relation in family . In general, there is no “true number of experts” for K . What are proposed by π(K |da ta ) are “good K ’s” instead of “true K”; they are the K ’s for some good approximating models from the ME family. It may also be interesting to consider random K with a finite prior distribution, with support increasing with n. This in some sense is combining the approach of the two theorems. The motivation is that we would like the number of experts K to be random in order to search over a range of values. On the other hand, we would like K to be not too large, in order to reduce computation. (Large K would correspond to a high-dimensional parametric model.) There are several possiblities leading to consistent Bayesian inference. One can use a truncated prior P[K = k] = P[k(I ) = k]I [k ≤ Bn ]/P[k(I ) ≤ Bn ], k = 1, 2, . . . . Here k(I ) satisfies conditions iii and iv of theorem 2 and can be, for example, the contracted Poisson or contracted geometric random variables described in remark 1. The truncation bound can be taken to be, for example, Bn = 2(cn)1/q + 1, where q > 1 is the same as in condition iv and c ∈ (0, 1]. One can also use a uniform prior, such as P[K = k] = (cn)a −1 I [k ≤ (cn)a ], k = 1, 2, . . ., for some a ∈ (0, 1), c ∈ (0, 1]. Both can easily be shown to lead to consistent Bayesian inference, by adapting the proof of theorem 2. Appendix: Secondary Propositions and Proofs Denote f = f (y|x; k, θ ) for a mixture-of-k-experts (conditional) density. Then the following two propositions hold for any (k, θ ) and for any (y, x) ∈ {0, 1} × [0, 1]s , which will be useful later. Proposition 4.
f ≤ 1.
ln f | ≤ 1, where θl is the lth element of θ , for each l = Proposition 5. | ∂ ∂θ l 1, . . . , dim(θ ).
The following lemma will be used for proving lemma 3: Lemmma 5. Let f be the mixture-of-experts model with parameters (θ1 , . . . , θdn ) and f˜ be another mixture-of-experts model with parameters (θ˜ 1 , . . . , θ˜dn ). Suppose that the number of experts of f and f˜ are both kn , where kn grows to infinity with n and kn ≤ na for some 0 < a < 1, for all large enough n. Define a δ-neighborhood of f : Mδn ( f ) = { f˜ : |θi − θ˜i | ≤ δ,
i = 1, 2, . . . , dn }.
236
Y. Ge and W. Jiang
Then the following holds for any γ > 0: Given any f 0 ∈ (the “smooth nonparametric” family defined in section 2.1), for all sufficiently large n, there exist δ and f such that Mδn ( f ) ⊆ K L γ (i.e., for any f˜ ∈ Mδn ( f ), we have DK ( f˜ , f 0 ) ≤ γ ), where δ = 4(s+γ1 )na and f is a kn -expert density with parameter components satisdn fying maxk= 1 |θk | ≤ c(γ ) + ln kn , for some constant c(γ ) depending on γ but not on n. k Proof of Proposition 4. f = sup j H j ( j g j ) = sup j j=1 g j H j ≤ α j +β Tj x H j ≤ 1, since H j = ( e α j +β T x ) y ( α1j +β T x )1−y ≤ 1. 1+e
j
j
1+e
Proof of Proposition 5. Note that for each l = 1, . . . , dim(θ ),
k
j [ ∂θ∂ l ln(g j H j )](g j H j )
∂ ln f ∂
g j H j
=
∂θ = ∂θ ln
l
l
j g j Hj j=1
∂ ln(g j H j )
. ≤ sup
∂θl j Since for each j = 1, . . . , k, ln(g j H j ) = (u j +
vTj x)
T u +v x − ln e j j + y α j + β Tj x j
T − ln 1 + e α j +β j x , it is easy to show that |
∂ ln(g j H j ) | ∂θl
ln f ≤ max{|x1 |, . . . , |xs |, 1} = 1. So, | ∂ ∂θ | ≤ 1. l
Proof of Lemma 1. πn (Fnc ) = Pr(at least one|θl | > Cn , l = 1, . . . , dn ) dn dn
θl Cn Pr(|θl | > Cn ) = 2 Pr ≤ > σ σ l=1
l=1
Cn2 − 2 2σ
2dn σ 1 by Mill’s ratio √ exp Cn 2π 1+2η √ n a + ln[(s + 1)(2n − 1)(2σ/ 2π)] noting kn ≤ na ≤ exp − 2σ 2
≤
≤ exp(−nr ) for any r > 0, for all sufficiently large n, since η > 0.
Mixtures of Logistic Regression
237
Proof of Lemma 2a. Use f t = f (y, x|k, t) to simplify the notation, while showing the dependence on a parameter valued at t. Use t∞ = dim(t) sup j=1 |t j | to denote the L ∞ norm for a vector t.
d
n
∂
f θ · (tl − sl )
| ft − fs | =
∂θl l=1
(θ is an intermediate point between t and s)
d
n
∂
= (tl − sl ) ln f θ fθ
∂θl l=1
≤
dn
l=1
1 ∂ ln f θ
· | fθ | sup |tl − sl | ·
2 ∂θl
l
1 ≤ dn t − s∞ by propositions 4 and 5. 2 +η ≤ Cn ≤ exp(nb−a ), so Cn ≥ 1. Then, Since C√ n grows with n such that n 2 √ Cn dn | f t − f s | ≤ t − s∞ · 2 . Let F (x, y) = Cn dn /2. By theorem 3 and equation 15 of Lee (2000), we have Cn + 1 dn N[ ] (2εF 2 , F ∗ , · 2 ) ≤ N(ε, Fn , L ∞ ) ≤ . ε 1
Here, N(ε, Fn , · ) is the covering number, that is, the minimal number of balls of radius ε that arerequired to cover the set Fn under a specified met√ 1 2 ric. Now, 2εF 2 = 2ε 2εCn dn ; replace 2εF 2 y=0 (C n dn /2) dx = Cn +1 with ε; then N[ ] (ε, F ∗ , · 2 ) ≤ ( ε/(√ )dn = ( 2C d ) n n
Therefore, H[ ] (u) = ln N[ ] (u, F ∗ , · 2 ) ≤ ln(
√
2Cn (Cn +1)dn dn ) ε
4Cn2 dn dn ) . u
Proof of Lemma 2b. By the result of lemma 2a, 2 dn ε ε 4Cn dn H[ ] (u)du ≤ ln du u 0 0 v(ε) v 2 = dn √ −4Cn2 dn ve −v /2 dv 2 ∞ 4C 2 dn where v(u) = 2 ln n u ∞ ε −v 2 /2 v(ε) + e dv = 4Cn2 dn dn /2 4Cn2 dn v(ε)
≤(
4Cn2 dn dn ) . ε
238
Y. Ge and W. Jiang
≤ Cn2 dn
dn /2
ε Cn2 dn
2 ln(4Cn2 dn /ε)
φ( 2 ln(4Cn2 dn /ε)) + 4 2π 2 ln(4Cn2 dn /ε) 1 = ε dn /2 2 ln(4Cn2 dn /ε) 1 + 2 ln(4Cn2 dn /ε) ≤ 2ε dn ln Cn2 + ln(4dn ) − ln ε √
for all large enough n. Noting that dn = (s + 1)(2kn − 1) ≤ 2(s + 1)na , and Cn ≤ exp(nb−a ), we have l.h.s. ≤ 2ε 2(s + 1)na 2nb−a + ln 8(s + 1) + ln na − ln ε. Since 0 < a < b < 1, then ∃t such that 0 < a < t < b < 1 and b − a < 1 − t, 1 √ n
ε
√ H[ ] (u)du ≤ 2ε n−t 2(s + 1)na n−(1−t)
0
× 2nb−a + ln 8(s + 1) + ln na − ln ε → 0 as n → ∞.
So, ∃c > 0 such that ∀ε > 0, large n.
ε 0
√ H[ ] (u)du ≤ c nε 2 for all sufficiently
Proof of Lemma 3. We use the neighborhood Mδ = Mδn ( f ) in lemma 5 to prove the result. By lemma 5, we have Mδ ⊆ K L γ for all sufficiently large n. Then, πn (KLγ ) ≥ πn (Mδ ) = Pr
d n
θ˜
=
dn θl +δ l=1
≥
dn l=1
≥
dn l=1
θl −δ
√
√
|θl − θ˜l | ≤ δ
l=1
u2 exp − 2 du 2σ 2πσ 2 1
(|θl | + δ)2 exp − 2σ 2 2πσ 2 2δ
(c(γ ) + ln kn + δ)2 exp − √ 2σ 2 2πσ 2 2δ
Mixtures of Logistic Regression
239
(c(γ ) + ln kn + δ)2 = exp −dn + ln 2σ 2 (2 ln na )2 a ≥ exp −n 2(s + 1) 2σ 2
√
2πσ 2 2δ
for all large enough n, using kn ≤ na , δ =
γ 4(s+1)na
≥ exp(−nν) for all large enough n, fixing any ν > 0, since a ∈ (0, 1). Proof of Lemma 4a.
(µ ˆ n − µ0 ) dx = 2
2 ˆ y( f n − f 0 )dy dx
=
y
fˆ n +
≤
2
y
fˆ n +
f0
2 f0
fˆ n −
2 f 0 dy dx
2 ˆ f n − f 0 dy dx dy
since
y2
fˆ n +
2
dy =
f0
y2
≤2 =2
fˆ n + f 0 + 2
fˆ n f 0 dy
y2 ( fˆ n + f 0 )dy
1
y2 ( fˆ n + f 0 ) ≤ 4 by proposition 4.
y=0
√ Then, (µ ˆ n − µ0 )2 dx ≤ 4 ( fˆ n − f 0 )2 dydx = 4D2H ( fˆ n , f 0 ). Proof of Lemma 4b. Denote fˆ n = f πn (dθ |(Yi , Xi )n ) = E θ |· f . Denote A = { f : DH ( f, f 0 ) ≤ } as in section 2.3. Then, D2H ( fˆ n , f 0 ) = =
=2 − 2
E θ |· f −
2 f 0 dydx
f 0 + E θ |· f − 2
f 0 E θ |· f dydx
E θ |· ( f f 0 )dydx
240
Y. Ge and W. Jiang
≤2−2
E θ |·
f f 0 dydx
by Jensen’s inequality
= E θ |· 2 − 2 f f 0 dydx
= E θ |·
f + f0 − 2
= E θ |·
f −
by Fubini’s theorem
f f 0 dydx
2 f 0 dydx = E θ |· D2H ( f, f 0 )
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
= =
Aε
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
+
Acε
D2H ( f, f 0 )πn (dθ |(Yi , Xi )n )
≤ ε2 + =ε +
Acε
2
Acε
≤ ε2 + 2 = ε2 + 4
Acε
Acε
f −
2 f 0 dydx πn (dθ |(Yi , Xi )n )
f + f0 − 2
f f 0 dydx πn (dθ |(Yi , Xi )n )
( f + f 0 )dydx πn (dθ |(Yi , Xi )n )
πn (dθ |(Yi , Xi )n )
= ε 2 + 4πn ({ f : DH ( f, f 0 ) > ε}|(Yi , Xi )n ).
It is easy to see that lemmas 4a and 4b hold also for the case with random number of experts k, by augmenting the integration over dθ with sum over k. Proof of Lemma 5. Jiang and Tanner (1999b, theorem 2) state that c sup f0 ∈ inf f ∈k DK ( f, f 0 ) ≤ k 4/s for some c > 0 independent of k, for each k = 1, 2, 3, . . . . Here, s = dim(x). Therefore, given any γ > 0, for any true model f 0 ∈ , there exists a k ∗ -experts model f ∗ ∈ k ∗ , with k ∗ large enough, such that
DK ( f ∗ , f 0 ) ≤
c γ + < γ /2. (k ∗ )4/s 4
Mixtures of Logistic Regression
241
This k ∗ -experts density f ∗ can be written as a kn -experts density f (kn > k ∗ for all large enough n) if we let
u j = u∗j ,
j = 1, . . . , k ∗ − 1
u j = u∗k ∗ − ln(kn − k ∗ + 1),
j = k ∗ , . . . , kn
and (α j , β j , v j ) = (α ∗j , β ∗j , v∗j ), (α j , β j , v j ) =
j = 1, . . . , k ∗ − 1
(αk∗∗ , β ∗k ∗ , v∗k ∗ ),
j = k ∗ , . . . , kn .
Here u’s, v’s, α’s, and β’s are components of parameter θ for density f ; u∗ ’s, v∗ ’s, α ∗ ’s, and β ∗ ’s are components of parameter θ ∗ for density f ∗ . This parameterization for embedding is explained in the proof of proposition 3. This implies that there also exists a kn -experts model f such that DK ( f, f 0 ) = DK ( f ∗ , f 0 ) < γ /2, for all sufficiently large n. Let θ ∗ and θ denote the vectors of parameters in the k ∗ -experts and kn -experts model, respectively. From the above parameter settings, we have θ ∞ ≤ max{θ ∗ ∞ , |u∗k ∗ | + ln(kn − k ∗ + 1)} ≤ c(γ ) + ln kn for some constant c(γ ) possibly dependent on γ . Now consider any kn -expert model f˜ ∈ Mδn ( f ). Note that DK ( f˜, f 0 ) = The first term second part, ln
f 0 ln
f 0 ln f0 f
f0 = f˜
f 0 ln
= DK ( f, f 0 ) <
γ 2
f0 f · f f˜
=
f 0 ln
f 0 ln
f . f˜
for all sufficiently large n. For the
f = ln f − ln f˜ f˜
dn
∂
|θl − θ˜l | ≤ ln f u
∂u
l l=1
(u is an intermediate point between θ and θ˜ )
dn
∂ ln f u
since f˜ ∈ Mδn ( f ) ≤δ
∂u
l l=1
f0 + f
242
Then
Y. Ge and W. Jiang
≤ dn δ
by proposition 5
≤ γ /2
noting that δ =
f 0 ln
f f˜
γ 4(s+1)na
<
γ . 2dn
≤ γ /2. Therefore, DK ( f˜ , f 0 ) ≤ γ . Therefore, Mδn ( f ) ⊆ K L γ .
Proof of Proposition 3. We need to show ∀ f ∈ Gm , f ∈ Gm . ∀ f ∈ Gm , f (y, x) =
m−1
l=1
j=1
=
m−1
=
e ul +vl x + e um +vm x T
T
Hm
m
T 1 um +vm x l=m (m −m+1) e m−1 u +vT x m T 1 l l + l=m (m −m+1) e um +vm x l=1 e
g j Hj +
j=1 m−1
e um +vm x T
g j H j + m−1
g j Hj +
j=1
m
l=m
e u˜ m +v˜ m x
Hm
T
m
e u˜ l +v˜ l x T
l=1
Hm .
This is an m -experts model (m ≥ m) with parameters
u˜ j = u j , u˜ j = um − ln(m − m + 1),
j = 1, . . . , m − 1 j = m, . . . , m
and
(α˜ j , β˜ j , v˜ j ) = (α j , β j , v j ), (α˜ j , β˜ j , v˜ j ) = (αm , β m , vm ),
j = 1, . . . , m − 1 j = m, . . . , m .
Note that for j = m, . . . , m , |u˜ j | = |um − ln(m − m + 1)| ≤ |um | + ln(m − m + 1)
m − m + 1 ≥ 1
≤ Cm + (m − m) ≤ Cm + C(m − m)
since C ≥ 1
= Cm . The other parameters are bounded in absolute value by Cm and hence bounded by Cm . Therefore, f ∈ Gm , proving Gm ⊆ Gm . Acknowledgments We thank the two referees for useful suggestions on improving the letter.
Mixtures of Logistic Regression
243
References ¨ Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jiang, W., & Tanner, M. A. (1999a). On the approximation rate of hierarchical mixtures-of-experts for generalized linear models. Neural Computation, 11, 1183– 1198. Jiang, W., & Tanner, M. A. (1999b). Hierarchical mixtures-of-experts for exponential family regression models: Approximation and maximum likelihood estimation. Annals of Statistics, 27, 987–1011. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jordan, M. I., & Xu, L. (1995). Convergence results for the EM approach to mixturesof-experts architectures. Neural Networks, 8, 1409–1431. Lee, H. K. H. (2000). Consistency of posterior distributions for neural networks. Neural Networks, 13, 629–642. Peng, F., Jacobs, R. A., & Tanner, M. A. (1996). Bayesian inference in mixturesof-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, 91, 953–960. Wood, S. A., Kohn, R., Jiang, W., & Tanner, M. A. (2005). Spatially adaptive nonparametric binary regression using a mixture of probits. (Techn. Rep.). Evanston, IL: Department of Statistics, Northwestern University. Available online at: http://newton.stats.northwestern.edu/∼jiang/report/binary.probit.pdf.
Received December 20, 2004; accepted May 3, 2005.
ARTICLE
Communicated by Peter Thomas
Polychronization: Computation with Spikes Eugene M. Izhikevich
[email protected] The Neurosciences Institute, 10640 John Jay Hopkins Drive, San Diego, CA 92121, U.S.A.
We present a minimal spiking network that can polychronize, that is, exhibit reproducible time-locked but not synchronous firing patterns with millisecond precision, as in synfire braids. The network consists of cortical spiking neurons with axonal conduction delays and spiketiming-dependent plasticity (STDP); a ready-to-use MATLAB code is included. It exhibits sleeplike oscillations, gamma (40 Hz) rhythms, conversion of firing rates to spike timings, and other interesting regimes. Due to the interplay between the delays and STDP, the spiking neurons spontaneously self-organize into groups and generate patterns of stereotypical polychronous activity. To our surprise, the number of coexisting polychronous groups far exceeds the number of neurons in the network, resulting in an unprecedented memory capacity of the system. We speculate on the significance of polychrony to the theory of neuronal group selection (TNGS, neural Darwinism), cognitive neural computations, binding and gamma rhythm, mechanisms of attention, and consciousness as “attention to memories.” 1 Introduction The classical point of view that neurons transmit information exclusively via modulations of their mean firing rates (Shadlen & Newsome, 1998; Mazurek & Shadlen, 2002; Litvak, Sompolinsky, Segev, & Abeles, 2003) seems to be at odds with the growing empirical evidence that neurons can generate spike-timing patterns with millisecond temporal precision in vivo (Lindsey, Morris, Shannon, & Gerstein, 1997; Prut et al., 1998; Villa, Tetko, Hyland, & Najem, 1999; Chang, Morris, Shannon, & Lindsey, 2000; Tetko & Villa, 2001) and in vitro (Mao, Hamzei-Sichani, Aronov, Froemke, & Yuste, 2001; Ikegaya et al., 2004). The patterns can be found in the firing sequences of single neurons (Strehler & Lestienne, 1986; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Reinagel & Reid, 2002; Bryant & Segundo, 1976; Mainen & Sejnowski, 1995) or in the relative timing of spikes of multiple neurons (Prut et al., 1998; Chang et al., 2000) forming a functional neuronal group (Edelman 1987, 1993). Activation of such a neuronal group can be triggered by stimuli or behavioral events (Villa et al., 1999; Riehle, Neural Computation 18, 245–282 (2006)
C 2005 Massachusetts Institute of Technology
246
E. Izhikevich
¨ Diesmann, & Aertsen, 1997). These findings have been widely used Grun, to support the hypothesis of temporal coding in the brain (Buzsaki, Llinas, Singer, Berthoz, & Christen, 1994; Abeles, 1991, 2002; Diesmann, Gewaltig, & Aertsen, 1999; Bienenstock, 1995; Miller, 1996a, 1996b; see also special issue of NEURON (September 1999) on the binding problem). In addition to the growing empirical evidence of precise spike-timing dynamics in the brain, there is growing theoretical interest in the artificial neural networks community to spike timing as an additional variable in the information processing by the brain. See special issue of IEEE TNN on pulse-coupled neural networks (May 1999); special issue of Neural Networks on spiking neurons (July 2001); and special issue of IEEE TNN on temporal coding (July 2004). 1.1 Spikes. When considering spiking neurons, most researchers emphasize synchrony of firing. Indeed, it is widely believed that if two or more neurons have a common postsynaptic target and fire synchronously, then their spikes arrive at the target at the same time, thereby evoking potent postsynaptic responses. If the neurons fire asynchronously (i.e., randomly), their spikes arrive at the postsynaptic target at different times, evoking only weak or no response. An implicit assumption here is that the axonal conduction delays are negligible or equal. 1.2 Delays. A careful measurement of axonal conduction delays in the mammalian neocortex (Swadlow 1985, 1988, 1992) showed that they could be as small as 0.1 ms and as large as 44 ms, depending on the type and location of the neurons. A typical distribution of axonal propagation delays between different pairs of cortical neurons (depicted in Figure 1A) is broad, spanning two orders of magnitude. Nevertheless, the propagation delay between any individual pair of neurons is precise and reproducible with a submillisecond precision (see Figure 1B; Swadlow 1985, 1994). Why would the brain maintain different delays with such precision if spike timing were not important? The majority of computational neuroscientists discard delays as a nuisance that only complicates modeling. From a mathematical point of view, a system with delays is not finite- but infinite-dimensional,1 which poses some mathematical and simulation difficulties. In this letter, we argue that an infinite dimensionality of spiking networks with axonal delays is not a nuisance but an immense advantage that results
1 For example, the simplest delay equation x = −x(t − 1), x ∈ R has infinite dimension because to solve it, we need to specify the initial condition on the entire interval [−1, 0], and not just in the point t = 0. Delayed dynamical systems can exhibit astonishingly rich and complex dynamics (e.g., see Foss & Milton, 2000); however, the mathematical theory of such equations is still in its infancy (Wiener & Hale, 1992; Bellen & Zennaro, 2003).
Polychronization
247
number of pairs
A
cortico-cortical connections 10
B 20 30 10 axonal conduction delay (ms)
16.75 ms
B
antidromic stimulation artifact
antidromic spike
C connection
delays (ms)
animal
reference
layer 6
cortico-(ipsi)cortical
1 1.7 1 1.2 2.2
cat rabbit rabbit rabbit rabbit
Ferster & Lindstrom (1983) Swadlow (1994) Swadlow (1985) Swadlow (1994) Swadlow (1994)
layer 5 LGN cortico-collicular VB layer 4
0.6 - 2.3 0 - 3 2
rabbit cat mice
Swadlow (1994) Ferster & Lindstrom (1983) Salami, Itami, Tsumoto, & Kimura (2003)
LGN
cortico-cortical
-
44 32 35 19 32.5
Figure 1: (A) Distribution of experimentally measured conduction delays of cortical axons running through the corpus collosum (antidromic stimulation, modified from Figure 3A of Swadlow, 1985). (B) Superposition of two voltage traces recorded in vivo shows the submillisecond precision of axonal conduction delays in the same pair of neurons (modified from Figure 4 of Swadlow, 1994). (C) Summary of experimental evidence of axonal conduction delays in different neurons and species.
in an unprecedented information capacity. In particular, there are stable firing patterns that are not possible without the delays. 1.3 Polychronization. To illustrate the main idea, consider neuron a in Figure 2A receiving inputs from neurons b, c, and d with different
248
E. Izhikevich
A
B neuron label
a
m s
1 ms s m 5 9
b
a
d
s 8m
c
5 m s 1 ms
b c d e
e
0 1
5
9
time (ms)
C
0
D
4 time (ms)
8 9 10
0
3
7 8 9 time (ms)
Figure 2: (A) Synaptic connections from neurons b, c, and d to neurons a and e have different axonal conduction delays. (B–D) Firings of neurons are denoted by the vertical bars. Each arrow points to the spike arrival time to the postsynaptic neuron. (B) Synchronous firing is not effective in eliciting a potent postsynaptic response since the spikes arrive at the postsynaptic neurons at different times. (C) The spiking pattern with neuron d firing at 0 ms, neuron c firing at 4 ms, and neuron b firing at 8 ms is optimal to excite neuron a because the spikes arrive at a simultaneously. (D) The reverse order of firing is optimal to excite neuron e.
conduction delays. Synchronous firing, as in Figure 2B, is not effective to excite neuron a, because the spikes arrive at this neuron at different times. To maximize the postsynaptic response in neuron a, the presynaptic neurons should fire with the temporal pattern determined by the delays and depicted in Figure 2C so that the spikes arrive at neuron a simultaneously. A different spike-timing pattern, as in Figure 2D, excites neuron e. We see that depending on the order and the precise timing of firing, the same three neurons can evoke a spike in either neuron a or neuron e, or possibly in some other neuron not shown in the figure. Notice how the conduction delays make this possible. If b, c, and d are sensory neurons driven by an external input, then the simple circuit in Figure 2 can recognize and classify simple spatiotemporal patterns (Hopfield, 1995; Seth, Mckinstry, Edelman, & Krichmar, 2004a).
Polychronization
249
Indeed, sensory input as in Figure 2C results in (d,c,b,a) firing as a group with spike-timing pattern (0, 4, 8, 10) ms. Sensory input as in Figure 2D results in a different set of neurons: (b,c,d,e), firing as a group with a different spike-timing pattern, namely, (0, 3, 7, 9) ms. Since the firings of these neurons are not synchronous but time-locked to each other, we refer to such groups as polychronous, where poly means many and chronous means time or clock in Greek. Polychrony should be distinguished from asynchrony, since the latter does not imply a reproducible time-locking pattern, but usually describes noisy, random, nonsynchronous events. It is also different from the notion of clustering, partial synchrony (Hoppensteadt & Izhikevich, 1997), or polysynchrony (Stewart, Golubitsky, & Pivato, 2003), in which some neurons oscillate synchronously while others do not. 1.4 Networks. To explore the issue of spike timing in networks with conduction delays, we simulated an anatomically realistic model consisting of 100,000 cortical spiking neurons having receptors with AMPA, NMDA, GABAA , and GABAB kinetics and long-term and short-term synaptic plasticity (Izhikevich, Gally, & Edelman, 2004). We found that the network contained large polychronous groups, illustrated in Figure 12, capable of recognizing and classifying quite complicated spatiotemporal patterns. The existence of such groups, requiring finely tuned synaptic weights and matching (or converging) conduction delays, might seem unlikely in randomly connected networks with distributions of conduction delays. However, spike-timing-dependent plasticity (STDP) can select matching conduction delays and spontaneously organize neurons into such groups, a phenomenon anticipated by M. Abeles (personal communication with his students), Bienenstock (1995) and Gerstner, Kempter, van Hemmen, & Wagner (1996) (see also Changeux & Danchin, 1976, and Edelman, 1987). An unexpected result is that the number of coexisting polychronous groups could be far greater than the number of neurons in the network, sometimes even greater than the number of synapses. That is, each neuron was part of many groups, firing with one group at one time and with another group at another time. This is the main result of this letter. In retrospect, it is not surprising, since the networks we consider have delays, and hence are infinite-dimensional from a purely mathematical point of view. In this letter we present a minimal model that captures the essence of this phenomenon. It consists of a sparse network of 1000 randomly connected spiking neurons with STDP and conduction delays, thereby representing a cortical column or hypercolumn. The MATLAB code of the model, spnet, and its technical description is given in the appendix. In section 2, we demonstrate that despite its simplicity, the model exhibits cortical-like dynamics, including oscillations in the delta (4 Hz) frequency range, 40 Hz gamma oscillations, and a balance of excitation and inhibition. In section 3, we describe polychronous groups in detail. Our definition differs from the
250
E. Izhikevich
one used by Izhikevich et al. (2004), who relied on the existence of socalled anchor neurons and hence could not have more groups than neurons (that is, most of the groups in that study went undetected). We also illustrate how polychronous groups contribute to cognitive information processing going beyond the Hopfield-Grossberg or liquid state machine paradigms. In sections 4 and 5, we discuss some open problems and present some speculations on the significance of this finding in modeling binding, attention, and primary (perceptual) consciousness. 2 Dynamics The model neural network, described in the appendix, preserves important ratios found in the mammalian cortex (Braitenberg & Schuz, 1991). It consists of 1000 randomly connected excitatory (80%) and inhibitory (20%) neurons. The network is sparse with 0.1 probability of connection between any two neurons. Behavior of each neuron is described by the simple spiking model (Izhikevich, 2003), which can reproduce 20 of the most fundamental neurocomputational features of biological neurons (summarized in Figure 3). Despite its versatility, the model can be implemented efficiently (Izhikevich, 2004). Since we cannot simulate an infinite-dimensional system on a finite-dimensional digital computer, we substitute the network by its finite-dimensional approximation having time resolution 1 ms. 2.1 Plasticity. Synaptic connections among neurons have fixed conduction delays, which are random integers between 1 ms and 20 ms. Thus, the delays in the model are not as dramatic as those observed experimentally in Figure 1. Excitatory synaptic weights evolve according to the STDP rule illustrated in Figure 4 (Song, Miller, & Abbott, 2000). The magnitude of change of synaptic weight between a pre- and a postsynaptic neuron depends on the timing of spikes: if the presynaptic spike arrives at the postsynaptic neuron before the postsynaptic neuron fires—for example, it causes the firing—the synapse is potentiated. Its weight is increased according to the positive part of the STDP curve in Figure 4 but does not allow growth beyond a cut-off value, which is a parameter in the model. In this simulation, we use the value 10 mV, which means that two presynaptic spikes are enough to fire a given postsynaptic cell. If the presynaptic spike arrives at the postsynaptic neuron after it fired, that is, it brings the news late, the synapse is depressed. Its weight is decreased according to the negative part of the STDP curve. Thus, what matters is not the timing of spiking per se but the exact timing of the arrival of presynaptic spikes to postsynaptic targets. 2.2 Rhythms. Initially, all synaptic connections have equal weights, and the network is allowed to settle down for 24 hours of model time (which takes 6 hours on a 1 GHz PC) so that some synapses are potentiated and
Polychronization (A) tonic spiking
251 (B) phasic spiking
(C) tonic bursting
(D) phasic bursting
input dc-current 20 ms
(E) mixed mode
(F) spike frequency adaptation
(G) Class 1 excitable
(H) Class 2 excitable
(I) spike latency
(J) subthreshold oscillations
(K) resonator
(L) integrator
(M) rebound spike
(N) rebound burst
(O) threshold variability
(Q) depolarizing after-potential
(R) accommodation
(S) inhibition-induced spiking
(P) bistability
(T) inhibition-induced bursting
DAP
Figure 3: Summary of the neurocomputational properties of biological spiking neurons (Izhikevich 2004). Shown are simulations of the same model (see equations A.1 and A.2) with different choices of parameters. This model is used in this study. Each horizontal bar denotes 20 ms time interval. The MATLAB file generating the figure and containing all the parameters can be downloaded from the author’s Web site. This figure is reproduced with permission from www.izhikevich.com. (An electronic version of the figure and reproduction permission are freely available online at www.izhikevich.com.)
252
E. Izhikevich
pre
t
post
0
0 interval, t
Figure 4: STDP rule (spike-timing-dependent plasticity, or Hebbian temporally asymmetric synaptic plasticity): The weight of synaptic connection from pre- to postsynaptic neuron is increased if the postneuron fired after the presynaptic spike, that is, the interspike interval t > 0. The magnitude of change decreases as A+ e −t/τ+ . Reverse order results in a decrease of the synaptic weight with magnitude A− e t/τ− . Parameters used: τ+ = τ− = 20 ms, A+ = 0.1, and A− = 0.12.
others are depressed. At the beginning of this settling period, the network exhibits high-amplitude rhythmic activity in the delta frequency range around 4 Hz (see Figure 5, top). This rhythm resembles one of the four fundamental types of brain waves, sometimes called deep sleep waves, because it occurs during dreamless states of sleep, during infancy, and in some brain disorders. Of course, the mechanism of generation of this rhythm in the model is probably different from the one in mammals, since the model does not have thalamus. As the synaptic connections evolve according to STDP, the delta oscillations disappear, and spiking activity of the neurons becomes more Poissonian and uncorrelated. After a while, gamma frequency rhythms in the range 30 to 70 Hz appear, as one can see in Figure 5 (bottom). The mechanism generating this rhythms is often called PING (pyramidal-interneuron network gamma; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000): strong firings of pyramidal neurons excite enough inhibitory interneurons, leading to transient reciprocal inhibition that temporarily shuts down the activity. These kinds of oscillations, implicated in cognitive tasks in humans and other animals, play an important role in the activation of polychronous groups, as we describe in the next section. 2.3 Balance of Excitation and Inhibition. There are fewer inhibitory neurons in the network, but their firing rate is proportionally higher, as one can see in Figure 5. As a result, the network converges to a state with
Polychronization
253
sec=1 1000
neuron number
800 600 delta rhythm (2–4 Hz)
400 200 0
0
100
200
300
400
500
600
700
800
900
1000
200
300
400
500
600
700
800
900
1000
800
900
1000
sec=100 1000
neuron number
800 600 400 200 0
0
100
sec=3600 1000 inhibitory neurons neuron number
800 excitatory neurons 600 400 gamma rhythm (30 –100 Hz) 200 0
0
100
200
300
400
500
600
700
time (ms)
Figure 5: Rhythmic activity of the spiking model is evident from the spike raster. As synaptic weights are evolved according to STDP, initial delta frequency oscillations (top, sec = 1) disappear, relatively uncorrelated Poissonian activity (middle, sec = 100), and then gamma frequency oscillations (bottom, sec = 3600) appear.
254
E. Izhikevich
network activity (gamma rhythm)
neuron number
800
spike-timing pattern
600 400 200
corresponding connectivity
0
10
20
30 time (ms)
40
50
60
Figure 6: Activation of a polychronous group: Spiking activity of the entire network (top) contains a pattern (middle) that is generated more often than predicted by chance. The pattern occurs because the connectivity between the neurons has matching axonal conduction delays (bottom). See Figures 2 and 7 for more detail. Dots denote actual spikes generated by the model; circles denote the predicted timing of spikes based on the anatomical connectivity and the delays among the neurons. The simulation time step is 1 ms. Red (black) dots denote firings of excitatory (inhibitory) neurons.
an approximate balance of excitation and inhibition (Shadlen & Newsome, 1994; Shadlen & Movshon, 1999; van Vreeswijk & Sompolinsky, 1996; Amit & Brunel, 1997), so that each excitatory neuron fires in the Poissonian manner with the rate fluctuating between 2 and 7 Hz. Even during the episodes of gamma oscillation, such as the one in Figure 5, the spiking activity of excitatory neurons is not synchronized; the neurons skip most of the gamma cycles and fire just a few spikes per second (see the open circles in Figure 6, middle). Twofold changes of some of the parameters, such as the maximal synaptic weight, the amount of depression in STDP, or the thalamic input, produce only transient changes in network dynamics. Neurons adjust their synaptic weights, balance the excitation and inhibition, and return to the mean firing
Polychronization
255
rate between 2 and 7 Hz. Thus, the network maintains a certain homeostatic state despite intrinsic or extrinsic perturbations. 3 Polychronous Spiking Although spiking of excitatory neurons looks random and uncorrelated, there are certain persistent spike-timing patterns that emerge and reoccur with millisecond precision. An example of such a spike-timing pattern is presented in Figure 6. Although no apparent order can be seen in the network activity at the top of the figure, except for a pronounced gamma oscillation, the pattern denoted by circles in the middle of the figure repeats itself a few times per hour with ±1 ms spike jitter. Statistical considerations standard in the synfire chain literature (not presented here; see, e.g., Prut et al., 1998) suggest that such repetitions are highly unlikely to occur by chance. There must be some underlying connectivity that generates the pattern. Indeed, considering the connections between the neurons, depicted in Figure 6 (bottom), we can see that the neurons are organized into a group, referred here as being polychronous (i.e., multiple timing), which forces the neurons to fire with the pattern.
800
695
966 950
600 400
1000 657
490 380
893
469 510
853 200 0
900
275 850
172 125 0
77
1 20
40
800
inhibitory neuron number
excitatory neuron number
3.1 Definition. Our definition of polychronous groups is based on the anatomy of the network, that is, on the connectivity between neurons. Let us freeze the simulation and consider the strongest connections among neurons, paying special attention to the conduction delays. In Figure 7 the conduction delays from neurons (125, 275, 490) to neuron 1 are such that when the neurons fire with the timing pattern (0, 3, 7) ms, their spikes arrive at neuron 1 at the same time, thereby making neuron 1 fire at 13 ms. The
60
time (ms)
Figure 7: Example of a polychronous group: Firing of neurons (125, 275, 490) with the timing pattern (0, 3, 7) ms results in spikes arriving simultaneously at neuron 1, then at neurons 172, 695, and 380. This multitiming (polychronous) activity propagates farther along the network and terminates at neuron 510.
256
E. Izhikevich
conduction delays from neurons 125 and 1 to neuron 172 are such that when the neurons fire with the pattern (3, 13) ms, their spikes arrive at neuron 172 simultaneously, thereby making it fire at 21 ms. Since we know all delays in the model, we can continue this procedure spike by spike and untangle the entire group. It consists of 15 neurons, some of them inhibitory; the group ends at neuron 510. The group in Figure 7 is defined on a purely anatomical basis; there is no firing involved yet. The figure tells only that there is a subgraph in the connectivity matrix of the network so that if neurons (125, 275, 490) fire with the indicated pattern, the stereotypical activity propagates through the subgraph. Thus, the group in the figure is only an anatomical prediction of a possible stereotypical firing pattern. One may never see the pattern, for example, if the first three neurons never fire. Whenever the neurons in the figure do fire with the spike-timing pattern determined by the connectivity and delays, we say that the group is activated and the corresponding neurons polychronize. Typically, firing of the first few neurons with the right timing is enough to activate most of the group, as it happens in Figure 6. Notice how activation of the group is locked to the gamma oscillation; that is, the first three neurons fire at the first gamma cycle, their spikes travel 10 to 20 ms and arrive at the next four neurons in the next gamma cycle, and so on, resulting in precise stereotypical activity. We stress here that 1 ms spike-timing precision is the consequence of our definition of the group. Of course, neurons in Figure 6 fire many other spiking patterns with large jitter. We ignore those patterns by calling them noise. 3.2 Emergence of Groups. Considering various triplets, such as neurons (125, 275, 490) in Figure 7 (left), firing with various patterns, we can reveal all polychronous groups emanating from the triplets. In the network of 1000 neurons presented in the appendix, we find over 5000 such groups. The groups did not exist at the beginning of simulation but appear as a result of STDP acting on random spiking (Izhikevich et al., 2004). STDP potentiates some synapses corresponding to connections with converging (matching) delays and depresses (prunes) other synapses. The plasticity takes an initially unstructured network and selects firing patterns that are consistent with the underlying anatomy, thereby creating many strongly connected subgraphs corresponding to polychronous groups. Since STDP is always “ON” in the network, groups constantly appear and disappear; their total number fluctuates between 5000 and 6000. We found a core of 471 groups that appeared and survived the entire duration of 24 hour simulation. The groups have different sizes, lengths, and time spans, as we summarize in Figure 8. (A few examples are depicted in Figure 12). Since an average group consists of 25 neurons, an average neuron is a member of
Polychronization
257 time span (ms) 1500
800 600
200
count
1000
longest path
400
500 >100
0 0
20
40
0
60
0
20
40
60
80
100
size (# of neurons) 2000
1500
3440
1500
count
count
1000
1000
500
500 0
>200 0
50
100
150
time span (ms)
200
0
0
10
20
30
length (longest path)
Figure 8: Characteristics of polychronous groups (total number is 5269 in a network of 1000 neurons). (Top left) An example of a polychronous group. (Top right) Distribution of group sizes, that is, the number of neurons that form each group. The example group has size 15. (Bottom left) Distribution of groups’ time spans—the time difference between the firing of the first and the last neuron in the group. The example group has time span 56 ms. (Bottom right) Distribution of the longest paths in the groups. The example group has a path of length 5.
131 different groups. Because different groups activate at different times, the neuron can fire with one group at one time and with another group at another time. 3.3 Groups Share Neurons. Quite often, different polychronous groups can share more than one neuron. Two such cases are illustrated in Figure 9. Neurons (8, 103, 351, 609, 711, 883) belong to two different groups in the upper half of the figure. However, there is no ambiguity because their firing order is different; the neurons fire with one spike-timing pattern at one time (when the first group is activated), with the other pattern at some other time (when the second group is activated), and with no pattern most of the time. The lower half of the figure depicts two groups having eight neurons in common and firing with different spike-timing patterns. In addition, neurons 838 and 983 fire twice during activation of the second group. Again, there is no ambiguity here because each polychronous group is defined not only by its constituent neurons but also by their precise spiking time. As an extreme illustration of this property, consider a toy fully connected network of 5 neurons in Figure 10A. In principle, such a network can exhibit 5! = 240 different firing patterns if we require that each neuron fires only
258
E. Izhikevich
711
neuron # (103, 711, 609, 8, 883, 351) firing pattern (ms) (0, 1, 12, 20, 25, 34)
609
883
351
103
neuron # (609, 711, 103, 8, 883, 351) firing pattern (ms) (0, 1, 2, 20, 25, 60)
8
711 609
883 351 103 8 785
779
983
neuron # (77, 779, 416, 474, 785, 838, 983, 877) firing pattern (ms) (0, 0, 2, 14, 15, 16, 20, 30)
474
416
877 838
77
416
983
474
983
877 838
77
0
neuron # (77, 416, 779, 785, 474, 838, 983, 877, 838, 983) firing pattern (ms) (0, 4, 5, 15, 18, 20, 21, 31, 43, 66)
785
779
10
20
838
30
40
50
60
70
80
90
100
time (ms)
Figure 9: Two examples of pairs of groups consisting of essentially the same neurons but firing with different spike-timing patterns; see also Figure 10. Neurons of interest are marked with numbers. The list of neurons and their firing times are provided for each group in the upper-right corners.
once and we distinguish the patterns only on the basis of the order of firing of the neurons. If we allow for multiple firings, then the number of patterns explodes combinatorially. However, the connectivity among the neurons imposes a severe restriction on the possible sustained and reproducible firing patterns, essentially excluding most of them. Random delays in the network would result in one, and sometimes two, polychronous groups.
Polychronization
259
s
2 ms
s 2m s 5m
3
B
2 ms
2
4
4
2
2
2
20
0
30
10
20
30
4
4
2
2
2
10
20
30
0
10
20
30
4
4
4
2
2
2
0
10
20
30
4
0
10
20
30
4
2 10
20
30
30
0
10
20
30
cyclic 10
20
30
cyclic 0
10
20
30
0
10
20
30
0
10
20
30
4
2 0
20
0
4
0
10
2
4
10
0 4
7 ms
0
neuron number
2
4
1 2m s 6m s
s
3m 7m s s s 4m
s m
2m
2m
7 4 m ms s
5 ms
s m
4
5
4
4
6 ms ms
10
A
2 0
10
20
30
time (ms)
Figure 10: (A) A toy network of five neurons with axonal conduction delays and strong synapses (two simultaneous spikes is enough to fire any neuron). (B) The delays in the network are optimized so that it has 14 polychronous groups, including two cyclic groups that exhibit reverberating activity with a time (phase) shift.
The delays in Figure 10A are not random; they were constructed to maximize the number of polychronous groups. Although there are only five neurons, the network has 14 polychronous groups shown in Figure 10B. Adding a sixth neuron triples the number of groups so that there are more groups than synapses in the network. Considering toy examples like this one, it would not be surprising that a network of 1011 neurons (which is the size of the human brain) would have more groups than the number of particles in the universe.
260
E. Izhikevich
454
simulation surrogate
count
100 80 60 40 20 0
0
10
20
30
40
activation frequency (# per hour) Figure 11: Distribution of frequencies of activation of groups in the simulated and surrogate (inverted time) spike trains. Each group is found using anatomical data (connectivity and delays) and then used as a template to scan through the spike train. The group is said to activate when more than 50% of its neurons polychronize, that is, fire with the prescribed spike-timing pattern with ±1 ms jitter, as in Figure 6. Surrogate data emphasize the statistical significance of these events.
3.4 Activation of Groups. Our definition of a polychronous group relies on the anatomy of the network, not on its dynamics. (Of course, the former and the latter are dependent via STDP.) We say that a group is half-activated when at least 50% of its constituent excitatory neurons polychronize (i.e., fire according to the prescribed spike-timing pattern with ±1 ms spike jitter). For example, the group in Figure 6 is 63% activated because 16 of 19 excitatory neurons polychronized. Once all the groups are found using the anatomical data (connectivity and delays), we use each group as a template, scan the spiking data recorded during a 24 hour period, and count how many times the group is half-activated. We apply this procedure only to those groups (total 471) that persist during the 24-hour period. In Figure 11 we plot the distribution histogram of the averaged frequency of half-activation of polychronous groups. The mean activation frequency is 7 times per hour, that is, every 8 minutes, implying that there is a spontaneous activation of a group every 1 second (8 × 60/471 ≈ 1 sec). Since an averaged neuron is a member of 131 different groups, 131 × 7 = 917 of its spikes per hour are part of activation of a group, which is less than 4% of the total number of spikes (up to 25,000) fired during the hour. Thus, the majority of the spikes are noise, and only a tiny fraction is involved in polychrony. The only way to tell which is which is to consider these spikes with relation to the spikes of the other neurons. To test the significance of our finding, we use surrogate data obtained from the spike data by inverting the time. Such a procedure does not change the mean firing rates, interspike histograms, magnitude of
Polychronization
261
cross-correlations, and other meaningful statistics of the spike trains. In particular, this approach is free from the criticism (Oram, Wiener, Lestienne, & Richmond, 1999; Baker & Lemon, 2000) that precise firing sequences appear exclusively by chance in spike rasters with covarying firing rates. Activation frequency of (noninverted) groups in the surrogate (inverted) spike raster, depicted as black bars in Figure 11, is much lower, indicating that group activations are statistically significant events. We emphasize that our method of analysis of spike trains is drastically different from the one used to search for synfire chains in in vivo data. We do not search for patterns in spike data; we know what the patterns are (using the connectivity matrix and delays); we just scan the spikes and count the occurrences of each pattern. Apparently such an approach is feasible only in models. 3.5 Representations. What is the significance of polychronous groups? We hypothesize that polychronous groups could represent memories and experience. In the simulation above, no coherent external input to the system was present. As a result, random groups emerge; that is, the network generates random memories not related to any previous experience. However, coherent external stimulation builds certain groups that represent this stimulation in the sense that the groups are activated when the stimulation is present. Different stimuli build different groups even when the same neurons are stimulated, as we illustrate in Figure 12. Every second during a 20-minute period, we stimulate 40 neurons, 1, 21, 41, 61, . . . , 781, either with the pattern (1, 2, . . . , 40) ms or with the inverse pattern (40, . . . , 2, 1) ms, as we show in the top of Figure 12. Initially, no groups starting with stimulated neurons existed (we did not explore whether the stimulation activated any of the existing groups consisting of other neurons). However, after 20 minutes of simulation 25 new groups emerged. Fifteen of them correspond to the first stimulus; they can be activated when the network is stimulated with the first pattern. The other 10 correspond to the second stimulus; that is, they can be activated when the network is stimulated with the second pattern. Thus, the groups represent the memory of the two input patterns, and their activation occurs when the network “sees” the corresponding patterns. In Figure 13 we depict the time evolution of the largest group corresponding to the first pattern in Figure 12. Notice how the group annexes neurons, probably at the expense of the other groups in the network. Further simulation shows that the initial portion of the group is relatively stable, but its tail expands and shrinks in an unpredictable manner. Finally, not all groups corresponding to a pattern activate when the network is stimulated. Because the groups share some neurons and have excitatory and inhibitory interconnections, they are in a constant state of competition and cooperation. As a result, each presentation of a stimulus activates only two to three groups (15%) in a random manner.
262
E. Izhikevich
stimulation pattern 1
stimulation pattern 2
20 ms
Figure 12: Persistent stimulation of the network with two spatiotemporal patterns (top) result in the emergence of polychronous groups that represent the patterns; the first few neurons in each group are the ones being stimulated, and the rest of the group activates (polychronizes) whenever the patterns are present. 2 min
5 min
9 min
3 min
6 min
10 min
4 min
7 min
20 min
20 ms
Figure 13: Time evolution (growth) of the last (largest) polychronous group in Figure 12 corresponding to stimulation pattern 1.
3.6 Rate to Spike-Timing Conversion. Neurons in the model use a spike-timing code to interact and form groups. However, the external input from sensory organs, such as retinal cells and hair cells in cochlear, arrives as a rate code, that is, encoded into the mean firing frequency of spiking. How can the network convert rates to precise spike timings? It is easy to see how rate to spike-timing conversion could occur at the onset of stimulation. As the input volley arrives, the neurons getting stronger excitation fire first, and the neurons getting weaker excitation fire later or not at all. This mechanism relies on the fact that there is a clear onset
Polychronization A rate code
263
spike-timing code
B rate code
spike-timing code
v2
v2
v1
v1
v1
threshold
v2
resting
resting
v2
threshold
v1
I2 I1
IPSP
IPSP
Figure 14: Rate code to spike-timing code conversion by spiking network with fast inhibition. (A) The firing rate input (rate code) induces a phasic inhibition in the network. While the excitatory neurons recover from the inhibition, the ones that get the strongest input fire first, and the ones getting the weakest input fire last (or not fire at all). (B) The strength of the rate code input determines the degree of hyperpolarization of excitatory neurons. Those inhibited less will fire sooner. This mechanism has the desired logarithmic scaling that makes the spike-timing code insensitive to the strength of the input (see the text for detail). Open circles—excitatory neurons; filled circles—populations of inhibitory neurons.
of stimulation, for example, after a visual saccade. What if stimulation is tonic without a clear beginning or end? We hypothesize that intrinsic rhythmicity generates internal “saccades” that can parse the rate input into spike timing, and we discuss three possible mechanisms how this could be accomplished. First, the intrinsic rhythmic activity can chop tonic input into “meanfiring-rate” packets. Since each packet has a well-defined onset, it can be converted into spike timings according to the mechanism described above: the neuron receiving the strongest input fires first, and so forth. This mechanism is similar, but not equivalent, to the mechanism proposed by Hopfield (1995). The other two mechanisms rely on the action of inhibitory interneurons. In one case, depicted in Figure 14A, inhibitory neurons, being faster, fire first and inhibit excitatory neurons, thereby resulting in a long inhibitory postsynaptic potential (IPSP). The rate with which excitatory neurons recover from the IPSP depends on their intrinsic properties and the strength of the overall external input. The stronger the input, the sooner the neuron fires after the IPSP. Again, the neuron receiving the strongest input fired first, and the neuron receiving the weakest input fired last.
264
E. Izhikevich
In the third mechanism depicted in Figure 14B, inputs with different firing rates produce IPSPs of different amplitudes in the excitatory neurons downstream. As the excitatory neurons recover from the inhibition, the neurons are ready to fire spikes (due to some other tonic stimulation) at different moments: the neuron inhibited less would fire first, and the neuron inhibited more would fire last or not at all. Since the recovery from inhibition is nearly exponential, the system is relatively insensitive to the input scaling. That is, a stronger input results in firing with the same spike-timing pattern but with an earlier onset. Similarly, a weaker input does not change the spike-timing pattern but only delays its onset. Let us illustrate this point using two linear dimensionless equations, vi = −vi , i = 1, 2, that model the recovery of the membrane potential from the inhibition vi (0) = −Ii , where each Ii denotes the amplitude (peak) of IPSP. The recovery is exponential, xi (t) = −Ii e −t , so the moment of time each membrane reaches a certain threshold value, say, v = −1, is ti = log Ii . If we scale the input by any factor (e.g., k Ii ), we translate the threshold moment by a constant because log k Ii = log k + log Ii , which is the same for both neurons. Thus, regardless of the scaling of the input, the time difference log k I1 − log k I2 = log I1 − log I2 is invariant. Thus, in contrast to Hopfield (1995), we do not need to postulate that the input is already somehow converted to the logarithmic scale. Synchronized inhibitory spiking implements the logarithmic conversion and makes spike-timing response relatively insensitive to the input scaling. 3.7 Stimulus-Triggered Averages. Notice that synchronized inhibitory activity occurs during gamma frequency oscillations (see Figure 5). Thus, the network constantly converts rate code to spike-timing code (and back) via gamma rhythm. Each presentation of a rate code stimulus activates an appropriate polychronous group or groups that represent the stimulus. This activation is locked to the phase of the gamma cycle but not to the onset of the stimulus. We explain this point in Figure 15, which illustrates results of a typical in vivo experiment in which a visual, auditory, or tactile stimulus is presented to an animal (we cannot simulate this with the present network because, among many other things, we do not model thalamus, the structure that gates inputs to the cortex). Suppose that we record from neuron A belonging to a polychronous group representing the stimulus. Since the input comes at a random phase of the internal gamma cycle, the activation of the group occurs at random moments, resulting in a typical smeared stimulus-triggered histogram. Such histograms were interpreted by many biologists as “the evidence” of absence of precise spike-timing patterns in the brain, since the only reliable effect that the stimulus evokes is the increased probability of firing of neuron A (i.e., increase in its meanfiring rate). Even recording from two or more neurons belonging to different groups would result in broad histograms and weak correlations among the neurons, because the groups rarely activate together, and when they do,
Polychronization
265 A
polychronous group
gamma rhythm
A
rate-code stimulus 0 stimulus-triggered histogram
Figure 15: Noisy, unreliable response of neuron A is due to the unreliable activation of the group representing the stimulus. The activation is locked to the intrinsic gamma cycle, not to the onset of the stimulation, resulting in the smeared stimulus-triggered histogram. One needs to record from two or more neurons belonging to the same group to see the precise spike-timing patterns in the network (the group and the associated gamma rhythm are drawn by hand).
they may activate at different cycles of the gamma rhythm. We see that “noisy,” unreliable responses of individual neurons to stimuli are the result of noisy and unreliable activations of polychronous groups. Recordings of two or more neurons belonging to the same group are needed to see the precise patterns (relative to the gamma rhythm). 4 Discussion Simulating a simple spiking network (the MATLAB code is in the appendix), we discovered a number of interesting phenomena. The most striking one is the emergence of polychronous groups—strongly interconnected groups of neurons having matching conduction delays and capable of firing stereotypical time-locked spikes with millisecond precision. Thus, such groups can be seen not only in anatomically detailed cortical models (Izhikevich et al., 2004) but also in simple spiking networks. Changing some of the parameters of the model twofold changes the number of groups that can be supported by the network but does not eliminate them completely. The selforganization of neurons into polychronous groups is a robust phenomenon that occurs despite the experimentalist’s efforts to prevent it.
266
E. Izhikevich
Our model is minimal; it consists of spiking neurons, axonal conduction delays, and STDP. All are well-established properties of the real brain. We hypothesize that unless the brain has an unknown mechanism that specifically prevents polychronization, the real neurons in the mammalian cortex must also self-organize into such groups. In fact, all the evidence of reproducible spike-timing patterns (Abeles, 1991, 2002; Lindsey et al., 1997; Prut et al., 1998; Villa et al., 1999; Chang et al., 2000; Tetko & Villa, 2001; Mao et al., 2001; Ikegaya et al., 2004; Riehle et al., 1997; Beggs & Plenz, 2003, 2004; Volman, Baruchi, & Ben-Jacob, 2005) can be used as the evidence of the existence and activation of polychronous groups. 4.1 How Is It Different from Synfire Chains? The notion of a synfire chain (Abeles, 1991; Bienenstock, 1995; Diesmann et al., 1999; Ikegaya et al., 2004) is probably the most beautiful theoretical idea in neuroscience. Synfire chains describe pools of neurons firing synchronously, not polychronously. Synfire activity relies on synaptic connections having equal delays or no delays at all. Though easy to implement, networks without delays are finite-dimensional and do not have rich dynamics to support persistent polychronous spiking. Indeed, in the context of synfire activity, the groups in Figure 9 could not be distinguished, and the network of five neurons in Figure 10 would have only one synfire chain showing reverberating activity (provided that all the delays are equal and sufficiently long). Bienenstock (1995) referred to polychronous activity as a synfire braid. Synfire chain research concerns the stability of a synfire activity. Instead, we employ here population thinking (Edelman, 1987). Although many polychronous groups are short-lived, there is a huge number of them constantly appearing. And although their activation is not reliable, there is a spontaneous activation every second in a network of 1000 neurons. Thus, the system is robust not in terms of individual groups but in terms of populations of groups. 4.2 How Is It Different from Hopfield-Grossberg Attractor Networks? Polychronous groups are not attractors from dynamical system point of view (Hopfield, 1982; Grossberg, 1988). When activated, they result in stereotypical but transient activity that typically lasts three to four gamma cycles (less than 100 ms; see Figure 8). Once the stimulation is removed, the network does not return to a “default” state but continues to be spontaneously active. 4.3 How Is It Different from Feedforward Networks? The anatomy of the spiking networks that we consider is not feedforward but reentrant (Edelman, 1987). Thus, the network does not “wait” for stimulus to come but exhibits an autonomous activity. Stimulation perturbs only the intrinsic activity, as it happens in mammalian brain. As a result, the network does not have a rigid stimulus-response function. The same stimulus can elicit
Polychronization
267
quite different responses because it activates a different (random) subset of polychronous groups representing the stimulus. Thus, the network operates in a highly degenerate regime (Edelman & Gally, 2001). 4.4 How Is It Different from Liquid-State Machines? Dynamics of a network implementing liquid-state-machine paradigm (Maass, Natschlaeger, & Markram, 2002) is purely stimulus driven. Such a network does not have short-term memory, and it cannot place the input in the context of the previous inputs. The simple model presented here implements some aspects of the liquid-state computing (e.g., it could be the liquid), but its response is not quite stimulus driven; it depends on the current state of the network, which in turn depends on the short-term and long-term experience and previous stimuli. This could be an advantage or a drawback, depending on the task that needs to be solved. Let us discuss some interesting open problems and implementation issues that are worth exploring further:
r
r
r
r
r
Finding groups: Our algorithm for finding polychronous groups considers various triplets firing with various spiking patterns and determines the groups that are initiated by the patterns. Because of the combinatorial explosion, it is extremely inefficient. In addition, we probably miss many groups that do not start with three neurons. Training: Our training strategy is the simplest and probably the least effective one: choose a set of “sensory” neurons, stimulate them with different spike-timing sequences, and let STDP form or select/reinforce appropriate groups. It is not clear whether this strategy is effective when many stimuli are needed to be learned. Incomplete activation: When a group is activated, whether in response to a particular stimulation or spontaneously, it rarely activates entirely. Typically, neurons at the beginning of the group polychronize, that is, fire with the precise spike-timing pattern imposed by the group connectivity, but the precision fades away as activation propagates along the group. As a result, the connectivity in the tail of the group does not stabilize, so the group as a whole changes. Stability: Because of continuous plasticity, groups appear, grow (see Figure 13), live for a certain period of time, and then could suddenly disappear (Izhikevich et al., 2004). Thus, spontaneous activity of the network leads to a slow degradation of the memory, and it is not clear how to prevent this. Sleep states: The network can switch randomly between different states. Some of them correspond to “‘vigilance” with gamma oscillations, and others resemble “sleep” states, similar to the one in Figure 5
268
E. Izhikevich
r
r
r
r
r
(top). It is not clear whether such switching should be prevented or whether it provides certain advantages for homeostasis of connections. Optimizing performance: Exploring the model, we encounter a regime when the number of polychronous groups was greater than the number of synapses in the network. However, the network was prone to epileptic seizures, which eventually lead to uncontrolled, completely synchronized activity. More effort is required to fine-tune the parameters of the model to optimize the performance of the network without inducing paroxysmal behavior. Context dependence: Propagation delays are assumed to be constant in the present simulation. In vivo studies have shown that axonal conduction velocity has submillisecond precision, but it also depends on the prior activity of the neuron during last 100 ms; hence, it can change with time in a context-dependent manner (Swadlow, 1974; Swadlow & Waxman, 1975; Swadlow, Kocsis, & Waxman, 1980). Thus, a polychronous group may exist and be activated in one time, but can temporarily disappear at another time because of the previous activity of its constituent neurons. Synaptic scaling: We assumed here that the maximal cut-off synaptic value is 10 mV, which is slightly more than half of the threshold value of the pyramidal neuron in the model. Since the average neuron in the network has 100 presynaptic sources, it implies that 2% of presynaptic spikes is enough to make it fire. It is interesting, but computationally impossible at present, to estimate the number of different polychronous groups when each neuron has, say, 400 presynaptic sources, each having maximal value of 2.5 mV. In this case, each group would be “wider,” since at least four neurons (the same 2%) are needed to fire any given postsynaptic cell. Network scaling: We simulated a network of 103 neurons and found 104 polychronous groups. How does the number of groups scale with the number of neurons? In particular, how many polychronous groups are there in a network of 1011 neurons, each having 104 synapses? This is a fundamental question related to the information capacity of the human brain. Closing the loop: An important biological observation is that organisms are part of the environment. The philosophy at the Neurosciences Institute (the author’s host institute) is “the brain is embodied, the body is embedded.” Thus, to understand and simulate the brain, we need to give the neural network a body and put it into real or virtual environment (Krichmar & Edelman, 2002). In this case, the network becomes part of a closed loop: the environment stimulates “sensory” neurons via sensory organs. Firings of these neurons combined with
Polychronization
r
269
the current state of the network (i.e., the context) activate appropriate polychronous groups, which excite “motor” neurons and produce appropriate movements in the environment (i.e., response). The movements change the sensory input and close the causality loop. Reward and reinforcement learning: Some stimuli bring the reward (not modeled here) and activate the value system (Krichmar & Edelman, 2002). It strengthens recently active polychronous groups— the groups that resulted in the reward. This increases the probability that the same stimuli in the future would result in activation of the same groups and thereby bring more reward. Thus, in addition to passively learning input stimuli, the system can actively explore those stimuli that bring the reward.
5 Cognitive Computations Let us discuss possible directions of this research and its connection to studies of neural computation, attention, and consciousness. This section is highly speculative; it is motivated by, but not entirely based on, the simulations described above. 5.1 Synchrony: Good or Bad? Much research on the dynamics of spiking and oscillatory networks is devoted to determining the conditions that ensure synchronization of the network activity. Many researchers (including this author until a few years ago) are under the erroneous assumption that synchronization is something good and desirable. What kind of information processing could possibly go on in a network of synchronously firing neurons? Probably none, since the entire network acts as a single neuron. Here we treat synchronization (or polychronization) of all neurons in the network as being an undesirable property that should be avoided. In fact, synchronization (or polychronization) should be so rare and difficult to occur by chance that when it happens, even transiently in a small subset of the network, it would signify something important, something meaningful, e.g., a stimulus is recognized, two or more features are bound, attention is paid. All these cognitive events are related to the activation of polychronous groups, as we discuss in this section. 5.2 Novel Model of Neural Computation. Most of artificial neural network research concerns supervised or unsupervised training of neural nets, which consists in building a mapping from a given set of inputs to a given set of outputs. For example, the connection weights of Hopfield-Grossberg model (Hopfield, 1982; Grossberg, 1988) are modified so that the given input patterns become attractors of the network. In these approaches, the network is “instructed” what to do.
270
E. Izhikevich
In contrast, we take a different approach in this article. Instead of using the instructionist approach, we employ a selectionist approach, known as the theory of neuronal group selection (TNGS) and neural Darwinism (Edelman 1987). There are two types of selection constantly going on in the spiking network:
r
r
Selection on the neuronal level: STDP selects subgraphs with matching conduction delays in initially unstructured network, resulting in the formation of a large number of groups, each capable of polychronizing, that is, generating a reproducible spike-timing pattern with millisecond precision. The number of coexisting polychronous groups, called repertoire of the network, is potentially unlimited. Selection on the group level: Polychronous groups are representations of possible inputs to the network, so that each input selects groups from the repertoire. That is, every time the input is presented to the network, a polychronous group (or groups) whose spike-timing pattern resonates with the input is activated (i.e., the neurons constituting the group polychronize).
Using the analogy with the immune system, where we have antibodies for practically all possible antigens, even those that do not exist on earth, we can take our point of view to the extreme and say that the network “has memories of all past and future events,” with the past events corresponding to certain groups with assigned representations and the future events corresponding to the large, amorphous, potentially unlimited cloud of available groups with no representation. Learning of a new input consists of selecting and reinforcing an appropriate group (or groups) that resonates with the input. Assigning the representation (meaning) to the group consists of potentiating weak connections that link this group with other groups coactive at the same time, that is, putting the group in the context of the other groups that already have representations (see Figure 16). In this sense, each polychronous group represents its stimulus and the context. In addition, persistent stimuli may create new groups, as we show in section 3. In any case, the input constantly shapes the landscape of the groups present in the network, selecting and amplifying some groups and suppressing and destroying others. The major result of this article is that spiking networks with delays have more groups than neurons. Thus, the system has potentially enormous memory capacity and will never run out of groups, which could explain how networks of mere 1011 neurons (the size of the human neocortex) could have such a diversity of behavior. Of course, we need to learn how to use this extraordinary property in models. 5.3 Binding and Gamma Rhythm. Binding is discussed in detail by Bienenstock (1995) in the context of synfire activity (see also the special
Polychronization
271
A
B
input
Figure 16: Due to the potentially unlimited number of coexisting polychronous groups, the system “has memories of all past and future events,” denoted by shaded and empty figures, respectively. (A) Past events are represented by the groups with assigned representations; they activate in response to specific inputs. Connections between the groups provide the context. (B) Experiencing a new event consists of selecting a group out of the amorphous set of “available” groups that resonates with the input. The context is provided by the potentiated connections between the group and recently active groups.
issue of NEURON (September 1999) on the binding problem). Bienenstock’s major idea is that dynamic binding of various features of a stimulus corresponds to the synchronization of synfire waves propagating along distinct chains. The synchronization is induced by weak reentrant synaptic coupling between these chains (see also Seth et al. 2004b). This idea is equally applicable to polychronous activity. In Figure 17 we illustrate what could happen when different groups representing different features of a stimulus are activated asynchronously (left) or time locked (right). In the former case, no specific temporal relationship would exist between firings of neurons belonging to different groups, except that the firings would be correlated (they are all triggered by the common input). The dotted lines in Figure 17 (right) are the reentrant connections between groups that establish the context for each group. These connections would coordinate activations of the groups and would be responsible for the time locking (polychronization) in Figure 17 (right). In essence, the four groups in the figure would act as a single meta-group whose reproducible
272
E. Izhikevich no binding (asynchronous activation)
gamma signature
binding (polychronous activation) feature 1
LFP feature 2
feature 3
feature 4
contribution to gamma
contribution to gamma
Figure 17: Time-locked activation of groups representing various features of a stimulus results in binding of the features and increased gamma rhythm. Each group contributes small gamma oscillation to the network gamma. (Left) The oscillations average out during the asynchronous activation. (Right) The oscillations add up during the time-locked activation. Dotted lines: weak reentrant connections between the groups that synchronize (or polychronize) their activation (the groups and the associated gamma rhythm are drawn by hand).
spike-timing pattern represents all features of the stimulus bound together into a whole. Each group has a gamma signature indicated by dashed boxes in Figure 17 (top left) and discussed in section 3 (see Figure 6). Activation of such a group produces a small oscillation of the local field potential (LFP) in the gamma frequency. When groups activate asynchronously, their LFPs would have random phases and cancel each other. When groups activate polychronously during binding, their LFPs would add up, resulting in the noticeable network gamma rhythm and increased synchrony (Singer & Gray, 1995). 5.4 Modeling Attention. The small size of the system does not allow us to explore other cognitive functions of spiking networks. In September 2005, the author simulated a detailed thalamo-cortical system having 1011
Polychronization
273
representation A
representation B
A is activated
Figure 18: Stimuli A and B are both represented by pairs of polychronous groups with overlapping neurons. Selective attention to representation A (both groups representing A are active) does not inhibit neurons involved in representation B. Because the neurons are shared, they just fire with the spike-timing pattern corresponding to A.
spiking neurons (i.e., the size of the human brain), 6-layer cortical microcircuitry, specific, non-specific, and reticular thalamic nuclei. One second of simulation took more than one month on a cluster of 27 3-GHz processors. In a large-scale network, there could be many groups (more than the 15 depicted in Figure 12) that represent any particular input stimulus. The stimulus alone could activate only a small subset of the groups. However, weak reentrant connections among the groups may trigger a regenerative process leading to explosive activation of many other groups representing the stimulus, resulting in its perception (and possibly increases gamma rhythm). These groups take up most of the neurons in the network so that only a relatively few neurons are available for activation of any other group not related to the stimulus. We might say that the stimulus is the focus of attention. If two or more stimuli are present, then activation of groups representing one stimulus essentially precludes the other stimuli from being attended. Remarkably, the groups corresponding to the unattended stimuli are not inhibited. The neurons constituting the groups fire, but with a different spike-timing pattern (see Figure 18). We hypothesize that this mutual exclusion is related to the phenomenon of selective attention.
274
E. Izhikevich
In our view, attention is not a command that comes from the “higher” or “executive” center and tells which input to attend to. Instead, we view attention as an emerging property of simultaneous and regenerative activation (via positive feedback) of a large subset of groups representing a stimulus, thereby impeding activation of other groups corresponding to other stimuli. Multiple stimuli compete for the focus of attention, and the winner is determined by many factors, mostly the context. 5.5 Consciousness as Attention to Memory. When no stimulation is present, there is a spontaneous activation of polychronous groups, as in Figure 11. We hypothesize that if the size of the network exceeds a certain threshold, a random activation of a few groups representing a previously seen stimulus may activate other groups representing the same stimulus so that the total number of activated groups is comparable to the number of activated groups that occurs when the stimulus is actually present. Not only would such an event exclude all the other groups not related to the stimulus from being activated, but from the network’s point of view, it would be similar to the event when the stimulus is actually present and it is the focus of attention. One can say that the network “thinks” of the stimulus—that is, it pays attention to the memory of the stimulus. Such “thinking” resembles “experiencing” the stimulus. A sequence of spontaneous activations corresponding to one stimulus, then another, and so on may be related to the stream of primary (perceptual or sensory) consciousness (Edelman, 2004), which can be found in many nonhuman animals. Of course, it does not explain the high-order (conceptual) consciousness of humans. Appendix: The Model The MATLAB code simulating the network activity is in Figure 19. The upper half of the program initializes the network, and it takes approximately 30 sec on a 1 GHz Pentium PC. The lower half of the program executes the model, and it takes 5 seconds to simulate 1 second of network activity. The actual time may vary depending on the firing rate of the neurons. The MATLAB code and an equivalent 20-times-faster C++ code are also available on the author’s Web page. Let us describe the details of the model. A.1 Anatomy. The network consists of N = 1000 neurons with the first Ne = 800 of excitatory RS type, and the remaining Ni = 200 of inhibitory FS type (Izhikevich, 2003). The ratio of excitatory to inhibitory cells is 4 to 1, as in the mammalian neocortex. Each excitatory neuron is connected to M = 100 random neurons, so that the probability of connection is M/N = 0.1, again as in the neocortex. Each inhibitory neuron is connected to M = 100 excitatory neurons only. The indices of postsynaptic targets are in the N×M-matrix post. Corresponding synaptic weights are in the N×M-matrix s. Inhibitory weights are not plastic, whereas excitatory weights evolve according to the STDP
Polychronization
275
% spnet.m: Spiking network with axonal conduction delays and STDP % Created by Eugene M.Izhikevich. February 3, 2004 M=100; % number of synapses per neuron D=20; % maximal conduction delay % excitatory neurons % inhibitory neurons % total number Ne=800; Ni=200; N=Ne+Ni; a=[0.02*ones(Ne,1); 0.1*ones(Ni,1)]; d=[ 8*ones(Ne,1); 2*ones(Ni,1)]; sm=10; % maximal synaptic strength post=ceil([N*rand(Ne,M);Ne*rand(Ni,M)]); s=[6*ones(Ne,M);-5*ones(Ni,M)]; % synaptic weights sd=zeros(N,M); % their derivatives for i=1:N if i<=Ne for j=1:D delays{i,j}=M/D*(j-1)+(1:M/D); end; else delays{i,1}=1:M; end; pre{i}=find(post==i&s>0); % pre excitatory neurons aux{i}=N*(D-1-ceil(ceil(pre{i}/N)/(M/D)))+1+mod(pre{i}-1,N); end; STDP = zeros(N,1001+D); v = -65*ones(N,1); % initial values u = 0.2.*v; % initial values firings=[-D 0]; % spike timings for sec=1:60*60*24 % simulation of 1 day for t=1:1000 % simulation of 1 sec I=zeros(N,1); I(ceil(N*rand))=20; % random thalamic input fired = find(v>=30); % indices of fired neurons v(fired)=-65; u(fired)=u(fired)+d(fired); STDP(fired,t+D)=0.1; for k=1:length(fired) sd(pre{fired(k)})=sd(pre{fired(k)})+STDP(N*t+aux{fired(k)}); end; firings=[firings;t*ones(length(fired),1),fired]; k=size(firings,1); while firings(k,1)>t-D del=delays{firings(k,2),t-firings(k,1)+1}; ind = post(firings(k,2),del); I(ind)=I(ind)+s(firings(k,2), del)'; sd(firings(k,2),del)=sd(firings(k,2),del)-1.2*STDP(ind,t+D)'; k=k-1; end; v=v+0.5*((0.04*v+5).*v+140-u+I); % for numerical v=v+0.5*((0.04*v+5).*v+140-u+I); % stability time u=u+a.*(0.2*v-u); % step is 0.5 ms STDP(:,t+D+1)=0.95*STDP(:,t+D); % tau = 20 ms end; plot(firings(:,1),firings(:,2),'.'); axis([0 1000 0 N]); drawnow; STDP(:,1:D+1)=STDP(:,1001:1001+D); ind = find(firings(:,1) > 1001-D); firings=[-D 0;firings(ind,1)-1000,firings(ind,2)]; s(1:Ne,:)=max(0,min(sm,0.01+s(1:Ne,:)+sd(1:Ne,:))); sd=0.9*sd; end;
Figure 19: MATLAB code of the spiking network with axonal conduction delays and spike-timing-dependent plasticity (STDP). It is available on the author’s Web page: www.izhikevich.com.
276
E. Izhikevich
rule discussed in the next section. Their derivatives are in the N×M-matrix sd , though only the Ne×M-block of the matrix is used. Each synaptic connection has a fixed integer conduction delay between 1 ms and D = 20 ms, where D is a parameter (M/D must be integer in the model). We do not model modifiable delays (Huning, Glunder, & Palm, 1998; Eurich, Pawelzik, Ernst, Cowan, & Milton, 1999) or transmission failures (Senn, Schneider, & Ruff, 2002). The list of all synaptic connections from neuron i having delay j is in the cell array delays{i, j} . Our MATLAB implementation assigns 1 ms delay to all inhibitory connections, and 1 to D ms delay to all excitatory connections. Although the anatomy of the model is random, reflecting the connectivity within a cortical minicolumn, one can implement an arbitrarily sophisticated anatomy by specifying the matrices post and delays. The details of the anatomy do not matter in the rest of the MATLAB code and do not slow the simulation. Once the matrices post and delays are specified, the program initializes cell arrays pre and aux. The former contains indices of all excitatory neurons presynaptic to a given neuron, and the latter is an auxiliary table of indices needed to speed up STDP implementation. A.2 Spiking Neurons. Each neuron in the network is described by the simple spiking model (Izhikevich, 2003) v = 0.04v 2 + 5v + 140 − u + I
(A.1)
u = a (bv − u)
(A.2)
with the auxiliary after-spike resetting if v ≥ +30 mV, then
v←c u ← u + d.
(A.3)
Here variable v represents the membrane potential of the neuron, and u represents a membrane recovery variable, which accounts for the activation of K+ ionic currents and inactivation of Na+ ionic currents, and it provides negative feedback to v. After the spike reaches its apex at +30 mV, which is not to be confused with the firing threshold, the membrane voltage and the recovery variable are reset according to equation A.3. Depending on the values of the parameters, the model can exhibit firing patterns of all known types of cortical neurons (Izhikevich, 2003). It can also reproduce all of the 20 most fundamental neurocomputational properties of biological neurons summarized in Figure 3, (see Izhikevich, 2004). We use (b, c) = (0.2, −65) for all neurons in the network. For excitatory neurons, we use the values (a , d) = (0.02, 8) corresponding to cortical pyramidal neurons exhibiting regular spiking (RS) firing patterns. For inhibitory neurons, we use the values (a , d) = (0.1, 2) corresponding to cortical
Polychronization
277
interneurons exhibiting fast spiking (FS) firing patterns. Better values of parameters corresponding to different types of cortical neurons, as well as the explanation of the model, can be found in Izhikevich (2006). Variable I in the model combines two kinds of input to the neuron: (1) random thalamic input and (2) spiking input from the other neurons. This is implemented via N-dimensional vector I. A.3 Spike-Timing-Dependent Plasticity. The synaptic connections in the model are modified according to the spike-timing-dependent plasticity (STDP) rule (Song et al., 2000). We use the simplest and the most effective implementation of this rule, depicted in Figure 4. If a spike from an excitatory presynaptic neuron arrives at a postsynaptic neuron (possibly making the postsynaptic neuron fire), then the synapse is potentiated (strengthened). In contrast, if the spike arrives right after the postsynaptic neuron fired, the synapse is depressed (weakened). If pre- and postsynaptic neurons fire uncorrelated Poissonian spike trains, there are moments when the weight of the synaptic connection is potentiated, and there are moments when it is depressed. We chose the parameters of the STDP curve so that depression is stronger than potentiation and the synaptic weight goes slowly to zero. Indeed, such a connection is not needed and should be eliminated. In contrast, if the presynaptic neuron often fires before the postsynaptic one, then the synaptic connection slowly potentiates. Indeed, such a connection causes the postsynaptic spikes and should be strengthened. In this way, STDP strengthens causal interactions in the network. The magnitude of potentiation or depression depends on the time interval between the spikes. Each time a neuron fires, the variable STDP is reset to 0.1. Every millisecond, STDP decreases by 0.95*STDP, so that it decays to zero as 0.1e −t/20(ms) , according to the parameters in Figure 4. This function determines the magnitude of potentiation or depression. For each fired neuron, we consider all its presynaptic neurons and determine the timings of the last excitatory spikes arrived from these neurons. Since these spikes made the neuron fire, the synaptic weights are potentiated according to the value of STDP at the presynaptic neuron adjusted for the conduction delay. This corresponds to the positive part of the STDP curve in Figure 4. Notice that the largest increase occurs for the spikes that arrived right before the neuron fired, that is, the spikes that actually caused postsynaptic spike. In addition, when an excitatory spike arrives at a postsynaptic neuron, we depress the synapse according to the value of STDP at the postsynaptic neuron. This corresponds to the negative part of the STDP curve in Figure 4. Indeed, such a spike arrived after the postsynaptic neuron fired, and hence the synapse between the neurons should be weakened. (The same synapse will be potentiated when the postsynaptic neuron fires.)
278
E. Izhikevich
Instead of changing the synaptic weights directly, we change their derivatives sd, and then update the weights once a second according to the rule s ← s + 0.01 + sd, and sd ← 0.9sd, where 0.01 describes activityindependent increase of synaptic weight needed to potentiate synapses coming to silent neurons (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Desai, Cudmore, Nelson, & Turrigiano, 2002). Thus, the synaptic change is not instantaneous but slow, taking many seconds to develop. We manually keep the weights in the range between 0 and sm, where sm is a parameter of the model, typically less than 10 (mV). Acknowledgments Anil K. Seth and Bruno van Swinderen read the manuscript and made a number of useful suggestions. Gerald M. Edelman, Bernie J. Baars, Anil K. Seth, and Bruno van Swinderen motivated my interest in the studies of consciousness. The concept of consciousness as attention to memories was developed in conversations with Bruno van Swinderen. This research was supported by the Neurosciences Research Foundation. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M. (2002). Synfire chains. In M. A. Arbib (Ed.), The handbook of Brain theory and neural networks. (pp. 1143–1146). Cambridge, MA: MIT Press. Amit, D. J., & Brunel N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Baker, S. N., & Lemon R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. J. Neurophysiol., 84, 1770–1780. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. J. Neuroscience, 23, 11167–11177. Beggs, J. M., & Plenz, D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neuroscience, 24, 5216–5229. Bellen, A., & Zennaro, M. (2003). Numerical methods for delay differential equations. Oxford: Clarendon Press. Bienenstock, E. (1995). A model of neocortex. Network: Comput. Neural Syst., 6, 179– 224. Braitenberg, V., & Schuz, A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Bryant, H., & Segundo, J. (1976). Spike initiation by transmembrane current: A whitenoise analysis. J. Physiol. (Lond.), 260, 279–314. Buzsaki, G., Llinas, R., Singer, W., Berthoz, A., & Christen, Y. (Eds.). (1994). Temporal coding in the brain. New York: Springer-Verlag.
Polychronization
279
Chang, E. Y., Morris, K. F., Shannon, R., & Lindsey, B. G. (2000). Repeated sequences of interspike intervals in baroresponsive respiratory related neuronal assemblies of the cat brain stem. J. Neurophysiol., 84, 1136–1148. Changeux, J. P., & Danchin, A. (1976). Selective stabilization of developing synapses as a mechanism for the recall and recognition. Cognition, 33, 25– 62. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Desai, N. S., Cudmore, R. H., Nelson, S. B., & Turrigiano, G. G. (2002). Critical periods for experience-dependent synaptic scaling in visual cortex. Nature Neuroscience, 5, 783–789. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Edelman, G. M. (1987). Neural Darwinism: The theory of neuronal group selection. New York: Basic Books. Edelman, G. M. (1993). Neural Darwinism: Selection and reentrant signaling in higher brain function. Neuron, 10, 115–125. Edelman, G. M. (2004). Wider than the sky: The phenomenal gift of consciousness. New Haven, CT: Yale University Press. Edelman, G. M., & Gally, J. (2001). Degeneracy and complexity in biological systems. PNAS, 98, 13763–13768. Eurich, C., Pawelzik, K., Ernst, U., Cowan, J., & Milton, J. (1999). Dynamics of selforganazed delay adaptation. Phys. Rev. Lett., 82, 1594–1597. Ferster, D., & Lindstrom, S. (1983). An intracellular analysis of geniculocortical connectivity in area 17 of the cat. Journal of Physiology, 342, 181–215. Foss, J., & Milton, J. (2000). Multistability in recurrent neural loops arising from delay. J. Neurophysiol., 84, 975–985. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding Nature, 383, 76–78. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17–61. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79, 2554–2558. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weekly connected neural networks. New York: Springer-Verlag. Huning, H., Glunder, H., & Palm, G. (1998). Synaptic delay learning in pulse-coupled neurons. Neural Computation, 10, 555–565. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Fester, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 304, 559–564. Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14, 1569–1572. Izhikevich, E. M. (2004). Which model to use for cortical spiking neurons? IEEE Transactions on Neural Networks, 15, 1063–1070.
280
E. Izhikevich
Izhikevich, E. M. (2006). Dynamical systems in neuroscience: The geometry of excitability and bursting. Cambridge, MA: The MIT Press. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Krichmar, J. L., & Edelman, G. M. (2002). Machine psychology: Autonomous behavior, perceptual categorization and conditioning in a brain-based device. Cerebral Cortex, 12, 818–830. Lindsey, B. G., Morris, K. F., Shannon, R., & Gerstein, G. L. (1997). Repeated patterns of distributed synchrony in neuronal assemblies. J. Neurophysiol., 78, 1714–1719. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feed-forward networks with excitatory-inhibitory balance. J. Neurosci., 23, 3006–3015. Maass, W., Natschlaeger, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14, 2531–2560. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mao, B.-Q., Hamzei-Sichani, F., Aronov, D., Froemke, R. C., & Yuste, R. (2001). Dynamics of spontaneous activity in neocortical slices. Neuron, 32, 883–898. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signals. Nat. Neurosci., 5, 463–471. Miller, R. (1996a). Neural assemblies and laminar interactions in the cerebral cortex. Biol. Cybern., 75(3), 253–261. Miller, R. (1996b). Cortico-thalamic interplay and the security of operation of neural assemblies and temporal chains in the cerebral cortex. Biol. Cybern., 75(3), 263–275. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79, 2857–2874. Reinagel, P., & Reid, R. C. (2002). Precise firing events are conserved across neurons. J. Neurosci., 22, 6837–6841. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Salami, M., Itami, C., Tsumoto, T., & Kimura, F. (2003). Change of conduction velocity by regional myelination yields constant latency irrespective of distance between thalamus and cortex. PNAS, 100, 6174–6179. Senn, W., Schneider, M., & Ruf, B. (2002). Activity-dependent development of axonal and dendritic delays, or, why synaptic transmission should be unreliable. Neural Computation, 14, 583–619. Seth, A. K., McKinstry, J. L., Edelman, G. M., & Krichmar, J. L. (2004a). Spatiotemporal processing of whisker input supports texture discrimination by a brain-based device. In S., Schaal, A., Billard, S., Vijayakumar, J., Hallam, & J.-A., Meyer (Eds.), From animals to animats 8: Proceedings of the Eighth International Conference on the Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.
Polychronization
281
Seth, A. K., McKinstry, J. L., Edelman, G. M., & Krichmar, J. L. (2004b). Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device. Cerebral Cortex, 14, 1185–1199. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Shadlen, M. N., & Morshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919–926. Stewart, I., Golubitsky, M., & Pivato, M. (2003). Symmetry groupoids and patterns of synchrony in coupled cell networks. SIAM J. Appl. Dynam. Sys., 2, 606–646. Strehler, B. L., & Lestienne, R. (1986). Evidence on precise time-coded symbols and memory of patterns in monkey cortical neuronal spike trains. PNAS, 83, 9812– 9816. Swadlow, H. A. (1974). Systematic variations in the conduction velocity of slowly conducting axons in the rabbit corpus collosum. Experimental Neurology, 43, 445– 451. Swadlow, H. A. (1985). Physiological properties of individual cerebral axons studied in vivo for as long as one year. J. Neurophysiology, 54, 1346–1362. Swadlow, H. A. (1988). Efferent neurons and suspected interneurons in binocular visual cortex of the awake rabbit: Receptive fields and binocular properties. J. Neurophysiol., 88, 1162–1187. Swadlow, H. A. (1992). Monitoring the excitability of neocortical efferent neurons to direct activation by extracellular current pulses. J. Neurophysiol., 68, 605–619. Swadlow, H. A. (1994). Efferent neurons and suspected interneurons in motor cortex of the awake rabbit: Axonal properties, sensory receptive fields, and subthreshold synaptic inputs. J. Neurophysiology, 71, 437–453. Swadlow, H. A., Kocsis, J. D., & Waxman, S. G. (1980). Modulation of impulse conduction along the axonal tree. Ann. Rev. Biophys. Bioeng., 9, 143–179. Swadlow, H. A., & Waxman, S. G. (1975). Observations on impulse conduction along central axons. PNAS, 72, 5156–5159. Tetko, I. V., & Villa, A. E. P. (2001). A pattern grouping algorithm for analysis of spatiotemporal patterns in neuronal spike trains. 2: Application to simultaneous single unit recordings. Journal of Neuroscience Methods, 105, 15–24. Turrigiano, G. G., Leslie, K. R., Desai N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Villa, A. E., Tetko, I. V., Hyland, B., & Najem, A. (1999). Spatiotemporal activity patterns of rat cortical neurons predict responses in a conditioned task. Proc. Natl. Acad. Sci. USA, 96, 1106–1111.
282
E. Izhikevich
Volman, V., Baruchi, I., & Ben-Jacob, E. (2005). Manifestation of function-follow-form in cultured neuronal networks. Physics Biology, 2, 98–110. Wiener, J., & Hale, J. K. (1992). Ordinary and delay differential equations. New York: Wiley. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: Experimental and mathematical observations on network dynamics. Int. J. Psychophysiol., 38, 315–336.
Received January 31, 2005; accepted June 14, 2005.
LETTER
Communicated by Peter Dayan
Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia Randall C. O’Reilly
[email protected]
Michael J. Frank
[email protected] Department of Psychology, University of Colorado Boulder, Boulder, CO 80309, U.S.A.
The prefrontal cortex has long been thought to subserve both working memory (the holding of information online for processing) and executive functions (deciding how to manipulate working memory and perform processing). Although many computational models of working memory have been developed, the mechanistic basis of executive function remains elusive, often amounting to a homunculus. This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner. These learning mechanisms are based on subcortical structures in the midbrain, basal ganglia, and amygdala, which together form an actor-critic architecture. The critic system learns which prefrontal representations are task relevant and trains the actor, which in turn provides a dynamic gating mechanism for controlling working memory updating. Computationally, the learning mechanism is designed to simultaneously solve the temporal and structural credit assignment problems. The model’s performance compares favorably with standard backpropagation-based temporal learning mechanisms on the challenging 1-2-AX working memory task and other benchmark working memory tasks. 1 Introduction This letter presents a computational model of working memory based on the prefrontal cortex and basal ganglia (the PBWM model). The model represents a convergence of two logically separable but synergistic goals: understanding the complex interactions between the basal ganglia (BG) and prefrontal cortex (PFC) in working memory function and developing a computationally powerful model of working memory that can learn to perform complex temporally extended tasks. Such tasks require learning which information to maintain over time (and what to forget) and how to Neural Computation 18, 283–328 (2006)
C 2005 Massachusetts Institute of Technology
284
R. O’Reilly and M. Frank
assign credit or blame to events based on their temporally delayed consequences. The model shows how the prefrontal cortex and basal ganglia can interact to solve these problems by implementing a flexible working memory system with an adaptive gating mechanism. This mechanism can switch between rapid updating of new information into working memory and robust maintenance of existing information already being maintained (Hochreiter & Schmidhuber, 1997; O’Reilly, Braver, & Cohen, 1999; Braver & Cohen, 2000; Cohen, Braver, & O’Reilly, 1996; O’Reilly & Munakata, 2000). It is trained in the model using a version of reinforcement learning mechanisms that are widely thought to be supported by the basal ganglia (e.g., Sutton, 1988; Sutton & Barto, 1998; Schultz et al., 1995; Houk, Adams, & Barto, 1995; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & Arbib, 2001; Contreras-Vidal & Schultz, 1999; Joel, Niv, & Ruppin, 2002). At the biological level of analysis, the PBWM model builds on existing work describing the division of labor between prefrontal cortex and basal ganglia (Frank, Loughry, & O’Reilly, 2001; Frank, 2005). In this prior work, we demonstrated that the basal ganglia can perform dynamic gating via the modulatory mechanism of disinhibition, allowing only task-relevant information to be maintained in PFC and preventing distracting information from interfering with task demands. The mechanisms for supporting such functions are analogous to the basal ganglia role in modulating more primitive frontal system (e.g., facilitating adaptive motor responses while suppressing others; Mink, 1996). However, to date, no model has attempted to address the more difficult question of how the BG “knows” what information is task relevant (which was hard-wired in prior models). The present model learns this dynamic gating functionality in an adaptive manner via reinforcement learning mechanisms thought to depend on the dopaminergic system and associated areas (e.g., nucleus accumbens, basal-lateral amygdala, midbrain dopamine nuclei). In addition, the prefrontal cortex representations themselves learn using both Hebbian and error-driven learning mechanisms as incorporated into the Leabra model of cortical learning, which combines a number of well-accepted mechanisms into one coherent framework (O’Reilly, 1998; O’Reilly & Munakata, 2000). At the computational level, the model is most closely related to the long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997; Gers, Schmidhuber, & Cummins, 2000), which uses error backpropagation to train dynamic gating signals. The impressive learning ability of the LSTM model compared to other approaches to temporal learning that lack dynamic gating argues for the importance of this kind of mechanism. However, it is somewhat difficult to see how LSTM itself could actually be implemented in the brain. The PBWM model shows how similarly powerful levels of computational learning performance can be achieved using more biologically based mechanisms. This model has direct implications for understanding executive dysfunction in neurological disorders such as attention deficit– hyperactivity disorder (ADHD) and Parkinson’s disease, which involve the
Making Working Memory Work
285
interaction between dopamine, basal ganglia, and prefrontal cortex (Frank, Seeberger, & O’Reilly, 2004; Frank, 2005). After presenting the PBWM model and its computational, biological, and cognitive bases, we compare its performance with that of several other standard temporal learning models including LSTM, a simple recurrent network (SRN; Elman, 1990; Jordan, 1986), and real-time recurrent backpropagation learning (RBP; Robinson & Fallside, 1987; Schmidhuber, 1992; Williams & Zipser, 1992). 2 Working Memory Functional Demands and Adaptive Gating The need for an adaptive gating mechanism can be motivated by the 1-2-AX task (see Figure 1; Frank et al., 2001), which is a complex working memory task involving both goals and subgoals and is used as a test case later in the article. Number and letter stimuli (1,2,A,X,B,Y) appear one at a time in sequence, and the participant is asked to detect one of two target sequences, depending on whether he or she last saw a 1 or a 2 (which thus serves as “task” stimuli). In the 1 task, the target is A followed by X, and for 2, it is B-Y. Thus, the task demand stimuli define an outer loop of active maintenance (maintenance of task demands) within which there can be a number of inner loops of active maintenance for the A-X level sequences. This task imposes three critical functional demands on the working memory system: Rapid updating: As each stimulus comes in, it must be rapidly encoded in working memory.
oute inne r loop L L r loops R L 1 L A X B X time
L 2
L
R B Y
Figure 1: The 1-2-AX task. Stimuli are presented one at a time in a sequence. The participant responds by pressing the right key (R) to the target sequence; otherwise, a left key (L) is pressed. If the subject last saw a 1, then the target sequence is an A followed by an X. If a 2 was last seen, then the target is a B followed by a Y. Distractor stimuli (e.g., 3, C, Z) may be presented at any point and are to be ignored. The maintenance of the task stimuli (1 or 2) constitutes a temporal outer loop around multiple inner-loop memory updates required to detect the target sequence.
286
R. O’Reilly and M. Frank
Working Memory
a) Update
b) Maintain
A
A
Gating Sensory Input
open A
closed C
Figure 2: Illustration of active gating. When the gate is open, sensory input can rapidly update working memory (e.g., encoding the cue item A in the 1-2-AX task), but when it is closed, it cannot, thereby preventing other distracting information (e.g., distractor C) from interfering with the maintenance of previously stored information.
Robust maintenance: The task demand stimuli (1 or 2) in the outer loop must be maintained in the face of interference from ongoing processing of inner-loop stimuli and irrelevant distractors. Selective updating: Only some elements of working memory should be updated at any given time, while others are maintained. For example, in the inner-loop, A’s and X’s should be updated while the task demand stimulus (1 or 2) is maintained. The first two of these functional demands (rapid updating and robust maintenance) are directly in conflict with each other when viewed in terms of standard neural processing mechanisms, and thus motivate the need for a dynamic gating mechanism to switch between these modes of operation (see Figure 2; Cohen et al., 1996; Braver & Cohen, 2000; O’Reilly et al., 1999; O’Reilly & Munakata, 2000; Frank et al., 2001). When the gate is open, working memory is updated by incoming stimulus information; when it is closed, currently active working memory representations are robustly maintained. 2.1 Dynamic Gating via Basal Ganglia Disinhibition. One of the central postulates of the PBWM model is that the basal ganglia provide a selective dynamic gating mechanism for information maintained via sustained activation in the PFC (see Figure 3). As reviewed in Frank et al. (2001), this idea is consistent with a wide range of data and other computational models that have been developed largely in the domain of motor control, but also in working memory (Wickens, 1993; Houk & Wise, 1995; Wickens, Kotter, & Alexander, 1995; Dominey, Arbib, & Joseph, 1995; Berns & Sejnowski, 1995, 1998; Jackson & Houghton, 1995; Beiser & Houk, 1998; Kropotov &
Making Working Memory Work
287
Posterior Cortex
Frontal Cortex dorsal striatum
Go
D1
+
excitatory inhibitory
thalamus
NoGo
VA,VL,MD
D2
−
GPe
SNr Figure 3: The basal ganglia are interconnected with frontal cortex through a series of parallel loops, each of the form shown. Working backward from the thalamus, which is bidirectionally excitatory with frontal cortex, the SNr (substantia nigra pars reticulata) is tonically active and inhibiting this excitatory circuit. When direct pathway “Go” neurons in dorsal striatum fire, they inhibit the SNr, and thus disinhibit frontal cortex, producing a gating-like modulation that we argue triggers the update of working memory representations in prefrontal cortex. The indirect pathway “NoGo” neurons of dorsal striatum counteract this effect by inhibiting the inhibitory GPe (globus pallidus, external segment).
Etlinger, 1999; Amos, 2000; Nakahara, Doya, & Hikosaka, 2001). Specifically, in the motor domain, various authors suggest that the BG are specialized to selectively facilitate adaptive motor actions, while suppressing others (Mink, 1996). This same functionality may hold for more advanced tasks, in which the “action” to facilitate is the updating of prefrontal working memory representations (Frank et al., 2001; Frank, 2005). To support robust active maintenance in PFC, our model takes advantage of intrinsic bistability of PFC neurons, in addition to recurrent excitatory connections (Fellous, Wang, & Lisman, 1998; Wang, 1999; Durstewitz, Kelc, & Gunturkun, 1999; Durstewitz, Seamans, & Sejnowski, 2000a). Here we present a summary of our previously developed framework (Frank et al., 2001) for how the BG achieves gating:
r
Rapid updating occurs when direct pathway spiny “Go” neurons in the dorsal striatum fire. Go firing directly inhibits the substantia nigra
288
R. O’Reilly and M. Frank
r
r
pars reticulata (SNr) and releases its tonic inhibition of the thalamus. This thalamic disinhibition enables, but does not directly cause (i.e., gates), a loop of excitation into the PFC. The effect of this excitation in the model is to toggle the state of bistable currents in the PFC neurons. Striatal Go neurons in the direct pathway are in competition (in the SNr, if not the striatum; Mink, 1996; Wickens, 1993) with “NoGo” neurons in the indirect pathway that effectively produce more inhibition of thalamic neurons and therefore prevent gating. Robust maintenance occurs via intrinsic PFC mechanisms (bistability, recurrence) in the absence of Go updating signals. This is supported by the NoGo indirect pathway firing to prevent updating of extraneous information during maintenance. Selective updating occurs because there are parallel loops of connectivity through different areas of the basal ganglia and frontal cortex (Alexander, DeLong, & Strick, 1986; Graybiel & Kimura, 1995; Middleton & Strick, 2000). We refer to the separately updatable components of the PFC/BG system as stripes, in reference to relatively isolated groups of interconnected neurons in PFC (Levitt, Lewis, Yoshioka, & Lund, 1993; Pucak, Levitt, Lund, & Lewis, 1996). We previously estimated that the human frontal cortex could support roughly 20,000 such stripes (Frank et al., 2001).
3 Learning When to Gate in the Basal Ganglia Figure 4 provides a summary of how basal ganglia gating can solve the 1-2-AX task. This figure also illustrates that the learning problem in the basal ganglia amounts to learning when to fire a Go versus NoGo signal in a given stripe based on the current sensory input and maintained PFC activations. Without such a learning mechanism, our model would require some kind of intelligent homunculus to control gating. Thus, the development of this learning mechanism is a key step in banishing the homunculus from the domain of working memory models (cf. the “central executive” of Baddeley’s, 1986, model). There are two fundamental problems that must be solved by the learning mechanism: Temporal credit assignment: The benefits of having encoded a given piece of information into prefrontal working memory are typically available only later in time (e.g., encoding the 1 task demand helps later only when confronted with an A-X sequence). Thus, the problem is to know which prior events were critical for subsequent good (or bad) performance. Structural credit assignment: The network must decide which PFC stripes should encode which different pieces of information at a given time. When successful performance occurs, it must reinforce those stripes that
Making Working Memory Work
a)
Stim
289
b)
PFC
1
1
C1
C Thal SNr
{
{
Striatum Go
c)
d) A
A 1
{
X
RA 1
{
Figure 4: Illustration of how the basal ganglia gating of different PFC stripes can solve the 1-2-AX task (light color = active; dark = not active). (a) The 1 task is gated into an anterior PFC stripe because a corresponding striatal stripe fired Go. (b) The distractor C fails to fire striatial Go neurons, so it will not be maintained; however, it does elicit transient PFC activity. Note that the 1 persists because of gating-induced robust maintenance. (c) The A is gated in. (d) A right key press motor action is activated (using the same BG-mediated disinhibition mechanism) based on X input plus maintained PFC context.
actually contributed to this success. This form of credit assignment is what neural network models are typically very good at doing, but clearly this form of structural credit assignment interacts with the temporal credit assignment problem, making it more complex. The PBWM model uses a reinforcement-learning algorithm called PVLV (in reference to its Pavlovian learning mechanisms; O’Reilly, Frank, Hazy, & Watz, 2005) to solve the temporal credit assignment problem. The simulated dopaminergic (DA) output of this PVLV system modulates Go versus NoGo firing activity in a stripe-wise manner in BG-PFC circuits to facilitate structural credit assignment. Each of these is described in detail below. The model (see Figure 5) has an actor-critic structure (Sutton & Barto, 1998), where the critic is the PVLV system that controls the firing of simulated midbrain DA neurons and trains both itself and the actor. The actor is the basal ganglia gating system, composed of the Go and NoGo pathways in the dorsal striatum and their associated projections through BG output structures to the thalamus, and then back up to the PFC. The DA signals computed
290
R. O’Reilly and M. Frank
Sensory Input
Motor Output
Posterior Cortex: I/O Mapping
PFC: Context, Goals, etc (gating)
PVLV: DA (Critic)
(modulation)
BG: Gating (Actor)
Figure 5: Overall architecture of the PBWM model. Sensory inputs are mapped to motor outputs via posterior cortical (“hidden”) layers, as in a standard neural network model. The PFC contextualizes this mapping by representing relevant prior information and goals. The basal ganglia (BG) update the PFC representations via dynamic gating, and the PVLV system drives dopaminergic (DA) modulation of the BG so it can learn when to update. The BG/PVLV system constitutes an actor-critic architecture, where the BG performs updating actions and the PVLV system “critiques” the potential reward value of these actions, with the resulting modulation shaping future actions to be more rewarding.
by PVLV drive both performance and learning effects via opposite effects on Go and NoGo neurons (Frank, 2005). Specifically, DA is excitatory onto the Go neurons via D1 receptors and inhibitory onto NoGo neurons via D2 receptors (Gerfen, 2000; Hernandez-Lopez et al., 2000). Thus, positive DA bursts (above tonic level firing) tend to increase Go firing and decrease NoGo firing, while dips in DA firing (below tonic levels) have the opposite effect. The change in activation state as a result of this DA modulation can then drive learning in an appropriate way, as detailed below and in Frank (2005). 3.1 Temporal Credit Assignment: The PVLV Algorithm. The firing patterns of midbrain dopamine (DA) neurons (ventral tegmental area, VTA, and substantia nigra pars compacta, SNc; both strongly innervated by the basal ganglia) exhibit the properties necessary to solve the temporal credit assignment problem because they appear to learn to fire for stimuli that predict subsequent rewards (e.g., Schultz, Apicella, & Ljungberg, 1993; Schultz, 1998). This property is illustrated in schematic form in Figure 6a for a simple Pavlovian conditioning paradigm, where a conditioned stimulus (CS, e.g., a tone) predicts a subsequent unconditioned stimulus (US, i.e., a reward). Figure 6b shows how this predictive DA firing can reinforce BG Go firing to maintain a stimulus, when such maintenance leads to subsequent reward.
Making Working Memory Work
291
a) CS US/r DA
b) CS PFC
(maint in PFC) (causes updating)
(spans the delay)
BG−Go US/r (reinforces Go)
DA Figure 6: (a) Schematic of dopamine (DA) neural firing for a conditioned stimulus (CS, e.g., a tone) that reliably predicts a subsequent unconditioned stimulus (US, i.e., a reward, r). Initially, DA fires at the point of reward, but then over repeated trials learns to fire at the onset of the stimulus. (b) This DA firing pattern can solve the temporal credit assignment problem for PFC active maintenance. Here, the PFC maintains the transient input stimulus (initially by chance), leading to reward. As the DA system learns, it begins to fire DA bursts at stimulus onset, by virtue of PFC “bridging the gap” (in place of a sustained input). DA firing at stimulus onset reinforces the firing of basal ganglia Go neurons, which drive updating in PFC.
Specifically, the DA firing can move from the time of a reward to the onset of a stimulus that, if maintained in the PFC, leads to this subsequent reward. Because this DA firing occurs when the stimulus comes on, it is well timed to facilitate the storage of this stimulus in PFC. In the model, this occurs by reinforcing the connections between the stimulus and the Go gating neurons in the striatum, which then cause updating of PFC to maintain the stimulus. Note that other models have leveraged this same logic, but have the DA firing itself cause updating of working memory via direct DA projections to PFC (O’Reilly et al., 1999; Braver & Cohen, 2000; Cohen et al., 1996; O’Reilly & Munakata, 2000; Rougier & O‘Reilly, 2002; O’Reilly, Noelle, Braver, & Cohen, 2002). The disadvantage of this global DA signal is that it
292
R. O’Reilly and M. Frank
would update the entire PFC every time, making it difficult to perform tasks like the 1-2-AX task, which require maintenance of some representations while updating others. The apparently predictive nature of the DA firing has almost universally been explained in terms of the temporal differences (TD) reinforcement learning mechanism (Sutton, 1988; Sutton & Barto, 1998; Schultz et al., 1995; Houk et al., 1995; Montague, Dayan, & Sejnowski, 1996; Suri et al., 2001; Contreras-Vidal & Schultz, 1999; Joel et al., 2002). The earlier DA gating models cited above and an earlier version of the PBWM model (O’Reilly & Frank, 2003) also used this TD mechanism to capture the essential properties of DA firing in the BG. However, considerable subsequent exploration and analysis of these models has led us to develop a non-TD based account of these DA firing patterns, which abandons the prediction framework on which it is based (O’Reilly et al., 2005). In brief, TD learning depends on sequential chaining of predictions from one time step to the next, and any weak link (i.e., unpredictable event) can break this chain. In many of the tasks faced by our models (e.g., the 1-2-AX task), the sequence of stimulus states is almost completely unpredictable, and this significantly disrupts the TD chaining mechanism, as shown in O’Reilly et al. (2005). Instead of relying on prediction as the engine of learning, we have developed a fundamentally associative “Pavlovian” learning mechanism called PVLV, which consists of two systems: primary value (PV) and learned value (LV) (O’Reilly et al., 2005; see Figure 7). The PV system is just the Figure 7: PVLV learning mechanism. (a) Structure of PVLV. The PV (primary value) system learns about primary rewards and contains two subsystems: the excitatory (PVe) drives excitatory DA bursts from primary rewards (US = unconditioned stimulus), and the inhibitory (PVi) learns to cancel these bursts (using timing or other reliable signals). Anatomically, the PVe corresponds to the lateral hypothalamus (LHA), which has excitatory projections to the midbrain DA nuclei and responds to primary rewards. The PVi corresponds to the striosome-patch neurons in the ventral striatum (V. Str.), which have direct inhibitory projections onto the DA system, and learn to fire at the time of expected rewards. The LV (learned value) system learns to fire for conditioned stimuli (CS) that are reliably associated with reward. The excitatory component (LVe) drives DA bursting and corresponds to the central nucleus of the amygdala (CNA), which has excitatory DA projections and learns to respond to CS’s. The inhibitory component (LVi) is just like the PVi, except it inhibits CS-associated bursts. (b) Application to the simple conditioning paradigm depicted in the previous figure, where the PVi learns (based on the PVe reward value at each time step) to cancel the DA burst at the time of reward, while the LVe learns a positive CS association (only at the time of reward) and drives DA bursts at CS onset. The phasic nature of CS firing, despite a sustained CS input, requires a novelty detection mechanism of some form; we suggest a synaptic depression mechanism as having beneficial computational properties.
Making Working Memory Work
293
a) LV i (V. Str.)
PV i (V. Str.) excitatory inhibitory
Timing (cereb.)
DA (VTA/SNc)
CS
b) CS US/PVe Timing PV i LV e DA
US LV e
PV e
(CNA)
(LHA)
294
R. O’Reilly and M. Frank
Rescorla-Wagner/delta-rule learning algorithm (Rescorla & Wagner, 1972; Widrow & Hoff, 1960), trained by the primary reward value r t (i.e., the US) at each time step t (where time steps correspond to discrete events in the environment, such as the presentation of a CS or US). For simplicity, consider a single linear unit that computes an expected reward value Vˆ tpv based on weights wit coming from sensory and other inputs xit (e.g., including timing signals from the cerebellum): Vˆ tpv =
xit wit
(3.1)
i
(our actual value representation uses a distributed representation, as described in the appendix). The error in this expected reward value relative to the actual reward present at time t represents the PV system’s contribution to the overall DA signal: ˆt . δ tpv = r t − V pv
(3.2)
Note that all of these terms are in the current time step, whereas the similar equation in TD involves terms across different adjacent time steps. This delta value then trains the weights into the PV reward expectation, wit = δ tpv xit ,
(3.3)
where wit is the change in weight value and 0 < < 1 is a learning rate. As the system learns to expect primary rewards based on sensory and other inputs, the delta value decreases. This can account for the cancellation of the dopamine burst at the time of reward, as observed in the neural recording data (see Figure 7b). When a conditioned stimulus is activated in advance of a primary reward, the PV system is actually trained to not expect reward at this time, because it is always trained by the current primary reward value, which is zero in this case. Therefore, we need an additional mechanism to account for the anticipatory DA bursting at CS onset, which in turn is critical for training up the BG gating system (see Figure 6). This is the learned value (LV) system, which is trained only when primary rewards are either present or expected by the PV and is free to fire at other times without adapting its weights. Therefore, the LV is protected from having to learn that no primary reward is actually present at CS onset, because it is not trained at that time. In other words, the LV system is free to signal reward associations for stimuli even at times when no primary reward is actually expected. This results in the anticipatory dopamine spiking at CS onset (see Figure 7b), without requiring an unbroken chain of predictive events between stimulus onset and subsequent reward, as in TD. Thus, this anticipatory dopamine spiking
Making Working Memory Work
295
by the LV system is really just signaling a reward association, not a reward prediction. As detailed in O’Reilly et al. (2005), this PV/LV division provides a good mapping onto the biology of the DA system (see Figure 7a). Excitatory projections from the lateral hypothalamus (LHA) and central nucleus of the amygdala (CNA) are known to drive DA bursts in response to primary rewards (LHA) and conditioned stimuli (CNA) (e.g., Cardinal, Parkinson, Hall, & Everitt, 2002). Thus, we consider LHA to represent r , which we also label as PVe to denote the excitatory component of the primary value system. The CNA corresponds to the excitatory component of the LV system described above (LVe), which learns to drive DA bursts in response to conditioned stimuli. The primary reward system Vˆ pv that cancels DA firing at reward delivery is associated with the striosome/patch neurons in the ventral striatum, which have direct inhibitory projections into the DA system (e.g., Joel & Weiner, 2000), and learn to fire at the time of expected primary rewards (e.g., Schultz, Apicella, Scarnati, & Ljungberg, 1992). We refer to this as the inhibitory part of the primary value system, PVi. For symmetry and important functional reasons described later, we also include a similar inhibitory component to the LV system, LVi, which is also associated with the same ventral striatum neurons, but slowly learns to cancel DA bursts associated with CS onset. (For full details on PVLV, see O’Reilly et al., 2005, and the equations in the appendix.) 3.2 Structural Credit Assignment. The PVLV mechanism just described provides a solution to the temporal credit assignment problem, and we use the overall PVLV δ value to simulate midbrain (VTA, SNc) dopamine neuron firing rates (deviations from baseline). To provide a solution to the structural credit assignment problem, the global PVLV DA signal can be modulated by the Go versus NoGo firing of the different PFC/BG stripes, so that each stripe gets a differentiated DA signal that reflects its contribution to the overall reward signal. Specifically, we hypothesize that the SNc provides a more stripe-specific DA signal by virtue of inhibitory projections from the SNr to the SNc (e.g., Joel & Weiner, 2000). As noted above, these SNr neurons are tonically active and are inhibited by the firing of Go neurons in the striatum. Thus, to the extent that a stripe fires a strong Go signal, it will disinhibit the SNc DA projection to itself, while those that are firing NoGo will remain inhibited and not receive DA signals. We suggest that this inhibitory projection from SNr to SNc produces a shunting property that negates the synaptic inputs that produce bursts and dips, while preserving the intrinsically generated tonic DA firing levels. Mathematically, this results in a multiplicative relationship, such that the degree of Go firing multiplies the magnitude of the DA signal it receives (see the appendix for details). It remains to be determined whether the SNc projections support stripe-specific topography (see Haber, Fudge, & McFarland, 2000, for data
296
R. O’Reilly and M. Frank
suggestive of some level of topography), but it is important to emphasize that the proposed mechanism involves only a modulation in the amplitude of phasic DA changes in a given stripe and not qualitatively different firing patterns from different SNc neurons. Thus, very careful quantitative parallel DA recording studies across multiple stripes would be required to test this idea. Furthermore, it is possible that this modulation could be achieved through other mechanisms operating in the synaptic terminals regulating DA release (Joel & Weiner, 2000), in addition to or instead of overall firing rates of SNc neurons. What is clear from the results presented below is that the networks are significantly impaired at learning without this credit assignment mechanism, so we feel it is likely to be implemented in the brain in some manner. 3.3 Dynamics of Updating and Learning. In addition to solving the temporal and structural credit assignment problems, the PBWM model depends critically on the temporal dynamics of activation updating to solve the following functional demands:
r
r
Within one stimulus-response time step, the PFC must provide a stable context representation reflecting ongoing goals or prior stimulus context, and it must also be able to update to reflect appropriate changes in context for subsequent processing. Therefore, the system must be able to process the current input and make an appropriate response before the PFC is allowed to update. This offset updating of context representations is also critical for the SRN network, as discussed later. In standard Leabra, there are two phases of activation updating: a minus phase where a stimulus is processed to produce a response, followed by a plus phase where any feedback (when available) is presented, allowing the network to correct its response next time. Both of these phases must occur with a stable PFC context representation for the feedback to be able to drive learning appropriately. Furthermore, the BG Go/NoGo firing to decide whether to update the current PFC representations must also be appropriately contextualized by these stable PFC context representations. Therefore, in PBWM, we add a third update phase where PFC representations update, based on BG Go/NoGo firing that was computed in the plus phase (with the prior PFC context active). Biologically, this would occur in a more continuous fashion, but with appropriate delays such that PFC updating occurs after motor responding. The PVLV system must learn about the value of maintaining a given PFC representation at the time an output response is made and rewarded (or not). This reward learning is based on adapting synaptic weights from PFC representations active at the time of reward, not based on any transient sensory inputs that initially activated those
Making Working Memory Work
r
297
PFC representations, which could have been many time steps earlier (and long since gone). After BG Go firing updates PFC representations (during the third phase of settling), the PVLV critic can then evaluate the value of the new PFC state to provide a training signal to Go/NoGo units in the striatum. This training signal is directly contingent on striatal actions: Did the update result in a “good” (as determined by PVLV associations) PFC state? If good (DA burst), then increase the likelihood of Go firing next time. If bad (DA dip), then decrease the Go firing likelihood and increase NoGo firing. This occurs via direct DA modulation of the Go/NoGo neurons in the third phase, where bursts increase Go and decrease NoGo activations and dips have the opposite effect (Frank, 2005). Thus, the Go/NoGo units learn using the delta rule over their states in the second and third phases of settling, where the third phase reflects the DA modulation from the PVLV evaluation of the new PFC state.
To summarize, the temporal credit assignment “time travel” of perceived value, from the point of reward back to the critical stimuli that must be maintained, must be based strictly on PFC states and not sensory inputs. But this creates a catch-22 because these PFC states reflect inputs only after updating has occurred (O’Reilly & Munakata, 2000), so the system cannot know that it would be good to update PFC to represent current inputs until it has already done so. This is solved in PBWM by having one system (PVLV) for solving the temporal credit assignment problem (based on PFC states) and a different one (striatum) for deciding when to update PFC (based on current sensory inputs and prior PFC context). The PVLV system then evaluates the striatal updating actions after updating has occurred. This amounts to trial-and-error learning, with the PVLV system providing immediate feedback for striatal gating actions (and this feedback is in turn based on prior learning by the PVLV system, taking place at the time of primary rewards). The system, like most reinforcement learning systems, requires sufficient exploration of different gating actions to find those that are useful. The essential logic of these dynamics in the PBWM model is illustrated in Figure 8 in the context of a simple “store ignore recall” (SIR) working memory task (which is also simulated, as described later). There are two additional functional features of the PBWM model: (1) a mechanism to ensure that striatal units are not stuck in NoGo mode (which would prevent them from ever learning) and to introduce some random exploratory firing, and (2) a contrast-enhancement effect of dopamine modulation on the Go/NoGo units that selectively modulates those units that were actually active relative to those that were not. The details of these mechanisms are described in the appendix, and their overall contributions to learning, along
298
R. O’Reilly and M. Frank
state 1: successful Recall – PFC Input
S
+ S
gn
state 3: ignoring I
++
–
+
++
–
+
++
S
X
X
S
S
S
S
∆w
R
Striatum
state 2: storing S
gn
∆w
S gn
gn
syn dep
I gn
gn
rew
DA prev gate = S stored –> rew on R, assoc S (in PFC!) w/ rew in DA sys
GO = store S, DA likes –> GO firing reinforced
NOGO = ignore I, DA stopped firing fm syn. depress
Figure 8: Phase-based sequence of operations in the PBWM model for three input states of a simple Store, Ignore, Recall task. The task is to store the S stimulus, maintain it over a sequence of I (ignore) stimuli, and then recall the S when an R is input. Four key layers in the model are represented in simple form: PFC, sensory Input, Striatum (with Go = g and NoGo = n units), and overall DA firing (as controlled by PVLV). The three phases per trial (−, +, ++ = PFC update) are shown as a sequence of states for the same layer (i.e., there is only one PFC layer that represents one thing at a time). W indicates key weight changes, and the font size for striatal g and n units indicates effects of DA modulation. Syndep indicates synaptic depression into the DA system (LV) that prevents sustained firing to the PFC S representation. In state 1, the network had previously stored the S (through random Go firing) and is now correctly recalling it on an R trial. The unexpected reward delivered in the plus phase produces a DA burst, and the LV part of PVLV (not shown) learns to associate the state of the PFC with reward. State 2 shows the consequence of this learning, where, some trials later, an S input is active and the PFC is maintaining some other information (X). Based on existing weights, the S input triggers the striatal Go neurons to fire in the plus phase, causing PFC to update to represent the S. During this update phase, the LV system recognizes this S (in the PFC) as rewarding, causing a DA burst, which increases firing of Go units, and results in increased weights from S inputs to striatal Go units. In state 3, the Go units (by existing weights) do not fire for the subsequent ignore (I) input, so the S continues to be maintained. The maintained S in PFC does not continue to drive a DA burst due to synaptic depression, so there is no DA-driven learning. If a Go were to fire for the I input, the resulting I representation in PFC would likely trigger a small negative DA burst, discouraging such firing again. The same logic holds for negative feedback by causing nonreward associations for maintenance of useless information.
Making Working Memory Work
299
with the contributions of all the separable components of the system, are evaluated after the basic simulation results are presented. 3.4 Model Implementation Details. The implemented PBWM model, shown in Figure 9 (with four stripes), uses the Leabra framework, described
Input Output X Y Z A B C
X Y Z L R
1 2 3
A B C 1 2 3
X Y Z
X Y Z
A B C 1 2 3
A B C 1 2 3
1 2 3
PFC (Context)
NoGo NoGo NoGo NoGo NoGo NoGo NoGo NoGo Go Go Go Go Go Go Go Go
Hidden (Posterior Cortex)
Striatum (Gating, Actor)
PVe PVi
SNrThal SNc
LVi LVe
X Y Z A B C
DA
PVLV (Critic)
Figure 9: Implemented model as applied to the 1-2-AX task. There are four stripes in this model as indicated by the groups of units within the PFC and Striatum (and the four units in the SNc and SNrThal layers). PVe represents primary reward (r or US), which drives learning of the primary value inhibition (PVi) part of PVLV, which cancels primary reward DA bursts. The learned value (LV) part of PVLV has two opposing excitatory and inhibitory components, which also differ in learning rate (LVe = fast learning rate, excitatory on DA bursts; LVi = slow learning rate, inhibitory on DA bursts). All of these reward-value layers encode their values as coarse-coded distributed representations. VTA and SNc compute the DA values from these PVLV layers, and SNc projects this modulation to the Striatum. Go and NoGo units alternate (from bottom left to upper right) in the Striatum. The SNrThal layer computes Go-NoGo in the corresponding stripe and mediates competition using kWTA dynamics. The resulting activity drives updating of PFC maintenance currents. PFC provides context for Input/Hidden/Output mapping areas, which represent posterior cortex.
300
R. O’Reilly and M. Frank
in detail in the appendix (O’Reilly, 1998, 2001; O’Reilly & Munakata, 2000). Leabra uses point neurons with excitatory, inhibitory, and leak conductances contributing to an integrated membrane potential, which is then thresholded and transformed via an x/(x + 1) sigmoidal function to produce a rate code output communicated to other units (discrete spiking can also be used, but produces noisier results). Each layer uses a k-winners-takeall (kWTA) function that computes an inhibitory conductance that keeps roughly the k most active units above firing threshold and keeps the rest below threshold. Units learn according to a combination of Hebbian and error-driven learning, with the latter computed using the generalized recirculation algorithm (GeneRec; O’Reilly, 1996), which computes backpropagation derivatives using two phases of activation settling, as mentioned earlier. The cortical layers in the model use standard Leabra parameters and functionality, while the basal ganglia systems require some additional mechanisms to implement the DA modulation of Go/NoGo units, and toggling of PFC maintenance currents from Go firing, as detailed in the appendix. In some of the models, we have simplified the PFC representations so that they directly reflect the input stimuli in a one-to-one fashion, which simply allows us to transparently interpret the contents of PFC at any given point. However, these PFC representations can also be trained with random initial weights, as explored below. The ability of the PFC to develop its own representations is a critical advance over the SRN model, for example, as explored in other related work (Rougier, Noelle, Braver, Cohen, & O’Reilly, 2005). 4 Simulation Tests We conducted simulation comparisons between the PBWM model and a set of backpropagation-based networks on three different working memory tasks: (1) the 1-2-AX task as described earlier, (2) a two-store version of the Store-Ignore-Recall (SIR) task (O’Reilly & Munakata, 2000), where two different items need to be separately maintained, and (3) a sequence memory task modeled after the phonological loop (O’Reilly & Soto, 2002). These tasks provide a diverse basis for evaluating these models. The backpropagation-based comparison networks were:
r
A simple recurrent network (SRN; Elman, 1990; Jordan, 1986) with cross-entropy output error, no momentum, an error tolerance of .1 (output err < .1 counts as 0), and a hysteresis term in updating the context layers of .5 (c j (t) = .5h j (t − 1) + .5c j (t − 1), where c j is the context unit for hidden unit activation h j ). Learning rate (lrate), hysteresis, and hidden unit size were searched for optimal values across this and the RBP networks (within plausible ranges, using round numbers, e.g., lrates of .05, .1, .2, and .5; hysteresis of 0, .1, .2, .3, .5, and
Making Working Memory Work
r
r
301
.7, hidden units of 25, 36, 49, and 100). For the 1-2-AX task, optimal performance was with 100 hidden units, hysteresis of .5, and lrate of .1. For the SIR-2 task, 49 hidden units were used due to extreme length of training required, and a lrate of .01 was required to learn at all. For the phonological loop task, 196 hidden units and a lrate of .005 performed best. A real-time recurrent backpropagation learning network (RBP; Robinson & Fallside, 1987; Schmidhuber, 1992; Williams & Zipser, 1992), with the same basic parameters as the SRN, and a time constant for integrating activations and backpropagated errors of 1, and the gap between backpropagations and the backprop time window searched in the set of 6, 8, 10, and 16 time steps. Two time steps were required for activation to propagate from the input to the output, so the effective backpropagation time window across discrete input events in the sequence is half of the actual time window (e.g., 16 = 8 events, which represents two or more outer-loop sequences). Best performance was achieved with the longest time window (16). A long short-term memory (LSTM) model (Hochreiter & Schmidhuber, 1997) with forget gates as specified in Gers (2000), with the same basic backpropagation parameters as the other networks, and four memory cells.
4.1 The 1-2-AX Task. The task was trained as in Figure 1, with the length of the inner-loop sequences randomly varied from one to four (i.e., one to four pairs of A-X, B-Y, and so on, stimuli). Specifically, each sequence of stimuli was generated by first randomly picking a 1 or 2, and then looping for one to four times over the following inner-loop generation routine. Half of the time (randomly selected), a possible target sequence (if 1, then A-X; if 2, then B-Y) was generated. The other half of the time, a random sequence composed of an A, B, or C, followed by an X, Y, or Z, was randomly generated. Thus, possible targets (A-X, B-Y) represent at least 50% of trials, but actual targets (A-X in the 1 task, B-Y in the 2 task) appear only 25% of time on average. The correct output was the L unit, except on the target sequences (1-A-X or 2-B-Y), where it was an R. The PBWM network received a reward if it produced the correct output (and received the correct output on the output layer in the plus phase of each trial), while the backpropagation networks learned from the error signal computed relative to this correct output. One epoch of training consisted of 25 outer-loop sequences, and the training criterion was 0 errors across two epochs in a row (one epoch can sometimes contain only a few targets, making a lucky 0 possible). For parameter searching results, training was stopped after 10,000 epochs for the backpropagation models if the network had failed to learn by this point and was scored as a failure to learn. For statistics, 20 different networks of each type were run.
302
R. O’Reilly and M. Frank
Epochs to Criterion
1-2-AX Task Training Time 3000 2500 2000 1500 1000 500 0
PBWM
RBP LSTM Algorithm
SRN
Figure 10: Training time to reach criterion (0 errors in two successive epochs of 25 outer-loop sequences) on the 1-2-AX task for the PBWM model and three backpropagation-based comparison algorithms. LSTM = long short-term memory model. RBP = recurrent backpropagation (real-time recurrent learning). SRN = simple recurrent network.
The basic results for number of epochs required to reach the criterion training level are shown in Figure 10. These results show that the PBWM model learns the task at roughly the same speed as the comparison backpropagation networks, with the SRN taking significantly longer. However, the main point is not in comparing the quantitative rates of learning (it is possible that despite a systematic search for the best parameters, other parameters could be found to make the comparison networks perform better). Rather, these results simply demonstrate that the biologically based PBWM model is in the same league as existing powerful computational learning mechanisms. Furthermore, the exploration of parameters for the backpropagation networks demonstrates that the 1-2-AX task represents a challenging working memory task, requiring large numbers of hidden units and long temporalintegration parameters for successful learning. For example, the SRN network required 100 hidden units and a .5 hysteresis parameter to learn reliably (hysteresis determines the window of temporal integration of the context units) (see Table 1). For the RBP network, the number of hidden units and the time window for backpropagation exhibited similar results (see Table 2). Specifically, time windows of fewer than eight time steps resulted in failures to learn, and the best results (in terms of average learning time) were achieved with the most hidden units and the longest backpropagation time window.
Making Working Memory Work
303
Table 1: Effects of Various Parameters on Learning performance in the SRN. Hidden-layer sizes for SRN (lrate = .1, hysteresis = .5) Hiddens 25 36 49 Success rate 4% 26% 86% Average epochs 5367 6350 5079
100 100% 2994
Hysteresis for SRN (100 hiddens, lrate = .1) Hysteresis .1 .2 Success rate 0% 0% Average epochs NA NA
.5 100% 2994
.3 38% 6913
.7 98% 3044
Learning rates for SRN (100 hiddens, hysteresis = .5) lrate .05 .1 .2 Success rate 100% 100% 96% Average epochs 3390 2994 3308 Notes: Success rate = percentage of networks (out of 50) that learned to criterion (0 errors for two epochs in a row) within 10,000 epochs. Average epochs - average number of epochs to reach criterion for successful networks. The optimal performance is with 100 hidden units, learning rate .1, and hysteresis .5. Sufficiently large values for the hidden units and hysteresis parameters are critical for successful learning, indicating the strong working memory demands of this task.
Table 2: Effects of Various Parameters on Learning Performance in the RBP Network. Time window for RBP (lrate = .1, 100 hiddens) Window 6 8 Success rate 6% 96% Average epochs 1389 625
10 96% 424
16 96% 353
Hidden-layer size for RBP (lrate = .1, window = 16) Hiddens 25 36 49 Success rate 96% 100% 96% Average epochs 831 650 687
100 96% 353
Notes: The optimal performance is with 100 hidden units, time window = 16. As with the SRN, the relatively large size of the network and long time windows required indicate the strong working memory demands of the task.
4.2 The SIR-2 Task. The PBWM and comparison backpropagation algorithms were also tested on a somewhat more abstract task (which has not been tested in humans), which represents perhaps the simplest, most direct form of working memory demands. In this store ignore recall (SIR) task (see Table 3), the network must store an arbitrary input pattern for a recall test that occurs after a variable number of intervening ignore trials (O’Reilly & Munakata, 2000). Stimuli are presented during the ignore trials and must be identified (output) by the network but do not need to be maintained. Tasks with this same basic structure were the focus of the original Hochreiter and Schmidhuber (1997) work on the LSTM algorithm, where
304
R. O’Reilly and M. Frank
Table 3: Example Sequence of Trials in the SIR-2 Task, Showing What Is Input, What Should Be Maintained in Each of Two “Stores,” and the Target Output. Trial
Input
Maint-1
Maint-2
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I-D S1-A I-B S2-C I-A I-E R1 I-A I-C S1-D I-E R1 I-B R2
– A A A A A A – – D D D – –
– – – C C C C C C C C C C C
D A B C A E A A C D E D B C
Notes: I = Ignore unit active, S1/2 = Store 1/2 unit active, R 1/2 = Recall unit 1/2 active. The functional meaning of these “task control” inputs must be discovered by the network. Two versions were run. In the shared representations version, one set of five stimulus inputs was used to encode A–E, regardless of which control input was present. In the dedicated representations version, there were different stimulus representations for each of the three categories of stimulus inputs (S1, S2, and I), for a total of 15 stimulus input units. The shared representations version proved impossible for nongated networks to learn.
they demonstrated that the dynamic gating mechanism was able to gate in the to-be-stored stimulus, maintain it in the face of an essentially arbitrary number of intervening trials by having the gate turned off, and then recall the maintained stimulus. The SIR-2 version of this task adds the need to independently update and maintain two different stimulus memories, instead of just one, which should provide a better test of selective updating. We explored two versions of this task—one that had a single set of shared stimulus representations (A-E) and another with dedicated stimulus representations for each of the three different types of task control inputs (S1, S2, I). In the dedicated representations version, the stimulus inputs conveyed directly their functional role and made the control inputs somewhat redundant (e.g., the I-A stimulus unit should always be ignored, while the S1-A stimulus should always be stored in the first stimulus store). In contrast, a stimulus in the shared representation version is ambiguous; sometimes an A should be ignored, sometimes stored in S1, and other times stored in S2, depending on the concomitant control input. This difference in stimulus ambiguity made a big difference for the nongating networks, as discussed below. The networks had 20 input units (separate A–E stimuli
Making Working Memory Work
305
for each of three different types of control inputs (S1, S2, I) = 15 units, and the 5 control units: S1,S2,I,R1,R2). On each trial, a control input and corresponding stimulus were randomly selected with uniform probability, which means that S1 and S2 maintenance ended up being randomly interleaved with each other. Thus, the network was required to develop a truly independent form of updating and maintenance for these two items. As Figure 11a shows, three out of four algorithms succeeded in learning the dedicated stimulus items version of the task within roughly comparable numbers of epochs, while the SRN model had a very difficult time, taking on average 40,090 epochs. We suspect that this difficulty may reflect the limitations of the one time step of error backpropagation available for this network, making it difficult for it to span the longer delays that often occurred (Hochreiter & Schmidhuber, 1997). Interestingly, the shared stimulus representations version of the task (see Figure 11b) clearly divided the gating networks from the nongated ones (indeed, the nongated networks—RBP and SRN—were completely unable to achieve a more stringent criterion of four zero-error epochs in a row, whereas both PBWM and LSTM reliably reached this level). This may be due to the fact that there is no way to establish a fixed set of weights between an input stimulus and a working memory representation in this task version. The appropriate memory representation to maintain a given stimulus must be determined entirely by the control input. In other words, the control input must act as a gate on the fate of the stimulus input, much as the gate input on a transistor determines the processing of the other input. More generally, dynamic gating enables a form of dynamic variable binding, as illustrated in Figure 12 for this SIR-2 task. The two PFC stripes in this example act as variable “slots” that can hold any of the stimulus inputs; which slot a given input gets “bound” to is determined by the gating system as driven by the control input (S1 or S2). This ability to dynamically route a stimulus to different memory locations is very difficult to achieve without a dynamic gating system, as our results indicate. Nevertheless, it is essential to emphasize that despite this additional flexibility provided by the adaptive gating mechanism, the PBWM network is by no means a fully generalpurpose variable binding system. The PFC representations must still learn to encode the stimulus inputs, and other parts of the network must learn to respond appropriately to these PFC representations. Therefore, unlike a traditional symbolic computer, it is not possible to store any arbitrary piece of information in a given PFC stripe. Figure 13 provides important confirmation that the PVLV learning mechanism is doing what we expect it to in this task, as represented in earlier discussion of the SIR task (e.g., see Figure 8). Specifically, we expect that the system will generate large positive DA bursts for Store events and not for Ignore events. This is because the Store signal should be positively associated with correct performance (and thus reward), while the Ignore signal should not be. This is exactly what is observed.
306
R. O’Reilly and M. Frank
a)
Epochs to Criterion
SIR 2 Store, 5 Dedicated Items 10000 1000 100 10 1
PBWM
RBP LSTM Algorithm
SRN
b)
Epochs to Criterion
SIR 2 Store, 2 Shared Items 100000 10000 1000 100 10 1
PBWM
RBP LSTM Algorithm
SRN
Figure 11: Training time to reach criterion (0 errors in 2 consecutive epochs of 100 trials each) on the SIR-2 task for the PBWM model and three backpropagation-based comparison algorithms, for (a) dedicated stimulus items (stimulus set = 5 items, A–E) and (b) shared stimulus items (stimulus set = 2 items, A–B). LSTM = long short-term memory model. RBP = recurrent backpropagation (real-time recurrent learning). SRN = simple recurrent network. The SRN does significantly worse in both cases (note the logarithmic scale), and with shared items, the nongated networks suffer considerably relative to the gated ones, most likely because of the variable binding functionality that a gating mechanism provides, as illustrated in Figure 12.
4.3 The Phonological Loop Sequential Recall Task. The final simulation test involves a simplified model of the phonological loop, based on earlier work (O’Reilly & Soto, 2002). The phonological loop is a working
Making Working Memory Work
a)
"S1" "S2"
307
b)
"S1" "S2"
A A
S1
A A
Go No
S2
PFC Input
No
Go BG
Figure 12: Gating can achieve a form of dynamic variable binding, as illustrated in the SIR-2 task. The store command (S1 or S2) can drive gating signals in different stripes in the BG, causing the input stimulus item (A,B, . . .) to be stored in the associated PFC stripe. Thus, the same input item can be encoded in a different neural “variable slot” depending on other inputs. Nevertheless, these neural stripes are not fully general like traditional symbolic variables; they must learn to encode the input items, and other areas must learn to decode these representations.
memory system that can actively maintain a short chunk of phonological (verbal) information (e.g., Baddeley, 1986; Baddeley, Gathercole, & Papagno, 1998; Burgess & Hitch, 1999; Emerson & Miyake, 2003). In essence, the task of this model is to encode and replay a sequence of “phoneme” inputs, much as in the classic psychological task of short-term serial recall. Thus, it provides a simple example of sequencing, which has often been linked with basal ganglia and prefrontal cortex function (e.g., Berns & Sejnowski, 1998; Dominey et al., 1995; Nakahara et al., 2001). As we demonstrated in our earlier model (O’Reilly & Soto, 2002), an adaptively gated working memory architecture provides a particularly efficient and systematic way of encoding phonological sequences. Because phonemes are a small closed class of items, each independently updatable PFC stripe can learn to encode this basic vocabulary. The gating mechanism can then dynamically gate incoming phonemes into stripes that implicitly represent the serial order information. For example, a given stripe might always encode the fifth phoneme in a sequence, regardless of which phoneme it was. The virtue of this system is that it provides a particularly efficient basis for generalization to novel phoneme sequences: as long as each stripe can encode any of the possible phonemes and gating is based on serial position and not phoneme identity, the system will generalize perfectly to novel sequences (O’Reilly & Soto, 2002). As noted above, this is an example of variable binding, where the stripes are variable-like slots for a given position, and the gating “binds” a given input to its associated slot.
308
R. O’Reilly and M. Frank
a) DA Values Over Training
Avg DA Value
0.4 0.3
Store
0.2 Recall
0.1 0.0
Ignore -0.1 0
50
100 Epochs
150
200
b)
Avg LVe Value
LVe (CS Assoc) Values Over Training 1.0 Recall
0.8 0.6
Store
0.4 0.2 0.0
Ignore 0
50
100 Epochs
150
200
Figure 13: (a) Average simulated DA values in the PBWM model for different event types over training. Within the first 50 epochs, the model learns strong, positive DA values for both types of storage events (Store), which reinforces gating for these events. In contrast, low DA values are generated for Ignore and Recall events. (b) Average LVe values, representing the learned value (i.e., CS associations with reward value) of various event types. As the model learns to perform well, it accurately perceives the reward at Recall events. This generalizes to the Store events, but the Ignore events are not reliably associated with reward, and thus remain at low levels.
Our earlier model was developed in advance of the PBWM learning mechanisms and used a hand-coded gating mechanism to demonstrate the power of the underlying representational scheme. In contrast, we trained the present networks from random initial weights to learn this task. Each
Making Working Memory Work
309
training sequence consisted of an encoding phase, where the current sequence of phonemes was presented in order, followed by a retrieval phase where the network had to output the phonemes in the order they were encoded. Sequences were of length 3, and only 10 simulated phonemes were used, represented locally as one out of 10 units active. Sequence order information was provided to the network in the form of an explicit “time” input, which counted up from 1 to 3 during both encoding and retrieval. Also, encoding versus retrieval phase was explicitly signaled by two units in the input. An example input sequence would be: E-1-‘B,’ E-2-‘A,’ E-3-‘G,’ R-1, R-2, R-3, where E/R is the encoding/recall flag, the next digit specifies the sequence position (“time”), and the third is the phoneme (not present in the input during retrieval). There are 1000 possible sequences (103 ), and the networks were trained on a randomly selected subset of 300 of these, with another nonoverlapping sample of 300 used for generalization testing at the end of training. Both of the gated networks (PBWM and LSTM) had six stripes or memory cells instead of four, given that three items had to be maintained at a time, and the networks benefit from having extra stripes to explore different gating strategies in parallel. The PFC representations in the PBWM model were subject to learning (unlike previous simulations, where they were simply a copy of the input, for analytical simplicity) and had 42 units per stripe, as in the O’Reilly and Soto (2002) model, and there were 100 hidden units. There were 24 units per memory cell in the LSTM model (note that computation increases as a power of 2 per memory cell unit in LSTM, setting a relatively low upper limit on the number of such cells). Figure 14 shows the training and testing results. Both gated models (PBWM, LSTM) learned more rapidly than the nongated backpropagationbased networks (RBP, SRN). Furthermore, the RBP network was unable to learn unless we presented the entire set of training sequences in a fixed order (other networks had randomly ordered presentation of training sequences). This was true regardless of the RBP window size (even when it was exactly the length of a sequence). Also, the SRN could not learn with only 100 hidden units, so 196 were used. For both the RBP and SRN networks, a lower learning rate of .005 was required to achieve stable convergence. In short, this was a difficult task for these networks to learn. Perhaps the most interesting results are the generalization test results shown in Figure 14b. As was demonstrated in the O’Reilly and Soto (2002) model, gating affords considerable advantages in the generalization to novel sequences compared to the RBP and SRN networks. It is clear that the SRN network in particular simply “memorizes” the training sequences, whereas the gated networks (PBWM, LSTM) develop a very systematic solution where each working memory stripe or slot learns to encode a different element in the sequence. This is a good example of the advantages of the variable-binding kind of behavior supported by adaptive gating, as discussed earlier.
310
R. O’Reilly and M. Frank
a)
Epochs to Criterion
Phono Loop Training Time 3000 2000 1000 0
PBWM
RBP LSTM Algorithm
SRN
b) Generalization Error %
Phono Loop Generalization 100 75 50 25 0
PBWM
RBP LSTM Algorithm
SRN
Figure 14: (a) Learning rates for the different algorithms on the phonological loop task, replicating previous general patterns (criterion is one epoch of 0 error). (b) Generalization performance (testing on 300 novel, untrained sequences), showing that the gating networks (PBWM and LSTM) exhibit substantially better generalization, due to their ability to dynamically gate items into active memory “slots” based on their order of presentation.
4.4 Tests of Algorithm Components. Having demonstrated that the PBWM model can successfully learn a range of different challenging working memory tasks, we now test the role of specific subcomponents of the algorithm to demonstrate their contribution to the overall performance. Table 4 shows the results of eliminating various portions of the model in
Making Working Memory Work
311
Table 4: Results of Various Tests for the Importance of Various Separable Parts of the PBWM Algorithm, Shown as a Percentage of Trained Networks That Met Criterion (Success Rate). Success Rate (%) Manipulation No Hebbian No DA contrast enhancement No Random Go exploration No LVi (slow LV baseline) No SNrThal DA Mod, DA = 1.0 No SNrThal DA Mod, DA = 0.5 No SNrThal DA Mod, DA = 0.2 No SNrThal DA Mod, DA = 0.1 No DA modulation at all
12ax
SIR-2
Loop
95 80 0 15 15 70 80 55 0
100 95 95 90 5 20 30 40 0
100 90 100 30 0 0 20 20 0
Notes: With the possible exception of Hebbian learning, all of the components clearly play an important role in overall learning, for the reasons described in the text as the algorithm was introduced. The No SNrThal DA Mod cases eliminate stripe-wise structural credit assignment; controls for overall levels of DA modulation are shown. The final No DA Modulation at all condition completely eliminates the influence of the PVLV DA system on Striatum Go/NoGo units, clearly indicating that PVLV (i.e., learning) is key.
terms of percentage of networks successfully learning to criterion. This shows that each separable component of the algorithm plays an important role, with the possible exception of Hebbian learning (which was present only in the “posterior cortical” (Hidden/Output) portion of the network). Different models appear to be differentially sensitive to these manipulations, but all are affected relative to the 100% performance of the full model. For the “No SNrThal DA Mod” manipulation, which eliminates structural credit assignment via the stripe-wise modulation of DA by the SNrThal layer, we also tried reducing the overall strength of the DA modulation of the striatum Go/NoGo units, with the idea that the SNrThal modulation also tends to reduce DA levels overall. Therefore, we wanted to make sure any impairment was not just a result of a change in overall DA levels; a significant impairment remains even with this manipulation. 5 Discussion The PBWM model presented here demonstrates powerful learning abilities on demonstrably complex and difficult working memory tasks. We have also tested it informally on a wider range of tasks, with similarly good results. This may be the first time that a biologically based mechanism for controlling working memory has been demonstrated to compare favorably with the learning abilities of more abstract and biologically implausible
312
R. O’Reilly and M. Frank
backpropagation-based temporal learning mechanisms. Other existing simulations of learning in the basal ganglia tend to focus on relatively simple sequencing tasks that do not require complex working memory maintenance and updating and do not require learning of when information should and should not be stored in working memory. Nevertheless, the central ideas behind the PBWM model are consistent with a number of these existing models (Schultz et al., 1995; Houk et al., 1995; Schultz et al., 1997; Suri et al., 2001; Contreras-Vidal & Schultz, 1999; Joel et al., 2002), thereby demonstrating that an emerging consensus view of basal ganglia learning mechanisms can be applied to more complex cognitive functions. The central functional properties of the PBWM model can be summarized by comparison with the widely used SRN backpropagation network, which is arguably the simplest form of a gated working memory model. The gating aspect of the SRN becomes more obvious when the network has to settle over multiple update cycles for each input event (as in an interactive network or to measure reaction times from a feedforward network). In this case, it is clear that the context layer must be held constant and be protected from updating during these cycles of updating (settling), and then it must be rapidly updated at the end of settling (see Figure 15). Although the SRN achieves this alternating maintenance and updating by fiat, in a biological network it would almost certainly require some kind of gating mechanism. Once one recognizes the gating mechanism hidden in the SRN, it is natural to consider generalizing such a mechanism to achieve a more powerful, flexible type of gating. This is exactly what the PBWM model provides, by adding the following degrees of freedom to the gating signal: (1) gating is dynamic, such that information can be maintained over a variable number of trials instead of automatically gating every trial; (2) the context representations are learned, instead of simply being copies of the hidden layer, allowing them to develop in ways that reflect the unique demands of working memory representations (e.g., Rougier & O’Reilly, 2002; Rougier et al., 2005); (3) there can be multiple context layers (i.e., stripes), each with its own set of representations and gating signals. Although some researchers have used a spectrum of hysteresis variables to achieve some of this additional flexibility within the SRN, it should be clear that the PBWM model affords considerably more flexibility in the maintenance and updating of working memory information. Moreover, the similar good performance of PBWM and LSTM models across a range of complex tasks clearly demonstrates the advantages of dynamic gating systems for working memory function. Furthermore, the PBWM model is biologically plausible. Indeed, the general functions of each of its components were motivated by a large base of literature spanning multiple levels of analysis, including cellular, systems, and psychological data. As such, the PBWM model can be used to explore possible roles of the individual neural systems involved by perturbing parameters
Making Working Memory Work
output
hidden
313
output
context
input
a) process event 1
(context gate closed)
hidden
output
context
input
b) update context
(context gate open)
hidden
context
input
c) process event 2
(context gate closed)
Figure 15: The simple recurrent network (SRN) as a gating network. When processing of each input event requires multiple cycles of settling, the context layer must be held constant over these cycles (i.e., its gate is closed, panel a ). After processing an event, the gate is opened to allow updating of the context (copying of hidden activities to the context, panel b). This new context is then protected from updating during the processing of the next event (panel c). In comparison, the PBWM model allows more flexible, dynamic control of the gating signal (instead of automatic gating each time step), with multiple context layers (stripes) that can each learn their own representations (instead of being a simple copy).
to simulate development, aging, pharmacological manipulations, and neurological dysfunction. For example, we think the model can explicitly test the implications of striatal dopamine dysfunction in producing cognitive deficits in conditions such as Parkinson’s disease and ADHD (e.g., Frank et al., 2004; Frank, 2005). Further, recent extensions to the framework have yielded insights into possible divisions of labor between the basal ganglia and orbitofrontal cortex in reinforcement learning and decision making (Frank & Claus, 2005). Although the PBWM model was designed to include many central aspects of the biology of the PFC/BG system, it also goes beyond what is currently known and omits many biological details of the real system. Therefore, considerable further experimental work is necessary to test the specific predictions and neural hypotheses behind the model, and further elaboration and revision of the model will undoubtedly be necessary. Because the PBWM model represents a level of modeling intermediate between detailed biological models and powerful, abstract cognitive and computational models, it has the potential to build important bridges between these disparate levels of analysis. For example, the abstract ACT-R cognitive architecture has recently been mapped onto biological substrates including the BG and PFC (Anderson et al., 2004; Anderson & Lebiere, 1998), with the specific role ascribed to the BG sharing some central aspects of its role in the PBWM model. On the other end of the spectrum,
314
R. O’Reilly and M. Frank
biologically based models have traditionally been incapable of simulating complex cognitive functions such as problem solving and abstract reasoning, which make extensive use of dynamic working memory updating and maintenance mechanisms to exhibit controlled processing over a time scale from seconds to minutes. The PBWM model should in principle allow models of these phenomena to be developed and their behavior compared with more abstract models, such as those developed in ACT-R. One of the major challenges to this model is accounting for the extreme flexibility of the human cognitive apparatus. Instead of requiring hundreds of trials of training on problems like the 1-2-AX task, people can perform this task almost immediately based on verbal task instructions. Our current model is more appropriate for understanding how agents can learn which information to hold in mind via trial and error, as would be required if monkeys were to perform the task.1 Understanding the human capacity for generativity may be the greatest challenge facing our field, so we certainly do not claim to have solved it. Nevertheless, we do think that the mechanisms of the PBWM model, and in particular its ability to exhibit limited variable-binding functionality, are critical steps along the way. It may be that over the 13 or so years it takes to fully develop a functional PFC, people have developed a systematic and flexible set of representations that support dynamic reconfiguration of input-output mappings according to maintained PFC representations. Thus, these PFC “variables” can be activated by task instructions and support novel task performance without extensive training. This and many other important problems, including questions about the biological substrates of the PBWM model, remain to be addressed in future research. Appendix: Implementational Details The model was implemented using the Leabra framework, which is described in detail in O’Reilly and Munakata (2000) and O’Reilly (2001), and summarized here. See Table 5 for a listing of parameter values, nearly all of which are at their default settings. These same parameters and equations have been used to simulate over 40 different models in O’Reilly and Munakata (2000) and a number of other research models. Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardized mechanisms, instead of constructing new mechanisms for each model. (The model can be obtained by e-mailing the first author at
[email protected].) 1 In practice, monkeys would likely require an extensive shaping procedure to learn the relatively complex 1-2-AX hierarchical structure piece by piece. However, we argue that much of the advantage of shaping may have to do with the motivational state of the organism: it enables substantial levels of success early on, to keep motivated. The model currently has no such motivational constraints and thus does not need shaping.
Making Working Memory Work
315
Table 5: Parameters for the Simulation. Parameter
Value
Parameter
El Ei Ee Vrest τ k in/out k PFC k PVLV khebb to PFC khebb
0.15 0.15 1.00 0.15 .02 1 4 1 .01 .001∗
gl gi ge γ k hidden k striatum to PFC
Value 0.10 1.0 1.0 0.25 600 7 7 .01 .001∗
Notes: See the equations in the text for explanations of parameters. All are standard default parameter values except for those with an *. The slower learning rate of PFC connections produced better results and is consistent with a variety of converging evidence, suggesting that the PFC learns more slowly than the rest of cortex (Morton & Munakata, 2002).
A.1 Pseudocode. The pseudocode for Leabra is given here, showing exactly how the pieces of the algorithm described in more detail in the subsequent sections fit together. Outer loop: Iterate over events (trials) within an epoch. For each event: 1. Iterate over minus (−), plus (+), and update (++) phases of settling for each event. (a) At start of settling: i. For non-PFC/BG units, initialize state variables (e.g., activation, v m). ii. Apply external patterns (clamp input in minus, input and output, external reward based on minus-phase outputs). (b) During each cycle of settling, for all nonclamped units: i. Compute excitatory netinput (ge (t) or η j ; equation A.3) (equation 24 for SNr/Thal units). ii. For Striatum Go/NoGo units in ++ phase, compute additional excitatory and inhibitory currents based on DA inputs from SNc (equation A.20). iii. Compute kWTA inhibition for each layer, based on gi (equation A.6): A. Sort units into two groups based on gi : top k and remaining k + 1 to n. B. If basic, find k and k + 1th highest; if average based, compute average of 1 → k & k + 1 → n. C. Set inhibitory conductance gi from gk and gk+1 (equation A.5).
316
R. O’Reilly and M. Frank
iv. Compute point-neuron activation combining excitatory input and inhibition (equation A.1). (c) After settling, for all units: i. Record final settling activations by phase (y−j , y+j , y++ ). ii. At end of + and ++ phases, toggle PFC maintenance currents for stripes with SNr/Thal act > threshold (.1). 2. After these phases, update the weights (based on linear current weight values): (a) For all non-BG connections, compute error-driven weight changes (equation A.8) with soft weight bounding (equation A.9), Hebbian weight changes from plus-phase activations (equation A.7), and overall net weight change as weighted sum of error-driven and Hebbian (equation A.10). (b) For PV units, weight changes are given by delta rule computed as difference between plus phase external reward value and minus phase expected rewards (equation A.11). (c) For LV units, only change weights (using equation A.13) if PV expectation > θ pv or external reward/punishment actually delivered. (d) For Striatum units, weight change is the delta rule on DAmodulated second-plus phase activations minus unmodulated plus phase acts (equation A.19). (e) Increment the weights according to net weight change. A.2 Point Neuron Activation Function. Leabra uses a point neuron activation function that models the electrophysiological properties of real neurons, while simplifying their geometry to a single point. The membrane potential Vm is updated as a function of ionic conductances g with reversal (driving) potentials E as follows: Vm (t) = τ
gc (t)gc (E c − Vm (t)),
(A.1)
c
with three channels (c) corresponding to e excitatory input, l leak current, and i inhibitory input. Following electrophysiological convention, the overall conductance is decomposed into a time-varying component gc (t) computed as a function of the dynamic state of the network, and a constant gc that controls the relative influence of the different conductances. The excitatory net input/conductance ge (t) or η j is computed as the proportion of open excitatory channels as a function of sending activations times the weight values: η j = ge (t) = xi wij =
1 xi wij . n i
(A.2)
Making Working Memory Work
317
The inhibitory conductance is computed via the kWTA function described in the next section, and leak is a constant. Activation communicated to other cells (y j ) is a thresholded () sigmoidal function of the membrane potential with gain parameter γ : y j (t) = 1+
1 1 γ [Vm (t)−]+
,
(A.3)
where [x]+ is a threshold function that returns 0 if x < 0 and x if x > 0. Note that if it returns 0, we assume y j (t) = 0, to avoid dividing by 0. To produce a less discontinuous deterministic function with a softer threshold, the function is convolved with a gaussian noise kernel (µ = 0, σ = .005), which reflects the intrinsic processing noise of biological neurons, y∗j (x)
=
∞ −∞
1 2 2 e −z /(2σ ) y j (z − x)dz, √ 2πσ
(A.4)
where x represents the [Vm (t) − ]+ value, and y∗j (x) is the noise-convolved activation for that value. In the simulation, this function is implemented using a numerical lookup table. A.3 k-Winners-Take-All Inhibition. Leabra uses a kWTA (k-WinnersTake-All) function to achieve inhibitory competition among units within a layer (area). The kWTA function computes a uniform level of inhibitory current gi for all units in the layer, such that the k + 1th most excited unit within a layer is generally below its firing threshold, while the kth is typically above threshold, gi = gk+1 , + q gk − gk+1
(A.5)
where 0 < q < 1 (.25 default used here) is a parameter for setting the in hibition between the upper bound of gk and the lower bound of gk+1 . These boundary inhibition values are computed as a function of the level of inhibition necessary to keep a unit right at threshold, gi =
ge∗ g¯ e (E e − ) + gl g¯ l (El − ) , − Ei
(A.6)
where ge∗ is the excitatory net input without the bias weight contribution. This allows the bias weights to override the kWTA constraint. In the basic version of the kWTA function, which is relatively rigid about the kWTA constraint and is therefore used for output layers, gk and gk+1 are set to the threshold inhibition value for the kth and k + 1th most excited units, respectively. In the average-based kWTA version, gk is the average
318
R. O’Reilly and M. Frank
gi value for the top k most excited units, and gk+1 is the average of gi for the remaining n − k units. This version allows more flexibility in the actual number of units active depending on the nature of the activation distribution in the layer.
A.4 Hebbian and Error-Driven Learning. For learning, Leabra uses a combination of error-driven and Hebbian learning. The error-driven component is the symmetric midpoint version of the GeneRec algorithm (O’Reilly, 1996), which is functionally equivalent to the deterministic Boltzmann machine and contrastive Hebbian learning (CHL). The network settles in two phases—an expectation (minus) phase, where the network’s actual output is produced, and an outcome (plus) phase, where the target output is experienced—and then computes a simple difference of a preand postsynaptic activation product across these two phases. For Hebbian learning, Leabra uses essentially the same learning rule used in competitive learning or mixtures-of-gaussians, which can be seen as a variant of the Oja normalization (Oja, 1982). The error-driven and Hebbian learning components are combined additively at each connection to produce a net weight change. The equation for the Hebbian weight change is hebb wij = xi+ y+j − y+j wij = y+j (xi+ − wij ),
(A.7)
and for error-driven learning using CHL, err wij = (xi+ y+j ) − (xi− y−j ),
(A.8)
which is subject to a soft-weight bounding to keep within the 0 − 1 range: sberr wij = [err ]+ (1 − wij ) + [err ]− wij .
(A.9)
The two terms are then combined additively with a normalized mixing constant khebb : wij = [khebb (hebb ) + (1 − khebb )(sberr )].
(A.10)
A.5 PVLV Equations. See O’Reilly et al. (2005) for further details on the PVLV system. We assume that time is discretized into steps that correspond to environmental events (e.g., the presentation of a CS or US). All of the following equations operate on variables that are a function of the current time step t. We omit the t in the notation because it would be redundant. PVLV is composed of two systems, PV (primary value) and LV (learned value), each of which in turn is composed of two subsystems (excitatory and inhibitory). Thus, there are four main value representation layers in
Making Working Memory Work
319
PVLV (PVe, PVi, LVe, LVi), which then drive the dopamine (DA) layers (VTA/SNc). A.5.1 Value Representations. The PVLV value layers use standard Leabra activation and kWTA dynamics as described above, with the following modifications. They have a three-unit distributed representation of the scalar values they encode, where the units have preferred values of (0, .5, 1). The overall value represented by the layer is the weighted average of the unit’s activation times its preferred value, and this decoded average is displayed visually in the first unit in the layer. The activation function of these units is a “noisy” linear function (i.e., without the x/(x + 1) nonlinearity, to produce a linear value representation, but still convolved with gaussian noise to soften the threshold, as for the standard units, equation A.4), with gain γ = 220, noise variance σ = .01, and a lower threshold = .17. The k for kWTA (average based) is 1, and the q value is .9 (instead of the default of .6). These values were obtained by optimizing the match for value represented with varying frequencies of 0-1 reinforcement (e.g., the value should be close to .4 when the layer is trained with 40% 1 values and 60% 0 values). Note that having different units for different values, instead of the typical use of a single unit with linear activations, allows much more complex mappings to be learned. For example, units representing high values can have completely different patterns of weights than those encoding low values, whereas a single unit is constrained by virtue of having one set of weights to have a monotonic mapping onto scalar values. A.5.2 Learning Rules. The PVe layer does not learn and is always just clamped to reflect any received reward value (r ). By default, we use a value of 0 to reflect negative feedback, .5 for no feedback, and 1 for positive feedback (the scale is arbitrary). The PVi layer units (y j ) are trained at every point in time to produce an expectation for the amount of reward that will be received at that time. In the minus phase of a given trial, the units settle to a distributed value representation based on sensory inputs. This results in unit activations y−j and an overall weighted average value across these units denoted PVi . In the plus phase, the unit activations (y+j ) are clamped to represent the actual reward r (a.k.a. PVe ). The weights (wij ) into each PVi unit from sending units with plus-phase activations xi+ , are updated using the delta rule between the two phases of PVi unit activation states: wij = (y+j − y−j )xi+ .
(A.11)
This is equivalent to saying that the US/reward drives a pattern of activation over the PVi units, which then learn to activate this pattern based on sensory inputs.
320
R. O’Reilly and M. Frank
The LVe and LVi layers learn in much the same way as the PVi layer (see equation A.11), except that the PV system filters the training of the LV values, such that they learn only from actual reward outcomes (or when reward is expected by the PV system but is not delivered), and not when no rewards are present or expected. This condition is PVfilter = PVi < θmin ∨ PVe < θmin ∨ PVi > θmax ∨ PVe > θmax wi =
(y+j − y−j )xi+
if PVfilter
0
otherwise
(A.12) ,
(A.13)
where θmin is a lower threshold (.2 by default), below which negative feedback is indicated, and θmax is an upper threshold (.8), above which positive feedback is indicated (otherwise, no feedback is indicated). Biologically, this filtering requires that the LV systems be driven directly by primary rewards (which is reasonable and required by the basic learning rule anyway) and that they learn from DA dips driven by high PVi expectations of reward that are not met. The only difference between the LVe and LVi systems is the learning rate , which is .05 for LVe and .001 for LVi. Thus, the inhibitory LVi system serves as a slowly integrating inhibitory cancellation mechanism for the rapidly adapting excitatory LVe system. The four PV and LV distributed value representations drive the dopamine layer (VTA/SNc) activations in terms of the difference between the excitatory and inhibitory terms for each. Thus, there is a PV delta and an LV delta: δ pv = PVe − PVi
(A.14)
δlv = LVe − LVi .
(A.15)
With the differences in learning rate between LVe (fast) and LVi (slow), the LV delta signal reflects recent deviations from expectations and not the raw expectations themselves, just as the PV delta reflects deviations from expectations about primary reward values. This is essential for learning to converge and stabilize when the network has mastered the task (as the results presented in the article show). We also impose a minimum value on the LVi term of .1, so that there is always some expectation. This ensures that low LVe learned values result in negative deltas. These two delta signals need to be combined to provide an overall DA delta value, as reflected in the firing of the VTA and SNc units. One sensible way of doing so is to have the PV system dominate at the time of primary
Making Working Memory Work
321
rewards, while the LV system dominates otherwise, using the same PVbased filtering as holds in the LV learning rule (see equation A.13): δ=
δ pv
if PVfilter
δlv
otherwise
.
(A.16)
It turns out that a slight variation of this where the LV always contributes works slightly better, and is what is used in this article: δ = δlv +
δ pv
if PVfilter
0
otherwise
.
(A.17)
A.5.3 Synaptic Depression of LV Weights. The weights into the LV units are subject to synaptic depression, which makes them sensitive to changes in stimulus inputs, and not to static, persistent activations (Abbott, Varela, Sen, & Nelson, 1997). Each incoming weight has an effective weight value w ∗ that is subject to depression and recovery changes as follows, wi∗ = R(wi − wi∗ ) − Dxi wi ,
(A.18)
where R is the recovery parameter, D is the depression parameter, and wi is the asymptotic weight value. For simplicity, we compute these changes at the end of every trial instead of in an online manner, using R = 1 and D = 1, which produces discrete one-trial depression and recovery. A.6 Special Basal Ganglia Mechanisms A.6.1 Striatal Learning Function. Each stripe (group of units) in the Striatum layer is divided into Go versus NoGo in an alternating fashion. The DA input from the SNc modulates these unit activations in the update phase by providing extra excitatory current to Go and extra inhibitory current to the NoGo units in proportion to the positive magnitude of the DA signal, and vice versa for negative DA magnitude. This reflects the opposing influences of DA on these neurons (Frank, 2005; Gerfen, 2000). This update phase DA signal reflects the PVLV system’s evaluation of the PFC updates produced by gating signals in the plus phase (see Figure 8). Learning on weights into the Go/NoGo units is based on the activation delta between the update (++) and plus phases: wi = xi (y++ − y+ ).
(A.19)
To reflect the finding that DA modulation has a contrast-enhancing function in the striatum (Frank, 2005; Nicola, Surmeier, & Malenka, 2000;
322
R. O’Reilly and M. Frank
Hernandez-Lopez, Bargas, Surmeier, Reyes, & Galarraga, 1997) and to produce more of a credit assignment effect in learning, the DA modulation is partially a function of the previous plus phase activation state, ge = γ [da]+ y+ + (1 − γ )[da]+
(A.20)
where 0 < γ < 1 controls the degree of contrast enhancement (.5 is used in all simulations), [da]+ is the positive magnitude of the DA signal (0 if negative), y+ is the plus-phase unit activation, and ge is the extra excitatory current produced by the da (for Go units). A similar equation is used for extra inhibition (gi ) from negative da ([da]− ) for Go units, and vice versa for NoGo units. A.6.2 SNrThal Units. The SNrThal units provide a simplified version of the SNr/GPe/Thalamus layers. They receive a net input that reflects the normalized Go–NoGo activations in the corresponding Striatum stripe:
Go − NoGo ηj = Go + NoGo +
(A.21)
(where []+ indicates that only the positive part is taken; when there is more NoGo than Go, the net input is 0). This net input then drives standard Leabra point neuron activation dynamics, with kWTA inhibitory competition dynamics that cause stripes to compete to update the PFC. This dynamic is consistent with the notion that competition and selection take place primarily in the smaller GP/SNr areas, and not much in the much larger striatum (e.g., Mink, 1996; Jaeger, Kita, & Wilson, 1994). The resulting SNrThal activation then provides the gating update signal to the PFC: if the corresponding SNrThal unit is active (above a minimum threshold; .1), then active maintenance currents in the PFC are toggled. This SNrThal activation also multiplies the per stripe DA signal from the SNc, δ j = snr j δ,
(A.22)
where snrj is the snr unit’s activation for stripe j, and δ is the global DA signal, equation A.16. A.6.3 Random Go Firing. The PBWM system learns only after Go firing, so if it never fires Go, it can never learn to improve performance. One simple solution is to induce Go firing if a Go has not fired after some threshold number of trials. However, this threshold would have to be either task specific or set very high, because it would effectively limit the maximum maintenance duration of the PFC (because by updating
Making Working Memory Work
323
PFC, the Go firing results in loss of currently maintained information). Therefore, we have adopted a somewhat more sophisticated mechanism that keeps track of the average DA value present when each stripe fires a Go: dak = dak + (dak − dak ).
(A.23)
If this value is < 0 and a stripe has not fired Go within 10 trials, a random Go firing is triggered with some probability (.1). We also compare the relative per stripe DA averages, if the per stripe DA average is low but above zero, and one stripe’s da k is .05 below the average of that of the other stripe’s, if (dak < .1) and (dak − da < −.05); Go,
(A.24)
a random Go is triggered, again with some probability (.1). Finally, we also fire random Go in all stripes with some very low baseline probability (.0001) to encourage exploration. When a random Go fires, we set the SNrThal unit activation to be above Go threshold, and we apply a positive DA signal to the corresponding striatal stripe, so that it has an opportunity to learn to fire for this input pattern on its own in the future. A.6.4 PFC Maintenance. PFC active maintenance is supported in part by excitatory ionic conductances that are toggled by Go firing from the SNrThal layers. This is implemented with an extra excitatory ion channel in the basic Vm update equation, A.1. This channel has a conductance value of .5 when active. (See Frank et al., 2001, for further discussion of this kind of maintenance mechanism, which has been proposed by several researchers—e.g., Lewis & O’Donnell, 2000; Fellous et al., 1998; Wang, 1999; Dilmore, Gutkin, & Ermentrout, 1999; Gorelova & Yang, 2000; Durstewitz, Seamans, & Sejnowski, 2000b.) The first opportunity to toggle PFC maintenance occurs at the end of the first plus phase and then again at the end of the second plus phase (third phase of settling). Thus, a complete update can be triggered by two Go’s in a row, and it is almost always the case that if a Go fires the first time, it will fire the next, because Striatum firing is primarily driven by sensory inputs, which remain constant.
Acknowledgments This work was supported by ONR grants N00014-00-1-0246 and N00014-031-0428 and NIH grants MH64445 and MH069597. Thanks to Todd Braver, Jon Cohen, Peter Dayan, David Jilk, David Noelle, Nicolas Rougier, Tom
324
R. O’Reilly and M. Frank
Hazy, Daniel Cer, and members of the CCN Lab for feedback and discussion on this work.
References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220. Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience, 9, 357–381. Amos, A. (2000). A computational model of information processing in the frontal cortex and basal ganglia. Journal of Cognitive Neuroscience, 12, 505–519. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036–1060. Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum. Baddeley, A. D. (1986). Working memory. New York: Oxford University Press. Baddeley, A., Gathercole, S., & Papagno, C. (1998). The phonological loop as a language learning device. Psychological Review, 105, 158. Beiser, D. G., & Houk, J. C. (1998). Model of cortical-basal ganglionic processing: Encoding the serial order of sensory events. Journal of Neurophysiology, 79, 3168– 3188. Berns, G. S., & Sejnowski, T. J. (1995). How the basal ganglia make decisions. In A. Damasio, H. Damasio, & Y. Christen (Eds.), Neurobiology of decision-making (pp. 101–113). Berlin: Springer-Verlag. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produces sequences. Journal of Cognitive Neuroscience, 10, 108– 121. Braver, T. S., & Cohen, J. D. (2000). On the control of control: The role of dopamine in regulating prefrontal function and working memory. In S. Monsell & J. Driver (Eds.), Control of cognitive processes: Attention and performance XVIII (pp. 713–737). Cambridge, MA: MIT Press. Burgess, N., & Hitch, G. J. (1999). Memory for serial order: A network model of the phonological loop and its timing. Psychological Review, 106, 551– 581. Cardinal, R. N., Parkinson, J. A., Hall, J., & Everitt, B. J. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal cortex. Neuroscience and Biobehavioral Reviews, 26, 321–352. Cohen, J. D., Braver, T. S., & O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control, and schizophrenia: Recent developments and current challenges. Philosophical Transactions of the Royal Society (London) B, 351, 1515–1527. Contreras-Vidal, J. L., & Schultz, W. (1999). A predictive reinforcement model of dopamine neurons for learning approach behavior. Journal of Comparative Neuroscience, 6, 191–214.
Making Working Memory Work
325
Dilmore, J. G., Gutkin, B. G., & Ermentrout, G. B. (1999). Effects of dopaminergic modulation of persistent sodium currents on the excitability of prefrontal cortical neurons: A computational study. Neurocomputing, 26, 104–116. Dominey, P., Arbib, M., & Joseph, J.-P. (1995). A model of corticostriatal plasticity for learning oculomotor associations and sequences. Journal of Cognitive Neuroscience, 7, 311–336. Durstewitz, D., Kelc, M., & Gunturkun, O. (1999). A neurocomputational theory of the dopaminergic modulation of working memory functions. Journal of Neuroscience, 19, 2807. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000a). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. Journal of Neurophysiology, 83, 1733–1750. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000b). Neurocomputational models of working memory. Nature Neuroscience, 3 (Suppl.), 1184–1191. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Emerson, M. J., & Miyake, A. (2003). The role of inner speech in task switching: A dual-task investigation. Journal of Memory and Language, 48, 148–168. Fellous, J. M., Wang, X. J., & Lisman, J. E. (1998). A role for NMDA-receptor channels in working memory. Nature Neuroscience, 1, 273–275. Frank, M. J. (2005). Dynamic dopamine modulation in the basal ganglia: A neurocomputational account of cognitive deficits in medicated and non-medicated Parkinsonism. Journal of Cognitive Neuroscience, 17, 51–72. Frank, M. J., & Claus, E. D. (2005). Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement learning, decision making and reversal. Manuscript submitted for publication. Frank, M. J., Loughry, B., & O’Reilly, R. C. (2001). Interactions between the frontal cortex and basal ganglia in working memory: A computational model. Cognitive, Affective, and Behavioral Neuroscience, 1, 137–160. Frank, M. J., Seeberger, L., & O’Reilly, R. C. (2004). By carrot or by stick: Cognitive reinforcement learning in Parkinsonism. Science, 306, 1940–1943. Gerfen, C. R. (2000). Molecular effects of dopamine on striatal projection pathways. Trends in Neurosciences, 23, S64–S70. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12, 2451–2471. Gorelova, N. A., & Yang, C. R. (2000). Dopamine D1/D5 receptor activation modulates a persistent sodium current in rat’s prefrontal cortical neurons in vitro. Journal of Neurophysiology, 84, 75. Graybiel, A. M., & Kimura, M. (1995). Adaptive neural networks in the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 103–116). Cambridge, MA: MIT Press. Haber, S. N., Fudge, J. L., & McFarland, N. R. (2000). Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. Journal of Neuroscience, 20, 2369–2382. Hernandez-Lopez, S., Bargas, J., Surmeier, D. J., Reyes, A., & Galarraga, E. (1997). D1 receptor activation enhances evoked discharge in neostriatal medium spiny neurons by modulating an L-type Ca2+ conductance. Journal of Neuroscience, 17, 3334–3342.
326
R. O’Reilly and M. Frank
Hernandez-Lopez, S., Tkatch, T., Perez-Garci, E., Galarraga, E., Bargas, J., Hamm, H., & Surmeier, D. J. (2000). D2 dopamine receptors in striatal medium spiny neurons reduce L-type Ca2+ currents and excitability via a novel PLCβ1-IP3 -calcineurinsignaling cascade. Journal of Neuroscience, 20, 8987–8995. Hochreiter, S., & Schmidhuber, J. (1997). Long short term memory. Neural Computation, 9, 1735–1780. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–248). Cambridge, MA: MIT Press. Houk, J. C., & Wise, S. P. (1995). Distributed modular architectures linking basal ganglia, cerebellum, and cerebral cortex: Their role in planning and controlling action. Cerebral Cortex, 5, 95–110. Jackson, S., & Houghton, G. (1995). Sensorimotor selection and the basal ganglia: A neural network model. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 337–368). Cambridge, MA: MIT Press. Jaeger, D., Kita, H., & Wilson, C. J. (1994). Surround inhibition among projection neurons is weak or nonexistent in the rat neostriatum. Journal of Neurophysiology, 72, 2555–2558. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547. Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 96, 451. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the 8th Confererence of the Cognitive Science Society (pp. 531–546). Hillsdale, NJ: Erlbaum. Kropotov, J. D., & Etlinger, S. C. (1999). Selection of actions in the basal gangliathalamocoritcal circuits: Review and model. International Journal of Psychophysiology, 31, 197–217. Levitt, J. B., Lewis, D. A., Yoshioka, T., & Lund, J. S. (1993). Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 and 46). Journal of Comparative Neurology, 338, 360–376. Lewis, B. L., & O’Donnell, P. (2000). Ventral tegmental area afferents to the prefrontal cortex maintain membrane potential “up” states in pyramidal neurons via D1 dopamine receptors. Cerebral Cortex, 10, 1168–1175. Middleton, F. A., & Strick, P. L. (2000). Basal ganglia and cerebellar loops: Motor and cognitive circuits. Brain Research Reviews, 31, 236–250. Mink, J. W. (1996). The basal ganglia: Focused selection and inhibition of competing motor programs. Progress in Neurobiology, 50, 381–425. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Morton, J. B., & Munakata, Y. (2002). Active versus latent representations: A neural network model of perseveration and dissociation in early childhood. Developmental Psychobiology, 40, 255–265.
Making Working Memory Work
327
Nakahara, H., Doya, K., & Hikosaka, O. (2001). Parallel cortico-basal ganglia mechanisms for acquisition and execution of visuomotor sequences—a computational approach. Journal of Cognitive Neuroscience, 13, 626–647. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Anuual Review of Neuroscience, 23, 185–215. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8(5), 895–938. O’Reilly, R. C. (1998). Six principles for biologically-based computational models of cortical cognition. Trends in Cognitive Sciences, 2(11), 455–462. O’Reilly, R. C. (2001). Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199–1242. O’Reilly, R. C., Braver, T. S., & Cohen, J. D. (1999). A biologically based computational model of working memory. In A. Miyake & P. Shah (Eds.), Models of working memory: Mechanisms of active maintenance and executive control. (pp. 375–411). Cambridge: Cambridge University Press. O’Reilly, R. C., & Frank, M. J. (2003). Making working memory work: A computational model of learning in the frontal cortex and basal ganglia (ICS Tech. Rep. 03-03, revised 8/04). University of Colorado Institute of Cognitive Science. O’Reilly, R. C., Frank, M. J., Hazy, T. E., & Watz, B. (2005). Rewards are timeless: The primary value and learned value (PVLV) Pavlovian learning algorithm. Manuscript submitted for publication. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. O’Reilly, R. C., Noelle, D., Braver, T. S., & Cohen, J. D. (2002). Prefrontal cortex and dynamic categorization tasks: Representational organization and neuromodulatory control. Cerebral Cortex, 12, 246–257. O’Reilly, R. C., & Soto, R. (2002). A model of the phonological loop: Generalization and binding. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Pucak, M. L., Levitt, J. B., Lund, J. S., & Lewis, D. A. (1996). Patterns of intrinsic and associational circuitry in monkey prefrontal cortex. Journal of Comparative Neurology, 376, 614–630. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variation in the effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Theory and research (pp. 64–99). New York: Appleton-Century-Crofts. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. CUED/F-INFENG/TR.1). Cambridge: Cambridge University Engineering Department. Rougier, N. P., Noelle, D., Braver, T. S., Cohen, J. D., & O’Reilly, R. C. (2005). Prefrontal cortex and the flexibility of cognitive control: Rules without symbols. Proceedings of the National Academy of Sciences, 102(20), 7338–7343.
328
R. O’Reilly and M. Frank
Rougier, N. P., & O’Reilly, R. C. (2002). Learning representations in a gated prefrontal cortex model of dynamic task switching. Cognitive Science, 26, 503–520. Schmidhuber, J. (1992). Learning unambiguous reduced sequence descriptions. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 291–298). San Mateo, CA: Morgan Kaufmann. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593. Schultz, W., Romo, R., Ljungberg, T., Mirenowicz, J., Hollerman, J. R., & Dickinson, A. (1995). Reward-related signals carried by dopamine neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 233–248). Cambridge, MA: MIT Press. Schultz, W., Apicella, P., Scarnati, D., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. Journal of Neuroscience, 12, 4595–4610. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103, 65–85. Sutton, R. S. (1988). Learning to predict by the method of temporal diferences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. Journal of Neuroscience, 19, 9587. Wickens, J. (1993). A theory of the striatum. Oxford: Pergamon Press. Wickens, J. R., Kotter, R., & Alexander, M. E. (1995). Effects of local connectivity on striatal function: Simulation and analysis of a model. Synapse, 20, 281– 298. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4 (pp. 96–104). New York: Institute of Radio Engineers. Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications. Hillsdale, NJ: Erlbaum.
Received August 9, 2004; accepted June 14, 2005.
LETTER
Communicated by Mikhail Lebedev
Identification of Multiple-Input Systems with Highly Coupled Inputs: Application to EMG Prediction from Multiple Intracortical Electrodes David T. Westwick
[email protected] Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, T2N 1N4, Canada.
Eric A. Pohlmeyer
[email protected] Department of Biomedical Engineering, Northwestern University, Evanston, IL 60208, U.S.A.
Sara A. Solla
[email protected] Department of Physiology, Northwestern Medical School, Chicago, IL, 60611, and Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, U.S.A.
Lee E. Miller
[email protected] Department of Physiology, Northwestern Medical School, Chicago, IL 60611, U.S.A.
Eric J. Perreault
[email protected] Department of Physical Medicine and Rehabilitation, Northwestern University Medical School, Chicago, IL 60611, U.S.A.
A robust identification algorithm has been developed for linear, timeinvariant, multiple-input single-output systems, with an emphasis on how this algorithm can be used to estimate the dynamic relationship between a set of neural recordings and related physiological signals. The identification algorithm provides a decomposition of the system output such that each component is uniquely attributable to a specific input signal, and then reduces the complexity of the estimation problem by discarding those input signals that are deemed to be insignificant. Numerical difficulties due to limited input bandwidth and correlations among the inputs are addressed using a robust estimation technique based on singular value decomposition. The algorithm has been evaluated on both simulated and experimental data. The latter involved estimating the Neural Computation 18, 329–355 (2006)
C 2005 Massachusetts Institute of Technology
330
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
relationship between up to 40 simultaneously recorded motor cortical signals and peripheral electromyograms (EMGs) from four upper limb muscles in a freely moving primate. The algorithm performed well in both cases: it provided reliable estimates of the system output and significantly reduced the number of inputs needed for output prediction. For example, although physiological recordings from up to 40 different neuronal signals were available, the input selection algorithm reduced this to 10 neuronal signals that made significant contributions to the recorded EMGs. 1 Introduction Recent advances in microelectrode array technology have made it possible to record from multiple neurons simultaneously (Maynard, Nordhausen, & Normann, 1997; Williams, Rennaker, & Kipke, 1999; Nicolelis et al., 2003). This capability may enhance our understanding of communication within the central nervous system (CNS) and may allow the development of brain-machine interfaces (BMIs) that provide enhanced communication and control for individuals with significant neurological disorders (Donoghue, 2002; Mussa-Ivaldi & Miller, 2003). However, the potential of these recording devices is limited by the current methods available for processing multi channel recordings. The purpose of this study is to develop robust and efficient algorithms for determining the dynamic relationship between a set of neural recordings and a continuous time signal related to those recordings. Much of the research incorporating multielectrode arrays has focused on using intra cortical recordings to predict kinematic and dynamic features of hand motion in freely moving primates and on using these predictions as a basis for developing cortical BMIs. To date, a number of linear and nonlinear algorithms have been used to generate the map between cortical activity and specific movement variables (Serruya, Hatsopoulos, Paninski, Fellows, & Donoghue, 2002; Taylor, Tillery, & Schwartz, 2002; Carmena et al., 2003). In general, linear models have been found to perform nearly as well as nonlinear models for the prediction of continuous movement variables (Wessberg et al., 2000; Gao, Black, Bienenstock, Wu, & Donoghue, 2003). However, nonlinear models can have advantages for predicting the rest periods between movements or the peak velocities during the fastest movements in a continuous movement sequence (Kim et al., 2003). As might be expected, performance of either type of model generally improves with increasing numbers of neurons. Prediction accuracy for various movement-related signals has been shown to increase with increasing numbers of neurons for as many as 250 simultaneously recorded cortical neurons (Carmena et al., 2003). However, (Sanchez et al., 2004) recently demonstrated that prediction accuracy can be improved further by an appropriate selection of inputs. Two main sources of error can arise when a large number of neurons are used as inputs to the system identification process. The first is that
Identification of Multiple-Input Systems with Highly Coupled Inputs
331
correlations among neurons can result in a numerically ill-conditioned estimation problem. The second is that using too many inputs can lead to accurate fits of the data employed in the estimation process but poor generalization to new data sets. Although additional neural recordings can provide novel information, not all of this information may be relevant to the task or process under study. Therefore, techniques that reduce the dimensionality of the input signals could reduce the computational complexity of the system identification problem and possibly increase prediction accuracy. One approach to this problem has been to use principal component analysis (PCA) to generate a set of orthogonal inputs that span the space defined by the original data (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Isaacs, Weber, & Schwartz, 2000; Wu et al., 2003). Such techniques can reduce the dimensionality of the input signal space when there are correlations between the input signals. This reduction can enhance the robustness of the identification process, but it does not reduce the number of neural signals that need to be recorded, since each principal component is a linear combination of all available input signals. An alternative approach is to select the “most relevant” set of inputs (Sanchez et al., 2004). Once this selection is complete, spike sorting and subsequent signal processing can be restricted to the retained inputs. Input signals with restricted bandwidths also can lead to a numerically unstable identification problem and poor generalization. This problem manifests itself as the need to invert an ill-conditioned input correlation matrix, a problem that can be alleviated by using standard singular value decomposition (SVD) techniques to check for numerical instabilities (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004) during the inversion process. When such instabilities exist, robust estimates of the impulse response functions (IRFs) between the inputs and the output still may be obtained by using an SVD-based matrix pseudo-inverse (Westwick & Kearney, 1997b). The goal of this study is to develop robust tools for processing information obtained from large numbers of neural recordings and for determining the linear relationship between these recordings and a related physiological signal of interest. Specifically, we have developed an algorithm for selecting an optimal set of input signals, based on their unique contributions to the system output, and for developing robust predictors of the output from this subset of inputs. The performance of these novel methods is demonstrated on both simulated and experimental data; the latter consist of intracortical recordings from the primary motor cortex of a freely moving primate together with EMG data taken from several arm muscles. 2 Analytical Methods Consider a multiple-input single-output (MISO) system, represented by a bank of N linear finite impulse response (FIR) filters with memory length
332
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
M. Let xk (t), for k = 1, 2, . . . , N, be measurements of the N input signals at time t, and let z(t) be the measured output. Then, z(t) =
N M−1
h k (τ )xk (t − τ ) + w(t),
(2.1)
k=1 τ =0
where w(t) accounts for both noise in the output measurements and the effects of any additional, unmeasured inputs to the system. It is assumed to be zero mean and uncorrelated with all the measured inputs xk (t), k = 1, . . . , N. The objective is to estimate the NM filter weights, h k (τ ) for τ = 0, . . . , M − 1 and k = 1, . . . , N, from input-output measurements xk (t), z(t), for t = 1, . . . , T. Given sufficient data, ideally T NM, this can be accomplished by rewriting equation 2.1 as a matrix equation, z = Xh + w,
(2.2)
where z and w are T element vectors containing z(t) and w(t), respectively. The IRFs h k (τ ) are placed, in order, in the NM element vector h, h = [h 1 (0) h 1 (1), . . . , h 1 (M − 1) h 2 (0) h 2 (1), . . . , h N (M − 1)]T .
(2.3)
Thus, X is the block structured matrix, X = X1 , X2 , . . . , XN ,
(2.4)
where the Xk are T × M matrices,
xk (1)
0
...
0
xk (2) xk (1) ... 0 Xk = . . .. .. .. .. . . . xk (T) xk (T − 1) . . . xk (T − M + 1) Since the noise is uncorrelated with the inputs, the minimum mean squared error estimate of h can be obtained using the normal equation (Golub & Van Loan, 1989), hˆ = (XT X)−1 XT z.
(2.5)
One disadvantage of the direct solution, equation 2.5, of the normal equation is that the matrices can become unacceptably large. For example, in
Identification of Multiple-Input Systems with Highly Coupled Inputs
333
the neural processing experiment described in section 4.2, the system had 40 inputs, each filter was represented using a 52 tap FIR filter, and the identification was performed using up to 18,000 data points. Thus, direct application of equation 2.5 would require multiplying an 18,000 by 2080 element matrix with its transpose, and then computing the inverse of the resulting 2,080 by 2080 matrix. 2.1 Auto- and Cross-Correlations. Perreault, Kirsch, and, Acosta (1999) developed an efficient solution to the MISO system identification problem based on auto- and cross-correlation functions instead of direct use of the data. They have shown that the input-output relationship can be rewritten in terms of auto- and cross-correlation matrices, φx1 z x1 x1 . . . = . . . xN x1 φ xN z
. . . x1 xN h1 . .. .. . . .. , . . . xN xN h N
(2.6)
where h k is a vector whose elements are the samples of the IRF h k (τ ), φ xk z is an M element vector containing the cross-correlation φxk z (τ ), and xk x is an M × M Toeplitz structured matrix whose elements are the correlations xk x (i, j) = φxk x (i − j). In compact notation, equation 2.6 is written as Xz = XX h.
(2.7)
Perreault et al. (1999) estimated the IRFs of the linear filters by solving equation 2.7 exactly through an inversion of the matrix XX , hˆ = −1 XX Xz ,
(2.8)
and noticed that the input correlation matrix XX could become ill conditioned if either the input signals are strongly coupled with one another or they are severely band-limited. The equivalence of the two algorithms becomes evident when noting that 1 T X X = XX + O T 1 T Xz , X z = T
M , T
(2.9) (2.10)
where O(M/T) indicates an error with magnitude of the same order as M/T. Thus, if T M, the solutions hˆ obtained from equations 2.5 and 2.8
334
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
will be virtually identical. Note that the error term of order M/T appears in the quadratic factor, equation 2.9 but not in the linear factor, equation 2.10. Furthermore, the corrections needed to make equation 2.9 exact can be implemented using the method suggested by Korenberg (1988), although the effect of these corrections is negligible unless the model’s memory length M is significant compared to the data length T. 2.2 Input Contributions to the Output. The neural input data may contain signals that are either unrelated to the output variable of interest or are highly correlated with one another. Both scenarios lead to increases in the variability of the estimated model and decreases in its ability to generalize. To address these problems, an algorithm was developed that locates and eliminates redundant or irrelevant inputs. The algorithm is a variation on the orthogonal least-squares technique (Chen, Cowan & Grant, 1991), but it is based on a backward elimination approach rather than on forward regression (Miller, 1990). Thus, at each stage, this iterative algorithm computes the unique contribution that each input makes to the output and then eliminates the input that makes the smallest such contribution. To find the component in the output that can be attributed only to the input xk (t), construct the matrices M1 = [X1 , . . . , Xk−1 , Xk+1 , . . . , XN ], M2 = Xk , orthogonalize M2 , which contains delayed copies of the input xk (t), against the remaining inputs, stored in M1 , and then project the output z(t) onto these orthogonal columns using the the following QR factorization (Golub & Van Loan, 1989)
M1
M2
z = QR
= Q1
Q2
R11 qz 0 0
R12 R22 0
r1z
r2z ,
(2.11)
r3z
where QT Q is the NM × NM identity matrix and R is upper triangular by construction. These matrices are partitioned such that Q1 and Q2 have the same dimensions as M1 and M2 , respectively. The dimensions of the blocks in R can be inferred from M1 = Q1 R11 M2 = Q1 R12 + Q2 R22 z = Q1 r1z + Q2 r2z + r3z qz .
(2.12)
Identification of Multiple-Input Systems with Highly Coupled Inputs
335
Since the columns of Q are orthogonal, the three terms on the right-hand side of equation 2.12 are orthogonal to each other. Thus, Q2 r2z , which is orthogonal to Q1 and hence M1 , is the component in the output that can be attributed only to the input xk (t). The mean squared value of this unique contribution, yˆ k (t), is given by: T 1 1 2 yˆ k (t) = (Q2 r2z )T (Q2 r2z ) T T t=1
=
T r2z r2z . T
(2.13)
The procedure described above can be used to identify the input that makes the smallest unique contribution to the output. The least significant input may then be removed from the pool of inputs and the process repeated. Note that if there are correlations between the inputs, the significance of the remaining inputs will change as a result of the deletion. Hence, it is necessary to repeat this process (N − 1) times to determine an optimal set of inputs to use in the identification process. Since the computational cost associated with the QR factorization in equation 2.11 is approximately 4T(MN)2 flops (Golub & Van Loan, 1989) and since this factorization would have to be repeated once for each input, this scheme is clearly not practical. However, direct computation of the QR factorization is not necessary. To simplify the notation, consider estimating the contribution due to the last input xN (t), so that the QR factorization in equation 2.11 involves the matrix M1 M2 z = X z . Squaring the right-hand side of equation 2.11 yields (QR)T QR = RT R,
(2.14)
while squaring the left-hand side gives
X
z
T
X
z =
XT X
XT z
zT X
zT z
=T
XX TXz
Xz σz2
(2.15)
,
(2.16)
where the last equality follows from equations 2.9 and 2.10. Thus, the matrix R, and hence the mean-squared value of yˆ N (t), can be obtained by computing the Cholesky factorization (Golub & Van Loan, 1989) of the matrix in
336
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
equation 2.16, which is constructed from the auto- and cross-correlation matrices. Note that the computational cost of the Cholesky factorization is approximately (1/3)(NM)3 flops, independent of T. Clearly, the contribution due to any one of the inputs can be obtained by rearranging the blocks Xz so that the input of interest appears in the bottom and in XX and right-most block of rows and columns in XX and in the bottom-most block Xz . of rows in 2.3 Singular Value Decomposition. Although the input selection algorithm will reduce correlations between inputs, the linear regression may still be poorly conditioned. For example, input properties such as limited bandwidth, which will produce an ill-conditioned autocorrelation matrix, are not altered by the selection algorithm. This ill-conditioned regression problem can be solved robustly using the singular value decomposition (SVD) (Golub & Van Loan, 1989) of the regression matrix X, X = USVT ,
(2.17)
where UT U = I, VT V = I, and S = diag(σ1 , σ2 , . . . , σ NM ), with σ1 ≥ σ2 ≥ · · · ≥ σ NM ≥ 0. Consider the estimate: hˆ = VS−1 UT z
(2.18)
= h + VS−1 UT w T
−1
(2.19)
= V(V h + S U w). T
(2.20)
Aside from finite-precision errors, the solutions in equations 2.5 and 2.18 are identical. However, rewriting equation 2.18 as 2.19 provides insight into the effect that measurement noise will have on the final estimate. Thus, let η = UT w
(2.21)
ζ = VT h,
(2.22)
and let ηk and ζk be the kth elements of the vectors η and ζ , respectively. The estimate hˆ can then be written as NM
ηk hˆ = ζk + vk , σk
(2.23)
k=1
where vk is the kth column of the matrix V. The decomposition in equation 2.23 expands the vector of IRF estimates as a linear combination of an orthogonal basis formed by the right singular vectors. Each expansion
Identification of Multiple-Input Systems with Highly Coupled Inputs
337
coefficient consists of two terms: ζk , the projection of the true system onto the kth right singular vector, and ηk /σk , the projection of the measurement noise onto the kth left singular vector. Note, however, that the kth noise term is scaled by 1/σk . Thus, small singular values can be expected to proˆ The goal is to duce relatively large noise terms, and hence large errors in h. retain only those terms in equation 2.23 that are dominated by the signal component and to discard the rest (Westwick & Kearney, 1997b). One approach to the selection of significant terms is to reorder the singular values according to their contributions to the output. The model can then be built up term by term, including the most significant remaining term at each step. Once an acceptable level of model accuracy has been reached, the expansion can be halted. The model output is given by yˆ = Xhˆ ˆ = USVT h.
(2.24)
Define the coefficient vector, ˆ γ = SVT h,
(2.25)
which contains the projection of the IRF estimate onto the right singular vectors, scaled by their associated singular values. The mean squared value of the model output is then 1 ˆ Tˆ 1 y y = γ T γ . T T
(2.26)
Thus, γk2 , the square of the kth element of the vector γ , represents the contribution made by the kth term in equation 2.23 to the mean square of the model output (Westwick & Kearney, 1997a). To sort the terms in order of decreasing significance, we need only to sort them in decreasing absolute value of the γk . Finally, we note that the SVD may be used in conjunction with the efficient correlation-based technique proposed by Perreault et al. (1999). Calculate the SVD of the input correlation matrix, XX = VVT , where = then
1 2 S . T
(2.27)
The initial estimate of the IRFs, based on equation 2.8, is
1 Xz . vk vkT hˆ = λk NM
k=1
(2.28)
338
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
The terms in equation 2.28 can be sorted in decreasing order of contribution to the output; the mean squared value of the output contribution of each of these terms can be calculated using 2 γk2 = λk vkT hˆ .
(2.29)
The mean squared output is then plotted versus the number of singular vectors retained in the model. The point where the plot starts to level off determines how many singular vectors to include in the final model. 3 Experimental Methods The algorithms described above have been evaluated using data from both computer simulations and physiological recordings. The rationale behind the use of artificial data is to test our algorithms on a system with known and modifiable properties; the application to physiological recordings then demonstrates the performance of these algorithms under more realistic conditions, where the system under study may not be well characterized a priori. 3.1 Generation of Artificial Data. The input selection and system identification algorithms were evaluated on a simulated linear MISO system with highly correlated and band-limited inputs. Both of these characteristics significantly complicate the identification process and are likely to be encountered when recording a large number of physiological signals. A schematic representation of the simulation process used to generate the artificial input-output data is shown in Figure 1A. The inputs were generated using K normally distributed independent white noise sources. Each of these sources was band-limited using a digital Butterworth filter with a randomly generated order, ranging from first to fourth, and a randomly generated cut-off frequency, ranging from 10% to 90% of the Nyquist rate. Unique filters were used for each input. The resulting set of K -independent, band-limited sources was multiplied by a randomly generated K × N mixing matrix, resulting in N correlated signals. In all cases N was greater than K , to mimic the recording of a large number of physiological signals driven by a small number of independent sources. Our goal was to evaluate the system identification procedure both with and without the optimal selection algorithm. Since this identification algorithm will not work if the inputs are fully statistically dependent, N independent gaussian white noise sources were added to the N dependent signals, to ensure some degree of independence. A 10 dB signal to noise ratio was used to emphasize the coupling among the N generated inputs over their small degree of independence.
Identification of Multiple-Input Systems with Highly Coupled Inputs
339
Figure 1: Schematic representation of the computer simulations. (A) Diagram of the process used to generate artificial data. The K independent sources are gaussian white noise. The SISO filters are digital Butterworth filters with randomly generated orders and cut-off frequencies. Specific details are provided in the text. (B) Typical input and output signals generated by this process; 250 sample points are shown.
The system output was generated from these inputs using a similar process. Each one of the N inputs was filtered by a unique digital Butterworth filter with a randomly generated order, ranging from first to fifth, and a randomly generated cutoff frequency, ranging from 10% to 80% of the Nyquist rate. These parameters were chosen so that on average, the bandwidth of the system inputs was greater than that of the system to be identified. The resulting signals were combined using a randomly generated N × 1 mixing matrix to produce a single output. Measurement noise was simulated by adding normally distributed white noise with 10 dB signal-to-noise ratio. Figure 1B shows typical input and output signals generated by this process. Monte Carlo simulations were used to evaluate algorithm performance. Each set of simulations used a fixed number of independent sources (K )
340
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
and corresponding inputs (N). In contrast, different noise sources, filter parameters, and mixing matrices were used for each trial in the set. Specific values for these stochastic model parameters were selected as described above. Unless specified, each Monte Carlo simulation set consisted of 100 trials, to obtain robust estimates of the mean and standard deviation for each quantity of interest. 3.2 Physiological Recordings. The algorithms also were evaluated using a set of physiological recordings obtained from a single behaving primate (Macacca mulatta) during execution of a button-pressing task. A total of 29 3-minute data files were collected during eight separate recording sessions. The recording sessions spanned a period of three months; it is assumed that the neural signals recorded from each electrode varied over the course of this period. Each trial began with the left hand on a touch pad at waist level. Following a 0.3 second touch pad hold time, one of four buttons in front of the subject was illuminated, instructing the subject to reach toward and press this randomly chosen illuminated button. The buttons were arranged on the top and bottom surfaces of a plastic box (see Figure 2A), thus requiring the subject to approach the button with the forearm either pronated or supinated, respectively. After a brief hold (∼200 ms), a tone indicated success; the subject was given a juice reward and could return its hand to the touch pad to initiate the next trial after a random intertrial interval. Figure 2B shows data, including associated neuronal discharge signals, electromyograms (EMGs) from four muscles, and a binary trace indicating the button press times for a series of five such trials. There were an average of 60 ± 8 button presses in each 3 minute trial, equally distributed across the four targets. The neuronal discharge signals were recorded from an array of 100 electrodes (Cyberkinetics, Inc.) The array was chronically implanted under the dura in the arm area of the primary motor cortex. Leads from the array were routed to a connector implanted on the skull. Signals from the best 32 of these electrodes were sent to a DSP-based multineuron acquisition system (Plexon, Inc, Dallas, TX) for later analysis with the Plexon Off-line Sorter software. For the data presented here, the sorting algorithm was able to distinguish between 35 to 40 independent neural signals from the 32 electrode recordings. It was possible to classify approximately 15% of these signals as single neurons, based on stringent shape and minimum interspike interval criteria. The remaining signals were those for which action potentials probably were due to more than one neuron, but from which background noise and occasional artifacts were removed. The spike occurrences for these discriminated signals were then converted into a rate code, sampled at 100 Hz, for subsequent processing. EMG signals were recorded from surface electrodes placed above the anterior deltoid (AD), biceps (BI), triceps (TRI), and combined wrist and digit flexors (FF). The signals were sampled at 2000 Hz and subsequently
Identification of Multiple-Input Systems with Highly Coupled Inputs
341
(A) top targets
bottom targets
Motor Cortical Discharge
(B)
AD BI TRI FF Target 12
14
16
18 20 Time (seconds)
22
24
26
Figure 2: Experimental setup for physiological recordings. (A) In this reaching task, the monkey is required to press one of four lighted buttons located on the top and bottom surfaces of the target platform. (B) Typical data recorded during this task. The top traces correspond to the neuronal firing patterns recorded from the intracortical microelectrode array. The bottom traces show the simultaneously recorded electromyograms from four of the arm muscles involved in the task: anterior deltoid (AD), biceps (BI), triceps (TRI), and combined wrist and digit flexors (FF). The target trace identifies periods of button pressing.
342
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
rectified, low-pass-filtered (10 Hz), and resampled at 100 Hz. All animalrelated procedures were approved by the institutional Animal Care and Use Committee at Northwestern University. 3.3 Statistical Analysis. To quantify model accuracy, we use r 2 , the square of the correlation coefficient between the system and model outputs. Typically, experimental data are divided into two sets: estimation data, and validation data. The model is identified from the estimation data, and tested on the validation data. If the model is validated using the estimation data, the value of r 2 can be biased upward if the model fits some of the measurement noise (overfitting). Thus, the difference between the values of r 2 obtained from the estimation and validation data can be used to assess the degree of overfitting. The accuracy of the proposed input selection algorithm was compared to that of PCA and of two of the methods recently proposed by Sanchez et al. (2004). All of these alternative algorithms reduce the dimensionality of the input signal space. PCA was chosen for comparison, since it is a commonly used approach for improving the numerical stability of the identification process. PCA differs from the Sanchez algorithms in that it uses all inputs in the prediction process. In contrast, the methods proposed by Sanchez are designed to remove unnecessary neural signals. The first method (fSISO) ranks each input according to the amount of variance it can predict in the system output. Prediction is performed using a SISO linear filter and ignoring all other available inputs; correlations among inputs are not considered. The second method (fMISO) ranks the inputs according to the sum of the estimated linear filter magnitudes; the estimated filters are obtained using a nonparametric linear MISO identification. This was the most effective of the three algorithms evaluated by Sanchez et al. (2004). As with the optimal selection algorithm, this ranking process must be repeated (N − 1) times for a system with N inputs. Each iteration involves eliminating the least significant input. 4 Results 4.1 System Identification from Artificial Data. In the case of artificial data, the optimal selection algorithm greatly reduced the number of inputs needed for an accurate prediction of the system output. The performance of our input selection algorithm was compared to that obtained if a subset of the input signals was randomly chosen for use in the system identification process. The value of r 2 calculated from the cross-validation data was used to compare the performance of these two selection processes. Figure 3 shows the average results from a set of 100 Monte Carlo simulations using 10 independent sources (K = 10) to produce 20 correlated inputs (N = 20). For these simulation parameters, more than 90% of the maximally obtainable output variance could be predicted using the three most significant inputs.
Identification of Multiple-Input Systems with Highly Coupled Inputs
343
1.0
0.8
0.6 R2
Optimal Selection Random Selection 0.4
0.2
0.0 0
2
4
6
8
10
12
14
16
18
20
Number of Inputs
Figure 3: Model accuracy as a function of the number of inputs used in the identification process. Thin traces correspond to the value of r 2 for randomly selected inputs, and thick traces correspond to that for optimally selected inputs. Error bars indicate the standard deviation based on the results of 100 simulated trials. Simulation parameters: K = 10, N = 20, T = 2000 data points for each of the simulated data vectors. The estimated linear filters had lengths of M = 32 points.
In contrast, more than twice as many inputs were required to reach the same level of fitting accuracy when a random selection was used. Similar results were obtained for a wide range of simulation parameters, as examined by varying the number K of independent sources from 7 to 15 and the number N of input signals generated by these sources from 7 to 40. Although it reduced the fitting accuracy, the robust identification algorithm improved output predictions for cross-validation data. The prediction accuracy of the estimated linear system was evaluated for both the fitted data and for cross-validation data not used in the fitting process. These results, as functions of the number of singular values used in the pseudoinverse, are shown in Figure 4. Simulation parameters were identical to those used for Figure 3, although half of the resulting data set was used for identification and the remaining half for cross-validation. As expected, r 2 for the fitted data increases monotonically with the number of singular values used in the identification process. However, there is a clear peak in the r 2 value for the cross-validation data, indicating the advantage of restricting the number of singular values when computing the pseudo-inverse for this
344
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault 1.0
0.8
0.6 R
2
Fitted Data Validation Data 0.4
0.2
0.0 0
100
200
300
400
500
600
Number of SVs
Figure 4: Prediction accuracy for the fitted (solid black trace) and crossvalidation (dashed black trace) data as a function of the number of singular values used in the pseudo-inverse. Results are the average and standard deviation (gray bands) from 100 simulated data sets. Simulation parameters: K = 10, N = 20, 2000 data points for each of the simulated data vectors (1000 points were used for the estimation and 1000 for the cross-validation). The estimated linear filters had a length of M = 32 points.
data set. Such a result is a signature of overfitting, and the simulation parameters were chosen to illustrate this point. For the same set of simulation parameters, the discrepancy between the curves for the fitting and validation data diminishes as the number of data points used in the fitting process is increased. The combined use of the optimal input selection algorithm and the robust MISO identification yields accurate output predictions using only a small subset of the measured inputs. Figure 5 shows typical model predictions for the same system used to generate the data for Figures 3 and 4. Only 3 of the 20 available inputs were used to predict the output response. Figure 5A shows the results when optimal input selection was combined with the robust identification algorithm. Based the results shown in Figure 4, the number of singular values needed to predict 90% of the output variance was used to compute the pseudo-inverse. Figure 5B shows typical results using three randomly chosen inputs and all singular values in the identification process. For this data set, r 2 increased by 0.1 when the optimal input selection algorithm was used in conjunction with the robust MISO
Identification of Multiple-Input Systems with Highly Coupled Inputs
345
(A) Signal Amplitude
10 5 0 -5 r 2 = 0.88 -10
0
50
100 150 Data Points
200
(B)
Data Prediction
10 Signal Amplitude
250
5 0 -5 r 2 = 0.78
-10 0
50
100 150 Data Points
200
250
Figure 5: Actual (gray traces) and predicted (black traces) model outputs for simulated data using 3 of 20 available input signals. (A) Results when the optimal inputs are chosen and the robust identification algorithm is used. (B) Results when the inputs are randomly chosen and a pseudo-inverse is not used in the identification process. Simulation parameters: K = 10, N = 20, 4000 data points for each of the simulated data vectors (2000 points were used for the estimation and 2000 for the cross-validation). The estimated linear filters had a length of M = 32 points.
identification, as compared to the result obtained using the original identification algorithm. This increase corresponds to a nearly twofold reduction in the mean squared error. In a series of 100 Monte Carlo simulations, r 2 increased by 0.17 ± 0.11. 4.2 System Identification from Physiological Data. The algorithms presented in this article were designed to work with correlated multiple input data of restricted bandwidth. The experimental data collected from the microelectrode arrays exhibited both of these features. Figure 6A shows
346
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
Figure 6: Input data characteristics. (A) Power spectra for the cortical recordings for a typical 180 second reaching trial. Individual signals are shown in gray and the mean of all signals in black. (B) Interdependence of the cortical recordings. The gray trace shows the instantaneous spike rate recorded from a single electrode; the black trace shows the estimate of that recording based on the recordings from all other electrodes (39 signals). Both signals have been low-pass-filtered at 10 Hz using a fourth-order Butterworth filter. The prediction is based on a linear MISO system estimated from 180 seconds of data. All estimated filters had a length of 510 ms.
Identification of Multiple-Input Systems with Highly Coupled Inputs
347
the power spectra for all cortical signals recorded during a typical trial; the average spectrum is shown in black. The spectrum for each signal was broad but had a dominant peak at approximately 0.2 Hz to 0.5 Hz, due to the time between subsequent reaching movements. The power outside this range was lower but remained significant because the signals contained relatively narrow bursts of activity. Correlation between channels was assessed by examining the accuracy with which a given neural signal could be predicted using all other available neural signals. A causal linear MISO system was used for this prediction. Across all 29 data sets, the squared correlation coefficient between any given input and its prediction from all of the other inputs was 0.33 ± 0.10 (mean ± SD ). Furthermore, the best input prediction in each trial resulted in r 2 of 0.55 ± 0.03. Figure 6B shows a typical example of the dependence between recorded neural signals. The gray trace shows the spike rate of the neural signal recorded from a single electrode, and the black trace shows the prediction of that signal based on the measurements from all other electrodes. Both signals were low-pass-filtered at 10 Hz to accentuate the average spike rates. The r 2 value for these two signals prior to filtering was 0.32. These results demonstrate that the neural recordings obtained from the microelectrode array were not statistically independent and that there is significant overlap in the information contained in these recordings, especially at the lower frequencies more relevant to movement control. The optimal selection algorithm was effective at reducing the number of cortical signals necessary to predict the EMG activity for each of the four recorded muscles. Figure 7 shows the average r 2 value across all 29 data sets as a function of the number of inputs used to predict each of the recorded EMGs. Filter lengths of 510 ms were used for prediction; lengths beyond this size did not improve prediction accuracy significantly. The average r 2 obtained when using the optimally selected inputs is compared to that obtained when using the three alternative strategies: PCA, fSISO, and fMISO. The thick straight lines above each set of curves indicate the regions where the selection algorithm proposed in this letter performed significantly better than the alternatives ( p < 0.05). The optimal algorithm produced the maximal cross-validation prediction accuracy for each of the four muscles tested and performed significantly better than the alternative methodologies. These results were most dramatic in comparison to the fSISO and PCA algorithms. There also was a statistically significant improvement in comparison to the fMISO algorithm, but the magnitude of the difference between the performance of that approach and that of the optimal algorithm was small. Reducing the number of inputs also increased the accuracy of the EMG predictions for cross-validation data for 113 of the 116 available data sets (4 muscles × 29 trials). The improvement in r 2 across all trials relative to when all inputs were used ranged from 0 to 0.28, with an average of 0.05 ± 0.07. Because the number of optimal inputs varied between
348
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
0.7
R2
0.6 0.5 0.4 0.3
AD
BI
0.7
R2
0.6 0.5 Optimal fMISO fSISO PCA
0.4 0.3
TRI 0
FF 10
20
Number of Inputs
30
0
10
20
30
Number of Inputs
Figure 7: Performance of the optimal selection algorithm on electrophysiological data. Each panel corresponds to a different muscle and shows the average (29 trials) accuracy of EMG predictions as a function of the number of input signals used in the identification process. All reported r 2 values are for crossvalidation data not used in the selection or identification processes. Solid traces correspond to r 2 obtained using the optimal selection algorithm. The gray lines show the results of using the fMISO selection algorithm, the coarse dashed lines show the results of using the fSISO selection algorithm, and the fine dashed lines show the results of using a PCA. In the estimation process, 120 seconds of data were used, and 60 seconds were used for cross-validation. All estimated filters had a length of 510 ms. The thick lines above the traces indicate regions where the optimal selection algorithm was significantly better ( p < 0.05) than the alternative approaches.
trials, this improvement is not evident in the average responses shown in Figure 7. Once the optimal set of inputs has been selected, the robust identification algorithm provided little performance enhancement with respect to techniques not based on a pseudo-inverse, as long as sufficient data were used in the identification process. Figure 8 provides results from a typical trial and shows r 2 for the fitted and cross-validation data, as a function of
Identification of Multiple-Input Systems with Highly Coupled Inputs 1.0 AD
349
BI
0.8
R2
0.6 0.4 Fitted Data Validation Data
0.2 0.0 1.0 TRI
FF
0.8 R2
0.6 0.4 0.2 0.0
0
100 200 300 400 Number of SVs
500
0
100 200 300 400 Number of SVs
500
Figure 8: EMG prediction accuracy for the fitted (solid traces) and crossvalidation (dashed traces) data as a function of the number of singular values used in the pseudo-inverse. For estimation, 120 seconds of data and 60 seconds for cross-validation were used. The estimated filters had a length of 510 ms.
the number of singular values used in the pseudo-inverse. This identification was performed on the 10 optimal inputs. Even when using these, less than 10% of the available singular values was needed to obtain 90% of the maximum achievable prediction accuracy. In contrast to the simulations presented in Figure 4, there is no clear peak in the cross-validation curve. This is due to the use of sufficiently long data records (2 minutes). The SVD algorithm had a more dramatic effect when fewer data were used in the identification process, although the peak value of r 2 was maximized by using at least 2 minutes of data for system identification. The optimal algorithm for the selection of inputs has made it possible to predict upper limb EMGs reliably. Figure 9 compares the recorded and predicted EMGs from a set of cross-validation data for each muscle. The 10 best neurons were used for prediction; a different set of neurons was used for each muscle. There is a close correspondence between the actual and predicted EMGs. For the cross-validation data in these examples, r 2 was between 0.60 and 0.73, typical values for trials without long rest periods between subsequent movements.
350
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
AD
BI
TRI
FF
Actual Predicted
5 sec
Figure 9: Actual (gray traces) and predicted (black traces) EMGs using 10 of 41 available cortical recordings. All plots are for cross-validation data not used in the estimation process. Models were estimated from 120 seconds of data, leaving 60 seconds available for cross-validation. The estimated filters had a length of 510 ms.
5 Discussion In this letter, we have presented two novel algorithms to address numerical problems associated with the identification of MISO systems and have demonstrated the applicability of these algorithms to the processing of neural recordings from intracortical microelectrode arrays. These algorithms address the problems associated with identifying MISO systems with highly correlated inputs and inputs with restricted information content. The algorithms provide tools for selecting the optimal inputs to be used in the identification process and for generating robust estimates of the MISO system that relate these inputs to the observed output. Both algorithms were found to perform well for simulated and experimental data. 5.1 Selection of Inputs. With rapid advances in microelectrode technology, it becomes increasingly possible to obtain large numbers of simultaneous neural recordings. However, the computational burden associated with processing such large numbers of available inputs can be a challenge
Identification of Multiple-Input Systems with Highly Coupled Inputs
351
in applications where efficient processing is essential. Furthermore, the likelihood for correlations among inputs increases as the number of recorded signals increases, as demonstrated in section 4. Such correlations can lead to numerical instabilities during the identification process (Perreault et al., 1999). Both issues can be addressed by eliminating inputs that do not provide unique information about the system output. Nonessential inputs may be uncorrelated with the output or may be highly correlated with other inputs. We have developed an algorithm for detecting such inputs and have demonstrated that it is effective. In addition to decreasing the computational time required for the remaining steps in the system identification process, such a pruning of neural inputs also has the potential to reduce the computational costs associated with preprocessing algorithms such as spike sorting and artifact removal. The reported advantages associated with reducing the number of inputs used in the identification process are not in contradiction to findings that prediction accuracy increases with increasing numbers of recorded signals (Carmena et al., 2003; Wessberg et al., 2000). Recently, (Paninski et al., 2004) and (Sanchez et al., 2004) demonstrated that the accuracy with which movement variables can be predicted from neural recordings depends strongly on which neurons are selected as model inputs. Our results demonstrate that this selection can be optimized by choosing neural signals based on the uniqueness of their contribution to the system output. Increasing the number of neural recordings increases the sample of neurons from which to draw the optimal set. Therefore, the potential of experimental techniques that allow such large-scale recordings is likely to be enhanced by optimally selecting a subset of the available recorded signals for use in the identification and prediction process. Similar selection algorithms recently were explored by Sanchez et al. (2004), who also demonstrated that a subset of the available neural recordings could be used to predict kinematic variables associated with reaching. We were able to compare two of their algorithms with the one proposed in this article. Although our selection algorithm produced the best results, the fMISO algorithm proposed by Sanchez et al. (2004) performed nearly as well. The results of both studies emphasize the need for considering all neural inputs and their contribution to the system output during the selection process. 5.2 Robust SVD Estimation. Most system identification algorithms rely on the use of white or at least broadband stationary inputs to produce reliable, robust estimates of the system dynamics. However, it can be difficult to obtain broadband inputs during functional behaviors. Under realistic conditions, the input bandwidth may be limited, and the assumption of stationarity may be violated. Hence, it is necessary to develop and use system identification algorithms that produce robust estimates of the system dynamics under such conditions. Westwick and Kearney
352
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
(1997b) have developed a robust algorithm for identifying SISO systems using a pseudo-inversion of the input autocorrelation matrix (see equation 2.8). Here we have extended this algorithm to the identification of MISO systems. Our results (see Figure 4) demonstrate that the use of this algorithm can improve prediction accuracy for cross-validation data not used in the identification process. These improvements are greatest when the number of data used in the identification process is relatively small, indicating that the robust algorithm helps reduce the problem of overfitting. Smaller improvements are observed when the number of data used to identify the MISO system is increased (see Figure 8). For neural data similar to those used in this study, this algorithm is likely to be most beneficial in situations where only small data records can be collected or when it is necessary to characterize system behavior over short periods of time. 5.3 Linear Systems Identification. This study has been restricted to the use of linear system identification techniques. Although the transformation from neural activity to motor output presumably contains many significant nonlinearities, linear models of the net transformation from neural activity to EMGs work surprisingly well when populations of neurons are considered. Similar phenomena have been demonstrated by a number of groups that have compared prediction accuracies of both linear filters and nonlinear networks for decoding neural information from motor and visual systems (Warland, Reinagel, & Meister, 1997; Wessberg et al., 2000; Gao et al., 2003). Given the similarities in prediction accuracy, there are a number of advantages to using linear identification techniques, including the computational and conceptual simplicity of these approaches as well as the potential for meaningful interpretation of the estimated linear filters. Linear IRFs can provide useful characterizations of the transfer of information from the cortex to the motor system (e.g., bandwidth, delays). In contrast, it can be more difficult to obtain similar insights if the system is modeled as a nonlinear network. However, the potential advantages of nonlinear models may become more apparent when their ability to generalize is tested under a wider range of conditions. For example, in situations where there is a significant pause between movements, it may be advantageous to incorporate a static output nonlinearity to approximate the threshold response of the motoneuron pool. Additionally, the techniques presented here could be applied to nonlinear systems consisting of static nonlinearities connected in series with finite memory linear systems (Bussgang, 1952). However, the applicability of these techniques to more general nonlinear systems remains to be demonstrated. 5.4 Potential Applications. The algorithms presented in this letter could be useful in a range of applications where it is necessary to predict the output of a MISO system and the inputs to that system are highly
Identification of Multiple-Input Systems with Highly Coupled Inputs
353
correlated, have limited information content, or can be recorded for only short periods of time. With respect to the processing of neural information, the algorithms could be used in the analysis of multidimensional signals from a variety of sources, including electromyograms, electroencephalograms, and intracortical recordings. One application for the input selection algorithm is as a mapping tool for determining which neural recordings are most relevant to any timedependent process of interest. Examples include assessing the neural signals or anatomical substrates contributing to movement control, cognitive tasks, or visual processing. The selection algorithm also could be used in conjunction with the robust identification algorithm when it is necessary to predict the system output or generate a control signal from a set of recorded inputs. One such application that has received much recent attention is the development of BMIs, including those for the restoration of motor function via neuromuscular stimulation of paralyzed muscles (Lauer, Peckham, Kilgore, & Heetderks, 2000), the control of augmented communication aids for individuals with severe communication disorders (Kennedy, Bakay, Moore, Adams, & Goldwaithe, 2000), and the control of assistive devices for improved mobility and function (Carmena et al., 2003; Taylor, Tillery, & Schwartz, 2003). Success in each of these applications hinges on the availability of a multidimensional natural control signal, such as would result from intracortical microelectrode array recordings. The input selection algorithm could be used to identify the neural signals most relevant to each degree of control, and the robust identification algorithm could be used to estimate the system describing the dynamic relationship between those neural signals and the desired control signal. By using only the optimal inputs, it would be possible to decrease the computational time needed for identification and prediction. This could offer significant improvements in real-time applications, especially those using adaptive algorithms. It should be noted, though, that changes in the neural signals available over time will require a reevaluation of which of the available signals are optimal for a given task. This evaluation could operate as a background process, updating the set of optimal signals as necessary.
Acknowledgments This research was supported by NSERC grant RGP-238939 (D.T.W.), NSF grant IBN-0432171 (E.J.P.) and NIH grants NS36976 (L.E.M.) and 1 K25 HD044720-01 (E.J.P.). E.A.P. was supported by NSF through an IGERT fellowship in Dynamics of Complex Systems in Science and Engineering, grant DGE-9987577. S.A.S. acknowledges the hospitality of the Kavli Institute for Theoretical Physics at the University of California, Santa Barbara, and partial NSF support under grant PHY99-07949.
354
D. Westwick, E. Pohlmeyer, S. Solla, L. Miller, and E. Perreault
References Bussgang, J. J. (1952). Crosscorrelation functions of amplitude distorted gaussian signals. MIT Res. Lab. Elec. Tech. Rep., 216, 1–14. Carmena, J. M., Lebedev, M. A., Crist, R. E., O’Doherty, J. E., Santucci, D. M., Dimitrov, D., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biol., 1(2), E42. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci., 2(7), 664–670. Chen, S., Cowan, C., & Grant, P. (1991). Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans. Neural Netw., 2, 302–309. Donoghue, J. P. (2002). Connecting cortex to machines: Recent advances in brain interfaces. Nat. Neurosci., 5 (Suppl.), 1085–1088. Gao, Y., Black, M. J., Bienenstock, E., Wu, W., & Donoghue, J. P. (2003). A quantitative comparison of linear and non-linear models of motor cortical activity for the encoding and decoding of arm motions. In 1st International IEEE/EMBS Conference on Neural Engineering (pp. 189–192). Los Alamitos, CA: IEEE. Golub, G., & Van Loan, C. (1989). Matrix computations (2nd ed.). Baltimore, MD: Johns Hopkins University Press. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prothesis. IEEE Trans. Rehabil. Eng., 8(2), 196–198. Kennedy, P. R., Bakay, R. A., Moore, M. M., Adams, K., & Goldwaithe, J. (2000). Direct control of a computer from the human central nervous system. IEEE Trans. Rehabil. Eng., 8(2), 198–202. Kim, S. P., Sanchez, J. C., Erdogmus, D., Rao, Y. N., Wessberg, J., Principe, J. C., & Nicolelis, M. (2003). Divide-and-conquer approach for brain machine interfaces: Nonlinear mixture of competitive linear models. Neural Netw., 16(5–6), 865–871. Korenberg, M. (1988). Identifying nonlinear difference equation and functional expansion representations: The fast orthogonal algorithm. Ann. Biomed. Eng., 16, 123–142. Lauer, R. T., Peckham, P. H., Kilgore, K. L., & Heetderks, W. J. (2000). Applications of cortical signals to neuroprosthetic control: A critical review. IEEE Trans. Rehabil. Eng., 8(2), 205–208. Maynard, E. M., Nordhausen, C. T., & Normann, R. A. (1997). The Utah Intracortical Electrode Array: A recording structure for potential brain-computer interfaces. Electroencephalogr. Clin. Neurophysiol., 102(3), 228–239. Miller, A. (1990). Subset selection in regression. London: Chapman and Hall. Mussa-Ivaldi, F. A., & Miller, L. E. (2003). Brain-machine interfaces: Computational demands and clinical needs meet basic neuroscience. Trends Neurosci., 26(6), 329–334. Nicolelis, M. A., Dimitrov, D., Carmena, J. M., Crist, R., Lehew, G., Kralik, J. D., & Wise, S. P. (2003). Chronic, multisite, multielectrode recordings in macaque monkeys. Proc. Natl. Acad. Sci. USA, 100(19), 11041–11046.
Identification of Multiple-Input Systems with Highly Coupled Inputs
355
Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue, J. P. (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity. J. Neurophysiol., 91(1), 515–532. Perreault, E., Kirsch, R., & Acosta, A. (1999). Multiple-input, multiple-output system identification for characterization of limb stiffness dynamics. Biol. Cybern., 80, 327–337. Sanchez, J., Carmena, J., Lebedev, M., Nicolelis, M., Harris, J., & Principe, J. (2004). Ascertaining the importance of neurons to develop better brain-machine interfaces. IEEE Trans. Biomed. Eng., 51(6), 943–953. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416(6877), 141–142. Taylor, D. M., Tillery, S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296(5574), 1829–1832. Taylor, D. M., Tillery, S. I., & Schwartz, A. B. (2003). Information conveyed through brain-control: Cursor versus robot. IEEE Trans. Neural. Syst. Rehabil. Eng., 11(2), 195–199. Warland, D. K., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. J. Neurophysiol., 78(5), 2336–2350. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361–365. Westwick, D. T., & Kearney, R. E. (1997a). Generalized eigenvector algorithm for nonlinear system identification with non-white inputs. Ann. Biomed. Eng., 25(5), 802–814. Westwick, D. T., & Kearney, R. (1997b). Identification of physiological systems: A robust method for non-parametric impulse response estimation. Med. Biol. Eng. Comput., 35(2), 83–90. Williams, J. C., Rennaker, R. L., & Kipke, D. R. (1999). Long-term neural recording characteristics of wire microelectrode arrays implanted in cerebral cortex. Brain Res. Protoc., 4(3), 303–313. Wu, W., Black, M. J., Mumford, D., Gao, Y., Bienenstock, E., & Donoghue, J. (2003). A switching Kalman filter model for the motor cortical coding of hand motion. In 25th International IEEE/EMBS Conference. Los Alamitos, CA: IEEE.
Received August 4, 2004; accepted June 30, 2005.
LETTER
Communicated by Bard Ermentrout
Oscillatory Networks: Pattern Recognition Without a Superposition Catastrophe Thomas Burwick
[email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, 44306 Bochum, Germany Institut fur
Using an oscillatory network model that combines classical network models with phase dynamics, we demonstrate how the superposition catastrophe of pattern recognition may be avoided in the context of phase models. The model is designed to meet two requirements: on and off states should correspond, respectively, to high and low phase velocities, and patterns should be retrieved in coherent mode. Nonoverlapping patterns can be simultaneously active with mutually different phases. For overlapping patterns, competition can be used to reduce coherence to a subset of patterns. The model thereby solves the superposition problem. 1 Introduction For a network with two or more active patterns, where activity is solely defined by on and off states, a subsequent stage of information processing may not be able to read out single patterns. This so-called superposition catastrophe has long been recognized as a major challenge for neural network modeling (Rosenblatt, 1961). It is related to the binding problem ¨ (see Roskies, 1999; Muller, Elliott, Herrmann, & Mecklinger, 2001). The superposition and binding problems motivated von der Malsburg to propose grouping of neural units based on temporal correlation. This allows several nonoverlapping patterns to be active at the same time and still be separable due to different temporal properties (von der Malsburg, 1981, 1985). For example, they may synchronize with different phases for different patterns (von der Malsburg & Schneider, 1986). Subsequently, associative memory based on temporal coding has been implemented in oscillatory neural networks. Recognized patterns may then correspond to limit cycles. The first models that implemented temporal coding were based on Wilson-Cowan-like dynamics, where the oscillators were defined in terms of coupled excitatory and inhibitory units (von der Malsburg & Schneider, 1986; Baird, 1986; Freeman, Yao, & Burke, 1988; Li & Hopfield, 1989). These approaches have been extended in Wang, Buhmann, & von der Malsburg, (1990) and von der Malsburg and Buhmann (1992). Segmentation is a particular example of the superposition problem. Correspondingly, many applications are concerned with this task (e.g., Terman & Wang, 1995; Wang & Neural Computation 18, 356–380 (2006)
C 2005 Massachusetts Institute of Technology
Oscillatory Networks
357
Terman, 1995, 1997) with applications such as medical image segmentation (Shareef, Wang, & Yagel, 1999). Oscillatory networks with associative memory were also studied by using a phase description for the oscillators. Whenever such a parameterization is possible, the analysis of oscillatory systems may be simplified significantly (see Kuramoto, 1984; Winfree, 2001). In this article, we study the avoidance of the superposition catastrophe in the context of phase models. Our discussion is based on a network model, where the real-valued activity uk of each unit k is complemented with a phase θk , k = 1, . . . , N. The phases θk are supposed to parameterize a temporal structure of the signals. We consider a generalization of classical neural network dynamics: duk 1 wkl (θ ) g(ul ) + Ik , = −uk + dt N
(1.1a)
dθk = 2π g(uk ) (1 + Sk (u, θ )) , dt
(1.1b)
wkl (θ ) = gkl (α + β cos (θk − θl )),
(1.2)
N
τu
l=1
τθ where
Sk (u, θ ) =
N β gkl sin (θl − θk ) g(ul ) . N
(1.3)
l=1
The activation function is g(x) = (1 + tanh (x)) /2, the τu , τθ > 0 are timescales, and the Ik are external inputs. The wkl (θ ) are phase-dependent weights, specified by the gkl and real-valued parameters α, β ≥ 0. In our examples, we use α > β. In accordance with the mentioned interpretation of the phases, the model is designed so that the limit 1(0) of g(uk ) describes an on(off) state, leading to high(low) frequencies of θk . This motivates the factor g(uk ) on the right-hand side of equation 1.1b. The Sk (u, θ ) will correct the phase velocity and imply synchronization. For on-states with g(uk ) → 1, the phase velocity will approach (2π/τθ ) (1 + Sk ), while for off-states, g(uk ) → 0, the phase velocity will vanish. (For more detailed discussion and motivation of wkl (θ ), Sk (u, θ ) see below.) Understanding a possible relevance of equations 1.1 to biology would be of interest. An interpretation of the units in terms of single biological neurons is not intended. Possibly an interpretation in terms of populations of neurons may be found. An exact interpretation of variables in biological terms, however, is outside the scope of this letter. For the purpose of this letter, it is sufficient to see the model as a formal framework that allows the implementation of the benefits that temporal coding should hold for pattern recognition.
358
T. Burwick
Specifying an oscillatory network model requires two choices to be made. First, a model for the single oscillators without coupling has to be chosen. Second, the coupling between the oscillators has to be specified. A common choice for the oscillator dynamics is the Stuart-Landau model. This assigns a complex-valued dynamics to the oscillator and may be derived from a general ordinary differential equation near the Hopf bifurcation (see Kuramoto, 1984). A natural choice for the couplings is linear in the complex-valued normal form coordinate (Kuramoto, 1975). The complete system may then be interpreted as a generalized version of the discrete complex-valued GinzburgLandau reaction-diffusion system. In the following, we refer to it as the generalized Ginzburg-Landau (GGL) model. Most approaches toward associative memory in the context of models with phases and amplitudes use the GGL model with vanishing shear (see Hoppensteadt & Izhikevitch, 2003, for a short review). Using pure phase models as an adiabatic approximation, associative memory has been implemented by identifying the coupling strengths with Hebbian weights (Abbott, 1990; Sompolinsky, Golomb, & Kleinfeld, 1990, 1991; Baldi & Meir, 1990; Kuramoto, Aoyagi, Nishikawa, Chawanya, & Okuda, 1992; Sompolinsky & Tsodyks, 1992; Kuzmina & Surina, 1994; Kuzmina, Manykin, & Surina, 1995). The corresponding step in the context of phase and amplitude dynamics was taken by using a complexified analog of the Cohen-Grossberg-Hopfield function (Takeda & Kishigami, 1992; Chakravaraty & Gosh, 1994; Aoyagi, 1995; Hoppensteadt & Izhikevitch, 1996). This approach imitated the method that was used for discrete dynamics (Noest, 1988a, 1988b). Reviews of complex-valued neural networks and their application to pattern recognition may also be found in Hirose (2003). The GGL model is the simplest model for coupling Stuart-Landau oscillators, but not the only possible one. Alternative models have been studied with and without regard to associative memory. For example, Tass and Haken proposed three different models and used one to model synchronization of neural activity in the visual cortex (Tass, 1993; Tass and Haken, 1996a, 1996b). In the context of associative memory, the Stuart-Landau oscillator of the GGL model was replaced with an oscillator model that has on and off states (Aoyagi, 1995). A phase-locked loop model was introduced to allow implementations with well-developed circuit technology (Hoppensteadt & Izhikevitch, 2000). Another model was proposed that allowed a smooth transition between fixed-point mode and oscillatory mode by appropriately changing a parameter (Chakravarathy & Gosh, 1996). We will also find that the model of equations 1.1 is not of the GGL type. In section 2, we review associative memory based on the GGL model with vanishing shear and identical natural frequencies. We will argue that the phase dynamics of this model is not suited to solve the superposition problem by giving different phases to different patterns and the background. In the context of weakly coupled GGL models, a solution has been proposed that relates the pooling of neural oscillators to frequency
Oscillatory Networks
359
gaps resulting from nonidentical natural frequencies (frequency modulation, FM; see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). In section 3, we take a different approach of modifying the underlying system and starting from equations 1.1 instead. This will allow the use of phase gaps to prevent the superposition catastrophe. We give a detailed comparison of equations 1.1 and the GGL model and also relate our approach to the FM proposal. In section 4, we present simple examples. In section 5, we conclude with a summary and outlook. 2 The Generalized Ginzburg-Landau Model We now review and discuss associative memory based on the GGL model. Notice, however, that we do not discuss the storage of complex-valued patterns. The changes may be included without difficulty by going to complexvalued Hermitean weights (Noest, 1988a, 1988b; Takeda & Kishigami, 1992; Chakravarathy & Gosh, 1994; Aoyagi, 1995). We present the models in a form that will allow a convenient comparison with equations 1.1 in section 3. 2.1 The GGL Model as Gradient System. The discrete complex Ginzburg-Landau (GGL) model may be expressed in terms of complex coordinates (Kuramoto, 1975; see also Hoppensteadt & Izhikevitch, 1997, sec. 10.1–3): dzk β = I˜ k + iωk zk − (σk + iηk )zk |zk |2 + gkl zl , dt N N
GGL:
(2.1)
l=1
where I˜ k , ωk , σk , ηk , and the weights gkl are real-valued parameters, k = 1, . . . , N. Associative memory has been introduced for the parameter choices ηk = 0, corresponding to vanishing shear, and ωk = (see Hoppensteadt & Izhikevitch, 1997, sec. 10.4, and the short review in Hoppensteadt & Izhikevitch, 2003). The latter choice allows us to set ωk = 0 by going to the comoving frame, θk → θk + t. We may also set σk = 1. With zk = Vk exp (iθk ) , z¯ k = Vk exp (−iθk ), where V, θ are real, equation 2.1 is then equivalent to
GGL:
N d Vk 1 w kl (θ ) Vl , = I˜ k Vk − Vk3 + dt N
(2.2a)
dθk 1 Sk (V, θ ), = dt Vk
(2.2b)
l=1
GGL:
360
T. Burwick
where w kl is wkl with α = 0, as given by equation 1.2, and Sk is obtained from Sk by replacing g(uk ) → Vk . The gkl sin (θl − θk ) Vl terms in Sk (V, θ ) tend to synchronize (desynchronize) unit k with units l whenever gkl > 0 (gkl < 0). Without couplings, gkl = 0, it is easily seen that for I˜ k < 0, the origin Vk = 0 is a stable fixed point; at I˜ k = 0, the unit kexperiences a Hopf bifurcation, generating a stable limit cycle with radius I˜ k for I˜ k > 0. Introducing the Lyapunov function, θ) = − 1 L(V, w kl (θ ) Vk Vl + P(V) 2N k,l
(2.3)
with P(V) =
k
−
I˜ k 2 1 4 V + Vk 2 k 4
(2.4)
allows us to express the dynamics of equations 2.2 as a gradient system (Noest, 1988a, 1988b; Takeda & Kishigami, 1992; Chakravarathy & Gosh, 1994; Aoyagi, 1995; Hoppensteadt & Izhikevitch, 1996). Assuming that the gkl are symmetric, one obtains GGL:
d Vk ∂ L(V, θ ), =− dt ∂ Vk
(2.5a)
GGL:
dθk 1 ∂ L(V, θ ). =− 2 dt Vk ∂θk
(2.5b)
θ ) should then correspond to storage of patterns. In secThe minima of L(V, θ ) to a coherence tion 2.2, we specify the storage of patterns and relate L(V, measure for the patterns. 2.2 Memory, Competition, and Coherence. We consider P patterns p with components ξk ∈ {0, 1}, k = 1, . . . , N. Using these patterns, we specify the weights gkl of equations 1.2 and 1.3 as gkl = dkl +
−1 p q 1 p p ξk ξl + λ 2 ξ ξ . P p P p=q k l
≡ h kl (ξ )
≡ c kl (ξ )
(2.6)
The couplings d correspond to a background geometry, the excitatory part h corresponds to Hebbian memory, and the inhibitory part c establishes
Oscillatory Networks
361
competition between the patterns with λ ≥ 0. Analogously, any inhibitory part of d establishes competition between units. The complex coordinates of equation 2.1 may be used to define a coherence measure C p for each pattern p: Z p (V, θ ) =
1 p ξ zk = C p exp(i p ) , Np k k
(2.7)
where Z p , C p, p depend on V, θ and Z p = C p , 0 ≤ C p ≤ 1. These correspond to the Kuramoto coherence measure restricted to the active parts of the stored patterns (Kuramoto, 1984: see Strogatz, 2000, for a review of the Kuramoto model and references to recent results). We call p the phase of pattern p and C p its coherence. N p denotes the number of nonzero components of pattern p. Using equation 2.6 with vanishing background geometry, that is, d = 0, equation 2.3 may then be written as = − β L gkl cos (θk − θl ) Vk Vl + P 2N k,l
β Re gkl exp i (θk − θl ) Vk Vl + P (2.8) 2N k,l λ β N 2 2 ρp C p − ρ p ρq C p Cq cos p − q + P, =− 2 P P p=q p =−
could be given where ρ p = Np /N denotes the density of pattern p. Since L this form, without competition, that is, λ = 0, equations 2.5 imply that the β-dependent terms of equations 2.2 introduce a tendency to maximize the coherence of all patterns. With λ > 0, this tendency is accompanied by a competition between the patterns, so that any pattern p with significant coherence C p will try to suppress the coherence Cq of any other pattern q or is minimzed. will arrange for a phase difference between p and q so that L For example, assume P = 2 and both patterns are nonoverlapping. The first sum in the last line of equations 2.8 will imply C1, C 2 → 1, while the second sum introduces a phase difference ( 2 − 1 ) → π/2 mod π in order In fact, such a dynamics is exactly what we observe in to minimize L. example 3 in section 4.4.1. This tendency toward coherence of a pattern and decoherence between nonoverlapping patterns bears a resemblance to the results obtained for the Wang-Terman model (Terman & Wang, 1995). The Wang-Terman model, however, is based on coupled relaxation oscillators and patterns were not stored via Hebbian connections. Instead, a specific two-dimensional
362
T. Burwick
background geometry was assumed for the purpose of image segmentation. Using the two-dimensional background geometry makes the WangTerman model rather comparable to lattice models such as the ones studied in Sakagushi, Shinomoto, and Kuramoto (1987, 1988). The latter, however, were not applied to image segmentation, do not use inhibition, and do not have on and off states. Moreover, differences between the synchronization properties of relaxation and nonrelaxation oscillators have been described in Somers and Kopell (1993, 1995) and Izhikevitch (2000). 2.3 Phase Dynamics with Vanishing Amplitudes. The coupling in equation 2.2b has a remarkable effect on the synchronization properties (see also Hoppensteadt & Izhikevitch, 1997). It contains the terms
GGL:
N 1 gkl sin (θl − θk ) Vl Vk
(2.9)
l=1
on the right-hand side of equation 2.2b. Due to the factor 1/Vk , the units get more susceptible to phase differences as Vk → 0. Moreover, a state Vk approaching the off-state is synchronized more strongly with on-states Vl , not with other off-states (we assume gkl > 0). Thus, there is a strong tendendency toward global synchrony (infinitely strong for off-states, since 1/Vk → ∞, as Vk → 0), where on- and off-states all acquire the same phase. In section 3.2 we will assume Vk = g(uk ) in order to compare equations 1.1 with the GGL model. The tendency toward global coherence, resulting from equation 2.9, is then not compatible with our interpretation of the Vk (see section 1). We intend to identify on and off states of units, respectively with high and low frequencies dθk /dt. Therefore, the phase of a unit that is close to an off-state should not be strongly driven toward synchronization with on-state phases. When returning to equations 1a and 1b in the next section, we will therefore find that our model differs from the associative memory GGL model not only by introducing the frequencydriving activity dθk /dt = (2π/τθ ) g(uk ) + . . . , but also by accompanying this change with an alternative phase coupling that allows off-state units to have phases that differ from on-state phases. This feature is necessary for using phase differences to avoid the superposition problem. 2.4 Weak Couplings and Temporal Coding. The central topic of this article is the superposition problem and its possible solution based on grouping of neural oscillators due to temporal properties. Therefore, it should be compared to an earlier approach that will now be reviewed in the context of weakly coupled systems.
Oscillatory Networks
363
For the case of weak couplings, the amplitude dynamics in equations 2.2, I˜ k > 0, k = 1, . . . , N, may be adiabatically eliminated (see Kuramoto, 1984). This will leave the phase dynamics N dθk β gkl sin (θl − θk ) , = ωk + dt N
(2.10)
l=1
where 0 < β 1 and gkl is obtained from the adiabatic values Vk → Vk∞ , Vl → Vl∞ , leading to gkl = (Vl∞ /Vk∞ ) gkl . Weak couplings will also ∞ imply Vk > 0, so that gkl < ∞. In contrast, oscillator death with Vk∞ = 0 (as well as self-ignition, I˜ k < 0) may occur for stronger couplings (see Aronson, Ermentrout, & Kopell, 1990). In equation 2.10, we also reestablished nonidentical frequencies ωk . In equation 2.2b, these were set to and eliminated by going to the comoving frame, θk → θk + t. Nonidentical frequencies have been used to propose neural grouping based on different frequency (see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). This frequency modulation (FM) proposal uses a number of basic frequencies , = 1, . . . , G, and assumes that ωk ∈ { 0 , 1 , . . . , G } ,
(2.11)
with G ≤ N. Two units k and l will then belong to the same group (ensemble, pool), if their natural frequencies will be the same, (k) = (l) , up to terms of order β. This approach gets particularly interesting when higher-order phase couplings are included in equation 2.10. The equality condition is then generalized to a resonance condition (Hoppensteadt & Izhikevitch, 1997, sec. 9.4.3). In section 3.4, we relate the FM proposal to our approach. 3 Phase Extension of Classical Networks In this section we return to the model of equations 1.1. The system is compared with the GGL model, pattern recognition capabilitilies are discussed, and temporal coding is compared with the FM proposal mentioned at the end of the previous section. 3.1 The Model. The system of equations 1.1 reads duk 1 wkl (θ ) g(ul ) + Ik , = −uk + dt N
(1.1a)
dθk = 2π g(uk ) (1 + Sk (u, θ )) , dt
(1.1b)
N
τu
l=1
τθ
364
T. Burwick
where wkl (θ ) = gkl (α + β cos (θk − θl )), Sk (u, θ ) =
N β gkl sin (θl − θk ) g(ul ). N
(2.2) (2.3)
l=1
kl of the GGL model, except for the paramThe weights wkl agree with the w eter α that was introduced to give the weights also a classical part. The term classical refers to phase independence (see the discussion in von der Malsburg, 1999). This difference was motivated by introducing in equations 1.1 a phase coupling that does not replace but complements classical network dynamics. In fact, for the examples of section 4, we always use α > β ≥ 0. The Sk (u, θ ) are obtained from Sk (V, θ ) by replacing Vl → g(ul ). With β > 0, the sin (θl − θk ) in Sk introduce a tendency to synchronize (desynchronize) the units k and l, assuming gkl > 0 (gkl < 0), and the cos(θk − θl ) in wkl strengthens (weakens) the couplings if the units are inphase (out-of-phase). These features agree with that of equations 2.2. Notice, however, the different strengths of phase couplings, given by g(uk ) in equations 1.1 and by 1/Vk in equations 2.2, which lead to the different behavior of units that are close to off-states. We will return to a comparison between equations 1.1 and equations 2.2 in section 3.2. With β = 0, the weights are constant, wkl (θ ) = αgkl , and equation 1.1a decouples from the phase dynamics, thereby reducing to a classical model, while the phase dynamics reduces to uncoupled oscillations, dθk /dt = (2π/τθ ) g(uk ). In this sense, the complete dynamics of equations 1.1a and 1.1b is an extension of the classical dynamics. 3.2 Gradient Dynamics, Frequency-Driving Activity, and Comparison with GGL Models. In the following, we identify Vk ≡ |zk | = g(uk ). Interpreting the activity g(ul ) as an amplitude allows a comparison of equations 1.1 with 2.2. Notice that we do not restrict the Vk to be small; the range of the Vk is 0 < Vk < 1, due to the definition of g. As |uk | → ∞, the Vk run into saturation. The comparison of equations 1.1 with 2.2 will get most transparent by relating equations 1.1 to gradient dynamics, in analogy to section 2.1. Using the weights wkl (θ ) of equation 1.2, a natural generalization of the CohenGrossberg-Hopfield function (Cohen & Grossberg, 1983; Hopfield, 1984) for amplitudes and phases reads
L(V, θ ) = −
1 wkl (θ ) Vk Vl + P(V) , 2N k,l
(3.1)
Oscillatory Networks
365
where the potential is given by P(V) =
Vk
k
=
1 k
2
g −1 (x) − Ik d x
ln − V − I ln V + − V V (Vk (1 (1 k k) k )) k k .
(3.2) (3.3)
For deriving equation 3.3, the uk may be obtained from Vk through uk = g −1 (Vk ) =
1 Vk . ln 2 1 − Vk
(3.4)
We applied the integration formula ln x d x = x ln x − x and set the integration constant in equation 3.3 to zero. We may now relate equations 1.1a and 1.1b to L, just as equations 2.2 were Using dg(u)/du = 2Vk (1 − Vk ), equations 1.1 may be written related to L. as: N d Vk 1 1 Vk τu + wkl (θ ) Vl + Ik = 2Vk (1 − Vk ) − ln dt 2 1 − Vk N l=1 ∂ L(V, θ ), = 2Vk (1 − Vk ) − ∂ Vk τθ
dθk = 2π Vk 1 + Sk (V, θ ) dt ∂ = 2π Vk + 2π − L(V, θ ). ∂θk
(3.5a)
(3.5b) (3.5c) (3.5d)
The system splits into a gradient part and the frequency-driving activity term 2π Vk . For the remainder of this section, we compare the system of equations 3.5 with the GGL model of equations 2.2 and 2.5. The amplitude dynamics of equation 3.5a is similar to equation 2.2a. The ln term in equation 3.5a is a higher-order analog of the I˜ k Vk − Vk3 terms in equation 2.2a. The additional factors Vk (1 − Vk ) correspond to the saturation as Vk → 0, 1. The main difference is in the phase dynamics. Notice first that the GGL model could be formulated as the pure gradient system of equations 2.5a and 2.5b only by assuming vanishing shear, that is, ηk = 0 in equation 2.1. Therefore, the term 2π Vk in equation 3.5d may be compared to the shear term of the GGL model that is usually set to zero when pattern recognition is implemented (see Hoppensteadt & Izhikevitch, 1997, sec. 10.4). In
366
T. Burwick
contrast, we consider the nonvanishing shear-like term to be essential for our interpretation of the oscillatory units. Setting the 2π g(uk ) + . . . term on the right-hand side of equation 1.1b to zero would destroy the relation between the activity uk and the frequency dθk /dt of the signals. This realizes the requirement that on and off states should correspond, respectively, to high and low frequencies of the phases. Moreover, in order to establish a consistent picture, the effect of the Sk (u, θ ) should not be growing as Vk → 0. In section 2.3, we mentioned that the coupling Sk /Vk in equation 2.2b leads to global synchrony, establishing a strong tendency toward equal phases of on- and off-states. Such a coupling would be in conflict with low frequencies as Vk → 0. Therefore, an obvious difference between equations 2.2 and equations 3.5 is that the latter uses the coupling Vk Sk instead. Thereby, synchronization between units k and l, assuming gkl > 0, is enforced only for on-states, that is, the unit k gets less susceptible to phase differences as Vk → 0. In terms of the gradient system, the difference arises due to the factor 1/Vk2 in equation 2.5b that is absent in equation 3.5d. Therefore, with regard to the Vk , the coupling in equations 35a to 3.5d is of higher order than the couplings of the GGL model. 3.3 Pattern Recognition Capabilities. Patterns stored according to equation 2.6 are related to minima of L. Correspondingly, one could expect that gradient dynamics in equations 3.5a to 3.5d tends to retrieve stored patterns by approaching the minima of L. However, due to the term 2π Vk in equation 3.5d, the system of equations 1.1 is not a pure gradient system. We find ∂L 2 d τu L(V, θ ) = − 2Vk (1 − Vk ) dt ∂ Vk k τu ∂L ∂L + − Vk . τθ ∂θk ∂θk
(3.6)
In contrast to GGL models with vanishing shear, the right-hand side includes terms that may be nonnegative. These may imply an increase in the L(V, θ ) values. Let us write the corresponding terms in the form ∂L β Vk − Vl Vk = sin (θl − θk ) Vk gkl Vl . ∂θk N k,l 2 k
(3.7)
In the following, we will discuss the effect of these terms on the pattern recognition capability of the system.
Oscillatory Networks
367
The terms of equation 3.7 result from the first term 2π Vk + . . . on the right-hand side of equation 1.1b. Why could this term eventually cause an increase of L(V, θ )? Say, θk = θl + δθ (mod 2π) with 0 < δθ < π. On one hand, the sign of sin (θl − θk ) is such that the θl tend to be fastened and the θk tend to be slowed down in order to reach synchrony (assuming gkl > 0 and α > β ≥ 0). With Vk − Vl < 0, this dynamics is supported by the frequency-driving activity terms. The corresponding terms in equation 3.7 will be negative and will imply a tendency of L(V, θ ) to decrease. On the other hand, for Vk − Vl > 0, the activity term may outvote the sin (θl − θk ) tendency: θk could remain faster than θl . This corresponds to the positive terms of equation 3.7, which may cause L(V, θ ) to increase. As a result, however, θk will approach θl from “the other side,” that is, the situation θk = θl − δθ (mod 2 π) is reached. Should Vk − Vl > 0 still be the case, the terms of equation 3.7 will then become negative and will imply a tendency of L(V, θ ) to decrease again. Obviously, without additional inquiry, we may not judge whether the foregoing increase will be cancelled by the following decrease of L(V, θ ). Therefore, we now look more closely at the V dynamics. Assuming an adiabatic approximation, τu τθ , we find that the only term that may become negative inside the bracket of equation 3.6 is suppressed by (τu /τθ ). Moreover, due to their (fast) dynamics, the Vk will have approached their on or off states (then ∂L/∂ Vk → 0) when the terms of order τu /τθ get relevant. Assuming sufficiently large values of α in equation 1.2, we may arrange the set of indices k so that Vk = 1 − k , k = 1, . . . , M, and Vk = k , k = M + 1, . . . , N, where 0 < k 1, for some M ≤ N. Then M ∂L β Vk → sin (θk − θl ) gkl + O() = O(), ∂θk N k
(3.8)
k,l=1
due to the antisymmetry of sin (θk − θl ) gkl . Combining these aspects, we find that frequency-driving activity terms only imply a (τu /τθ ) O() correction to dL/dt, and we may expect that these terms will not significantly affect the pattern recognition capabilities. This expectation will be confirmed by the examples in section 4. 3.4 Comparison with the Frequency Modulation Approach to Temporal coding. Before giving examples for the pattern recognition behavior, we want to relate equations 1 to the frequency modulation (FM) approach described in section 2.4. Remember the weak coupling limit in equation 2.10: N dθk β gkl sin (θl − θk ) . = (k) + dt N l=1
Equation 1.1b:
2π g(uk ) τθ
g(uk )gkl g(ul )
(3.9)
368
T. Burwick
Here, we have added a comparison with equation 1.1b. We may relate the FM proposal to equations 1a and 1b by assuming two basic frequencies: 1 = 0 (off-states) and 2 = 2π/τθ (on-states). The difference is that for the FM approach, the have been external parameters, while in equations 1a and 1b, they are subject to the dynamics of equation 1.1a. Whether the system approaches 1 or 2 depends on external inputs and initial values. Notice also that in the context of the complete dynamics of equations 1.1a and 1.1b, the couplings are also subject to a dynamics. Due to the term g(uk )gkl g(ul ), a synchronization is not enforced if one or both of the units k, l approach an off-state. In the context of brain dynamics, it has been speculated that the external character of natural frequencies would actually turn dynamical in a more complete setting: “It is reasonable to speculate that the brain has mechanisms to regulate the natural frequencies of its neurons so that the neurons can be entrained into different pools at different times simply by adjusting the ” (Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). In section 1, we mentioned that here we do not aim at a biological interpretation. Nevertheless, it is obvious that equations 1.1 provide an extension of equation 3.9, where the analog of the natural frequencies may result from a dynamical process. Notice, however, a difference between the FM approach and the direction we take. The FM approach uses only frequency gaps for separating the neural pools, labeled by 1, . . . , G . The analogy with equations 1.1 goes only to separating on-states ( 1 ) and off-states ( 2 ). Among the on-states, we continue with a separation based on phase gaps. In particular, for overlapping patterns, competitive couplings are used to separate their phases. Using competition as the separating mechanism has been proposed already in von der Malsburg (1981). With regard to biological interpretation, it has been suggested that temporal coding based on frequency gaps versus temporal coding based on phase gaps may be complementary (Hoppensteadt & Izhikevitch, 1999), the former being valid in the weak-coupling regime, while the latter may be more suiteable for strong couplings (Izhikevitch, 1999). Notice in this respect that the model of equations 1.1 is indeed applicable with strong couplings, in particular due to the saturation properties of the activation function. 4 Examples In this section, we illustrate the dynamics of equations 1.1 in the context of simple examples. We continue to use the short form Vk = g (uk ). In section 4.1, we comment on input choices. In section 4.2, we specify the parameters and present the patterns that we use for the examples. In section 4.3, we consider the weights without competition, that is, λ = 0 in equation 2.6. We apply external input to the network and demonstrate how
Oscillatory Networks
369
coherent pattern recognition retrieves the stored patterns. As expected, the frequency-driving activity 2π Vk separates the phases of active units from the background. We compare coherent pattern recognition to the classical dynamics. This will help in understanding the limitations of the classical approach and valuing the additional features that arise due to the phase couplings, most notably the avoidance of the superposition problem. In section 4.4, we include competition between the patterns, that is, λ > 0, and study the resulting effects. 4.1 Pattern Retrieval and Input Choices. Choosing the activation function g of equations 1.1 so that on and off states of a unit k correspond, respectively, to the limits 1 and 0 of Vk introduces a subtlety regarding the inputs I . This may be understood when writing equations 1.1 in terms of h(x) = tanh(x) = 2g(x) − 1 instead. For this discussion, it is sufficient to consider β = 0, then duk α gkl h(ul ) + J k , = −uk + dt 2N N
τu
(4.1)
l=1
where J is related to I through α gkl . 2N N
Ik = J k −
(4.2)
l=1
The subtlety is related to the interpretation of vanishing inputs. Vanishing inputs should be identified not with vanishing I but with J = 0. The related I value is then given by the second term in equation 4.2. Only for such a value of I will u = 0 imply du/dt = 0 so that vanishing inputs correspond to a neutral or undecided state at the origin of u-space. In this section, we specify the inputs in terms of J . 4.2 A Simple Network. We now want to illustrate the dynamics of equations 1.1 by discussing simple examples. We choose N = 6, P = 3, and store the patterns 1 0 0 1 0 0 0 0 1 , ξ2 = , ξ3 = , ξ1 = 0 1 1 0 0 1 0
0
1
(4.3)
370
T. Burwick
1
2
1 3
6 5
2
1 3
6
4
5
pattern 1
2 3
6
4
5
pattern 2
4
pattern 3
Figure 1: The stored patterns according to equation 4.3, drawn together with the resulting Hebbian connections h kl of equation 4.4 and numbering of the units. The filled and empty circles correspond, respectively, to on and off states.
with N1 = N2 = 2, N3 = 3. This form allows us to distinguish two cases: overlapping and nonoverlapping patterns. Pattern 1 does not overlap with patterns 2 and 3, while patterns 2 and 3 overlap at unit 4. The patterns are illustrated in Figure 1. The Hebbian weights h and the inhibitory competition terms c are given by equation 2.6:
1 1 0 h kl = P 0 0
10 01 01 00
0 0 0 0 0 1 1 0 0 , c kl = −1 2 P 2 1 1 2 1 111
110000
000111
01211
0 1 2 1 1 1 0 1 1 1 . 2 1 2 1 1 1 1 1 0 0
(4.4)
111100
We set d = 0 in equation 2.6 since we are not interested in background effects. In the following, the global Hebbian parameter α is set to α = 2P N. This is large enough to establish attractors of the system that realize storage of the patterns. On-states correspond to Vk → 1, while off-states correspond to Vk → 0. The timescales are set to τu = τθ /4 = τ . This is a mild form of the τu τθ scenario, underlying the adiabatic approximation in section 3.3. Simulations are performed using Euler’s method with time step dt = 0.02τ . The initial values, random values for the θ and small random values for uk (so that Vk ≈ 1/2), are the same for all examples in this section. The examples are distinguished by different values for β, λ, and inputs Ik (expressed in terms of J k ; see section 4.1). 4.3 Coherent Pattern Recognition. We begin with two examples for the response to external inputs. In section 4.1 we argued that vanishing inputs
Oscillatory Networks
371
should be identified with J k = 0, where J is related to I by equation 4.2. Therefore, excitatory input should be identified with J k > 0 and inhibitory input with J k < 0. Both examples compare coherent pattern recognition with the classical case. The classical case corresponds to β = 0. For the phase coding scenario, we choose β = 0.1α. In this section, we do not include competition—λ = 0. 4.3.1 Example 1. First, we set J 1 = −E, J 6 = +E, where E = 10, while all other inputs vanish. The resulting dynamics is displayed in Figure 2. We find that the inhibitory input at unit 1 suppresses pattern 1, while the excitatory input at unit 6 excites pattern 3. Moreover, due to its overlap with pattern 3, pattern 2 is also excited. This is true for the classical and the phase coding cases. It is the behavior that should be expected. If phase coding is included, β = 0.1α, we find that the active units synchronize after a few multiples of the timescale. The example confirms that we succeeded in constructing a pattern recognition mechanism that retrieves patterns in a coherent mode. The frequencydriving activity term in equation 1.1b is essential in separating the active from nonactive units. The freqencies of active units approach 2π/τθ , while the frequencies of nonactive units are frozen toward zero. The implications for the superposition problem are illustrated with the next example. 4.3.2 Example 2. The superposition problem may now be demonstrated by choosing both inputs to be positive, J 1 = J 6 = +E. Then pattern 1 is also excited (see Figure 3). In the classical case, β = 0, the superposition of active patterns no longer allows us to distinguish the single patterns. For example, it is no longer possible to distinguish pattern 1 since all units are active. Consequently, there is a problem for information processing along the succeeding stages. This is the superposition catastrophe that plagues the classical approach (Rosenblatt, 1961; von der Malsburg, 1981). Comparing the classical situation to the case β = 0.1α makes it obvious why the inclusion of phase coding helps to avoid the superposition problem. Now the units carry not only information about being on or off, related to high or low frequencies, but also information about the phases. We find that units 1 and 2 of pattern 1 synchronize separately from units 3 to 6 of patterns 2 and 3. As a result, in the output, pattern 1 may be distinguished from the rest due to its different phase. Whenever the succeeding stage of information processing is sensitive to this phase difference, it will be able to identify pattern 1 despite the fact that the other units are also on. 4.4 Competition Between Patterns. While the superposition problem is a severe drawback in the application of classical networks, another problem is the mixed states that correspond to a common activation of overlapping patterns. This arises when using the Hebbian weights without additional coupling of the units. It is only a problem, of course, if we aim at getting a
372
T. Burwick
A
(β = 0)
B
J1 = - E
(β = 0.1α)
J1 = - E
J6 = + E
J6 = + E
1
0.5
0.5
Phase θk /
2π
1
0
0
5
10
15
Activity Vk
1
0
5
10
units 3-6 unit 2
0.5
unit 2
0.5
unit 1 0
5
10
unit 1 15
1
0
0
5
10
15
10
15
1
pattern 3
pattern 3
pattern 2
0.5
pattern 2
0.5
pattern 1
pattern 1 0
15
1
units 3-6
0
Coherence Cp
0
0
5
10
time t [τ]
15
0
0
5
time t [τ]
Figure 2: Example 1. The lines connecting the units indicate the Hebbian connections h kl of equation 4.4 (see Figure 1). The filled and empty circles correspond, respectively, to on and off states. We compare the classical case (β = 0) with phase coding (β > 0). (A) β = 0, the classical case. The inhibitory input J 1 suppresses pattern 1. The excitatory input J 6 activates patterns 3 and 2 through unit 4. (B) β = 0.1α, phase coding. Coherent activity is indicated by circles with parallel stripes. The connected active units synchronize within a few multiples of the timescale. The corresponding patterns reach maximal coherence. (Whenever mod 2π is applied to θ , the values are connected for the sake of better visualization.)
Oscillatory Networks
A
373
(β = 0)
B
J1 = + E
J1 = + E
J6 = + E
J6 = + E
1
0.5
0.5
Phase θk /
2π
1
0
0
5
10
15
Activity Vk
1
Coherence Cp
0
0
10
15
units 3-6 units 1-2
0.5
0.5
0
5
10
1
15
0
0
0
5
10
15
patterns 1,3,2
pattern 3
0.5
5
1
pattern 1 pattern 2
0
5
1
units 1-6
0
(β = 0.1α)
0.5
10
time t [τ]
15
0
0
5
10
15
time t [τ]
Figure 3: Example 2. Both external inputs are now excitatory. As a consequence, all units are activated. Again, we compare the classical case (β = 0) with that of phase coding (β > 0). (A) β = 0, the classical case. The superposition problem arises. The activation carries no information about single patterns. (B) β = 0.1α, phase coding. Now the units of the two connected components synchronize separately. Different orientations of the stripes correspond to different phases. The superposition problem is avoided. Pattern 1 is distinguishable from the rest. Separating patterns 2 and 3 also requires competition (see example 3).
374
T. Burwick
unique pattern as the output of the network. Then the situation of examples 1 and 2 is not satisfying since patterns 2 and 3 are activated simultaneously due to being connected through Hebbian connections. This is why we now include additional inhibitory weights c kl that introduce competition between the patterns—that is, we use nonvanishing λ in equation 2.6. 4.4.1 Example 3. We set λ = 1.2P. No external inputs are applied. To make the influence of phase coding obvious, we now use a larger value for β by choosing β = 0.8α for the nonclassical case. The resulting dynamics is displayed in Figure 4. It may be understood as follows. Let us first discuss the classical situation. When analyzing our random initial values for uk , they are found to be slightly positive, so that the Vk are slightly above 1/2 and the activities tend to run toward the on-states. This, however, invokes the competition process, and for the classical scenario, β = 0, the units of pattern 3 succeed in suppressing all the other activities, so that finally units 4 to 6 of pattern 3 approach the on-state attractor, while the other units move toward the off-state attractor. The flat lines in the phase diagram correspond to switched-off units 1 to 3. The dominance of pattern 3 is mainly due to N3 > N2,1 . With phase coding, the situation is different and allows an interesting observation. We find that although again pattern 3 dominates pattern 2, we now end up with a still active pattern 1. The reason for this may be understood when realizing that the two winning patterns are out of phase. In fact, the phase difference approaches π. Thus, pattern 1 escaped the competition by desynchronizing with pattern 3 so that the coupling w(θ ) and thus the competition is weakened. As a result, the system ends up in a situation similar to Figure 3, but now the superposition problem between the overlapping patterns has been solved due to the competition: unit 3 and thereby pattern 2 is suppressed. Two patterns may therefore be simultaneously active and still be separable. This is another example of avoiding the superposition problem. Notice that an analogous behavior occurs when global inhibition is used—that is, competition between units instead of competition between patterns. A possible example is obtained from λ = 0 and dkl = −1 for all k, l in equation 2.6. Again, in the classical case, pattern 3 is winning, while in the presence of phase coding, patterns 1 and 3 are active with different phases, as illustrated for the case of Figure 4. 5 Summary and Outlook In this letter, we used the oscillatory network model of equations 1.1 to study a solution to the superposition catastrophe in the context of phase models. The system of equations 1.1 was obtained by extending classical neural network models to include phase dynamics. It was designed to meet two basic requirements: on and off states should correspond, respectively,
Oscillatory Networks
A
375
(β = 0)
B
1
0.5
0.5
Phase θk /
2π
1
(β = 0.8α)
0
0
5
10
15
Activity Vk
1
0
5
10
15
1
units 4-6 units 4-6 units 1-2
0.5
0.5
unit 3 0
unit 3
units 1-2 0
5
10
15
1
Coherence Cp
0
0
0
5
10
15
10
15
1
patterns 1,3 pattern 3 0.5
pattern 2
0.5
pattern 2 pattern 1 0
0
5
10
time t [τ]
15
0
0
5
time t [τ]
Figure 4: Example 3. The network now also includes inhibitory weights λc kl , λ = 1.2P. These introduce competition between patterns. The network is now completely connected, as illustrated by the additional broken lines. No external input is applied. (A) β = 0, the classical case. Pattern 3 is the winning pattern (it is larger than the others). (B) β = 0.8α, phase coding. Again pattern 3 wins over pattern 2. Pattern 1, however, is now able to survive. Phase coding allows pattern 1 to escape the competition by desynchronizing with pattern 3. This is illustrated by circles with stripes of orthogonal orientation.
376
T. Burwick
to high and low frequencies of the phases, and patterns should be retrieved in a coherent (synchronized) mode. Identifying the activity g (uk ) of equations 1.1 with an oscillation amplitude Vk allows us to compare the model with equations 2.2, a generalized version of the discrete complex Ginzburg-Landau (GGL) model with vanishing shear and identical natural frequencies. This model describes StuartLandau oscillators (i.e., oscillators close to a Hopf-bifurcation) coupled to lowest order, a model that is frequently used for implementing associative memory (see Hoppensteadt & Izhikevitch, 2003). With Vk = g (uk ), the GGL model does not obey the first of the two requirements. That phase dynamics leads to different results as oscillatory units k approach their offstate, Vk → 0. We also compared our approach with an alternative proposal for grouping oscillatory units based on temporal properties (see Hoppensteadt & Izhikevitch, 1997, sec. 5.4.2). This frequency modulation (FM) proposal is founded on choosing different natural frequencies for oscillatory units, resulting in the grouping of units with nearly identical natural frequencies. With regard to a separation of on- and off-states, equations 1.1 may be seen as a dynamical extension of this approach. However, with regard to a grouping among on-states, the FM approach would be based on frequency gaps, while the model of equations 1.1 uses phase gaps. Correspondingly, we mentioned that (regarding biological systems) a complementarity has been suggested (Izhikevitch, 1999), according to which temporal coding with frequency gaps arises for weakly coupled systems, while temporal coding based on phase gaps may be more suitable for strong couplings. In the context of equations 1.1, the solution of the superposition problem is straightforward. It is in accordance with original proposals of temporal coding (von der Malsburg, 1981, 1985; see also von der Malsburg, 2003). The frequency-driving acitivity term 2πg (uk ) in equation 1.1b separates the active units from the background so that active units, Vk → 1, approach the frequency (2π/τθ ) (1 + Sk ), while nonactive units, Vk → 0, are frozen to zero frequency. Among the active units, the phases provide an additional labeling of the activities. In case of coherent pattern recognition, due to synchronizing phase couplings, each component may be identified with a single phase. Different patterns may then have the same frequency and may still be distinguished by the phase differences between the patterns. In order to avoid mixed states of overlapping patterns, a competition between the patterns may be introduced. This was achieved by introducing appropriate inhibitory weights in addition to the excitatory Hebbian weights. The coherence of the patterns may then be reduced to the winning subset. In this letter, we did not specify mechanisms by which successive stages of information processing can read out the phases of different components. Future work should approach this issue so that a more complete picture arises. Moreover, taking the limit N → ∞ is essential for obtaining phase
Oscillatory Networks
377
transitions between ordered and disordered states (see Kuramoto, 1984; Strogatz, 2000). Studying the proposed and related models in this limit should be of particular interest. Higher-order phase couplings, and possibly related phenomena such as clustering, should also be of interest (see Tass, 1999). In the context of the FM proposal, these led to resonance conditions for the natural frequencies. An early version of the Wang-Terman model was studied for patterns that were overlapping at one unit, resembling our example in section 4 (Wang et al., 1990). Presentation of overlapping patterns as input led to common activation of these patterns: the units at the overlap participated in each of the overlapping patterns (see Wang et al., Figure 3). This differs from our example 3, where competition led to a winner among the overlapping patterns. If a common activation of overlapping patterns should be desirable, different approaches might work for the phase models. For example, the phase model could be extended to the coupling of higher-order modes. Overlapping units could then participate in a higher-frequency mode that may simultanously synchronize with the lower-frequency nonoverlapping parts of the patterns via resonances. This would still allow the patterns to desynchronize in the nonoverlapping parts. Modifying the phase model to establish such a feature, however, is beyond the scope of this article. We presented simple examples to illuminate the dynamical content of equations 1.1. Evidently, to gain importance as a pattern recognition method, it should be demonstrated that the method stands the test of the more advanced applications. In case of locally coupled relaxation oscillators, this step was taken by Wang and Terman, who studied segmentation of real images. They observed that a straightforward application of relaxation oscillator networks led to an unsuitable segmentation of the real image, leading to many tiny fragments, a problem they called fragmentation. This problem was solved by introducing a lateral potential for each oscillator (Wang & Terman, 1997; see also Shareef et al., 1999). It will be of interest to study how the phase model should be applied to real images. In case fragmentation arises, it would be particularly interesting to study whether some hierarchical processing could help to combine the fragments, possibly in analogy to hierarchical processing supported by higher-level cortical areas. Given the relative simplicity of the phase models in comparison with coupled relaxation oscillator models, one may hope that necessary steps to include high-level processing could be more easily analyzed and implemented.
Acknowledgment It is a pleasure to thank Christoph von der Malsburg for inspiring and helpful discussions.
378
T. Burwick
References Abbott, L. F. (1990). A network of oscillators, J. Phys. A: Math. Gen., 23, 3835–3859. Aoyagi, T. (1995). Networks of neural oscillators for retrieving phase information, Physical Review Letters, 74(20), 4075–4078. Aronson, D., Ermentrout, G., & Kopell, N. (1990). Amplitude response of coupled oscillators. Physica D, 41, 403–449. Baird, B. (1986). Nonlinear dynamics of pattern formation and pattern recognition in the rabbit olfactory bulb. Physica D, 22, 242–252. Baird, P., & Meir, R. (1990). Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Computation, 2, 458– 471. Chakravarthy, S. V., & Gosh, J. (1994). A Neural Network–based associative memory for storing complex-valued patterns. In Proc. IEEE Int. Conf. Syst. Man Cybern. (pp. 2213–2218). Piscataway, NJ: IEEE. Chakravarthy, S. V., & Gosh, J. (1996). A complex-valued associative memory for storing patterns as oscillatory states. Biological Cybernetics, 75, 229–238. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transaction on Systems, Man, and Cybernetics, SMC-13, 815–826. Freeman, W. J., Yao, Y., & Burke, B. (1988). Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks, 1, 277–288. Hirose, A. (Ed.). (2003). Complex-valued neural networks. Singapore: World Scientific. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences of the U.S.A., 81, 3088–3092. Hoppensteadt, F. C., & Izhikevitch, M. E., (1996). Synaptic organization and dynamical properties of weakly connected neural oscillators: II. Learning phase information. Biological Cybernetics, 75, 129–135. Hoppensteadt, F. C., & Izhikevitch, E. M., (1997). Weakly connected neural networks. Berlin: Springer-Verlag. Hoppensteadt, F. C., & Izhikevitch, E. M. (1999). Thalamo-Cortical interactions modeled by weakly connected oscillators: Could the brain use FM radio principles? BioSystems, 48, 85–94. Hoppensteadt, F. C., & Izhikevitch, E. M. (2000). Pattern recognition via synchronization in phase-locked loop neural networks. IEEE Transactions on Neural Networks, 11, 734–738. Hoppensteadt, F. C., & Izhikevitch, E. M. (2003). Canonical neural models, In M. Arbib (Ed.), Brain theory and neural networks (2nd ed., pp. 181–186). Cambridge, MA: MIT Press. Izhikevitch, E. M. (1999). Weakly connected quasi-periodic oscillators, FM interactions, and multiplexing in the brain. BioSystems, 48, 85–94. Izhikevitch, E. M. (2000). Phase equations for relaxation oscillators. SIAM J. Appl. Math., 60, 1789–1805. Kuramoto, Y. (1975). Self-entrainement of a population of coupled non-linear oscillators. In H. Araki (ed.), International Symposium on Mathematical Problems in Theoretical Physics (pp. 420–422). Berlin: Springer-Verlag.
Oscillatory Networks
379
Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Kuramoto, Y., Aoyagi, T., Nishikawa, I., Chawanya, T., & Okuda, K. (1992). Neural network model carrying phase information with application to collective dynamics. Progr. Theor. Phys., 87, 1119–1126. Kuzmina, M., Manykin, E., & Surina, I. (1995). Associative memory oscillatory networks with Hebbian and pseudo-inverse matrices of connections. In Proceedings of the Third European Congress on Intelligent Techniques and Soft Computing (EUFIT’95), Aachen, Germany, August 28–31 (pp. 392–395). Bellingham, WA: SPIE International Society for Optical Engineering. Kuzmina, M., & Surina, I. (1994). Macrodynamical approach for oscillatory networks. In Proceedings of SPIE: Optical Neural Networks (Vol. 2430, pp. 229–235). Aachen: ELITE Foundation. Li, Z., & Hopfield, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processings. Biological Cybernetics, 61, 379–392. ¨ Muller, H. J., Elliott, M. A., Herrmann, C. S., & Mecklinger, A. (Eds.). (2001). Visual Cognition, 8 [Special issue]. Noest, A. J. (1988a). Associative memory in sparse phasor neural networks. Europhysics Letters, 6, 469–474. Noest, A. J. (1988b). Discrete-state phasor neural networks. Physical Review A, 38, 2196–2199. Rosenblatt, F. (1961). Principles of neurodynamics: Perception and the theory of brain mechanism. Washington, DC: Spartan Books. Roskies, A. L. (Ed.). (1999). Neuron, 24 [Special topic]. Sakagushi, H., Shinomoto, S., & Kuramoto, Y. (1987). Local and global selfentrainments in oscillator lattices. Progr. Theor. Phys., 77, 1005–1010. Sakagushi, H., Shinomoto, S., & Kuramoto, Y. (1988). Mutual entrainment in oscillator lattices with nonvariational type interaction. Progr. Theor. Phys., 79, 1069–1079. Shareef, N., Wang, D., & Yagel, R. (1999). Segmentation of medical images using LEGION. IEEE Transactions on Medical Imaging, 18, 74–91. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, 68, 393–407. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 88, 1–14. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1990). Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. USA, 87, 7200–7204. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Physical Review A, 43, 6990–7011. Sompolinsky, H., & Tsodyks, M. (1992). Processing of sensory information by a network of coupled oscillators. International Journal of Neural Systems, 3 (Suppl.), 51–56. Strogatz, S. H. (2000). From Kuramoto to Crawford: Exploring the onset of synchronization in populations of coupled oscillators. Physica D, 143, 1–20. Takeda, M., & Kishigami, T. (1992). Complex neural fields with a Hopfield-like energy function and an analogy to optical fields generated in phase-conjugate resonators. J. Opt. Soc. Am. A, 9, 2182–2192.
380
T. Burwick
Tass, P. A. (1993). Synchronisierte Oszillationen im visuellen Cortex—ein synergetisches ¨ Theoretische Physik und Modell. Unpublished doctoral dissertation, Institut fur Synergetik der Universitt Stuttgart. Tass, P. A. (1999). Phase resetting in medicine and biology. Berlin: Springer-Verlag. Tass, P., & Haken, H. (1996a). Synchronization in networks of limit cycle oscillators. Z. Phys. B, 100, 303–320. Tass, P., & Haken, H. (1996b). Synchronization oscillations in the visual cortex—a synergetic model. Biological Cybernetics, 74, 31–39. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Max-Planck Institute for Biophysical Chemistry. von der Malsburg, C. (1985). Nervous structures with dynamical links. Ber. Bunsenges. Phys. Chem., 89, 703–710. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. von der Malsburg, C. (2003). Dynamic link architecture. In M. Arbib (Ed.), Brain theory and neural networks (2nd ed., pp. 365–368). Cambridge, MA: MIT Press. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Wang, D., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Computation, 2, 94–106. Wang, D., & Terman, D. (1995). Locally excitatory globally inhibitory oscillator networks. IEEE Transaction on Neural Networks, 6, 283-286. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Winfree, A. T. (2001). The geometry of biological time (2nd ed.). Berlin: Springer-Verlag.
Received November 10, 2004; accepted June 30, 2005.
LETTER
Communicated by Bruno Olshausen
Topographic Product Models Applied to Natural Scene Statistics Simon Osindero
[email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
Max Welling
[email protected] Department of Computer Science, University of California Irvine, Irvine, CA 92697-3425, U.S.A.
Geoffrey E. Hinton
[email protected] Canadian Institute for Advanced Research and Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
We present an energy-based model that uses a product of generalized Student-t distributions to capture the statistical structure in data sets. This model is inspired by and particularly applicable to “natural” data sets such as images. We begin by providing the mathematical framework, where we discuss complete and overcomplete models and provide algorithms for training these models from data. Using patches of natural scenes, we demonstrate that our approach represents a viable alternative to independent component analysis as an interpretive model of biological visual systems. Although the two approaches are similar in flavor, there are also important differences, particularly when the representations are overcomplete. By constraining the interactions within our model, we are also able to study the topographic organization of Gabor-like receptive fields that our model learns. Finally, we discuss the relation of our new approach to previous work—in particular, gaussian scale mixture models and variants of independent components analysis. 1 Introduction This letter presents a general family of energy-based models that we refer to as product of student-t (PoT) models They are particularly well suited to modeling statistical structure in data for which linear projections are expected to result in sparse marginal distributions. Many kinds of data might be expected to have such structure, and in particular natural data Neural Computation 18, 381–414 (2006)
C 2005 Massachusetts Institute of Technology
382
S. Osindero, M. Welling, and G. Hinton
sets such as digitized images or sounds seem to be well described in this way. The goals of this letter are twofold. First, we present the general mathematical formulation of PoT models and describe learning algorithms for them. We hope that this part of the article will be useful in introducing a new method to the community’s tool kit for machine learning and density estimation. Second, we focus on applying PoTs to capturing the statistical structure of natural scenes. This is motivated from both a density estimation perspective and from the perspective of providing insight into information processing within the visual pathways of the brain. PoT models were touched on briefly in Teh, Welling, Osindero, & Hinton (2003), and in this letter, we present the basic formulation in more detail, provide hierarchical and topographic extensions, and give an efficient learning algorithm employing auxiliary variables and Gibbs sampling. We also provide a discussion of the PoT model in relation to similar existing techniques. We suggest that the PoT model could be considered a viable alternative to the more familiar technique of independent component analysis (ICA) when constructing density models, performing feature extraction, or building interpretive computational models of biological visual systems. As we shall demonstrate, we are able to reproduce many of the successes of ICA, yielding results that are comparable but with some interesting and significant differences. Similarly, extensions of our basic model can be related to some of the hierarchical forms of ICA that have been proposed, as well as to gaussian scale mixtures. Again there are interesting differences in formulation. An example of a potential advantage in our approach is that the learned representations can be inferred directly from the input, without the need for any iterative settling even in hierarchical or highly overcomplete models. The letter is organized as follows. Section 2 describes the mathematical form of the basic PoT model along with extensions to hierarchical and topographic versions. Section 3 describes how to learn within the PoE framework using the contrastive divergence (CD) algorithm (Hinton, 2002) (with the appendix providing the background material for running the necessary Markov chain Monte Carlo sampling). In section 4 we present results of our model when applied to natural images. We are able to recreate the success such of ICA-based models as, for example, Bell and Sejnowski (1995, 1997), Olshausen and Field (1996, 1997), Hoyer and Hyvarinen (2000), Hyvarinen, Hoyer, & Inki (2001), and Hyvarinen and Hoyer (2001). Our model provides computationally motivated accounts for the form of simple cell and complex cell receptive fields, as well as for the basic layout of cortical topographic maps for location, orientation, spatial frequency, and spatial phase. Additionally, we are easily able to produce such results in an overcomplete setting.
Topographic Product Models Applied to Natural Scene Statistics
383
In section 5 we analyze in more detail the relationships between our PoT model, ICA models and their extensions, and gaussian scale mixtures. Finally, in section 6, we summarize our work. 2 Products of Student-t Models We will begin with a brief overview of product of expert models (Hinton, 2002), before presenting the basic product of Student-t model (Welling, Hinton, & Osindero, 2002). Then we move on to discuss hierarchical topographic extensions. 2.1 Product of Expert Models. Product of expert models (PoEs) were introduced in Hinton (2002) as an alternative method of combining expert models into one joint model. In contrast to mixture of expert models, where individual models are combined additively, PoEs combine expert opinions multiplicatively as follows (see also Heskes, 1998), 1 pi (x|θi ), Z(θ ) M
PPoE (x|θ) =
(2.1)
i=1
where Z(θ ) is the global normalization constant and pi (·) are the individual expert models. Mixture models employ a divide-and-conquer strategy, with different experts being used to model different subsets of the training data. In product models, many experts cooperate to explain each input vector, and different experts specialize in different parts of the input vector or in different types of latent structure. If a scene contains n different objects that are processed in parallel, a mixture model needs a number of components exponential in n because each component of the mixture must model a combination of objects. A product model, by contrast, requires only a number of components linear in n because many different experts can be used at the same time. Another benefit of product models is their ability to model sharp boundaries. In mixture models, the distribution represented by the whole mixture must be vaguer than the distribution represented by a typical component of the mixture. In product models, the product distribution is typically much sharper than the distributions of the individual experts1 , which is a major advantage for high-dimensional data (Hinton, 2002; Welling, Zemel, & Hinton, 2002). 1 When multiplying together n equal-variance gaussians, for example, the variance is reduced by a factor of n. It is also possible to make the entropy of the product distribution higher than the entropy of the individual experts by multiplying together two very heavytailed distributions whose modes are in very different places.
384
S. Osindero, M. Welling, and G. Hinton
Learning PoE models has been difficult in the past, mainly due to the presence of the partition function Z(θ ). However, contrastive divergence learning (Hinton, 2002) (see section 3.2) has opened the way to apply these models to large-scale applications. PoE models are related to many other models that have been proposed in the past. In particular, log-linear models2 have a similar flavor but are more limited in their parameterization, 1 exp (λi f i (x)) , Z(λ) M
PLogLin (x|λ) =
(2.2)
i=1
where exp[λi f i (·)] takes the role of an unnormalized expert. A binary product of experts model was introduced under the name harmonium in Smolensky (1986). A learning algorithm based on projection pursuit was proposed in Freund and Haussler (1992). In addition to binary models (Hinton, 2002), the gaussian case been studied (Williams, Agakov, & Felderhof, 2001; Marks & Movellan, 2001; Williams & Agakov, 2002; Welling, Agakov, & Williams, 2003). 2.2 Product of Student-t Models. The basic model we study here is a form of PoE suggested by Hinton and Teh (2001), where the experts are given by generalized Student-t distributions: y = Jx 1 pi (yi |αi ) ∝ αi . 1 + 12 yi2
(2.3) (2.4)
The variables yi are the responses to linearly filtered input vectors and can be thought of as latent variables that are deterministically related to the observables, x. Through this deterministic relationship, equation 2.4 defines a probability density on the observables. The filters, { Ji }, are learned from the training data (typically images) by maximizing or approximately maximizing the log likelihood. Note that due to the presence of the J parameters, this product of Student-t (PoT) model is not log linear. However, it is possible to introduce auxiliary variables, u, such that the joint distribution P(x, u) is log linear3 and the marginal distribution P(x) reduces to that of the original
2 Otherwise known as exponential family models, maximum entropy models, and additive models—for example, see Zhu, Wu, & Mumford (1998). 3 Note that it is log linear in the parameters θ i jk = Ji j Jik and αi with features ui x j xk , and log ui .
Topographic Product Models Applied to Natural Scene Statistics
385
PoT distribution, PPoT (x) =
∞
du P(x, u) 0
P(x, u) ∝ exp −
M i=1
(2.5)
ui
1 2 1 + ( Ji x) + (1 − αi ) log ui , 2
(2.6)
where Ji denotes the row vector corresponding to the ith row of the filter matrix J. An intuition for this form of reparameterization with auxiliary variables can be gained by considering that a one-dimensional t-distribution can be written as a continuous mixture of gaussians, with a gamma distribution controlling mixing proportions on components with different precisions, that is, Gamma Gaussian
−(α+ 12 ) 1 α+ 2 1 1 1 2 1 1 = dλ √ λ 2 e − 2 τ λ 1 + τ2 λα−1 e −λ . √ (α) 2 2π (α) 2π
(2.7)
The advantage of this reformulation using auxiliary variables is that it supports an efficient, fast-mixing Gibbs sampler, which is in turn beneficial for contrastive divergence learning. The Gibbs chain samples alternately from P(u|x) and P(x|u) given by 1 Gui αi ; 1 + ( Ji x)2 2 i=1 V = Diag[u], P(u|x) = Nx 0; ( JVJT )−1
P(u|x) =
M
(2.8) (2.9)
where G denotes a gamma distribution and N a normal distribution. From equation 2.9, we see that the variables u can be interpreted as precision variables in the transformed space y = Jx. In terms of graphical models, the representation that best fits the PoT model with auxiliary variables is that of a two-layer bipartite undirected graphical model. Figure 1A schematically illustrates the Markov random field (MRF) over u and x; Figure 1B shows the role of the deterministic filter outputs in this scheme. A natural way to interpret the differences between directed models (and in particular ICA models) and PoE models was provided in Hinton and Teh (2001) and Teh et al. (2003). Whereas directed models intuitively have a topdown interpretation (e.g., samples can be obtained by ancestral sampling starting at the top layer units), PoE models (or more generally energy-based models) have a more natural bottom-up interpretation. The probability of an input vector is proportional to exp(−E(x)), where the energy E(x) is
386
S. Osindero, M. Welling, and G. Hinton u
u
u
2
z=W(y)
W y=Jx
y=Jx
J x
x
A
x
B
C
Figure 1: (A) Standard PoT model as an undirected graph or Markov random field (MRF) involving observables, x and auxiliary variables, u. (B) Standard PoT MRF redrawn to show the role of deterministic filter outputs y = Jx. (C) Hierarchical PoT MRF drawn to show both sets of deterministic variables, y and z = W(y)2 , as well as auxiliary variables u.
computed bottom-up starting at the input layer (e.g., E(y) = E( Jx)). We may thus interpret the PoE model as modeling a collection of soft constraints, parameterized through deterministic mappings from the input layer to the top layer (possibly parameterized as a neural network) and where the energy serves to penalize inputs that do not satisfy these constraints (e.g., are different from zero). The costs contributed by the violated constraints are added to compute the global energy, which is equivalent to multiplying the distributions of to compute the product distribution the individual experts (since P(x) ∝ i pi (x) ∝ exp(− i E i (x))). For a PoT, we have a two-layer model where the constraint violations are penalized using the energy function (see equation 2.6), E(x) =
M i=1
1 2 αi log 1 + ( Ji x) . 2
(2.10)
We note that the shape of this energy function implies that relative to a quadratic penalty, small violations are penalized more strongly while large violations are penalized less strongly. This results in “sparse” distributions of violations (y-values) with many very small violations and occasional large ones. In the case of an equal number of observables, {xi }, and latent variables, {yi } (the so-called complete representation), the PoT model is formally equivalent to square, noiseless ICA (Bell & Sejnowski, 1995) with Student-t priors. However, in the overcomplete setting (more latent variables than
Topographic Product Models Applied to Natural Scene Statistics
387
observables), product of experts models are essentially different from overcomplete ICA models (Lewicki & Sejnowski, 2000). The main difference is that the PoT maintains a deterministic relationship between latent variables and observables through y = Jx, and consequently not all values of y are allowed. This results in important marginal dependencies between the y-variables. In contrast, in overcomplete ICA, the hidden y-variables are marginally independent by assumption and have a stochastic relationship with the x-variables. (For details, we refer to Teh et al., 2003.) For undercomplete models (fewer latent variables than observables), there is again a discrepancy between PoT models and ICA models. In this case, the reason can be traced back to the way noise is added to the models in order to force them to assign nonzero probability everywhere in input space. In contrast to undercomplete ICA models, where noise is added in all directions of input space, undercomplete PoT models have noise added only in the directions orthogonal to the subspace spanned by the filter matrix J. (More details can be found in Welling, Zemel, & Hinton, 2003, 2004.) 2.3 Hierarchical PoT (HPoT) Models. We now consider modifications to the basic PoT by introducing extra interactions between the activities of filter outputs, yi , and altering the energy function for the model. These modifications were motivated by observations of the behavior of independent components of natural data and inspired by similarities between our model and (hierarchical) ICA. Since the new model essentially involves adding a new layer to the standard PoT, we refer to it as a hierarchical PoT (HPoT). As we will show in section 4, when trained on a large collection of natural image patches, the linear components { Ji } behave similarly to the learned basis functions in ICA and grow to resemble the well-known Gabor-like receptive fields of simple cells found in the visual cortex (Bell & Sejnowski, 1997). These filters, like wavelet transforms, are known to decorrelate input images very effectively. However, it has been observed that higher-order dependencies remain between the filter outputs {yi }. In particular, there are important dependencies between the activities or energies yi2 (or more generally |yi |β , β > 0) of the filter outputs. This phenomenon can be neatly demonstrated through the use of bow-tie plots, in which the conditional histogram of one filter output is plotted given the output value of a different filter (e.g., see Simoncelli, 1997). The bow-tie shape of the plots implies that the first-order dependencies have been removed by the linear filters { Ji } (since the conditional mean vanishes everywhere), but that higher-order dependencies still remain; specifically, the variance of one filter output can be predicted from the activity of neighboring filter outputs. In our modified PoT, the interactions between filter outputs will be implemented by first squaring the filter outputs and subsequently introducing an extra layer of units, denoted by z. These units will be used to capture the dependencies between these squared filter outputs, z = W(y)2 = W( Jx)2 ,
388
S. Osindero, M. Welling, and G. Hinton
and this is illustrated in Figure 1C. (Note that in the previous expression and in what follows, the use of (·)2 with a vector argument will imply a component-wise squaring operation.) The modified energy function is E(x) =
M i=1
K 1 2 αi log 1 + Wi j ( J j x) 2
W ≥ 0,
(2.11)
j=1
where the nonnegative parameters Wi j model the dependencies between the activities {yi2 }.4 Note that the forward mapping from x through y to z is completely deterministic and can be interpreted as a bottom-up neural network. We can also view the modified PoT as modeling constraint violations, but this time in terms of z with violations now penalized according to the energy in equation 2.11. As with the standard PoT model, there is a reformulation of the hierarchical PoT model in terms of auxiliary variables, u, P(x, u) ∝ exp −
M i=1
K 1 ui 1 + Wi j ( J j x)2 + (1 − αi ) log ui , 2
(2.12)
j=1
with conditional distributions, P(u|x) =
M i=1
K 1 Gui αi ; 1 + Wi j ( J j x)2 2
P(x|u) = Nx 0; ( JVJT )−1
(2.13)
j=1
V = Diag[WT u].
(2.14)
Again, we note that this auxiliary variable representation supports an efficient Gibbs sampling procedure where all auxiliary variables u are sampled in parallel given the inputs x using equation 2.13, and all input variables x are sampled jointly from a multivariate gaussian distribution according to equation 2.14. As we will discuss in section 3.2, this is an important ingredient in training HPoT models from data using contrastive divergence. Finally, in a somewhat speculative link to computational neuroscience, in the following discussions we will refer to units, y, in the first hidden layer as simple cells and units, z, in the second hidden layer as complex cells. For simplicity, we will assume the number of simple and complex cells to be
4 For now, we implicitly assume that the number of first hidden-layer units (i.e., filters) is greater than or equal to the number of input dimensions. Models with fewer filters than input dimensions need some extra care, as noted in section 2.3.1. The number of toplayer units can be arbitrary, but for concreteness, we will work with an equal number of first-layer and top-layer units.
Topographic Product Models Applied to Natural Scene Statistics
389
equal. There are no obstacles to using unequal numbers, but this does not appear to lead to any qualitatively different behavior. 2.3.1 Undercomplete HPoT Models. The HPoT models, as defined in section 2.3, were implicitly assumed to be complete or overcomplete. We may also wish to consider undercomplete models. These models can be interesting in a variety of applications where one seeks to represent the data in a lower-dimensional yet informative space. Undercomplete models need a little extra care in their definition, since in the absence of a proper noise model, they are unnormalizable over input space. In Welling, Agakov, & Williams (2003) and Welling, Zemel, & Hinton (2003, 2004), a natural solution to this dilemma was proposed where a noise model is added in directions orthogonal to all of the filters {J}. We note that it is possible to generalize this procedure to HPoT models, but in the interests of parsimony, we omit detailed discussion of undercomplete models in this article. 2.4 Topographic PoT Models. The modifications described next were inspired by a similar proposal in Hyvarinen et al. (2001) named topographic ICA. By restricting the interactions between the first and second layers of a HPoT model, we are able to induce a topographic ordering on the learned features. Such ordering can be useful for a number of reasons; for example, it may help with data visualization by concentrating feature activities in local regions. This restriction can also help in acting as a regularizer for the density models that we learn. Furthermore, it makes it possible to compare the topographic organization of features in our model (and based on the statistical properties of the data) to the organization found within cortical topographic maps. We begin by choosing a topology on the space of filters. This is most conveniently done by simply considering the filters to be laid out on a grid and considering local neighborhoods with respect to this layout. In our experiments, we use a regular square grid and apply toroidal boundary conditions to avoid edge effects. The complex cells receive input fromthe simple cells in precisely the same way as in our HPoT model, yi = j Wi j ( J j x)2 , but now W is fixed and we assume it is chosen such that it computes a local average from the grid of filter activities. The free parameters that remain to be learned using contrastive divergence are {αi , J}. In the following, we explain why the filters {Ji } should be expected to organize themselves topographically when learned from data. As noted previously, there are important dependencies between the activities of wavelet coefficients of filtered images. In particular, the variance (but not the mean) of one coefficient can be predicted from the value of a neighboring coefficient. The topographic PoT model can be interpreted as
390
S. Osindero, M. Welling, and G. Hinton
an attempt to model these dependencies through a Markov random field on the activities of the simple cells. However, we have predefined the connectivity pattern and have left the filters to be determined through learning. This is the opposite strategy as the one used in, for instance, Portilla, Strela, Wainwright, & Simoncelli (2003), where the wavelet transform is fixed and the interactions between wavelet coefficients are modeled. One possible explanation for the emergent topography is that the model will make optimal use of these predefined interactions if it organizes its simple cells such that dependent cells are nearby in filter space and independent ones are distant.5 A complementary explanation is based on the interpretation of the model as capturing complex constraints in the data. The penalty function for violations is designed such that (relative to a squared penalty) large violations are relatively mildly penalized. However, since the complex cells represent the average input from simple cells, their values would be well described by a gaussian distribution if the corresponding simple cells were approximately independent. (This is a consequence of the central limit theorem for sums of independent random variables.) In order to avoid a mismatch between the distribution of complex cell outputs and the way they are penalized, the model ought to position simple cells that have correlated activities near each other. In doing so, the model can escape the central limit theorem because the simple cell outputs that are being pooled are no longer independent. Consequently, the pattern of violations that arises is a better match to the pattern of violations one would expect from the penalizing energy function. Another way to understand the pressure toward topography is to ask how an individual simple cell should be connected to the complex cells in order to minimize the total cost caused by the simple cell’s outputs on real data. If the simple cell is connected to complex cells that already receive inputs from the simple cell’s neighbors in position and spatial frequency, the images that cause the simple cell to make a big contribution will typically be those in which the complex cells that it excites are already active, so its additional contribution to the energy will be small because of the gentle slope in the heavy tails of the cost function. Hence, since complex cells locally pool simple cells, local similarity of filters is expected to emerge. 2.5 Further Extensions to the Basic PoT Model. The parameters {αi } in the definition of the PoT model control the sparseness of the activities of the complex and simple cells. For large values of α, the PoT model will resemble more and more a gaussian distribution, while for small values, there is a very sharp peak at zero in the distribution that decays very quickly into fat tails. 5
This argument assumes that the shape of the filters remains essentially unchanged (i.e., Gabor-like) by the introduction of the complex cells in the model. Empirically we see that this is indeed the case.
Topographic Product Models Applied to Natural Scene Statistics
391
1 0.9
Beta=5 Beta=2 Beta=1/2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -8
-6
-4
-2
0
2
4
6
8
10
Figure 2: Functions f (x) = 1/(1 + |x|β ) for different values of β.
In the HPoT model, the complex cell activities, z, are the result of linearly combining the (squared) outputs simple cells, y = Jx. The squaring operation is a somewhat arbitrary choice (albeit a computationally convenient and empirically effective one), and we may wish to process the first-layer activities in other ways before combining them in the second layer. In particular, we might consider modifications of the form activity = |Jx|β with | · | denoting absolute values and β > 0. Such a model defines the a density in y-space of the form M K 1 1 β p y (y) = exp − αi log 1 + Wi j |y j | . Z(W, α) 2 i=1
(2.15)
j=1
A plot of the unnormalized distribution f (x) = 1/(1 + |x|β ) is shown in Figure 2 for three settings of the parameter β. One can observe that for smaller values of β, the peak at zero becomes sharper and the tails become fatter. In section 3, we show that sampling, and hence learning, with contrastive divergence can be performed efficiently for any setting of β. 3 Learning in HPoT Models In this section, we explain how to perform maximum likelihood learning of the parameters for the models introduced in the previous section. In the case of complete and undercomplete PoT models, we are able to analytically
392
S. Osindero, M. Welling, and G. Hinton
compute gradients; however, in the general case of overcomplete or hierarchical PoT’s, we are required to employ an approximation scheme, and the preferred method in this article will be contrastive divergence (CD) (Hinton, 2002). Since CD learning is based on Markov chain Monte Carlo (MCMC) sampling, the appendix provides a discussion of sampling procedures for the various models we have introduced. 3.1 Maximum Likelihood Learning in HPoT Models. To learn the parameters θ = ( J, W, α) (and β for the extended models), we will maximize the log likelihood of the model, 1 log px (xn ; θ ). N N
θ ML = arg max L = arg max θ
θ
(3.1)
n=1
For models that have the Boltzmann form, p(x) = compute the following gradient, N ∂L ∂ E(x; θ ) 1 ∂ E(xn ; θ ) − =E , ∂θ ∂θ N ∂θ p
1 Z
exp[−E(x; θ )], we can
(3.2)
n=1
where E[·] p denotes expectation with respect to the model’s distribution over x (this term comes from the derivatives of the log partition function, Z). For the parameters ( J, W, α) in the PoT, we obtain the following derivative functions: αi Wi j ( Jx) j xk ∂ E(x; θ ) = 1 2 ∂ J jk 1 + j Wi j ( Jx) j 2 i 1 α ( Jx)2j ∂ E(x; θ ) 2 i = 1 ∂ Wi j 1 + 2 j Wi j ( Jx)2k j ∂ E(x; θ ) 1 = log 1 + Wi j ( Jx)2j . ∂αi 2 j
(3.3)
(3.4)
(3.5)
Once we have computed the gradients of the log likelihood, we can maximize it using any gradient-based optimization algorithm. Elegant as the gradients in equation 3.2 may seem, in the general case they are intractable to compute. The reason is the expectation in the first term of equation 3.2 over the model distribution. One may choose to approximate this average by running an MCMC chain to equilibrium with p(x; θ ) as its invariant distribution. However, there are (at least) two reasons that this might not be a good idea: (1) the Markov chain has to be run to equilibrium
Topographic Product Models Applied to Natural Scene Statistics
393
for every gradient step of learning, and (2) we need a lot of samples to reduce the variance in the estimates. Hence, for the general case, we propose using the contrastive divergence learning paradigm, which is discussed next. 3.2 Training HPoT Models with Contrastive Divergence. For complete and undercomplete HPoT models, we can derive the exact gradient of the log likelihood with respect to the parameters J. In the complete case, these gradients turn out to be of the same form as the update rules proposed in Bell and Sejnowski (1995). However, the gradients for the parameters W and α are much harder to compute.6 Furthermore, in overcomplete settings, the exact gradients with respect to all parameters are computationally intractable. We now describe an approximate learning paradigm to train the parameters in cases where evaluation of the exact gradients is intractable. Recall that the bottleneck in computing these gradients is the first term in equation 3.2. An approximation to these expectations can be obtained by running an MCMC sampler with p(x; J, W, α) as its invariant distribution and computing Monte Carlo estimates of the averages. As mentioned in section 3.1, this is a very inefficient procedure because it needs to be repeated for every step of learning, and a fairly large number of samples may be needed to reduce the variance in the estimates.7 Contrastive divergence (Hinton, 2002), replaces the MCMC samples in these Monte Carlo estimates with samples from brief MCMC runs, which were initialized at the data cases. The intuition is that if the current model is not a good fit for the data, the MCMC particles will swiftly and consistently move away from the data cases. But, if the data population represents a fair sample from the model distribution, then the average energy will not change when we initialize our Markov chains at the data cases and run them forward. In general, initializing the Markov chains at the data and running them only briefly introduces bias but greatly reduces both variance and computational cost. Algorithm 1 summarizes the steps in this learning procedure: Algorithm 1: Contrastive Divergence Learning 1. Compute the gradient of the energy with respect to the parameters, θ , and average over the data cases xn . 2. Run MCMC samplers for k steps, starting at every data vector xn , keeping only the last sample sn,k of each chain. 6 Although we can obtain exact derivatives for α in the special case where W is restricted to be the identity matrix. 7 An additional complication is that it is hard to assess when the Markov chain has converged to the equilibrium distribution.
394
S. Osindero, M. Welling, and G. Hinton
3. Compute the gradient of the energy with respect to the parameters, θ , and average over the samples sn,k . 4. Update the parameters using, ∂ E(xn ) ∂ E(snk ) η , − θ = N ∂θ ∂θ samples sn,k
(3.6)
data xn
where η is the learning rate and N the number of samples in each minibatch. For further details on contrastive divergence learning, we refer to the literature (Hinton, 2002; Teh et al., 2003; Yuille, 2004; Carreira-Perpinan & Hinton, 2005). For highly overcomplete models, it often happens that some of the Ji filters (rows of J) decay to zero. To prevent this happening, from 2 we constrain the L 2 -norm of these filters to be one: j J i j = 1∀i. Also, constraining the norm of the rows of the W matrix was helpful during learning. We choose to constrain the L 1 -norm to unity j Wi j = 1∀i, which makes sense because Wi j ≥ 0. We note that the objective function is not convex, and so the existence of poor local minima could be a concern. The stochastic nature of our gradient descent procedure may provide some protection against being trapped in shallow minima, although it has the concomitant price of being slower than noise-free gradient descent. We also note that the intractability of the partition function makes it difficult to obtain straightforward objective measures of model performance since log probabilities can be computed only up to an unknown additive constant. This is not so much of a problem when one is using a trained model for, say, feature extraction, statistical image processing, or classification, but it does make explicit comparison with other models rather hard. (For example, there is no straightforward way to compare the densities provided by our overcomplete HPoT models with those from overcomplete ICA-style models.) 4 Experiments on Natural Images There are several reasons to believe that the HPoT should be an effective model for capturing and representing the statistical structure in natural images; indeed much of its form was inspired by the dependencies that have been observed in natural images. We have applied our model to small patches taken from digitized natural images. The motivation for this is several-fold. First, it provides a useful test of the behavior of our model on a data set that we believe to contain sparse structure (and therefore to be well suited to our framework). Second, it allows us to compare our work with that from other authors and similar models, namely ICA. Third, it allows us to use our model framework
Topographic Product Models Applied to Natural Scene Statistics
395
as a tool for interpreting results from neurobiology. Our method can complement existing approaches and also allows one to suggest alternative interpretations and descriptions of neural information processing. Section 4.2 presents results from complete and overcomplete single-layer PoTs trained on natural images. Our results are qualitatively similar to those obtained using ICA. In section 4.3 we demonstrate the higher-order features learned in our hierarchical PoT model, and in section 4.4 we present results from topographically constrained hierarchical PoTs. The findings in these two sections are qualitatively similar to the work by Hyvvarinen et al. (2001); however, our underlying statistical model is different and allows us to deal more easily with overcomplete, hierarchical topographic representations. 4.1 Data Sets and Preprocessing. We performed experiments using standard sets of digitized natural images available on the World Wide Web from Aapo Hyvarinen8 and Hans van Hateren.9 The results obtained from the two different data sets were not significantly different, and for simplicity, all results reported here are from the van Hateren data set. To produce training data of a manageable size, small, square patches were extracted from randomly chosen locations in the images. As is common for unsupervised learning, these patches were filtered according to computationally well-justified versions of the sort of whitening transformations performed by the retina and lateral geniculate nucleus (LGN). (Atick & Redlich, 1992). First, we applied a log transformation to the raw pixel intensities. This procedure somewhat captures the contrast transfer function of the retina. It is not critical, but for consistency with past work, we incorporated it for the results presented here. The extracted patches were subsequently normalized such that mean pixel intensity for a given pixel across the data set was zero, and also so that the mean intensity within each patch was zero, effectively removing the DC component from each input. The patches were then whitened, usually in conjunction with dimensionality reduction. This is a standard technique in many ICA approaches and speeds up learning without having much impact on the final results obtained. 4.2 Single-Layer PoT Models. Figure 3 illustrates results from our basic approach and shows for comparison results obtained using ICA. The data consisted of 150,000 patches of size 18 × 18 that were reduced to vectors of dimension 256 by projection onto the leading 256 eigenvectors of the data covariance matrix, and then whitened to give unit variance along each axis. 4.2.1 Complete Models. We first present the results of our basic approach in a complete setting and display a comparison of the filters learned using 8 9
http://www.cis.hut.fi/projects/ica/data/images/. http://hlab.phys.rug.nl/imlib/index.html.
396
S. Osindero, M. Welling, and G. Hinton
A
B
C
D
Figure 3: Learned filters shown in the raw data space. Each small square represents a filter vector, plotted as an image. The gray scale of each filter display has been (symmetrically) scaled to saturate at the maximum absolute weight value. (A) Random subset of filters learned in a complete PoT model. (B) Random subset of filters learned in a complete ICA model. (C) Random subset of filters learned in a 1.7× overcomplete PoT model. (D) Random subset of filters learned in a 2.4× overcomplete PoT model.
our method with a set obtained from an equivalent ICA model learned using direct gradient ascent in the likelihood. We trained both models (learning just J, and keeping α fixed10 at 1.5) for 200 passes through the entire data set of 150,000 patches. The PoT was trained using one-step contrastive
This is the minimum value of α that allows us to have a well-behaved density model in the complete case. As alpha gets smaller than this, the tails of the distribution get heavier and heavier, and the variance and, eventually, mean are no longer well defined. 10
Topographic Product Models Applied to Natural Scene Statistics
397
divergence as outlined in section 3.2, and the ICA model was trained using the exact gradient of the log likelihood (as in Bell & Sejnowski, 1995 for instance). As expected, at the end of learning, the two procedures delivered very similar results, exemplars of which are given in Figures 3A and 3B. Furthermore, both sets of filters bear a strong resemblance to the types of simple cell receptive fields found in V1. 4.2.2 Overcomplete Models. We next consider our model in an overcomplete setting; this is no longer equivalent to any ICA model. In the PoT, overcomplete representations are simple generalizations of the complete case, and unlike causal generative approaches, the features are conditionally independent since they are given just by a deterministic mapping. To facilitate learning in the overcomplete setting, we have found it beneficial to make two modifications to the basic setup. First, we set αi = α∀i and make α a free parameter to be learned from the data. The learned value of α is typically less than 1.5 and gets smaller as we increase the degree of overcompleteness.11 One intuitive way of understanding why this might be expected is the following. Decreasing α reduces the energy cost for violating the constraints specified by each individual feature; however, this is counterbalanced by the fact that in the overcomplete setting, we expect an input to violate more of the constraints at any given time. If α remains constant as more features are added, the mass in the tails may no longer be sufficient to model the distribution well. The second modification that we make is to constrain the L2-norm of the filters to l, making l another free parameter to be learned. If this modification is not made, there is a tendency for some of the filters to become very small during learning. Once this has happened, it is difficult for them to grow again since the magnitude of the gradient depends on the filter output, which in turn depends on the filter length. The first manipulation simply extends the power of the model, but one could argue that the second manipulation is something of a fudge: if we have sufficient data, a good model, and a good algorithm, it should be unnecessary to restrict ourselves in this way. There are several counterarguments to this, the principal ones being: (1) we might be interested, from a biological point of view, in representational schemes in which the representational units all receive comparable amounts of input; (2) we can view it as approximate posterior inference under a prior belief that in an effective model, all the units should play a roughly equal part in defining the density and forming the representation. We note that a similar manipulation is also applied by most practitioners dealing with overcomplete ICA models (e.g., Olshausen & Field, 1996).
11 Note that in an overcomplete setting, depending on the direction of the filters, α may be less than 1.5 and still yield a normalizable distribution overall.
398
S. Osindero, M. Welling, and G. Hinton
In Figures 3C and 3D, we show example filters typical of those learned in overcomplete simulations. As in the complete case, we note that the majority of learned filters qualitatively match the linear receptive fields of simple cells found in V1. Like V1 spatial receptive fields, most (although not all) of the learned filters are well fit by Gabor functions. We analyzed in more detail the properties of filter sets produced by different models by fitting a Gabor function to each filter (using a least-squares procedure) and then looking at the population properties in terms of Gabor parameters.12 Figure 4 shows the distribution of parameters obtained by fitting Gabor functions to complete and overcomplete filters. For reference, similar plots for linear spatial receptive fields measured in vivo are given in Ringach (2002) and van Hateren and van der Schaaf (1998). The plots are all reasonable qualitative matches to those shown for the “real” V1 receptive fields as shown, for instance, in Ringach (2002). They also help to indicate the effects of representational overcompleteness. With increasing overcompleteness, the coverage in the spaces of location, spatial frequency, and orientation becomes denser and more uniform, while at the same time, the distribution of receptive fields shapes remains unchanged. Further, the more overcomplete models give better coverage in lower spatial frequencies that are not directly represented in complete models. Ringach (2002) reports that the distribution of shapes from ICA or sparse coding can be a poor fit to the data from real cells, the main problem being that there are too few cells near the origin of the plot, which corresponds roughly to cells with smaller aspect ratios and small numbers of cycles in their receptive fields. The results that we present here appear to be a slightly better fit. (One source of the differences might be Ringach’s choice of ICA prior.) A large proportion of our fitted receptive fields are in the vicinity of the macaque results, although as we become more overcomplete, we see a spread farther away from the origin. In summary, our results from these single-layer PoT models can account for many of the properties of simple cell linear spatial receptive fields in V1. 4.3 Hierarchical PoT Models. We now present results from the hierarchical extension of the basic PoT model. In principle, we are able to learn both sets of weights—the top-level connections W and the lower-level connections J—simultaneously. However, effective learning in this full system has proved difficult when starting from random initial conditions. The results we present in this section were obtained by initializing W to the 12 Approximately 5 to 10% of the filters failed to localize well in orientation or location—usually appearing somewhat like noise or checkerboard patterns—and were not well described by a Gabor function. These were detected during the parametric fitting process and were eliminated from our subsequent population analyses. It is unclear exactly what role these filters play in defining densities within our model.
Topographic Product Models Applied to Natural Scene Statistics 1.7x OVERCOMPLETE
COMPLETE
A
18
12
12
12
6
6
6
6
12
18
B
0 0
6
18
12
18
0 0
6
12
18
Frequency & Orientation
.125 .25
.5
40
.125 .25
.5
Phase
60
30
40 20 0
45
90
D60 40 20 0
0
.5
60
40
10 0
.125 .25
80
20
E
2.4x OVERCOMPLETE
18
0 0
C
Location
399
2
4
20
0
120 100 80 60 40 20 0
0 45 90 Aspect Ratio
0
0
45
90
0
2
4
150 100 50 0
2
4
RF Shapes
0
4
4
3
3
3
2
2
2
2
1
1
1
1
0
0
2
4
0
0
2
4
4
0
F
0
2
4
RF Shapes
4 3
0
0
2
4
Figure 4: A summary of the distribution of some parameters derived by fitting Gabor functions to receptive fields of three models with different degrees of overcompleteness in the representation size. The left-most column (A–E) is a complete representation, the middle column is 1.7× overcomplete, and the right-most column is 2.4× overcomplete. (A) Each dot represents the center location, in pixel coordinates within a patch, of a fitted Gabor. (B) Scatter plots showing the joint distribution of orientation (azimuthally) and spatial frequency in cycles per pixel (radially). (C) Histograms of Gabor fit phase (mapped to range 0◦ –90◦ since we ignore the envelope sign). (D) Histograms of the aspect ratio of the Gabor envelope (length/width). (E) A plot of “normalized width” versus “normalized length” (cf. Ringach, 2002). (F) For comparison, we include data from real macaque experiments (Ringach, 2002).
400
S. Osindero, M. Welling, and G. Hinton
Figure 5: Each panel in this figure illustrates the theme represented by a different top-level unit. The filters in each row are arranged in descending order, from left to right, of the strength Wi j with which they connect to the particular top-layer unit.
identity matrix and first learning J, before subsequently releasing the W weights and then letting the system learn freely. This is therefore equivalent to initially training a single-layer PoT and then subsequently introducing a second layer. When models are trained in this way, the form of the first-layer filters remains essentially unchanged from the Gabor receptive fields shown previously. Moreover, we see interesting structure being learned in the W weights, as illustrated by Figure 5. The figure is organized to display the filters connected most strongly to a top-layer unit. There is a strong organization by what might be termed themes based on location, orientation, and spatial frequency. An intuition for this grouping behavior is as follows. There will be correlations between the squared outputs of some pairs of filters, and by having them feed into the same top-level unit, the model is able to capture this regularity. For most input images, all members of the group will have small combined activity, but for a few images, they will have significant combined activity. This is exactly what the energy function favors, as opposed to a grouping of very different filters that would lead to a rather gaussian distribution of activity in the top layer. Interestingly, these themes lead to responses in the top layer (if we examine the outputs zi = Wi ( Jx)2 ) that resemble complex cell receptive fields. It can be difficult to accurately describe the response of nonlinear units in a network, and we choose a simplification in which we consider the response of the top-layer units to test stimuli that are gratings or Gabor patches. The test stimuli were created by finding the grating or Gabor stimulus that was most effective at driving a unit and then perturbing various parameters about this maximum. Representative results from such a characterization are shown are shown in Figure 6. In comparison to the first-layer units, the top-layer units are considerably more invariant to phase and somewhat more invariant to position. However, both the sharpness of tuning to orientation and spatial frequency remain roughly unchanged. These results typify the properties that we see when we consider the responses of the second layer in our hierarchical model and are a striking match to the response properties of complex cells.
B
Normalised Resp.
A
Normalised Resp.
Topographic Product Models Applied to Natural Scene Statistics
Phase
1
1
0.5 0
Location
1
0.5 50
0
50
0 5
Orientation
0.5 0
5
0
1
50
0
50
0
0
1
1
1
0.5
0.5
0.5
0.5
50 0 50 Phase Shift, Degs
0 5 0 5 Location Shift, Pixels
Spatial Frequency
0.5
1
0
401
1
2
0
0 50 0 50 0 1 2 Orientation Shift, Degs Frequency Scale Factor
Figure 6: (A) Tuning curves for simple cells (i.e., first-layer units). (B) Tuning curves for complex cells (i.e., second-layer units). The tuning curves for phase, orientation, and spatial frequency were obtained by probing responses using grating stimuli; the curve for location was obtained by probing using a localized Gabor patch stimulus. The optimal stimulus was estimated for each unit, and then one parameter (phase, location, orientation or spatial frequency) was varied, and the changes in responses were recorded. The response for each unit was normalized such that the maximum output was 1, before combining the data over the population. The solid line shows the population average (median of 441 units in a 1.7× overcomplete model), and the lower and upper dotted lines show the 10% and 90% centiles, respectively. We use a style of display as used in Hyvarinen et al. (2001).
4.4 Topographic Hierarchical PoT Models. We next consider the topographically constrained form of the hierarchical PoT that we proposed in an attempt to induce spatial organization on the representations learned. The W weights are fixed and define local, overlapping neighborhoods on a square grid with toroidal boundary conditions. The J weights are free to learn, and the model is trained as usual. Representative results from such a simulation are given in Figure 7. The inputs were patches of size 25 × 25, whitened and dimensionality reduced to vectors of size 256; the representation is 1.7× overcomplete. By simple inspection of the filters in Figure 7A, we see that there is strong local continuity in the receptive field properties of orientation and spatial frequency and location, with little continuity of spatial phase. With notable similarity to experimentally observed cortical topography, we see pinwheel singularities in the orientation map and a low-frequency cluster in the spatial frequency map, which seems to be somewhat aligned with one of the pinwheels. While the map of location (retinotopy) shows good local structure, there is poor global structure. We suggest that this may be due to the relatively small scale of the model and the use of toroidal boundary conditions (which eliminated the need to deal with edge effects).
402
S. Osindero, M. Welling, and G. Hinton
Figure 7: An example of a filter map. (The gray scale is saturating in each cell independently.) This model was trained on 25 × 25 patches that had been whitened and dimensionality reduced to 256 dimensions, and the representation layer is 1.7 × overcomplete in terms of the inputs. The neighborhood size was a 3 × 3 square (i.e., eight nearest neighbors). We see a topographically ordered array of learned filters with local continuity of orientation, spatial frequency, and location. The local variations in phase seem to be random. Considering the map for orientation, we see evidence for pinwheels. In the map for spatial frequency, there is a distinct low-frequency cluster.
5 Relation to Earlier Work 5.1 Gaussian Scale Mixtures. We can consider the complete version of our model as a gaussian scale mixture (GSM; Andrews & Mallows, 1974;
Topographic Product Models Applied to Natural Scene Statistics
403
Wainwright & Simoncelli, 2000; Wainwright, Simoncelli, & Willsky, 2000) with a particular (complicated) form of scaling function.13 The basic form for a GSM density on a variable, g, can be given as follows (Wainwright & Simoncelli, 2000), pGSM (g) =
∞
−∞
T g (cQ)−1 g φc (c)dc, exp − N 1 2 (2π) 2 |cQ| 2 1
(5.1)
where c is a nonnegative scalar variate and Q is a positive definite covariance matrix. This is the distribution that results if we draw c from φc (c) and √ a variable v from a multidimensional gaussian NV (0, Q) and then take g = cv. Wainwright et al. (2000) discuss a more sophisticated model in which the distributions of coefficients in a wavelet decomposition for images are described by a GSM that has a separate scaling variable, c i , for each coefficient. The c i have a Markov dependency structure based on the multiresolution tree that underlies the wavelet decomposition. In the complete setting, where the y variables are in linear one-to-one correspondence with the input variables, x, we can interpret the distribution p(y) as a gaussian scale mixture. To seethis, we first rewrite p(y, u) = p(y|u) p(u), where the conditional p(y|u) = j N y j [0, ( i Wi j ui )−1 ] is gaussian (see equation 2.14). The distribution p(u) needs to be computed by marginalizing p(x, u) in equation 2.12 over x, resulting in − 12 1 −ui αi −1 p(u) = e ui Wjk u j , Zu i k j
(5.2)
where the partition function Zu ensures normalization. We see that the marginal distribution of each yi is a gaussian scale mixture in which the scaling variate for yi is given by c i (u) = ( j Wji u j )−1 . The neighborhoods defined by W in our model play an analogous role to the tree-structured cascade process in Wainwright et al. (2000) and determine the correlations between the different scaling coefficients. However, a notable difference in this respect is that the GSM model assumes a fixed tree structure for the dependencies, whereas our model is more flexible in that the interactions through the W parameters can be learned. The overcomplete version of our PoT is not so easily interpreted as a GSM because the {yi } are no longer independent given u, nor is the distribution over x a simple GSM due to the way in which u is incorporated into the 13
In simple terms, a GSM density is one that can be written as a (possibly infinite) mixture of gaussians that differ only in the scale of their covariance structure. A wide range of distributions can be expressed in this manner.
404
S. Osindero, M. Welling, and G. Hinton
covariance matrix (see equation 2.9). However, much of the flavor of a GSM remains. 5.2 Relationship to tICA. In this section we show that in the complete case, the topographic PoT model is isomorphic to the model optimized (but not the one initially proposed) by Hyvarinen et al. (2001) in their work on topographic ICA (tICA). These authors define an ICA generative model in which the components or sources are not completely independent but have a dependency that is defined with relation to some topology, such as a toroidal grid—components close to one another in this topology have greater codependence than those that are distantly separated. Their generative model is shown schematically in Figure 8. The first layer takes a linear combination of variance-generating variables, t, and then passes them through some nonlinearity, φ(·), to give positive scaling variates, σ . These are then used to set the variance of the sources, s, and conditioned on these scaling variates, the components in the second layer
t H σ s A x Figure 8: Graphical model for topographic ICA (Hyvarinen et al., 2001). First, the variance generating variables, ti , are generated independently from their prior. They are then linearly mixed through the matrix H, before being nonlinearly transformed using function φ(·) to give the variances, σi = φ(HiT t), for each of the sources, i. Values for these sources, si , are then generated from independent zero mean distributions with variances σi , before being linearly mixed through matrix A to give observables xi .
Topographic Product Models Applied to Natural Scene Statistics
405
are independent. These sources are then linearly mixed to give the observables, x. The joint density for (s, t) is given by p(s, t) =
psi
i
si φ HiT t
pti (ti ) , φ HiT t
(5.3)
and the log likelihood of the data given the parameters is
L(B) =
psi
i
data x
BT x i T φ Hi t
pti (ti ) | det B|dt φ HiT t
(5.4)
where B = A−1 . As noted in their article, this likelihood is intractable to compute because of the integral over possible states of t. This prompts the authors to derive an approach that makes various simplifications and approximations to give a lower bound on the likelihood. First, they restrict the form of the base density for s to be gaussian,14 both 1 t and H are constrained to be nonnegative, and φ(·) is taken to be (·)− 2 . This yields the following expression for the marginal density of s, p(s) =
1 2 exp − t H s p (t ) HiT t dt. k ik i ti i d 2 k (2π) 2 i i
1
(5.5)
This expression is then simplified by the approximation,
HiT t ≈
Hii ti .
(5.6)
While this approximation may not always be a good one, it is a strict lower bound on the true quantity and thus allows a lower bound on the likelihood as well. Their final approximate likelihood objective, L(B), is then given by d d T 2 L(B) = G Hi j Bi x + log | det(B)| , data
j=1
(5.7)
i=1
14 Their model can therefore be considered a type of GSM, although the authors do not comment on this.
406
S. Osindero, M. Welling, and G. Hinton
where the form of the scalar function G is given by G(τ ) = log
1 1 tτ pt (t) Hii dt. √ exp 2 2π
(5.8)
The results obtained by Hyvarinen and Hoyer (2001) and Hyvarinen et al. (2001) are very similar to those presented here in section 4. These authors also noted the similarity between elements of their model and the response properties of simple and complex cells in V1. Interestingly, the optimization problem that they actually solve (i.e., maximization of equation 5.7), rather than the one they originally propose, can be mapped directly onto the optimization problem for a square, topographic PoT model if we take: B ≡ JPoT , H ≡ WPoT , and G(τ ) = log(1 + 12 τ ). More generally, we can construct an equivalent, square energy-based model whose likelihood optimization corresponds exactly to the optimization of their approximate objective function. In this sense, we feel that our perspective has some advantages. First, we have a more accurate picture of what model we actually (try to) optimize. Second, we are able to move more easily to overcomplete representations. If Hyvarinen et al. (2001) were to make their model overcomplete, there would no longer be a deterministic relationship between their sources s and x. This additional complication would make the already difficult problems of inference and learning significantly harder. Third, in the HPoT framework, we are able to learn the top-level weights W in a principled way using the techniques discussed in section 3.2, whereas current tICA approaches have treated only fixed local connectivity. 5.3 Relationship to Other ICA Extensions. Karklin and Lewicki (2003, 2005) also propose a hierarchical extension to ICA that involves a second hidden layer of marginally independent sparsely active units. Their model is of the general form proposed in Hyvarinen et al. (2001) but uses a different functional dependency between the first and second hidden layers to that employed in the topographic ICA model that Hyvarinen et al. (2001) fully develop. In the generative pass from Karklin and Lewicki’s model, linear combinations of the second-layer activities are fed through an exponential function to specify scaling or variance parameters for the first hidden layer. Conditioned on these variances, the units in the first hidden layer are independent and behave like the hidden variables in a standard ICA model. This model can be described by the graph in Figure 8, where the transfer function φ(·) is given by an exponential. Using the notation of this figure, the relevant distributions are p(ti ) =
qi exp(|ti |qi ) 2 q i−1
(5.9)
Topographic Product Models Applied to Natural Scene Statistics
σ j = ce [Ht] j
! !q j ! sj ! qj p(s j |σ j ) = −1 exp !! !! σj 2σ j q j xk = [As]k .
407
(5.10) (5.11) (5.12)
The authors have so far considered only complete models and in this case, as with tICA, the first layer of hidden variables is deterministically related to the observables.15 To link this model to our energy-based PoT framework, first we consider the following change of variables: B = A−1
(5.13)
−1
(5.14)
K = H νj =
σ j−1 .
(5.15)
Then, considering the q variables to be fixed, we can write the energy function of their model as !qi ! ! ! ! E(x, ν) = K ik log(cνk )!! + log ν j + |ν j |q j | [Bx] j |q j . ! i
k
(5.16)
j
When we take the Boltzmann distribution with the energies defined in equation 5.16, we recover the joints and marginals specified by Karklin and Lewicki (2003, 2005). While the overall models are different, there are some similarities between this formulation and the auxiliary variable formulation of extended HPoT models (i.e., equation 2.12 with generalized exponent β from section 2.5). Viewed from an energy-based perspective, they both have the property that an energy “penalty" is applied to (a magnitude function of) a linear filtering of the data. The scale of this energy penalty is given by a supplementary set of random variables that themselves are subject to an additional energy function. As with standard ICA, in overcomplete extensions of this model, the similarities to an energy-based perspective would be further reduced. We note as an aside that it might be interesting to consider the energy-based overcomplete extension of Karklin and Lewicki’s model, in addition to the standard causal overcomplete extension. In the overcomplete version of the
15 Furthermore, as well as focusing their attention on the complete case, the authors assume the first-level weights are fixed to a set of filters obtained using regular ICA.
408
S. Osindero, M. Welling, and G. Hinton
causal model, inference would likely be much more difficult because of posterior dependencies both within and between the two hidden layers. For the overcomplete energy-based model, the necessary energy function appears not to be amenable to efficient Gibbs sampling, but parameters could still be learned using contrastive divergence and Monte Carlo methods such as hybrid Monte Carlo. 5.4 Representational Differences Between Causal Models and EnergyBased Models. As well as specifying different probabilistic models, overcomplete energy-based models (EBMs) such as the PoT differ from overcomplete causal models in the types of representation they (implicitly) entail. This has interesting consequences when we consider the population codes suggested by the two types of model. We focus on the representation in the first layer (simple cells), although similar arguments might be for deeper layers as well. In an overcomplete causal model, many configurations of the sources are compatible with a configuration of the input.16 For a given input, a posterior distribution is induced over the sources in which the inferred values for different sources are conditionally dependent. As a result, even for models that are linear in the generative direction, the formation of a posterior representation in overcomplete causal models is essentially nonlinear, and, moreover, it is nonlocal due to the lack of conditional independence. This implies that unlike EBMs, inference in overcomplete causal models is typically iterative, often intractable, and therefore time-consuming. Also, although we can specify the basis functions associated with a unit, it is much harder to specify any kind of feedforward receptive field in causal models. The issue of how such a posterior distribution could be encoded in a representation remains open; a common postulate (made on the grounds of efficient coding) is that a maximum a posteriori (MAP) representation should be used, but we note that even computing the MAP value is usually iterative and slow. Conversely, in overcomplete EBMs with deterministic hidden units such as we have presented in this article, the mapping from inputs to representations remains simple and noniterative and requires only local information. In Figure 9 we use a somewhat absurd example to schematically illustrate a salient consequence of this difference between EBMs and causal models that have sparse priors. The array of vectors in Figure 9A should be understood to be either a subset of the basis functions in an overcomplete causal model or a subset of the filters in overcomplete PoT model. In Figure 9B, we show four example input images. These have been chosen to be identical to four of the vectors shown in Figure 9A. The left-hand column of Figure 9C shows the representation responses of the units in an EBM-style model for
16
In fact, strictly speaking, there is a subspace of compatible source configurations.
Topographic Product Models Applied to Natural Scene Statistics
A 0
B
1
2
3
C
5
4
6
EBM Representation
7
0.5
0.5
0 1
0 1
0.5
0.5
0 1
0 1
0.5
0.5
0 1
1
0.5
0.5 0
5
8
9
10
Causal Model Representation 1
1
0
409
10
0
0
5
10
Figure 9: Representational differences between overcomplete causal models and overcomplete deterministic EBMs. (A) The 11 vectors in this panel should be considered as the vectors associated with a subset of representational units in either an overcomplete EBM or an overcomplete causal model. In the EBM, they would be the feedforward filter vectors; in the causal model, they would be basis functions. (B) Probe stimuli. These images exactly match the vectors as those associated with units 4,5,6, and 1. (C) The left-hand column shows the normalized responses in an EBM model of the 11 units assuming they are filters. The right-hand column shows the normalized response from the units assuming that they are basis functions in a causal model with a sparse prior and that we have formed a representation by taking the MAP configuration for the source units.
these four inputs; the right-hand column shows the MAP responses from an overcomplete causal model with a sparse source prior. This is admittedly an extreme case, but it provides a good illustration of the point we wish to make. More generally, although representations in an overcomplete PoT are sparse, there is also some redundancy; the PoT population response is typically less sparse (Willmore & Tolhurst, 2001) than a causal model with an equivalent prior.
410
S. Osindero, M. Welling, and G. Hinton
Interpreting the two models as a description of neural coding, one might expect the EBM representation to be more robust to the influences of neural noise as compared with the representation suggested from a causal approach. Furthermore, the EBM style representation is shiftable; it has the property that for small changes in the input, there are small changes in the representation. This property would not necessarily hold for a highly overcomplete causal model. Such a discontinuous representation might make subsequent computations difficult and nonrobust, and it also seems somewhat at odds with the neurobiological data; however, proper comparison is difficult since there is no real account of dynamic stimuli or spiking in either model. At present, it remains unclear which type of model, causal or energy based, provides the more appropriate description of coding in the visual system, especially since there are many aspects that neither approach captures.
6 Summary We have presented a hierarchical energy-based density model that we suggest is generally applicable to data sets that have a sparse structure or can be well characterized by constraints that are often well satisfied but occasionally violated by a large amount. By applying our model to natural scene images, we are able to provide an interpretational account for many aspects of receptive field and topographic map structure within primary visual cortex and which also develops sensible high-dimensional population codes. Deterministic features (i.e., the first- and second-layer filter outputs) within our model play a key role in defining the density of a given image patch, and we are able to make a close relationship between these features and the responses of simple cells and complex cells in V1. Furthermore, by constraining our model to interact locally, we are able to provide some computational motivation for the forms of the cortical maps for retinotopy, phase, spatial frequency, and orientation. While our model is closely related to some previous work, most prominently Hyvarinen et al. (2001), it bestows a different interpretation on the learned features, is different in its formulation, and describes rather different statistical relations in the overcomplete case. We present our model as both a general alternative tool to ICA for describing sparse data distributions and also as an alternative interpretive account for some of the neurobiological observations from the mammalian visual system. Finally, we suggest that the models outlined here could be used as a starting point for image processing applications such as denoising or deblurring and that it might also be adapted to time-series data such as natural audio sequences.
Topographic Product Models Applied to Natural Scene Statistics
411
Appendix: Sampling in HPoT Models A.1 Complete Models. We start our discussion with sampling in complete HPoT models. In this case, there is a simple invertible relationship between x and y, implying that we may focus on sampling y and subsequently transforming these samples back to x-space through x = J−1 y. Unfortunately, unless W is diagonal, all y variables are coupled through W, which makes it difficult to devise an exact sampling procedure. Hence, we resort to Gibbs sampling using equation 2.13, where we replace y j = J j x to acquire sample u|y. To obtain a sample y|u, we convert equation 2.9 into P(y|u) = Ny y; 0, Diag[WT u]−1 .
(A.1)
We iterate this process (alternatingly sampling u ∼ P(u|y) and y ∼ P(y|u)) until the Gibbs sampler has converged. Note that both P(u|y) and P(y|u) are factorized distributions, implying that both u and y variables can be sampled in parallel. A.2 Overcomplete Models. In the overcomplete case, we are no longer allowed to first sample the y variables and subsequently transform them into x space. The reason is that the deterministic relation y = Jx means that when there are more y variables than x variables, some y configurations are not allowed (i.e., they are not in the range of the mapping x → Jx with x ∈ R). If we sample y, all these samples (with probability one) will have some components in these forbidden dimensions, and it is unclear how to transform them correctly into x-space. An approximation is obtained by projecting the y-samples into x-space using x˜ = J# y. We have often used this approximation in our experiments and have obtained good results, but we note that its accuracy is expected to decrease as we increase the degree of overcompleteness. A more expensive but correct sampling procedure for the overcomplete case is to use a Gibbs chain in the variables u and x (instead of u and y) by using equations 2.13 and 2.14 directly. In order to sample x|u, we need to compute a Cholesky factorization of the inverse-covariance matrix of the gaussian distribution P(x|u), RT R = JVJT
V = Diag[WT u].
(A.2)
The samples x|u are now obtained by first sampling from a multivariate standard normal distribution, n ∼ Nn [n; 0, I], and subsequently setting x = R−1 n. The reason this procedure is expensive is that R depends on u, which changes at each iteration. Hence, the expensive Cholesky factorization and inverse have to be computed at each iteration of Gibbs sampling.
412
S. Osindero, M. Welling, and G. Hinton
A.3 Extended PoT Models. The sampling procedures for the complete and undercomplete extended models discussed in section 2.5 are very similar, apart from the fact that the conditional distribution P(y|u) is now given by
Pext (x|u) ∝
M i=1
1 exp − Vii |yi |β 2
V = Diag[WT u].
(A.3)
Efficient sampling procedures exist for this generalized gaussian-Laplace probability distribution. In the overcomplete case, it has proved more difficult to devise an efficient Gibbs chain (the Cholesky factorization is no longer applicable), but the approximate projection method using the pseudo-inverse, J# , still seems to work well. Acknowledgments We thank Peter Dayan and Yee Whye Teh for important intellectual contributions to this work and many other researchers for helpful discussions. The work was funded by the Gatsby Charitable Foundation, the Wellcome Trust, NSERC, CFI, and OIT. G.E.H. holds a Canada Research Chair. References Andrews, D., & Mallows, C. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, 36, 99–102. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes. Neural Computation, 4(2), 196–210. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Carreira-Perpinan, M., & Hinton, G. (2005). On contrastive divergence learning. In R. Cowell & Z. Ghahrami (Eds.), Proceedings of the Society for Artificial Intelligence and Statistics. Barbados. Available online at http://www.gatsby.ucl.ac.uk/aistats. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions of binary vectors using 2-layer networks. In C. L. Giles, S. J. Hanson, & J. D. Cawan (Eds.), Advances in neural information processing systems, 4 (pp. 912–919). San Francisco: Morgan Kaufmann. Heskes, T. (1998). Selecting weighting factors in logarithmic opinion pools. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Topographic Product Models Applied to Natural Scene Statistics
413
Hinton, G., & Teh, Y. (2001). Discovering multiple constraints that are frequently approximately satisfied. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (pp. 227–234). San Francisco: William Kaufmann. Hoyer, P. O., & Hyvarinen, A. (2000). Independent component analysis applied to feature extraction from colour and stereo images. Network Computation in Neural Systems, 11(3), 191–210. Hyvarinen, A., & Hoyer, P. O. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18), 2413–2423. Hyvarinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Computation, 13(7), 1527–1558. Karklin, Y., & Lewicki, M. S. (2003). Learning higher-order structures in natural images. Network Computation in Neural Systems, 14, 483–499. Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17, 397–423. Lewicki, M., & Sejnowski, T. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Marks, T. K., & Movellan, J. R. (2001). Diffusion networks, products of experts, and factor analysis (Tech. Rep. UCSD MPLab TR 2001.02). San Diego: University of California, San Diego. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–610. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Portilla, J., Strela, V., Wainwright, M., & Simoncelli, E. P. (2003). Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans. Image Processing, 12(11), 1338–1351. Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. J. Neurophysiol., 88(1), 455–463. Simoncelli, E. (1997). Statistical models for images: Compression, restoration and synthesis. In Proceedings of the 31st Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA: IEEE Computer Society. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 1: Foundations. New York: McGraw-Hill. Teh, Y., Welling, M., Osindero, S., & Hinton, G. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research [Special Issue] 4, 1235–1260. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B Biol. Sci., 265(1394), 359–366. Wainwright, M. J., & Simoncelli, E. P. (2000). Scale mixtures of gaussians and the ¨ statistics of natural images. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 855–861). Cambridge, MA: MIT Press.
414
S. Osindero, M. Welling, and G. Hinton
Wainwright, M. J., Simoncelli, E. P., & Willsky, A. S. (2000). Random cascades of gaussian scale mixtures and their use in modeling natural images with application to denoising. In Proceedings of the 7th International Conference on Image Processing. Vancouver, BC: IEEE Computer Society. Welling, M., Agakov, F., and Williams, C. (2003). Extreme components analysis. ¨ L. K. Savl, & B. Scholkopf ¨ In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Welling, M., Hinton, G., & Osindero, S. (2002). Learning sparse topographic rep¨ resentations with products of student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Welling, M., Zemel, R., & Hinton, G. (2002). Self-supervised boosting. In S. Becker, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, S. Thrun, 15. Cambridge, MA: MIT Press. Welling, M., Zemel, R., & Hinton, G. (2003). A tractable probabilistic model for projection pursuit. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann. Welling, M., Zemel, R., & Hinton, G. (2004). Probabilistic sequential independent components analysis. IEEE Transactions in Neural Networks [Special Issue], 15, 838–849. Williams, C. K. I., & Agakov, F. (2002). An analysis of contrastive divergence learning in gaussian Boltzmann machines (Tech. Rep. EDI-INF-RR-0120). Edinburgh: School of Informatics, University of Edinburgh. Williams, C., Agakov, F., & Felderhof, S. (2001). Products of gaussians. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Willmore, B., & Tolhurst, D. J. (2001). Characterizing the sparseness of neural codes. Network Computation in Neural Systems, 12(3), 255–270. Yuille, A. (2004). A comment on contrastive divergence (Tech. Rep.). Los Angeles: Department Statistics and Psychology, UCLA. Zhu, S. C., Wu, Y. N., & Mumford, D. (1998). Filters, random fields and maximum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27(2), 107–126.
Received December 20, 2004; accepted August 1, 2005.
LETTER
¨ ak Communicated by Peter Foldi´
A Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural Images Michael S. Falconbridge
[email protected] School of Psychology, University of Western Australia, Nedlands WA 6009, Australia
Robert L. Stamps
[email protected] School of Physics, University of Western Australia, Nedlands WA 6009, Australia
David R. Badcock
[email protected] School of Psychology, University of Western Australia, Nedlands WA 6009, Australia
Slightly modified versions of an early Hebbian/anti-Hebbian neural network are shown to be capable of extracting the sparse, independent linear components of a prefiltered natural image set. An explanation for this capability in terms of a coupling between two hypothetical networks is presented. The simple networks presented here provide alternative, biologically plausible mechanisms for sparse, factorial coding in early primate vision. 1 Introduction Retinal photoreceptor activity is typically high in awake, behaving primates. Yet based on data describing average energy consumption by human cortex and the energetic cost of neural activity, Lennie (2003) recently calculated that the fraction of active cortical neurons is very low. The upper bound proposed by Lennie is 1%, even in so-called active cortical areas (those areas that light up during fMRI, for example). Single cell recordings in V1—the first visual cortical area—reveal that these neurons tend to have very low activity most of the time and high activity only rarely (although more often than a gaussian activity distribution would predict) (Baddeley et al., 1997). Thus, very early in the cortical stages of processing, a sparse representation of the dense sensory input is achieved. How is sparse coding implemented in the primate brain? In a landmark paper, Olshausen and Field (1996) produced a successful model of V1 simple cells based on the assumptions of independence of simple cell responses, sparseness, and low information loss. Other modelers showed that the Gabor-like receptive fields produced by implementing Neural Computation 18, 415–429 (2006)
C 2005 Massachusetts Institute of Technology
416
M. Falconbridge, R. Stamps, and D. Badcock
the Olshausen model also result from applying independent component analysis (ICA) with the assumption of leptokurtic (highly peaked at zero) data sources (Hurri, Hyv¨arinen, & Oja, 1996; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998; van Hateren & Ruderman, 1998).1 It was shown that the two methods of ICA and what might be termed SCA (sparse component analysis) are actually equivalent when it comes to modeling perception of natural scenes (e.g., Bell & Sejnowski, 1997). In this article, we show that two slightly modified versions of an early ¨ ak, 1990) are also caHebbian/anti-Hebbian sparse coding algorithm (Foldi´ pable of extracting the sparse, independent components of natural images. The first version is, in all respects, the same as the original Foldiak algorithm except that the binary constraint on the outputs is removed; the second is a more biologically plausible version. In the discussion, a mathematical explanation for why the algorithms achieve ICA and SCA is presented. 2 Foldiak’s Network Foldiak’s original SCA algorithm combines three simple, biologically inspired learning rules. Hebbian learning occurs on the feedforward connections denoted wi j . Anti-Hebbian learning occurs on lateral connections ui j between the output units. A sensitivity term si assigned to each output unit automatically adjusts to encourage a certain low level of activity among the outputs. Although no overall objective function has been presented for the algorithm, an intuitive explanation for the three respective learning rules is that they (1) ensure that the feedforward connections capture the interesting (high-variance-producing) features in the data set, (2) discourage correlations between outputs, which forces them to represent different features, and (3) encourage sparse, distributed activity among the output units. The first Foldiak algorithm presented here may be described in a step-wise fashion as follows. The architecture of the network used to implement the algorithm is depicted in Figure 1. 1. The prefiltered image xq is presented to the network. For both versions of the network, prefiltering was performed according to the process described in Olshausen and Field (1997), which approximates the effect of processing by retinal and lateral geniculate nucleus (LGN) neural circuits (see Olshausen & Field, 1997, and Atick & Redlich, 1992, where a similar function is compared with the effect of retina/LGN processing). Natural images were taken from http://www.cis.hut.fi/projects/ica/data/images.
1
The Gabor function referred to is a two-dimensional version—an oriented, twodimensional gaussian envelope modulated by a sinusoidal function along one (typically the short) axis.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
417
x1
σ1
y1
x2
σ2
y2
xn
σm
ym
Figure 1: Representation of the architecture of the Foldiak network. The network is actually single layered but has been separated into two parts to demonstrate that it consists of a linear, feedforward subnetwork that feeds into a recurrent inhibition layer. Hebbian learning occurs on the feedforward connections and anti-Hebbian learning on the lateral connections. Each σ is a weighted sum of its inputs. yi is calculated as in equation 2.2.
2. The responses of filters (rows of a matrix W denoted wi ) are calculated: q q σ q = Wxq or σi = wi j x j . (2.1) j
3. The outputs are calculated: q q q yi = f σi + ui j y j + si .
(2.2)
j
f is the activation function. ui j = u ji represents the connection strength between output units i and j, which is never greater than zero for any (i, j) and is zero when i = j. si is a sensitivity term associated with output unit i. Note that the solution to equation 2.2 cannot be calculated in one step. The output values may be acquired by finding the equilibrium solution to the differential (with respect to iteration number) equation: q q q q y˙ i = f σi + ui j y j + si − yi . (2.3) j
4. The weights and sensitivity terms are updated according to the following rules. The Hebbian rule for feedforward weights is wi j = η1 yi (x j − yi wi j ).
(2.4)
An anti-Hebbian rule for lateral weights ui j = −η2 (yi y j − α 2 ) (if i = j or ui j > 0 then ui j : = 0).
(2.5)
418
M. Falconbridge, R. Stamps, and D. Badcock
And the sensitivity update rule is si = η3 (α − yi ),
(2.6)
where ηm are small, positive numbers representing the learning rates for the various parameters, and α is the desired average activity for all output units. Alternatively, the variables may be updated using average values after the presentation of a batch of, say, 100 images. This is, in practice, an advantageous approach as it leads to smoother changes in the net. 5. q is iterated, and steps 1 to 5 are repeated. Note that all equations shown here are exactly the same as those of the original Foldiak (1990) algorithm except in two ways: (1) the last term of the wi j rule has been changed from −η1 yi wi j , which effectively fixes |wi | when yi is binary, to −η1 yi yi wi j , which is more effective for doing the same thing when yi is continuous; and (2) the equation forcing outputs greater than 0.5 to go to one and those less than 0.5 to go to zero has been removed. 3 Simulations and Results 3.1 The First Network. In line with the original Foldiak network, this network uses a sigmoidal activation function, f (x) =
1 , 1 + e −βx
(3.1)
where β is a “steepness” parameter. The function constrains outputs to lie between zero and one. In keeping with the original Foldiak network, inputs were also constrained to be positive. This was achieved by simultaneously presenting (within one vector) a rectified version of the original prefiltered image and a rectified negative version of the image. This represents filtering through on and off channels in the retina/LGN system (Schiller, 1992). Having positive inputs and positive outputs produces positive feedforward connections via the Hebbian learning rule. (See Hoyer, 2003, for additional arguments supporting positive nets.) Figure 2 shows a set of 144 image components learned by the first network. They were calculated by taking the first half of each row2 of W (representing connections from on units) and subtracting the second half (representing connections from off units). This reverses the rectification step 2 Note that the rows of W were referred to earlier as filters through which images are passed to get values for σ . This is so, but the learning rule for wi j ensures that these rows come to represent components of the image data set. This is typical for ICA and SCA algorithms. Filter and component are used interchangeably to refer to a row of W.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
419
Figure 2: An example set of 144 components produced by the continuous Foldiak network.
used to produce positive inputs, which projects the connection strengths back into the original prefiltered image space. The set of components in the figure was produced using the parameters α = .005, β = 200, η2 = 2.4, and η3 = 1.5. η1 was varied from 3 (106 iterations) to 1.5 (400,000 iterations) to .725 (106 iterations). This represents a gradual decrease in the plasticity of the feedforward connections, although a similar result can be obtained by maintaining η1 = 3 for 106 iterations. There is a tendency toward Gabor-like functions. This is consistent with the general form of the components found by existing ICA and SCA algorithms. There are obvious differences, though, between some of the components in Figure 2 and other published components. There is a subset of components that appear to contain a small number of Gabor-like features (e.g., the component on row 3, column 10). Note, though, that during learning of natural scenes, the multiple Gabor components tend to fluctuate, whereas the single Gabor components remain stable, even when η1 is large. This suggests that the single Gabor-like functions represent optimal solutions. In order to analyze the properties of the components, 2D Gabor functions were fit to each component. Fits that produced relatively high chi-square values were not included in the analysis (5 multiple Gabor components
420
M. Falconbridge, R. Stamps, and D. Badcock
Figure 3: The distribution of the 139 Gabor components in frequency space (.) compared with that for Olshausen and Field’s (1997) components (o).
out of the total 144 components were lost). The distribution of peak spatial frequencies and orientations for the fits is depicted in Figure 3. In the same figure, the corresponding distribution for the Olshausen and Field (1997) components is shown for comparison. The distribution for their components is slightly more spread, and the peak is at a higher frequency. This is more obvious in Figure 4, where orientation information in the data has been removed. The histograms for the two component sets are directly comparable, as each bin represents the same frequency in terms of cycles per image. They are compared in the diagram to a histogram of peak spatial frequency sensitivities for a set of 53 macaque simple cells (De Valois, Albrecht, & Thorell, 1982). In making the comparison, we chose to align the upper cutoff frequency of the cell data with the theoretical (Nyquist) upper limit for the component data. This approximately equates one pixel with a single photoreceptor. Using this point of reference, the peak of the Foldiak network’s distribution lies closer to the simple cell peak than does the peak for the Olshausen data. 3.2 A More Biologically Plausible Network. For the more plausible network, the activation function was chosen to more closely reflect biological data. The function of choice was a one-sided hyperbolic ratio function: f (c) =
cβ c β +γ β
0
if c ≥ 0; if c < 0.
(3.2)
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
normalized "cell" count
1
421
MF O&F DeV
0.8
0.6
0.4
0.2
0 –0.38 –0.14
0.13
0.37
0.63
0.89
1.14
1.4
1.67
1.91
2.16
Log of peak spatial frequency (normalized to upper cut-off) Figure 4: Distribution of peak spatial frequencies for the modified Foldiak (MF) network components, the Olshausen and Field (1997) components, and a sample of 53 simple cells (De Valois et al., 1982).
This function was shown by Albrecht and Hamilton (1982) to describe well the response of simple cells to the increasing contrast of stimuli and has been used extensively to fit cell response data since (Gottschalk, 2002). A common value for β for simple cell data is 2.0 (Gottschalk, 2002). This was used in our simulations. γ was varied for each simulation to allow exponential-like y histograms in keeping with the response distributions of V1 cells to natural images (Baddeley, et al., 1997; Gallant & Vinje, 2000). The positivity constraint on the inputs was removed so that the filter response distributions were approximately symmetrical around zero. With this constraint removed, negative inputs come to represent positive inputs from “off” LGN cells and positive inputs represent inputs from “on” cells. Essentially, the model is equivalent to a positive one. In either case, the conditions for an output unit response, and thus the occurrence of learning, include the matching of both input pattern and sign to the pattern and sign—or, equivalently, the on-off specificity—of the feedforward connections. The removal of the positivity constraint was necessary to allow the model to produce exponential output distributions. If inputs were constrained to be positive, the output distributions tended to be highly peaked at zero and slightly peaked at one, and to represent little to no activity in between.
422
M. Falconbridge, R. Stamps, and D. Badcock
Figure 5: The receptive fields of a set of output units. These are calculated by mapping the responses of units to a point source applied to each point in the input array.
Figure 5 depicts an example set of receptive field profiles for the output units. The receptive fields were mapped out by recording the responses of output units to a single white pixel (the point source) placed at each point in the input array. The receptive fields depicted were a result of a simulation where α = 0.01 and γ = 0.3. The learning rate for the feedforward connections varied from 1.25 to 0.63 (representing a decrease in plasticity over time), for the lateral connections it remained at 0.1, and for the sensitivity term it varied from 0.2 to 0.1 (representing increased stability of units in the face of fluctuating input levels) over a total of 106 image presentations. Note that a value of .01 for α means that the approximate percentage of active outputs at any one time is equal to the upper limit of 1% predicted by Lennie (2003) for active cortical areas. The distribution of receptive fields in frequency space is similar to the distribution for the sigmoidal model’s components depicted in Figure 3 in that there is an even spread across orientations and a similar concentration in the midfrequency range. Thus, the relation of the distribution to that of De Valois et al.’s cells is similar.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
423
In an attempt to define the contrast response function of our output units, the responses of some randomly selected units to images (including images of the units’ best fitting Gabors) of varying contrast were measured. In line with physiological data, the response to increasing contrast is very similar to the response functions imposed on these units and is thus well modeled by the hyperbolic ratio function. In summary, compared with the first network, the second model has a more plausible choice of response function for output units and more plausible output activity distributions (i.e., resembling more closely distributions of activity in real visual systems). The level of activity among output units is similar to that determined by Lennie, and the contrast response functions of output units resemble those of real simple cells. The receptive fields as measured using a point source quantitatively resemble simple cell-receptive fields measured in the same way. 4 Discussion 4.1 How Do the Algorithms Work? For the purpose of a comparison with the Olshausen generative model (see Olshausen & Field, 1997), assume there is a set of variables for the Foldiak net that can be compared to the vector a, which represents a set of coefficients for the components of an image x. For the Foldiak generative model, these coefficients are assumed to be positive (no subtraction of components allowed when generating an image). The reason for this assumption is explained later. As the pattern of feedforward connections to any particular output unit for both the Foldiak and Olshausen net represents an input component (rather than an optimal filter for that component), then the generative model uses these connection strengths to generate images. The generative equation is thus x = WT a
xj =
or
wk j a k .
(4.1)
k
The Foldiak net produces a variable σ that is calculated thus: σi =
wi j x j .
(4.2)
j
Combining equations 4.1 and 4.2 produces σi =
wi j
j
wk j a k
k
2 = ai wi j + wi j wk j a k . j
k=i
j
424
M. Falconbridge, R. Stamps, and D. Badcock
Now, σi is used as input to a fully interconnected network that produces outputs y. Incorporating the previous relationship into equation 2.2 produces yi = f a i |wi |2 +
k=i
wi j wk j a k +
j
uik yk + si ,
(4.3)
k=i
where |wi |2 = j wi2j . We can use this equation to determine the statistical relationship between the hypothetical coefficient a i and the network output yi . This is done using the fact that the probability distribution for the variables a is related to the distribution for y by
dyi
. P(a i ) = P(yi )
da i
(4.4)
Because of the steep sigmoidal-like shape of f (large steepness parameters were used for both the sigmoidal and hyperbolic ratio functions), the derivative dyi /da i is sharply peaked when the argument of f is equal to zero. Assuming P(yi ) is nonzero in this vicinity (substantiated later), P(a i ) is a narrow gaussian-like distribution that peaks where a i |wi |2 +
k=i
j
wi j wk j a k = −si −
uik yk .
(4.5)
k=i
The activation function thus effectively produces a low noise mapping between the outputs of two hypothetical networks—the first represented by the left-hand side and the second by the right-hand side of equation 4.5. The properties of the second net are examined next. The learning rule for si (see equation 2.6) attempts to make y¯ i = α. If it can achieve this, si itself goes to zero (fluctuations about zero may occur in order to keep y¯ i at the desired level). If not zero, si tends to be negative when α is small because in general, yi tends to be larger than α. The learning rule for uik (see equation 2.5) tries to push correlations between yi and yk to the value of α 2 . By achieving this, uik itself goes to zero. If not zero, uik is always negative. If α is appropriately chosen (i.e., if the level of correlations and the level of average activity it represents is attainable by the second net), then the right-hand side of equation 4.5 will tend to be positive but will approach zero in time. The implications for the first, generative net are as follows. As |wi | is essentially fixed by the Foldiak wi j rule, then it means that the average of a i —assumed positive—is pushed toward zero. a¯ i cannot in general be zero if the generative model is going to produce actual images,
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
425
so the right-hand-side net going to zero also means that the rows of W are forced to become orthogonal as this minimizes the sum over j in the second left-hand side term of equation 4.5. An effective cost function may be produced for the linear generative net based on these observations—that is, 2 (xi − xˆ i ) + λ1 C= S1 (a i ) + λ2 S2 (|wi w j |) , q
i
i
ij
with |wi |2 fixed, where λ1 and λ2 are constants and S1 and S2 are “sparseness” functions like that used in Olshausen’s net to penalize large values for the argument. The first term encourages accurate image reconstruction, the second encourages small values for the component coefficients, and the third encourages the rows of W to point in orthogonal directions. This cost function represents a summary of all of the forces present in the hypothetical Foldiak generative network. The first term embodies the generative model assumption that reconstructed images are to be as close to the originals as possible. The second and third terms represent the effect of coupling the generative model to the network represented by the right-hand side of equation 4.5. The actual forms of the functions S1 and S2 are not obvious from the coupling. Any function that is smooth and increases monotonically when the argument is greater than zero and that is flat close to zero should approximate the effect of the coupling in equation 4.5. Flattening near zero reflects the fact that the right-hand side of equation 4.5 approaches zero quickly when it is large and more slowly as it gets closer to zero. It is also in line with the notion that when either a i or |wi w j | is zero, there should be no forces on them—to make them either larger or negative—as the aim is to make them as small as possible and neither a i or |wi w j | is allowed to be negative. An appropriate choice for S1 and S2 might be S(x) = x 2 . The cost function is the same as the Olshausen cost function except that it has an extra term that encourages the image components to be orthogonal. This term does not oppose the actions of the others in any way; rather, it helps ensure that the components represent unique aspects of the image data set. This explains why our learned components and those of the Olshausen net have similar properties. Note that the above cost function does not by itself lead to all of the learning rules present in the Foldiak algorithm, as the lateral connections and sensitivity terms do not appear in the function. They have to be introduced into the model via the coupling we have described in order to get the right type of rules.
426
M. Falconbridge, R. Stamps, and D. Badcock
4.1.1 Choosing α. The parameter α must be chosen not only to allow the right-hand side of equation 4.5 to approach zero; it must also allow the Foldiak network to produce appropriate outputs y. The update rule for wi j is the Oja (1982) update rule, so the final weight matrix minimizes the Oja cost function,3 which, for multiple Oja component analyzers, is C=
q
xi − xˆ i j
2
where
xˆ i j = w ji y j .
(4.6)
i, j
According to this equation each output unit produces its own “reconstructed” image xˆ i = wi yi , which is compared to the original x. Averaged over all input images and all reconstructed images produced for each input image, the aim is for the reconstructed images to look as much like the corresponding input images as possible in a least-squares sense. The rows of W thus learn to represent significant components of the image set. The aim of SCA is to use each component as rarely as possible while keeping an average like the one in the Oja cost function low. The aim of ICA is to keep component coefficients as independent from one another as possible. Both of these constraints translate to making α as small as possible, as doing so makes y¯ i small for all i and minimizes correlations between the outputs. Competing with this is the fact that α must also be large enough to allow the right-hand side of equation 4.5 to approach zero. 4.1.2 Supplementary Notes. Note that only correlations between the outputs are removed by the ui j rule, but higher-order dependencies than this are discouraged among the linear component coefficients a. This is because Hebbian learning on the feedforward connections between the inputs x and the nonlinear outputs y constitutes nonlinear Hebbian learning, where statistical relationships of higher order than correlations are absorbed into the rows of W (Karhunen & Joutsensalo, 1994; Sudjianto & Hassoun, 1995). It was stated earlier that a is assumed to be positive. This holds because y is always positive, which means that by the Hebbian learning rule, the components that emerge are those that are added to make the images— not subtracted. It follows then that the coefficients a i are positive. Also, it was required that P(yi ) is nonzero around the points where the relation in equation 4.5 holds. The left-hand side is simply σi , so this equates to
3 The cost function is not minimized with respect to the outputs y, as these are constrained in certain ways. If the constraints (including the nonlinear activation function) were removed, then each row of W would point to the principal component of the data set (assuming there was one after prefiltering).
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
427
requiring P
ui j y j = si − σi > 0.
(4.7)
j
Note the following: that y j is always positive or zero, ui j is always negative or zero, si tends to be negative or zero, and the distribution P(σi ) sits in the region σ > 0 for a positive network and is centered on zero and relatively spread for a mixed-sign network. It is very likely, then, that relationship 4.7 holds. 4.2 Possible Biological Mechanism. The attraction of a nonlinear Hebbian/anti-Hebbian network like that presented here is its biological plausibility. The occurrence of Hebbian-like mechanisms leading to longterm changes (long-term potentiation, LTP) in the brains of a range of animals is fairly well substantiated (see Brown, Kairiss, & Keenan, 1990, for a review). Its occurrence in perceptual areas of the brain is often assumed (e.g., Rauschecker, 1991). Katz and Shatz (1996) cite evidence for the occurrence of LTP in visual cortex of a range of animals at various ages. A biologically plausible mechanism for the anti-Hebbian learning is as follows. Neighboring pyramidal cells responsible for passing information from and to other brain areas or layers are connected via inhibitory interneurons. Such neurons constitute 20% of the neural population in cortex (Thomson & Deuchars, 1997) and are known to connect cells whose receptive fields lie near one another in visual space (Blakemore, Carpenter, & Georgeson, 1970; Budd & Kisvarday, 2001). Anti-Hebbian learning translates in this scheme to Hebbian learning on the inhibitory interneuron/ pyramidal synapses. Evidence for a cellular origin to adaptation such as that seen in our models via the use of an adapting sensitivity term can be found in Carandini (2000). This is in line with psychophysical and brain imaging results that show that adaptation is specific to the cells activated by the repeated stimuli (e.g., Boynton & Finney, 2003). 5 Conclusion Two slightly modified versions of Foldiak’s (1990) Hebbian/anti-Hebbian network have been used to extract the sparse, independent linear components of a prefiltered natural image set. The second version in particular has a number of aspects in common with the primate visual system, including receptive fields qualitatively similar to those of simple cells, response functions and activity distributions qualitatively similar to those of visual cells, output units that adapt to the prevailing magnitude of the input signal,
428
M. Falconbridge, R. Stamps, and D. Badcock
and lateral inhibitory connections like those known to exist in the primate visual system. The algorithms work because the nonlinear activation function links a hypothetical linear generative model to certain self-adjusting components of the network. Changes in the network produce changes in the generative model. An effective cost function for the generative network was suggested. It is essentially the same as the Olshausen and Field (1997) cost function with an additional term to encourage orthogonal components. A possible biological mechanism based on the models uses inhibitory interneurons and Hebbian learning to achieve the proposed anti-Hebbian learning. Acknowledgments We thank the reviewers for their helpful comments on earlier drafts of the article. This work was supported by ARC grants A00000836 and DP0346084 (D.R.B.) and an Australian Postgraduate Award (M.F.). References Albrecht, D. G., & Hamilton, D. B. (1982). Striate cortex of monkey and cat: Contrast response function. Journal of Neurophysiology, 41, 217–237. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Computation, 4, 196–210. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society of London B, 264, 1775–1783. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Blakemore, C., Carpenter, R. H. S., & Georgeson, M. A. (1970). Lateral inhibition between orientation detectors in the human visual system. Nature, 228, 37–39. Boynton, G. M., & Finney, E. M. (2003). Orientation-specific adaptation in human visual cortex. Journal of Neuroscience, 23(25), 8781–8787. Brown, T. H., Kairiss, E. W., & Keenan, C. L. (1990). Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rev. Neurosci., 13, 475–511. Budd, J. M. L., & Kisvarday, Z. F. (2001). Local lateral connectivity of inhibitory clutch cells in layer 4 of cat visual cortex (area 17). Experimental Brain Research, 140, 245–250. Carandini, M. (2000). Visual cortex: Fatigue and adaptation. Current Biology, 10(16), R1–R3. De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22, 545–559. ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Foldi´ Biol. Cybern., 64, 165–170. Gallant, J. L., & Vinje, W. E. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287(5456), 1273.
Hebbian/Anti-Hebbian Network Performs Sparse Coding and ICA
429
Gottschalk, A. (2002). Derivation of the visual contrast response function by maximizing information rate. Neural Computation, 14, 527–542. Hoyer, P. O. (2003). Modelling receptive fields with non-negative sparse coding. In E. D. Schutter (Ed.), Computational neuroscience: Trends in research 2003. Amsterdam: Elsevier. Hurri, J., Hyv¨arinen, A., & Oja, E. (1996). Image feature extraction using independent component analysis. In Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG)’96. Piscataway, NJ: IEEE. Karhunen, J., & Joutsensalo, J. (1994). Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7(1), 113–127. Katz, L. C., & Shatz, C. J. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Lennie, P. (2003). The cost of cortical computation. Current Biology, 13, 493–497. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Rauschecker, J. P. (1991). Mechanisms of visual plasticity. Physical Review, 71(2), 587– 615. Schiller, P. H. (1992). The ON and OFF channels of the visual system. Trends in Neurosciences, 15(3), 86–92. Sudjianto, A., & Hassoun, M. H. (1995). Statistical basis of nonlinear Hebbian learning and application to clustering. Neural Networks, 8(5), 707–715. Thomson, A. M., & Deuchars, J. (1997). Synaptic interactions in neocortical local circuits: Dual intracellular recordings in vivo. Cerebral Cortex, 7, 510–522. van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265, 2315–2320. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265, 359–366.
Received January 30, 2004; accepted June 9, 2005.
LETTER
Communicated by Steven Nowlan
Differential Log Likelihood for Evaluating and Learning Gaussian Mixtures Marc M. Van Hulle
[email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, B-3000 Leuven, Belgium
We introduce a new unbiased metric for assessing the quality of density estimation based on gaussian mixtures, called differential log likelihood. As an application, we determine the optimal smoothness and the optimal number of kernels in gaussian mixtures. Furthermore, we suggest a learning strategy for gaussian mixture density estimation and compare its performance with log likelihood maximization for a wide range of real-world data sets. 1 Introduction A standard procedure in density estimation is to minimize the negative log likelihood given the sample S = {vi |i = 1, · · · , N}, vi = [vi1 , · · · , vid ] ∈ V ⊆ Rd , taken from the density p(v) (Redner & Walker, 1984; Bishop, 1995), often in combination with the expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977): 1 log p(v ˜ i |), N N
F = −log L = −
(1.1)
i=1
with p(.) ˜ the density estimate, and the parameter vector of the model estimate. When N → ∞, we obtain the expected log likelihood: E[−log L] = −
p(v) log p(v|)dv. ˜
(1.2)
V
For a given sample S, the (sample) log likelihood cannot be used for judging the goodness of fit of a model or for model selection, since it is biased as an estimator of the expected log likelihood. The bias appears in practice since the same sample is used for estimating both the model parameters and the expected log likelihood. Akaike (1973) was the first to suggest an approximation of this bias, which he included in his information metric to estimate the expected log likelihood. This has led to several improved Neural Computation 18, 430–445 (2006)
C 2005 Massachusetts Institute of Technology
Differential Log Likelihood
431
information metrics (for review, see Stoica & Sel´en, 2004), including a bootstrap procedure (Ishiguro, Sakamoto, & Kitagawa, 1997). In this letter, we consider the log likelihood to be biased as an estimator of the theoretically optimal log likelihood (cf. the Kullback-Leibler divergence). We estimate this bias by the difference between the negative log likelihood and the entropy of the density estimate. We call it the differential log likelihood, in analogy with the differential entropy (thus, also with respect to some reference; see Cover & Thomas, 1991). For the entropy, we will use kernel density estimates (“plug-in” estimates; Ahmad & Lin, 1976), but other estimates could be used as well. The differential log likelihood should fluctuate around zero, when a good model is achieved, and diverge away from it otherwise. We show under what condition differential log likelihood is an unbiased estimator for finite data sets and when this is approximately the case. In terms of applications, we determine the optimal kernel smoothness and the number of kernels in kernel-based density estimation of both synthetic and real-world examples and derive a strategy for differential log likelihood learning, which we apply to density estimation of a wide selection of real-world data sets. 2 Differential Log Likelihood The expected log likelihood is biased by the differential (“joint”) entropy (Shannon, 1948) of the function to be estimated. When subtracting this bias, we obtain the Kullback-Leibler distance or divergence (KL) (Kullback, 1959; for review, see Soofi, 2000): KL( p p) ˜ =−
p(v) log
V
p(v|) ˜ dv p(v)
p(v) log p(v|)dv ˜ +
=− V
p(v) log p(v)dv,
(2.1)
V
that is, a nonnegative quantity that will equal zero only when the true and estimated densities are equal, so that it can be regarded as an (average) error function, weighting the discrepancy between the true and estimated densities logarithmically. Given a sample S, the entropy can be estimated from the density estimate p˜ (plug-in estimates; Ahmad and Lin, 1976), or from sample spacings in the one-dimensional case (Hall, 1984), or from the nearest-neighbor method in the d-dimensional case (Kozachenko & ¨ Leonenko, 1987) (for review, see Beirlant, Dudewicz, Gyorfi, & van der Meulen, 1997). An alternative approach to estimate entropy or KL is by means of the Edgeworth expansion (Comon, 1994; Van Hulle, 2005a), albeit that some reservations should be made about the accuracy of this polynomial density expansion (Friedman, 1987; Huber, 1985).
432
M. Van Hulle
Consider the following reasoning. We start with the negative log likelihood and subtract from it a bias term, namely, the model estimate’s differential entropy:
LL( p p) ˜ =−
p(v) log p(v|)dv ˜ + V
p(v|) ˜ log p(v|)dv, ˜
(2.2)
V
and call it the differential log likelihood. It is clear that when p(.) = p(.) ˜ everywhere, LL = 0. The advantage with respect to KL is that as many samples as desired can be generated from p˜ to estimate its differential entropy. There is an intimate connection with the information criteria used for model selection (Akaike, 1973; Ishiguro et al., 1997; Stoica & Sel´en, 2004). Consider the following expression: E v∈ p
1 log p(v|(v)) ˜ − E v ∈ p˜ log p(v ˜ |(v)) . N
(2.3)
The terms between the square brackets correspond to LL( p p), ˜ given the sample S, and form an unbiased estimate of the previous expression; when for the second term between the square brackets the expectation would be taken for v ∈ p, we would obtain the classical bias correction term for the estimated log likelihood used in information criteria. The difference between the two formulations is that LL will equal zero in the optimal case, whereas the bias correction term is a nonnegative quantity. In fact, since LL = 0 in the optimal case, the estimated entropy term it contains could be regarded as a bias correction of the estimated log likelihood. In the general case, we have the following connection with KL: LL( p p) ˜ = KL( p p) ˜ − H( p) ˜ + H( p).
(2.4)
For the connection between distortion error (vector quantization), log likelihood, and Kullback-Leibler divergence in the case of gaussian mixtures, we refer to Van Hulle (2005b). It can be verified that LL is nonsymmetric: LL( p p) ˜ = LL( p ˜ p). It is, however, in general different from KL since for LL( pφ p ), with φ p the d-dimensional multivariate gaussian density with the same mean and covariance matrix, we can write that LL( pφ p ) = KL( pφ p ) − H(φ p ) + H( p) = 0,
(2.5)
since KL( pφ p ) = H(φ p ) − H( p) (i.e., the negentropy). Hence, LL( pφ p ) will be zero when the maximum likelihood gaussian approximation is achieved.
Differential Log Likelihood
433
2.1 Gaussian Input, Gaussian Kernel. Assume that the true distribution is a d-dimensional gaussian, φ p , with parameters the mean vector µ p and covariance matrix p , and that we use a single gaussian kernel, φ p˜ , with parameters µ p˜ and p˜ , for modeling the true distribution. It can then be shown that LL(φ p φ p˜ ) =
1 −1 1 d Tr p˜ p + (µ p − µ p˜ ) −1 , (2.6) p˜ (µ p − µ p˜ ) − 2 2 2
after some algebraic manipulations. Evidently when we let the parameters µ p˜ and p˜ of our gaussian kernel match those of the distribution, µ p and p , we obtain LL(φ p φ p˜ ) = 0 (the validity is discussed in appendix A). When estimating LL from samples taken from the distributions φ p and φ p˜ , we can show that LL is asymptotically χ 2 distributed (see appendix A) and that it is an asymptotically unbiased estimator when we take for µ p˜ and p˜ the maximum likelihood estimates of µ p and p . 2.2 General Input, Gaussian Kernel. Consider now the case of a general input density p(v) and a gaussian kernel φ(v). We can approximately write that (see appendix B) LL( pφ) ≈ LL(φ p φ)
(2.7)
and know that when φ = φ p , thus when the mean vectors and covariance matrices match, LL( pφ) = 0 (see equation 2.5). Hence, also for this case, the metric is unbiased. Similarly, we can compute the inverse relation (see appendix C): LL(φ p) ≈ LL(φφ p ).
(2.8)
2.3 General Input, General Model. There is no general applicable approximation for this case, but when the mean vectors and covariance matrices of p and p˜ are equal, we have that (see appendix D) d 1 i,i,i LL( p p) ˜ ≈ (κ − κ˜ i,i,i )2 + 3 12 i=1
1 + 6
d
d
(κ i,i, j − κ˜ i,i, j )2
i, j=1,i= j
(κ i, j,k − κ˜ i, j,k )2
(2.9)
i, j,k=1,i< j
d 1 i,i,i 2 + (κ˜ ) + 3 12 i=1
d i, j=1,i= j
(κ˜
1 ) + 6
i,i, j 2
d i, j,k=1,i< j
(κ˜
i, j,k 2
)
,
434
M. Van Hulle
with κ i, j,k the third cumulant over the input dimensions i, j, k, which is probably biased since we did not take into account the higher-order moments. 3 Optimal Smoothness, Kernel Selection Consider N = 1000 data points that are randomly and uniformly distributed in [−1, 1]2 . We further take k = 100 data points from the same distribution and center on these points gaussian kernels with standard deviation σ (homoscedastic gaussian mixture). The idea is to approximate with this gaussian mixture the uniform distribution and to use LL to optimize the smoothness σ . (We consider only one parameter to optimize for clarity but consider the more general case later.) We compute LL( p p) ˜ discretely as follows, LL( p p) ˜ =−
NB N 1 1 log p(v ˜ i ∈ p) + log p(v ˜ i ∈ p), ˜ N NB i=1
(3.1)
i=1
with NB = 100,000 Monte Carlo samples generated from p˜ (plug-in entropy estimation). We also compute LL( p p) ˜ by numerical integration of equation 2.2 and the Kullback-Leibler distance by numerical integration of equation 2.1 (both over a regular 500 × 500 grid). Furthermore, we compute Akaike’s information criterion (AIC) (Akaike, 1973). The optimal smoothness is expected to correspond to the minimum in KL. The result is shown in Figure 1A (plotted in σ -steps of 0.01). We observe that both LL plots cross the zero line at almost exactly the point where KL is minimal. We also observe that the discrete version is a good approximation of the true LL. The minimum in AIC also corresponds to the minimum in KL. Furthermore, we note that detecting the zero crossing in LL is easier than locating the minimum in a shallow curve such as AIC’s. Finally, we note that the Bayesian (BIC) and generalized information criteria (GIC) plots (Stoica & Sel´en, 2004) have the same minima as the AIC plot, since we keep the number of parameters and data points constant (results not shown). We repeat the simulation for a two-dimensional gaussian with unit variance. We observe that the LL plots cross the zero line nearly at the point where KL is minimal, but this is not the case for the AIC plot (vertical dotted line in Figure 1B, plotted in steps of 0.05). The same overestimated minima are also obtained for the BIC and GIC plots, again since we keep the number of parameters constant (and also the number of data points). Next, we take a classic benchmark in the density estimation literature: the eruption lengths data set of the Old Faithful geyser in Yellowstone National Park, which consists of 222 observations (Silverman, 1992). We take k = 10 prototypes, the positions of which are initialized by sampling the data set and which are further optimized by the k-means clustering algorithm
Differential Log Likelihood
| |
0.5
|
0.0
|
|
|
1.0 0.8
∆LL KL AIC |
|
2.0
2.5 σ
-0.2
|
|
1.5
|
1.0 σ
|
|
|
|
0.8
|
0.0 1.0
|
0.6
|
0.2
|
|
0.4
|
0.6 0.4
-4 0.5
|
0.2
|
∆LL AIC
D
|
0
|
|
4
-0.5 0.0
|
8
|
0.5 σ
|
12
|
0.4
|
16
|
0.3
|
20
|
0.2
|
C
|
0.1
|
-0.5 0.0
∆LL AIC BIC
∆LL KL AIC
|
0.0
1.0
|
0.5
1.5
|
1.0
2.0
|
1.5
B
|
2.0
|
A
435
|
|
|
|
|
|
|
1
2
3
4
5
6
7 k
Figure 1: (A) LL( p p) ˜ and KL( p p) ˜ for a two-dimensional uniform distribution of gaussian kernels with standard deviation σ . Shown are KL obtained by numerical integration (thick stippled line), and LL by numerical integration (continuous line) and its discretized version equation 3.1 (dashed line), and Akaike’s information criterion (AIC; dotted line), which was rescaled for visualization purposes. The vertical stippled line indicates the minimum in the KL plot; the horizontal stippled line indicates the zero line. (B) The same as A but for a two-dimensional gaussian distribution. The vertical dotted line corresponds to the minimum in the AIC plot. (C) LL (continuous line) and AIC (dashed line) for the Hepatitis data set. (D) LL (continuous lines), AIC (dashed lines), and BIC (dotted lines) plots for a linear array of five univariate, two-dimensional gaussians and k gaussian kernels, both of which are separated by a distance of 0.1, 0.5 or 2 (increasing line thicknesses).
(10 runs). (Note that quadratic distortion minimization and likelihood maximization are equivalent in homoscedastic gaussian mixture density modeling; see Van Hulle, 2005b.) We then allocate gaussian kernels to these prototypes and optimize their radii σ as before. The optimal σ values are 0.12 and 0.14, for LL and AIC, respectively (optimized in steps of 0.01; results not shown). Similar results are obtained for the Iris and the Wisconsin Breast Cancer data sets (available online from the UCI Machine Learning Repository, http://www1.ics.uci.edu/∼mlearn/MLRepository.html).
436
M. Van Hulle
The Iris data set consists of four dimensional measurements on 150 Iris flowers, and the Breast Cancer data set consists of nine dimensional measurements on 699 patients from two classes (benign and malignant). The optimal σ values for k = 10, for LL and AIC, are, respectively, 0.22 and 0.23, for the Iris data set, and 1.35 and 1.4 for the Breast Cancer data set (results not shown). Furthermore, we apply the Hepatitis data set (available from the same Web site; the preprocessing is given in appendix E) which consists of 19 dimensional measurements on 155 patients from two classes (die and live). The results for k = 10 are shown in Figure 1C: for LL we have a clear zero crossing at σ = 0.18; for AIC there is no clear minimum, and this is also the case for BIC and GIC (for the same reasons as above, results not shown). Finally, we verify whether we can determine the number of kernels needed to estimate a given distribution. We consider a linear array of five two-dimensional, equiprobable gaussians with their centers separated by a distance β, starting from the origin (0, 0), and positioned along the x-axis. We consider a data set of N = 50 data points drawn from the mixture distribution. We take β = 0.1, 0.5, and 2. We center k = 1, 2, . . . gaussian kernels, with unit variance, starting from the origin (0, 0) in steps of β along the x-axis. (Hence, there can be a mismatch only in the number of kernels, not in their locations and spreads.) The result is shown in Figure 1D. The LL curves decrease steeply from k = 1 and cross the zero line in the vicinity of k = 5, as desired (vertical stippled line). The AIC and BIC curves all have their minima at k = 4. When we increase the number of data points, for example, to N = 100, the AIC and BIC minima will shift to k = 5. 4 Differential Log-Likelihood Learning The next step is to use LL itself as a learning metric. Assume that we have k neurons with gaussian activation probabilities p˜ i (v|θi ) with θi = [wi , σi ]. The idea is now to minimize the squared differential log likelihood min(LL2 /2) for optimizing the wi and σi , ∀i (squared so as to obtain a nonnegative quantity). Gradient descent yields the following learning equations: LL p˜ i (v|θi ) 1 wi = ηw 2 (v − wi ) − N v p(v|) ˜ σi p˜ i (v |θi ) 1 (v − wi ) + , (4.1) N v p(v ˜ |) LL p˜ i (v|θi ) 1 v − wi 2 − σi = ησ 3 N v p(v|) ˜ σi 1 p˜ i (v |θi ) + v − wi 2 , (4.2) N v p(v ˜ |)
Differential Log Likelihood
437
with ηw and ησ the learning rates (constants) and v and v data points from the original sample and from the density estimate, respectively, N and N the 1 sizes of the corresponding samples, and p(v|) ˜ = k i p˜ i (v|θi ), with = [θi ] and θi = [wi , σi ] (homogeneous, heteroscedastic mixture). Note that we
i (v|θi ) = v N1 p(i|v, ˜ θi ) ≈ have simplified the equation for σi since v k1N p˜p(v|) ˜
i (v |θi ) q˜ i = k1 , the prior probability of kernel i, and v N1 p˜p(v = q ˜ , so that the i ˜ |) corresponding terms approximately cancel when p(v|) ˜ ≈ p(v). Note furthermore that we can easily extend our approach to nonhomogeneous mixtures by also considering gradient descent on the prior probability parameters. In order to test the goodness-of-fit of the gaussian mixture, we apply rank and sign tests as follows. We estimate the negative log likelihood F by applying bootstrapping on the original distribution’s sample S by considering NB = 1000 subsamples of it with 2/3N data points each. We test this estimate of F against the distribution of NB differential entropy estimates H( p), ˜ each of which is computed by (Monte Carlo) generating N = N samples from the estimated distribution. Note that the sign test is more sensitive than the rank test (Montgomery & Runger, 1999). We consider, in addition to the Old Faithful geyser data set, nine popular real-world data sets that are all available from the UCI Machine Learning Repository, so that a wide selection is obtained with dimensionalities d between 1 and 19 (the preprocessing is given in appendix E). We perform on both of them log-likelihood maximization (min(F )), using gradient descent on F (Bishop, 1995), and squared differential log-likelihood minimization (min(LL2 )), using equations 4.1 and 4.2, and determine the outcomes of Table 1: Density Estimation Performance by Training k = 9 Gaussian Kernels Using the Min(LL2 ) and Min(F ) Learning Strategies, for Different Real-World Data sets. Min(LL2 ) Data Set
N
d
F
Old Faithful 222 1 1.164 Iris 150 4 2.534 Hayes-Roth 132 5 12.73 Glass Identific. 214 9 8.369 Breast Cancer 699 9 19.69 Solar Flare 1389 10 32.65 Contraceptive 1473 10 21.25 Wine Recogn. 178 13 26.70 Boston Housing 506 13 39.28 Hepatitis 155 19 47.60
Min(F )
LL
Rank/Sign
0.000963 0.00252 −0.00410 0.0231 −0.00409 −0.00433 −0.00191 0.000117 0.0199 −0.0113
y/y y/y y/y y/n y/y y/n y/y y/y y/n y/y
F
LL
1.064 0.0949 2.315 −0.583 7.728 −0.359 4.794 2.670 15.67 2.431 2.504 −0.441 9.703 0.0740 21.38 0.328 38.33 2.519 46.16 2.517
Rank/Sign n/n n/n n/n n/n n/n n/n y/n y/n n/n n/n
Note: N = number of data points; d = dimensionality; Contraceptive = Contraceptive Method Choice; y = yes; n = no.
438
M. Van Hulle
the rank and sign tests. We use k = 9 gaussian kernels, the centers of which we initialize by taking random samples from the respective data sets; for the initial radii, we take 0.5. We use the same initial kernel centers and radii, η values, and number of training epochs for both methods so as to have a fair comparison. The results are summarized in Table 1. As expected with our method, the F values are higher, but the LL values are much lower, and, more important, the rank and/or sign tests pass, which is much less often the case for F minimization (e.g., the sign test never passes). This clearly shows the advantage of our method, at least for the real-world examples considered here. 5 Conclusion We have introduced a novel metric, called differential log likelihood, for assessing the quality of density estimation based on gaussian mixtures. We have shown under what condition the differential log likelihood is unbiased and what its properties are. As an example application, we determined the optimal smoothness and the optimal number of kernels in gaussian mixtures. Furthermore, we suggested a learning strategy for gaussian mixture density estimation and compared its performance with log-likelihood maximization for a wide range of real-world data sets. One could consider other applications, such as gaussian factorial distribution approximation and monitoring topographic map formation. These are topics for future research. Appendix A: LL distribution Let P(v) and P(v) ˆ be two multivariate normal distributions with mean , respectively: vectors µ and µ, ˆ covariance matrices and v ∝ Nd (µ, )
(A.1)
). ˆ vˆ ∝ Nd (µ,
(A.2)
We wish to determine the distribution of the differential log likelihood LL. Assume we compute LL for N data points taken from the two distributions: P(vi ) NLL = − log i ˆi) i P(v 1 1 −1 (vi − µ) −1 (vˆ i − µ) (vi − µ) ˆ ˆ − (vˆ i − µ) ˆ ˆ = 2 i 2 i
(A.3) (A.4)
Differential Log Likelihood
439
1 −1 = Tr (vi − µ) ˆ (vi − µ) ˆ 2 i 1 −1 (vˆ i − µ) (vˆ i − µ) ˆ ˆ 2 i 1 −1 (vi − µ)(v ˆ i − µ) ˆ = Tr 2 i −
1 −1 (vˆ i − µ) (vˆ i − µ) ˆ ˆ 2 i 1 N −1 −1 (v − µ) = Tr (vi − v)(vi − v) + (v − µ) ˆ ˆ 2 2 i −
− =
1 −1 (vˆ i − µ) (vˆ i − µ) ˆ ˆ 2 i
N − 1 −1 CN Tr 2 −1 N ((v − µ) + (µ − µ)) ˆ ˆ + (v − µ) + (µ − µ) 2 1 −1 (vˆ i − µ), (vˆ i − µ) ˆ ˆ − 2 i
(A.5)
(A.6)
(A.7)
(A.8)
1 with C N = N−1 i (vi − v)(vi − v) the N-sample covariance matrix and v the sample mean. We can further write that:
NLL =
N − 1 −1 N −1 (v − µ) C N + (v − µ) Tr 2 2 N N −1 (µ − µ) −1 (v − µ) + (v − µ) ˆ ˆ + (µ − µ) 2 2 1 N −1 (µ − µ) −1 (vˆ i − µ). ˆ ˆ − (vˆ i − µ) ˆ ˆ + (µ − µ) 2 2 i
N→∞
(A.9)
We have that v −→ µ: we know µ, and we can always increase the sample size generated from P through Monte Carlo sampling: v ∝ Nd (µ, ). The estimator v is a consistent estimator of µ. It means that it should converge to µ; thus, the second, third, and fourth terms in the latter equation converge
440
M. Van Hulle
to zero. Hence, we have that: N − 1 −1 N −1 (µ − µ) C N + (µ − µ) ˆ Tr ˆ 2 2 1 −1 (vˆ i − µ) − (vˆ i − µ) ˆ ˆ 2 i N N N−1 −1 (µ − µ) −1 + (µ − µ) Tr ˆ ˆ ≈ 2 N−1 2
NLL ≈
N −1 (vˆ i − µ) ˆ ˆ (vˆ i − µ) 2 N −1 N + (µ − µ) −1 (µ − µ) ≈ Tr ˆ ˆ 2 2 N −1 (vˆ i − µ), − (vˆ i − µ) ˆ ˆ 2 −
(A.10)
(A.11)
(A.12)
N the maximum likelihood estimator of . We can since C N = N−1 , with see that the first term in the latter equation is N2 the trace of the product of two matrices, and thus a constant; the second term is also a constant; and the third term is N2 χ 2 distributed with d degrees of freedom (since it is a Mahalanobis distance). Hence, LL is χ 2 distributed. Finally, we can compute the bias:
1 −1 1 d + (µ − µ) −1 (µ − µ) LL = Tr ˆ − , ˆ 2 2 2
(A.13)
since the expectation of χd2 equals d, and verify that it corresponds to equa the maximum likelihood estimator of . Hence, LL is an tion 2.6, with = , as desired. asymptotically unbiased estimator when µ = µ ˆ and A.1 Validity. Consider again the analytical expression of LL(φ p φ p˜ ), equation 2.6: LL(φ p φ p˜ ) =
1 1 d Tr( −1 (µ − µ p˜ ) −1 . p˜ p ) + p˜ (µ p − µ p˜ ) − 2 2 p 2
(A.14)
It is clear that when the mean vectors and the covariance matrices match— thus, when µ p = µ p˜ and p = p˜ —we have that LL = 0. But the inverse is not necessarily true (at least for d > 1) since, for example, for matching means, the second term is zero, and −1 p˜ can be constructed in such a way that the trace of the product with p equals d. However, in practice, we are always trying to match the gaussian kernel’s mean and covariance matrix
Differential Log Likelihood
441
to those of the available φ p sample (cf. maximum likelihood estimates), and for this best normal estimate, LL is asymptotically unbiased. We can prove that LL = 0 only for µ p = µ p˜ and p = p˜ , when also | p˜ | = | p | (uniqueness). This condition follows from the connection with the Kullback-Leibler divergence equation 2.4. The differential entropies H(φ p˜ ) and H(φ p ) will be equal when log | p˜ | = log | p |, that is, the remaining terms in the entropy difference, and thus when | p˜ | = | p |. Since K L(φ p φ p˜ ) = 0 if and only if φ p˜ = φ p everywhere (Kullback, 1959), the mean vectors and covariance matrices must match. Appendix B: General Input, Gaussian Kernel We first perform an Edgeworth expansion of p about its best normal estimate φ p (Barndorff-Nielsen & Cox, 1989) (also called Gram-Charlier A series),
1 i, j,k κ h i jk (v) + κ , p(v) ≈ φ p (v) 1 + 3! i, j,k with h i jk the i jkth Hermite polynomials, and κ i, j,k =
(B.1)
N2 (N−1)(N−2)
√κ2
i jk
si s 2j sk2
where
N is the number of data points in the sample, κ the sample third central moment over input dimensions i, j, k, and si the sample second central moment over input dimension i. The term κ represents the higher-order terms. For a general input density p(v) and a gaussian kernel φ(v), we have that i jk
LL( pφ) = −
p(v) log φ(v)dv +
V
φ(v) log φ(v)dv V
φ p (v)(1 + Z(v)) log φ(v)dv +
≈−
V
=−
φ p (v) log φ(v)dv −
V
φ(v) log φ(v)dv V
φ p (v)Z(v) log φ(v)dv V
φ(v) log φ(v)dv
+ V
= LL(φ p φ) −
φ p (v)Z(v) log φ(v)dv
V
V
= LL(φ p φ) −
φ p (v)Z(v) log φ p (v)dv
≈ LL(φ p φ) −
φ p (v)(1 + Z(v)) log φ p (v)dv V
442
M. Van Hulle
+
φ p (v) log φ p (v)dv V
= LL(φ p φ) + LL( pφ p ) = LL(φ p φ), with Z(v) =
(B.2)
κ i, j,k h i jk (v), and where we have used for the transition φ (v) from the fourth to fifth step the fact that V φ p (v)Z(v) log pφ dv = 0, by virtue of the properties of Hermite polynomials, and from the seventh to the eighth step, equation 2.5. 1 3!
i, j,k
Appendix C: General Input, Gaussian Kernel—Inverse Case We first consider H( p) = H(φ p ) − KL( pφ p ), with the latter the negentropy: H( p) = H(φ p ) −
p(v) log
V
≈ H(φ p ) −
p(v) dv φ p (v)
φ p (v)(1 + Z(v)) log(1 + Z(v))dv
V
≈ H(φ p ) −
φ p (v)(Z(v) + 0.5Z(v)2 )dv V
= H(φ p ) d 1 i,i,i 2 − (κ ) + 3 12 i=1
d
(κ i,i, j )2 +
i, j=1,i= j
1 6
d
(κ i, j,k )2 ,
i, j,k=1,i< j
(C.1) using the Edgeworth expansion and using have that
V
φ p (v)Z(v)dv = 0. We then
φ(v) log p(v)dv +
LL(φ p) = −
V
V
≈−
p(v) log p(v)dv V
φ(v) log φ p (v)(1 + Z(v))dv − H( p) φ(v) log φ p (v) −
=− V
φ(v)(Z(v) − 0.5Z(v)2 )dv − H( p) V
= LL(φφ p ) + H(φ p ) − H( p) − 0.5
φ(v)Z(v)2 dv V
≈ LL(φφ p ),
(C.2)
Differential Log Likelihood
443
where we have used equation C.1 for the last transition. Appendix D: General Input, General Model When the mean vectors and covariance matrices of p and p˜ are equal, we have that
p(v) log p(v)dv ˜ +
LL( p p) ˜ =−
p(v) ˜ log p(v)dv ˜
V
V
p(v) log φ p˜ v(1 + Z p˜ )dv − H( p) ˜
≈ V
=−
p(v) log φ p˜ (v)dv − V
φ p (v)(1 + Z p )(Z p˜ − V
≈ LL( pφ p˜ ) + H(φ p˜ ) −
φ p (v)(Z p˜ − V
d 1 i,i,i (κ − κ˜ i,i,i )2 + 3 = 12 i=1
1 + 6
d
1 2 ˜ Z + Z p Z p˜ ] − H( p) 2 p˜
(κ i,i, j − κ˜ i,i, j )2
i, j=1,i= j
(κ
i, j,k
− κ˜
i, j,k 2
)
+ H(φ p˜ ) − H( p) ˜
i, j,k=1,i< j
d 1 i,i,i ≈ (κ − κ˜ i,i,i )2 + 3 12 i=1
1 + 6
d
1 2 Z )dv − H( p) ˜ 2 p˜
d
d
(κ i,i, j − κ˜ i,i, j )2
i, j=1,i= j
(κ i, j,k − κ˜ i, j,k )2
i, j,k=1,i< j
d 1 i,i,i 2 (κ˜ ) + 3 + 12 i=1
d i, j=1,i= j
(κ˜ i,i, j )2 +
1 6
d
(κ˜ i, j,k )2 ,
i, j,k=1,i< j
where we have applied the Edgeworth expansion and used equation C.1 for the last transition. Appendix E: Data preprocessing For some data sets, the ranges of the input dimensions (attributes) are quite different from one another. Since we are working with isotropic kernels, the attributes with extremal ranges are rescaled. This is done for the following attributes and data sets:
444
M. Van Hulle
Hayes-Roth: The first attribute is divided by 10. Contraceptive Method Choice: The first attribute is divided by 10. Wine Recognition: The thirteenth attribute is divided by 100; the third, eighth, and eleventh attributes are divided by 10; the fifth attribute is multiplied by 10. Boston Housing: The second, sixth, nineth, and eleventh attributes are divided by 10; the fourth attribute is multiplied by 10. Hepatitis: The fifteenth, sixteenth, and eighteenth attributes are divided by 10. Acknowledgments I was supported by research grants received from the Belgian Fund for Scientific Research—Flanders (G.0248.03 and G.0234.04), the Interuniversity Attraction Poles Programme—Belgian Science Policy (IUAP P5/04), the Flemish Regional Ministry of Education (Belgium) (GOA 2000/11), and the European Commission (IST-2001-32114, NEST-2003-012963, and STREP2002-016276). References Ahmad, I. A., & Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, 22, 372–375. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), 2nd Int’l Symp. in Information Theory, Akademiai Kiado, Budapest (pp. 267–281). Budapest: Akademiai Kiado. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Inference and asymptotics. London: Chapman and Hall. ¨ Beirlant, J., Dudewicz, E. J., Gyorfi, L., & van der Meulen, E. C. (1997) Nonparametric entropy estimation: An overview. Int. J. Math. and Statistical Sciences, 6, 17–39. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Comon, P. (1994). Independent component analysis—a new concept? Signal Process., 36(3), 287–314. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Royal Statistical Soc., B, 39, 1–38. Friedman, J. (1987). Exploratory projection pursuit. J. American Statistical Association, 82(397), 249–266. Hall, P. (1984). Limit theorems for sums of general functions of m-spacings. Math. Proc. Camb. Phil. Soc., 96, 517–532. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Ishiguro, M., Sakamoto, Y., & Kitagawa, G. (1997). Bootstrapping log likelihood and EIC, and extension of AIC. Annals of the Institute of Statistical Mathematics, 49, 411–434.
Differential Log Likelihood
445
Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of entropy of a random vector. Problems of Information Transmission, 23(9), 95–101. Kullback, S. (1959). Information theory and statistics. New York: Wiley. Montgomery, D. C., & Runger, G. C. (1999). Applied statistics and probability for engineers. New York: Wiley. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev., 26, 195–239. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379–423. Silverman, B. W. (1992). Density estimation for statistics and data analysis. London: Chapman & Hall. Soofi, E. S. (2000). Principal information theoretic approaches, Journal of the American Statistical Association, 95(452), 1349–1353. Stoica, P., & Sel´en, Y. (2004). Model-order selection: A review of information criterion rules. IEEE Signal Processing Magazine, 21(4), 36–47. Van Hulle, M. M. (2005a). Edgeworth approximation of multivariate differential entropy. Neural Computation, 17, 1903–1910. Van Hulle, M. M. (2005b). Maximum likelihood topographic map formation. Neural Computation, 17(3), 503–513.
Received September 24, 2004; accepted June 22, 2005.
LETTER
Communicated by Helge Ritter
Magnification Control in Self-Organizing Maps and Neural Gas Thomas Villmann
[email protected] Clinic for Psychotherapy, University of Leipzig, 04107 Leipzig, Germany
Jens Christian Claussen
[email protected] Institute of Theoretical Physics and Astrophysics, Christian-Albrecht University Kiel, 24098 Kiel, Germany
We consider different ways to control the magnification in self-organizing maps (SOM) and neural gas (NG). Starting from early approaches of magnification control in vector quantization, we then concentrate on different approaches for SOM and NG. We show that three structurally similar approaches can be applied to both algorithms that are localized learning, concave-convex learning, and winner-relaxing learning. Thereby, the approach of concave-convex learning in SOM is extended to a more general description, whereas the concave-convex learning for NG is new. In general, the control mechanisms generate only slightly different behavior comparing both neural algorithms. However, we emphasize that the NG results are valid for any data dimension, whereas in the SOM case, the results hold only for the one-dimensional case. 1 Introduction Vector quantization is an important task in data processing, pattern recognition and control (Fritzke, 1993; Haykin, 1994; Linde, Buzo, & Gray, 1980; Ripley, 1996). A large number of different types have been discussed (for an overview, refer to Haykin, 1994; Kohonen, 1995; Duda & Hart, 1973). Neural maps are a popular type of neural vector quantizers that are commonly used in, for example, data visualization, feature extraction, principal component analysis, image processing, classification tasks, and acceleration of common vector quantization (de Bodt, Cottrell, Letremy, & Verleysen, 2004). Wellknown approaches are the self-organizing map (SOM) (Kohonen, 1995), the neural gas (NG) (Martinetz, Berkovich, & Schulten, 1993), elastic net (EN) (Durbin & Willshaw, 1987), and generative topographic mapping (GTM) (Bishop, Svens´en, & Williams, 1998). In vector quantization, data vectors v ∈ Rd are represented by a few codebooks or weight vectors wi , where i is an arbitrary index. Several criteria exist to evaluate the quality of a vector quantizer. The most common Neural Computation 18, 446–469 (2006)
C 2005 Massachusetts Institute of Technology
Magnification Control in Self-Organizing Maps and Neural Gas
447
one is the squared reconstruction error. However, other quality criteria are also known, for instance, topographic quality for neighborhood-preserving mapping approaches (Bauer & Pawelzik, 1992; Bauer, Der, & Villmann, 1999), optimization of mutual information (Linsker, 1989) and other criteria (for an overview, see Haykin, 1994). Generally, a faithful representation of the data space by the codebooks is desired. This property is closely related to the so-called magnification, which describes the relation between data and weight vector density for a given model. The knowledge of magnification of a map is essential for correct interpretation of its output (Hammer & Villmann, 2003). In addition, explicit magnification control is a desirable property of learning algorithms, if depending on the respective application, only sparsely covered regions of the data space have to be emphasized or, conversely, suppressed. The magnification can be explicitly expressed for several vector quantization models. Usually, for these approaches, the magnification can be expressed by a power law between the codebook vector density ρ and the data density P. The respective exponent is called magnification exponent or magnification factor. As explained in more detail below, the magnification is also related to other properties of the map, for example, reconstruction error as well as mutual information. Hence, control of magnification is influencing these properties too. In biologically motivated approaches, magnification can also be seen in the context of information representation in brains, for instance, in the senso-motoric cortex (Ritter, Martinetz, & Schulten, 1992). Magnification and its control can be related to biological phenomena like the perceptual magnet effect, which refers to the fact that rarely occurring stimuli are differentiated with high precision, whereas frequent stimuli are distinguished only in a rough manner (Kuhl, 1991; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). It is a kind of attention-based learning with inverted magnification, that is, rarely occurring input samples are emphasized by an increased learning gain (Der & Herrmann, 1992; Herrmann, Bauer, & Der, 1994). This effect is also beneficial in technical systems. In remote-sensing image analysis, for instance, seldomly found ground cover classes should be detected, whereas usual (frequent) classes with broad variance should be suppressed (Mer´enyi & Jain, 2004; Villmann, Mer´enyi, & Hammer, 2003). Another technical environment for magnification control is robotics for accurate description of dangerous navigation states (Villmann & Heinze, 2000). In this letter, we concentrate on a general framework for magnification control in SOM and NG. In this context, we briefly review the most important approaches. One approach for SOM is generalized, and afterward, it is transferred to NG. For this purpose, we first give the basic notations, followed in section 3 by a more detailed description of magnification and early approaches related to the topic of magnification control, including a unified approach for controlling strategies. The magnification control approaches of SOM are described according to the unified framework in section 4,
448
T. Villmann and J. Claussen
whereby one of them is significantly extended. The same procedure is applied to NG in section 5. Again, one of the control approaches presented in this section is new. A short discussion concludes the letter. 2 Basic Concepts and Notations in SOM and NG In general, neural maps project data vectors v from a (possibly highdimensional) data manifold D ⊆ Rd onto a set A of neurons i, which is formally written as D→A : D → A. Each neuron i is associated with a pointer wi ∈ Rd , all of which establish the set W = {wi }i∈A. The mapping description is a winner-take-all rule, that is, a stimulus vector v ∈ D is mapped onto that neuron s ∈ A with the pointer ws being closest to the actual presented stimulus vector v, D→A : v → s(v) = argmin v − wi .
(2.1)
i∈A
The neuron s is called the winner neuron. The set Ri = {v ∈ D|D→A(v) = i} is called the (masked) receptive field of the neuron i. The weight vectors are adapted during the learning process such that the data distribution is represented. For further investigation, we describe SOM and NG as our focused neural maps in more detail. During the adaptation process, a sequence of data points v ∈ D is presented to the map with respect to the data distribution P(D). Then the most proximate neuron s according to equation 2.1 is determined, and the pointer ws , as well as all pointers wi of neurons in the neighborhood of s, are shifted toward v, according to wi = h(i, v, W)(v − wi ).
(2.2)
The property of “being in the neighborhood of s” is represented by a neighborhood function h(i, v, W). The neighborhood function is defined as ki (v, W) h λ (i, v, W) = exp − λ
(2.3)
for the NG, where ki (v, W) yields the number of pointers wj for which the relation v − w j < v − wi is valid (Martinetz et al., 1993); especially, we have h λ (s, v, W) = 1.0. In the case of SOM, the set A of neurons has a topological structure usually chosen as a hypercube or hexagonal lattice. Each neuron i has a fixed position r(i). The neighborhood function has the form r(i) − r(s(v)) A h σ (i, v, W) = exp − . (2.4) 2σ 2
Magnification Control in Self-Organizing Maps and Neural Gas
449
In contrast to the NG, the neighborhood function of SOM is evaluated in the output space A according to its topological structure. This difference causes significantly different properties of both algorithms. For the SOM, there does not exist any energy function such that the adaptation rule follows the gradient descent (Erwin, Obermayer, & Schulten, 1992). Moreover, the convergence proofs are valid only for the one-dimensional setting (Cottrell, Fort, & Pages, 1998; Ritter et al., 1992). The introduction of an energy function leads to different dynamics as in the EN (Durbin & Willshaw, 1987) or new winner determination rule (Heskes, 1999). The advantage of the SOM is the ordered topological structure of neurons in A. In contrast, in the original NG, such an order is not given. One can extend the NG to the topology representing network (TRN) such that topological relations between neurons are installed during learning, although generally they do not achieve the simple structure as in SOM lattices (Martinetz & Schulten, 1994). Finally, the important advantage of the NG is that the adaptation dynamic of the weight vectors follows a potential minimizing dynamics (Martinetz et al., 1993). 3 Magnification and Magnification Control in Vector Quantization 3.1 Magnification in Vector Quantization. Usually vector quantization aims to minimize the reconstruction error RE = i Ri v − wi 2 P(v) dv. However, other quality criteria are also known, for instance, topographic quality (Bauer & Pawelzik, 1992; Bauer et al., 1999). More generally, one can consider the generalized distortion error, Eγ =
D
ws − vγ P(v) dv.
(3.1)
This error is closely related to other properties of the (neural) vector quantizer. One important property is the achieved weight vector density ρ(w) after learning in relation to the data density P(D). Generally, for vector quantizers, one finds the relation P(w) ∝ ρ(w)α
(3.2)
after the converged learning process (Zador, 1982). The exponent α is called the magnification exponent or magnification factor. The magnification is coupled with the generalized distortion error 3.1 by α=
d , d +γ
(3.3)
450
T. Villmann and J. Claussen
Table 1: Magnification of Different Neural Maps and Vector Quantization Approaches. Model
Magnification κ 1 σ 2 P ρ˜
Elastic net
1+
SOM
1+12M 2 (σ ) 3+18M 2 (σ )
1
Reference Claussen and Schuster (2002) Dersch and Tavan (1995)
Linsker network LBG
d d+2
Linsker (1989) Zador (1982)
FSCL
3β+1 3β+3
Galanopoulos and Ahalt (1996)
NG
d d+2
Martinetz et al. (1993)
Note: For SOM, M2 (σ ) denotes the second normalized moment of the neighborhood function depending on the neighborhood range σ .
where d is the intrinsic or Hausdorff dimension of the data.1 Beginning with the pioneering work of Amari (1980), which investigated a resolutiondensity relation of map formation in a neural field model and extended the approach of Willshaw and von der Malsburg (1976), for several neural map and vector quantizer approaches the magnification relation has been considered, including the investigation of the relation between data and model density. Generally, different magnification factors are obtained for different vector quantization approaches. An overview of several important models with known magnification factors is given in Table 1. For the usual SOMs, mapping a one-dimensional input space onto a chain of neurons, αSOM =
2 3
(3.4)
holds in the limit 1 σ N (Ritter & Schulten, 1986). For small values of neighborhood range σ , the neighborhood ceases to be of influence, and the magnification rate approaches the value α = 13 (Dersch & Tavan, 1995). The influence of different types of neighborhood function was studied in detail for SOMs in Dersch and Tavan (1995), which extends the early works 1 Several approaches are known to estimate the Hausdorff-dimension of data, often called intrinsic dimension. One of the best-known methods is the Grassberger-Procacciaanalysis (GP) (Grassberger & Procaccia, 1983; Takens, 1985). For GP, there is a large number of investigations of statistical properties (e.g., Camastra & Vinciarelli, 2001; Eckmann & Ruelle, 1992; Liebert, 1991; Theiler, 1990). For a neural network approach of intrinsic dimension estimation (based on NG), also in comparison to GP, we refer to Bruske and Sommer (1998), Camastra and Vinciarelli (2001), Villmann, Hermann, and Geyer (2000), Villmann (2002), and Villmann et al. (2003).
Magnification Control in Self-Organizing Maps and Neural Gas
451
of Luttrell (1991) and Ritter (1991). The magnification depends on the second normalized moment M2 of the neighborhood function, which itself is determined by the neighborhood range σ . van Hulle (2000) extensively discussed the influence of kernel approaches in SOMs. Results for magnification of discrete SOMs can be found in Ritter (1989) and Kohonen (1999). These latter problems and approaches will not be further addressed here. According to equations 3.3 and 3.1, the SOM minimizes the somewhat exotic E 1 distortion error, whereas the NG minimizes the usual E 2 -error. 2 Further, we can observe interesting relations in information-theoretic properties of the mapping: The information transfer realized by the mapping D→A, in general, is not independent of the magnification of the map (Zador, 1982). It has been derived that for an optimal information transfer realizing vector quantizer (or a neural map in our context), the relation α = 1 holds (Brause, 1992). A vector quantizer designed to achieve an optimal information transfer is the Linsker network (Linsker, 1989; see Table 1), or the optimal coding network approach proposed by Brause (1994). 3.2 Magnification Control in Vector Quantization: A General Framework. As pointed out in section 1, different application tasks may require different magnification properties of the vector quantizer, that is, the magnification should be controlled. Straightforwardly, magnification control means changing the value of the magnification factor α for a given vector quantizer by manipulation of the basic approach. Consequently, the question is, How one can affect the magnification factor to achieve an a priori chosen magnification factor? We further address this topic in the following. First, we review results from the literature and put them into a general framework. The first approaches to influence the magnification of a vector quantizer are models of conscience learning, characterized by a modified winner determination. The algorithm by DeSieno (1988) and the frequency sensitive competitive learning (FSCL) (Ahalt, Krishnamurty, Chen, & Melton, 1990) belong to this algorithm class. Originally, these approaches were proposed for equalizing the winner probability of the neural units in SOM. However, as the neighborhood relation between neurons is not used in this approach, it is applicable to each vector quantizer based on winner-takeall learning. To achieve the announced goal, in the DeSieno model, a bias term B is inserted into the winner determination rule, equation 2.1, such that D→A : v → s(v) = argmin(v − wi − B)
(3.5)
i∈A
with the bias term B = γ ( N1 − pi ), and pi is the actual winning probability of the neuron i. The algorithm converges such that the winning probabilities of all neurons are equalized, which is related to a maximization of the entropy,
452
T. Villmann and J. Claussen
and, hence, the resulted magnification is equal to the unity. However, an arbitrary magnification cannot be achieved. Moreover, as pointed out in van Hulle (2000), the algorithm shows unstable behavior. FSCL modifies the selection criterion for the best-matching unit by a fairness term F , which is a function of the winning frequency ωi of the neurons. Again, the winner determination is modified: D→A : v → s(v) = argmin(F (ωi )v − wi ).
(3.6)
i∈A
As mentioned above, originally it was defined to achieve an equiprobable quantization too. However, it was shown that this goal cannot be achieved by the original version (Galanopoulos & Ahalt, 1996; van Hulle, 2000). Yet for one-dimensional data, any given γ -norm error criterion, equation 3.1, can be minimized by a specific choice of the fairness function: if F(ωi ) is taken as F (ωi ) = (ωi )ξ
(3.7)
is achieved, for the one-dimensional case, a magnification αFSCL = 3β+1 3β+3 2 (Galanopoulos & Ahalt, 1996). The difficulbeing equivalent to γ = 3β+1 ties of transferring the one-dimensional result to higher dimensions are, however, as prohibitive as in SOM. We now study control possibilities of achieving arbitrary magnification, focusing on SOM and NG by modification of the learning rule. We emphasize again that for SOM, the results hold only for the one-dimensional case, whereas for NG, the more general case of arbitrary dimensionality is valid. Thus, the following direction of modifications of the general learning rule, equation 2.2, wi = h(i, v, W)(v − wi ), can serve as a general framework:
r r r
Localized learning: Introduction of a multiplicative factor by a local learning rate i Winner-relaxing learning: Introduction of winner relaxing by adding a winner-enhancing (relaxing) term R Convex-concave learning: Scaling of the learning shift by powers ξ in the factor (v − wi )ξ
These three directions serve as axes for a taxonomy in the following section. We focus on SOM and NG as popular neural vector quantizers. We explain, expand, and develop the respective methodologies of magnification
Magnification Control in Self-Organizing Maps and Neural Gas
453
control for these models. The localized and the winner-relaxing learning for SOM and NG are briefly reviewed. In particular, localized learning for SOM was published in Bauer, Der, and Herrmann (1996), whereas winnerrelaxing learning for both SOM and NG and localized learning in NG were previously developed by the authors (Claussen, 2003, 2005; Claussen & Villmann, 2003a; Villmann, 2000). The concave-convex learning for SOM is extended here to a more general approach compared to its origins (Zheng & Greenleaf, 1996). The concave-convex learning for NG is new too. 4 Controlling the Magnification in SOM Within the general framework outlined in section 3.2, we now consider the three learning rule modifications for SOM. 4.1 Insertion of a Multiplicative Factor: Localized Learning. The first choice is to add a factor in the SOM learning rule. An established realization is localized learning, the biological motivation of which is the perceptual magnet effect (Bauer et al., 1996). For this purpose, an adaptive local learning step size s(v) was introduced in equation 2.2 such that the new adaptation rule reads as wi = s(v) h σ (i, v, W)(v − wi ),
(4.1)
where s(v) is the best-matching neuron with respect to equation 2.1. The local learning rates i = (wi ) depend on the stimulus density P at the position of their weight vectors wi via
i = 0 P(wi )m ,
(4.2)
where the brackets . . . denote the average in time. This approach leads to the new magnification law, αlocal SOM = αSOM · (m + 1),
(4.3)
where m appears to be an explicit control parameter (Bauer et al., 1996). Hence, an arbitrary predefined magnification can be achieved. In applications, one has to estimate the generally unknown data distribution P, which may lead to numerical instabilities of the control mechanism (van Hulle, 2000). 4.2 Winner-Relaxing SOM and Magnification Control. Recently, a new approach for magnification control of the SOM by a generalization (Claussen 2003, 2005) of the winner-relaxing modification (Kohonen, 1991)
454
T. Villmann and J. Claussen
was derived, giving a control scheme that is independent of the shape of the data distribution (Claussen, 2005). We refer to this algorithm as WRSOM. In the original winner-relaxing SOM, an additional term occurs in learning for the winning neuron only, implementing a relaxing behavior. The relaxing force is a weighted sum of the difference between the weight vectors and the input according to their neighborhood relation. The relaxing term was introduced to obtain a learning dynamic for SOM according to an average reconstruction error, taking into account the effect of shifting Voronoi borders. The original learning rule is added by a winner-relaxing term R(µ, κ) as wi = h σ (i, v, W)(v − wi ) + R(µ, κ),
(4.4)
with R(µ, κ) being R(µ, κ) = (µ + κ)(v − wi )δis − κδis h σ ( j, v, W)(v − w j ),
(4.5)
j
depending on weighting parameters µ and κ. For µ = 0 and κ = 12 , the original winner-relaxing SOM is obtained (Kohonen, 1991). Surprisingly, it has been shown that the magnification is independent of µ (Claussen, 2003, 2005). Only the choice of κ contributes to the magnification: αWRSOM =
2 . κ +3
(4.6)
The stability range is |κ| ≤ 1, which restricts the accessible magnification range to 12 ≤ αWRSOM ≤ 1. More detailed numerical simulations and stability analysis can be found in Claussen (2005). The advantage of winner-relaxing learning is that no estimate of the generally unknown data distribution has to be made, as required in the local learning approach above. 4.3 Convex-Concave Learning. The third structural possibility for control according to our framework is to apply convex-concave learning in the learning rule. This approach was introduced in Zheng and Greenleaf (1996). Here, we extend this approach to a more general variant. Originally, an exponent ξ was introduced in the general learning rule such that equation 2.2 now reads as wi = h σ (i, v, W)(v − wi )ξ
(4.7)
Magnification Control in Self-Organizing Maps and Neural Gas
455
with
def
(v − wi )ξ = (v − wi ) · v − wi ξ −1 .
(4.8)
Thereby, two possibilities are proposed: ξ = κ1 with κ > 1, κ ∈ N and κ is odd (convex learning), or one simply takes ξ > 1, ξ ∈ N and ξ is odd (concave learning). This gives the magnification
αconcave/convexSOM =
2 ξ +2
= αSOM ·
(4.9) 3 , ξ +2
(4.10)
which allows an explicit magnification control. Yet this approach allows only a rather rough control around ξ = 1: the neighboring allowed values are ξ = 13 and ξ = 3 corresponding to magnifications αconvex/concaveSOM = 67 and αconcave/convexSOM = 25 , respectively. Therefore, greater flexibility would be of interest. For this purpose, we are seeking a generalization of both concave and convex learning. As a more general choice, we take ξ to be real, that is, ξ ∈ R. If we do so, the same magnification equation 4.9 is obtained. The proof of the magnification law is given in appendix A. Obviously, the choices ξ = κ1 and ξ = κ > 1, κ ∈ N and κ being odd as made in Zheng and Greenleaf (1996) are special cases of the now general approach. We considered the numerical behavior of the magnification control of the WRSOM using a one-dimensional chain of 50 neurons. The data distribution was chosen in agreement with Bauer et al. (1996) as P(x) = sin(π x). The theoretical entropy maximum of the winning probabilities of the neurons pi N is i=1 pi log( pi ) = log(N), giving the value 3.912 for N = 50. The results in dependence on ξ for different neighborhood ranges σ are depicted in Figure 1. According to the theoretical prediction, the output entropy is maximized for small ξ , and for large ξ , a magnification exponent zero is reached corresponding to an equidistant codebook without adaptation to the input distribution. For σ < 1, the turnover is shifted toward smaller values of ξ , and for ξ 1, σ 1, fluctuations increase. Further, as in the WRSOM, the advantage of convex-concave learning is that no estimate of the generally unknown data distribution has to be made as before in localized learning.
456
T. Villmann and J. Claussen
H
σ=2.0 σ=1.0 σ=0.5
3.9
3.85
3.8
3.75 0.2
0.5
1
2
ξ
5
Figure 1: Output entropy for convex and concave learning. An input density of P(x) = sin(π x) was presented to a one-dimensional chain of N = 50 neurons after 106 learning steps of stochastic sequential updating, averaged over 105 inputs, and learning rate = 0.01, fixed.
5 Magnification Control in Neural Gas In this section we transfer the ideas of magnification control in SOM to the NG, keeping in mind the advantage that the results then are valid for any dimension. 5.1 Multiplicative Factor: Localized Learning. The idea of localized learning is now applied to NG (Herrmann & Villmann, 1997). Hence, we have the localized learning rule wi = s(v) h λ (i, v, W)(v − wi ),
(5.1)
with s(v) again being the best–matching neuron with respect to equation 2.1 and s(v) is the local learning chosen as in equation 4.2. This approach gives a similar result as for SOM, αlocal NG = α NG · (m + 1),
(5.2)
Magnification Control in Self-Organizing Maps and Neural Gas
457
3.92
entropy H
3.9 3.88 3.86 3.84 3.82 3.8 -1 -0.5
0 0.5 1 1.5 2 control parameter m
2.5
3
Figure 2: Local learning for NG: Plot of the entropy H for maps trained with different magnification control parameters m (d = 1 (), d = 2 (+), d = 3 ()). The arrows indicate the theoretical values of m (m = 2, m = 1, m = 2/3, resp.), which maximizes the entropy of the map.
and, hence, allows a magnification control (Villmann, 2000). However, we have similar restrictions as for SOM: in actual applications, one has to estimate the generally unknown data distribution P. The numerical study shows that the approach can also be used to increase the mutual information of a map generated by an NG (Villmann, 2000). As for WRSOM, we use a standard setup as in Villmann (2000) of 50 neurons and 107 training steps with a probability density P(x1 . . . xd ) = i sin(π xi ), x ∈ [0, 1], and with parameters λ = 1.5 fixed and decaying from 0.5 to 0.05. The entropy of the resulting map computed for an input dimension of 1, 2 and 3 is plotted in Figure 2. 5.2 Winner-Relaxing NG. The winner-relaxing NG (WRNG) was first studied in Claussen and Villmann (2003a). According to the WRSOM approach, one uses an additive winner-relaxing term R(µ, κ) to the original learning rule: wi = h λ (i, v, W)(v − wi ) + R(µ, κ),
(5.3)
458
T. Villmann and J. Claussen
with R(µ, κ) being as in equation 4.5. The resulting WRNG magnification for small neighborhood values λ with λ → 0 but not vanishing is given by Claussen and Villmann (2005):
αWRNG =
d 1 . 1−κ d +2
(5.4)
Thereby, the magnification exponent appears to be independent of an additional diagonal term (controlled by µ) for the winner the same as in WRSOM; again, µ = 0 is the usual setting. If the same stability borders |κ| = 1 of the WRSOM also apply here, one can expect to increase the NG exponent by positive values of κ or to lower the NG exponent by factor 1/2 for κ = −1. However, one has to be cautious when transferring the λ → 0 result obtained above (which would require increasing the number of neurons as well) to a realistic situation where a decrease of λ over time will be limited to a final finite value to avoid the stability problems found in Herrmann and Villmann (1997). For a finite λ, the maximal coefficient h λ that contributes to the averaged learning shift is given by the prefactor of the second but one winner, which is given by eλ (Claussen & Villmann, 2005). For the NG, however, the neighborhood is defined by the rank list. As the winner term of the NG is not present in the winner-relaxing term (for µ = 0), all terms share the factor e−λ by h λ (k) = e−λ h λ (k − 1), which indicates that in the discretized algorithm, κ has to be rescaled by e+λ to agree with the continuum theory. The numerical investigation indicates that this prefactor applies for finite λ and number of neurons. The scaling of the position of the entropy maximum with input dimension is in agreement with theory, as well as the prediction of the opposite sign of κ that has to be taken to increase mutual information. Numerical studies show that winner-relaxing learning can also be used to increase the mutual information of an NG vector quantization. The entropy 2 shows a dimension-dependent maximum approximately at κ = d+2 eλ (see Figure 3). In any case, within a broad range around the optimal κ, the entropy is close to the maximum. The advantage of the method is to be independent on estimation of the unknown data distribution as the SOM equivalent WRSOM. Further, again as in the WRSOM, the magnification of WRNG is independent in the first order on the diagonal term, controlled by µ. Numerical simulations have shown that the contribution in higher orders is marginal (Claussen & Villmann, 2003b). More pronounced is the influence of the diagonal term on stability. According to the larger prefactor, no stable behavior has been found for |µ| ≥ 1; therefore, µ = 0 is the recommended setting (Claussen & Villmann, 2005).
Magnification Control in Self-Organizing Maps and Neural Gas
459
3.912
H 3.90
3.89
3.88
3.87 -1
0
1
2
3
κ
4
Figure 3: Winner-relaxing learning for NG: Plot of the entropy H curves for varying values of κ for one-, two-, and three-dimensional data. The entropy has the maximum if the magnification equals the unit (Zador, 1982). The arrows indicate the κ-values for the respective data dimensions.
5.3 Concave-Convex Learning. We now consider the third modification known from SOM, the concave-convex learning approach but in its new, developed general variant, wi = h λ (i, v, W)(v − wi )ξ ,
(5.5)
with ξ ∈ R and definition 4.8. It is proved in appendix B that the resulting magnification is αconcave/convexNG =
d , ξ +1+d
(5.6)
depending on the intrinsic data dimensionality d. This dependency is in agreement with the usual magnification law of NG, which is also related to the data dimension. The respective numerical simulations with the parameter choice as before are given in Figure 4. In contrast to concave-convex SOM, where α = 1 can d be achieved for large ξ , here α is bounded by d+1 ; information optimal learning is not possible in cases of low-dimensional data.
460
T. Villmann and J. Claussen
H 3.9
3.8
0.1
1
ξ
10
Figure 4: Concave-convex learning for NG. Plot of the entropy H curves for varying values of ξ for one-, two-, and three-dimensional data. The entropy can be enhanced by convex learning in each case (dashed line: d = 1, with 108 learning steps).
6 Discussion According to the given general framework, we studied three structurally different approaches for magnification control in SOM and NG. All methods are capable of controlling the magnification with more or less accuracy. Yet they differ in properties (e.g., stability range, density estimation). No approach yet shows a clear advantage. The choice of the optimal algorithm may depend on the particular problem and implementation constraints. In particular, several problems occur in actual applications. First, in the SOM case, all results are valid only for the one-dimensional case, because all investigations are based on the usual convergence dynamic. However, the SOM dynamics is analytically treatable only in the one-dimensional setting and higher-dimensional cases that factorize. Moving away from these special cases causes a systematic shift in magnification control, as numerically shown in Jain and Mer´enyi (2004). In actual applications, a quantitative comparison with theory is quite limited due to several influences that are not easily tractable. First, the data density has to be estimated, which is generally difficult (Mer´enyi & Jain, 2004); second, the intrinsic dimension has to be determined; and third, the measurement of the magnification from the density of weight vectors is rather coarse, especially in higher dimensions.
Magnification Control in Self-Organizing Maps and Neural Gas
461
Table 2: Comparison of Magnification Control for the Different Control Approaches for SOM and NG (d = 1 for SOM). SOM
NG
(m + 1) αSOM (Bauer et al., 1996)
(m + 1) α NG (Villmann 2000)
Winner-relaxing learning
3 κ+3 αSOM (Claussen, 2003, 2005)
1 1−κ α NG (Claussen & Villmann, 2005)
Concave-convex learning
3 ξ +2 αSOM
d+2 d+ξ +1 α NG
Local learning
(in section 4.3; Zheng & Greenleaf, 1996)
(in section 5.3)
Only some special cases can be handled adequately. In particular, maximizing mutual information can be controlled easily by observation of the entropy of winning probabilities of neurons or consideration of inverted magnification in case of available auxiliary class information, that is, labeled data (Mer´enyi & Jain, 2004). Thus, actual applications have to be done carefully using some heuristics. Interestingly, successful applications of magnification control (by local learning) in satellite remote-sensing image analysis can be found in Mer´enyi and Jain (2004), (Villmann, 1999; Villmann et al., 2003). Summarizing the above approaches of magnification control, we obtain the good news that the possibilities for magnification control known from SOM can be successfully transferred to the NG learning in all three cases. The achieved theoretical magnifications are collected in Table 2. The interesting point is that the local learning approach, as well as concave-convex learning, yields structurally similar modification factors for the new magnification. However, a magnification of 1 is not reachable by concave-convex learning in case of NG. In case of the winner-relaxing approach, we have a remarkable difference: in contrast to the WRSOM, where the relaxing term has to be inverted (κ < 0) to increase the magnification exponent, for the NG, positive values of κ are required to increase the magnification factor. Appendix A: Magnification Law of the Generalized Concave-Convex Learning for the Self-Organizing Map In this appendix we prove the magnification law of the generalized concaveconvex learning for SOM: the exponent in equation 4.7 is required to be ξ ∈ R and keeping further in mind definition 4.8. Since the convergence proofs of SOM are valid only for the one-dimensional setting, we switch from w to w and from v to v.
462
T. Villmann and J. Claussen
In the continuum approach, we can replace the index of the neuron by its position or location r (Ritter et al., 1992). Further, the neighborhood function h σ depends only on the difference of the location r to rs(v) as the location of the winning neuron. Then we have in the equilibrium for the learning rule, equation 4.7,
h σ (r − rs(v) )(v − w(r ))ξ P(v)dv = 0.
(A.1)
We perform the usual approach of expanding the integrand in a Taylor series in powers of ς = s(v) − r and evaluating at r (Ritter & Schulten, 1986; Hertz, Krogh, & Palmer, 1991; Zheng & Greenleaf, 1996). This gives v = w(r + ς),
(A.2)
h σ (s(v) − r ) becomes h σ (ς) = h σ (−ς), and P(v) = P(w(r + ς)) ≈ P(w) + ς P (w)w (r ).
(A.3)
Further, dv = dw(r + ς) = w (r + ς) dς can be rewritten as w (r + ς) dς ≈ (w + ςw ) dς,
(A.4)
and for v − w(r ) = w(r + ς) − w(r ), we get 1 2 1 w(r + ς) − w(r ) ≈ ςw + ς w = ς w + ςw . 2 2
(A.5)
Because of (v − w(r ))ξ in equation A.1, we consider (w + 12 ςw )ξ :
1 w + ςw 2
ξ
ς · ξ w ≈ (w )ξ 1 + . 2 w
(A.6)
Further, because of definition 4.8, the power ς ξ has to be interpreted as ς ξ = ς · |ς|ξ −1 ,
(A.7)
which is an odd function in ς. Collecting now equations A.2 to A.7, we get in equation A.1 0=
1 h σ (ς) · ς · |ς|ξ −1 · (w )ξ · 1 + ξ w (w )−1 ς 2
× (P(w) + ς P (w)w (r ))(w + ςw ) dς.
(A.8)
Magnification Control in Self-Organizing Maps and Neural Gas
463
Since ς · |ς|ξ −1 is odd, the term of lowest order in ς vanishes according to the rotational symmetry of h σ (ς). Further, in our approximation, we ignore terms behind ς 2 . Hence, the above equation can be simplified as 0 = (w )ξ
P (w)(w )2 +
ξ +2 P(w)w 2
h σ (ς) · ς 2 · |ς|ξ −1 dς. (A.9)
From there we get dr 2 = P 2+ξ ρ = dw
(A.10)
and, hence, αconcave/convexSOM =
2 , 2+ξ
(A.11)
which completes the proof. Appendix B: Magnification Law of the Generalized Concave-Convex Learning for Neural Gas For the derivation of the magnification for the generalized convex-concave learning in case of magnification-controlled NG, first we have the usual continuum assumption (Ritter et al., 1992). The further treatment is in complete analogy to the derivation of the magnification in the usual NG (Martinetz et al., 1993). Let r be the difference vector, r = v − wi .
(B.1)
The winning rank ki (v, W) in the neighborhood function h λ (i, v, W) in equation 2.3 depends only on r; therefore, we introduce the new variable, 1
x (r) = rˆ · ki (r) d ,
(B.2)
which can be assumed as monotonously increasing with r. We define the d × d–Jacobian:
∂rk J(x) = det ∂ xl
.
(B.3)
464
T. Villmann and J. Claussen
Starting from the new learning rule, wi = h λ (i, v, W) (v − wi )ξ ,
(B.4)
again consider the averaged change,
P (v) h λ (i, v, W) (v − wi )ξ dv .
wi =
(B.5)
If h λ (i, v, W) in equation 2.3 rapidly decreases to zero with increasing r, we can replace the quantities r(x), J(x) by the first terms of their respective Taylor expansions around the point x = 0, neglecting higher derivatives. We obtain 2
r · ∂r ρ (wi ) x (r) = r (τd ρ (wi )) 1 + +O r , d · ρ (wi ) 1 d
(B.6)
which corresponds to
r (x) =
1 1 − (τd ρ (wi ))− d ·
x·∂r ρ(wi ) d·ρ(wi )
+ O x2
1
x (τd ρ (wi )) d
(B.7)
with d
τd =
π2 d
2 +1
(B.8)
as the volume of a d–dimensional unit sphere (Martinetz et al., 1993). We define ϕ = τd ρ(wi ). Further, we expand J(x) and obtain ∂J + ... J (x) = J (0) + xk ∂ xk
1 ∂r ρ 1 1 = 1 − ϕ− d 1 + ·x· + O x2 ϕ d ρ
(B.9) (B.10)
and, hence, ∂J 1 ∂r ρ = −ϕ −(1+ d ) . ∂x x=0 ρ
(B.11)
Magnification Control in Self-Organizing Maps and Neural Gas
465
After collecting all replacements, equation B.5 becomes ξ
wi = · ϕ − d
D
dx h λ (x) · xξ
1 · P + ϕ − d · x · ∂r P + . . . 1 ∂r ρ 1 −(1+ d1 ) · ·x· − 1+ ϕ + ... ϕ d ρ ξ ∂r ρ − d1 · 1−ϕ ·x· + ... , d ·ρ
(B.12) (B.13) (B.14)
with new integration variable x. We use the approximation
1 − ϕ− d · x · 1
∂r ρ + ... d ·ρ
ξ
≈ 1 − ξ ϕ− d · x · 1
∂r ρ + ... d ·ρ
(B.15)
and get ξ
wi = · ϕ − d
D
dx h λ (x) · xξ
1 · P + ϕ − d · xξ · ∂ r P + . . . 1 1 −(1+ d1 ) ξ ∂r ρ · ·x · − 1+ ϕ + ... ϕ d ρ ∂r ρ 1 + ... . · 1 − ξ ϕ − d · xξ · d ·ρ
(B.16) (B.17) (B.18)
In the equilibrium wi = 0, we have 0=
D
1 dx h λ (x) · xξ · P + ϕ − d · xξ · ∂r P + . . .
1 ∂r ρ 1 1 − 1+ ϕ −(1+ d ) · xξ · + ... ϕ d ρ ∂r ρ 1 + ... . · 1 − ξ ϕ − d · xξ · d ·ρ ·
(B.19) (B.20)
Because of the rotational symmetry of h λ , we can neglect odd power terms in x. Remaining terms are of even power order. Again, according to equation 4.8, we take xξ = x·|x|ξ −1 , and, hence, xξ itself acts as an odd term. Therefore, only terms containing xξ +k with odd k contribute. Finally,
466
T. Villmann and J. Claussen
considering the nonvanishing terms and neglecting higher-order terms, we find the relation ∂r P ∂r ρ = P (wi ) ρ
d , d +ξ +1
(B.21)
which is the desired result. Acknowledgments We are grateful to Barbara Hammer (University Clausthal, Germany) for intensive discussions and comments. We also thank an anonymous reviewer for a hint that led us to a more elegant proof in appendix A.
References Ahalt, S. C., Krishnamurty, A. K., Chen, P., & Melton, D. E. (1990). Competitive learning algorithms for vector quantization. Neural Networks, 3, 277–290. Amari, S.-I. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364. Bauer, H. U., Der, R., & Herrmann, M. (1996). Controlling the magnification factor of self-organizing feature maps. Neural Computation, 8, 757–771. Bauer, H. U., Der, R., & Villmann, Th. (1999). Neural maps and topographic vector quantization. Neural Networks, 12, 659–676. Bauer, H. U., & Pawelzik, K. R. (1992). Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Trans. on Neural Networks, 3, 570–579. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Brause, R. (1992). Optimal information distribution and performance in neighbourhood-conserving maps for robot control. Int. J. Computers and Artificial Intelligence, 11, 173–199. Brause, R. W. (1994). An approximation network with maximal transinformation. In M. Marinaro & P. G. Morasso (Eds.), Proc. ICANN’94, International Conference on Artificial Neural Networks (Vol. 1, pp. 701–704). London: Springer. Bruske, J., & Sommer, G. (1998). Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 572–575. Camastra, F., & Vinciarelli, A. (2001). Intrinsic dimension estimation of data: An approach based on Grassberger-Procaccia’s algorithm. Neural Processing Letters, 14, 27–34. Claussen, J. C. (2003). Winner-relaxing and winner-enhancing Kohonen maps: Maximal mutual information from enhancing the winner. Complexity, 8(4), 15–22. Claussen, J. C. (2005). Winner-relaxing self-organizing maps. Neural Computation, 17, 997–1009.
Magnification Control in Self-Organizing Maps and Neural Gas
467
Claussen, J. C., & Schuster, H. G. (2002). Asymptotic level density of the elastic net self-organizing feature map. In J. R. Dorronsoro (Ed.), Proc. International Conf. on Artificial Neural Networks (ICANN), (pp. 939–944). Berlin: Springer-Verlag. Claussen, J. C., & Villmann, Th. (2003a). Magnification control in winner relaxing neural gas. In M. Verleysen (Ed.), Proc. of European Symposium on Artificial Neural Networks (ESANN’2003), (pp. 93–98). Brussels: d-side. Claussen, J. C., & Villmann, Th. (2003b). Magnification control in neural gas by winner relaxing learning: Independence of a diagonal term. In O. Kaynak, E. Alpaydin, E. Oja, & L. Xu (Eds.), Proc. International Conference on Artificial Neural Networks (ICANN’2003) (pp. 58–61). Istanbul: Bogazici University. Claussen, J. C., & Villmann, Th. (2005). Magnification control in winner-relaxing neural gas. Neurocomputing, 63, 125–137. Cottrell, M., Fort, J. C., & Pages, G. (1998). Theoretical aspects of the SOM algorithm. Neurocomputing, 21, 119–138. de Bodt, E., Cottrell, M., Letremy, P., & Verleysen, M. (2004). On the use of selforgainzing maps to accelerate vector quantization. Neurocomputing, 17, 187–203. Der, R., & Herrmann, M. (1992). Attention based partitioning. In M. Van der Meer (Ed.), Bericht des Status–Seminar des BMFT Neuroinformatik (pp. 441–446). Berlin: DLR. Dersch, D., & Tavan., P. (1995). Asymptotic level density in topological feature maps. IEEE Trans. on Neural Networks, 6, 230–236. DeSieno, D. (1988). Adding a conscience to competitive learning. In Proc. ICNN’88, International Conference on Neural Networks (pp. 117–124). Piscataway, NJ: IEEE Service Center. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Eckmann, J. P., & Ruelle, D. (1992). Fundamental limitations for estimating dimensions and Lyapunov exponents in dynamical systems. Physica D, 56, 185–187. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cyb., 67, 47–55. Fritzke, B. (1993). Vector quantization with a growing and splitting elastic net. In S. Gielen & B. Kappen (Eds.), Proc. ICANN’93, International Conference on Artificial Neural Networks (pp. 580–585). London: Springer. Galanopoulos, A. S., & Ahalt, S. C. (1996). Codeword distribution for frequency sensitive competitive learning with one dimensional input data. IEEE Transactions on Neural Networks, 7, 752–756. Grassberger, P., & Procaccia, I. (1983). Measuring the strangeness of strange attractors. Physica D, 9, 189–208. Hammer, B., & Villmann, Th. (2003). Mathematical aspects of neural networks. In M. Verleysen (Ed.), Proc. of European Symposium on Artificial Neural Networks (ESANN’2003) (pp. 59–72). Brussels: d-side. Haykin, S. (1994). Neural networks—A comprehensive foundation. New York: IEEE Press. Herrmann, M., Bauer, H.-U., & Der, R. (1994). The “perceptual magnet” effect: A model based on self-organizing feature maps. In L. S. Smith & P. J. B. Hancock (Eds.), Neural computation and psychology (pp. 107–116). Berlin: Springer-Verlag.
468
T. Villmann and J. Claussen
Herrmann, M., & Villmann, Th. (1997). Vector quantization by optimal neural gas. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial Neural Networks—Proceedings of International Conference on Artificial Neural Networks (ICANN’97) Lausanne (pp. 625–630). Berlin: Springer-Verlag. Hertz, J. A., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja & S. Kaski (Eds.), Kohonen maps (pp. 303–316). Amsterdam: Elsevier. Jain, A., & Mer´enyi, E. (2004). Forbidden magnification? I. In M. Verleysen (Ed.), European symposium on artificial neural networks 2004 (pp. 51–56). Brussels: d-side. Kohonen, T. (1991). Self-organizing maps: Optimization approaches. In T. Kohonen, K. M¨akisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks (Vol. 2, pp. 981–990). Amsterdam: North-Holland. Kohonen, T. (1995). Self-organizing maps (2nd ext. ed.). Berlin: Springer. Kohonen, T. (1999). Comparison of SOM point densities based on different criteria. Neural Computation, 11, 2081–2085. Kuhl, P. K. (1991). Human adults and human infants show a “perceptual magnet” effect for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93–107. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255, 606–608. Liebert, W. (1991). Chaos und Herzdynamik. Frankfurt: Verlag Harri Deutsch. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28, 84–95. Linsker, R. (1989). How to generate maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402– 411. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Trans. on Neural Networks, 2, 427–436. Martinetz, Th. M., Berkovich, S. G., & Schulten, K. J. (1993). “Neural-gas” network for vector quantization and its application to time-series prediction. IEEE Trans. on Neural Networks, 4, 558–569. Martinetz, Th., & Schulten, K. (1994). Topology representing networks. Neural Networks, 7, 507–522. Mer´enyi, E., & Jain, A. (2004). Forbidden magnification? II. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks 2004 (pp. 57–62). Brussels: d-side. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Ritter, H. (1989). Asymptotic level density for a class of vector quantization processes (Rep. A9). Espoo, Finland: Helsinki University of Technology, Laboratory of Computer and Information Science. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Trans. on Neural Networks, 2, 173–175. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biol. Cyb., 54, 99–106.
Magnification Control in Self-Organizing Maps and Neural Gas
469
Ritter, H., Martinetz, Th., & Schulten, K. (1992). Neural computation and self-organizing maps: An introduction. Reading, MA: Addison-Wesley. Takens, F. (1985). On the numerical determination of the dimension of an attractor. In B. Braaksma, H. Broer, & F. Takens (Eds.), Dynamical systems and bifurcations (pp. 99–106). Berlin: Springer-Verlag. Theiler, J. (1990). Statistical precision of dimension estimators. Physical Review A, 41, 3038–3051. van Hulle, M. M. (2000). Faithful representations and topographic maps from distortionto information-based self-organization. Hoboken, N.J.: Wiley. Villmann, Th. (1999). Benefits and limits of the self-organizing map and its variants in the area of satellite remote sensoring processing. In Proc. of European Symposium on Artificial Neural Networks (ESANN’99) (pp. 111–116). Brussels: D facto. publications. Villmann, Th. (2000). Controlling strategies for the magnification factor in the neural gas network. Neural Network World, 10, 739–750. Villmann, Th. (2002). Neural maps for faithful data modelling in medicine—state of the art and exemplary applications. Neurocomputing, 48, 229–250. Villmann, Th., & Heinze, A. (2000). Application of magnification control for the neural gas network in a sensorimotor architecture for robot navigation. In H. M. ¨ Groß, K. Debes, & H.-J. Bohme (Eds.), Proceedings of Selbstorganisation Von Adap¨ tivem Verfahren (SOAVE’2000) Ilmenau (pp. 125–134). Dusseldorf: VDI-Verlag. Villmann, Th., Hermann, W., & Geyer, M. (2000). Variants of self-organizing maps for data mining and data visualization in medicine. Neural Network World, 10, 751–762. Villmann, Th., Mer´enyi, E., & Hammer, B. (2003). Neural maps in remote sensing image analysis. Neural Networks, 16, 389–403. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London, Series B, 194, 431–445. Zador, P. L. (1982). Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transaction on Information Theory, 28, 149–159. Zheng, Y., & Greenleaf, J. F. (1996). The effect of concave and convex weight adjustments on self-organizing maps. IEEE Transactions on Neural Networks, 7, 87–96.
Received June 15, 2004; accepted June 15, 2005.
LETTER
Communicated by Hui Wang
Enhancing Density-Based Data Reduction Using Entropy D. Huang
[email protected]
Tommy W. S. Chow
[email protected] Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong
Data reduction algorithms determine a small data subset from a given large data set. In this article, new types of data reduction criteria, based on the concept of entropy, are first presented. These criteria can evaluate the data reduction performance in a sophisticated and comprehensive way. As a result, new data reduction procedures are developed. Using the newly introduced criteria, the proposed data reduction scheme is shown to be efficient and effective. In addition, an outlier-filtering strategy, which is computationally insignificant, is developed. In some instances, this strategy can substantially improve the performance of supervised data analysis. The proposed procedures are compared with related techniques in two types of application: density estimation and classification. Extensive comparative results are included to corroborate the contributions of the proposed algorithms. 1 Introduction As computer technology grows at an unprecedented pace, the size of data sets increases to the extent that data analysis has become cumbersome. Theoretically, using more data samples generally leads to improved data analysis (Provost & Kolluri, 1999; Friedman, 1997). But directly mining a data set in the gigabytes is a formidable, or even impossible, task because of the computational burden. In order to handle this problem, data reduction techniques have been studied (Blum & Langley, 1997; Catlett, 1991; Hart, 1968; Mitra, Murthy, & Pal 2002; Wilson & Martinez, 2000). Data reduction algorithms have been designed to reduce a huge data set to a small representative, but informative, pattern set on which data analysis can be performed. The belief is that data reduction introduces no or only minimal negative effect on data analysis. The simplest methods of data reduction are to sample data patterns in a random or a stratified way. These can be easily implemented and have negligible computational burden, so they are widely used as evaluation baseline. They are, however, unable to guarantee stable performance because Neural Computation 18, 470–495 (2006)
C 2005 Massachusetts Institute of Technology
Enhancing Density-Based Data Reduction Using Entropy
471
the randomness adopted in these methods may cause a loss of useful information (Catlett, 1991). A number of more sophisticated data reduction techniques have also been developed. These techniques can be categorized into two main groups: a classification-based method and a distribution-based method (Mitra et al., 2002; Blum & Langley, 1997). The former identifies the patterns that are informative for constructing a classification model, and the latter determines the patterns so that the original data distribution can be preserved as much as possible. Classification-based methods assume that all patterns are not equally important for certain classification learning algorithms. Through rejecting the “useless” patterns, this type of method can improve the scalability and the final results of a classification process (Wilson & Martinez, 2000). The condensed nearest neighbor rule (CNN) (Hart, 1968) was one of the first classification-based methods that was developed. Subsequently, many k-nearest-neighbor (kNN) rule-based data reduction schemes, such as PNN (Gates, 1972) and RNN (Chang, 1974) were proposed (Bezdek & Kuncheva, 2001; Dasarathy, 1991). There are other types of classifiers, for instance, neural networks and decision trees, that have been explored for data reduction (Plutowski & White, 1993; Quinlan, 1983; Schapire, 1990). Generally, the classification-based data reduction method works with a specified classification model. The reduced data set is iteratively adjusted according to classification accuracy. These steps repeat until the classification accuracy cannot be further improved. Because these data reduction methods may have distorted the original data distribution, they do not support other classification models and other pattern recognition tasks. Also, the classification-based methods are likely to fail in a way that they confuse outliers with the real useful samples because outliers always have relatively high uncertainty (Roy & McCallum, 2001). (For discussion on classification-based data reduction, see Provost & Kolluri, 1999; Blum & Langley, 1997; Wilson & Martinez, 2000.) The distribution-based data reduction method (Mitra et al., 2002; Gray, 1984; Kohonen, 2001; Astrahan, 1970) determines the representative pattern sets, that is, the reduced data set, so that the data distribution can be preserved as much as possible. This is a versatile method that can support various pattern recognition models tackling various pattern recognition tasks. Vector quantization error (VQE) is a popular technique employed by different distribution-based data reduction schemes (Gersho & Gray, 1992; Kohonen, 2001; Chow & Wu, 2004). VQE measures the distance between each pattern and its corresponding representative. A small VQE implies a good data reduction result. Through optimizing VQE, a representative data set, which is a reduced data set and is referred as codebook in the context of clustering, can be obtained. A self-organizing map (SOM) (Kohonen, 2001; Chow & Wu, 2004) is a typical descent-based algorithm designed for minimizing VQE. In SOM, the learning parameters are problem dependent, and the determination of these parameters has never been a straightforward issue. Another type of distribution-based method uses density estimation
472
D. Huang and T. Chow
(Mitra et al., 2002; Astrahan, 1970). In this approach, the main difficulty lies in the strategy for estimating the underlying probability density. The maximum likelihood learning algorithm is employed for density estimation by Yang and Zwolinski (2001), although it may not be efficient with a data set with relatively complex data distribution. Alternatively, a much more efficient strategy has been employed to analyze probability density (Astrahan, 1970; Mitra et al., 2002). The rationale behind this strategy is simple. For a pattern x, the density of it must be inversely related to the distance between x and its kth nearest neighbor. Thus, Mitra et al. (2002) and Astrahan (1970) analyze the density of each pattern after calculating the distances between the patterns. Without involving a learning process, this strategy is rather efficient, but it still requires calculating the distances between all possible pairs of patterns. This is computationally and memory demanding when one is handling a large amount of patterns. The problem of outliers is also significant because they may degrade the performance of supervised pattern recognition (Han & Kamber, 2001). But the filtering of outliers is overlooked in most studies. In this article, we focus on density-based data reduction. New types of data reduction criteria are introduced, based on the concept of entropy. They are thus called representative entropy (RE)/weighted representative entropy (WRE). In our proposed RE and WRE, all the relationships between a data point and all its representatives are fully considered. This makes RE and WRE more sophisticated than VQE, which considers only a single relationship between a point and one of its representatives. This issue is elaborated in a later section. We describe the design of a new data reduction algorithm using RE or WRE. The proposed algorithm, which begins with a randomly selected data set R0 , includes two sequential stages: a forward searching stage and an RE/WRE–based step-wise searching stage. Compared with other density-based methods (Astrahan, 1970; Mitra et al., 2002), the proposed method exhibits two main advantages. First, it is computationally efficient because the exhaustive task of calculating the distances between all possible pattern pairs is avoided. More important, because of the characteristics of RE/WRE, the improved efficiency has not traded off the quality of data reduction. Second, we propose an outlier-filtering strategy that can be implemented in a simple but efficient way. This scheme is particularly useful in handling classification problems. In the next section, the concept of entropy is briefly presented. The proposed data reduction criteria are introduced and evaluated in section 3, the proposed data reduction method is detailed in section 4, and experimental results are presented and discussed in section 5. 2 Preliminaries on Entropy The concept of entropy, which originated with thermodynamics, has been extended to statistical mechanics and information theory (Cover & Thomas,
Enhancing Density-Based Data Reduction Using Entropy
473
1991). For a discrete distribution modeled by X = {x1 , x2 , . . . , xN }, entropy measures the “information” conveyed by X. “Information” means the uncertainty or the degree of surprise for a particular value of X being drawn. Suppose that x is a value drawn from X, the event x = xk occurs with probability pk , and the sum of the probabilities for x = xk (k = 1, 2, . . . , N) N is 1, that is, k=1 pk = 1. In the case of pk = 1, there is no uncertainty or surprise for x = xk . A lower value of pk increases the uncertainty or the “information” when it is known that x = xk occur. Thus, this “information” is generally measured by I (xk ) = − log( pk ). The “information” contained by the whole event set X is called entropy and is enumerated by the expected value of − log( pk ), that is,
H(X) = E(I (xk )) = −
N
pi log pi .
(2.1)
i=1
A large value of entropy H(X) indicates high uncertainty about X. When all the probabilities (i.e., pk for all k,) are equal to each other, we have the maximal uncertainty that the value in X is taken, and the entropy H(X) achieves its maximum log(1/N). Conversely, when all the pi except one are 0, there is no uncertainty about X, that is, H(X) = 0. 3 Data Reduction Criteria 3.1 Prior Work–Vector Quantization Error. Vector quantization, first developed in the community of signal process, is a technique that exploits the underlying data distribution to compress data (Gersho & Gray, 1992). This technique partitions a given data domain into a number of distinct regions and then generates a representative point in each region. Suppose that R = {r1 , r2 , . . . , r K } is a representative set, that is, a codebook, of X. The VQE is defined as VQE (R; X) =
1 1 VQE (R; x) = arg min d(r j , x)2 , j N all x∈X N all x∈X
(3.1)
where d(r j , x) is the distance (dissimilarity) between a pattern x and a representative r j . Intuitively, VQE measures the similarity between R and X. Until now, VQE has been the most popular objective function for data clustering or data reduction. Equation 3.1 shows that for a pattern, VQE considers the relationship of that pattern with one representative—the one closest to it. VQE explicitly ignores the relationships with other representatives.
474
D. Huang and T. Chow
3.2 Entropy-Based Data Reduction Criteria: Representative Entropy and Weighted Representative Entropy 3.2.1 Definitions of RE and WRE. Assuming that R is the result of a data reduction process, then the probability of x (x ∈ X) being represented by a representative r j (i.e.,r j ∈ R) is p(r j |x). Also the probability of x being K represented by all representatives is 1 (i.e., j=1 p(r j |x) = 1). It is ideal that each pattern in X is close to one and only one representative as much as possible. In terms of probability, it is expected that p(ri |x) is zero for all i except one. The more uneven the distribution of the representation information ofR to X is, the better is the representative set R. This motivates us to explore the concept of entropy for evaluating the quality of R. The proposed criterion, called the representative entropy for data reduction (RE), is defined by K RE(R; x) =
j=1
− p(r j |x) log( p(r j |x)) log(K )
.
For a pattern x, a smaller value of RE(R;x) indicates that x is more likely to be represented by only one object in R. That is, a small RE(R;x) indicates a good representation performance atx. For X, the representation entropy of R is K 1 1 j=1 − p(r j |x) log p(r j |x) RE(R; X) = . RE(R; x) = N all x∈X N all x∈X log(K ) (3.2) The denominator log(K ) of RE is used for normalization, with which RE is limited to the range of [0, 1]. In this case, it is reasonable to compare the representative sets with different sizes. RE can be rewritten as 1 1 − p(r j |x) log( p(r j |x)) all x∈X log(K ) N K
RE(R; X) =
j=1
RE(r j ,X)
1 = RE(r j , X). log(K ) K
(3.3)
j=1
RE(r j , X) in equation 3.3 evaluates the representation ability of a single representative r j . Suppose that r j covers or represents part of the original data space Aj . According to equation 3.3, RE(r j , X) will achieve the minimum only when p(r j |x) = 0 or 1 for all x ∈ X; that is, Aj has no overlap with the areas covered by other representatives. Under this condition, r j can be
Enhancing Density-Based Data Reduction Using Entropy
475
considered a respectable representative. As the overlap of Aj with other areas increases, the representational ability of r j decreases, whereas the value of RE(r j , X) increases. Clearly, the value of RE(r j , X) reflects the representational ability of r j , for instance, a low value of RE(r j , X) corresponds to a good r j . This characteristic plays an important role in the proposed data reduction scheme, which is detailed in section 4.2. To evaluate r j , RE considers Aj , the data area around r j . It is reasonable to assume that the patterns lying in Aj may have different contributions to this evaluation task. See x1 and x2 illustrated in Figure 1. x1 must be more important than x2 with respect to r1 , whereas x2 is more influential for r3 than x1 . RE(R, X) (see equation 3.3), however, does not incorporate this idea. Thus, RE(R; X) is modified with a weighting operation. The weighted RE (WRE) is defined as 1 WRE (r j , X) log(K ) K
WRE (R; X) =
j=1
=
K 1 1 (− p(r j |x) log( p(r j |x))) · p(x|r j ) . log(K ) N all x∈X j=1
A
B
(3.4) Part A of WRE, inherited from RE, measures the distribution of representational information, while part B, measuring the relationship of a data pattern to a representative, is a weighting factor. With part B, when a representative is evaluated, patterns close to that representative have a greater effect than those far from it. In section 5, the effect of this weighting operation is evaluated. 3.2.2 Calculation of RE and WRE. To calculate RE or WRE, p(r j |x)( j = 1, . . . , K ) must be known. Popular schemes used for estimating the probabilities include the maximum likelihood algorithm (Yang & Zwolinski, 2001) and the Bayesian-based algorithm (Duda, Hart, & Stork, 2001). However, these methods are not computationally simple and efficient enough for an iterative data reduction process, in which probability estimation must also be estimated many times. We thus adopt a distance-based probability estimation method. A representative covers the L original patterns nearest to it. That is, for a representative (say, r j ), if x is one of the L patterns nearest to r j , p(x|r j ) > 0; otherwise, p(x|r j ) = 0. Based on this idea, we estimate p(x|r j ) by using
p(x|r j ) =
1− 0,
d(x,r j )
Radius(r j )
,
d(x, r j ) ≤ Radius(r j ) otherwise
,
476
D. Huang and T. Chow
. .. . . r2 * . . d'.. . r * d . .. x1 1 . .. . . . x 2. d' r3 . . *. . . . . Figure 1: The relation between the representatives and the original data patterns. The original data and the representatives (i.e., the reduced data) are marked by “.” and “*,” respectively.
where Radius(r j ) is the distance of r j with the (L + 1)th pattern nearest to it. Also, p(r j ) can be calculated with p(r j ) = 1/K (K is the size of R). According to the Bayes formula (Duda et al., 2001), p(r j |x) is estimated by p(r j |x) =
p(x|r j ) p(r j ) p(x|r j ) p(x|r j ) p(r j ) = K . = K p(x) j=1 p(x|r j ) p(r j ) j=1 p(x|r j )
3.3 Comparison Between the Vector Quantization Error and RE/WRE. Referring to the definitions, the main advantage of RE (see equation 3.3) and WRE (see equation 3.4) over VQE (see equation 3.1) is evident. Let us detail the goal of data reduction first. As illustrated in Figure 1, suppose that d is the distance between a pattern and the representative closest to that pattern, and d denotes the distance of that pattern to any representative but the closest one. The goal of data reduction is to decrease all d and to increase all d at the same time. Explicitly considering only d, VQE can be minimized through reducing d. On the other hand, both d and d are included in RE or WRE. Below, VQE and RE/WRE are briefly evaluated in synthetic scenarios. A data set with 100 data points (say, A) was generated from the normal distribution
1 0 N [0, 0], . 0 1 Then a representative set having 10 points (say, B) was drawn from Aunder the constraint that a point in B must be in the disc with the center [0, 0] and
Enhancing Density-Based Data Reduction Using Entropy
477
Table 1: Evaluation of Data Reduction Criteria Under Synthetic Circumstances. a. Results on the Data with a Normal Distribution. g2
Vector Quantization Error
RE
WRE
0.5 1.5 2.5 3.5 4.5
2.20 ± 1.98 2.68 ± 1.91 2.98 ± 1.95 3.34 ± 2.41 3.27 ± 2.19
0.068 ± 0.004 0.051 ± 0.006 0.046 ± 0.007 0.042 ± 0.006 0.040 ± 0.007
0.075 ± 0.004 0.058 ± 0.007 0.051 ± 0.007 0.047 ± 0.007 0.045 ± 0.008
b. Results on the Data with a Uniform Distribution. g2
Vector Quantization Error
RE
WRE
0.5 0.8 1
0.91 ± 0.45 1.01 ± 0.48 1.13 ± 0.51
0.054 ± 0.006 0.044 ± 0.007 0.042 ± 0.007
0.060 ± 0.007 0.050 ± 0.007 0.047 ± 0.008
the radius g. Basically, as g increases, this constraint becomes loose, and the likelihood that B represents A increases accordingly. It is expected that the variation of a data reduction criterion is able to roughly reflect this fact. In this study, g 2 changes in a range of [0.5, 4.5]. The statistical results over 500 independent trials are listed in Table 1a. It shows that RE or WRE decreases with as g increases, which is in accord with the expectation. In contrast, VQE cannot perform well because it changes in a way opposite to the theoretical expectations. Also, these data reduction criteria are compared on data generated from a uniform distribution. One hundred points of A were uniformly distributed in the square area from [−1, −1] to [1, 1]. The results obtained on this uniform distribution are presented in Table 1b. It is shown that RE or WRE correctly reflects the representation ability of B, whereas VQE does not. 4 Density-based Data Reduction Method 4.1 Multiscale Method. The multiscale method (Mitra et al., 2002) is a typical density-based data reduction scheme. In this method, the density of a pattern is analyzed according to the distance of that pattern to its neighbor. All patterns are then ranked in order of density. With this ranking list, the representatives are recursively determined. Given data X, this method can be briefly stated as follows: Step 0. Determine the parameter k. This parameter is closely related to the size of the data region covered by a representative. Step 1. Calculate the distance between all possible pattern pairs in X.
478
D. Huang and T. Chow
Step 2. Repeat the following operation until there is no remaining pattern in X. For a pattern in X, determine the distance of it to the kth neighbor. According to these distance values, identify the densest pattern, say, xd , and mark it as representative. Finally draw the disc with the center at xd and the radius of 2radd , where radd is the distance between xd and its kth neighbor. Delete the patterns lying in this disc. In this process, the computational and memory requirement of step 2 is O(N2 ), where N is the size of X. Apparently, this step will be computationally expensive when a large data set is given. 4.2 RE/WRE–Based Data Reduction Method 4.2.1 Procedure. The proposed method has two sequential stages: a forward searching stage and an RE/WRE–based stepwise searching stage. At the beginning, a set of data points, R0 , is randomly drawn from a given data set, and the representative set R is empty. Then the forward search is conducted on R0 to recursively place the appropriate representatives into R. This process stops when R0 has been scanned through, or when the representatives selected in R have been in a desired number. Following the forward process is an RE/WRE–based stepwise process, in which a pattern is first identified as representative from the area that is not yet covered well by R. When R has been in the desired size, the “worst” representative (i.e., the one having the lowest representation ability) will be deleted after a new representative is determined. The “worst” representative is identified according to RE(r j , X) or WRE(r j , X). In this sense, the proposed data reduction method is called REDR or WREDR. For a given data set X containing N patterns, REDR or WREDR can be stated as follows: Step 0. Randomly select a pattern set R0 . Set the representative set R empty. Determine K , the desired size of R. Naturally, a representative will represent L (L = N/K ) patterns of X. Step 1. This step repeats until R0 is scanned through or R has contained K elements. In X, determine the top Lpatterns nearest to r j (anyr j ∈ R0 ). Based on the sum of the distances of r j with these patterns, the densest element of R0 (say, rd ) is identified and placed into R. The top Lpatterns nearest to rd (including rd itself) will be rejected in the further forward searching stage. Step 2. In X, sort out the patterns having maxr j ∈R pouter (x|r j ) > 0. Among these patterns, identify the one with min(maxr j ∈R pouter (x|r j )), and put it
Enhancing Density-Based Data Reduction Using Entropy
into R. pouter (x|r j ) is defined by d(x,r ) 1 − Ra d outerj (r j ) , pouter (x|r j ) = 0,
d(x, r j ) ≤ Ra d outer (r j ) otherwise
479
,
where Rad outer(r j ) is the distance of r j with the 2Lth pattern nearest to it. Step 3. When R consists of K + 1 representatives, delete the worst representative—the one having the largest RE(r j , X) or WRE(r j , X). Step 4. Calculate RE(R, X) or WRE(R, X) for the newly constructed R. Step 5. Repeat from step 2 to step 4 until RE(R, X) or WRE(R, X) cannot be reduced for five consecutive iterations. Below, an example is given to demonstrate the procedure of WREDR. In this example, WREDR identifies 6 representatives from 60 original patterns. In Figure 2, certain intermediate results are illustrated. The original patterns and representatives are marked by “.” and “*”, respectively. The disc around a representative roughly illustrates the region covered by that representative. Figure 2a shows that 6 patterns are selected into R0 during initialization. The representative ability of these 6 patterns is poor because the area at the bottom right is uncovered, and the regions covered by r1 andr5 overlap each other. The forward searching process tackles the problem of overlapping. As a result, r1 is marked as representative, and r5 is eliminated. Also, r6 is eliminated because it is redundant to r2 and r3 . Obviously, in this forward searching stage, qualified representatives are selected from R0 , and redundant ones are deleted. However, the area uncovered by R0 —for instance, the bottom right in this example—has not been explored yet. This task will be fulfilled in the following WRE-based stepwise searching process in which new representatives r5 and r6 , illustrated in Figure 2b, are determined consecutively in the order of r5 to r6 . Apparently, in this course, the bottom right of the given data space is gradually explored. After r5 and r6 are added, the size of R is 6. This is the desirable value. Thus, in the following, the WRE-based process substitutes the “worst” representative with a new one. In this example, r2 is determined as the “worst” and replaced by rnew. As suggested in Figure 2c, the region covered by r2 has much overlap with the regions covered by other representatives. In this sense, deleting r2 is reasonable. Figure 2c also shows that substituting r2 with rnew improves the representative ability of R because more data patterns are covered, and the overlap between representatives is further reduced. 4.2.2 Remarks. In the first stage of REDR or WREDR, the density of the patterns in R0 is analyzed, whereas the density of only one data pattern is required at each iteration in the second stage. Assume that there are k0
480
D. Huang and T. Chow
r1
r5
r4
r2 r3
r6
(a)
r1 r4
r2
r6
r5
r3
(b)
r1 r4
r2
rnew
r5
r3
r6
(c)
Figure 2: Demonstration of the proposed data reduction procedure. The original data patterns are marked with “.” The representatives are described using “*.”
Enhancing Density-Based Data Reduction Using Entropy
481
patterns selected into R0 and that the second stage runs ni iteration. The computational complexity of REDR is then O(N(ni + k0 )), where N is the size of X. Recall that the computational complexity of the multiscale method is O(N2 ). Generally, k0 is much less than N. The proposed method is rapidly convergent, as suggested by our experimental results. Thus, we have N(ni + k0 ) N2 . That is, the proposed methods have a substantially lower computational requirement compared to the multiscale method. Also, the proposed methods have a significantly lower memory requirement compared with the multiscale method because the proposed methods do not need to remember a large number of distance values. In the above process, the condition of maxr j ∈R pouter (x|r j ) > 0 in step 2 guarantees a stable data reduction process. Apparently the patterns with maxr j ∈R pouter (x|r j ) = 0 must still be uncovered by R. Without a priori knowledge on the density distribution of these patterns, determining the representative from them is not recommended. With the constraint of maxr j ∈R pouter (x|r j ) > 0, the newly determined representative must be around the boundary of the area that has already been covered by R. In this way, the proposed method can gradually and reliably explore the whole data space. This can be seen in the above example in which r5 and r6 are marked as representative in turn. In a supervised task, the proposed methods are conducted in a stratified way. Given a value of L in step 0, which determines the ratio between the sizes of the reduced data and the original data, the pattern set of each class is reduced separately. The collection of these reduced data sets is the final data reduction result. Apart from this stratified way, a new strategy is developed to filter out outliers. An outlier is an object that behaves in a different way from the others. For instance, in a classification data set, an outlier generally exhibits a different class label from those having similar attributes. Theoretically, outliers cause noise to the supervised learning algorithms and degrade the performance of these algorithms (Han & Kamber, 2001). It is always desirable to eliminate outliers. In our proposed outlier-filtering strategy, a representative candidate is checked before being added to R, which is described in step 1 or 2. Only those that are not outliers can be placed into R. To determine if an object is an outlier, the area around that object is considered. Assume that Ao is the area around a representative candidate ro . For any class (say, c k ), the conditional probability p(Ao |ck ) can be calculated using p(Ao |c k ) =
1 1 p(x|ro ) p(x|c k ) = N x∈X N
x∈
class ck
p(x|ro ),
482
D. Huang and T. Chow
Table 2: Comparisons Between WREDR and WREDR-FO. Number of Outliers 10 50 80 120
WREDR
WREDR-FO
No Data Reduction
0.98 ± 0.02 0.89 ± 0.05 0.81 ± 0.05 0.79 ± 0.07
1.00 ± 0 1.00 ± 0 0.99 ± 0 0.99 ± 0
0.97 0.90 0.83 0.78
Note: These results highlight the contributions of the filtering-outlier strategy.
where N is the number of patterns in X. The class having the maximum of p(Ao |ck ) is the dominant class of Ao . Apparently, if the class of ro is consistent with the dominant class of Ao , ro is not an outlier; otherwise, ro is an outlier and cannot be included in R. In the proposed method, the computational burden of p(Ao |ck ) is very small because p(x|ro ) has been estimated during the calculation of RE/WRE. Below, the proposed algorithm, together with this outlier-filtering strategy, is called REDR-FO or WREDR-FO. To highlight the benefits of this outlier-filtering strategy, WREDR-FO is compared to WREDR. A classification data set was generated from two normal distributions that consist of two classes {0, 1}: 0.3 0 Class 1 (class = 0) (500 data points)X ∼ N [1, 1] , , 0 0.3
0.3 0 . Class 2 (class = 1) (500 data points)X ∼ N [−1, −1] , 0 0.3
A thousand patterns of this data set were evenly split into two groups of equal size for training and testing. Certain patterns were randomly chosen in the training data and were mislabeled to generate outliers. Classification accuracy is used to evaluate the quality of a reduced data set. Obviously, high classification accuracy indicates a good reduced data set. In this section, the kNN rule with k = 1 is used as the evaluation classifier. WREDR and WREDR-FO are required to reduce the original 500 training patterns to 50. As the performance of these data reduction methods may be affected by different initializations, statistical results of 10 independent trials are presented. Table 2 lists the comparative results. It is noted that as the number of outliers increases, the contribution of the outlier-filtering strategy becomes more significant. It is worth noting that because of the proposed outlierfiltering strategy, WREDR-FO is able to improve the final classification performance.
Enhancing Density-Based Data Reduction Using Entropy
483
5 Experimental Results Here, reduction ratio denotes the ratio between the sizes of reduced data and given data. For a given training data set, a testing data set; and a reduction ratio, a data reduction method is first used to reduce the training data set. Then, based on the reduced data set, certain density estimation and classification models are built. With the performance of these models on the testing data set, the employed data reduction method is evaluated. Throughout our investigation, the following strategies are adopted unless stated otherwise: 1. Each input variable is preprocessed to have zero mean and unit variance through a linear transformation. 2. In WREDR or REDR, the size of R0 is the same as that of R, the final data reduction result. 3. In the sampling type schemes, SOM and the proposed methods, performance is affected by initialization. Thus, in each case, these methods are independently run 10 times. The statistical results over the 10 trials are presented here. 4. Unlike other methods, Mitra’s multiscale method (Mitra et al., 2002) delivers only one set of result in each case. The results delivered by this method are by no mean statistical ones. Also, an exact desired reduction ratio, 0.1 or 0.3, may not be obtained; trials with different values of k may deliver only a close but not the exact reduction ratio. Thus, the reduction ratios provided later in this article are simply the closest ones. For a given reduction ratio, we choose the trial in which the actual reduction ratio is the closest to that given value and present the results of that trial here. 5. Our investigations are conducted using Matlab 6.1 on a Sun Ultra Enterprise 10000 workstation with 100 MHz clock frequency and 4 GB memory. 5.1 Density Estimation. The study presented in this section was conducted on synthetic data, in which the real density functions are known. More important, a large number of testing data patterns can be generated to guarantee the accuracy of evaluation results. Five data reduction methods—random sampling scheme, SOM, the density-based multiscale method, REDR, and WREDR—are compared from the perspectives of efficiency and effectiveness. The running time is recorded for efficiency evaluation. The effectiveness is measured by using the difference between the real density function (g(x)) and a density estimation function obtained based on reduced data ( f (x)). A small density difference indicates good reduced data.
484
D. Huang and T. Chow
In this study, g(x) is known, whereas f (x) is modeled with a Parzen window (Parzen, 1962). Given an M-dimensional pattern set Q = {q 1 , q 2 , q 3 , . . . , q Nq }, the Parzen window estimator is
p(x) =
Nq Nq 1 1 p(x|q i ) = κ(x − q i , h i ), Nq Nq i=1
(5.1)
i=1
where κ(•) is the kernel function and h i is the parameter to determine the width of the window. With the proper selection of κ(•) and h i , the Parzen window estimator can converge to the real probability density (Parzen, 1962). In this study, κ is a gaussian function. The M-dimensional gaussian function is κ(x − q i , h i ) = G(x − q i , h i ) =
1 1 T exp − (x − q )(x − q ) . i i 2h i (2π h i2 ) M/2
For the pattern q i , the window width h i is determined with h i = 2 d(q i , q j ) where d(q i , q j ) is the Euclidean distance, that is, d(q i , q j ) = (q i − q j )(q i − q j )T , and q j is the pattern that is the kth nearest to q i . We use two settings, k = 2 and k = 3, and the results indicate that the latter one performs better than the former one. Thus, we set k with 3. The difference between g(x), the real density function, and f (x), the density estimation function on a reduced data using equation 5.1, is measured with two indices: the absolute distance Da b and the Kullback-Leibler distance (divergence) DK L . Da b and DK L are, respectively, defined as Da b ( f (x), g(x)) =
| f (x) − g(x)|d x and
x
DK L ( f (x), g(x)) =
f (x) log x
f (x) d x. g(x)
The integrals of the above equations are calculated by using a numerical approach. After a large set of patterns, TX, is evenly sampled in a given data space, Da b and DK L are approximated on TX by using
Da b ( f (x), g(x)) ≈
| f (txi ) − g(txi )|txi ,
txi ∈T X
DK L ( f (x), g(x)) ≈
txi ∈T X
f (txi ) log
f (txi ) txi . g(txi )
Enhancing Density-Based Data Reduction Using Entropy
485
Table 3: Data Sets Used in Density Estimation Application. Name of Data Data1 Data2
Data3
Data4
Distribution of Data 600 fromN ([0, 0], I2 ). 800 fromN ([0, 0], 0.5I2 ), 800 from N ([1, 1], 0.3I2 ), 800 from N ([−1, −1], 0.3I2 ). 500 fromN ([0, 0], 0.5I2 ), 500 from N ([1, 1], 0.3I2 ), 500 from N ([−1, −1], 0.3I2 ), 500 from N ([−1, 1], 0.3I2 ), 500 from N ([1, −1], 0.3I2 ). 800 fromN (0, 0.5), 800 from N (−0.5, 1), 800 from N (1, 0.3).
TX 1681 patterns in [3, 3]∼[−3, −3] 1681 patterns in [3, 3]∼[−3, −3] 1681 patterns in [3, 3]∼[−3, −3]
1201 patterns in [3] ∼ [−3]
In order to guarantee precise approximation, the range for sampling TX is determined in a way that the range can cover virtually the whole data region where the probability is more than zero. That is, for a given probability function g(x), TX covers almost all areas where g(x) is not zero. In this case, we have 1 = x g(x)d x ≈ txi ∈T X g(txi )txi . The data sets used in this section are shown in Table 3. In this study, these data sets were all generated in low-dimensional data domains because high dimensionality is well known for its adverse effect of reducing the reliability of Parzen window. First, these methods are compared in terms of Da b and DK L . The comparative results are presented in Figure 3, with a reduction ratio of 0.05, and Figure 4, with a reduction ratio of 0.1. The results in different examples and with different reduction ratios lead clearly to similar conclusions. From the perspective of the quality of data reduction results, the density-based methods—that is, the multiscale methods, REDR and WREDR–deliver similar performance. These density-based methods outperform SOM and the random sampling scheme. In Table 4, the data reduction methods are compared in terms of computational efficiency. It is noted that both REDR and WREDR are more efficient than the multiscale method. This is because the exhaustive computation on pattern-pair distance is avoided in REDR and WREDR. To sum up, among the compared methods, REDR and WREDR are clearly the best data reduction methods because they can deliver the best or nearly the best data reduction results with greater computational efficiency. Besides, the very small deviations illustrated in Figures 3 and 4 indicate that initialization has little effect on the performance of either REDR or WREDR. Also, it can be noted that WREDR outperforms REDR in most cases. Clearly, it is due to the weighting strategy of WRE. Furthermore, REDR and WREDR are compared through t-tests in which the p-values can reflect the significance of difference between the results of REDR and
486
D. Huang and T. Chow
Dab
DKL 0.6
0.6
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(a) Dab
DKL 0.4
0.6
0.4 0.2 0.2
0
1
2
3
4
0
5
1
2
3
4
5
(b) 0.4
0.4
Dab
DKL
0.2
0
0.2
1
2
3
4
0
5
1
2
3
4
5
(c) Dab
DKL 0.4 0.4
0.2 0.2
0
1
2
3
4
0
5
1
2
3
4
5
(d) Figure 3: Comparisons on effectiveness in terms of Da b and DK L for the reduction ratio = 0.05. (a) Results on Data1. (b) Results on Data2. (c) Results on Data3. (d) Results on Data4. In each image, from left to right, the bars represent the results of the random sampling scheme, SOM, the multiscale method, REDR, and WREDR, respectively.
Enhancing Density-Based Data Reduction Using Entropy
Dab
DKL 0.4
0.4
0.2
0.2
0
1
2
3
4
487
0
5
1
2
3
4
5
(a) Dab
DKL 0.6
0.6
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(b) 0.4
0.4
Dab
DKL
0.2
0
0.2
1
2
3
4
0
5
1
2
3
4
5
(c) DKL
Dab
0.4
0.4
0.2
0.2
0
1
2
3
4
0
5
1
2
3
4
5
(d) Figure 4: Comparisons on the effectiveness in terms of Da b and DK L for a reduction ratio = 0.1. (a) Results on Data1. (b) Results on Data2. (c) Results on Data3. (d) Results on Data4. In each image, from left to right, the bars represent the results of the random sampling scheme, SOM, the multiscale method, REDR, and WREDR, respectively.
488
D. Huang and T. Chow
Table 4: Comparisons in Terms of Running Time (in seconds). Name of Data Set Reduction ratio = 0.05 Data1 Data2 Data3 Data4 Reduction ratio = 0.1 Data1 Data2 Data3 Data4
SOM
Multiscale Method
REDR
WREDR
1 15 27 7
2 271 452 177
1 59 79 21
1 64 82 24
2 19 23 13
5 364 520 209
3 125 147 26
3 109 152 30
Table 5: Comparisons Between WREDR and REDR in Terms of Da b . Reduction Ratio = 0.05 Name of Data Set Data1 Data2 Data3 Data4
Reduction Ratio = 0.1
Average of REDR
Average of WREDR
p-Value
Average of REDR
Average of WREDR
p-Value
0.235 0.193 0.142 0.195
0.235 0.184 0.138 0.184
0.61 0.35 0.16 0.34
0.161 0.280 0.128 0.305
0.161 0.273 0.130 0.299
0.89 0.34 0.76 0.45
WREDE. A small p-value means a large difference. In Table 5, the comparative results of REDR and WREDR are presented. These results show that the advantage of WRE over RE becomes significant as the reduction ratio decreases. Also, this advantage basically increases along the direction from the simple distribution, such as Data1, to a relatively complex data distribution, such as Data3. DK L and Da b are known as a straightforward and accurate way to measure the representation ability of a reduced data set. Using them as reference, we evaluate the reliability of the proposed criteria RE and WRE. The values of RE (WRE), DK L , and Da b in each iteration of the second stage of REDR (WREDR) are recorded. The variations of DK L and, Da b are compared with those of REDR (WREDR). In Figures 5 and 6, the typical results on two data are illustrated. It can be seen that RE and WRE vary in a similar fashion with DK L , and Da b . We can thus assert that RE and WRE are reliable enough to measure the representation ability of a reduced data set. 5.2 Classification. In this section, five data reduction methods are compared: the stratified sampling scheme, the supervised SOM, the multiscale method, WREDR, and WREDR-FO. The results of the RE-based method
Enhancing Density-Based Data Reduction Using Entropy
489
0.3
RE Dab DKL
0.1 1
Number of Iteration
100
100WRE 0.2
Dab DKL
0.15
1
Number of Iteration
52
Figure 5: Typical comparison of variation between RE/WRE and density errors (i.e., Da b and DK L ). These values are obtained in the second stage of REDR/WREDE on Data3 with reduction ratio = 0.05. For clear illustration, the values of WRE are timed by 100. Both RE and WRE are shown to vary in a similar fashion with density errors. It verifies that RE and WRE are reliable in evaluating the data reduction effectiveness.
are not presented here because they are similar to the results of the WREbased ones. The stratified sampling scheme and the supervised SOM treat a classification data set in the same way as WREDR and WREDR-FO: with a predetermined reduction ratio, the pattern subsets of different classes are reduced separately, and the final data reduction result is the collection of the results on all the classes. The six data sets used in this section are described in Table 6. The synthetic data, which were detailed in section 4.2.2, contain 80 outliers. To evaluate a reduced data set, several popular classifiers are first built. According to the results of these classifiers on the testing data, the tested data set is evaluated. A high classification result indicates a good reduced data set. The classifiers used are the kNN
490
D. Huang and T. Chow
0.4
RE
Dab
DKL
0.2
1
54
Number of Iteration
100WRE Dab DKL 0.3
0.1 1
Number of Iteration
43
Figure 6: Typical comparison of variation between RE/WRE and density errors (i.e., Dab and DKL ). These results are obtained in the second stage of REDR/WREDE on Data3 with a reduction ratio = 0.05.
rule with k =1 and the multilayer perceptron (MLP) (Haykin, 1999). MLP is provided in the Netlab toolbox (http://www.ncrg.aston.ac.uk/netlab). Throughout this investigation, six hidden neurons are used, and the number of output neurons is set with the number of classes so that one class distinctively corresponds to one output neuron. Also, in each example, the classification models are constructed on the entire training data set. The results of those models on the testing data are presented in Table 7. Two reduction ratios, 0.1 and 0.3, are investigated in this study. Table 8 lists the comparative results. In the example of image segmentation, the maximal reduction ratio of the multiscale method is about 0.17, which is much less than 0.3. Thus, in this example, there is no result of the
Enhancing Density-Based Data Reduction Using Entropy
491
Table 6: Data Sets used in Classification Application. Name of Data Set
Number of Training Number of Testing Number of Number of Data Samples Data Samples Features Classes
Synthetic Data MUSK Pima Indian Diabetes Spambase Statlog Image Segmentation Forest Covertypea
500 3000 500
500 3598 268
2 166 8
2 2 2
2000 4435
2601 2000
58 36
2 6
50,000
35,871
54
5
a The original Forest Covertype data set has seven classes and more than 580,000 patterns.
Under our computer environment, it is very hard to tackle the whole data set. Thus, the patterns belonging to class 1 and class 2 are omitted in our study.
Table 7: Classification Accuracy of the Models Built on the Training Data Sets. Name of Data Set
kNN
MLP
Synthetic Data MUSK Pima Spambase Image Segmentation Forest Covertype
0.83 0.94 0.69 0.90 0.89 0.95
0.99 0.97 0.73 0.92 0.85 0.85
multiscale method for a reduction ratio of 0.3. The presented results clearly indicate the advantage of WREDR-FO, which is due to the contribution of the criterion WRE and the outlier-filtering strategy. Also, referring to the results presented in Table 7, it is suggested that WREDR-FO has little effect on reducing classification accuracy. Even in the examples of the synthetic classification and the Pima Indian Diabetes classification, WREDR-FO can enhance the final classification performance, in contrast to the general argument that a reduced data set corresponds to degraded classification results (Provost & Kolluri, 1999). Clearly, using our proposed outlier-filtering strategy can compensate for the classification loss caused by data reduction to a certain degree. In addition, it is noted that WREDR-FO provides more classification enhancement to kNN than to MLP. This is mainly due to the fact that kNN may be more sensitive to noise than MLP (Khotanzad & Lu, 1990). In Table 9, different methods are compared in terms of running time. WREDR and WREDR-FO are shown to be much more efficient than the multiscale method. Also, the computational effort required by the proposed outlier-filtering strategy is insignificant since WREDR-FO is almost as efficient as WREDR.
Table 8: Comparisons in Terms of Classification Accuracy.
Reduction ratio = 0.1 Synthetic Data Musk Pima Spambase Image segmentation Forest covertype Reduction ratio = 0.3 Synthetic Data Musk
Spambase Image segmentation Forest covertype
Supervised SOM
Multiscale Method
WREDR
kNN
MLP
kNN
MLP
kNN
MLP
kNN
MLP
kNN
MLP
0.83 0.07 0.89 0.02 0.66 0.03 0.82 0.01 0.86 0.01 0.80 0.02
0.93 0.06 0.90 0.02 0.68 0.03 0.88 0.02 0.81 0.02 0.54 0.02
0.92 0.07 0.94 0.01 0.70 0.02 0.83 0.01 0.86 0.01 0.82 0.02
0.89 0.05 0.85 0.00 0.71 0.03 0.84 0.01 0.81 0.01 0.66 0.02
0.86
0.93
0.92
0.93
0.68
0.73
0.83
0.88
0.87
0.82
0.88
0.78
0.91 0.00 0.91 0.01 0.68 0.02 0.84 0.01 0.86 0.01 0.90 0.00
0.98 0.00 0.92 0.01 0.71 0.03 0.89 0.01 0.81 0.01 0.77 0.00
0.99 0.02 0.94 0.01 0.72 0.02 0.86 0.00 0.87 0.01 0.91 0.01
1.00 0.00 0.93 0.01 0.75 0.02 0.90 0.01 0.84 0.01 0.80 0.00
0.82 0.04 0.92 0.01 0.67 0.04 0.84 0.01 0.86 0.01 0.84 0.01
0.98 0.03 0.94 0.02 0.67 0.04 0.89 0.01 0.82 0.01 0.70 0.00
0.91 0.02 0.94 0.00 0.70 0.03 0.85 0.01 0.87 0.01 0.85 0.00
0.93 0.01 0.86 0.01 0.65 0 0.84 0.00 0.83 0.01 0.71 0.00
0.80
0.98
0.93
0.94
0.67
0.68
0.87
0.90
—
—
0.99 0 0.94 0.02 0.69 0.02 0.90 0.00 0.83
0.92
0.79
0.91 0.02 0.93 0.00 0.69 0.02 0.85 0.01 0.88 0.01 0.92 0.00
0.98 0.01 0.94 0.00 0.72 0.01 0.88 0.01 0.88 0.01 0.93 0.00
1.00 0 0.95 0.01 0.73 0.02 0.91 0.00 0.84 0.01 0.84 0.00
0.84 0.00
WREDR-FO
Notes: In the cells of listing results, the upper and lower values are the mean and the standard deviation, respectively. The best result of each case is highlighted in boldface.
D. Huang and T. Chow
Pima
Stratified Sampling
492
Name of Data Set
Enhancing Density-Based Data Reduction Using Entropy
493
Table 9: Comparisons in Terms of Running Time (in seconds). Name of Data Set Reduction ratio = 0.1 Synthetic Data Musk Pima Spam-base Image Segmentation Forest covertype Reduction ratio = 0.3 Synthetic Data Musk Pima Spambase Image Segmentation Forest covertype
Supervised SOM
Multiscale Method
WREDR
WREDR-FO
0.8 153 1.2 15 43 8.2 × 103
2.7 1.1 × 103 3.3 285 1.7 × 103 1.2 × 105
0.8 410 1.4 35 99 6.3 × 103
0.7 479 1.6 35 107 7.8 × 103
1.8 651 1.9 29 73 4.9 × 104
7.3 3.1× 103 10.0 760 — 1.9 × 105
4.0 1.5 × 103 4.1 131 414 1.3 × 104
3.4 1.6 × 103 4.2 133 459 1.9 × 104
6 Conclusions This article focuses on the study of density-based data reduction schemes because this type of data reduction technique can be widely used for tackling data analysis tasks and building data analysis model. In the conventional density-based methods, the probability density of each data point has to be estimated or analyzed. This makes these methods computationally expensive when huge data sets are given. To address this shortcoming, we propose a novel type of entropy-based data reduction criteria and a data reduction process based on these criteria. Compared with the existing density-based methods, our proposed methods exhibit higher efficiency and similar effectiveness. Also, the strategy for outlier filtering is designed. This simple and efficient strategy is immensely useful for classification tasks. Finally, it is important to note that the experimental results indicate that the proposed methods are robust to initializations. Acknowledgments We thank the anonymous reviewers for their useful comments. The work described in this article is fully supported by a grant from City University of Hong Kong of project, no. 7001701-570. References Astrahan, M. M. (1970). Speech analysis by clustering, or the hyperphoneme method (Stanford A I Project Memo). Palo Alto, CA: Stanford University.
494
D. Huang and T. Chow
Bezdek, J. C., & Kuncheva, L. I. (2001). Nearest prototype classifier designs: An experimental study. Int. J. Intell. Syst., 16(12), 1445–1473. Blum, A. L., & Langley, P. (1997). Selection of relevant feature and examples in machine learning. Artificial Intelligence, 97(1–2), 245–271. Catlett, J. (1991). Megainduction: Machine learning on very large databases. Unpublished doctoral dissertation, University of Sydney, Australia. Chang, C. L. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Trans. Computers, 23(11), 1179–1184. Chow, T. W. S., & Wu, S. (2004). An online cellular probabilistic self-organizing map for static and dynamic data Sets. IEEE. Trans. on Circuits and Systems I, 51(4), 732–747. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dasarathy, B. V. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. Los Alamitos, CA: IEEE Computer Society Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Trans. on Inform. Theory, IT-18, 431–433. Gersho, A., & Gray, R. M. (1992). Vector quantization and signal compression. Norwell, MA: Kluwer. Gray, R. M. (1984). Vector quantization. IEEE Assp Magazine, 1, 4–29. Friedman, J. H. (1997). Data mining and statistics: What’s the connection? Available online at http://www.salford-systems.com/doc/dm-stat.pdf. Han, J. W., Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE. Trans. on Information Theory, 14, 515–516. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Khotanzad, A., & Lu, J.-H. (1990). Classification of invariant image representations using a neural network. IEEE Transactions on Signal Processing, 38(6), 1028–1038. Kohonen, T. (2001). Self-organizing maps. London: Springer. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Density-based multiscale data condensation. IEEE. Trans. on PAMI, 24(6), 734–747. Parzen, E. (1962). On the estimation of a probability density function and mode. Ann. Math. Statist., 33, 1064–1076. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 2, 131–169. Plutowski, M., & White, H. (1993). Selecting concise training sets from clean data. IEEE Trans. Neural Networks, 4(2), 305–318. Quinlan, R. (1983). Learning efficient classification procedures and their application to chess end games. In R. S. Michalski, J. G. Carbonell, & T. M. Mitechell, (Eds.), Machine Learning—an Artificial Intelligence Approach (pp. 463–482) Palo Alto, CA: Tioga. Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. In Proc. 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann. Available online at www.cs.umass.edu/∼mccallum/papers/active-icm/101.ps.
Enhancing Density-Based Data Reduction Using Entropy
495
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization. New York: Wiley. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38, 257–286. Yang, Z. P., & Zwolinski, M. (2001). Mutual information theory for adaptive mixture models. IEEE Trans. on PAMI, 23(4), 396–403.
Received December 29, 2004; accepted June 27, 2005.
LETTER
Communicated by Heinrich Buelthoff
Receptive Field Structures for Recognition Benjamin J. Balas
[email protected]
Pawan Sinha
[email protected] Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02142, U.S.A.
Localized operators, like Gabor wavelets and difference-of-gaussian filters, are considered useful tools for image representation. This is due to their ability to form a sparse code that can serve as a basis set for highfidelity reconstruction of natural images. However, for many visual tasks, the more appropriate criterion of representational efficacy is recognition rather than reconstruction. It is unclear whether simple local features provide the stability necessary to subserve robust recognition of complex objects. In this article, we search the space of two-lobed differential operators for those that constitute a good representational code under recognition and discrimination criteria. We find that a novel operator, which we call the dissociated dipole, displays useful properties in this regard. We describe simple computational experiments to assess the merits of such dipoles relative to the more traditional local operators. The results suggest that nonlocal operators constitute a vocabulary that is stable across a range of image transformations. 1 Introduction Information theory has become a valuable tool for understanding the functional significance of neural response properties. In particular, the idea that a goal of early sensory processing may be to efficiently encode natural stimuli has generated a large body of work describing the function of the human visual system in terms of redundancy reduction and maximum-entropy responses (Attneave, 1954; Barlow, 1961; Atick, 1992; Field, 1994). In the compound eye of the fly, for example, the contrast response function of a particular class of interneuron approximates the distribution of contrast levels found in natural scenes (Laughlin, 1981). This is the most efficient encoding of contrast fluctuations, meaning that from the point of view of information theory, these cells are optimally tuned to the statistics of their environment. In the context of the primate visual system, it has been proposed that the receptive fields of various cells may have the form they do for similar reasons. Olshausen and Field (1996, 1997) and Bell Neural Computation 18, 497–520 (2006)
C 2006 Massachusetts Institute of Technology
498
B. Balas and P. Sinha
and Sejnowski (1997) have demonstrated that the oriented edge-finding receptive fields that are found in early visual cortex (Hubel & Wiesel, 1959) may exist because they provide an encoding of natural scenes that maximizes information. Olshausen and Field were able to produce such filters through enforcing sparseness constraints on their encoding while ensuring that the representation allowed high-fidelity reconstruction of the original scene. Bell and Sejnowski enforced the statistical independence of the filters rather than working with an explicit sparseness criterion. These two approaches are actually equivalent, as demonstrated by Olshausen and Field. An aspect of Bell and Sejnowski’s work that sets it apart, however, is their progression through constraints of different strength, such as principal component analysis (orthogonal basis), ZCA (zero-phase whitening filters), and finally independent component analysis (statistical independence). These different constraints lead to qualitatively different filters, such as checkerboard-like structures and center-surround functions, resembling the preferred stimuli of cells found in some parts of the visual pathway (V4 and the lateral geniculate nucleus, (LGN), respectively). The search for efficient codes has helped direct the efforts of researchers interested in explaining neural response properties in the visual system and fostered the study of ecological constraints in natural scenes (Simoncelli & Olshausen, 2001). However, there are many other tasks that the visual system must accomplish, for which the goal may be quite different from high-fidelity input reconstruction. The task of recognizing complex objects is an important case in point. A priori, we cannot assume that the same computations that result in sparse coding would also support robust recognition. Indeed, the resilience of human recognition performance to image degradations suggests that image measurements underlying recognition can survive significant reductions in reconstruction quality. Extracting measurements that are stable against ecologically relevant transformations of an object (lighting and pose, for example) is a constraint that might result in qualitatively different receptive field structures from the ones that support high-fidelity reconstruction. In this letter, we examine the nature of receptive fields that emerge under a recognition- rather than reconstruction-based criterion. We develop and illustrate our ideas primarily in the context of human faces, although we expect that similar analyses can be conducted with other object classes as well. In this analysis, we note the emergence of a novel receptive field structure that we call the dissociated dipole. These dipoles (or “sticks”) perform simple nonlocal luminance comparisons, allowing a region-based representation of image structure. We also compare the stability characteristics of various kinds of filters. These include model neurons with receptive field structures like those found by sparse coding constraints and sticks operators. Our goal is to eventually gain an understanding of how object representations that are
Receptive Field Structures for Recognition
499
useful for recognition might be constructed from simple image measurements.
2 Experiment 1: Searching for Simple Features in the Domain of Faces We begin by investigating what kinds of simple features can be used to discriminate among frontally viewed faces. The choice of a specific example class is primarily for ease of exposition. The ideas we develop are intended to be more generally applicable. (We substantiate this claim in experiment 2 when we describe computational experiments with arbitrary object classes.) Computationally, there are many methods for performing the face discrimination task with relatively high accuracy, especially if the faces are already well normalized for position, pose, and scale. Using nothing more than the Euclidean distance between faces to do nearest-neighbor classification in pixel space, one can obtain reasonably good results (∼65% with a 40-person classification task using the ORL database, compiled by AT&T Laboratories, Cambridge, UK). Using eigenfaces, one can improve this score somewhat by removing the contribution of higher-order eigenvectors, effectively “denoising” the face space. Further adjustments can be made as well, including the explicit modeling of intra- and interpersonal differences (Moghaddam, Jebara, & Pentland, 2000) and the use of more complex classifiers. On the other side of the spectrum from these global techniques are methods for rating facial similarity that rely on Gabor jets placed at fiducial points on a face (Wiskott, Fellous, Kruger, & von der Malsburg, 1997). These techniques use information at multiple spatial scales to produce a representation built up from local analyses; they are also quite successful. The overall performance of these systems depends on both the choice of representation and the back-end classification strategy. Since we focus exclusively on the former, our goal is not to produce a system for recognition that is superior to these approaches, but rather to explore the space of front-end feature choices. In other words, we look within a specific set of image measurements, bilobed differential operators, to see what spatial analyses lead to the best invariance across images of the same person. For our purposes, a bilobed differential operator is a feature type in which weighted luminance is first calculated over two image regions, and the final output of the operator is the signed difference between those two average values. In general, these two image regions need not be connected. Some examples of these filters are shown in Figure 1. Conceptually, the design of our experiment is as follows. We exhaustively consider all possible bilobed differential operators (with the individual lobes modeled as rectangles for simplicity). We evaluate the discrimination performance of the corresponding measurements over a face database (discriminability refers to maximizing separation between individuals and minimizing distances within instances of the same person). By sorting the
500
B. Balas and P. Sinha
Figure 1: Examples of bilobed differential operators of the sort employed in experiment 1.
large space of all operators using the criterion of discriminability, we can determine which are likely to constitute a good vocabulary for recognition. We note that this approach differs substantially from efforts to find reliable features for face and object detection in cluttered backgrounds. For example, Ullman’s work on features of intermediate complexity (IC) (Ullman, Vidal-Naquet, & Sali, 2002) demonstrates a method for learning class-diagnostic image fragments using mutual information. These IC features are both very likely to be present in an image when the object is present and unlikely to appear in the image background by chance. Other feature learning studies have concentrated on developing generative models for object recognition (Fei-Fei, Fergus, & Perona, 2003; Fergus, Perona, & Zisserman, 2003; Fei-Fei, Fergus, & Perona, 2004) in which various appearance densities are estimated for diagnostic image fragments. This allows recognition of an object in a cluttered scene to proceed in a Bayesian manner. These studies are unquestionably valuable to our understanding of object recognition. Our goals in this study are slightly different, however. First, we are interested in discovering what features support invariance to a particular object rather than a particular object class. It is for this reason that we do not attempt to segment the objects under consideration from a cluttered background. We envision segmentation proceeding via parts-based representations such as those described above. Indeed, it has recently been shown that simple contrast relationships can be used to detect objects in cluttered backgrounds with good accuracy (Sinha, 2002) and that good segmentation results can be obtained once one has recognized an object at the class level (Borenstein & Ullman, 2002, 2004). While it may be possible to learn diagnostic features of an individual that could be used for segmentation purposes, we believe that it is also plausible to consider segmentation as a process that proceeds prior to individuation (subordinate-level classification). Second, rather than looking for complex object parts that support invariance, we commence by considering very simple features. This means that we are not likely to find globally optimal features for individuation. Instead, we aim to determine what structural properties of potentially
Receptive Field Structures for Recognition
501
low-level RFs contribute to recognition. In a sense, we are trying to understand what computations between the lowest and highest levels of visual processing lead to the impressive invariances for object transformations displayed by our visual system. Given that we are attempting to understand how recognition abilities are built up from low-level features, one might ask why we do not explicitly assume preprocessing by center-surround or wavelet filters. Indeed, others have pursued this line of thought (Edelman, 1993; Schneiderman & Kanade, 1998; Riesenhuber & Poggio, 1999), and such an analysis could help us understand how the outputs of early visual areas (such as the LGN and V1) serve as the basis for further computations that might support recognition. That said, we have chosen not to adopt this strategy, so that we can remain completely agnostic as to what basic computations are necessary first steps toward solving high-level problems. However, it is straightforward to extend this work to incorporate a front-end comprising simple filters. 2.1 Stimuli. We use faces drawn from the ORL database (Samaria and Harter, 1994) for this initial experiment. The images are all 112 × 92 pixels in size, and there are 10 unique images of each of the 40 individuals included in the database. We chose to work with 21 randomly chosen individuals in the database, using the first 5 images of each person. The faces are imaged against uniform backdrops. Therefore, the task in our experiment is not to segregate faces from a cluttered background, but rather to individuate them. 2.2 Preprocessing 2.2.1 Block Averaging. Relaxing locality constraints results in a very large number of allowable square differential operators in a particular image. To reduce the size of our search space, we first down-sample all of the images in our database to a much smaller size of 11 × 9 pixels. Much of the information necessary for successful classification is present at this small size, as evidenced by the fact that the recognition performance of a simple nearest-neighbor classifier actually increases slightly (from 65% correct at full resolution to 70% using 8 × 8 pixel blocks) if we use these smaller images as input. 2.2.2 Constructing Difference Vectors. Our next step involves changing our recognition problem from a 21-class categorization task into a binary one. We do this by constructing difference vectors, which comprise two classes of intra- and interpersonal variation (Moghaddam et al., 2000). Briefly, we subtract one image from another, and if the two images used depicted the same individual, then that difference vector captures intrapersonal variation. If the two images were of different individuals, then that difference vector would be one that captured interpersonal variation. Given these two
502
B. Balas and P. Sinha
sets, we look for spatial features that can distinguish between these two types of variation in facial appearance rather than attempting to find features that are always stable within each of 21 categories. To assemble the difference vectors used in this experiment, we took all unique pair-wise differences between images that depicted the same person (intrapersonal set) and used the first image of each individual to construct a set of pair-wise differences that matched our first set in size (interpersonal set). The faces used to construct these difference vectors were not precisely registered. We attempted to find features robust to the variations in facial position and view that arise in this data set. 2.2.3 Constructing Integral Images. Now that we have two sets of lowresolution difference vectors, we introduce one last preprocessing step designed to speed up the execution of our search. Since the differential operators we are analyzing have rectangular lobes, we construct integral images (Viola & Jones, 2001) from each of our difference vectors. Integral images allow the fast computation of rectangular image features, reducing the process to a series of look-ups. The value of each pixel in the integral image created from a given stimulus represents the sum of all pixels above and to the left of that pixel in the original picture. 2.3 Feature Ranking. In our 11 × 9 images, there are a total (n) of 2970 unique box features. Given that we are interested in all possible differential operators, there are approximately 4.5 million spatial features (n2 /2) for us to consider. To decide which of these features were best for recognition, we used A as our measure of discriminability (Green & Swets, 1966). A is a nonparametric measure of discriminability calculated by finding the area underneath an observer’s ROC (receiver-operating-characteristic) curve. This curve is determined by plotting the number of “hits” and “false alarms” a given observer obtains when using a particular numerical threshold to judge the presence or absence of a signal. In this experiment, we treat each differential operator as one observer. The signals we wish to detect are the intrapersonal difference vectors. The response of each operator (mean value of pixels under the white rectangle minus mean value of pixels under the black rectangle) was calculated on each difference vector, and then the labels associated with those vectors (intra- versus interpersonal variation) were sorted according to that numerical output. With the distribution of labeled difference vectors in hand for a particular feature, we could proceed to calculate the value of A . We determined how many hits and false alarms there would be for a threshold placed at each possible location along the continuum of observed feature values. This allowed us to plot a discretized ROC curve for each feature. Calculating the area underneath this curve is straightforward, yielding the discriminability for that operator. A scores range from 0.5 to 1. A perfect separation of intra- and interpersonal difference vectors would lead to an
Receptive Field Structures for Recognition
503
A score of 1, while a complete enmeshing of the two classes would lead to a score of 0.5. In one simulation, the absolute value of each feature was taken (rectified results), and in another the original responses were unaltered (unrectified results). In this way, we could establish how instances of each class were distributed with respect to each spatial feature, both with and without information concerning the direction of brightness differences. It is important to note at this stage that there is no reason to expect that any of the values we recover from our analysis of these spatial features will be particularly high. In boosting procedures, it is customary to use a cascade of relatively poor filters to construct a classifier capable of robust performance, meaning that even with a collection of bad features, one can obtain worthwhile results. In this experiment, we are interested only in the relative ranking of features, though it is possible that the set of features we obtain could be useful for recognition despite their poor abilities in isolation. We shall explicitly consider the utility of the features discovered here in a recognition paradigm presented in experiment 2.
2.4 Results 2.4.1 Differential Operators. The top-ranked differential operators recovered from our analysis of the space of possible two-lobed box filters are displayed in Figure 2. As we expected, the A measured for each individual feature is not particularly high, with the best operator in these two sets scoring approximately 0.71. There are four main classes of features that dominate the top 100 differential operators. First, features resembling center-surround structures appear in several top slots, in both the rectified and unrectified data. This is somewhat surprising, given that cells with this structure are most commonly associated with very early visual processing implicated in low-level tasks such as contrast enhancement, rather than higher-level tasks like recognition. Of course, the features we have recovered here are far larger in terms of their receptive field than typical center-surround filters used for early image processing, so perhaps these structures are useful for recognition if scaled up to larger sizes. The second type of feature that is very prevalent in the results is what we will call a dissociated dipole, or stick, operator, and appears primarily in the unrectified results. These features have a spatially disjoint structure, meaning that they execute brightness comparisons across widely separate parts of an image. Admittedly, the connection between these operators and the known physiology of the primate visual system is weak. To date, there have been no cells with this sort of dissociated receptive field structure found in the human visual pathway, although they may exist in the auditory and somatosensory processing streams (Young, 1984; Chapin, 1986).
504
B. Balas and P. Sinha
Top 100 features for ORL recognition task
Figure 2: The top 100 ranked features for discriminating between intra- and interpersonal difference vectors. Beneath the 10 × 10 array are representatives of the most common features discovered.
The final two features are elongated edge and line detectors, which dominate the results of the rectified operators. An elongated edge detector appears in the unrectified rankings as well, but other structurally similar features are found only in the next 100 ranked features. These structures resemble some of the receptive fields known to exist in striate cortex, as well as the wavelet-like operators that support sparse coding of natural scenes. We point out that multiple copies of these features appear throughout our rankings, which is to be expected. Small structural changes to these filters only slightly alter their A score, meaning that many of the top features have very similar forms. We do not attribute any particular importance to the fact that the nonlocal operators that perform best appear to be comparing values on the right edge of the image to values in the center, or to the tendency for elongated edge detectors to appear in the center of the image. It is only the generic structure of each operator that is important to us here. 2.4.2 Single Rectangle Features. We chose to examine differential operators in our initial analysis for several reasons. First, cells with both excitatory and inhibitory regions are found throughout the visual system. Second, by
Receptive Field Structures for Recognition
505
0.74
Two Rects – signed Two Rects – unsigned
0.72
One Rect – signed One Rect – unsigned
0.7
Aprime
0.68 0.66 0.64 0.62 0.6 0.58
0
20
40
60
80
100
Feature Rank Figure 3: Plots of A scores across the best features from each family of operators (single versus double rectangle features, as well as rectified versus unrectified operator values).
taking the difference in luminance between one region or another, one is far less sensitive to uniform changes in illumination brought on by haze a bright lighting, for example. However, given that we are using a database of faces that is already relatively well controlled in terms of lighting and pose, it may be the case that even simpler features can support recognition. To examine this possibility, we conduct the same analysis described above for differential operators on the set of all single-rectangle box features in our images. We find that single-rectangle features are not as useful for discriminating between our two classes as are differential operators. The range of A values for the top 100 features from each category is plotted in Figure 3, where it is clear that both sets of differential operators provide better recognition performance than single box filters. Even in circumstances where many of the reasons to employ differential operators have been removed through clever database construction (say, by disallowing fluctuations in ambient illumination), we find that they still outperform simpler measurements.
506
B. Balas and P. Sinha
2.5 Discussion. In our analysis of the best differential operators for face recognition, we have observed a new type of operator (the dissociated dipole) that offers an alternative form of processing by which within-class stability might be achieved for images of faces. An important question to consider is how this operator fits within the framework of previous computational models of recognition, as well as whether it has any relevance to human vision. The dissociated dipole is an instance of a higher-order image statistic, a binary measurement. The notion that such statistics might be useful for pattern recognition is not new; indeed, Julesz (1975) suggested that needle statistics could be useful for characterizing random dot textures. In the computer vision community, nonlocal comparisons are employed in integral geometry to characterize shapes (Novikoff, 1962). The possibility that nonlocal luminance comparisons may be useful for object and face recognition has not been thoroughly explored, however. Such an approach differs from traditional shape-based approaches to object recognition, in that it implicitly considers relationships between regions to be of paramount importance. Our recent results (Balas & Sinha, 2003) have demonstrated that such a nonlocal representation of faces provides for better recognition performance than a strictly local one. Furthermore, Kouh and Riesenhuber (2003) have found that to model the responses of V4 neurons to various gratings using a hierarchical model of recognition (Riesenhuber & Poggio, 1999), it is necessary to pool responses from spatially disjoint low-level neurons. Before proceeding, we wish to specify more precisely the relationship between local, nonlocal, and global image analysis. We consider local analyses those in which a contiguous set of pixels (either 4- or 8-connected) is represented in terms of a single output value. A global analysis is similar to this, save for the amount of the image under consideration. In the limit, a global image analysis uses all pixels in the image to construct the output value. A local analysis might use only some small percentage of image area. This distinction is not truly categorical. Rather, there is a spectrum between local and global image analysis. Likewise, a similar spectrum exists between local and nonlocal analysis. While a local analysis considers only a set of contiguous pixels, a nonlocal analysis breaks this condition of contiguity. In the extreme, one can imagine a highly nonlocal feature composed of two pixels located at opposite corners of an image. At the other extreme would be a highly local feature consisting of two neighboring pixels. Of course, there are many operators spanning these two possibilities that are neither purely local nor nonlocal. Moreover, if one measures local features (like Gabor filter outputs) at several nonoverlapping positions, is this a local or a nonlocal analysis? If one is merely concatenating the values of each local analysis into one feature vector, then this is not a truly nonlocal computation by our definition. If, however, the values of those local features are explicitly combined to produce one output value, then we would have arrived at a nonlocal analysis
Receptive Field Structures for Recognition
507
Figure 4: A dipole measurement is parameterized in terms of the space constant σ of each lobe, the distance δ between the centers of each lobe, and the angle of orientation, θ .
of the image. Nonlocal analysis of this type has traditionally received less attention than local or global strategies of image processing. The reason nonlocal representations of brightness have not been studied in great detail may be due to the sheer number of generic binary statistics. In general, the trouble with appeals to higher-order statistics for recognition is that there is a vast space of possible measurements that are allowable with the introduction of new parameters (in our case, the distance between operator lobes). This combinatorial explosion makes it hard to determine which particular measurements are actually useful within the large range of possibilities. This is, of course, a serious problem in that the utility of any set of proposed measurements is dependent on the ability to separate helpful features from useless ones. We also note that there are several computational oddities associated with nonlocal operators. Suppose that we formulate a dissociated dipole as a difference-of-offset-gaussians operator (a model we present in full in the next experiment), allowing the distance between the two gaussians to be manipulated independent of either one’s spatial constant (see Figure 4). In so doing, we lose the ability to create steerable filters (Freeman & Adelson, 1991), meaning that to obtain dipoles at a range of orientations, we have no other option than to use a large number of operators. This is not impossible, but it lacks the elegance and efficiency of more traditional approaches by which multiscale representations can be created at any orientation through the use of a small number of basis functions. Another important difference between local and nonlocal computations is the distribution of operator outputs. Natural images are spatially redundant, meaning that the output of most local operators is near zero (Kersten,
508
B. Balas and P. Sinha
1987). The result is a highly kurtotic distribution of filter outputs, indicating that a sparse representation of the image using those filters is expected. In many cases, this is highly desirable from both metabolic and computational viewpoints. As we increase the distance between the offset gaussians we use to model dissociated dipoles, the kurtosis of the distribution decreases significantly. This means that using these operators yields a coarse (or distributed) encoding of the image under consideration. This may not be unreasonable, especially given that distributed representations of complex objects may help increase robustness to image degradation. However, it is important to note that nonlocal computations depart from some conventional ideas about image representation in significant ways. Finally, given that we have discussed our findings in the context of discovering receptive field structures that are good for recognition rather than encoding, it is important to describe what differences we see between those two processes. The initial stages of any visual system have to perform transduction—transforming the input into a format amenable to further processing. Encoding is the process by which this re-representation of the visual input is accomplished. Recognition is the process by which labels that reflect aspects of image content are assigned to images. The constraints on encoding processes are twofold: the input signal should be represented both accurately and efficiently. Given the variety of visual tasks that must be accomplished with the same initial input, it makes sense that early visual stages would not be committed to optimizing any one of them. For that reason, we suggest that recognition operates on a signal that is initially encoded via localized edge-like operators, but may rely on different measurements extracted from that signal that prove more useful. In our next experiment, we directly address the question of whether the structures we have discovered in this analysis are useful for face and object classification. In this next analysis, we remove many of the simplifications necessary for an exhaustive search to be tractable in experiment 1. We also move beyond the domain of face recognition to include multiple object classes in our recognition task. 3 Experiment 2: Face and Object Recognition Using Local and Nonlocal Features In our first experiment, we noted the emergence of center-surround operators and nonlocal operators under a recognition criterion for frontally viewed faces. However, in our first experiment, many compromises were made in order to conduct an exhaustive search through the space of possible operators. First, our images were reduced to an extremely small size in order to limit the number of features we needed to consider. Though faces can be recognized at very low resolutions, it is also clear that there is interesting and useful structure at finer spatial scales. Second, we chose to work with difference images rather than the original faces. This allowed
Receptive Field Structures for Recognition
509
A
B
Figure 5: Examples of stimuli used in experiment 2. (A) Training images of several individuals depicted in the ORL database. (B) Training images of objects depicted in the COIL database. Note that the COIL database contains multiple exemplars of some object classes (such as the cars in this figure), making withinclass discrimination a necessary part of performing recognition well using this database.
us to transform a multicategory classification task into a binary task, but embodied the implicit assumption that a differencing operation occurs as part of the recognition process. Third, we point out that in any consideration of all possible bilobed features in an image, the number of nonlocal features will far exceed the number of local features. Greater numbers need not imply better performance, yet it is still possible that the abundance of useful nonlocal operators may be a function of set size. Finally, we note that in considering only face images, it is unclear whether the features we discovered are useful for general recognition purposes or specific to face matching. In this second experiment, we attempt to address these concerns through a recognition task that eliminates many of these difficulties. We employ high-resolution images of both faces and various complex objects in a classification task designed to test the efficacy of center-surround, local-oriented, and nonlocal features in an unbiased fashion. 3.1 Stimuli. For our face recognition experiment, we once again make use of the ORL database. In this case, all 40 individuals were used, with one image of each person serving as a training image. The images were not preprocessed in any way and remained at full resolution (112 × 92 pixels). To help determine if our findings hold up across a range of object categories, we also conduct this recognition experiment with images taken from the COIL database (see Figure 5; Nayar, Nene, & Murase, 1996; Nene, Nayar, & Murase, 1996). These images are 128 × 128 pixel images of 100 different objects, including toy cars, foods, pharmaceutical products, and many other diverse items. We selected these images for the wide range of surface and structural properties represented by the objects. Also, repeated exemplars
510
B. Balas and P. Sinha
of a few object categories (such as cars) make both across-class and withinclass recognition necessary. Each object is depicted rotated in depth from its original position in increments of 5 degrees. We chose the 0 degree images of each object as training images, and used the following 9 images as test images. The only preprocessing performed on these images was reducing them from full color to grayscale. 3.2 Procedure. To determine the relative performance of centersurround, local-oriented, and nonlocal features in an unbiased way, we model all of our features as generalized difference-of-gaussian operators. A generic bilobed operator in two-dimensional space can be modeled as follows: √
1 2π |1 |
e 1/2
−(x−µ1 )t −1 (x−µ1 ) 1 2
−(x−µ2 )t −1 (x−µ2 ) 1 2 2 −√ e . 1/2 2π|2 |
(3.1)
For all of our remaining experiments, we consider only operators with diagonal covariance matrices 1 and 2 . Further, the diagonal elements of each matrix shall be equal, yielding isotropic gaussian lobes. For this simplified case, equation 3.1 can be expressed as 2
√
2
−(x−µ1 ) −(x−µ2 ) 1 1 2 2 e 2σ1 − √ e 2σ2 . 2πσ1 2πσ2
(3.2)
We introduce also a parameter δ to represent the separation between two lobes. This is simply the Euclidean norm of the difference between the two means: δ = µ2 − µ1 .
(3.3)
In order to build a center-surround operator, δ must be set to zero, and the spatial constants of the center and surround should be in a ratio of 1 to 1.6 to match the dimensions of receptive fields found in the human visual system (Marr, 1982). To create a local-oriented operator, we shall set σ 1 = σ 2, and set the distance δ to be equal to three times the value of the spatial constant. Finally, nonlocal operators can be created by allowing the distance δ to exceed the value 3σ (once again assuming equal spatial constants for the two lobes). Examples of all of these operators are displayed in Figure 6. Given this simple parameterization of our three feature types, we choose in this experiment to sample equal numbers of each kind of operator from the full set of possible features. In this way, we may represent each of our training images in terms of some small number of features drawn from a specific operator family and evaluate subsequent classification performance.
Receptive Field Structures for Recognition
511
Figure 6: Representative operators drawn from the four operator families considered in experiment 2. Top to bottom, we display examples of centersurround features, local oriented features, and two kinds of nonlocal features (δ = 6σ, s = 9σ ).
Four operator families were considered: center-surround features (δ = 0), local-oriented features (δ = 3σ ), and two kinds of nonlocal features (δ = 6σ and 9σ ). For each operator family, we constructed 40 banks of 50 randomly positioned and oriented operators each. Twenty of these feature banks contained operators with a spatial constant of 2 pixels, and the other 20 feature banks contained operators with a 4 pixel spatial constant. Each bank of operators was applied to the training images to generate a feature vector consisting of 50 values. The same operators were then applied to all test images, and the resulting feature vectors were classified using a nearest-neighbor metric (L2 norm). This procedure was carried out on both the ORL and the COIL databases. 3.3 Results. The number of images correctly identified for a given filter bank was calculated for each recognition trial, allowing us to compute an average level of classification performance from the 20 runs within each operator family and spatial scale (see Figure 7). We find in this task that once again, center-surround and nonlocal features offer the best recognition performance. This result holds at both spatial scales used in this task, as well as for both face recognition and multiclass object recognition. We also note the small variability in recognition performance around each operator’s mean value. Despite the random sampling of features used to constitute our operator banks, the resulting recognition performance remained very consistent. In both cases, we note that center-surround performance slightly exceeds that obtained using nonlocal operators. It is interesting to note, however, that a larger separation between the lobes of a nonlocal feature results in better recognition performance. This cannot continue indefinitely, of course, as longer and longer separations will lead to more limitations on where operators can be placed within the image. Increased accuracy with increased
512
B. Balas and P. Sinha
Proportion Correct
Face Recognition
Object Recognition
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5 0
2
4
6
8
sigma = 2 pixels sigma = 4 pixels
0
2
4
6
8
Lobe separation in multiples of spatial constant (sigma) Figure 7: Recognition performance for both faces (left) and objects (right) as a function of both the distance between operator lobes and the spatial constant of the lobes.
nonlocality does suggest that larger distances between lobes are more useful, however, and that it is not enough simply to deviate from locality. We note that the distinct dip in performance for local-oriented features is both consistent and puzzling. Why should it be the case that unoriented local features are good at recognition while oriented local features are poor? Center-surround operators analyze almost the same pixels as a local-oriented operator placed at the same location, so why should they be so different in terms of their recognition performance? Moreover, how is it that radically different operators like the dissociated dipole and the center-surround operator should perform so similarly? In our third and final experiment, we attempt to address these questions by breaking down the recognition problem into distinct parts so we can learn how these operator families function in classification tasks. Specifically, we point out that good recognition performance is made possible when an operator possesses two distinct properties. First, an operator must provide a stable response to images of objects with the same identity. Second, the operator must respond differently to images of objects with different identities. Neither condition is sufficient for recognition to proceed, but both are necessary. We hypothesize that though both centersurround operators and nonlocal operators provide useful information for recognition, they do so in different ways. In our last experiment, we assess both the stability and variability of each operator type to determine how good recognition results are achieved with different receptive field structures.
Receptive Field Structures for Recognition
513
4 Experiment 3: Feature Stability and Variability In experiment 2, we determined that both center-surround and nonlocal operators outperform local oriented features at recognition of faces and objects. In many ways, this is quite surprising. Center-surround features appear to share little with nonlocal operators as we have defined them, yet their recognition performance is quite similar. In this task, we break down the recognition process into components of stability and variability. To perform well at recognition, a particular operator must first be able to respond in much the same way to many different images of the same face. This is how we define stability, and one can think of it in terms of various identity-preserving transformations. Whether a face is smiling or not, lit from the side or not, a useful operator for recognition must not vary its response too widely. If this proves true, we may say that that feature is stable with respect to the transformation being considered. We use this notion to formulate an operational definition of stability in terms of a set of image measurements and a particular face transformation. Let us first assume that we possess a set of image measurements in a filter bank, just as we did in experiment 2. This filter bank is applied to some initial image, which shall always depict a person in frontal view with a neutral expression. The value of each operator in our collection can be determined and stored in a one-dimensional vector, x. This same set of operators is then applied to a second image, depicting the same person as the original image but with some change of expression or pose. The values resulting from applying all operators to this new image are then stored in a second vector, y. The two vectors x and y may then be compared to see how drastic the changes in operator response were across the transformation from the first image to the second. If by some luck our operators are perfectly invariant to the current transformation, plotting x versus y would produce a scatter plot in which all points would lie on the line y = x. Poor invariance would be reflected in a plot in which points are distributed randomly. For two vectors x and y (each of length n), we may use the value of the correlation coefficient (see equation 4.1) between them as our quantitative measure of feature stability: r=
n(xy) − (x)(y) [nx 2 − (x)2 ][ny2 − (y)2 ]
.
(4.1)
The second component of recognition is variability. It is not enough to be stable to transformations; one must also be diagnostic of identity. Imagine, for example, that one finds an image measurement that is perfectly stable across lighting, expression, and pose transformations. It may seem that this measurement is ideal for recognition, but let us also imagine that it turns out to be of the same value for every face considered. This provides no
514
B. Balas and P. Sinha
means of distinguishing one face from another, despite the measurement’s remarkable invariance to transformations of a single face. What is needed is an ability to be stable within images of a single face, but vary broadly across images of many different faces. This last attribute we shall call variability, and we may quantify it for a particular measurement as the variance of its response across a population of faces. In this third experiment, we use these operational definitions of stability and variability to determine what properties center-surround and nonlocal operators possess that make them useful for recognition. We shall return once again to the domain of faces, as they provide a rich set of transformations to consider, both rigid and nonrigid alterations of the face in varying degree. 4.1 Stimuli. We use 16 faces (8 men, 8 women) from the Stirling face database for this experiment. The faces are grayscale images of individuals in a neutral, frontal pose accompanied by pictures of the same models smiling and speaking while facing forward, and also in a three-quarter pose with neutral expression. We call these transformations the SMILE, SPEECH, and VIEW transforms, respectively. The original images were 284 × 365 pixels, and the only preprocessing step applied was to crop out a 256 × 256 pixel region centered in the original image rectangle. 4.2 Procedure. All operators in these sets were built as difference-ofgaussian features, exactly as described in experiment 2. Also as before, center-surround, local oriented, and two kinds of nonlocal features were evaluated. Because we would like to understand how both the separation of lobes and their individual spatial extent affect performance, two scales were employed for each kind of feature. Space constants of 4 pixels (fine scale) and 8 pixels (coarse scale) were used. In the case of center-surround features, the value of the space constant always refers to the size of the surround. For each pair of images to be analyzed, we construct 120 collections of 50 operators each. These feature banks were split into 10 center-surround, 10 local, and 20 nonlocal banks (10 banks each for separations of six and nine times the spatial constant of the lobes) at both scales mentioned above. Once a set of operators was constructed, we applied it to each neutral, frontal image in our data set to assemble the feature value for the starting image. The same operators were then applied to each of the three transformed images so that a value for Pearson’s R could be calculated for that set of operators relative to each transformation. The average value of Pearson’s R could then be taken across all 16 faces in our set. This process was repeated for all families and scales of operator banks to assess stability. To assess variability, operator banks were once again applied to the neutral, frontal images once again. This time, the variance in each operator’s output was calculated across the population of 16 faces. The results were
Receptive Field Structures for Recognition
Coeff Corr.
Smile Trans.
515
Speech Trans.
View Trans. 1
1
1
0.8
0.8
0.6
0.6
0.6
0.4
0.8
0 2 4 6 8
0 2 4 6 8
0.2
0 2 4 6 8
Lobe separation in multiples of spatial constant (sigma) sigma = 4 pixels sigma = 8 pixels Figure 8: The stability of each feature type (x-axis) as a function of both the spatial scale of the gaussian lobes and various facial transformations.
combined and expressed in terms of the mean variance of response and its standard deviation. 4.3 Results 4.3.1 Difference-of-Gaussian Features. Plots depicting the average values of the correlation coefficients (averaged again over all individuals) are presented in Figure 8. We present the measured stability of each kind of operator across three ecologically relevant transformations: SMILE (second image of individuals smiling), SPEECH (second image of individuals speaking), and VIEW (second image of individuals in three-quarters pose). These plots highlight several interesting characteristics of our operators. First, center-surround filters at both scales appear to perform quite well compared to the other features once again. As soon as we move the two gaussians apart to form oriented local operators, however, a sharp dip in stability occurs. This indicates that the two-lobed oriented edge detectors used here provide for comparatively poor stability across all three of the transformations we have examined here. That said, as the distance between the lobes of our operators increases further, stability of response also increases. Nonlocality seems to increase stability across all three transformations, nearly reaching the level of center-surround stability at a coarse scale. Stability, however, is not the only attribute required to perform recognition tasks well. As discussed earlier, a feature that is stable across face transformations is useful only if it is not also stable across images of different individuals. That is, a universal feature is not of any use for recognition
516
B. Balas and P. Sinha
Table 1: Mean ± S.E. of Operator Variance Across Individuals.
Center-surround Local (s = 3) Nonlocal (s = 6) Nonlocal (s = 9)
σ =4
σ =8
σ = 16
122.5 ± 3.7 242.0 ± 9.6 378.8 ± 11.4 430.2 ± 11.0
206.6 ± 6.2 527.0 ± 15.0 718.5 ± 17.7 795.4 ± 19.7
311.3 ± 8.5 986.9 ± 26.7 1204.1 ± 29.9 1271.7 ± 32.6
because it has no discriminative power. We present next the amount of variability in response for each family of operators (see Table 1). Center-surround operators appear to be the least variable across images of different individuals, while nonlocal operators appear to vary most. All feature types except for the center-surround filters increase in variability as their scale increases, which seems somewhat surprising, as one might expect more dramatic differences in individual appearance to be expressed at a finer scale. Nonetheless, we can see from the combination of these results and the stability results that center-surround and nonlocal operators achieve good recognition performance through different means. Centersurround operators are not so variable from person to person, but make up for it with an extremely stable response to individual faces despite significant transformations. In contrast, nonlocal operators lack the full stability of center-surround operators, but appear to make up for it by being much more variable in response across the population of faces. The localoriented features rank poorly in terms of both their stability and variability characteristics, thus limiting their usefulness for recognition tasks. 4.4 Discussion. The results of our stability analysis of differential operators reveal two main findings. First, the same features that were discovered to perform the best discrimination between intra- and interpersonal difference vectors in experiment 1 (large center-surround filters and nonlocal operators) and to perform best in a simple recognition system for both faces and objects (experiment 2) also display the greatest combination of stability and variability when confronted with ecologically relevant face transforms. However, the limited stability of local oriented operators suggests that they may not provide the most useful features for handling these image transforms. 5 Conclusion We have noted the emergence of large center-surround and nonlocal operators as tools for performing object recognition using simple features and found that both of these operators provide good stability of response across a range of different transforms. These structures differ from receptive field forms known to support sparse encoding of natural scenes, yet
Receptive Field Structures for Recognition
517
seem to provide a better means of discriminating between individual objects and providing stable responses to image transforms. This suggests that the constraints that govern information-theoretic approaches to image representation may not necessarily be useful for developing representations that can support the recognition of objects in images. In the specific context of faces, do large center-surround fields or nonlocal comparators, on their own, present a viable alternative to performing efficient face recognition? At present, the answer to this question is no. Complex (and truly global) features such as eigenface (Turk & Pentland, 1991) bases provide for higher levels of recognition performance than we expect to achieve using these far simpler features. We note, however, that the discovery of a useful vocabulary of low-level features may aid global recognition techniques like eigenface-based systems. One could easily compute PCA bases on nonlocal and center-surround measurements rather than pixels. The added stability of these operators may help significantly increase recognition performance. The larger question at stake, however, does not only concern face recognition, despite its’ being our domain of choice for this study. Of greater interest than building a face recognition engine is learning how one might obtain stability to relevant image transforms given some set of simple measures. Little is known about how one moves from highly selective, small receptive fields in V1 to the large receptive fields in inferotemporal cortex that demonstrate impressive invariance to stimulus manipulations within a particular class. We have introduced here a particular measurement, the dissociated dipole, which represents one example of a very broad space of alternative computations by which limited amounts of invariance might be achieved. Our proposal of nonlocal operators draws support from several studies of human perception. Indeed, past psychophysical studies of the long-range processing of pairs of lines suggest the existence of similarly structured “coincidence detectors,” which enact non-local comparisons of simple stimuli (Morgan & Regan, 1987; Kohly & Regan, 2000). Further work exploring nonlocal processing of orientation and contrast has more recently given rise to the idea of a “cerebral bus” shuttling information between distant points (Danilova & Mollon, 2003). These detectors could contribute to shape representation, as demonstrated by Burbeck’s idea of encoding shapes via medial “cores” built by integrating information across disparate “boundariness” detectors (Burbeck & Pizer, 1995). Our overarching goal in this work is to redirect the study of nonclassical receptive field structures toward examining the possibility that object recognition may be governed by computations outside the realm of traditional multiscale pyramids, and subject to different constraints from those that guide formulations of image representation based on information theory. The road from V1 to IT (and, computationally speaking, from Gabors and gaussian derivatives to eigenfaces) may contain many surprising image processing tools.
518
B. Balas and P. Sinha
Even within the realm of dissociated dipoles, there are many parameters to explore. For example, the two lobes need not be isotropic or be of equal size and orientation. The lobes could easily take the form of gaussian derivatives rather than gaussians. Given that there are many more parameters that could be introduced to the simple DOG framework, it is possible that even better invariance could be achieved by introducing more degrees of structural freedom. The point is that expanding our consideration to nonlocal operators opens up a large space of possible filters, and systematic exploration of this space, while difficult, may be very rewarding. Acknowledgments This research was funded in part by the DARPA HumanID Program and the National Science Foundation ITR Program. B.J.B. is supported by an NDSEG fellowship. P.S. is supported by an Alfred P. Sloan Fellowship in neuroscience and a Merck Foundation Fellowship. We also thank Ted Adelson, Gadi Geiger, Mriganka Sur, Chris Moore, Erin Conwell, and David Cox for many helpful discussions.
References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61, 183–193. Balas, B. J., & Sinha, P. (2003). Dissociated dipoles: Image representation via nonlocal operators. Cambridge, MA: MIT Press. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In Proceedings of the European Conference on Computer Vision (pp. 109–124). Berlin: SpringVerlag. Borenstein, E., & Ullman, S. (2004). Learning to segment. In Proceedings of the European Conference on Computer Vision (pp. 315–328). Berlin: Springer-Verlag. Burbeck, C. A., & Pizer, S. M. (1995). Object representation by cores: Identifying and representing primitive spatial regions. Vision Research, 35(13), 1917–1930. Chapin, J. K. (1986). Laminar differences in sizes, shapes, and response profiles of cutaneous receptive fields in the rat SI cortex. Exp. Brain Research, 62(3), 549–559. Danilova, M. V., & Mollon, J. D. (2003). Comparison at a distance. Perception, 32(4), 395–414. Edelman, S. (1993). Representing 3-D objects by sets of activities of receptive fields. Biological Cybernetics, 70, 37–45.
Receptive Field Structures for Recognition
519
Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised one-shot learning of object categories. Paper presented at the International Conference on Computer Vision, Nice, France. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148, 574–591. Julesz, B. (1975). Experiments in the visual perception of texture. Scientific American, 232(4), 34–43. Kersten, D. (1987). Predictability and redundancy of natural images. J. Opt. Soc. Am. A, 4(12), 2395–2400. Kohly, R. P., & Regan, D. (2000). Coincidence detectors: Visual processing of a pair of lines and implications for shape discrimination. Vision Research, 40(17), 2291–2306. Kouh, M., & Riesenhuber, M. (2003). Investigating shape representation in area V4 with HMAX: Orientation and grating selectivities. (Rep. AIM=2003=021, CBCL=231). Cambridge, MA: MIT. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch, 36, 910–912. Marr, D. (1982) Vision. New York: Freeman. Moghaddam, B., Jebara, T., & Pentland, A. (2000). Bayesian face recognition. Pattern Recognition, 333(11), 1771–1782. Morgan, M. J., & Regan, D. (1987). Opponent model for line interval discrimination: Interval and Vernier performance compared. Vision Research, 27(1), 107–118. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. Paper presented at the ARPA Image Understanding Workshop, Palm Springs, FL. Nene, S. A., Nayar, S. K., & Murase, H. (1996). Columbia Object Image Library (COIL100). New York: New York University. Novikoff, A. (1962). Integral geometry as a tool in pattern perception. In H. Foerster & G. Zopf (Eds.), Principles of self-organization. New York: Pergamon. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607– 609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.
520
B. Balas and P. Sinha
Samaria, F., & Harter, A. (1994). Parametrisation of a stochastic model for human face identification. Paper presented at the Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL. Schneiderman, H., & Kanade, T. (1998). Probabilistic modeling of local appearance and spatial relationships for object recognition. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Sinha, P. (2002). Qualitative representations for recognition. Lecture Notes in Computer Science, 2525, 249–262. Turk, M. A., & Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5(7), 682–687. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HI. Wiskott, L., Fellous, J.-M., Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. Young, E. D. (1984). Response characteristics of neurons of the cochlear nucleus. In C. I. Berlin (Ed.), Hearing science recent advances. San Diego, CA: College Hill Press.
Received July 28, 2004; accepted August 9, 2005.
LETTER
Communicated by Sidney Lehky
A Neural Model of the Scintillating Grid Illusion: Disinhibition and Self-Inhibition in Early Vision Yingwei Yu
[email protected]
Yoonsuck Choe
[email protected] Department of Computer Science, Texas A&M University, College Station, Texas 77843-3112, U.S.A
A stationary display of white discs positioned on intersecting gray bars on a dark background gives rise to a striking scintillating effect—the scintillating grid illusion. The spatial and temporal properties of the illusion are well known, but a neuronal-level explanation of the mechanism has not been fully investigated. Motivated by the neurophysiology of the Limulus retina, we propose disinhibition and self-inhibition as possible neural mechanisms that may give rise to the illusion. In this letter, a spatiotemporal model of the early visual pathway is derived that explicitly accounts for these two mechanisms. The model successfully predicted the change of strength in the illusion under various stimulus conditions, indicating that low-level mechanisms may well explain the scintillating effect in the illusion. 1 Introduction The scintillating grid illusion consists of bright discs superimposed on intersections of orthogonal gray bars on a dark background (see Figure 1A; Schrauf, Lingelbach, & Wist, 1997). In this illusion, illusory dark spots are perceived as scintillating within the white discs. Several important properties of the illusion have been discovered and reported inrecent years. (1) The discs that are closer to a fixation show less scintillation (Schrauf et al., 1997), which might be due to the fact that receptive fields in the periphery are larger than those in the fovea. As shown in Figure 2, if the periphery of the scintillating grid is correspondingly scaled up, the scintillation effect is diminished. Note that the diminishing effect is not due to the polar arrangement alone, as can be seen in Figure 1B. (2) The illusion is greatly reduced or even abolished both with steady fixation and by reducing the contrast between the constituent grid elements (Schrauf et al., 1997). (3) As speed of motion is increased (either efferent eye movement or afferent grid movement), the strength of scintillation decreased (Schrauf, Wist, & Ehrenstein, 2000). (4) The presentation duration of the grid also plays a role Neural Computation 18, 521–544 (2006)
C 2006 Massachusetts Institute of Technology
522
Y. Yu and Y. Choe
A
B
Scintillating grid
Polar arrangement
Figure 1: The scintillating grid illusion and its polar variation. (A) The original scintillating grid illusion is shown (redrawn from Schrauf et al., 1997). (B) A polar variation of the illusion. The scintillating effectis still strong in the polar arrangement (cf. Kitaoka, 2003).
in determining the strength of illusion. The strength first increases when the presentation time is less than about 220 ms, but it slowly decreases once the presentation duration is extended beyond that (Schrauf et al., 2000). What kind of neural process may be responsible for such a dynamic illusion? The scintillating grid can be seen as a variation of the Hermann grid illusion where the brightness level of the intersecting bars is reduced (a representative example of simultaneous contrast; Gerrits, & Vendrik, 1970). The illusory dark spots in the Hermann grid can be explained by feedforward lateral inhibition mechanism, commonly modeled with difference of gaussian (DOG) filters (Spillmann, 1994). Thus, DOG filters may seem like a plausible mechanism contributing to the scintillating grid illusion. However, they are not exactly fit to explain the complexities of the scintillating grid illusion because of the following reasons. (1) The DOG model cannot account for the change in the strength of scintillation over different brightness and contrast conditions, as shown in the experiments by Schrauf et al., 1997). (2) Furthermore, DOG cannot explain the basic scintillation effect, which has a temporal dimension to it. Thus, the feedforward lateral mechanism represented by DOG fails to fully explain the scintillating effect. Anatomical and physiological observations show that the centersurround property in early visual processing may not be strictly feedforward: the process involves a recurrent inhibition, which leads to disinhibition. Moreover, the process also includes self-inhibition—the inhibition of the cell itself. For example, Hartline et al. used Limulus (horseshoe crab) optical cells to demonstrate disinhibition and self-inhibition in the retina
A Neural Model of Scintillating Grid Illusion
523
Figure 2: A variation without the scintillating effect. The grids toward the periphery are significantly scaled up, which results in the abolishment of the scintillating effect when stared at in the middle (see Raninen & Rovamo, 1987, for a similar input scaling approach to alter perceptual effects). This is because the scintillating grid illusion highly depends on the size of the receptive fields. In the fovea, the receptive field size is small, and in the periphery, the receptive field size is relatively larger. (Note that Kitaoka, 2003, presented a similar plot, but there, the periphery was not significantly scaled up such that the scintillating effect was preserved.)
(Hartline & Ratliff, 1957). Disinhibition and self-inhibition have been discovered in mammals and other vertebrates as well. As for disinhibition, it has been found in the retina of cats (Li et al., 1992; Kolb & Nelson, 1993), tiger salamanders (Roska, Nemeth, & Werblin, 1998), and mice (Frech, PerezLeon, Wassle, & Backus, 2001). For example, Kolb and Nelson (1993) have shown that the A2 amacrine cells in the cat retina contribute to lateral inhibition among ganglion cells, and they can play a role in disinhibition. With regard to self-inhibition, Hartveit (1999) found that depolarization of a rod bipolar cell in rat retina evokes a feedback response to the same cell, thus indicating that a mechanism similar to those in the Limulus may exist in mammalian vision. Also, Nirenberg and Meister (1997) have shown that transient and motion-sensitive responses in ganglion cells may be produced by self-inhibitory feedback of the amacrine cells in mouse retina (for similar results, see Berry, Brivanlou, Jordan, & Meister, 1999; Victor, 1987; Crevier & Meister, 1998). Computational models also suggested that self-inhibition
524
Y. Yu and Y. Choe
may exist in cells sensitive to light-dark contrast (Neumann, Pessoa, & Hanse, 1999). Disinhibition can effectively reduce the amount of inhibition where there is a large area of bright input, and self-inhibitioncan give rise to oscillations in the response over time. Thus, the combination of those two mechanisms, disinhibition and self-inhibition, may provide an explanation for the intriguing scintillating grid illusion. In this letter, we present a model based on disinhibition and selfinhibition in the early visual pathway to explain the scintillating grid and its various spatiotemporal properties reported by Schrauf et al. (1997, 2000). Our model is, to our knowledge, the first attempt at computationally modeling the scintillating grid illusion at a neuronal level. In the next section, we begin with a brief review of disinhibition and self-inhibition and introduce our model in detail. Next we present the main results, followed by discussion and conclusion. 2 Disinhibition and Self-Inhibition in Early Vision In this section, we review the basic mechanism of disinhibition and selfinhibition in the Limulus optical cells. The study by Hartline and Ratliff (1957) was the first to show that lateral inhibition exists at the very first stage in visual processing—between optical cells in the Limulus. Furthermore, they showed that the lateral interaction has a nontrivial ripple effect. That is, the final response of a specific cell can be considered as the overall effect of the response from itself and from all other cells directly or indirectly connected to that cell. As a result, the net response of a cell can be enhanced or reduced due to such an inhibitory interaction depending on the surrounding context. For example, inhibition of an inhibitory neuron X will release the target of X from inhibition, thus allowing the latter to fire (or increase its firing rate). This process is known as disinhibition, and it has been shown that such a mechanism may be more accurate than lateral inhibition alone in explaining subtle brightnesscontrast effects (see, e.g., Yu, Yamauchi, & Choe 2004). Self-inhibition is also found in Limulus optical cells (Ratliff, Hartline, & Miller, 1963). When a depolarizing step input is applied to the cell, a transient peak in firing rate can be observed, which is followed by a rapid decay to a steady rate. The self-inhibition effect is due to synaptic interactions, which produce a negative feedback into the cell itself. The self-inhibition mechanism is illustrated by the process in which cells go from an initial transient peak to a stable equilibrium plateau, which is a form of neural adaptation: each impulse discharge acts recurrently to delay the discharge of the next impulse within the same neuron (Hartline, Wager, & Ratliff, 1956; Hartline & Ratliff, 1957). Together with the feedback to neighboring cells, the feedback to oneself may be essential for explaining the evolution of the temporal dynamics observed in the scintillating grid illusion.
A Neural Model of Scintillating Grid Illusion
525
3 A Spatiotemporal Model of Disinhibition and Self-Inhibition Hartline and his colleagues developed an early computational model of the response in the Limulus retina. The Hartline-Ratliff equation describing disinhibition in Limulus can be summarized as follows (Hartline & Ratliff, 1957, 1958; Stevens, 1964), r m = m − K s r m −
vmn (rn − tmn ),
(3.1)
n=m
where rm is the final response of the mth ommatidium, K s its self-inhibition rate, m itsexcitation level, vmn the inhibitory weight from another ommatidium n, and tmn its threshold. When equation 3.1 is used to calculate the evolution of responses in a network of cells, the effect of disinhibition arises. Brodie, Knight, and Ratliff (1978) extended this equation to derive a spatio-temporal filter, where the input was assumed to be a sinusoidal grating. The reason for limiting the input in such a way was to make tractable the explicit calculation of the responses. As a result, the derived model was applicable only to sinusoidal grating inputs. In addition to model selfinhibition, which gives the temporal property, they replaced the constant self-inhibition rate in equation 3.1 with a time-dependent term (cf. K s (t) in section 3.1). Their model is perfect in accounting for the responses in Limulus retina, but it is limited to a single spatial frequency channel input. Because of this, their extended model cannot be applied to complex images containing a broader band of spatial frequency, which is typically the case for visual illusions such as the scintillating grid illusion. In the following section, we extend the Hartline-Ratliff equation using a different strategy to derive a filter that can address these issues, while remaining tractable. 3.1 A Simplified Model in One Dimension. Rearranging equation 3.1 by omitting the threshold, we have (1 + K s) rm −
wmnrn = m ,
(3.2)
where wmn is the strength of interaction (either excitatory or inhibitory) from cell m to n.Note that wmn extends the definition of vmn in equation 3.1 to allow excitatory as well as inhibitory contributions. To generalize the model to n inputs, the responses of n cells can be expressed in matrix form as (Yu et al., 2004), Kr = e,
(3.3)
526
Y. Yu and Y. Choe
where r is the output vector of size n, e the input vector of size n, and K the n × n weight matrix: K=
1 + K s (t)
−w(1)
..
−w(1)
1 + K s (t)
..
..
..
..
−w(n − 1)
..
..
−w(n − 1)
−w(n − 2) , .. 1 + K s (t)
(3.4)
where K s (t) is the self-inhibition rate at time t, and w(·) is the connection weight, which is a function of the distance between the cells. (Note that unlike in our previous models—(Yu et al., 2004; Yu & Choe, 2004a)—the introductionof the time-varying term K s (t) allows for the model to have a temporal behavior.) For the convenience of calculation, we assume K s (t) here approximately equals the self-inhibition rate of a single cell. The exact derivation of K s (t) is as follows (Brodie et al., 1978): K s (t) =
y(t) , r (t)
(3.5)
where a y(t) is the amount of self-inhibition at time t, and r (t) is the response at time t for this cell. We know that the Laplace transform y(s) of y(t) has the following property: y(s) = r (s)Ts (s),
(3.6)
k , 1 + sτ
(3.7)
Ts (s) =
where k is the maximum value K s (t) can reach and τ the time constant. By assuming that the input e(t) is a step input for this cell, the Laplace transform of e(t) can be written as e(s) =
I0 , s
(3.8)
where I0 is a constant representing the strength of the light stimulus. From the definition of y(t), we know that dr = e(t) − y(t). dt
(3.9)
A Neural Model of Scintillating Grid Illusion
527
To solve the response r (t), we can apply Laplace transform and plug in e(s) and y(s): r (s) =
I0 k − r (s) s 1 + sτ
1 . s
(3.10)
Solving this equation, we get r (s) =
I0 sτ + 1 . s τ s2 + s + k
(3.11)
By substituting r (s) and T(s) in equation 3.6 with equations 3.7 and 3.11, we get y(s) =
k I0 (sτ + 1) . s (τ s 2 + s + k) (1 + sτ )
(3.12)
Then, by inverse Laplace transform, we can get y(t) and r (t). Finally, the exact expression for K s (t) can be obtained by evaluating equation 3.5: K s (t) =
4kτ − 1 + (1 − 4k)h(t) cos(ωt) − 2kh(t)ω sin(ωt) , 4kτ − 1 + (1 − 4k)h(t) cos(ωt) + (4kτ − 2)h(t)ω sin(ωt)
(3.13)
where 1 , h(t) = τ exp − 2τ t
(3.14)
and √ ω=
4kτ − 1 . 2τ
(3.15)
An intuitive way of understanding the above expression is to see it as a division of two convolutions, K s (t) =
e(t) ∗ g(t) , e(t) ∗ f (t)
(3.16)
where g(t) and f (t) are impulse response functions derived from the above and ∗ the convolution operator (see the appendix for details). Figure 3 shows several curves plotting the self-inhibition rate under different parameter conditions. As discovered in Limulus by Hartline and Ratliff (1957, 1958), self-inhibition is strong (k = 3), while lateral contribution is weak
528
Y. Yu and Y. Choe
3.5 3.0
Ks(t)
2.5 2.0 1.5
k=3, τ=0.3 k=3, τ=0.1 k=2, τ=1.0 k=2, τ=0.5
1.0 0.5 0.0 0.0
0.2
0.6 0.4 time: t
0.8
1.0
Figure 3: Self-inhibition rate. The evolution of the self-inhibition rate K s (t) (y-axis) over time (x-axis) is shown for various parameter configurations (see equations 3.5 to 3.7). The parameter k defines the peak value of the curve, and τ defines how quickly the curve converges to a steady state. For all computational simulations in this article, the values k = 3 and τ = 0.3 were used.
(0.1 or less). Hartline and Ratliff (1957, 1958) experimentally determined these values, whereas they left τ as a free parameter. Now we have a model for the response of cells arranged in one dimension, but for realistic visual inputs, we need a 2D model. In the following section, we provide details about extending the 1D model above to 2D. 3.2 Extending the Model to Two Dimensions. The 1D model can be easily extended to two dimensions. We simply serialize the input and output matrices to vectors to fit in the 1D model we have. The weight matrix K can then be defined to represent the weight K i j from cell j to cell i based on their distance in 2D, following the DOG distribution (Marr & Hildreth, 1980): Ki j =
−w(|i, j|)
1 + K s (t)
x2 w(x) = kc exp − 2 σc
when i = j when i = j
,
x2 − ks exp − 2 , σs
(3.17)
(3.18)
where |i, j| is the Euclidean distance between cell i and j; kc and ks are the scaling constants that determine scale of the excitatory and √ the relative √ inhibitory distributions, set to 1/( 2πσc ) and 1/( 2πσs ); and σc and σs their
A Neural Model of Scintillating Grid Illusion
529
widths. The σ values were indirectly specified, as a fraction of the receptive field size ρ: σc = ρ/24 and σs = ρ/6. Finally, the response vector r can be derived from equation 3.3 as follows (Yu et al., 2004): r = K−1 e,
(3.19)
and we can apply reverse serialization to get the vector r back into a 2D matrix form. Figure 4 shows a single row of the weight matrix K, corresponding to a weight matrix (when reverse serialized) of a cell in the center of the 2D retina, at various time points. The plot shows that the cell in the center can be influenced by the inputs from locations far away, outside its classical receptive field area. The initial state shown in Figure 4A looks bumpy (and somewhat random), but on closer inspection we can observe concentric rings of ridges as in Figure 4B. (The apparent bumpiness along the ridges is due to the aliasing effect caused by the square boundary of the receptive field.) Another noticeable feature is that the spatial extent of excitation and inhibition shrinks over time (from Figure 4A to 4F). This may seem inconsistent with the notion of persisting inhibitory surround, but in fact the spatiotemporal property of on-off receptive fields shows such a diminishing lateral influence over time (in retinal ganglion cells, Jacobs & Werblin, 1998; and also in the lateral geniculate nucleus, Cai, DeAngelis, & Freeman, 1997). 4 Methods To match the behavior of the model to psychophysical data, we need to measure the degree of the illusory effect in the scintillating grid. Measuring the strength of the overall scintillation effect is difficult because it involves many potential factors, such as the change in the perceived brightness of the discs over time and the perceived number of scintillating dark spots. (In fact, in Schrauf et al., 1997, 2000, subjects were simply asked to report the strength of illusion on a scale of 1 to 5 without any reference to these various factors.) For our computational experiments, one simple yet meaningful measure of the strength of scintillation can be derived based on the changes in the perceived contrast. More specifically, we are interested in the change over time in the relative contrast of the disc versus the gray bars: S(t) = C(t) − C(0),
(4.1)
where S(t) is the perceived strength t time units from the last eye movement or the time of initial presentation of the scintillating grid stimulus (time t in
530
Y. Yu and Y. Choe
A
B
1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
40 20
20 0
20 0
t=0.007
t=0.031
C
D 1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
40 20
20 0
20 0
t=0.062
t=0.124
E
F
1.5 1 0.5 0 −0.5 40
1.5 1 0.5 0 −0.5 40 40 20
20 0
t=0.2452
40 20
20 0
t=0.895
Figure 4: Disinhibition filter at various time points. The filter (i.e., the connection weights) of the central optical cell shown at different time steps (k = 3, τ = 0.3, ρ = 20). The self-inhibition rate evolved over time as follows: (A) K s (t) = 0.03, (B) K s (t) = 0.15, (C) K s (t) = 0.29, (D) K s (t) = 0.54, (E) K s (t) = 0.99, and (F) K s (t) = 2.59. Initially, a large ripple extends over a long distance from the center (beyond the classical receptive field), but as time goes on, the longrange influence diminishes. In other words, the effective receptive field size is reduced over time due to the change in self-inhibition rate. (Note that the visual field size is shown above is 41 × 41 for better visualization, compared to 30 × 30 in the actual experiments.)
A Neural Model of Scintillating Grid Illusion
531
our model is on an arbitrary scale) and C(t) is the contrast between the disc and the gray bars in the center row of the response matrix: C(t) =
Rdisc (t) − Rmin (t) , Rbar (t) − Rmin (t)
(4.2)
where Rdisc (t) is the response at the center of the disc region, Rbar (t) the response at the center of either of the gray bar regions, and Rmin (t) the minimum response in the output at time t. In other words, the function of perceived strength of illusion S(t) is defined as the relative disc-to-bar contrast at time t as compared to its initial value at time 0. Using this measure, in the experiments below, we tested our model under various experimental conditions, mirroring those in Schrauf et al. (1997, 2000). In all calculations, the effect of illusion was measured on an image consisting of a single isolated grid element of size 30 × 30 pixels. The disc at the center had a diameter of 8, and the bars had a width of 6. The model parameters k = 3 and τ = 0.3 were fixed throughout all experiments, and so was the pattern where the background luminance was set to 10, the gray bar to 50, and the white disc to 100, unless stated otherwise. Dependent on the experimental condition under consideration, the model parameters (receptive field size ρ) and/or the stimulus conditions (such as the duration of exposure to the stimulus and/or brightness of different components of the grid) were varied. The units of the receptive field size, the width of the bar, and the diameter of the disc were all equivalent in pixels on the receptor surface, where each pixel corresponds to one photo receptor. The details of the variations are provided in section 5. 5 Experiments and Results 5.1 Experiment 1: Perceived Brightness as a Function of Receptive Field Size. In the scintillating grid illusion, the scintillating effect is most strongly present in the periphery of the visual field. As we stated earlier, this may be due to the fact that the receptive field size is larger in the periphery than in the fovea, thus matching the scale of the grid. If there is a mismatch in the scale of the grid and the receptive field size, the illusory dark spot would not appear. For example, in Figure 2, the input is scaled up in the periphery, thus creating a mismatch between the peripheral receptive field sizeand the scale of the grid. As a result, the scintillating effect is abolished. Conversely, if the receptive field size is reduced in size with no change to the input, the perceived scintillation would diminish (as it happens in the center of gaze in the original scintillating grid; see Figure 1). To verify this point, we tested our model with different receptive field sizes while the input grid size was fixed. As shown in Figure 5A, smaller receptive fields result in almost no darkening effect in the white disc.
532
Y. Yu and Y. Choe
A
C
B
ρ =3
ρ =6
D
ρ =9
ρ =12
ρ =15
G 7
k=3, τ=0.3
6 5 4 3 2 1
ρ=3
8 Normalized response
F
C(t=0.01)
E
ρ=6
6
ρ=9
4
ρ=12
2
ρ=15
0
0 2
4
6 8 10 12 14 16 18 20 Receptive field size: ρ
0
20 10 Position
30
Figure 5: Response under various receptive field sizes. The response of our model to a single grid element in the scintillating grid is shown, under various receptive field sizes at t = 0.01. (A–E) The responses of the model are shown when the receptive field size was increased from ρ = 3 to 6, 9, 12, and 15. Initially, the disc in the center is bright (simulating the fovea), but as ρ increases, it becomes darker (simulating the periphery). (F) The relative brightness level of the central disc compared to the gray bar C(t) is shown (see equation 4.2). The contrast decreases as ρ increases, indicating that the disc in the center becomes relatively darker than the gray bar region. The contrast drops abruptly until around ρ = 6 and then gradually decreases. (G) The normalized responses of the horizontal cross section of A to F are shown. For normalization, the darkest part and the gray bar region of the horizontal cross-section were scaled between 0.0 and 1.0. When ρ is small (=3), the disc in the center is very bright (the plateau in the middle in the black ribbon), but it becomes dark relative to the gray bars as ρ increases (white ribbon).
As the receptive field size grows, the dark spot becomes more prominent (see Figures 5B to 5E). The cross-sections of Figures 5A to 5E are shown in figure 5G. Figure 5F shows the bar-disc contrast C(t) over different receptive field sizes (ρ), where a sudden drop in contrast can be observed around ρ = 6. Note that at this point, C(t) is still above 1.0, suggesting that the disc in the center is brighter than the gray bars. However, C(t) is not an absolute measure of perceived brightness since it does not count in the vividly bright halo around the disc (already visible in Figure 5B). Thus, our interpretation that low C(t) (close to 1.0 or below) means a perceived dark spot may be reasonable.
A Neural Model of Scintillating Grid Illusion
533
In sum, these results could be an explanation as to why there is no scintillating effect in Figure 2. In the original configuration, the peripheral receptive fields were large enough to give rise to the dark spot; however, in the new configuration, they are not large enough, and thus no dark spot can be perceived. 5.2 Experiment 2: Perceived Brightness as a Function of Time. In this experiment, the response of the model at different time steps was measured.
A
B
t=0.01
C
t=0.1
D
t=0.8
t=10.0
G 3.0
k=3, τ=0.3
2.5
C(t)
t=1.6
2.0 1.5 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 time: t
Normalized Response
F
E
t=10 t=1.6 t=0.8 t=0.1 t=0.01
5 4 3 2 1 0
0
20
10
30
Position
Figure 6: Response at various time points. The response of the model to an isolated scintillating grid element is shown over time. The parameters used for this simulation were: receptive field size = 6 (represents the periphery), k = 3, and τ = 0.3. The plots demonstrate a single blinking effect of the white disc. (A) In the beginning when the self-inhibition rate is small, the illusory dark spot can be seen in the central disc (K s (t) = 0.0495). (B–E) As time goes on, the illusory dark spot disappears as the self-inhibition rate increases: K s (t) = 0.04521, K s (t) = 2.4107, K s (t) = 3.2140, and K s (t) = 3, respectively. (F) The relative brightness level of the central disc compared to the gray bar C(t) is shown (see equation 4.2). The results demonstrate an increase in the relative perceived brightness of the center disc as time progresses. (G) The normalized responses of the horizontal cross-section of A to E are shown. Normalization was done as described in Figure 5G. In the beginning (t = 0.01), the disc region in the middle is almost level with the flanking gray bar region (white ribbon near the bottom). However, as time goes on, the plateau in the middle rises, signifying that the disc in the center is becoming perceived as brighter.
534
Y. Yu and Y. Choe
In Figures 6A to 6E, five snapshots are shown. In the beginning, the dark spot can clearly be observed in the center of the disc, but as time goes on, it gradually becomes brighter. Figure 6F plots the relative brightness of the disc compared to the bars as a function of time, which shows a rapid increase to a steady state. Such a transition from dark to bright corresponds to a single scintillation. (Note that the opposite effect, bright to dark, is achieved by refreshing of the neurons via saccades.) Figure 6G shows the actual response level in a horizontal cross-section of the response matrix shown in Figures 6A to 6E. Initially, the response to the disc area shown as the sunken plateau in the middle is relatively low compared to that to the gray bars represented by the flanking areas (bottom trace, white ribbon). However, as time passes by, the difference in response between the two areas dramatically increases (top trace, black ribbon). Again, the results show a nice transition from a perception of a dark spot to that of a bright disc. 5.3 Experiment 3: Strength of Scintillation as a Function of Luminance. The strength in perceived illusion can be affected by changes in the luminance of the constituent parts of the scintillating grid, such as the gray bar, disc, and the dark background (see figures 7A and 7C; Schrauf et al., 1997). Figures 7B and 7D show a variation in response in our model under such stimulus conditions. Our results show a close similarity to the experimental results by Schrauf et al. (1997). As the luminance of the gray bar increases,the strength of illusion increases, but after reaching about 40% of the disc brightness, the strength gradually declines (see Figure 7B), consistent with experimental results (see Figures 7A). Such a decrease is due to disinhibition, which cannot be explained by DOG (Yu et al., 2004). When the luminance of the disc was increased, the model (see Figure 7D, right) demonstrated a similar increase in the scintillating effect as in the human experiment (see Figure 7C, right). When the disc has a luminance lower than the bar, a Hermann grid illusion occurs (Schrauf et al., 1997). Both the human data (see Figure 7C, left) and the model results (see Figure 7D, left) showed an increase in the Hermann grid effect when the disc became darker. Note that disinhibition plays an important role here, especially for the bar luminance experiments (see Figures 7A and 7B). In standard DOG, which lacks the recurrent inhibitory interaction, the illusory effect will monotonically increase with the increase in the luminance of the gray bars. However, with disinhibition, the increasing illusory effect will reach a critical point followed by a decline. (See section 6 for details.) 5.4 Experiment 4: Strength of Scintillation as a Function of Motion Speed and Presentation Duration. As we have seen above, the scintillating effect has both a spatial and atemporal component. Combining these two may give rise to a more complex effect. Schrauf et al. (2000) demonstrated that such an effect in fact exists. They conducted experiments under
A Neural Model of Scintillating Grid Illusion
A
3.5 3.0 2.5 2.0 1.5 1.0 0.0
5.0
2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4
Model
10
10.0 15.0 20.0 25.0 30.0
15 20
25 30
35 40
Bar luminance (cd/m2)
Bar brightness in input
Schrauf et al. (1997)
Model
C
45
50
D 6.0
Hermann (Human) Scintillating (Human)
5.0 4.0 3.0 2.0 1.0
2.5 Strength of scintillation: S
Mean rated strength
Human
4.0
Strength of scintillation: S
B 4.5
Mean rated strength
535
Hermann (Model) Scintillating (Model)
2.0 1.5 1.0 0.5 0.0
0
20
40
60
80 100 120 140 160
0
50
100
150
200
Disc luminance (cd/m2)
Disc brightness in input
Schrauf et al. (1997)
Model
250
Figure 7: Strength of scintillation under various luminance conditions. (A) Mean rated strength of scintillation in human experimentsis shown as a function of disc luminance (Schrauf et al., 1997). (B) Scintillation effects in the model shown as a function of bar luminance. (C) Mean rated strength of scintillation in human experimentsis plotted as a function of bar luminance (Schrauf et al., 1997). The plot shows results from two separate experiments: the Hermann grid on the left and the scintillating grid on the right. (D) The Hermann grid and scintillation effects in the model are plotted as functions of disc luminance. Under both conditions, the model results closely resemble those in human experiments. For B and D, the strength of the scintillation effectin the model was calculated as S = C(∞) − C(0), where C(∞) is the steady-state value of C(t) (see equation 4.1). The illusion strength in the Hermann grid portion in D was calculated as S = 1/C(∞) − 1. The reciprocal was used because in the Hermann grid, the intersection is darker than the bars, whereas in the scintillating grid, it is the other way around (the disc is brighter than the bars).
three conditions: (1) smooth pursuit movements executed across a stationary grid (efferent condition), (2) grid motion at an equivalent speed while the eyes are held stationary (afferent condition), and (3) brief exposure of a stationary grid while the eyes remained stationary. For conditions 1 and 2,
536
Y. Yu and Y. Choe
A
B 0.55
Human
Strength of scintillation: S
Mean rated strength
4.5 4.0 3.5 3.0 2.5 2.0 1.5 2.0
Model
0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15
4.0
6.0
8.0
10.0
0
12.0 14.0
1
3
4
5
6
7
8
9 10
Model
Schrauf et al. (2000) D
C Human
Strength of scintillation: S
4.5 Mean rated strength
2
Speed of motion (pixel/s)
Speed of motion (deg/s)
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.0
0.2 0.4 0.6 0.8 Presentation duration (s)
1.0
Schrauf et al. (2000)
0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20
Model
0
0.5
1 1.5 2 2.5 3 3.5 Presentation duration: t
4
Model
Figure 8: Strength of scintillation under varying speed and presentation duration. (A) Mean rated strength of the illusion as a function of the speed of stimulus movement (Schrauf et al., 2000). (B) Scintillation effect as a function of the speed of motion (v) in the model is shown. The receptive field size was 6, and the strength of scintillation was calculated as S(v −1 ) = C(v −1 ) − C(0). (C) Mean rated strength of the illusion as a function of the duration of exposure (Schrauf et al., 2000). (D) Scintillation effect as a function of presentation duration (t) in the model. The receptive field size was 6, and the strength of scintillation was computed as S(t) = C(t) − C(0). In both cases (A–B and C–D) the curves show a very similar trend.
both afferent and efferent motion produced very similar results: the strength of scintillation gradually decreased as the speed of motion increased (see Figure 8A). For condition 3, the strength of illusion abruptly increases, coming to a peak at around 200 ms, and then slowly decreases (see Figure 8C). We tested our model under these conditions to verify if temporal dynamics induced by self-inhibition can accurately account for the experimental results. First, we tested the model when either the input or the eye was moving (assuming that conditions 1 and 2 above are equivalent). In our experiments,
A Neural Model of Scintillating Grid Illusion
537
instead of directly moving the stimulus, we estimated the effect of motion in the following manner. Let v be the corresponding speed of motion, either afferent or efferent. From this, we can calculate the amount of time elapsed before the stimulus (or the eye) moves on to a new location. For a unit distance, the elapsed time t is simply an inverse function of motion speed v; thus, the effect of illusion can be calculated as S(v −1 ). Figure 8B shows the results from our model, which closely reflects the experimental results in Figure 8A. Note that we used a single flash because we modeled the input as a single step function (see equation 3.8). So the experiment may have been slightly different from the real one where grids come in and go out of view, corresponding to multiple step inputs. However, most of the perceived effect was accountable under this simplifying assumption, as shown in the results, indicating that the perceived dynamics within a single grid element can play an important role in determining the overall effect under moving conditions. Next, we tested the effect of stimulus flash duration on our model behavior. Figure 8D shows our model’s prediction of the brightness as a function of the presentation duration. In this case, given a duration of d, the strength of illusion can be calculated as S(d). The perceived strength initially increases abruptly up to around t = 1.5; then it slowly decreases until it reaches a steady level. Again, the computational results closely reflect those in human experiments (see Figure 8C). The initial increase might be due to the fact that the presentation time is within the time period required for one scintillation, and the slow decline may be due to no new scintillation being produced after the first cycle as the eyes were fixated, so that the overall perception of the scintillating strength declines. Again, this experiment is an approximation of the real condition, but we can make a reasonable assumption that the effect at the end of the input flash duration is what is perceived. For example, consider turning off the stimulus at a different point of time in Figure 6F (which in fact shows a close similarity to Figure 8D). Finally, note that because of the relationship between speed of motion and elapsed time mentioned above, the data presented in Figures 8B and 8D are identical (i.e., from the same simulation), except for the appropriate transformation in the x-axis. This is why we see the small wiggle in the beginning (on the left) in Figure 8B, which corresponds to the peak near t = 1.5 and the slight decrease to a steady state in Figure 8D (on the right). In summary, our model based on disinhibition and self-inhibition was able to accurately replicate experimental data under various temporal conditions. 6 Discussion The main contribution of this work was to provide, to our knowledge, the first neurophysiologically grounded computational model to replicate the scintillating grid illusion. We have demonstrated that disinhibition and
538
Y. Yu and Y. Choe
self-inhibition are sufficient mechanisms to explain a broad range of spatiotemporal phenomena observed in psychophysical experiments with the scintillating grid. The DOG filter failed to account for the change in the strength of scintillation, because it does not incorporate the disinhibition mechanism or the dynamics of self-inhibition. Disinhibition can effectively reduce the amount of inhibition in the case where there is a large area of bright input (Yu et al., 2004). Therefore, a DOG filter without a disinhibition mechanism cannot explain why the dark illusory spots in the scintillating grid are perceived to be much darker than those in the Herman grid. The reason is that the DOG filter predicts that the white bars in the Hermann grid should give stronger inhibition to its intersection than the gray bars in the scintillating grid to its disc. Thus, according to DOG, the intersection in the Herman grid should appear darker than that in the scintillating grid, which is contrary to fact. However, with a disinhibition mechanism, since disinhibition is stronger in the Hermann grid than in the scintillating grid (because the bars are brighter in the Hermann grid, there is more disinhibition), the inhibition in the center of the Hermann grid is weaker than that in the scintillating grid. Thus, the center appears brighter (because of weaker inhibition) in the Hermann grid than in the scintillating grid due to disinhibition. Regarding the issue of dynamics, the lack of a self-inhibition mechanism in the DOG filter causes it to fail to explain the temporal properties of the scintillation. There are certain issues with our model that may require further discussion. In our simulations, we used a step input with an abrupt stimulus onset. In a usual viewing condition, the scintillating grid as a whole is presented, and when the gaze moves around, the scintillating effect is generated. All the while, the input is continuously present, without any discontinuous stimulus onset. Thus, the difference in the mode of stimulus presentation could be a potential issue. However, as Schrauf et al. (2000) observed, what causes the scintillation effect is not the saccadic eye movement per se, but the transient stimulation that the movement brings about. Thus, such a transient stimulation can be modeled as a step input, and the results of our model may well be an accurate reflection of the real phenomena. Another concern is about the way we measured the strength of the scintillation effect in the model. In our model, we were mostly concerned about the change in the perceived brightness of the disc over time (see equation 4.1), whereas in psychophysical experiments, other measures of the effect have been incorporated, such as the perceived number of scintillating dark spots (Schrauf et al., 2000). However, one observation is that the refresh rate of the stimulus depends on the number of saccades in a given amount of time. Considering that a single saccade triggers an abrupt stimulus onset, we can model multiple saccades as a series of step inputs in our simulations. Since our model perceives one scintillation per stimulus onset, the frequency of flickering reported in the model can be modulated exactly by changing the number of stimulus onsets in our simulations. A related issue is the use
A Neural Model of Scintillating Grid Illusion
539
of a single grid element (instead of a whole array) in our experiments. It may seem that the scintillation effect would require at least a small array (say 2 × 2) of grid elements. However, as McAnany and Levine (2004) have shown, even a single grid element can elicit the scintillating effect quite robustly; thus, the stimulus condition in our simulations may be sufficient to model the target phenomenon. The model also gives a couple of interesting predictions (both brought to our attention by Rufin VanRullen). The first prediction is that a scintillating effect will occur only in an annular region in the visual field surrounding the fixation point, where the size of the receptive field matches that of the grid element size. However, this does not seem to be the case under usual viewing conditions. Our explanation for this apparent shortcoming of the model is that the size of the usual scintillating grid image is not large enough to go beyond the outer boundary of the annular region. Our explanation can be tested in two ways: test the strength of illusion with (1) a very large scintillating grid image where the grid element size remains the same, or with (2) the usual sized image with a reduced gridelement size (similar in manner to the input scaling approach by Raninen & Rovamo, 1987). We expect that the annular region will become visible in both cases, where no scintillating effect is observed beyond the outer boundary of the annular region. The second prediction is that the scintillation would be synchronous, due to the same time course followed by the neurons responding to each scintillating grid element. Again, this is quite different from our perceived experience, which is more asynchronous. In our observation, the asynchronicity is largely due to the random nature of eye movement. If that is true, the scintillating effect will become synchronous if eye movement is suppressed. That is, if we fixate on one location of the scintillating grid while the stimulus is turned on and off periodically (or alternatively, we can blink our eyes to simulate this), all illusory dark spots would seem to appear all at the same time, in a synchronous manner. (In fact, this seems to be the case in our preliminary experiments.) Then why is our experience asynchronous? The reason that we perceive the scintillation to be asynchronous may be that when we move our gaze from point X to point Y in a long saccade, first the region surrounding X, and then, later, the region surrounding Y scintillates. This will give the impression that the scintillating effect is asynchronous. In sum, the predictions of the model are expected to be consistent with experiments under similar conditions as in the computational simulations. Further psychophysical experiments may have to be conducted to test the model predictions more rigorously. Besides the technical issues as discussed, more fundamental questions need to be addressed. Our model was largely motivated by the pioneering work by Hartline et al. in the late 1950s. However, the animal model they used was the Limulus, an invertebrate with compound eyes; thus, the applicability of our extended model in human visual phenomena may
540
Y. Yu and Y. Choe
be questionable. However, disinhibition and self-inhibition, the two main mechanisms in the Limulus, have been discovered in mammals and other vertebrates. Mathematically, the recurrent inhibitory influence in the disinhibition mechanism and the self-inhibitory feedback are the same in both the limulus and in mammals. Therefore, our model based on the Limulus may generalize to human vision. Finally, an important question is whether our bottom-up model accounts for the full range of phenomena in the scintillating grid illusion. Why should the scintillating effect originate only from such a low level in the visual pathway? In fact, recent experiments have shown that part of the scintillating effect can arise based on top-down, covert attention (VanRullen & Dong, 2003). The implication of VanRullen and Dong’s study is that although the scintillation effect can originate in the retina, it can be modulated by later stages in the visual hierarchy. This is somewhat expected because researchers have found that the receptive field properties (which may include the size) can be effectively modulated by attention (for a review, see Gilbert, Ito, Kapadia, & Westheimer, 2000). It is unclear how exactly such a mechanism can affect brightness-contrast phenomena that depend on the receptive field size at such a very low level; thus, it may require further investigation. Schrauf and Spillmann (2000) also pointed out a possible involvement of a later stage by studying the illusion in stereo depth. But as they admitted, the major component of the illusion may be retinal in origin. Regardless of these issues, modeling spatiotemporal properties at the retinal level may be worthwhile by serving as a firm initial stepping-stone on which a more complete theory can be constructed.
7 Conclusion In this letter, we presented a neural model of the scintillating grid illusion, based on disinhibition and self-inhibition in early vision. The two mechanisms inspired by neurophysiology were found to be sufficient in explaining the multifaceted spatiotemporal properties of the modeled phenomena. We expect that our model can be extended to the latest results that indicate a higher-level involvement in the illusion, such as that of attention.
Appendix: Derivation of Self-Inhibition Rate K s (t) The exact formula for K s (t) can be derived as follows:
K s (t) =
y(t) , r (t)
(A.1)
A Neural Model of Scintillating Grid Illusion
541
where y(t) is the amount of self-inhibition at time t and r (t) theresponse at time t for this cell. We know that the Laplace transform r (s) of r (t) has the following property: r (s)s = e(s) − r (s)Ts (s),
(A.2)
k , 1+τ
(A.3)
Ts (s) =
where Ts (s) is a transfer function, k the maximum value K s (t) can reach, τ the time constant, and e(s) the Laplace transform of the step input of this cell: e(s) =
1 . s
(A.4)
By rearranging equation A.2, we can solve for r (s) to obtain r (s) = e(s)
1 . s + Ts
(A.5)
Therefore, r (t) can be treated as the step input function e(t) convolved with an impulse response function, r (t) = e(t) ∗ f (t),
(A.6)
where ∗ is the convolution operator, and f (t) = L −1
1 , s + Ts
(A.7)
where L −1 is the inverse Laplace transform operator. Solving equation A.7, we get f (t) as a superposition of two exponential functions: f (t) =
1 C1 exp(C2 t) + C2 exp(C1 t) , C
(A.8)
√ where C = 1 − 4τ k, C1 = (C + 1)/2, and C2 = (C − 1)/2. The function y(t) can also be obtained in a similar manner as shown above: y(s) = r (s)Ts (s).
(A.9)
542
Y. Yu and Y. Choe
Impulse response
2.5
f(t) g(t)
2.0 1.5 1.0 0.5 0.0 -0.5 0
0
1
2
2
2
3
t Figure 9: Impulse response functions f (t) and g(t).
By substituting r (s) with the right-hand side in equation A.5, we have y(s) = e(s)
Ts . s + Ts
(A.10)
Therefore, y(t) can also be treated as the step input function e(t) convolved with an impulse response function g(t) in time domain: y(t) = e(t) ∗ g(t),
(A.11)
where g(t) is a sine-modulated exponentially decaying function: g(t) = L
−1
Ts s + Ts
√ √ = 6 5 exp(−5t) sin( 5t).
(A.12)
Hence, the final form of K s (t) can then be calculated as a division of two convolutions as follows: K s (t) =
e(t) ∗ g(t) . e(t) ∗ f (t)
(A.13)
Figure 9 shows the impulse response functions f (t) and g(t). The exact formula in equation 3.13 was derived based on the above derivation. Acknowledgments We thank Takashi Yamauchi, Rufin VanRullen, and an anonymous reviewer for helpful discussions and Jyh-Charn Liu for his support. This research
A Neural Model of Scintillating Grid Illusion
543
was funded in part by the Texas Higher Education Coordinating Board ATP grant 000512-0217-2001 and by the National Institute of Mental Health Human Brain Project, grant 1R01-MH66991. A preliminary version of the material presented here appeared in Yu and Choe (2004b) as an abstract. References Berry II, M. J., Brivanlou, I. H., Jordan, T. A., & Meister, M. (1999). Anticipation of moving stimuli by the retina. Nature, 398, 334–338. Brodie, S., Knight, B. W., & Ratliff, F. (1978). The spatiotemporal transfer function of the limulus lateral eye. Journal of General Physiology, 72, 167–202. Cai, D., DeAngelis, G. C., & Freeman, R. D. (1997). Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78, 1045–1061. Crevier, D. W., & Meister, M. (1998). Synchronous period-doubling in flicker vision of salamander and man. Journal of Neurophysiology, 79, 1869–1878. Frech, M. J., Perez-Leon, J., Wassle, H., & Backus, K. H. (2001). Characterization of the spontaneous synaptic activity of amacrine cells in the mouse retina. Journal of Neurophysiology, 86, 1632–1643. Gerrits, H. J., & Vendrik, A. J. (1970). Simultaneous contrast, filling-in process and information processing in man’s visual system. Experimental Brain Research, 26, 411–430. Gilbert, C., Ito, M., Kapadia, M., & Westheimer, G. (2000). Interactions between attention, context and learning in primary visual cortex. Vision Research, 40, 1217– 1226. Hartline, H. K., & Ratliff, F. (1957). Inhibitory interaction of receptor units in the eye of Limulus. Journal of General Physiology, 40, 357–376. Hartline, H. K., & Ratliff, F. (1958). Spatial summation of inhibitory influences in the eye of Limulus, and the mutual interaction of receptor units. Journal of General Physiology, 41, 1049–1066. Hartline, H. K., Wager, H., & Ratliff, F. (1956). Inhibition in the eye of Limulus. Journal of General Physiology, 39, 651–673. Hartveit, E. (1999). Reciprocal synaptic interactions between rod bipolar cells and amacrine cells in the rat retina. Journal of Neurophysiology, 81, 2932–2936. Jacobs, A. L., & Werblin, F. S. (1998). Spatiotemporal patterns at the retinal output. Journal of Neurophysiology, 80, 447–451. Kitaoka, A. (2003). Trick eyes 2. Tokyo: Kanzen. Kolb, H., & Nelson, R. (1993). Off-alpha and off-beta ganglion cells in the cat retina. Journal of Comparative Neurology, 329, 85–110. Li, C. Y., Zhou, Y. X., Pei, X., Qiu, F. T., Tang, C. Q., & Xu, X. Z. (1992). Extensive disinhibitory region beyond the classical receptive field of cat retinal ganglion cells. Vision Research, 32, 219–228. Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proceedings of the Royal Society of London B, 207, 187–217. McAnany, J. J., & Levine, M. W. (2004). The blanking phenomenon: A novel form of visual disappearance. Vision Research, 44, 993–1001.
544
Y. Yu and Y. Choe
Neumann, H., Pessoa, L., & Hanse, T. (1999). Interaction of on and off pathways for visual contrast measurement. Biological Cybernetics, 81, 515–532. Nirenberg, S., & Meister, M. (1997). The light response of retinal ganglion cells is truncated by a displaced amacrine circuit. Neuron, 18, 637–650. Raninen, A., & Rovamo, J. (1987). Retinal ganglion-cell density and receptive-field size as determinants of photopic flicker sensitivity across the human visual field. Journal of the Optical Society of America A, 4, 1620–1626. Ratliff, F., Hartline, H. K., & Miller, W. H. (1963). Spatial and temporal aspects of retinal inhibitory interaction. Journal of the Optical Society of America, 53, 110–120. Roska, B., Nemeth, E., & Werblin, F. (1998). Response to change is facilitated by a three-neuron disinhibitory pathway in the tiger salamander retina. Journal of Neuroscience, 18, 3451–3459. Schrauf, M., Lingelbach, B., & Wist, E. R. (1997). The scintillating grid illusion. Vision Research, 37, 1033–1038. Schrauf, M., & Spillmann, L. (2000). The scintillating grid illusion in stereo depth. Vision Research, 40, 717–721. Schrauf, M., Wist, E. R., & Ehrenstein, W. H. (2000). The scintillating grid illusion during smooth pursuit, stimulus motion, and brief exposure in humans. Neuroscience Letters, 284, 126–128. Spillmann, L. (1994). The Hermann grid illusion: A tool for studying human perceptive field organization. Perception, 23, 691–708. Stevens, C. F. (1964). A quantitative theory of neural interactions: Theoretical and experimental investigations. Unpublished doctoral dissertation, Rockefeller Institute. VanRullen, R., & Dong, T. (2003). Attention and scintillation. Vision Research, 43, 2191–2196. Victor, J. (1987). The dynamics of the cat retina x cell centre. Journal of Physiology, 386, 219–246. Yu, Y., & Choe, Y. (2004a). Angular disinhibition effect in a modified Poggendorff illusion. In K. D. Forbus, D. Gentner, T. Regier (Eds.), Proceedings of the 26th Annual Conference of the Cognitive Science Society (pp. 1500–1505). Mahwah, NJ: Erlbaum. Yu, Y., & Choe, Y. (2004b). Explaining the scintillating grid illusion using disinhibition and self-inhibition in the early visual pathway. In Society for Neuroscience Abstracts. Program No. 301.10. Washington, DC: Society for Neuroscience. Yu, Y., Yamauchi, T., & Choe, Y. (2004). Explaining low-level brightness-contrast illusions using disinhibition. In A. J. Ijspeert, M. Murata, N. Wakamiya (Eds.), Biologically inspired approaches to advanced Information technology (BioADIT 2004) (pp. 166–175). Berlin: Springer.
Received February 18, 2005; accepted September 8, 2005.
LETTER
Communicated by Jonathan Victor
A Comparison of Descriptive Models of a Single Spike Train by Information-Geometric Measure Hiroyuki Nakahara
[email protected].
Shun-ichi Amari
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Saitama, 351-0198 Japan
Barry J. Richmond
[email protected] Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20892, U.S.A.
In examining spike trains, different models are used to describe their structure. The different models often seem quite similar, but because they are cast in different formalisms, it is often difficult to compare their predictions. Here we use the information-geometric measure, an orthogonal coordinate representation of point processes, to express different models of stochastic point processes in a common coordinate system. Within such a framework, it becomes straightforward to visualize higher-order correlations of different models and thereby assess the differences between models. We apply the information-geometric measure to compare two similar but not identical models of neuronal spike trains: the inhomogeneous Markov and the mixture of Poisson models. It is shown that they differ in the second- and higher-order interaction terms. In the mixture of Poisson model, the second- and higher-order interactions are of comparable magnitude within each order, whereas in the inhomogeneous Markov model, they have alternating signs over different orders. This provides guidance about what measurements would effectively separate the two models. As newer models are proposed, they also can be compared to these models using information geometry. 1 Introduction Over the past two decades, studies of the information-carrying properties of neuronal spike trains have intensified and become more sophisticated. Many earlier studies of neuronal spike trains concentrated mainly on using general methods to reduce the dimensionality of the description. Recently, however, specific models have been developed to incorporate findings from Neural Computation 18, 545–568 (2006)
C 2006 Massachusetts Institute of Technology
546
H. Nakahara, S. Amari, and B. Richmond
both experimental and theoretical biophysical data (Dean, 1981; Richmond & Optican, 1987; Abeles, 1991; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Reid, Victor, & Shapley, 1992; Softky & Koch, 1993; Shadlen & Newsome, 1994; Victor & Purpura, 1997; Stevens & Zador, 1998; Oram, Wiener, Lestienne, & Richmond, 1999; Shinomoto, Sakai, & Funahashi, 1999; Meister & Berry, 1999; Baker & Lemon, 2000; Kass & Ventura, 2001; Reich, Mechler, & Victor, 2001; Brown, Barbieri, Ventura, Kass, & Frank, 2002; Wiener & Richmond, 2003; Beggs & Plenz, 2003; Fellous, Tiesinga, Thomas, & Sejnowski, 2004). These newer approaches guess at specific structures that give rise to the spike trains in experimental data. Because these models have specific structures, fitting these models is translated into estimating the parameters of the model rather than using general approaches to dimensionality reduction. There are several benefits to these more descriptive models. First, all of the approaches succinctly describe data. Second, the more principled models make their assumptions explicit; they declare which properties in the data are considered important. Third, parametric models have practical value for data analysis because the parameter values of a model can often be reasonably well estimated even with the limited number of samples that can be obtained in experiments. When considering different models, it is natural to ask which model is really good, or perhaps, more to the point, what we learn about the system from different models. In what ways are the models equivalent or different? If the differences can be seen explicitly, experiments can be designed to evaluate features that distinguish the models. A powerful approach for distinguishing them is to project them into a single coordinate frame, especially an orthogonal coordinate one. Information-geometric measure (IG) (Amari, 2001; Nakahara & Amari, 2002b) provides a orthogonal coordinate system to make such projections for models of point processes. Using the IG measure, we consider a probability space, where each point in the space corresponds to a probability distribution. Estimating the probability distribution of the spike train from experimental data corresponds to identifying the location and spread of a point in this space. In this context, different assumptions underlying different spike train models correspond to different search constraints (see Figure 1). Once different models are re-represented by a common coordinate system, one can compare the regions of the probability space that can be occupied by the different models, with the overlapping regions representing properties that the models have in common, and nonoverlapping regions representing properties that are unique to a particular model. Here we use the IG measure to compare two stochastic models of spike trains: the inhomogeneous Markov (IM) (Kass & Ventura, 2001) and the mixture of Poisson (MP) models (Wiener & Richmond, 2003). Experimentally recorded spike trains generally depart from Poisson statistics (see section 2.) The variance-to-mean relation is seldom one, the interval distribution is often not exponential, and the count distribution in a period is
Spike Train by Information Geometry
"Full" probability space
547
Subspace of a model
Subspace of another model Figure 1: Schematic drawing to show that each parametric model for spike train description is embedded as a subspace in a full probability space. Given a parametric model, estimating parameter values (or probability distribution) from experimental data corresponds to identifying a point in the subspace.
often not Poisson. Recently, both IM and MP models have been proposed to deal with these deviations. Both treat the spike trains as point processes, but they emphasize different properties. The intersection of the two models is an inhomogeneous Poisson process. The IM model emphasizes the non-Poisson nature of interval distributions, whereas the MP model emphasizes the non-Poisson nature of the spike count distribution (see Figure 2 and section 3). To compare the two models, we represent them in a common coordinate system, according to the methods of the IG measure (see section 4). The second- and higher-order statistics of the models can thus be characterized (see section 5) in a form suitable for comparison and then used to discriminate the models. 2 Preliminaries Consider an inhomogeneous Poisson process. For a spike train of a single neuron, consider a time period of N bins, where each bin is so short that it can
548
H. Nakahara, S. Amari, and B. Richmond
raster
Trials
200 100 0 0
100
200
time [ms]
K i, j
ar
100
0.8 0 0
PSTH
[Hz]
Freq.
1.6
100 200
r (r = j-i)
[Prob] # Spikes 0.12
50
0.06
0 0
100 200
0
time [ms]
MIM model
η1 ,L, ηN a1 ,L, a N
0
8
16
[num]
MP model
1
η 1 ,L ,ηN {π k , λk} (k 1L K )
Figure 2: Schematic drawing to indicate how raw experimental data are converted to estimation of parameter values of the two different models. The raw data, or raster data (top), can be converted to different formats of data: K i, j , PSTH, and spike count histogram (middle). For the MIM model, K i, j = a j−i and PSTH data are used, whereas for the MP model, PSTH and the spike count histogram are used. Only the MIM model case (not the IM model) is shown here for simplicity of presentation.
have at most a single spike. Each neuronal spike train is then represented by a binary random vector variable. Let X N = (X1 , . . . , XN ) be N binary random variables, and let p(x N ) = P[X N = x N ], x N = (x1 , . . . , xn ), xi = 0, 1, be its probability, where p(x N ) > 0 is assumed for all x. Each Xi indicates
Spike Train by Information Geometry
549
a spike in the ith bin, by Xi = 1, or no spike, by Xi = 0. With this notation, the inhomogeneous Poisson process is given by
p(x N ) =
N
ηixi (1 − ηi )1−xi ,
i
where ηi = E[xi ]. The probability of a spike occurrence in a bin is independent from those of the other bins:
p(x N ) =
N
p(xi ), where p(xi ) = ηixi (1 − ηi )1−xi .
i
Then (η1 , . . . , η N ), or the peristimulus histogram (PSTH), obtained from experimental data is sufficient to estimate the parameter values of this model. In this condition, the experimental data analysis is simple, and spike generation from this model is easy. This independence property leads to wellknown facts: count statistics obey the Poisson distribution, and interval statistics obey the exponential distribution (the two facts are exact for the homogeneous Poisson process and asymptotically exact for the inhomogeneous one). Its simplicity makes the Poisson model a popular choice as a descriptive model. Experimental findings often suggest that the empirical probability distributions of spike counts and intervals are close to the Poisson and exponential distributions, respectively, but frequently they depart from these as well (Dean, 1981; Tolhurst, Movshon, & Dean, 1983; Gawne & Richmond, 1993; Gershon, Wiener, Latham, & Richmond, 1998; Lee, Port, Kruse, & Georgopoulos, 1998; Stevens & Zador, 1998; Maynard et al., 1999; Oram et al., 1999). These findings led to studies that considered a larger class of models, including the IM and MP models (see section 3.) The Poisson process occupies only a small subspace of the original space of all probability distributions p(x N ). The number of all possible spike patterns is 2 N , since X N ∈ {0, 1} N . Therefore, each p(x N ) is given by 2nprobabilities pi1 ...i N = Prob{X1 = i 1 , . . . , XN = i N }, i k = 0, 1, subject to i1 ,...,i N pi1 ...i N = 1. The set of all possible probability distributions { p(x)} forms a (2 N − 1)-dimensional manifold S N . To represent a point in S N , that is, a probability distribution p(x N ), one simple coordinate system, called the P-coordinate system, is given by { pi1 ...i N } above, where { pi1 ...i N } corresponds to 2 N probabilities. Every set of values has 2 N probabilities, and each corresponds to a specific probability distribution, p(x N ). Since { pi1 ...i N } sums to one, the effective coordinate dimension is 2 N − 1 (instead of 2 N ). For the IG measure, we use two other coordinate systems. The first is the θ -coordinate system explained in section 4, and the second is the
550
H. Nakahara, S. Amari, and B. Richmond
η-coordinate system, given by the expectation parameters, ηi = E[xi ] = Prob{xi = 1}, i = 1, . . . , n ηij = E[xi x j ] = Prob{xi = x j = 1}, i < j ηi1 i2 ...il = E[xi1 , . . . , xil ] = Prob{xi1 = xi2 = · · · = xil = 1} i 1 < i 2 < · · · < il . All ηijk , and so on, together have 2 N − 1 components, that is, η = (η1 , η2 , . . . , η N ) = (ηi , ηij , ηijk , . . . , η12...N ) has 2 N − 1 components, where we write η1 = (ηi ), η2 = (ηij ) and so on, forming the η-coordinate system in S N . Any probability distribution of X N can be completely represented by Por η-coordinates if and only if all of the coordinates are used. If any model of the probability distribution of X N has fewer parameters than 2 N − 1 (this is usually the case), the probability space in which the model lies is restricted. Since the Poisson process uses η1 as its coordinates, it lies in a much smaller subspace than the full space S N . The components of the η-coordinates, η1 = (ηi ) = (η1 , η2 , . . . , η N ), (i = 1, . . . , N) are familiar, since they would correspond to the PSTH in experimental data analysis. They are taken to represent the time course expectation of the neural firing, expressed as the probability of a spike at each time or as the firing frequency. Below, we freely exchange the PSTH and η1 for simplicity. However, we note the difference between η1 and the probability density of firing, since the PSTH is often regarded as the latter as well. The probability density will be recalculated if the bin size changes so that it is invariant under the change of the bin size and has a spike count per unit (infinitesimal) time. The firing frequency would become the probability density as the bin size approaches zero (or the time resolution approaches infinite). In data analysis, the firing frequency is then taken to be the empirically measured density. In contrast, each of η1 , that is, ηi , is the probability of a bin, P[Xi = 1], not a probability density. For ηi , it is assumed that each bin can contain at most one spike. In practical data analysis, it thus can be regarded as the density, as long as the bin size is small enough. In general, though, we must be aware of this difference and carefully translate between them. How large to make the bins is an important question but beyond the scope of this study. Others may express concern about events that are not found to occur in the data, that is, if zero probability must be assigned to the event. In such cases, we can, in principle, reconstruct a probability model from which the events of zero probabilities are omitted or impose some assumptions on those zero probabilities, which seems more useful in practice (see Nakahara & Amari, 2002b).
Spike Train by Information Geometry
551
To simplify the notation, we sometimes will write Xi = 0 and Xi = 1 as xi0 and xi1 , respectively—for example, x N = (X1 = 0, X2 = 0, X3 = 1, . . . , XN = 1 1) = (x10 , x20 , x31 v, . . . , xN ). The notation of p(ijk) , and so on, is used to define 0 . p(i1 i2 ...ik ) = P x10 , . . . , xi11 , . . . , xi12 , . . . , xi1k , . . . , xN 0 We also use p(0) = P[x10 , . . . , xN ]. The cardinality of a spike train, that is, the number of spikes or the spike count, is Y ≡ X N = # {Xi = 1} .
Given a specific spike train x N , n is reserved to indicate n = |x N |, and s(1), s(2), . . . , s(n) are used to denote the specific timings of spikes that are the set of indices having xi = 1. For example, with this notation, we write 0 1 0 1 1 0 1 x N = x10 , . . . , xs(1)−1 , xs(1) , xs(1)+1 , . . . , xs(2) , . . . , xs(3) , xs(3)+1 , . . . , xs(n) , 0 0 xs(n)+1 , . . . , xN . 3 Two Parametric Models for Single Spike Trains Here we present the original formulations of the inhomogeneous Markov (IM) (Kass & Ventura, 2001) and the mixture of Poisson (MP) models (Wiener & Richmond, 2003). 3.1 Inhomogeneous Markov Model. The IM model was developed as a tractable class of spike interval probability distributions to account for observations that the spike firing over bins is not completely independent and thus departs from the Poisson process (Kass & Ventura, 2001; Ventura, Carta, Kass, Gettner, & Olson, 2002). The inhomogeneous Markov assumption is the key to the IM model and 1 assumes the following equality: for any spike event, xs(l) (l ≤ n), 1 0 0 x1 , . . . , xs(l)−1 = P x 1 x 1 P xs(l) s(l) s(l−1) , xs(l−1)+1 , . . . , xs(l)−1 . The probability of firing at a time t given its past, which is the left side of the equation, depends on only t and the time of the last spike. Denote the right-hand side by K˜ s(l−1),s(l) , as 1 1 0 0 x K˜ s(l−1),s(l) = P xs(l) s(l−1) , xs(l−1)+1 , . . . , xs(l)−1 , where l = 2, . . . , n (potentially n is up to N).
552
H. Nakahara, S. Amari, and B. Richmond
1 If K˜ s(l−1),s(l) = ηs(l) = P[xs(l) ], this is an inhomogeneous Poisson process. By explicitly including the parameter, { K˜ s(l−1),s(l) }, the IM model enlarges the class of probabilities beyond the Poisson process. The original parameters of the IM model are given by {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j). After some calculations, we obtain the following:
Proposition 1. Given the original parameters {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j), the probability of any spike train under the IM model is given by PI M (x N ) s(1)−1 n n = (1 − ηl ) ηs(1) K˜ s(l−1),s(l) ) l=1
l=2
l=1
s(l+1)−s(l)−1
k=1
(1 − K˜ s(l),s(l)+k ),
(3.1)
where (and hereafter) PI M is used to denote a probability distribution of the IM model and the convention s(n + 1) = N + 1 is introduced without loss of generality. (See appendix A for the proof.)
Equation 3.1 states that the probability of any spike train p(x N ) depends on only ηi and K˜ i, j under the IM model. Thus, {ηi , K˜ i, j } (i, j = 1, . . . , N, i < j) is one coordinate system for the IM model. Another coordinate system, {ηi , K i, j }
(i, j = 1, . . . , N, i < j),
can be introduced by defining K s(l−1),s(l) such that K˜ s(l−1),s(l) = ηs(l) K s(l−1),s(l) (provided that ηi > 0). The number of parameters or the IM model dimensionality is N + N(N−1) . 2 A subclass of the IM model, called the multiplicative inhomogeneous Markov (MIM) model, has also been proposed (Kass & Ventura, 2001). In addition to the IM assumption, they assumed another constraint on K i, j , given by a j−i ≡ K i, j .
(3.2)
Note that this assumption is not equivalent to assuming K˜ i, j = K˜ i , j , where j − i = j − i . The assumption further constrains the probability subspace available to the model (see Figure 2.) The dimensionality of the MIM model is 2N − 1. The MIM model is easily expressed by substituting K˜ i, j = η j a j−i in equation 3.1.
Spike Train by Information Geometry
553
3.2 Mixture of Poisson Model. The MP model was introduced to better account for the spike count statistics of neural spike trains over an interval of interest (Wiener & Richmond, 2003). In many neurophysiological recordings of single neurons, the spike count and its probability distribution are easy to obtain and arguably the most robust measure to estimate. The MP model fits the spike count distribution to a mixture of Poisson distributions instead of a single Poisson distribution. However, the mixture of Poisson distributions itself cannot determine spike train generation without further assumptions. In the original work (Wiener & Richmond, 2003), each trial was drawn from one of the Poisson distributions; each kth component of the Poisson process is chosen with a probability πk in each trial of the experiment, generating a spike train of the trial. Thus, the MP model enjoys the simplicity of the Poisson process for generating spike trains in each trial. Let us write each kth (inhomogeneous) Poisson process, Pk ,
Pk [X N = x N ] =
N
xi ηi,k (1 − ηi,k )1−xi ,
i
where we define ηi,k = E k [xi ], and E k denotes expectation with respect to probability Pk . The trial-by-trial mixture of the Poisson processes is given by
P[X N = x N ] =
K
πk Pk [X N = x N ],
k
K where {πk } are mixing probabilities with k=1 πk = 1. The corresponding spike count distribution is the mixture of Poisson distribution, given by P[Y = y] = kK πk Pk [Y = y]. Here the Poisson distribution of each kth component is given by
Pk [Y = y] =
yλk −λk e , where we have y!
λk =
N
ηi,k .
i
An important issue is how to estimate the spike generation of each kth component, that is, {ηi,k } (i = 1, . . . , N). Consider first a single Poisson process for which we pretend to have only a single component in the above formulation. Can we recover {ηi,1 } from λ1 ? We can get η = λ1 /N if the process is homogeneous. If it is inhomogeneous, the solution is not unique: various sets of {ηi,1 } may match a value of λ1 . In practice, we get {ηi,1 } by looking at the PSTH from the same experimental data. If there is more than one
554
H. Nakahara, S. Amari, and B. Richmond
component, the PSTH tells us only the left-hand side of the equation below: ηi = E[Xi ] =
K
πk ηi,k (i = 1, . . . , N).
k
In this general case, to obtain {ηi,k } of each kth component, the approach taken by the MP model assumes that the overall shape of PSTH is the same among all components (Wiener & Richmond, 2003). This assumption implies that for each k, there exists a constant αk such that ηi,k = αk ηi for any i = 1, . . . , N. By taking the sum with respect to i, the value of αk is given by
λk , where we defined c 1 ≡ ηi . c1 i N
αk =
The MP model, as a generative model of spike trains, is the trial-by-trial mixture of Poisson process with this assumption, summarized as follows: Proposition 2. Given original parameters {πk , λk , ηi } (k = 1, . . . , K ; i = 1, . . . , N), the probability distribution of any spike pattern of the MP model is given by PMP (x N ) =
K
πk Pk (x N ),
k
where (and hereafter) PMP denotes the probability distribution of the MP model and Pk denotes the probability distribution of the kth component, Pk (x N ) =
N
xi ηi,k (1 − ηi,k )1−xi ,
(3.3)
i
where ηi,k is defined by ηi,k =
λk ηi (i = 1, . . . , N) . c1
Here, c 1 is a constraint of the model parameters given by c1 =
N
i
ηi =
N
K
i
k
πk ηi,k =
K
πk λk .
k
Another constraint of the model parameter is
K k
πk = 1.
Spike Train by Information Geometry
555
Thus, the dimensionality of the MP model is 2K + N − 2. We note that the constraint of c 1 is intrinsic, a part of, the MP model (see Nakahara, Amari, & Richmond, 2005). 4 Representation of the Two Models by Information-Geometric Measure 4.1 Information-Geometric Measure. Having established the two models, we rewrite them in IG measure (Amari, 2001; Nakahara & Amari, 2002b). The usefulness of the IG measure was studied earlier for spike data analysis (Nakahara & Amari, 2002a, 2002b; Nakahara, Amari, Tatsuno, Kang, & Kobayashi, 2002; Amari, Nakahara, Wu, & Sakai, 2003) and for DNA microarray data (Nakahara, Nishimura, Inoue, Hori, & Amari, 2003). Although these studies emphasized neural population firing and interactions among neurons (or gene expressions), almost all the earlier results can be directly applied in analyzing single-neuron spike trains because the mathematical formulation is general in the sense that it can be applied to any binary random vectors. Here we give a brief description of the IG measure. Let us first introduce the θ -coordinate system, defined by log P[X N = x N ] =
θi xi +
θij xi x j +
i< j
θijk xi x j xk , . . . ,
i< j
+ θ1,...,N x1 . . . xN − ψ, where the indices of θijk , and so on, satisfy i < j < k, and ψ is a normalization term, corresponding to − log p(x1 = x2 = · · · = xN = 0). This log expansion is not an approximation; it is exact when x N is a random binary vector, as here. All θijk , and so on; have a total of 2 N − 1 components, that is, θ = (θ 1 , θ 2 , . . . , θ N ) = (θi , θij , θijk , . . . , θ12 . . . N ), and form the θ -coordinate system in S N , which is also called θ -coordinates. θ -coordinates can represent any probability distribution in S N . It is straightforward to write any components of the θ -coordinates in relation to the P-coordinate system. Here we list a few first terms: ψ = − log p(0) , θi = log
p(ij) p(0) p(ijk) p(i) p( j) p(k) p(i) , θij = log , θijk = log . p(0) p(i) p( j) p(0) p(ij) p(jk) p(ik)
The general relation of both η- and θ -coordinates with respect to P-coordinates is given by theorem 1, set out in the appendix B, along with a brief remark on the background of the IG coordinates. The IG coordinates
556
H. Nakahara, S. Amari, and B. Richmond
allow the examination of different-order interactions of neural spike trains and the information conveyed by these different orders (Nakahara & Amari, 2002b). The mixed coordinates of different orders are particularly useful (Amari, 2001; Nakahara & Amari, 2002b); for example, the first order (First cut) mixed coordinates (η1 , θ 2 , . . . , θ N ) are useful for dissociating the mean firing rate η1 from all the second and higher-order interactions under the null hypothesis of any correlated firing, because all the components of θ 2 , . . . , θ N can be treated independently from η1 (in general, the components of η- and θ -coordinates can be treated independently because of their dual orthogonality; Amari, 2001; Nakahara & Amari, 2002b; see appendix B). The first-cut mixed coordinates are also most suitable for comparing IM and MP models, since both share the parameters η1 (see section 3), and hence the difference between them must be sought in other components in the full probability space. The first-cut mixed coordinates can characterize this difference by looking at the interaction terms θ 2 , . . . , θ N , separately from η1 , as will be done in a later section. Before applying the IG measure to these two models, we make two useful remarks. Remark: The Second-Order IG Model. The first remark concerns the notion of restricted probability space. Recall that the Poisson process is the intersection of the IM and MP models, which is given by p(x N ) = N xi 1−xi . This model corresponds to i=1 ηi (1 − ηi )
log p(x N ) =
θi xi − ψ,
(4.1)
i
ηi where θi = log 1−η . Therefore, we have (θi ) ⇔ (ηi ). In other words, the Poisi son process lies in the subspace defined by (θ 2 , . . . , θ N ) = (0, . . . , 0). Under the maximum entropy principle, the Poisson process is the distribution that can be determined completely by the first-order statistics η1 . The IM and MP models expand this restricted space differently. Consider the following model,
log p(x N ) =
p
θi xi +
θij x(i j) − ψ,
p
p
(i) where θi = log p(0) and θij = log p(i(i)j)p((0)j) . Now, (θ 1 , θ 2 ) becomes the coordinate system, and (η1 , η2 ) is another coordinate system, that is, η-coordinates. The number of parameters is N + N(N−1) . This model has been extensively stud2 ied in various fields, including the spin glasses and Boltzmann machine. Since (θ 3 , . . . , θ N ) = (0, . . . , 0), this model can be determined completely by
Spike Train by Information Geometry
557
its first- and second-order statistics (η1 , η2 ) under the maximum entropy principle. We call this model the second-order IG model. Remark: Coordinate Transformation. The second remark concerns the different representations of the η- and θ -coordinates. There can be different representations of the two coordinates due to the degrees of freedom in an affine coordinate system (see Nakahara et al., 2005, for an illustrating example). Hereafter, when the distinction is needed, we refer to the original dual coordinates, θ = (θ 1 , θ 2 , . . . , θ N ) and η = (η1 , η2 , . . . , η N ) as the standard IG, or θ - and η-, coordinates.
4.2 IM Model. We represent the IM model by standard θ -coordinates. For convenience, we first introduce another representation of θ - and ηcoordinates. Let us define, for i < j, X˜ i, j ≡ Xi X j
j−1
(1 − Xl ).
(4.2)
l=i+1
X˜ i, j becomes 1 only when there are spikes at the ith and jth bins and no spikes between them. This is a natural quantity to be dealt with by the IM model. We have
j−i−1
η˜ i, j ≡ E[ X˜ i, j ] = ηi K˜ i, j
(1 − K˜ i,i+l ).
(4.3)
l=i
Since {ηi , K˜ i, j } is a coordinate system of the IM model, equation 4.3 implies that {ηi , η˜ i, j } is also another coordinate system, which is another representation (but not the standard) of η-coordinates of the IM model. In correspondence to {η˜ i, j }, we introduce {θ˜ i, j } by log PI M (x N ) ≡
θi Xi +
θ˜ i, j X˜ i, j − ψ.
(4.4)
i< j
We see that {θi , θ˜ i, j } is another representation of θ -coordinates of the IM model that corresponds to the η-coordinates, {ηi , η˜ i, j }. We have θi = log
p(i) p(i j) p(0) , θ˜ i, j = log . p(0) p(i) p( j)
(4.5)
and equals that of The dimensionality of the coordinates is N + N(N−1) 2 the second-order model in the standard coordinates (the second-order IG
558
H. Nakahara, S. Amari, and B. Richmond
model). However, the restricted probability spaces of IM and second-order models are different. To see how the IM model is embedded in full space, let us write the IM model in standard θ -coordinates. By expanding equation 4.4 with the definition of X˜ i, j , we get θij = θ˜ i, j , θijk = −θ˜ i,k , θi jkl = θ˜ i,l , or in general, θi1 i2 ...ik = (−1)k θ˜i1 ,ik (k ≥ 2).
(4.6)
This equation indicates that the restricted probability space of the IM model imposes a specific alternating structure in higher-order interaction terms. It also indicates that the IM model is not distinguishable from the secondorder IG model in terms of the second-order interaction, since we have θij = θ˜ i, j . The relation of {θi , θ˜ i, j } with the original parameters {ηi , K i, j } is given by the following: Theorem 2. The IM model is represented by the standard θ -coordinates in relation to its original parameters {ηi , K i, j } (i < j, ; i, j = 1, . . . , N) as
N
1 − K i,l ηi + log 1 + ηl θi = log 1 − ηi 1 − ηl
(4.7)
l=i+1
θi1 i2 ...ik = (−1)k θ˜i1 ,ik (k ≥ 2)
N
1 − K i,l ˜θ i, j = logK i, j − . log 1 + ηl 1 − ηl l= j
(4.8) (4.9)
See appendix C for the proof. First, note that the first-order component θi deviates from the Poisson process; assuming the Poisson process, we would estimate only {ηi } (or directly ηi {θi }) and set θi = log 1−η . Second, θ˜ i, j becomes zero if all K i, j is equal to i one, which corresponds to cases in which the IM model is reduced to the Poisson process. Thus, in the above of θ˜i, j , the difference between the IM and the Poisson process lies with the terms {K i, j }, especially the first term, log K i, j (see below). We see that θ˜i, j depends not only on K i, j and η j but also on K i,l and ηl for l = j + 1, . . . , N. We now approximate θ˜ i, j to grasp its nature. First, ηi 1 holds in most data. Second, the original work (Kass & Ventura, 2001) suggests that K i, j is roughly within a range of [0.4, 1.6], implying that the probability of a spike occurring is not strongly dependent on the time of the previous spike (refer to Kass and Ventura’s article for a question of the relation to the refractory period). In any case, since this estimation is done with only one type of data, further examination is required before taking it as a general phenomenon.
Spike Train by Information Geometry
559
Yet it seems unlikely that K i, j takes a different order of magnitude. Then let us assume ηl
1 − K i,l 1, 1 − ηl
(4.10)
and also notice that in many situations, we have ηl K i, j . In such cases, we have approximately
θ˜ i, j log K i, j −
N
ηl
l= j
1 − K i,l . 1 − ηl
(4.11)
We observe that K i, j is the dominant term in θ˜ i, j and that at the same time, the terms (1 − K i,l ) (l = j + 1, . . . , N) also contribute to θ˜ i, j . As long as the N ηl do not differ much, for further simplification we put η¯ j = N−1j+1 l= j ηl , and then we have θ˜ i, j log K i, j −
N η¯ j
(1 − K i,l ). 1 − η¯ j l= j
(4.12)
For the multiplicative IM (MIM) model, we can replace K i, j by a j−i in theorem 2 to obtain the exact expression. When using the above approximation, we get
θ˜ i, j log a j−i −
N η¯ j
(1 − a l−i ). 1 − η¯ j l= j
(4.13)
Note that even with the MIM model, θ˜ i, j cannot be represented by terms that use only a j−i . In other words, even for a fixed k = j − i, θ˜ i, j takes different values, and the range of its values is determined by the summation in the second term, which goes from a j−i to a N−i . Thus, N, the number of bins, affects the value of θ˜ i, j because θ˜ i, j is defined with respect to a given period, whereas a k is not. In this sense, the second term reflects a boundary effect (see Nakahara et al., 2005, for a figure illustrating these approximations). 4.3 MP Model. Let us first write the MP model in terms of log expansion, based on its original definition (see proposition 2) as
log PMP (x N ) = log
K
k
πk Pk (x N ).
560
H. Nakahara, S. Amari, and B. Richmond
To represent the MP model in the standard θ -coordinates using the relation to the original parameters, {πk , λk , ηi } (k = 1, . . . , K , i = 1, . . . , N), note that the probabilities of the MP model, denoted by p MP , are represented by the original parameters (see proposition 2),
p(iMP = 1 ...i l )
K
πk p(ik 1 ...il ) ,
(4.14)
k
where for each kth component inhomogeneous Poisson process, p k is given by
k p(0) =
N
l ηi j ,k ηi,k k , . . . , p(ik 1 ···il ) = p(0) . 1 − ηi,k 1 − ηi j ,k
k k (1 − η j,k ), p(i) = p(0)
j=1
j=1
Also recall that we have ηi,k =
λk η c1 i
where c 1 =
N i
ηi . Then we have:
Theorem 3. The probabilities of the MP model are given by l p(iMP···il ) 1
j=1 l
=
c1
K ηi j
πk λlk
p(k0)
k
l j=1
λk 1 − ηi j c1
−1
,
(4.15)
where p k indicates the probabilities due to the kth component and p(k0) is given by p(k0) = Nj=1 (1 − η j,k ) = Nj=1 (1 − λc1k ηi ). By using theorems 1 and 3 together, any component of the standard θ coordinates of the MP model can be represented by the original parameters of the MP model. For example, the first- and second-order components of the standard θ -coordinates are given by ηi + log θi = log 1 − ηi
K k
−λk k 1 − cc1 1−λ πk p(0) k ηi K k k πk p(0)
(4.16)
and K θij = log
k
k πk p(0)
K
1 k k πk p(0) 1− λk η i c1
1 1 k πk p(0) λ λ 1− c k ηi 1− c k η j 1 1
. K 1 k λk k πk p(0)
K k
1− c η j 1
(4.17)
Spike Train by Information Geometry
561
Higher-order interaction terms, such as θijk , θi jkl (and so on), can be derived in a similar manner. In the above equations, first-order term θi of the MP model deviates from the Poisson process, which would set θi = log ηi /(1 − ηi ). Then the MP model induces second-order interactions, as evident in the above expression. More generally, the MP model induces higher-order interaction, despite the fact that each component of the MP model (i.e., each Poisson process) itself does not have any interaction terms. Mathematically, this is because the summation terms in the p MP appear as a ratio that is not cancelled in computing the θ -coordinates. To put it another way, this is a general phenomenon. Although there is no interaction term (no second and higher) within a trial (e.g., a Poisson process within a trial), the MP model induces interaction terms because it mixes different Poisson processes over trials so that it has interaction terms as a whole. For example, related observations are made in the context of cross-correlation analysis (Brody, 1999). Third, the term θi1 i2 ...il of the MP model is permutation free over its given indices, {i 1 , . . . , il }, or, equivalently, that the value of θi1 i2 ...il depends on the choice of the indices only through the magnitudes of {ηi1 , . . . , ηil }. Mathematically, this is clear because the term θi1 i2 ...il depends on the bin indices only through the term lj=1 (1 − λc1k ηi j )−1 . With this property, for example, θij = θi j , if (ηi , η j ) = (ηi , η j ) or = (η j , ηi ). Such a property does not exist in the IM model. Finally, we observe that the MP model tends to produce components of comparable magnitude in each order interaction when the order (i.e., l) is not too high. To see this, first note that in most cases, it is reasonable to k assume 1 ηi,k = λc1k ηi for any i, k. Then we have approximations, p(0) l l λk λ λ η . With these exp (− c1 i ηi ) = e −λk and j=1 (1 − c1k ηi j )−1 1 + c1k j=1 i j approximations, the summation term of p MP is
K
k
k πk λlk p(0)
l j=1
λk 1 − ηi j c1
−1
K
k
πk λlk e −λk
l
λ k 1 + ηi j . c1
(4.18)
j=1
The l th order term θi1 i2 ...il is represented by a ratio of these summation terms, and in each summation term, the bin indices appear only as lj=1 ηi j , as evident in the above approximation. Because 1 ηi can usually be assumed and the difference among ηi is of secondary order, the magnitude of the same order interaction is expected to be similar as long as the order l is not too large. This implies that in such cases, parameters of {πk , λk } become dominant factors in the interaction terms (see Nakahara et al., 2005, for a figure illustrating these approximations).
562
H. Nakahara, S. Amari, and B. Richmond
5 Discussion Different generative point process models are expected to give results with different high-order statistical moments. Since the information geometric (IG) measure provides complete space for representing point processes, any point process model can be mapped into this space, allowing the subspaces occupied by different models to be located, described, and compared. The two models examined in this article, the inhomogeneous Markov (IM) and mixture of Poisson (MP) models, were constructed to account for the interval and count distributions, respectively. In their native forms, it is difficult to quantify the ways in which the high-order moments differ. However, the differences between the IM and MP models can be clearly seen by using the first-cut mixed coordinates, (η1 ; θ 2 , . . . , θ N ) and inspecting its second- and higher-components, θ 2 , . . . , θ N . The two models differ only in the second- and higher-order terms. For θij (or equivalently θ˜ i, j ) the thirdand higher-order interaction terms have a structure of alternating signs in the IM model. The components are permutation free and of comparable magnitude for the MP model, at least for the first few components (e.g, equations B.1 and B.2), whereas the components represented by successive IG terms may vary considerably for the MIM model (cf. equations 4.17 and 4.18). The results from this analysis make it possible to consider what data would be required to distinguish between Comparison with these models. the second-order IG model, log p(x N ) = i θi xi + i, j θij xi x j − ψ (a “standard” second-order model because it is completely characterized by its first two moments under the maximum entropy principle), shows that the two models are different in {θij }. The MIM model is similar to letting {θij } be a specific sequence (a function of {a k }, cf. equation 4.18), whereas the MP model tends to produce components of comparable magnitude, for which the approximation θij ≈ θ holds. Experimental data sets are often (perhaps even always) too small to compare higher-order interactions above about three-way; their estimates get less precise and would have potential estimation bias to be corrected. However, it might require fewer data to recognize that {θij } are approximately equal or that they seem to follow the prediction of the {a k } than to use the models as originally formulated. The two models might be distinguished by inspecting whether the second- and third-order interaction terms have alternating signs, that is, θi1 i2 ...ik = (−1)k θ˜ i1 ,ik (k ≥ 2), so the third-order interaction is negative if the second-order interaction is positive, and vice versa. If there is a positive correlation of spikes between two bins (e.g., at i 1 and i k bins) and a positive correlation among the spikes for three bins (e.g., at i 1 , i k , and il bins, where l is one of {2, . . . , k − 1}), the data cannot have come from a process described by the IM model. If the correlations are of opposite signs, then the data could have come from either model (in the MP models, the third-order interaction can be positive or negative, regardless
Spike Train by Information Geometry
563
of the sign of the second-order interaction), and it would be necessary to examine higher-order interactions. There has been to our knowledge no experimental study suggesting alternating sign interactions over the secondand third-order interactions. Previous studies found only positive (not negative) third-order interactions for single- and multiple-neuron spike trains at least in a few cases (Abeles, Bergman, Margalit, & Vaadia, 1993; Riehle, ¨ Diesmann, & Aertsen, 1997; Prut et al., 1998; Oram et al., 1999; LestiGrun, enne & Tuckwell, 1998; Baker & Lemon, 2000). However, one needs to be cautious because it is possible that the investigators were mostly interested in finding positive interactions, whether second, third, or higher order. There are two caveats to consider when comparing the two models with data. First, other sources of experimental variability may add difficulties. For example, latency might vary across trials. This source of variability is not considered in the models, but it surely affects the interaction terms. Second, in fitting experimental data, the PSTH ({ηi }) was smoothed in both models, which reduces the effective dimensionality of the models. The MIM model had additional smoothing to estimate {η˜ i, j }, even further reducing the model dimensionality. Smoothing generally affects, or blurs, all interaction terms; if too severe, it will make it difficult to distinguish the two. Either of these circumstances could mask real differences in the data. Despite the caveats, however, the results here suggest that experimental data might distinguish the models by comparing the two- and three-way relations among spikes. Finally, our approach is applicable to other descriptive models of single neuronal spike trains and can be combined with analysis of activity in neural populations (Nakahara & Amari, 2002b). Appendix A: Proof of Proposition 1 1 The IM assumption is given by, for any spike event, xs(l) (l ≤ n),
1 0 0 x1 , . . . , xs(l)−1 = P x 1 x 1 ˜ P xs(l) s(l) s(l−1) , xs(l−1)+1 , . . . , xs(l)−1 = K s(l−1),s(l) , where l = 2, . . . , n (potentially n is up to N). Under the IM model, the probability of any spike train is written by two quantities: ηi and K˜ i, j . First, note that we have 1 0 P[X N = x N ] = P x10 , . . . , xs(1) , . . . , xs(2)−1 1 1 0 . , . . . , xN x10 , . . . , xs(1) , . . . , xs(2)−1 × P xs(2) Second, under the IM assumption, we can write 1 n 1 0 = q˜ s(l),s(l+1) K˜ s(l−1),s(l) , , . . . , xN |x10 , . . . , xs(1) , . . . , xs(2)−1 P xs(2) l=2
(A.1)
564
H. Nakahara, S. Amari, and B. Richmond
where we define 1 0 0 0 x , xs(l−1)+2 , . . . , xs(l)−1 q˜ s(l−1),s(l) ≡ P xs(l−1)+1 s(l−1)
(A.2)
for l = 2, . . . , n, with the convention s(n + 1) = N + 1. To see that {q˜ i, j } is deq˜ termined solely by { K˜ i, j }, note that we have K˜ i, j = 1 − qi,˜ i,j+1 . Given this idenj j−i−1 ˜ (1 − K i,i+l )q˜ i,i+2 and q˜ i, j+2 = 1 − K˜ i,i+1 . Therefore tity, we have q˜ i, j = l2 we obtain j−i−1 0 1 xi = (1 − K˜ i,i+l ). q˜ i, j ≡ P x 0j−1 , x 0j−2 , . . . , xi+1 l1
(A.3)
0 1 Third, by defining Pinit = P[x10 , . . . , xs(1) , . . . , xs(2)−1 ], we have, under the IM assumption,
1 0 1 0 x Pinit = P x10 , . . . , xs(1) P xs(2)−1 , . . . , xs(1)+1 s(1) 0 1 = P x1 , . . . , xs(1) q˜ s(1),s(2) . In the original work (Kass & Ventura, 2001), there was no explicit mention 1 of how to treat the quantity P[x10 , . . . , xs(1) ], but with our understanding of its spirit and our preference for simplicity, we define this quantity as 1 = P x10 , . . . , xs(1)
(1 − ηl ) ηs(1) .
s(1)−1 l=1
Then we obtain Pinit =
(1 − ηl ) ηs(1) q˜ s(1),s(2) .
s(1)−1 l=1
(A.4)
Taken altogether, proposition 1 is proved. Appendix B: Theorem 1 Theorem 1. θi1 ...il =
l
(−1)l−m
log p(A)
(B.1)
p({i1 ,i2 ,...,il }∪A) ,
(B.2)
m=0 A∈m (Xl∗ )
ηi1 ...il =
N−l
m=0 A∈m ( X¯ l∗ )
Spike Train by Information Geometry
565
where some conventional notation is introduced; Xl∗ indicates the specific l-tuple among X N , namely {Xi1 , Xi2 , . . . , Xil }. m (Xl∗ ) indicates the set of all possible m-tuples of Xl∗ . In equation B.1, given A ∈ m (Xl∗ ), A indicates each element of m (Xl∗ ) and the summation is taken over all the elements of m (Xl∗ ). The same convention applies to equation B.1, except that the summation is taken over m ( X¯ l∗ ) and X¯ l∗ is defined such that X N = Xl∗ ∪ X¯ l∗ and Xl∗ ∩ X¯ l∗ = φ. Given A{ j1 , . . . , jm }, p(A) is used with the same notation as p( j1 ... jm ) in equation B.1 and p({i1 ,i2 ,...,il }∪A) as p(i1 i2 ...il j1 ... jm ) in the equation B.2. The proof, which can be easily derived by using Rota’s method, the principle of inclusion-exclusion, is omitted (Hall, 1998). The IG measure, the η- and θ -coordinates together, effectively uses the dually flat structure of the η-coordinates and the θ -coordinates in S N (Amari, 2001; Nakahara & Amari, 2002b). This dual structure is the property of being e-flat and m-flat in a more general term (Amari & Nagaoka, 2000), which proved to underlie various useful properties of probability distributions, especially the exponential family of probability distribution. The explicit use of the dual structure is made by the IG measure, while it may be regarded as having a root for data analysis in log-linear models in a broad sense (Bishop, Fienberg, & Holland, 1975; Whittaker, 1990). Appendix C: Proof of Theorem 2 It suffices to obtain θi and θ˜ i, j in order to get the values of the standard θ -coordinates of the IM model, since we have θi1 i2 ...ik = (−1)k θ˜i1 ,ik (k ≥ 2). To compute θi and θ˜ i, j , it suffices to obtain the expression of p(0) , p(i) and p(i j) , due to equation 4.5. Using the notation q˜ i, j (see appendix A), we have N p(0) = i=1 (1 − ηi ) and p(i) =
i−1
(1 − ηl ) ηi q˜ i,N+1 , p(i j) =
i−1
l=1
(1 − ηl ) ηi q˜ i, j K˜ i, j q˜ j,N+1 .
l=1
Therefore, by using the identity q˜ i, j =
θi = log
j−i−1 l=1
(1 − K˜ i,i+l ), we get
N N K˜ i, j 1 − K˜ i,l 1 − K˜ i,l ηi + log , θ˜i, j = log − log . 1 − ηi 1 − ηl ηj 1 − ηl l= j l=i+1
By using K˜ i, j = η j K i, j , we get the expression in the theorem.
566
H. Nakahara, S. Amari, and B. Richmond
Acknowledgments H. N. is supported by Grant-in-Aid on Priority Areas (C) and the Fund from the US-Japan Brain Research Cooperative Program of MEXT, Japan. H. N. thanks M. Arisaka for technical assistance and O. Hikosaka for support. B. J. R. is supported by the IRP/NIMH/NIH/DHHS/USA. We acknowledge a JSPS Invitation Fellowship Program for covering B. J. R.’s expenses in Japan for some of this work and thank G. LaCamera and A. Lerchner for their comments on the draft.
References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. Journal of Neurophysiology, 70, 1629–1638. Amari, S. (2001). Information geometry on hierarchical decomposition of stochastic interactions. IEEE Transaction on Information Theory, 47, 1701–1711. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: AMS and Oxford University Press. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synfiring and higher-order interactions in neuron pool. Neural Computation, 15, 127–142. Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84, 1770–1780. Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. Journal of Neuroscience, 23(35), 11167–11177. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis. Cambridge, MA: MIT Press. Brody, C. D. (1999). Correlations without synchrony. Neural Computation, 11(7), 1537– 1551. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14(2), 325–346. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. Fellous, J. M., Tiesinga, P. H., Thomas, P. J., & Sejnowski, T. J. (2004). Discovering spike patterns in neuronal responses. Journal of Neuroscience, 24(12), 2989– 3001. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758– 2771.
Spike Train by Information Geometry
567
Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. Journal of Neurophysiology, 79, 1135–1144. Hall, M. Jr. (1998). Combinatorial theory (2nd ed.). New York: Wiley. Kass, R. E., & Ventura, V. (2001). A spike-train probability model. Neural Computation, 13(8), 1713–1720. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18(3), 1161–1170. Lestienne, R., & Tuckwell, H. C. (1998). The significance of precisely replicating patterns in mammalian CNS spike trains. Neuroscience, 82, 315–336. Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann, R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. Journal of Neuroscience, 19(18), 8083– 8093. Meister, M., & Berry, M. J. (1999). The neural code of the retina. Neuron, 22(3), 435–450. Nakahara, H., & Amari, S. (2002a). Information-geometric decomposition in spike analysis. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, (pp. 253–260). Cambridge, MA: MIT Press. Nakahara, H., & Amari, S. (2002b). Information geometric measure for neural spikes. Neural Computation, 14, 2269–2316. Nakahara, H., Amari, S., & Richmond, B. J. (2005). A comparison of single spike train descriptive models by information geometric measure. (RIKEN BSI BSIS Tech. Rep. No. 05-1). Saitama: RIKEN. Nakahara, H., Amari, S., Tatsuno, M., Kang, S., & Kobayashi, K. (2002). Examples of applications of information geometric measure to neural data. (RIKEN BSI BSIS Tech. Rep. No. 02-1). Saitama: RIKEN. Nakahara, H., Nishimura, S., Inoue, M., Hori, G., & Amari, S. (2003). Gene interaction in DNA microarray data is decomposed by information geometric measure. Bioinformatics, 19, 1124–1131. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. Journal of Neurophysiology, 81(6), 3021–3033. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. Journal of Neurophysiology, 79, 2857–2874. Reich, D. S., Mechler, F., & Victor, J. D. (2001). Temporal coding of contrast in primary visual cortex: When, what, and why. Journal of Neurophysiology, 85(3), 1039–1050. Reid, R. C., Victor, J. D., & Shapley, R. M. (1992). Broadband temporal stimuli decrease the integration time of neurons in cat striate cortex. Visual Neuroscience, 9, 39– 45. Richmond, B. J., & Optican, L. M. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. II. quantification of response waveform. Journal of Neurophysiology, 57(1), 147–161. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953.
568
H. Nakahara, S. Amari, and B. Richmond
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4(4), 569–579. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Computation, 11(4), 935–951. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPS. Journal of Neuroscience, 13(1), 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1(3), 210–217. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Research, 23(8), 775–785. Ventura, V., Carta, R., Kass, R. E., Gettner, S. N., & Olson, C. R. (2002). Statistical analysis of temporal evolution in single-neuron firing rates. Biostatistics, 3(1), 1–20. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network, 8, 127–164. Whittaker, J. (1990). Graphical models in applied multivariate statistics. New York: Wiley. Wiener, M. C., & Richmond, B. J. (2003). Decoding spike trains instant by instant using order statistics and the mixture-of-Poissons model. Journal of Neuroscience, 23(6), 2394–2406.
Received October 22, 2004; accepted August 10, 2005.
LETTER
Communicated by Garrett Stanley
A Theoretical Analysis of the Influence of Fixational Instability on the Development of Thalamocortical Connectivity Antonino Casile
[email protected] Laboratory for Action Representation and Learning, Department of Cognitive Neurology, Hertie Institute for Clinical Brain Research, University Clinic, ¨ 72072 Tubingen, Germany
Michele Rucci
[email protected] Department of Cognitive and Neural Systems, Boston University, Boston, MA 02215, U.S.A.
Under natural viewing conditions, the physiological instability of visual fixation keeps the projection of the stimulus on the retina in constant motion. After eye opening, chronic exposure to a constantly moving retinal image might influence the experience-dependent refinement of cell response characteristics. The results of previous modeling studies have suggested a contribution of fixational instability to the Hebbian maturation of the receptive fields of V1 simple cells (Rucci, Edelman, & Wray, 2000; Rucci & Casile, 2004). This letter examines the origins of such a contribution. Using quasilinear models of lateral geniculate nucleus units and V1 simple cells, we derive analytical expressions for the secondorder statistics of thalamocortical activity before and after eye opening. We show that in the presence of natural stimulation, fixational instability introduces a spatially uncorrelated signal in the retinal input, which strongly influences the structure of correlated activity in the model. This input signal produces a regime of thalamocortical activity similar to that present before eye opening and compatible with the Hebbian maturation of cortical receptive fields. 1 Introduction In the primary visual cortex (V1), distinct regions in the receptive fields of simple cells tend to receive afferents from either ON- or OFF-center neurons in the lateral geniculate nucleus (LGN) (Hubel & Wiesel, 1962; Reid & Alonso, 1995; Ferster, Chung, & Wheat, 1996). It is a long-standing proposal that this pattern of connectivity originates from a Hebbian stabilization of synchronously firing geniculate afferents onto common postsynaptic targets, which is initially driven by endogenous spontaneous activity and Neural Computation 18, 569–590 (2006)
C 2006 Massachusetts Institute of Technology
570
A. Casile and M. Rucci
later refined by visual experience (Stent, 1973; Changeux & Danchin, 1976; Miller, Erwin, & Kayser, 1999). This hypothesis is challenged by the substantially different structures of endogenous spontaneous activity and visually evoked responses. At the level of the retina, spontaneous activity appears to be correlated on a narrow spatial scale in the order of tens of arcmins (Mastronarde, 1983), whereas natural visual stimulation is known to be characterized by broad spatial correlations in the order of degrees of visual angle (Field, 1987; Burton & Moorhead, 1987; Ruderman, 1994). This difference raises the question as to how the same activity-dependent mechanism of synaptic plasticity could account for both the initial emergence and the later refinement of V1 receptive fields. A possible solution to this problem is represented by the fact that after eye opening, the statistics of neural activity depend not only on the visual scene but also on the observer’s behavior during the acquisition of visual information. Eye movements, in particular, with their direct impact on the sampling of visual information, may profoundly influence neural responses. Eye movements are a constant presence during natural viewing. In addition to saccades that relocate the direction of gaze every few hundred milliseconds, small fixational eye movements keep the eyes in constant motion even during the periods of fixation (Ratliff & Riggs, 1950; Yarbus, 1967; Steinman, Haddad, Skavenski, & Wyman, 1973; Ditchburn, 1980). Recent neurophysiological studies have shown that fixational eye movements strongly affect the responses of geniculate (Martinez-Conde, Macknik, & Hubel, 2002) and cortical neurons (Gur, Beylin, & Snodderly, 1997; Martinez-Conde et al., 2000; Leopold & Logothetis, 1998; Snodderly, Kagan, & Gur, 2001). Furthermore, experiments with kittens in which eye movements were prevented during the critical period have reported serious impairments in the maturation of characteristics of V1 neurons, such as orientation selectivity (Buisseret, Gary-Bobo, & Imbert, 1978; Gary-Bobo, Milleret, & Buisseret, 1986) and ocular dominance (Fiorentini, Maffei, & Bisti, 1979; Freeman & Bonds, 1979; Singer & Raushecker, 1982). In previous studies, we simulated the responses of lateral geniculate nucleus (LGN) and V1 neurons to analyze the second-order statistics of thalamic (Rucci, Edelman, & Wray, 2000) and thalamocortical activity (Rucci & Casile, 2004) before and after eye opening. Patterns of correlated activity were found to be consistent with a Hebbian maturation of simple cell receptive fields both in the presence of spontaneous activity and when images of natural scenes were scanned by eye movements, but not when the same images were examined in the absence of the retinal image motion produced by fixational instability. These results were highly robust. They were little affected by the precise characteristics of neuronal models and simulated oculomotor activity. In this letter, to better understand the possible influence of fixational instability on visual development, we used quasilinear models of LGN and V1 units to derive analytical expressions of the patterns of correlated
Fixational Instability and Thalamocortical Development
571
activity. We show that the similarity between the statistics of thalamocortical activity present in our model before and after eye opening originates from a decorrelation of the retinal input operated by fixational instability. 2 Measuring Correlated Activity in a Model of the LGN and V1 By definition, Hebbian synapses change their strengths proportionally to the levels of correlation between the responses of pre- and postsynaptic elements. In this article, instead of explicitly modeling synaptic changes, we estimated the emerging structure of thalamocortical connectivity by directly analyzing levels of correlation between geniculate and cortical cells. Since V1 neurons exhibit orientation-selective responses at the time of eye opening, the goal of this analysis was to examine the compatibility between a strictly Hebbian mechanism of synaptic plasticity and the preservation of preexisting patterns of connectivity. To determine whether a simple cell η would establish stronger connections with afferents from either ON- or OFF-center geniculate units at a location x within its receptive field, we evaluated the correlation difference map, Rη (x) = η x η (t) α xON (t) − α xOFF (t) I,t , α α (t) and α xOFF (t) are where η x η (t) is the activity of the cortical neuron and α xON α α the responses of an ON- and an OFF-center geniculate cell with receptive fields centered at x α . The average is evaluated over time t and an ensemble of input images I. A positive value of Rη (x) implies that the simple cell response is more strongly correlated with the response of an ON-center (rather than an OFFcenter) geniculate unit with receptive field at relative separation x = x η − x α . The opposite holds for a negative value of Rη (x). Rη can be seen as η’s receptive field predicted by purely Hebbian synapses. To preserve and refine the spatial organization of η’s receptive field, Rη needs to be positive at locations that correspond to ON subregions and negative in correspondence of the OFF subregions. 2.1 Modeling Cell Responses. The responses of simple cells in V1 and nonlagged ON- and OFF-center X cells in the LGN were modeled on the basis of the convolution between the visual input I (x, t) and the cell spatiotemporal kernel k(x, t). For both LGN and V1 units, we assumed a space-time separable kernel k(x, t) = s(x)h(t), where s(x) and h(t) represent the spatial and temporal components. Cell responses were obtained by rectifying the convolution output using a threshold , that is, α(t) = kα (x, t) I (x, t) − if kα (x, t) I (x, t) > , and α(t) = 0 otherwise.
572
A. Casile and M. Rucci
Spatial receptive fields of simple cells were modeled by means of Gabor filters, sη (x) = Aη cos([ωη , 0] · x + φ)e −
x T R(ρ)T ση R(ρ)x 2
,
σ −2 0 where Aη is the amplitude, ση = η0x σ −2 is the covariance matrix of the ηy gaussian, ωη and φ are the angular velocity and phase of the plane wave, and R is a rotation matrix that introduces the angle ρ between the major axis of the gaussian and the plane wave. Parameters were adjusted to model 10 simple cells following neurophysiological data from Jones and Palmer (1987a, Table 1). Spatial kernels of geniculate units were modeled as differences of gaussians, sα (x) = Acnt e
−
xT x 2 2σcnt
− Asrn e
−
xT x 2 2σsrn
,
where the subscripts indicate contributions from the receptive field center (cnt) and surround (srn). Kernel parameters followed neurophysiological data from Linsenmeier, Frishman, Jakiela, and Enroth-Cugell (1982) to model ON-center cells with receptive fields located between 5 degrees and 15 degrees of visual eccentricity. At each angle of visual eccentricity, spatial receptive fields of modeled OFF-center cells were equal in magnitude and opposite in sign to those of ON-center units, i.e. sαON = −sαOFF . Since many neurons in the LGN and V1 possess similar temporal dynamics (Alonso, Usrey, & Reid, 2001) for both cortical and geniculate units, the temporal element h(t) was modeled as a difference of two gamma functions (DeAngelis, Ohzawa, & Freeman, 1993a; Cai, DeAngelis, & Freeman, 1997), h α (t) = h η (t) = k1 (t, t1 , c 1 , n1 ) − k2 (t, t2 , c 2 , n2 ), n −c(t−to )
e . Following data from Cai et al. (1997), where (t, t0 , c, n) = [c(t−tn0 )]n e −n temporal parameters were t1 = t2 = 0, n1 = n2 = 2, k1 = 1, k2 = 0.6, c 1 = 60s −1 , c 2 = 40s−1 . Previous studies in which cell responses were simulated during free viewing of natural images have shown that the second-order statistics of thalamocortical activity produced by this model are insensitive to the level of rectification (Rucci et al., 2000; Rucci & Casile, 2004). To probe into the origins of our previous simulation results, in this study we focused on the specific case of no rectification for simple cells and rectification with zero threshold for geniculate units. This assumption enables correlation
Fixational Instability and Thalamocortical Development
573
difference maps to be expressed as the product of linear geniculate and cortical responses, Rη (x) = η x η (t) α xON (t) − α xOFF (t) I,t = η x η (t)α x α (t)I,t , α α
(2.1)
where α(t) = α ON (t) − α OFF (t) = kαON (x, t) I (x, t). While this choice of rectification parameters simplified the mathematical analysis of this letter, our previous simulation data ensure that results remain valid for a wide range of thresholds. In this letter, correlation difference maps were estimated on the basis of equation 2.1, without explicitly simulating cell responses. Examples of traces of neuronal activity can be found in Figure 6 of our previous study (Rucci & Casile, 2004). 3 Thalamocortical Activity Before Eye Opening To establish a reference baseline, we first examined the structure of thalamocortical activity immediately before eye opening. Experimental evidence indicates that many of the response features of V1 cells are already present at the time of eye opening (Hubel & Wiesel, 1963; Blakemore & van Sluyters, 1975). Computational studies have shown that correlation-based mechanisms of synaptic plasticity are compatible with the emergence of simple cell receptive fields in the presence of endogenous spontaneous activity (Linsker, 1986; Miller, 1994; Miyashita & Tanaka, 1992). For simplicity, we restricted our analysis to the two-dimensional case of one spatial and one temporal dimension, by considering sections of the spatial receptive fields. The receptive fields of simple cells were sectioned along the axis orthogonal to the cell-preferred orientation. For LGN cells, we considered a section along a generic axis crossing the center of the receptive field. Results are, however, general and can be directly extended to the full 3D space-time case. In the presence of spontaneous retinal activity, levels of correlation between the responses of thalamic and cortical units can be estimated by means of linear system theory (Papoulis, 1984), Rη (x) = F −1 {K η (ωx , ωt )K α (ωx , ωt )CSA (ωx , ωt )}|t=0 ,
(3.1)
where F −1 indicates the operation of inverse Fourier transform, CSA (ωx , ωt ) is the power spectrum of spontaneous activity in the retina, and K α (ωx , ωt ) and K η (ωx , ωt ) are the Fourier transforms of LGN and V1 kernels. Under the model assumption of space-time separability of cell kernels, equation 3.1 gives Rη (x) = TF −1 {Sη (ωx )Sα (ωx )SSA (ωx )},
(3.2)
574
A. Casile and M. Rucci
where we also assumed space-time separability of the power spectrum of ∞ spontaneous retinal activity. T is a multiplicative factor equal to −∞ H SA (ωt )Hη (ωt )Hα (ωt )dωt , and SSA (ωx ), HSA (ωt ), Sα (ωx ), Hα (ωt ), Sη (ωx ), and Hη (ωt ) are, respectively, the spatial and temporal components of the power spectrum of spontaneous retinal activity and of the Fourier transforms of LGN and V1 kernels. Data from Mastronarde (1983) show that retinal spontaneous activity is characterized by narrow spatial correlations. These data are accurately interpolated by gaussian functions. Least-squares interpolations of levels of correlation between ganglion cells at different separations produced gaussians with amplitude ASA = 13.9 independent of the cell eccentricity and standard deviation σSA that ranged from 0.18 degree at eccentricity 5 degrees to 0.35 degree at 25 degrees. Use in equation 3.2 of a gaussian approximation for retinal spontaneous activity gives, after some algebraic manipulations, an analytical expression for the structure of correlated activity, x2
x2
− 2σˆ 2 cos( − 2σ˜ 2 cos( ηα (x) + R ηα (x), (3.3) Rη (x) ∝ Ae ω x + φ) + Ae ω x + φ) = R where the parameters are given by: 2 + σ2 + σ2 2 + σ2 σ = σ = σ ση2 + σsrn cnt η SA SA ση2 ση2 and . ω = σˆ 2 ωη ω = σ˜ 2 ωη ˆ ˜ η) η) Aη Acnt ASA σcnt ση σSA 2π − ση2 ωη (2ω−ω Aη Asrn ASA σsrn ση σSA 2π − ση2 ωη (2ω−ω e e A= A= σˆ σ˜ (3.4) Substitution of cell-receptive field parameters in equation 3.4 yields A A at all considered angles of visual eccentricity. Thus, the second term of equation 3.3 can be neglected, and correlation difference maps are described by Gabor functions: x2
η (x) = Ae − 2σˆ 2 cos( Rη (x) ≈ R ω x + φ).
(3.5)
Since the spatial receptive fields of modeled V1 units are also represented by Gabor functions, the similarity between correlation difference maps and cortical receptive fields can be quantified by directly comparing the two parameters of the Gabor maps: σ , the width of the gaussian, and ω, the spatial frequency of the plane wave. Figure 1 compares the correlation difference maps given by equation 3.5 to the receptive fields of modeled V1 units. Since the precise locations of the receptive fields of recorded cells were not reported by Jones and Palmer
Fixational Instability and Thalamocortical Development
(A)
575
(B)
1
RF Rη at 15o Rη at 5o
0.5
r
σ
2.5
rω
2 1.5 1
0
0.5 –0.5
–2
–1
0 1 distance (deg)
2
0
4
6
8 10 12 14 LGN eccentricity (deg)
16
Figure 1: Comparison between the spatial organization of simple cell receptive fields and the structure of thalamocortical activity for retinal inputs with various levels of spatial correlation. (a) Results for one of the 10 modeled simple cells in the case of spontaneous activity. The correlation difference maps (Rη ) measured between the considered simple cell and arrays of geniculate units located around 5 and 15 degrees of visual eccentricity are compared to the profile of the receptive field (RF). (b) Comparison between the parameters of the Gabor functions that represented receptive fields and patterns of correlated activity. Dashed and solid ω/ωη and rσ = σ /ση evaluated in the curves show, respectively, the ratios rω = presence of retinal spontaneous activity (◦), white noise (), and broad spatial correlations (σSA = 1◦ ) at the level of the retina (). The closer these two ratios are to 1, the higher is the similarity between patterns of correlation and the spatial structure of simple cell receptive fields. Each curve represents average values over 10 modeled simple cells. Error bars represent standard deviations. The x-axis marks the angle of visual eccentricity of the receptive fields of geniculate units.
(1987a, 1987b), we estimated the patterns of correlation that each modeled V1 unit would establish with LGN cells located at various angles of visual eccentricity. Figure 1a shows an example for one of the modeled V1 units. The patterns of correlated activity measured at both 5 and 15 degrees of visual eccentricity closely resembled the receptive field profile of the cortical cell. The curves marked by solid triangles in Figure 1b show, respectively, the mean values of the two ratios rω = ω/ωη and rσ = σ /ση evaluated over all 10 modeled V1 cells as a function of the visual eccentricity of geniculate units. Both ratios were close to 1 at all eccentricities, indicating a close matching between the patterns of correlated activity and the receptive fields of all simulated cells. The average values of the two indices of similarity were r¯σ = 1.08 ± 0.05 and r¯ω = 0.86 ± 0.07, respectively. Thus, in the model, the structure of thalamocortical activity present immediately before eye opening matched the spatial organization of simple cell receptive fields. It is important to notice that the similarity between receptive fields and correlation difference maps shown in Figure 1 originated from the narrow
576
A. Casile and M. Rucci
spatial correlations of spontaneous activity. Indeed, when no spatial correlation was present at the level of the retina, that is, when spontaneous activity was modeled as white noise (C SA(ωx , ωt ) = 1), correlation difference maps calculated from equation 3.1 were virtually identical to the simulated receptive fields. Mean ratios over all simulated cells and angles of visual eccentricity were r¯σ = 1.03 ± 0.02 and r¯ω = 0.94 ± 0.03, indicating that correlation difference maps and cortical receptive fields were highly similar. This similarity did not occur in the presence of large input spatial correlations. For example, in the case of σSA = 1◦ , the mean matching ratios were r¯σ = 1.77 ± 0.31 and r¯ω = 0.35 ± 0.14. This analysis shows that the narrow correlations of spontaneous retinal activity were responsible for the compatibility between the structure of thalamocortical activity and the Hebbian maturation of cortical receptive fields observed in our previous modeling studies (Rucci et al., 2000; Rucci & Casile, 2004). 4 Thalamocortical Activity After Eye Opening After eye opening, the assumption of narrow spatial correlations in the visual input is no longer valid. Luminance values in natural scenes are correlated over relatively large distances, as revealed by the power law profile of the power spectrum of natural images (Field, 1987; Burton & Moorhead, 1987; Ruderman, 1994). Figure 2 examines the impact of these broad input correlations on the structure of thalamocortical activity. Following the approach of section 3, correlation difference maps were given by Rη (x) = CF −1 {Sη (ω)Sα (ω)N (ω)},
(4.1)
where N (ω) is the power spectrum of natural images and C is a multiplicative factor equal to Hα (0)Hη (0). The power spectrum N (ω) was estimated from a set of 15 natural images (van Hateren & van der Schaaf, 1998). Its radial mean was best interpolated by N¯ (ω) ∝ ω−2.02 , which is in agreement with previous measurements (Field, 1987; Ruderman, 1994). Similar to the results of our previous study (Rucci & Casile, 2004), patterns of correlated activity did not match the receptive fields of simple cells during static presentation of natural scenes. An example for one of the 10 modeled simple cells is shown in Figure 2a, which compares the profile of the cell receptive field to sections of the correlation difference maps measured at 5 and 15 degrees of visual eccentricity. The mismatch is particularly evident in correspondence of the side lobes of the receptive field, where levels of correlation predicted stabilization of afferents from geniculate cells with the wrong polarity (ON- instead of OFF-center). Figure 2b shows average results obtained over the entire population of simulated simple cells. Since, in this case, an analytical expression of Rη (x) was not available, correlation difference maps obtained by solving
Fixational Instability and Thalamocortical Development
(A)
577
(B)
1
RF Rη at 15o o
Rη at 5
0.5
r RF rSL
1 0.75 0.5 0.25 0
0
–0.25 –0.5 –0.75
–0.5
–2
–1
0 1 distance (deg)
2
–1
5
7.5 10 12.5 LGN eccentricity (deg)
15
Figure 2: Comparison between the spatial organization of simple cell receptive fields and the structure of thalamocortical activity in the case of static presentation of natural images. (a) Results for one of the 10 modeled simple cells. The two correlation difference maps (Rη ) measured between the considered simple cell and arrays of geniculate units located around 5 and 15 degrees of visual eccentricity are compared to the profile of the simple cell receptive field (RF). (b) Average matching across the 10 modeled V1 units. Bars indicate the matching between correlation difference maps and cortical receptive fields evaluated both over the entire receptive field (r RF ) and only in correspondence of the secondary lobes (r SL ) (see the text for details). The x-axis represents the angle of visual eccentricity of simulated geniculate units. Vertical lines represent the standard deviation.
numerically equation 4.1 were compared to receptive fields by means of the correlation coefficient r RF . This index measures the similarity of two patterns. It varies between −1 and +1, with +1 indicating perfect matching and −1 perfect mirror symmetry. In addition to the mean correlation coefficient r RF , a second, more specific correlation coefficient index, r SL , quantified the similarity between receptive fields and correlation difference maps only over the secondary lobes of cell receptive fields. At all considered eccentricities, a clear mismatch was present between correlation maps and receptive fields. Average correlation coefficients were r¯ RF = 0.65 ± 0.09 over the entire receptive fields and r¯SL = −0.45 ± 0.3 in correspondence of the secondary lobes. That is, contrary to the case of retinal spontaneous activity, the structure of correlated activity measured in the presence of the broad correlations of natural images was not compatible with a Hebbian refinement of the receptive fields of simple cells. The results of Figure 2 were obtained in the absence of eye movements. Under natural viewing conditions, however, the retinal image is always in motion as small movements of the eyes, head, and body prevent maintenance of a steady direction of gaze. Results from previous computational studies have shown a strong influence of fixational instability on
578
A. Casile and M. Rucci
the structure of correlated activity in models of the LGN and V1 (Rucci et al., 2000; Rucci & Casile, 2004). To examine the origins of this influence, in this article we model fixational instability by means of a two-dimensional ergodic process ξ (t) = [ξx (t), ξ y (t)]T . For simplicity, we assumed zero moments of the first order (ξx (t) = 0 and ξ y (t) = 0) and uncorrelated movements along the two axes (Rξx ξ y (t) = 0). By means of Taylor expansion, the luminance profile I ( x) of a natural image in the neighborhood of a generic point x can be approximated as I ( x) ≈ I (x) + ∇ I (x)T · [ x − x]. Thus, if the average area covered by fixational instability is sufficiently small, the input to the retina during visual fixation can be approximated by Ir (x, t) ≈ I (x) + ∇ I (x)T · ξ (t) = I (x) +
∂ I (x) ∂ I (x) ξx (t) + ξ y (t). ∂x ∂y
Using this approximation, we can estimate the responses of cortical and geniculate cells during visual fixation: η x 0 (t) = kη (µ, τ ) Ir (µ, τ )|(x 0 ,t) ≈ kη (µ, τ ) I (µ)|(x 0 ,t) ∂ I (µ) ∂ I (µ) + kη (µ, τ ) ξx (τ ) + kη (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 0 ,t) (x 0 ,t) = ηSx 0 (t) + ηD x 0 (t) α x 1 (t) = kα (µ, τ ) Ir (µ, τ )|(x 1 ,t) ≈ kα (µ, τ ) I (µ)|(x 1 ,t) ∂ I (µ) ∂ I (µ) + kα (µ, τ ) ξx (τ ) + kα (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 1 ,t) (x 1 ,t) = α xS1 (t) + α xD1 (t), where x 0 and x 1 are the locations of receptive fields centers and S η (t) = kη (µ, τ ) I (µ)|(x 0 ,t) x0 ∂ I (µ) ∂ I (µ) D η x 0 (t) = kη (µ, τ ) ξx (τ ) + kη (µ, τ ) ξ y (τ ) ∂µx ∂µ y (x 0 ,t) (x 0 ,t) Dy x = ηD (t) + η (t) x 0 x0
(4.2)
Fixational Instability and Thalamocortical Development
S α x 1 (t) = kα (µ, τ ) I (µ)(x 1 ,t) I (µ) α xD1 (t) = kα (µ, τ ) ∂∂µ ξ (τ ) + kα (µ, τ ) x x (x 1 ,t) D = α xD1x (t) + α x 1y (t).
579
∂ I (µ) ξ (τ ) ∂µ y y
(x 1 ,t)
That is, cell responses can be decomposed into a static component with nonzero mean (η S and α S ) and a zero-mean dynamic component introduced by fixational instability (η D and α D ). ηDx , α Dx , ηD y , and α D y are the contributions to cell responses generated by the instability of visual fixation along the x- and y-axes. Given this decomposition, correlation difference maps can also be expressed as the sum of a static and a dynamic term: Rη (x) = RηS (x) + RηD (x).
(4.3)
Indeed, from our assumptions on the statistical moments of fixational instability, it follows that only three of the nine terms obtained by direct multiplication of the responses η x 0 (t) and α x 1 (t) have nonzero means. The first of these terms is given by
ηSx 0 (t)α xS1 (t) ξ,I,t = (kη (µ, τ ) I (µ)|(x 1 ,t) )(kα (µ, τ ) I (µ)|(x 0 ,t) )ξ,I,t ∞ ∞ = (sη (µ) sα (−µ) N(µ)|(x 1 −x 0 ) ) h α (τ )dτ h η (τ )dτ = Csη (µ) sα (−µ) N(µ) x ,
−∞
−∞
t
where N(x) is the autocorrelation function of natural images. Since this term depends on only the static components of cell responses, it represents the correlation difference map that would be obtained in the absence of fixational instability (see equation 4.1). The second term is given by
Dx x ηD x 0 (t)α x 1 (t) ξ,I,t =
∂ I (µ) ∂ I (µ) kη (µ, τ ) ξx (τ ) ξx (τ ) kα (µ, t) ∂µx ∂µ x x 1 ,t x 0 ,t
ξ,I,t
= (sη (µ) sα (−µ) Nx (µ)| x 1 −x 0 )h η (τ ) h α (−τ ) Rξx ξx (τ )|τ =0 t = Dsη (µ) sα (−µ) Nx (µ)| x ,
(4.4)
580
A. Casile and M. Rucci 10
power spectrum
10
N’(ω) N(ω)
8
10
6
10
4
10
–1
10
0
1
10 10 frequency (cycles/deg)
Figure 3: Comparison between the power spectra of natural images N (ω) and the dynamic power spectrum N (ω) = Nx (ω) + N y (ω) given by the sum of the power spectra of the x and y components of the gradient of natural images. The two curves are radial averages over a set of 15 natural images.
where Nx (µ) is the autocorrelation function of the first component of the gradientof natural images (the derivative along the x-axis). D is a constant ∞ equal to −∞ Hη (ωt )Hα (ωt )Rξx ξx (ωt )dωt , where Rξx ξx (ωt ) indicates the Fourier transform of Rξx ξx (t). By using a similar procedure, we obtain the third term,
D D η x 0y (t)α x 1y (t) ξ,I,t = Dsη (µ) sα (−µ) Ny (µ) x
where Ny (µ) is the autocorrelation function of the second component of the gradient of natural images (the derivative along the y-axis). By adding these three terms and defining N (µ) = Nx (µ) + Ny (µ), we obtain Rη (x) = Csη (µ) sα (−µ) N(µ)| x + Dsη (µ) sα (−µ) N (µ)| x = RηS (x) + RηD (x),
(4.5)
which proves equation 4.3. Equation 4.5 shows that fixational instability adds a contribution RηD (x) to the correlation map RηS (x) obtained with presentation of the same stimuli in the absence of retinal image motion. Whereas in the absence of fixational instability, levels of correlation depend on the autocorrelation function of the stimulus N(x) (or, equivalently, its power spectrum N (ω)), the term RηD (x) introduced by the jittering of the eye depends on the autocorrelation function of the gradient of the stimulus, N (x) (or, equivalently, its power spectrum N (ω), the dynamic power spectrum). Figure 3a compares N (ω) and N (ω) for the case of images of natural scenes. Whereas N (ω) followed, as expected, a power law with exponent
Fixational Instability and Thalamocortical Development
(A)
(B)
1
RF RD at 15o RD at 5o
0.5
0
–0.5
–2
581
–1
0 1 distance (deg)
2
1 0.75 0.5 0.25 0 –0.25 –0.5 –0.75 –1
rRF rSL
5
7.5 10 12.5 LGN eccentricity (deg)
15
Figure 4: Comparison between the spatial organization of simple cell receptive fields and patterns of correlated activity measured when images of natural scenes were examined in the presence of fixational instability (the term RηD (x) in equation 4.5). The layout of the panels and the graphic notation are the same as in Figure 2.
approximately equal to −2, the dynamic power spectrum N (ω) was almost flat up to a cut-off frequency of about 10 cycles/deg—that is, it was uncorrelated. Thus, in the presence of natural images, fixational instability adds an input signal that discards spatial correlations. It should be observed that the whitening of the dynamic power spectrum is a direct consequence of the scale invariance of natural images and has a simple explanation in the frequency domain. Since the Fourier transforms of the two partial derivatives ∂ I∂(x) and ∂ I∂(x) are, respectively, proportional to x y ωx I (ω) and ω y I (ω), the two power spectra Nx (ω) and N y (ω) are proportional to ωx2 N (ω) and ω2y N (ω). Thus, N (ω) = Nx (ω) + N y (ω) ∝ |ω|2 N (ω). For images of natural scenes, N (ω) ∝ |ω|−2 (Field, 1987) and the product |ω|2 N (ω) produce a dynamic power spectrum N (ω) with uniform spectral density. In other words, our analysis shows that whereas the intensity values of natural images tend to be correlated over large distances, local changes in intensity around pairs of pixels are uncorrelated. Therefore, fixational instability represents an optimal decorrelation strategy for visual input with power spectrum that declines as |ω|−2 . We have already shown in Figure 2 that the patterns of correlated activity RηS (x) measured with static presentation of images of natural scenes did not match the receptive fields of modeled simple cells. Figure 4 analyzes the contribution of fixational instability, the term RηD (x) in equation 4.5, to the structure of correlated activity. In this case, correlation difference maps closely resembled the spatial organization of cortical receptive fields irrespective of the eccentricity of simulated geniculate units. The mean
582
A. Casile and M. Rucci
matching index was r¯ RF = 0.98 ± 0.006 over the entire receptive fields and r¯SL = 0.92 ± 0.06 over the secondary lobes. That is, each simple cell established strong correlations with either ON- or OFF-center geniculate units only when the receptive fields of these units overlapped an ON or an OFF subregion. Similar to the case of spontaneous retinal activity, this pattern of correlated activity is compatible with a Hebbian refinement of simple cell receptive fields. To summarize, equation 4.5 shows that in the presence of the self-motion of the retinal image that occurs during natural viewing, the second-order statistics of thalamocortical activity depend on both the spatial configuration of the stimulus and how its retinal projection changes during visual fixation. The first component is represented in equation 4.5 by the term RηS , which depends on the power spectrum of the stimulus N (ω). The latter component is represented by RηD , which is determined by the dynamic power spectrum N (ω), a spectrum that discards the broad spatial correlations of natural images. Of these two terms, only RηD matches the spatial organization of simple cell receptive fields (compare Figure 4 with Figure 2). The overall structure of correlated activity is given by the weighted sum of the results of Figures 2 and 4. During fixational instability, the relative influence of RηS and RηD depends on two elements: (1) the powers of the two inputs N (ω) and N (ω) and (2) neuronal sensitivity to both input signals. In natural images, most energy is concentrated at low spatial harmonics. Since N (ω) attenuates the low spatial frequencies of the stimulus, it tends to possess less power than N (ω). For example, for the two power spectra in Figure 3, the ratio of power dynamic/static in the range 0 to 10 cycles/deg was only 0.08. That is, N (ω) provided over 10 times more power than N (ω) within the main spatial range of sensitivity of geniculate cells. However, in equation 4.5, N (ω) and N (ω) are modulated by the multiplicative terms C and D, which depend on the temporal characteristics of cell responses (both C and D) and fixational instability (D only). Since geniculate neurons respond more strongly to changing stimuli than stationary ones, D tends to be higher than C. For example, a retinal image motion with gaussian temporal correlation (the term Rξx ξx in equation 4.4) characterized by a standard deviation of 30 ms and a mean amplitude of 10, values that are consistent with the instability of fixation of several species, produced a ratio D/C ≈ 950. Thus, although N (ω) provided less power than N (ω), the weighted ratio of the total power (DN (ω))/(CN (ω)) in the range 0 to 10 cycles/deg was approximately 76. Since the term RηD dominated the sum of equation 4.5, the matching between correlation difference maps and receptive fields increased from r¯ RF = 0.65 ± 0.09 and r¯SL = −0.45 ± 0.3 (the values obtained with static presentation of natural images) to r¯ RF = 0.90 ± 0.05 and r¯SL = 0.12 ± 0.55. That is, in the presence of fixational instability, the responses of simulated cortical units tended to be correlated with those of geniculate units with correct polarity.
Fixational Instability and Thalamocortical Development
583
It is important to observe that several mechanisms might further enhance the impact of fixational instability on the refinement of thalamocortical connectivity. A first possibility is a rule of synaptic plasticity that depends on the covariance (and not the correlation) between the responses of preand postsynaptic elements (Sejnowski, 1977): Rη (x) = (η(t) − η)(α(t) − α). In the case in which mean activity levels are estimated over periods of fixation, η = ηS and α = α S , yielding Rη (x) = RηD (x). Thus, the term RηS (x) does not affect synaptic plasticity, and the structure of thalamocortical activity is compatible with the spatial organization of the receptive fields of simple cells. This is consistent with the results of our previous simulations in which we analyzed the statistics of geniculate activity during natural viewing (Rucci et al., 2000). A second mechanism that might enhance the influence of fixational instability is a nonlinear attenuation of the responses of simple cells to unchanging stimuli. Systematic deviations from linearity have been observed in the responses of simple cells. In particular, it has been shown that responses to stationary stimuli tend to decline faster and give lower steady-state levels of activity than would be expected from linear predictions (Tolhurst, Walker, Thompson, & Dean, 1980; DeAngelis et al., 1993a). This attenuation can be incorporated into our model by assuming that after an initial transitory period following the onset of visual fixation, a simple cell responds as η(t) = (1 − β) · ηS (t) + ηD (t), where the constant β ∈ [0, 1] defines the degree of attenuation. With this modification, correlation difference maps are given by Rη (x) = (1 − β)RηS (x) + RηD (x).
(4.6)
Figure 5 compares the receptive fields of simulated simple cells with the correlation difference maps estimated with various degrees of attenuation. It is clear by comparing these data to those of Figure 2 that even a partial attenuation of cortical responses to unchanging stimuli resulted in a substantial improvement in the degree of similarity between patterns of correlation and receptive fields. A 60% attenuation was sufficient to produce an almost perfect matching (¯r RF = 0.97 ± 0.02 and r¯SL = 0.50 ± 0.44). Thus, consistent with our previous simulations of thalamocortical activity (Rucci & Casile, 2004), in the presence of fixational instability, a nonlinear attenuation of simple cell responses leads to a regime of correlated activity that is compatible with a Hebbian refinement of the spatial organization of simple cell receptive fields.
584
A. Casile and M. Rucci
(A)
(B)
1
RF β=0.6 β=0.8 β=1
0.8 0.6 0.4 0.2 0 –0.2 –0.4 –2
–1
0 1 distance (deg)
2
r RF r SL
1 0.75 0.5 0.25 0 –0.25 –0.5 –0.75 –1
0.6 0.8 0.9 1 degree of attenuation of static activity (β)
Figure 5: Effect of nonlinear attenuation of simple cells responses to unchanging stimuli. (a) Results for one of the 10 modeled simple cells. The correlation difference maps (Rη ) estimated from equation 4.6 for three values of the attenuation factor β are compared to the profile of the simple cell receptive field (RF). (b) Mean matching indices over all modeled V1 units as a function of the attenuation factor. Both correlation coefficients evaluated over the entire receptive field (r RF ) and the secondary subregions (r SL ) are shown. Parameters of LGN units simulated an eccentricity of 10 degrees.
5 Conclusions Many of the response characteristics of V1 neurons develop before eye opening and refine with exposure to pattern vision (Hubel & Wiesel, 1963; Blakemore & van Sluyters, 1975; Buisseret & Imbert, 1976; Pettigrew, 1974). After eye opening, small eye and body movements keep the retinal image in constant motion. The statistical analysis of this article, together with the results of our previous simulations (Rucci et al., 2000; Rucci & Casile, 2004), indicate that the physiological instability of visual fixation contributes to decorrelating cell responses to natural stimuli and establishing a regime of neural activity similar to that present before eye opening. Thus, at the time of eye opening, no sudden change occurs in the second-order statistics of thalamocortical activity, and the same correlation-based mechanism of synaptic plasticity can account for both the initial emergence and the later refinement of simple cell receptive fields. In this study, we have used independent models of LGN and V1 neurons to examine whether the structure of thalamocortical activity is compatible with a Hebbian maturation of the spatial organization of simple cell receptive fields. The results of our analysis are consistent with a substantial body of previous modeling work. Before eye opening, in the presence of spontaneous retinal activity, a modeled simple cell established strong correlations with ON- and OFF-center geniculate units only when the receptive fields of these units overlapped, respectively, the ON and OFF subregions
Fixational Instability and Thalamocortical Development
585
within its receptive fields. This pattern of correlated activity is in agreement with the results of previous studies that modeled the activity-dependent development of cortical orientation selectivity (Linsker, 1986; Miller, 1994; Miyashita & Tanaka, 1992). After eye opening, the visual system is exposed to the broad spatial correlations of natural scenes. In the absence of retinal image motion, these input correlations would coactivate geniculate units with the same polarity (ONor OFF-center) and with receptive fields at relatively large separations, a pattern of neural activity that is not compatible with a Hebbian refinement of simple cell receptive fields. During natural fixation, however, neurons receive input signals that vary in time as their receptive fields move with the eye (Gur & Snodderly, 1997). This study shows that in the presence of images of natural scenes, these input fluctuations lack spatial correlations. In the model, this spatially uncorrelated input signal strongly influenced neuronal responses and produced patterns of thalamocortical activity that were similar to those measured immediately before eye opening. Thus, our analysis shows that a direct scheme of Hebbian plasticity can be added to the category of activity-dependent mechanisms compatible with the maturation of cortical receptive fields in the presence of decorrelated natural visual input (Law & Cooper, 1994; Olshausen & Field, 1996). The fact that fixational instability might have such a strong effect on the development of cortical receptive fields should not come as a surprise. Consistent with the results of our analysis, several experimental studies have shown that prevention and manipulation of eye movements during the critical period disrupt the maturation of the response properties of cortical neurons (for a review, see Buisseret, 1995). For example, no restoration of cortical orientation selectivity (Gary-Bobo et al., 1986; Buisseret et al., 1978) and ocular dominance (Freeman & Bonds, 1979; Singer & Raushecker, 1982) is observed in dark-reared kittens exposed to visual stimulation with their eye movements prevented. In addition, neurophysiological results have shown that fixational eye movements strongly influence the responses of geniculate and cortical neurons (Gur et al., 1997; Leopold & Logothetis, 1998; Martinez-Conde et al., 2002). In the primary visual cortex of the monkey, bursts of spikes have been recorded following fixational saccades (MartinezConde, Macknik, & Hubel, 2000), and distinct neuronal populations have been found that selectively respond to the two main components of fixational eye movements, saccades and drifts (Snodderly et al., 2001). This study relied on two important assumptions. A first assumption was the use of linear models to predict cell responses to visual stimuli. Linear spatiotemporal models enabled the derivation of analytical expressions of levels of correlation in thalamocortical activity. A substantial body of evidence shows that LGN X cells act predominantly as linear filters. Responses to drifting gratings contain most power in the first harmonic (So & Shapley, 1981), and responses to both flashed and complex naturalistic stimuli are well captured by linear predictors (Stanley, Li, & Dan, 1999; Cai et al., 1997).
586
A. Casile and M. Rucci
Also, the responses of V1 simple cells contain a strong linear component (Jones & Palmer, 1987b; DeAngelis, Ohzawa, & Freeman, 1993b). However, for these neurons, important deviations from linearity have been reported. In particular, it has been observed that responses to stationary stimuli decline faster and settle on lower steady-state levels than would be expected from linear predictions (Tolhurst et al., 1980; DeAngelis et al., 1993a). We have shown that a nonlinear attenuation of cortical responses to unchanging stimuli enhances the influence of fixational instability on the structure of correlated activity. In the model, the broad correlations of natural scenes had little impact on the second-order statistics of thalamocortical activity in the presence of strong nonlinear attenuation. A second assumption concerned the way we modeled the self-motion of the retinal image. In this study, the physiological instability of visual fixation was modeled as a zero-mean stochastic process with uncorrelated components along the two Cartesian axes. These assumptions simplified our analysis and led to the elimination of several terms in equation 4.5. However, the results presented in this article do not critically depend on them. Simulations in which retinal image motion replicated the cat’s oculomotor behavior have produced patterns of correlated activity that are very similar to the theoretical predictions of this study (Rucci et al., 2000; Rucci & Casile, 2004). Furthermore, although a statistical analysis of the instability of visual fixation under natural viewing conditions has not been performed, the motion of the retinal image as subjectively experienced by a jitter after-effect (Murakami & Cavanagh, 1998) appears to be compatible, at least qualitatively, with our modeling assumptions. It is worth emphasizing that during natural viewing, other elements, in addition to eye movements, contribute to the instability of visual fixation. In particular, small movements of the head and body and imperfections in the vestibulo-ocular reflex (Skavenski, Hansen, Steinman, & Winterson, 1979) are known to amplify the self-motion of the retinal image. Our analysis aims to address the joint effect of all these movements. It can be shown analytically that the factor D, which in equation 4.5 modulates the impact of a moving retinal stimulus (the term RηD (x)), depends in a quadratic manner on the spatial extent of fixational instability. Therefore, within the limits of validity of the Taylor approximation of equation 4.2, the larger the amplitude of fixational instability, the stronger its influence on the structure of correlated activity. It should also be noted that while this article focuses on the examination of static images of natural scenes, our analysis applies to any jittering stimulus on the retina, regardless of the origin of motion, whether self-generated or external. For example, the trembling of leaves on a tree exposed to the wind might produce a decorrelation of neural activity similar to that of fixational instability. Our results appear to contrast with a previous proposal according to which the spatial response characteristics of retinal and geniculate neurons are sufficient to decorrelate the spatial signals provided by images of natural
Fixational Instability and Thalamocortical Development
587
scenes (Atick & Redlich, 1992). According to this hypothesis, a neuronal sensitivity function that increases linearly with the spatial frequency would counterbalance the power spectrum of natural images and produce a decorrelated pattern of neural activity. However, neurophysiological recordings have shown that in both the cat and the monkey, the frequency responses of cells in the retina and the LGN deviate significantly from linearity in the low spatial frequency range (So & Shapley, 1981; Linsenmeier et al., 1982; Derrington & Lennie, 1984; Croner & Kaplan, 1995). Such deviation is not compatible with Atick and Redlich’s proposal and, in the absence of fixational instability, would lead to a regime of thalamocortical activity strongly influenced by the broad spatial correlations of natural images (see Figure 2). In contrast to this static decorrelation mechanism, the decorrelation of visual input produced by fixational instability does not depend on the spatial response properties of geniculate and cortical units. Thus, the proposed mechanism is highly robust with respect to individual neuronal differences in spatial contrast sensitivity functions. While in this study we have focused on the developmental consequences of a chronic exposure to fixational instability, our results also have important implications concerning the way visual information is represented in the early visual system. A number of recent studies have suggested an important role for fixational instability in the neural encoding of visual stimuli ¨ (Ahissar & Arieli, 2001; Greschner, Bongard, Rujan, & Ammermuller, 2002; Snodderly et al., 2001). The results presented here suggest that fixational instability, by decreasing statistical dependencies between neural responses, might contribute to discarding broad input correlations, thus establishing efficient visual representations of natural visual scenes (Barlow, 1961; Attneave, 1954). Further theoretical and experimental studies are needed to characterize and test this hypothesis. Acknowledgments We thank Matthias Franz, Alessandro Treves, and Martin Giese for many helpful comments on a preliminary version of this article. This work was supported by the Volkswagen Stiftung, the National Institutes of Health grant EY015732-01, and the National Science Foundation grants CCF0432104 and CCF-0130851. Correspondence and requests for materials should be addressed to A.C.
References Ahissar, E., & Arieli, A. (2001). Figuring space by time. Neuron, 32, 185–201. Alonso, J., Usrey, M., & Reid, R. C. (2001). Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. J. Neurosci, 21(11), 4002–4015.
588
A. Casile and M. Rucci
Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comput., 4, 196–210. Attneave, F. (1954). Some informational aspects of visual perception. Psychol. Rev., 61(3), 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Blakemore, C., & van Sluyters, R. C. (1975). Innate and environmental factors in the development of the kitten’s visual cortex. J. Physiol., 248, 663–716. Buisseret, P. (1995). Influence of extraocular muscle proprioception on vision. Physiol. Rev., 75(2), 323–338. Buisseret, P., Gary-Bobo, E., & Imbert, M. (1978). Ocular motility and recovery of orientational properties of visual cortical neurons in dark-reared kittens. Nature, 272, 816–817. Buisseret, P., & Imbert, M. (1976). Visual cortical cells: Their developmental properties in normal and dark-reared kittens. J. Physiol., 255, 511–525. Burton, G. J., & Moorhead, I. R. (1987). Color and spatial structure in natural scenes. Appl. Opt., 26, 157–170. Cai, D., DeAngelis, G. C., & Freeman, R. D. (1997). Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. J. Neurophysiol., 78, 1045–1061. Changeux, J. P., & Danchin, A. (1976). Selective stabilization of developing synapses as a mechanism for the specification of neuronal networks. Nature, 264, 705–712. Croner, L. J., & Kaplan, E. (1995). Receptive fields of P and M ganglion cells across the primate retina. Vision Res., 35(1), 7–24. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993a). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. I. General characteristics and postnatal development. J. Neurophysiol., 69(4), 1091–1117. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993b). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69(4), 1118–1135. Derrington, A. M., & Lennie, P. (1984). Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. J. Physiol., 357, 219– 240. Ditchburn, R. (1980). The function of small saccades. Vision Res., 20, 271–272. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Fiorentini, A., Maffei, L., & Bisti, S. (1979). Change of binocular properties of cortical cells in the central and paracentral visual field projections of monocularly paralyzed cats. Brain. Res., 171, 541–544. Freeman, R. D., & Bonds, A. B. (1979). Cortical plasticity in monocularly deprived immobilized kittens depends on eye movement. Science, 206, 1093–1095. Gary-Bobo, E., Milleret, C., & Buisseret, P. (1986). Role of eye movements in developmental processes of orientation selectivity in the kitten visual cortex. Vision Res., 26(4), 557–567.
Fixational Instability and Thalamocortical Development
589
¨ Greschner, M., Bongard, M., Rujan, P., & Ammermuller, J. (2002). Retinal ganglion cell synchronization by fixational eye movements improves feature estimation. Nature, 5(4), 341–347. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17(8), 2914– 2920. Gur, M., & Snodderly, D. M. (1997). Visual receptive fields of neurons in primary visual cortex (V1) move in space with the eye movements of fixation. Vision Res., 37(3), 257–265. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture of the cat’s visual cortex. J. Physiol., 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1963). Receptive fields of cells in striate cortex of very young, visually inexperienced kittens. J. Neurophysiol., 26, 994–1002. Jones, J. P., & Palmer, L. A. (1987a). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1233–1258. Jones, J. P., & Palmer, L. A. (1987b). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1187–1211. Law, C. C., & Cooper, L. N. (1994). Formation of receptive fields in realistic visual environments according to the Bienenstock, Cooper and Munro (BCM) theory. Proc. Natl. Acad. Sci. USA, 91, 7797–7801. Leopold, D. A., & Logothetis, N. K. (1998). Microsaccades differentially modulate neural activity in the striate and extrastriate visual cortex. Exp. Brain. Res., 123, 341–345. Linsenmeier, R. A., Frishman, L. J., Jakiela, H. G., & Enroth-Cugell, C. (1982). Receptive field properties of X and Y cells in the cat retina derived from contrast sensitivity measurements. Vision Res., 22, 1173–1183. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. USA, 83, 8390–8394. Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2000). Microsaccadic eye movements and firing of single cells in the macaque striate cortex. Nat. Neurosci., 3(3), 251–258. Martinez-Conde, S., Macknik, S. L., & Hubel, D. H. (2002). The function of bursts of spikes during visual fixation in the awake primate lateral geniculate nucleus and primary visual cortex. Proc. Natl. Acad. Sci. USA, 99(21), 13920–13925. Mastronarde, D. N. (1983). Correlated firing of cat retinal ganglion cells. I. Spontaneously active inputs to X and Y cells. J. Neurophysiol., 49, 303–323. Miller, K. D. (1994). A model of the development of simple cell receptive fields and the ordered arrangement of orientation columns through activity-dependent competition between ON- and OFF- center inputs. J. Neurosci., 14(1), 409– 441. Miller, K. D., Erwin, E., & Kayser, A. (1999). Is the development of orientation selectivity instructed by activity? J. Neurobiol., 41, 55–57. Miyashita, M., & Tanaka, S. (1992). A mathematical model for the self-organization of orientation columns in visual cortex. Neuroreport, 3, 69–72. Murakami, I., & Cavanagh, P. (1998). A jitter after-effect reveals motion-based stabilization of vision. Nature, 395, 798–801.
590
A. Casile and M. Rucci
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Papoulis, A. (1984). Probability, random variables and stochastic processes. New York: McGraw-Hill. Pettigrew, J. D. (1974). The effect of visual experience on the development of stimulus specificity by kitten cortical neurons. J. Physiol., 237, 49–74. Ratliff, F., & Riggs, L. A. (1950). Involuntary motions of the eye during monocular fixation. J. Exp. Psychol., 40, 687–701. Reid, R. C., & Alonso, J. M. (1995). Specificity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284. Rucci, M., & Casile, A. (2004). Decorrelation of neural activity during fixational instability: Possible implications for the refinement of V1 receptive fields. Vis. Neurosci., 21(5), 725–738. Rucci, M., Edelman, G. M., & Wray, J. (2000). Modeling LGN responses during freeviewing: A possible role of microscopic eye movements in the refinement of cortical orientation selectivity. J. Neurosci., 20(12), 4708–4720. Ruderman, D. L. (1994). The statistics of natural images. Network, 5, 517–548. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4(4), 303–321. Singer, W., & Raushecker, J. (1982). Central-core control of developmental plasticity in the kitten visual cortex: II. Electrical activation of mesencephalic and diencephalic projections. Exp. Brain. Res., 47, 223–233. Skavenski, A. A., Hansen, R., Steinman, R. M., & Winterson, B. J. (1979). Quality of retinal image stabilization during small natural and artificial body rotations in man. Vision Res., 19, 365–375. Snodderly, D. M., Kagan, I., & Gur, M. (2001). Selective activation of visual cortex neurons by fixational eye movements: Implications for neural coding. Vis. Neurosci., 18, 259–277. So, Y. T., & Shapley, R. (1981). Spatial tuning of cells in and around lateral geniculate nucleus of cat: X and Y relay cells and perigeniculate interneurons. J. Neurophysiol., 45(1), 107–120. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19(18), 8036– 8042. Steinman, R. M., Haddad, G. M., Skavenski, A. A., & Wyman, D. (1973). Miniature eye movement. Science, 181(102), 810–819. Stent, G. S. (1973). A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. USA, 70, 997–1001. Tolhurst, D., Walker, N., Thompson, I., & Dean, A. F. (1980). Non-linearities of temporal summation in neurones in area 17 of the cat. Exp. Brain. Res., 38, 431–435. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B. Biol. Sci., 265, 359–366. Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum Press.
Received January 5, 2005; accepted July 19, 2005.
LETTER
Communicated by C. Lee Giles
Learning Beyond Finite Memory in Recurrent Networks of Spiking Neurons ˇ Peter Tino
[email protected]
Ashely J. S. Mills
[email protected] School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
We investigate possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons strictly operating in the pulse-coding regime. We extend the existing gradient-based algorithm for training feedforward spiking neuron networks, SpikeProp (Bohte, Kok, & La Poutr´e, 2002), to recurrent network topologies, so that temporal dependencies in the input stream are taken into account. It is shown that temporal structures with unbounded input memory specified by simple Moore machines (MM) can be induced by recurrent spiking neuron networks (RSNN). The networks are able to discover pulse-coded representations of abstract information processing states coding potentially unbounded histories of processed inputs. We show that it is often possible to extract from trained RSNN the target MM by grouping together similar spike trains appearing in the recurrent layer. Even when the target MM was not perfectly induced in a RSNN, the extraction procedure was able to reveal weaknesses of the induced mechanism and the extent to which the target machine had been learned. 1 Introduction A considerable amount of work has been devoted to studying computations on time series in a variety of connectionist models, most prominently in models with feedback delay connections between the neural units. Such models are commonly referred to as recurrent neural networks (RNNs). Feedback connections endow RNNs with a form of neural memory that makes them (theoretically) capable of processing time structures over arbitrarily long time spans. However, although RNNs are capable of simulating Turing machines (Siegelmann & Sontag, 1995), induction of nontrivial temporal structures beyond finite memory can be problematic (Bengio, Frasconi & Simard, 1994). Finite state machines (FSMs) and automata constitute a simple yet well-established and easy-to-analyze framework for describing temporal structures that go beyond finite memory relationships. In general, for a finite description of the string mapping realized by an FSM, one needs Neural Computation 18, 591–613 (2006)
C 2006 Massachusetts Institute of Technology
ˇ and A. Mills P. Tino
592
a notion of an abstract information processing state that can encapsulate histories of processed strings of arbitrary finite length. Indeed, FSMs have been a popular benchmark in the recurrent network community, and there is a huge amount of literature dealing with empirical and theoretical aspects of learning FSMs and automata in RNNs (e.g., Cleeremans, Servan-Schreiber, ˇ ˇ & Sajda, & McClelland, 1989; Giles et al., 1992; Tino 1995; Frasconi, Gori, ˇ Horne, Giles, Maggini, & Soda, 1996; Omlin & Giles, 1996; Casey, 1996; Tino, & Collingwood, 1998). However, the RNNs under consideration have been based on traditional artificial neural network models that describe neural activity in terms of rates of spikes1 produced by individual neurons (rate coding). It remains controversial whether, when describing computations performed by a real biological system, one can abstract from the individual spikes and consider only macroscopic quantities, such as the number of spikes emitted by a single neuron (or a population of neurons) per time interval. Several models of spiking neurons, where the input and output information is coded in terms of exact timings of individual spikes (pulse coding), have been proposed (see, e.g., Gerstner, 1999). Learning algorithms for acyclic networks of such (biologically more plausible) artificial neurons have been developed and tested (Bohte, Kok, & La Poutr´e, 2002; Moore, 2002). Maass (1996) proved that networks of spiking neurons with feedback connections (recurrent spiking neuron networks, RSNNs) can simulate Turing machines. Yet virtually no systematic work has been reported on inducing deeper temporal structures in such networks. There are recent developments along this direction, however; for example, Natschl¨ager and Maass (2002) investigated induction of finite memory machines (of depth 3) in feedforward spiking neuron networks. A memory mechanism was implemented in a biologically realistic model of dynamic synapses (Maass & Markram, 2002) feeding a pool P of spiking neurons. The output was given by a spiking neuron converting space-rate coding in P into an output spike train. In robotics, Floreano and collaborators evolved controllers containing spiking neuron networks for vision-based mobile robots and adaptive indoor micro-flyers (Floreano & Mattiussi, 2001; Floreano, Zufferey, & Nicoud, 2005). In such studies, there is usually a leap in the coding strategy from emphasis on spike timings in individual neurons (pulse coding) into more space-rate-based population codings. Although most experimental research focuses on characterizations of potential information processing states using temporal statistics of rate properties in spike trains (e.g., Abeles et al., 1995; Martignon et al., 2000) there is some experimental evidence that in certain situations, the temporal information may be pulse-coded (Nadasdy, Hirase, Czurk, Csicsvari, & Buzski, 1999; DeWeese & Zador, 2003).
1
Identical electrical pulses also known as action potentials.
Learning Beyond Finite Memory with Spiking Neurons
593
A related development was marked by the introduction of so-called liquid state machines (LSM) by Maass, Natschl¨ager, and Markram (2002). LSM is a metaphor for a new way of viewing real-time computation in recurrent neural circuits. Recurrent networks (possibly with spike-driven dynamics) serve as fixed, nonadaptable general-purpose temporal integrators. The only adaptive part (conforming to the specification of the temporal task at hand) consists of a relatively simple and trainable readout unit operating on the recurrent circuit. The circuit needs to be sufficiently complex that subtle information in the input stream is encoded in the high-dimensional transient states of the circuit that are intelligible to the readout (output) unit. In other words, in the recurrent circuit, there is no learning to find temporal features useful for solving the task at hand. Instead, it is assumed that the hard-wired subsystem responsible for computing the features (information processing states) from the input stream is complex enough to serve as a general-purpose filter for a wide range of temporal tasks. The concept of LSM represents an exciting and fresh outlook on computations performed by neural circuits. It argues that there is an alternative to the more traditional attractor-based computations in neural networks. Stable internal states are not required for stable output. Provided certain conditions on dynamics of the neural circuit and the readout unit are satisfied, the LSMs have universal power for computations with fading memory.2 However, a recent study of decoding properties of LSM on sequences ¨ of visual stimuli encoded in a temporal population code by Knusel, Wyss, ¨ Konig, & Verschure (2004) shows that the LSM mechanism can lead to undesirable mixing of information across stimuli. When the stimuli are not strongly temporally correlated, a special stimulus-onset reset signal may be needed to improve decoding of temporal population codes for individual stimuli. In this study, we are concerned with possibilities of inducing deep temporal structures without fading memory in recurrent networks of spiking neurons. We will strictly adhere to pulse coding, for example, all the input, output, and state information is coded in terms of spike trains on subsets of neurons. Using the words of Natschl¨ager and Ruf (1998), “This paper is not about biology but about possibilities of computing with spiking neurons which are inspired by biology. . . . There is still not much known about possible learning in such systems. . . . A thorough understanding of such networks, which are rather simplified in comparison to real biological networks, is necessary for understanding possible mechanisms in biological systems.” This letter is organized as follows. We introduce the recurrent spiking neuron network used in this study in section 2 and develop a training
2 Any time-invariant map with fading memory from functions of time u(t) to functions of time y(t) can be approximated by LSM to any degree of precision (Maass et al., 2002).
ˇ and A. Mills P. Tino
594
algorithm for such a network in section 3. The experiments in section 4 are followed by a discussion in section 5. Section 6 concludes by summarizing the key messages of this study. 2 Model First, we briefly describe the formal model of spiking neurons, the spike response model (Gerstner, 1995), employed in this study (see also Bohte, 2003; Maass & Bishop, 2001). Spikes emitted by neuron i are propagated to neuron j through several synaptic channels k = 1, 2, . . . , m, each of which has an associated synaptic efficacy (weight) wijk , and an axonal delay dijk . In each synaptic channel k, input spikes get delayed by dijk and transformed by a response function ijk , which models the rate of neurotransmitter diffusion across the synaptic cleft. The response function can be either excitatory (contributing to excitatory postsynaptic potential, EPSP) or inhibitory (contributing to inhibitory postsynaptic potential, IPSP). Formally, denote the set of all (presynaptic) neurons emitting spikes to neuron j by j . Let the last spike time of a presynaptic neuron i ∈ j be tia . The accumulated potential at time t on soma of unit j is
x j (t) =
m
wijk · ijk t − tia − dijk ,
(2.1)
i∈ j k=1
where the response function ijk is modeled as ijk (t) = σijk · (t/τ ) · exp(1 − (t/τ )) · H(t).
(2.2)
Here, σijk is 1 and −1 if the synapse k between neurons i, j is concerned with transmitting EPSP and IPSP, respectively; τ represents the membrane potential decay time constant (which describes the rate at which current leaks out of the postsynaptic neuron); H(t) is the Heaviside step function, which is 1 for t > 0 and otherwise 0. Neuron j fires a spike (and depolarizes) when the accumulated potential x j (t) reaches a threshold . In a feedforward spiking neuron network (FFSNN), the first neurons to fire a spike are the input units. Spatial spike patterns across input neurons code the information to be processed by the FFSNN; the spikes propagate to subsequent layers, finally resulting in a pattern of spike times across neurons in the output layer. The output spike times represent the response of FFSNN to the current input. The input-to-output propagation of spikes through FFSNN is confined to a simulation interval of length ϒ. All neurons
Learning Beyond Finite Memory with Spiking Neurons
595
can fire at most once within the simulation interval.3 After the simulation interval has expired, the output spike pattern is read out and interpreted, and a new simulation interval is initialized by presenting a new input spike pattern in the input layer. Given a mechanism for temporal encoding and decoding of the input and output information, respectively, Bohte et al. (2002) have recently formulated a backpropagation-like supervised learning rule for training FFSNN, called SpikeProp. Synaptic efficacies on connections to the output unit j are updated as follows: wijk = −η · ijk t aj − tia − dijk · δ j ,
(2.3)
where
δ = j
i∈ j
t dj − t aj a m k k a a k a k k=1 wij · ij t j − ti − dij · (1 t j − ti − dij − 1/τ )
(2.4)
and η > 0 is the learning rate. The numerator is the difference between the desired t dj and actual t aj firing times of the output neuron j within the simulation interval. Synaptic efficacies on connections to the hidden unit i are updated analogously: k k a whi = −η · hi ti − tha − dhik · δ i ,
(2.5)
where m
δi =
a 1 t j − tia − dijk − 1/τ · δ j , · 1 tia − tha − dhik − 1/τ
k k a a k k=1 wij · ij t j − ti − dij · m k k a a k k=1 whi · hi ti − th − dhi h∈i j∈ i
(2.6) and i denotes the set of all (postsynaptic) neurons to which neuron i emits spikes. The numerator pulls in contributions from the layer succeeding that for which δ’s are being calculated.4 Obviously, FFSNN cannot properly deal with temporal structures in the input stream that go beyond finite memory. One possible solution is to turn FFSNN into a recurrent spiking neuron network (RSNN) by extending the feedforward architecture with feedback connections. In analogy with
3 The period of neuron refractoriness (a neuron is unlikely to fire shortly after producing a spike) is not modeled; thus, to maintain biological plausibility, a neuron may fire only once within the simulation interval (see, e.g., Bohte et al., 2002). 4 When a neuron does not fire, its contributions are not incorporated into the calculation of δ’s for other neurons; neither is a δ calculated for it.
ˇ and A. Mills P. Tino
596
RNN, we select a hidden layer in FFSNN as the layer responsible for coding (through spike patterns) important information about the history of inputs seen so far (recurrent layer) and feed back its spiking patterns through the delay synaptic channels to an auxiliary layer at the input level, called the context layer. The input and context layers now collectively form a new extended input layer of the RSNN. The delay feedback connections temporally translate spike patterns in the recurrent layer by the delay constant , α(t) = t + .
(2.7)
Such temporal translation can be achieved using networks of spiking neurons. Experimentation has shown that it is trivial to train an FFSNN with one input and one output to implement an arbitrary delay to high precision, so long as the desired delay does not exceed the temporal resolution at which the FFSNN operates. Thus, multiple copies of these trained networks can be used to delay the firing times of recurrent units in parallel. Figure 1 shows a typical RSNN architecture used in our experiments. As in the general TIS (transformed input and state) memory model (Mozer, 1994), the network consists of five layers. Each input item is processed within a single simulation interval. The extended input layer (input and context layers, denoted by I and C, respectively) feeds the first auxiliary hidden layer H1 , which in turn feeds the recurrent layer Q. Within the nth simulation interval, the spike timings of neurons in the input and context layers I and C are stored in the spatial spike train vectors i(n) and c(n), respectively. The spatial spike trains of the first hidden and recurrent layers are stored in vectors h1 (n) and q(n), respectively. The role of the recurrent layer Q is twofold: 1. The spike train q(n) codes information about the history of n inputs seen so far. This information is passed to the next simulation interval through the delay FFSNN network5 α, c(n + 1) = α(q(n)). The delayed spatial spike train c(n + 1) appears in the context layer. Spike train (i(n + 1), c(n + 1)) of the extended input in simulation interval n + 1 consists of the history-coding spike train c(n + 1) (representing the previously seen n inputs) and a spatial spike train i(n + 1) coding the current, (n + 1)st, external input (input symbol). 2. The recurrent layer feeds the second auxiliary hidden layer H2 , which finally feeds the output layer O. Within the nth simulation interval, the spatial spike trains in the second hidden and output layers are stored in vectors h2 (n) and o(n), respectively.
5
The delay function α(t) is applied to each component of q(n).
Learning Beyond Finite Memory with Spiking Neurons
α
o(n)
O
h 2 (n)
H2
q(n)
Q
h1 (n)
H1
i(n)
I
597
Spike train o(n) encodes output for n–th input item
Spike train q(n) encodes state after presentation of n–th input item
delay by ∆
C
c(n) = α (q(n–1))
Spike train i(n) encodes n–th input item
Spike train c(n) is delayed state information from the previous input item presentation
Figure 1: Typical RSNN architecture used in our experiments. When processing the n th input item, the external input is encoded through spike train i(n) in layer I , the state information is computed as spike train q(n) in layer Q, and the output is represented by the spike train o(n) in layer O. Information about the inputs processed before seeing the nth input item is contained in context spike train c(n) in layer C. c(n) = α(q(n − 1)) is a state-encoding spike train from the previous time step, delayed by . Hidden layers H1 and H2 (with spike trains h1 (n) and h2 (n), respectively) are auxiliary layers enhancing capabilities of the network to represent the state information and calculate the output.
Parameters, such as the length ϒ of the simulation interval, feedback delay , and spike time encodings of input-output symbols, have to be carefully coordinated. We illustrate this by unfolding an RSNN through two simulation periods (corresponding to processing of two input items) in Figure 2. The input string is 10, and the corresponding desired output string is 01. The input layer I has five neurons. There is a single neuron in the output layer O. All the other layers have two neurons each. Index n indexes the input items (and simulation intervals). Input symbols 0 and 1 are encoded in the five input units through spike patterns i0 = [0, 6, 0, 6, 0] and i1 = [0, 6, 6, 0, 0], respectively (all times are shown in ms). Spike patterns in the single output neuron for output symbols 0 and 1 are o0 = [20] and o1 = [26], respectively. Approximately 5 ms elapses between the firing of
ˇ and A. Mills P. Tino
598
n=1 t start (1) = 0ms
n=2
Target ’0’ coded as [20] t(1) = [20]
t start (2) = 40ms
o(1) = [22]
o(2) = [65]
h 2 (2) = [57,56]
h 2 (1) = [16,17]
q(1) = [12,11]
Target ’1’ coded as [26] t(2) = [66]
q(2) = [51,52]
α delay by ∆ = 30ms
h1 (1) = [7,8]
c(1) = c start
i(1) = [0,6,6,0,0]
Input ’1’ coded as [0,6,6,0,0]
h 1 (2) = [48,47]
c(2) = [42,41]
i(2)=[40,46,40,46,40]
Input ’0’ coded as [0,6,0,6,0]
Figure 2: The first two steps in the operation of an RSNN. The input string is 10, and the corresponding desired output is the string 01. The input layer I has five neurons. There is a single neuron in the output layer O. All the other layers have two neurons each. Spike train vectors are shown in ms. Processing of the nth input symbol starts at tstart (n) = (n − 1) · 40 ms. The target (desired) output patterns t(n) are shown above the output layers. Context spike train c(2) is the state spike train q(1) from the previous time step, delayed by = 30 ms. The initial context spike train, cstart , is imposed externally at the beginning of training.
neurons in each subsequent layer. Processing of the nth input symbol starts at tstart (n) = (n − 1) · ϒ,
(2.8)
where ϒ = 40 ms is the length of the simulation interval. Spike train i(n) representing the nth input symbol sn ∈ {0, 1} is calculated by shifting the
Learning Beyond Finite Memory with Spiking Neurons
599
corresponding input spike pattern isn by tstart (n), i(n) = isn + tstart (n).
(2.9)
The same principle applies to calculation of the target (desired) spike patterns t(n) that would be observed at the output, if the network functioned correctly. If, after presentation of the nth input symbol sn , the desired output symbol is σn ∈ {0, 1}, then t(n) is calculated as t(n) = oσn + tstart (n).
(2.10)
The network is trained by minimizing the difference between the target and observed output spike patterns, t(n) and o(n), respectively. The training procedure is outlined in the next section. Context spike train c(2) is the state spike train q(1) from the previous time step, delayed by = 30 ms. The initial context spike train, cstart , is imposed externally at the beginning of training. 3 Training: SpikePropThroughTime We extended the SpikeProp algorithm (Bohte et al., 2002) for training FFSNN to recurrent models in the spirit of backpropagation through time for rate-based RNN (Werbos, 1989), that is, using the unfolding-in-time methodology. We call this learning algorithm SpikePropThroughTime. Given an input string of length n, n copies of the base RSNN are made, stacked on top of each other, and sequentially simulated, incrementing tstart by ϒ after each simulation interval. Expanding the base network through time via multiple copies simulates processing of the input stream by the base RSNN. Adaptation δ’s (see equations 2.4 and 2.6) are calculated for each of the network copies. The synaptic efficacies (weights) in the base network are then updated using δ’s calculated in each of the copies by adding up, for every weight, the n corresponding weight-update contributions of equations 2.3 and 2.5. Figure 3 shows expansion through time of a five-layer base RSNN on an input of length 2. The first copy is simulated with tstart (1) = ϒ · 0 = 0. As explained in the previous section, all firing times of the first copy are relative to 0. For copies n > 1, the external inputs and desired outputs are made relative to tstart (n) = (n − 1) · ϒ. In an FFSNN, when calculating the δ’s for a hidden layer, the firing times from the preceding and succeeding layers are used. Special attention must be paid when calculating δ’s of neurons in the recurrent layer Q. Context spike train c(n + 1) in copy (n + 1) is the delayed recurrent spike train q(n) from the nth copy. The relationship of firing times in c(n + 1) and h1 (n + 1)
ˇ and A. Mills P. Tino
600
o(2)
O
h2 (2)
Copy 2
H2
q(2)
h1 (2)
i(2)
I
o(1)
H1
c(2)
O
Q
C
α
Copy 1
delay by ∆
h2 (1)
H2
q(1)
h1 (1)
I
i(1)
Q
H1
c(1)
C
Figure 3: Expansion through time of a five-layer base RSNN on an input of length 2.
Learning Beyond Finite Memory with Spiking Neurons
601
contains the information that should be incorporated into the calculation of the δ’s for recurrent units in copy n. The delay constant is subtracted from the firing times h1 (n + 1) of H1 and then, when calculating the δ’s for recurrent units in copy n, these temporally translated firing times are used as if they were simply another hidden layer succeeding Q in copy n. Denoting by 2,n the set of neurons in the second auxiliary hidden layer H2 of the nth copy and by 1,n+1 the set of neurons in the first auxiliary hidden layer H1 of the copy (n + 1), the δ of the ith recurrent unit in the nth copy is calculated as
δi =
h∈i
+
wijk · ijk t aj − − tia − dijk · 1 t aj − − tia − dijk − 1/τ · δ j m k k a a k k tia − tha − dhi − 1/τ k=1 whi · hi ti − th − dhi · 1
m k=1
j∈1,n+1
j∈2,n
h∈i
wijk · ijk t aj − tia − dijk · 1 t aj − tia − dijk − 1/τ · δ j . m k k a a k k tia − tha − dhi − 1/τ k=1 whi · hi ti − th − dhi · 1 m k=1
(3.1)
4 Learning Beyond Finite Memory in RSNN: Inducing Moore Machines 4.1 Moore Machines. One of the simplest computational models that encapsulates the concept of unbounded input memory is the Moore machine (Hopcroft & Ullman, 1979). Formally, an (initial) Moore machine (MM) M is a six-tuple M = (U, V, S, β, γ , s0 ), where U and V are finite input and output alphabets, respectively; S is a finite set of states; s0 ∈ S is the initial state; β : S × U → S is the state transition function; and γ : S → V is the output function. Given an input string u = u1 u2 , . . . , un of symbols from U (ui ∈ U, i = 1, 2, . . . , n), the machine M acts as a transducer by responding with the output string v = M(u) = v1 v2 , . . . , vn , vi ∈ V, computed as follows. First, the machine is initialized with the initial state s0 . Then for all i = 1, 2, . . . , n, the new state is recursively determined, si = β(si−1 , ui ), and the machine emits the output symbol vi = γ (si ). Moore machines are conveniently represented as directed labeled graphs: nodes represent states, and arcs, labeled by symbols from U, represent state transitions initiated by the input symbols. An example of a simple MM is shown in Figure 4. 4.2 Encoding of Input and Output. In this study, we consider input alphabets of one or two symbols. Moreover, there is a special end-of-string symbol 2 initiating transitions to the initial state. In the experiments, the input layer I had five neurons. The input symbols 0, 1, and 2 are encoded in the five input units through spike patterns i0 = [0, 6, 0, 6, 0], i1 = [0, 6, 6, 0, 0], and i2 = [6, 0, 0, 6, 0], respectively. The firing times are
ˇ and A. Mills P. Tino
602
1 0 2
0
1 1
0 2
0
1
Figure 4: An example of a simple two-state Moore machine. Circles denote states, and arcs denote state transitions. The number in the upper left of each state labels it. The number in the lower right of each state specifies its output value. Processing of each input string starts in the initial state 0. Each arc is labeled with the input symbol initiating state transition specified by the arc. The special end-of-string reset symbol 2 initiates a transition from every state to the initial state (dashed arcs). As an example, the input sequence 00111001 is mapped by the machine to the output sequence 00101110.
in ms. The last input neuron acts like a reference neuron, always firing at the beginning of any simulation interval. In all our experiments, we used binary output alphabet V = {0, 1}, and the output layer O of RSNN consisted of a single neuron. Spike patterns (in ms) in the output neuron for output symbols 0 and 1 are o0 = [20] and o1 = [26], respectively. 4.3 Generation of Training Examples and Learning. Given a target Moore machine M, a set of training examples is constructed by explicitly constructing input strings6 u over U and then determining the corresponding output string M(u) over V (by traversing edges of the graph of M, starting in the initial state, as prescribed by the input string u). The training set D consists of N couples of input-output strings, D = {(u1 , M(u1 )), (u2 , M(u2 )), . . . , (u N , M(u N ))}.
6 It is important to choose strings that exercise structures prominent in the target Moore machine; for example, if a Moore machine contains several discrete loops linked by transitions, then the training string should exercise these loops and transitions. Random walks over the target machine, even those that exercise every transition, were found to be less effective at inducing structures than an intuitive exercising of the main structural components.
Learning Beyond Finite Memory with Spiking Neurons
603
We adopt the strategy of inducing the initial state in the recurrent network ˇ (as opposed to externally imposing it; see Forcada & Carrasco, 1995; Tino ˇ & Sajda, 1995). The context layer of the network is initialized with the fixed predefined context spike train c(1) = cstart only at the beginning of training. From the network’s point of view, the training set is a couple (u, ˜ M(u)) ˜ of the long concatenated input sequence, u˜ = u1 2u2 2u3 2, . . . , 2u N−1 2u N 2, and the corresponding output sequence is M(u) ˜ = M(u1 )γ (s0 )M(u2 )γ (s0 )M(u3 )γ (s0 ) . . . γ (s0 )M(u N−1 )γ (s0 )M(u N )γ (s0 ). Input symbol 2 is instrumental in inducing the start state by acting as an end-of-string reset symbol initiating transition from every state of M to the initial state s0 . The network is trained using SpikePropThroughTime (section 3) to minimize the squared error between the desired output spike trains derived from M(u) ˜ when the RSNN is driven by the input u. ˜ The RSNN is unfolded, and SpikePropThroughTime is applied for each training pattern (ui , M(ui )), i = 1, 2, . . . , N. In rate-based neural networks, the weights are conventionally initialized to random elements of from the interval [0, 1]. For general feedforward networks of spiking neurons, it appears that this is not applicable when using the SpikeProp learning rule; learning of simple nonlinear training sets appears to fail if the threshold is not sufficiently high and the weights are not sufficiently scaled. The problem was first identified in Moore (2002).7 The firing threshold is 50, and the weights are initialized to random elements of from the interval (0, 10).8 We use a dynamic learning rate strategy that detects oscillatory behavior and plateaus within the error space. The action to take upon detecting oscillation or plateau is, respectively, to decrease the learning rate by multiplying by an oscillation-counter-coefficient (< 1), or increase the learning rate by multiplying by a plateau-counter-coefficient (> 1) (see, e.g., Lawrence, Giles, & Fong, 2000). 4.4 Extraction of Induced Structure. It has been extensively verified in the context of traditional rate-based RNN that it is often possible to
7 We used the initial weight setting, neuron threshold parameters, and the learning rate as suggested by this thesis. 8 Whatever weight setting is chosen, the weights must be initialized so that the neurons in subsequent layers are sufficiently excited by those in their previous layer that they fire; otherwise, the network would be unusable. There is no equivalent in traditional rate-based neural networks to the nonfiring of a neuron in this sense.
ˇ and A. Mills P. Tino
604
extract from RNN the target finite state machine M that the network was successfully trained and tested on. The extraction procedure operates on activation patterns appearing in the recurrent layer while processing9 M ˇ ˇ & Sajda, (e.g., Giles et al., 1992; Tino 1995; Frasconi et al., 1996; Omlin & Giles, 1996; for an extensive review and criticism, see Jacobsson, 2005). One possibility10 is to drive the network with sufficiently long input strings and record the recurrent layer activations in a set B. The set B is then clustered using a vector-quantization tool. The cluster indexes will become states of the extracted finite state machine. Next, the network is reset with repeated presentation of the the end-of-string symbol 2 and then driven once more with sufficiently long input strings. At each time step n, one records:
r
r
The index q˜ (n) of the cluster containing the recurrent activation vector. The cluster index q˜ (n) represents context cluster c˜ (n + 1) = q˜ (n) at time step n + 1. Given the current input symbol and context cluster c˜ (n) = q˜ (n − 1), the network transits (on the cluster level) to the next context cluster c˜ (n + 1) = q˜ (n). The network output associated with the next state c˜ (n + 1) = q˜ (n).11
Finally, the cluster transitions and output actions are summarized as a graph of a finite state machine that is minimized to its equivalent canonical form (Hopcroft & Ullman, 1979). The state information in RSNN is coded as spike trains in the recurrent layer Q. We studied whether, in analogy with rate-based RNN, the abstract information processing states can be discovered by detecting natural groupings of normalized spike trains12 (q(n) − tstart (n)) using a vector quantization tool (in our experiments, k-means clustering).13 We also applied the extraction procedure to RSNN that managed to induce the target Moore machine only partially, so that the extent to which the target machine has been induced and the weaknesses of the induced mechanism can be exposed and understood. 4.5 Experimental Results. The network had five neurons in layers I , C, H1 , Q, and H2 . Within each of those layers, one neuron was inhibitory; all the others were excitatory. Each connection between neurons had
9
With frozen weights. Given the Moore machine setting and RNN architecture used in this letter. 11 If there are several output symbols associated with recurrent layer activations in the cluster q˜ (n), the cluster q˜ (n) is split so that the inconsistency is removed. 12 The spike times q(n) entering the quantization phase are made relative to the start time tstart (n) of the simulation interval. 13 The goal here is to assign identical cluster indexes to similar firing times and different indexes to dissimilar firing times. Although the spiking neuron clustering method of Natschl¨ager and Ruf (1998) worked, k-means clustering is simpler (it requires fewer hyperparameters) and faster, so it was used in practice. 10
Learning Beyond Finite Memory with Spiking Neurons
605
m = 16 synaptic channels, with delays dijk = k, k = 1, 2, . . . , m, realizing axonal delays between 1 ms and 16 ms. The decay constant τ in response functions ij was set to τ = 3. The length ϒ of the simulation interval was set to 40 ms. The delay was 30 ms. The inputs and desired outputs were coded into the spike trains as described in equations 2.8 to 2.10 and section 4.2. We used SpikePropThroughTime to train RSNN. The training was error monitored, and training was stopped when it was clear that the network had learned the target (zero thresholded output error) with sufficient stability. In some cases, training was carried on until zero absolute error was achieved. The maximum number of training epochs (sweeps through the training set) was 10,000. First, we experimented with cyclic machines C p of period p ≥ 2: U = {0}; V = {0, 1}; S = {0, 1, 2, . . . , p − 1}; s0 = 0; for 0 ≤ i < p − 1, β(i, 0) = i + 1 and β( p − 1, 0) = 0; γ (0) = 0 and for 0 < i ≤ p − 1, γ (i) = 1. The RSNN perfectly learned machines C p , 2 ≤ p < 5. After training, the respective RSNN emulated the operation of these MM perfectly and apparently indefinitely (no deviations from expected behavior were observed over test sets having length of the order 104 ). The training set had to be incrementally constructed by iteratively training with one presentation of the cycle, then two presentations, and so on. Note that given that the network can only observe the inputs, these MMs would require an unbounded input memory buffer. So no mechanism with vanishing (input) memory can implement string mappings represented by such MMs. Using the successful networks, we extracted unambiguously all the machines C p of period 1 ≤ p < 5. The number of clusters in k-means clustering was set to 10. Second, we trained RSNN on a two-state machine M2 shown in Figure 4. Again, the RSNN perfectly learned the machine.14 As in the previous experiment, no mechanism with vanishing input memory can implement string mappings defined by this Moore machine. Using the successful networks, we extracted unambiguously the machine M2 (the number of clusters in k-means clustering was 10). In the third experiment, we investigate the information available in the context firing times in case of partial induction. Consider the three-state machine M3 in Figure 5. The machine has two main fixed-input cycles in opposite directions. Training led to an error rate of ≈ 0.3 over test strings of ˜ 3 is shown in Figure 6. length 10,000. The extracted induced machine M The cycle on input 1 in M3 has been successfully induced, but the cycle on input 0 has not. The oscillation between states 4 and 1 on strings {01}+ in ˜ 3 corresponds to the oscillation between states 1 and 2 in M3 . Transitions M
14
Repeated presentation of only five carefully selected training patterns of length 4 was sufficient for induction of this machine. No deviations from expected behavior were observed over test sets of length of the order 104 .
ˇ and A. Mills P. Tino
606
1 0 2
0
0
0 2
1
1
1
0
0
2 1
2 Figure 5: A three-state target Moore machine M3 that was partially induced.
˜ 3, a on the reset symbol 2 have also been induced correctly. Curiously, in M cycle of length 4 (over the states 1,3,5,4) has been induced on fixed input 0. 5 Discussion We were able to train RSNN to mimic target MMs requiring unbounded input memory on only a relatively simple set of MMs. Compared with traditional rate-based RNN, two major problems are apparent when inducing structures like MM in RSNN: 1. There are two timescales the network operates on: (1) the shorter timescale of spike trains coding the input, output, and state information within a single simulation interval and (2) the longer timescale of sequences of simulation intervals, each representing a single inputto-output processing step. This timescale can be synchronized using spike oscillations as in Natschl¨ager and Maass (2002). Long-term dynamics have to be induced based on the behavior of the target MM, but these dynamics are driven ultimately by the shortterm dynamics of individual spikes. So in order to exhibit the desired
Learning Beyond Finite Memory with Spiking Neurons
2
2
1
4
0 2
0
1
607
0 1
1 0
1
0
3
0
0
0 2
1
0
2
1
0
1
1
5 1
2 2 ˜ 3 from a partial induction of the Moore machine Figure 6: Extracted machine M M3 shown in Figure 5.
long-term behavior, the network has to induce the appropriate shortterm dynamics. In contrast, only the long-term dynamics need to be induced in the rate-base RNN. 2. Spiking neurons used in this letter produce a spike only when the accumulated potential x j (t) reaches a threshold . This leads to discontinuities in the error surface. Gradient-based methods for training feedforward networks of spiking neurons alleviate this problem by resorting to simplifying assumptions on spike patterns within a single simulation interval (see Bohte et al., 2002; Bohte, 2003). The situation is much more complicated in the case of RSNN. A small weight perturbation can prevent a recurrent neuron from firing in the shorter timescale of a simulation interval. That can have serious consequences for further long-timescale processing, especially if such a change in short-term behavior appears at the beginning of presentation of a long input string. The error surface becomes erratic, as evidenced in Figure 7. We took an RSNN trained to perfectly mimic the cycle 4 machine C4 (see section 4.5). We studied the influence of perturbing weights w∗ in the recurrent part of the RSNN (e.g., between layers I , C, H1 , and Q) on the test error calculated on a long test string of length 1000. For each weight perturbation extent ρ, we randomly sampled 100 weight vectors w from the hyperball of radius ρ centered at w∗ . Shown are the mean and standard deviation values of the
ˇ and A. Mills P. Tino
608
Weight Perturbation vs Error for a Simple Four–State Automata 5 Mean Over 100 Trials Standard Deviation
4.5
4
Per Symbol Error
3.5
3
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5 2 Absolute Perturbation Extent
2.5
3
Figure 7: Maximum radius of weight perturbation versus test error of RSNN trained to mimic the cycle 4 machine C4 . For each setting of weight perturbation extent ρ, we randomly sampled 100 weight vectors w from the hyperball of radius ρ centered at induced RSNN weights w∗ . Shown are the mean (solid line) and standard deviation (dashed line) values of the absolute output error per symbol.
absolute output error per symbol for 0 < ρ ≤ 3. Clearly, small perturbations of weights lead to large, abrupt changes in the test error. Obviously gradient-based methods, like our SpikePropThroughTime, have problems in locating good minima on such error surfaces. We tried the following to find RSNN weights in the experiments in section 4.4, but without much success. The abrupt and erratic nature of the error surface makes it hard, even for evolutionary techniques, to locate a good minimum.
r
Fast evolutionary strategies (FES) with (recommended) configuration (30,200)-ES,15 employing the Cauchy mutation function (see Yao & Liu, 1997; Yao, 1999).
15 In each generation, 30 parents generate 200 offspring through recombination and mutation.
Learning Beyond Finite Memory with Spiking Neurons
r r
609
Extended Kalman filtering in the parameter space (Puskorius & Feldkamp, 1994). A recent powerful evolutionary method for optimization on realvalued domains (Rowe & Hidovic, 2004).
We tried RSNNs with varying numbers of neurons in the hidden and recurrent layers. In general, the increased representational capacity of RSNNs with more neural units could not be used because of the problems with finding good weight settings due to the erratic nature of the error surface. Our SpikePropThroughTime algorithm can be extended by allowing for adjustable axonal delays dijk , individual synaptic decay constants τijk , and firing thresholds j (Schrauwen & Van Campenhout, 2004). While this may lead to models of reduced size, the inherent problems with learning instabilities will not be eliminated. We note that the finite memory machines induced in feedforward spiking neuron networks (Natschl¨ager & Maass, 2002, with dynamic synapses, Maass & Markram, 2002) were quite simple (of depth 3). The input memory depth is limited by the feedforward nature of such networks. As soon as one tries to increase the processing capabilities of spiking networks by introducing feedback connections while insisting on pulse coding, the induction process becomes complicated. Theoretically, it is possible to emulate in a stable manner any MM in a sufficiently rich RSNN. For example, given an MM M, one needs to first fix appropriate spike patterns representing abstract states of the machine M. Then an FFSNN NS (playing the role of subnetwork of RSNN with layers C, I, H1 , and Q) can be trained on the input-driven state transition structure of M. The target spike trains on the top layer of NS are calculated by shifting the spike patterns representing states of M by an appropriate time delay . Note that NS can be trained with traditional SpikeProp algorithm. While training, the target spike trains can be slightly perturbed to yield stable representations in NS of state transitions in M. Trained NS with added delay lines α (making spike trains at the top layer of NS appear in the context layer of NS with delay ) forms a recurrent part of the RSNN being constructed. We need to stack another FFSNN NO on top of NS to associate states with outputs. Again, NO can be trained using SpikeProp to associate spike trains in the top layer of NS (representing abstract states of M) with the corresponding outputs. The target output spike trains need to be shifted by an appropriate simulation interval length ϒ. Because processing in the spiking neuron networks is driven purely by relative differences between spike times in individual neurons, the RSNN consisting of NS (endowed with delay lines α) and NO will emulate M. Indeed, following this procedure, we were able to construct an RSNN perfectly emulating, for example, machine M2 in Figure 4 in a stable manner. Moreover, one can envisage a procedure for RSNN, analogous to that developed for
610
ˇ and A. Mills P. Tino
rate-based RNN by Giles and Omlin (1993), that would enable direct insertion of (the whole or part of) a given finite state machine (FSM) into an RSNN. Such RSNN-based emulators of FSMs can be used as submodules in larger computational devices operating on spike trains. However, this article aims to study induction of deeper-than-finite-inputmemory temporal structures in RSNN. FSMs offer a convenient platform for our study, as they represent a simple, well-established, and easy-to-analyze framework for describing temporal structures that go beyond finite memory relationships. Many previous studies of inducing deep input memory structures in rate-based RNNs were performed in this framework (Cleereˇ ˇ & Sajda, mans et al., 1989; Giles et al., 1992; Tino 1995; Frasconi et al., 1996; ˇ et al., 1998). When training an RSNN Omlin & Giles, 1996; Casey, 1996; Tino on example string mappings drawn from some target MM, the structure of the target MM is assumed unknown for the learner. Hence, we cannot split training of RSNN into two separate phases (training two FFSNNs NS and NO on state transitions and state-output associations, respectively). RSNN is a dynamical system, and weight changes in RSNN lead to complex bifurcation mechanisms, making it hard to induce more complex MM through a guided search in the weight space. It may be that in biological systems, long-term dependencies are represented using rate-based codings and/or a liquid state machine mechanism (Maass et al., 2002) with a complex but nonadaptable recurrent pulse-coded part. 6 Conclusion We have investigated the possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons operating strictly in the pulse-coding regime. All the input, output, and state information is coded in terms of spike trains on subsets of neurons. We briefly summarize key results of the letter: 1. A pulse-coding strategy for processing temporal information in recurrent spiking neuron networks (RSNN) was introduced. We have extended, in the sense of backpropagation through time (Werbos, 1989), the gradient-based Spike-Prop algorithm (Bohte et al., 2002) for training feedforward spiking neuron networks (FFSNN). The algorithm SpikePropThroughTime is able to account for temporal dependencies in the input stream when training RSNN. 2. We have shown that temporal structures with unbounded input memory specified by simple Moore machines can be induced by RSNN. The networks were able to discover pulse-coded representations of abstract information processing states coding potentially unbounded histories of processed inputs. However, the nature of pulse coding, in
Learning Beyond Finite Memory with Spiking Neurons
611
the context of the training strategies tried here, appears not to allow induction beyond simple Moore machines. 3. In analogy with traditional rate-based RNN trained on finite state machines, it is often possible to extract from RSNN the target machines by grouping together similar spike trains appearing in the recurrent layer. Furthermore, extraction of finite state machines from RSNN that managed to induce the target Moore machine can only partially reveal weaknesses of the induced mechanism and the extent to which the target machine has been learned. Although, theoretically, RSNN operating on pulse coding can process any (computationally feasible) temporal structure of unbounded input memory, the induction of such structures through a guided search in the weight space is another matter. Weight changes in RSNN lead to complex bifurcation mechanisms, enormously complicating the training process.
References Abeles, M., Bergman, H., Gat I., Meilijson, I., Seidemann, E., Tishby, N., & Vaadia, E. (1995). Cortical activity flips among quasi stationary states. Proc. Natl. Acad. Sci. USA, 92, 8616–8620. Bengio, Y., Frasconi, P., & Simard, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bohte, S. M. (2003). Spiking neural networks. Unpublished doctoral dissertation, Centre for Mathematics and Computer Science, Amsterdam. Bohte, S., Kok, J., & La Poutr´e, H. (2002). Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing, 48(1–4), 17–37. Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6), 1135–1178. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. DeWeese, M. R., & Zador, A. M. (2003). Binary coding in auditory cortex. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 101–108). Cambridge, MA: MIT Press. Floreano, D., & Mattiussi, C. (2001). Evolution of spiking neural controllers for autonomous vision-based robots. In T. Gomi (Ed.), Evolutionary robotics IV (pp. 38–61). Berlin: Springer–Verlag. Floreano, D., Zufferey, J., & Nicoud, J. (2005). From wheels to wings with evolutionary spiking neurons. Artificial Life, 11(1–2), 121–138. Forcada, M. L. & Carrasco, R. C. (1995). Learning the initial state of a second-order recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930.
612
ˇ and A. Mills P. Tino
Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Insertion of finite state automata in recurrent radial basis function networks. Machine Learning, 23, 5–32. Gerstner, W. (1995). Time structure of activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W. (1999). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed coupled neural networks (pp. 3–54). Cambridge, MA: MIT Press. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite state automata with second–order recurrent neural networks. Neural Computation, 4(3), 393–405. Giles, C. L., & Omlin, C. W. (1993). Insertion and refinement of production rules in recurrent neural networks. Connection Science, 5(3), 307–377. Hopcroft, J., & Ullman, J. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison–Wesley. Jacobsson, H. (2005). Rule extraction from recurrent neural networks: A taxonomy and review. Neural Computation, 17(6), 1223–1263. ¨ ¨ Knusel, P., Wyss, R., Konig, P., & Verschure, P. F. M. J. (2004). Decoding a temporal population code. Neural Computation, 16(10), 2079–2100. Lawrence, S., Giles, C. L., & Fong, S. (2000). Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering, 12(1), 126–140. Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural Computation, 8(1), 1–40. Maass, W., & Bishop, C. (Eds.). (2001). Pulsed neural networks. Cambridge, MA: MIT Press. Maass, W., & Markram, H. (2002). Synapses as dynamic memory buffers. Neural Networks, 15(2), 155–161. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Martignon, L., Deco, G., Laskey, K. B., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher-order temporal patterns in the neurostatistics of cell assemblies. Neural Computation, 12(11), 2621–2653. Moore, S. (2002). Back propagation in spiking neural networks. Unpublised master’s thesis, University of Bath. Mozer, M. C. (1994). Neural net architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Predicting the future and understanding the past (pp. 243–264). Reading, MA: Addison-Wesley. Nadasdy, Z., Hirase, H., Czurk, A., Csicsvari, J., & Buzski, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience, 19(21), 9497–9507. Natschl¨ager, T., & Maass, W. (2002). Spiking neurons and the induction of finite state machines. Theoretical Computer Science: Special Issue on Natural Computing, 287(1), 251–265. Natschl¨ager, T., & Ruf, B. (1998). Spatial and temporal pattern analysis via spiking neurons. Network: Computation in Neural Systems, 9(3), 319–332. Omlin, C., & Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41–51.
Learning Beyond Finite Memory with Spiking Neurons
613
Puskorius, G., & Feldkamp, L. (1994). Neural control of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Trans. on Neural Networks, 5(2), 279–297. Rowe, J., & Hidovic, D. (2004). An evolution strategy using a continuous version of the gray-code neighbourhood distribution. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2004) (pp. 725–736). Berlin: SpringerVerlag. Schrauwen, B., & Van Campenhout, J. (2004). Extending SpikeProp. In Proceedings of the International Joint Conference on Neural Networks (pp. 471–476). Piscataway, NJ: IEEE. Siegelmann, H., & Sontag, E. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150. ˇ P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1998). Finite state machines Tino, and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). Orlando, FL: Academic Press. ˇ ˇ P., & Sajda, Tino, J. (1995). Learning and extracting initial mealy machines with a modular neural network model. Neural Computation, 7(4), 822–844. Werbos, P. (1989). Generalization of backpropagation with applications to a recurrent gas market model. Neural Networks, 1(4), 339–356. Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423–1447. Yao, X., & Liu, Y. (1997). Fast evolution strategies. Control and Cybernetics, 26(3), 467–496.
Received October 22, 2004; accepted July 12, 2005.
LETTER
Communicated by Daniel Amit
Effects of Fast Presynaptic Noise in Attractor Neural Networks J. M. Cortes
[email protected]
J. J. Torres
[email protected]
J. Marro
[email protected]
P. L. Garrido
[email protected] Institute Carlos I for Theoretical and Computational Physics and Department of Electromagnetism and Physics of Matter, University of Granada, 18071 Granada, Spain
H. J. Kappen
[email protected] Department of Biophysics, Radboud University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
We study both analytically and numerically the effect of presynaptic noise on the transmission of information in attractor neural networks. The noise occurs on a very short timescale compared to that for the neuron dynamics and it produces short-time synaptic depression. This is inspired in recent neurobiological findings that show that synaptic strength may either increase or decrease on a short timescale depending on presynaptic activity. We thus describe a mechanism by which fast presynaptic noise enhances the neural network sensitivity to an external stimulus. The reason is that, in general, presynaptic noise induces nonequilibrium behavior and, consequently, the space of fixed points is qualitatively modified in such a way that the system can easily escape from the attractor. As a result, the model shows, in addition to pattern recognition, class identification and categorization, which may be relevant to the understanding of some of the brain complex tasks. 1 Introduction There is multiple converging evidence (Abbott & Regehr, 2004) that synapses determine the complex processing of information in the brain. An aspect of this statement is illustrated by attractor neural networks. These show that synapses can efficiently store patterns that are retrieved later with only partial information on them. In addition to this time effect, Neural Computation 18, 614–633 (2006)
C 2006 Massachusetts Institute of Technology
Effects of Fast Presynaptic Noise in Attractor Neural Networks
615
artificial neural networks should contain some synaptic noise. That is, actual synapses exhibit short-time fluctuations, which seem to compete with other mechanisms during the transmission of information—not to cause unreliability but ultimately to determine a variety of computations (Allen & Stevens, 1994; Zador, 1998). In spite of some recent efforts, a full understanding of how brain complex processes depend on such fast synaptic variations is lacking (see Abbott & Regehr, 2004, for instance). A specific matter under discussion concerns the influence of short-time noise on the fixed points and other details of the retrieval processes in attractor neural networks (Bibitchkov, Herrmann, & Geisel, 2002). The observation that actual synapses endure short-time depression or facilitation is likely to be relevant in this context. That is, one may understand some observations by assuming that periods of elevated presynaptic activity may cause either a decrease or an increase in neurotransmitter release and, consequently, the postsynaptic response will be either depressed or facilitated depending on presynaptic neural activity (Tsodyks, Pawelzik, & Markram, 1998; Thomson, Bannister, Mercer, & Morris, 2002; Abbott & Regehr, 2004). Motivated by neurobiological findings, we report in this article on the effects of presynaptic depressing noise on the functionality of a neural circuit. We study in detail a network in which the neural activity evolves at random in time regulated by a “temperature” parameter. In addition, the values assigned to the synaptic intensities by a learning (e.g., Hebb’s) rule are constantly perturbed with microscopic fast noise. A new parameter is involved by this perturbation that allows a continuum transition from depression to normal operation. As a main result, this letter illustrates that in general, the addition of fast synaptic noise induces a nonequilibrium condition. That is, our systems cannot asymptotically reach equilibrium but tend to nonequilibrium steady states whose features depend, even qualitatively, on dynamics (Marro & Dickman, 1999). This is interesting because in practice, thermodynamic equilibrium is rare in nature. Instead, the simplest conditions one observes are characterized by a steady flux of energy or information, for instance. This makes the model mathematically involved, for example, there is no general framework such as the powerful (equilibrium) Gibbs theory, which applies only to systems with a single Kelvin temperature and a unique Hamiltonian. However, our system still admits analytical treatment for some choices of its parameters, and in other cases, we discovered the more intricate model behavior by a series of computer simulations. We thus show that fast presynaptic depressing noise during external stimulation may induce the system to scape from the attractor: the stability of fixed-point solutions is dramatically modified. More specifically, we show that for certain versions of the system, the solution destabilizes in such a way that computational tasks such as class identification and categorization are favored. It is likely this is the first time such behavior is reported in an artificial neural network as a consequence of biologically motivated stochastic behavior of synapses.
616
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
Similar instabilities have been reported to occur in monkeys (Abeles et al., 1995) and other animals (Miller & Schreiner, 2000), and they are believed to be a main feature in odor encoding (Laurent et al., 2001), for instance. 2 Definition of Model Our interest is in a neural network in which a local stochastic dynamics is constantly influenced by (pre)synaptic noise. Consider a set of N binary neurons with configurations S ≡ {si = ±1; i = 1, . . . , N}.1 Any two neurons are connected by synapses of intensity:2 wij = wij x j ∀i, j.
(2.1)
Here, wij is fixed, namely, determined in a previous learning process, and x j is a stochastic variable. This generalizes the hypothesis in previous studies of attractor neural networks with noisy synapses (see, e.g., Sompolinsky, 1986; Garrido & Marro, 1991; Marro, Torres, & Garrido, 1999). Once W ≡{w ij } is given, the state of the system at time t is defined by setting S and X ≡ {xi }. These evolve with time—after the learning process that fixes W—via the familiar master equation: ∂ Pt (S, X) = −Pt (S, X) c[(S, X) → (S , X )] ∂t X S + c[( S , X ) → (S, X)]Pt (S , X ). X S
(2.2)
We further assume that the transition rate or probability per unit time of evolving from (S, X) to (S , X ) is c[(S, X) → (S , X )] = p c X [S → S ]δ(X − X ) + (1 − p) c S [X → X ]δS,S . (2.3) This choice (Garrido & Marro, 1994; Torres, Garrido, & Marro, 1997) amounts to considering competing mechanisms. That is, neurons (S) evolve stochastically in time under a noisy dynamics of synapses (X), the latter
1 Note that such binary neurons, although a crude simplification of nature, are known to capture the essentials of cooperative phenomena, which is the focus here. See, for instance, Abbott and Kepler (1990) and Pantic, Torres, Kappen, and Gielen (2002). 2 For simplicity, we are neglecting here postsynaptic dependence of the stochastic perturbation. There is some claim that plasticity might operate on rapid timescales on postsynaptic activity (see Pitler & Alger, 1992). However, including xij in equation 2.1 instead of x j would impede some of the algebra in sections 3 and 4.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
617
evolving (1 − p)/ p times faster than the former. Depending on the value of p, three main classes may be defined (Marro & Dickman, 1999): 1. For p ∈ (0, 1), both the synaptic fluctuation and the neuron activity occur on the same temporal scale. This case has already been preliminarily explored (Pantic et al., 2002; Cortes, Garrido, Marro, & Torres, 2004). 2. The limiting case p → 1. This corresponds to neurons evolving in the presence of a quenched synaptic configuration, that is, xi is constant and independent of i. The Hopfield model (Amari, 1972; Hopfield, 1982) belongs to this class in the simple case that xi = 1, ∀i. 3. The limiting case p → 0. The rest of this article is devoted to this class of systems. Our interest for the latter case is a consequence of the following facts. First, there is adiabatic elimination of fast variables for p → 0, which decouples the two dynamics (Garrido & Marro, 1994; Gardiner, 2004). Therefore, an exact analytical treatment—though not the complete solution—is then feasible. To be more specific, for p → 0, the neurons evolve as in the presence of a steady distribution for X. If we write P(S, X) = P(X|S) P(S), where P(X|S) stands for the conditional probability of X given S, one obtains from equations 2.2 and 2.3, after rescaling time tp → t (technical details are worked out in Marro & Dickman, 1999, for instance) that ∂ Pt (S) = −Pt (S) c¯ [S → S ] + c¯ [S → S]Pt (S ). ∂t S S
(2.4)
Here, c¯ [S → S ] ≡
dX P st (X|S) c X [S → S ],
(2.5)
and P st (X|S) is the stationary solution that satisfies P st (X|S) =
d X c S [X → X] P st (X |S) . dX c S [X → X ]
(2.6)
This formalism allows modeling fast synaptic noise, which, within the appropriate context, will induce a sort of synaptic depression, as explained in detail in section 4. The superposition, equation 2.5, reflects the fact that activity is the result of competition among different elementary mechanisms. That is, different underlying dynamics, each associated with a different realization of the stochasticity X, compete and, in the limit p → 0, an effective rate results
618
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
from combining c X [S → S ] with probability P st (X|S) for varying X. Each of the elementary dynamics tends to drive the system to a well-defined equilibrium state. The competition will, however, impede equilibrium, and in general, the system will asymptotically go toward a nonequilibrium steady state (Marro & Dickman, 1999). The question is whether such a competition between synaptic noise and neural activity, which induces nonequilibrium, is at the origin of some of the computational strategies in neurobiological systems. Our study seems to indicate that this is a sensible issue. In matter of fact, we shall argue below that p → 0 may be realistic a priori for appropriate choices of P st (X|S). For simplicity, we shall be concerned in this article with sequential updating by means of single neuron, or “spin-flip,” dynamics. That is, the elementary dynamic step will simply consist of local inversions si → −si induced by a bath at temperature T. The elementary rate c X [S → S ] then reduces to a single site rate that onemay write as [u X (S, i)]. Here, uX (S, i) ≡ 2T −1 si h iX (S), where h iX (S) = j=i w ij x j s j is the net (pre)synaptic current arriving at (or local field acting on) the (postsynaptic) neuron i. The function (u) is arbitrary except that for simplicity, we shall assume (u) = exp(−u)(−u), (0) = 1 and (∞) = 0 (Marro & Dickman, 1999). We report on the consequences of more complex dynamics in Cortes, et al. (2005). 3 Effective Local Fields Let us define a function H eff (S) through the condition of detailed balance, namely, c¯ [S → Si ] eff i eff −1 . (S ) − H (S) T = exp − H c¯ [Si → S]
(3.1)
Here, Si stands for S after flipping at i, si → −si . We further define the effective local fields h ieff (S) by means of H eff (S) = −
1 eff h (S) si . 2 i i
(3.2)
Nothing guarantees that H eff (S) and h ieff (S) have a simple expression and are therefore analytically useful. This is because the superposition 2.5, unlike its elements (u X ), does not satisfy detailed balance in general. In other words, our system has an essential nonequilibrium character that prevents one from using Gibbs’s statistical mechanics, which requires a unique Hamiltonian. Instead, here there is one energy associated with each realization of X ={xi }. This is in addition to the fact that the synaptic weights wij in equation 2.1 may not be symmetric.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
619
For some choices of both the rate and the noise distribution P st (X|S), the function H eff (S) may be considered a true effective Hamiltonian (Garrido & Marro, 1989; Marro & Dickman, 1999). This means that H eff (S) then generates the same nonequilibrium steady state as the stochastic time evolution equation, which defines the system (see equation 2.4), and that its coefficients have the proper symmetry of interactions. To be more explicit, assume that P st (X|S) factorizes according to Pst (X|S) =
P x j |s j
(3.3)
j
and that one also has the factorization c¯ [S → S ] = i
dx j P(x j |s j ) (2T −1 si w ij x j s j ).
(3.4)
j=i
The former amounts to neglecting some global dependence of the factors on S = {si } (see below), and the latter restricts the possible choices for the rate function. Some familiar choices for this function that satisfy detailed balance are the one corresponding to the Metropolis algorithm, that is, (u) = min[1, exp(−u)]; the Glauber case (u) = [1 + exp(u)]−1 ; and (u) = exp(−u/2) (Marro & Dickman, 1999). The last fulfills (u + v) = (u)(v), which is required by equation 3.4.3 It then ensues after some algebra that h ieff = −T
j=i
αij+ s j + αij− ,
(3.5)
with αij± ≡
c¯ (βij ; +) c¯ (±βij ; −) 1 ln , 4 c¯ (−βij ; ∓) c¯ (∓βij ; ±)
(3.6)
where βij ≡ 2T −1 wij , and c¯ (βij ; s j ) =
d x j P(x j |s j ) (βij x j ).
(3.7)
3 In any case, the rate needs to be properly normalized. In computer simulations, it is customary to divide (u) by its maximum value. Therefore, the normalization happens to depend on temperature and the number of stored patterns. It follows that this normalization is irrelevant for the properties of the steady state; it just rescales the timescale.
620
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
This generalizes a case in the literature for random S-independent fluctuations (Garrido & Munoz, 1993; Lacomba & Marro, 1994; Marro & Dickman, 1999). In this case, one has c¯ (±κ; +) = c¯ (±κ; −) and, consequently, αij− = 0 ∀i, j. However, we are concerned here with the case of S-dependent disorder, which results in a nonzero threshold, θi ≡ j=i αij− = 0. In order to obtain a true effective Hamiltonian, the coefficients αij± in equation 3.5 need to be symmetric. Once (u) is fixed, this depends on the choice for P(x j |s j ), that is, on the fast noise details. This is studied in the next section. Meanwhile, we remark that the effective local fields h ieff defined above are very useful in practice. That is, they may be computed, at least numerically, for any rate and noise distribution. As far as (u + v) = (u)(v) and Pst (X|S) factorizes,4 it follows an effective transition rate as
c¯ [S → Si ] = exp − si h ieff /T .
(3.8)
This effective rate may then be used in computer simulation, and it may also serve to be substituted in the relevant equations. Consider, for instance, the overlaps defined as the product of the current state with one of the stored patterns: mν (S) ≡
1 ν si ξi . N i
(3.9)
Here, ξ ν = {ξiν = ±1, i = 1, . . . , N} are M random patterns previously stored in the system, ν = 1, . . . , M. After using standard techniques (Hertz, Krogh, & Palmer, 1991; Marro & Dickman, 1999; see also Amit, Gutfreund, & Sompolinsky, 1987), it follows from equation 2.4 that ∂t mν = 2N−1
ξiν sinh h ieff /T − si cosh h ieff /T ,
(3.10)
i
which is to be averaged over both thermal noise and pattern realizations. Alternatively, one might perhaps obtain dynamic equations of type 3.10 by using Fokker-Planck-like formalisms as, for instance, in Brunel and Hakim (1999). 4 Types of Synaptic Noise The above discussion and, in particular, equations 3.5 and 3.6, suggest that the system emergent properties will importantly depend on the details The factorization here does not need to be in products P(x j |s j ) as in equation 3.3. The same result (see equation 3.8) holds for the choice that we introduce in the next section, for instance. 4
Effects of Fast Presynaptic Noise in Attractor Neural Networks
621
of the synaptic noise X. We now work out the equations in section 3 for different hypotheses concerning the stationary distribution, equation 2.6. Consider first equation 3.3 with the following specific choice:
P(x j |s j ) =
1 + s j Fj 1 − s j Fj δ(x j + ) + δ(x j − 1). 2 2
(4.1)
This corresponds to a simplification of the stochastic variable x j . That is, for Fj = 1 ∀ j, the noise modifies w ij by a factor − when the presynaptic neuron is firing, s j = 1, while the learned synaptic intensity remains unchanged when the neuron is silent. In general, wij = −w ij with probability 12 (1 + s j Fj ). Here, Fj stands for some information concerning the presynaptic site j such as, for instance, a local threshold or Fj = M−1 ν ξ νj . Our interest for case 4.1 is twofold: it corresponds to an exceptionally simple situation and reduces our model to two known cases. This becomes evident by looking at the resulting local fields:
h ieff =
1 [(1 − ) s j − (1 + )Fj ]w ij . 2 j=i
(4.2)
That is, exceptionally, symmetries here are such that the system is described by a true effective Hamiltonian. Furthermore, this corresponds to the Hopfield model, except for a rescaling of temperature and the emergence of a threshold θi ≡ j w ij Fj (Hertz et al., 1991). It also follows that, concerning stationary properties, the resulting effective Hamiltonian, equation 3.2, reproduces the model as in Bibitchkov et al. (2002). In fact,∞this would correspond in our notation to h ieff = 12 j=i wij s j x ∞ j , where x j stands for the stationary solution of certain dynamic equation for x j . The conclusion is that (except perhaps concerning dynamics, which is something worth investigating) the fast noise according to equation 3.3 with equation 4.1 does not imply any surprising behavior. In any case, this choice of noise illustrates the utility of the effective field concept as defined above. Our interest here is in modeling the noise consistent with the observation of short-time synaptic depression (Tsodyks et al., 1998; Pantic et al., 2002). In fact, equation 4.1 in some way mimics that increasing the mean firing rate results in decreasing the synaptic weight. With the same motivation, a more intriguing behavior ensues by assuming, instead of equation 3.3, the factorization P st (X|S) =
j
P(x j |S)
(4.3)
622
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
with
δ(x j + ) + [1 − ζ (m)]
δ(x j − 1). P(x j |S) = ζ (m)
(4.4)
= m(S)
Here, m ≡ (m1 (S), . . . , m M (S)) is the M-dimensional overlap vector,
stands for a function of m
to be determined. The depression and ζ (m) effect here depends on the overlap vector, which measures the net current arriving to postsynaptic neurons. The nonlocal choice, equations 4.3 and 4.4, thus introduces nontrivial correlations between synaptic noise and neural activity, which is not considered in equation 4.1. Note that therefore we are not modeling here the synaptic depression dynamics in an explicit way as, for instance, in Tsodyks et al. (1998). Instead, equation 4.4 amounts to considering fast synaptic noise, which naturally depresses the strength of
the synapses after repeated activity, namely, for a high value of ζ (m). Several further comments on the significance of equations 4.3 and 4.4, which is here a main hypothesis together with p → 0, are in order. We first mention that the system time relaxation is typically orders of magnitude larger than the timescale for the various synaptic fluctuations reported to account for the observed high variability in the postsynaptic response of central neurons (Zador, 1998). On the other hand, these fluctuations seem to have different sources, such as, for instance, the stochasticity of the opening and closing of the vesicles (S. Hilfiker, private communication, April 2005), the stochasticity of the postsynaptic receptor, which has its own causes, variations of the glutamate concentration in the synaptic cleft, and differences in the potency released from different locations on the active zone of the synapses (Franks, Stevens, & Sejnowski, 2003). This is the complex situation that we try to capture by introducing the stochastic variable x in equation 2.1 and subsequent equations. It may be further noticed that the nature of this variable, which is microscopic here, differs from the one in the case of familiar phenomenological models. These often involve a mesoscopic variable, such as the mean fraction of neurotransmitter, which results in a deterministic situation, as in Tsodyks et al. (1998). The depression in our model rather naturally follows from the coupling between the synaptic noise and the neurons’ dynamics via the overlap functions. The final result is also deterministic for p → 0 but only, as one should perhaps expect, on the timescale for the neurons. Finally, concerning also the reality of the model, it should be clear that we are restricting ourselves here to fully connected networks for simplicity. However, we have studied similar systems with more realistic topologies such as scale-free, small-world, and diluted networks (Torres, Munoz, Marro, & Garrido, 2004), which suggests one can generalize the present study in this sense. Our case (see equations 4.3 and 4.4) also reduces to the Hopfield model
Otherwise, the competition but only in the limit → −1 for any ζ (m). results in rather complex behavior. In particular, the noise distribution
Effects of Fast Presynaptic Noise in Attractor Neural Networks
623
P st (X|S) lacks with equation 4.4 the factorization property, which is required to have an effective Hamiltonian with proper symmetry. Nevertheless, we may still write d x j P(x j |S) (si x j s j βij ) c¯ [S → Si ] = . c¯ [Si → S] d x j P(x j |Si ) (−si x j s j βij ) j=i
(4.5)
Then, using equation 4.4, we linearize around w ij = 0, that is, βij = 0 for T > 0. This is a good approximation for the Hebbian learning rule (Hebb, 1949), wij = N−1 ν ξiν ξ νj , which is the one we use hereafter, as far as this rule stores only completely uncorrelated, random patterns. In√fact, fluctuations √ in this case are of order M/N for finite M—of order 1/ N for finite α— which tends to vanish for a sufficiently large system, for example, in the macroscopic (thermodynamic) limit N → ∞. It then follows the effective weights,
1+
+ ζ (m
i )] wij , wijeff = 1 − [ζ (m) 2
(4.6)
= m(S),
i ≡ m(S
i) = m
− 2si ξ i /N, and ξ i = (ξi1 , ξi2 , ..., ξiM ) is the where m m binary M–dimensional stored pattern. This shows how the noise modifies synaptic intensities. The associated effective local fields are h ieff =
wijeff s j .
(4.7)
j=i
The condition to obtain a true effective Hamiltonian, that is, proper symme i =m
− 2si ξ i /N m.
This is a good try of equation 4.6 from this, is that m approximation in the thermodynamic limit, N → ∞. Otherwise, one may proceed with the dynamic equation 3.10 after substituting equation 4.7, even though this is not then a true effective Hamiltonian. One may follow the same procedure for the Hopfield case with asymmetric synapses (Hertz et al., 1991), for instance. Further interest in the concept of local effective fields as defined in section 3 follows from the fact that one may use quantities such as equation 4.7 to importantly simplify a computer simulation, as we do below. To proceed further, we need to determine the probability ζ in equation 4.4. In order to model activity-dependent mechanisms acting on the synapses, ζ should be an increasing function of the net presynaptic current
624
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
simply needs to depend on the overlaps, besides or field. In fact, ζ (m) preserving the ±1 symmetry. A simple choice with these requirements is
= ζ (m)
1 ν [m (S)]2 , 1+α ν
(4.8)
where α = M/N. We describe next the behavior that ensues from equations 4.6 to 4.8 as implied by the noise distribution, equation 4.4. 5 Noise-Induced Phase Transitions Let us first study the retrieval process in a system with a single stored pattern, M = 1, when the neurons are acted on by the local fields, equation 4.7. One obtains from equations 3.8 to 3.10, after using the simplifying (mean field) assumption si ≈ si , that the steady solution corresponds to the overlap, m = tanh{T −1 m[1 − (m)2 (1 + )]},
(5.1)
m ≡ mν=1 , which preserves the symmetry ±1. Local stability of the solutions of this equation requires that 1 |m| > mc (T) = √ 3
Tc − T
− c
12
.
(5.2)
The behavior of equation 5.1 is illustrated in Figure 1 for several values of
. This indicates a transition from a ferromagnetic-like phase (i.e., solutions m = 0 with associative memory) to a paramagnetic-like phase, m = 0. The transition is continuous or second order only for > c = −4/3, and it then follows a critical temperature Tc = 1. Figure 2 shows the tricritical point at (Tc , c ) and the general dependence of the transition temperature with . A discontinuous phase transition allows much better performance of the retrieval process than a continuous one. This is because the behavior is sharp just below the transition temperature in the former case. Consequently, the above indicates that our model performs better for large negative ,
< −4/3. We also performed Monte Carlo simulations. These concern a network of N = 1600 neurons acted on by the local fields, equation 4.7, and evolving by sequential updating via the effective rate, equation 3.8. Except for some finite-size effects, Figure 1 shows good agreement between our simulations and the equations here; in fact, the computer simulations also correspond to a mean field description given that the fields 4.7 assume fully connected neurons.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
625
m
1
0.5
0 0.4
0.8
1.2
T Figure 1: The steady overlap m(T), as predicted by equation 5.1, for different values of the noise parameter, namely, = −2.0, −1.5, −1.0, −0.5, 0, 0.5, 1.0, 1.5, 2.0, from top to bottom, respectively. ( = −1 corresponds to the Hopfield case, as explained in the text.) The graphs depict second-order phase transitions (solid curves) and, for the most negative values of , first-order phase transitions (the discontinuities in these cases are indicated by dashed lines). The symbols stand for Monte Carlo data corresponding to a network with N = 1600 neurons for = −0.5 (filled squares) and −2.0 (filled circles).
6 Sensitivity to the Stimulus As shown above, a noise distribution such as equation 4.4 may model activity-dependent processes reminiscent of short-time synaptic depression. In this section, we study the consequences of this type of fast noise on the retrieval dynamics under external stimulation. More specifically, our aim is to check the resulting sensitivity of the network to external inputs. A high degree of sensibility will facilitate the response to changing stimuli. This is an important feature of neurobiological systems, which continuously adapt and quickly respond to varying stimuli from the environment. Consider first the case of one stored pattern, M = 1. A simple external input may be simulated by adding to each local field a driving term −δξi , ∀i, with 0 < δ 1 (Bibitchkov et al., 2002). A negative drive in this case of
626
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
-1
Φ
-1.5
P
-2
F
-2.5 -3 0.8
1.2
1.6
2
T Figure 2: Phase diagram depicting the transition temperature Tc as a function of T and . The solid (dashed) curve corresponds to a second- (first-) order phase transition. The tricritical point is at (Tc , c ) = (1, −4/3). F and P stand for the ferromagnetic-like and paramagnetic-like phases, respectively. The best retrieval properties of our model system occur close to the lower-left quarter of the graph.
a single pattern ensures that the network activity may go from the attractor, ξ , to the “antipattern,” −ξ . It then follows the stationary overlap, m = tanh[T −1 F (m, , δ)],
(6.1)
F (m, , δ) ≡ m[1 − (m)2 (1 + ) − δ].
(6.2)
with
Figure 3 shows this function for δ = 0 and varying . This illustrates two types of behavior: (local) stability (F > 0) and instability (F < 0) of the attractor, which corresponds to m = 1. That is, the noise induces instability, resulting in this case in switching between the pattern and the antipattern. This is confirmed in Figure 4 by Monte Carlo simulations. The simulations correspond to a network of N = 3600 neurons with one stored pattern, M = 1. This evolves from different initial states, corresponding to different distances to the attractor, under an external stimulus −δξ 1
Effects of Fast Presynaptic Noise in Attractor Neural Networks
627
3
F(m,Φ,δ)
2 1 0 -1 -2 -3 0
0.5
1
m Figure 3: The function F as defined in equation 6.2 for δ = 0 and, from top to bottom, = −2, −1, 0, 1, and 2. The solution of equation 6.1 becomes unstable so that the activity will escape the attractor (m = 1) for F < 0, which occurs for
> 0 in this case.
for different values of δ. The two left graphs in Figure 4 show several independent time evolutions for the model with fast noise, namely, for
= 1; the two graphs to the right are for the Hopfield case lacking the noise ( = −1). These, and similar graphs one may obtain for other parameter values, clearly demonstrate how the network sensitivity to a simple external stimulus is qualitatively enhanced by adding presynaptic noise to the system. Figures 5 and 6 illustrate similar behavior in Monte Carlo simulations with several stored patterns. Figure 5 is for M = 3 correlated patterns with mutual overlaps |mν,µ | ≡ |1/N i ξiν ξiµ | = 1/3 and | ξiν | = 1/3. More specifically, each pattern consists of three equal initially white (silent neurons) horizontal stripes, with one of them colored black (firing neurons) located in a different position for each pattern. The system in this case begins with the first pattern as initial condition, and, to avoid dependence on this choice, it is let to relax for 3 × 104 Monte Carlo steps (MCS). It is then perturbed by a drive −δξ ν , where the stimulus ν changes (ν = 1, 2, 3, 1, . . .) every 6 × 103 MCS. The top graph shows the network response in the Hopfield case. There is no visible structure of this signal in the absence of fast noise as far as δ 1. In fact, the depth of the basins of attraction is large enough in the Hopfield model to prevent any move for small δ, except
628
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
A
B
1
1
m
m
0.5 0
0.5
-0.5 -1
0 0
4000
8000
D
0.5 0 -0.5 -1
m
m
C 1
0
1500
3000
0
3000
6000
1 0.5 0
0
4000 time (MCS)
8000
time (MCS)
Figure 4: Time evolution of the overlap, as defined in equation 3.9, between the current state and the stored pattern in Monte Carlo simulations with 3600 neurons at T = 0.1. Each graph, for a given set of values for (δ , ), shows different curves corresponding to evolutions starting with different initial states. The two top graphs are for δ = 0.3 and = 1 (graphs A and C) and = −1 (graphs B and D), the latter corresponding to the Hopfield case lacking the fast noise. This shows the important effect noise has on the network sensitivity to external stimuli. The two bottom graphs illustrate the same for a fixed initial distance from the attractor as one varies the external stimulation, namely, for δ = 0.1, 0.2, 0.3, 0.4, and 0.5 from top to bottom.
when approaching a critical point (Tc = 1), where fluctuations diverge. The bottom graph depicts a qualitatively different situation for = 1. That is, adding fast noise in general destabilizes the fixed point for the interesting case of small δ far from criticality. Figure 6 confirms the above for uncorrelated patterns, for example, mν,µ ≈ δ ν,µ and ξiν ≈ 0. This shows the response of the network in a similar simulation with 400 neurons at T = 0.1 for M = 3 random, othogonal patterns. The initial condition is again ν = 1, and the stimulus is here +δξ ν with ν changing every 1.5 × 105 MCS. Thus, we conclude that the switching phenomenon is robust with respect to the type of pattern stored. 7 Conclusion The set of equations 2.4 to 2.6 provides a general framework to model activity-dependent processes. Motivated by the behavior of neurobiological
Effects of Fast Presynaptic Noise in Attractor Neural Networks
P1 P2 P3 P1 P2 P3 P1 P2 P3
δ=0
1
629
Input Hopfield
m1
Fast Noise
0
-1 160
180
200
220
240
3
time (10 MCS) Figure 5: Time evolution during a Monte Carlo simulation with N = 400 neurons, M = 3 correlated patterns (as defined in the text), and T = 0.1. The system in this case was let to relax to the steady state and then perturbed by the stimulus −δξ ν , δ = 0.3, with ν = 1 for a short time interval and then with ν = 2, and so on. After suppressing the stimulus, the system is again allowed to relax. The graphs show as a function of time, from top to bottom, the number of the pattern used as the stimulus at each time interval; the resulting response of the network, measured as the overlap of the current state with pattern ν = 1, in the absence of noise, that is, the Hopfield case = −1; and the same for the relevant noisy case = 1.
systems, we adapted this to study the consequences of fast noise acting on the synapses of an attractor neural network with a finite number of stored patterns. We presented in this letter two different scenarios corresponding to noise distributions fulfilling equations 3.3 and 4.3, respectively. In particular, assuming a local dependence on activity as in equation 4.1, one obtains the local fields, equation 4.2, while a global dependence as in equation 4.4 leads to equation 4.7. Under certain assumptions, the system in the first of these cases is described by the effective Hamiltonian, equation 3.2. This reduces to a Hopfield system—the familiar attractor neural network without any synaptic noise—with rescaled temperature and a threshold. This was already studied for a gaussian distribution of thresholds (Hertz et al., 1991; Horn & Usher, 1989; Litinskii, 2002). Concerning stationary properties, this case is also similar to the one in Bibitchkov et al. (2002). A more intriguing
630
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
δ=0 P1
P2
P3
P1
P2
P3
P1
δ=0
m
µ
1
0
-1 0
5
10
15
5
Time (10 MCS) Figure 6: The same as in Figure 5 but for three stored patterns that are orthogonal (instead of correlated). The stimulus is +δξ ν , δ = 0.1, with ν = ν(t), as indicated at the top. The time evolution of the overlap mν is drawn with a different color (black, dark gray, and light gray, respectively) for each value of ν to illustrate that the system keeps jumping between the patterns in this case.
behavior ensues when the noise depends on the total presynaptic current arriving at the postsynaptic neuron. We studied this case both analytically, by using a mean field hypothesis, and numerically, by a series of Monte Carlo simulations using single-neuron dynamics. The two approaches are fully consistent with and complement each other. Our model involves two main parameters. One is the temperature T, which controls the stochastic evolution of the network activity. The other parameter, , controls the depressing noise intensity. Varying this, the system describes the route from normal operation to depression phenomena. A main result is that the presynaptic noise induces the occurrence of a tricritical point for certain values of these parameters, (Tc , c ) = (1, −4/3). This separates (in the limit α → 0) first from second-order phase transitions between a retrieval phase and a nonretrieval phase. The principal conclusion in this letter is that fast presynaptic noise may induce a nonequilibrium condition that results in an important intensification of the network sensitivity to external stimulation. We explicitly show that the noise may turn the attractor or fixed-point solution of the retrieval process unstable, and the system then seeks another attractor. In particular, one observes switching from the stored pattern to the corresponding antipattern for M = 1 and switching between patterns for a larger number of
Effects of Fast Presynaptic Noise in Attractor Neural Networks
631
stored patterns, M. This behavior is most interesting because it improves the network ability to detect changing stimuli from the environment. We observe the switching to be very sensitive to the forcing stimulus, but rather independent of the network initial state or the thermal noise. It seems sensible to argue that besides recognition, the processes of class identification and categorization in nature might follow a similar strategy. That is, different attractors may correspond to different objects, and a dynamics conveniently perturbed by fast noise may keep visiting the attractors belonging to a class that is characterized by a certain degree of correlation among its elements (Cortes et al., 2005). In fact, a similar mechanism seems to be at the basis of early olfactory processing of insects (Laurent et al., 2001), and instabilities of the same sort have been described in the cortical activity of monkeys (Abeles et al., 1995) and other cases (Miller & Schreiner, 2000). Finally, we mention that the above complex behavior seems confirmed by preliminary Monte Carlo simulations for a macroscopic number of stored patterns, that is, a finite loading parameter α = M/N = 0. On the other hand, a mean field approximation (see below) shows that the storage capacity of the network is αc = 0.138, as in the Hopfield case (Amit et al., 1987), for any < 0, while it is always smaller for > 0. This is in agreement with previous results concerning the effect of synaptic depression in Hopfield-like systems (Torres, Pantic, & Kappen, 2002; Bibitchkov et al., 2002). The fact that a positive value of tends to shallow the basin, thus destabilizing the attractor, may be understood by a simple (mean field) argument, which is confirmed by Monte Carlo simulations (Cortes et al., 2005). Assume that the stationary activity shows just one overlap of order unity. This corresponds to the condensed pattern; the overlaps with the √ rest, M − 1 stored patterns, is of order of 1/ N (non-condensed patterns) (Hertz et al., 1991). The resulting probability of change of the synaptic inM tensity, namely, 1/(1 + α) ν=1 (mν )2 , is of order unity, and the local fields, Hopfield . Therefore, the storage capacity, equation 4.7, follow as h ieff ∼ − h i which is computed at T = 0, is the same as in the Hopfield case for any
< 0, and always lower otherwise. Acknowledgments We acknowledge financial support from MCyT–FEDER (project No. ´ y Cajal contract). BFM2001-2841 and a Ramon
References Abbott, L. F., & Kepler, T. B. (1990). Model neurons: From Hodgkin-Huxley to Hopfield. Lectures Notes in Physics, 368, 5–18. Abbott, L. F., & Regehr, W. G. (2004). Synaptic computation. Nature, 431, 796–803.
632
J. Cortes, J. Torres, J. Marro, P. Garrido, and H. Kappen
Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidelman, E., Tishby, N., & Vaadia, E. (1995). Cortical activity flips among quasi-stationary states. Proc. Natl. Acad. Sci. USA, 92, 8616–8620. Allen, C., & Stevens, C. F. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci. USA, 91, 10380–10383. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans. Syst. Man. Cybern., 2, 643–657. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Bibitchkov, D., Herrmann, J. M., & Geisel, T. (2002). Pattern storage and processing in attractor networks with short-time synaptic dynamics. Network: Comput. Neural Syst., 13, 115–129. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comp., 11, 1621–1671. Cortes, J. M., Garrido, P. L., Kappen, H. J., Marro, J., Morillas, C., Navidad, D., & Torres, J. J. (2005). Algorithms for identification and categorization. In AIP Conf. Proc. 779 (pp. 178–184). Cortes, J. M., Garrido, P. L., Marro, J., & Torres, J. J. (2004). Switching between memories in neural automata with synaptic noise. Neurocomputing, 58–60, 67– 71. Franks, K. M., Stevens, C. F., & Sejnowski, T. J. (2003). Independent sources of quantal variability at single glutamatergic synapses. J. Neurosci., 23(8), 3186–3195. Gardiner, C. W. (2004). Handbook of stochastic methods: For physics, chemistry and the natural sciences. Berlin: Springer-Verlag. Garrido, P. L., & Marro, J. (1989). Effective Hamiltonian description of nonequilibrium spin systems. Phys. Rev. Lett., 62, 1929–1932. Garrido, P. L., & Marro, J. (1991). Nonequilibrium neural networks. Lecture Notes in Computer Science, 540, 25–32. Garrido, P. L., & Marro, J. (1994). Kinetic lattice models of disorder. J. Stat. Phys., 74, 663–686. Garrido, P. L., & Munoz, M. A. (1993). Nonequilibrium lattice models: A case with effective Hamiltonian in d dimensions. Phys. Rev. E, 48, R4153–R4155. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Horn, D., & Usher, M. (1989). Neural networks with dynamical thresholds. Phys. Rev. A, 40, 1036–1044. Lacomba, A. I. L., & Marro, J. (1994). Ising systems with conflicting dynamics: Exact results for random interactions and fields. Europhys. Lett., 25, 169–174. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active, dynamical process: Experiments, computation and theory. Annu. Rev. Neurosci., 24, 263–297. Litinskii, L. B. (2002). Hopfield model with a dynamic threshold. Theoretical and Mathematical Physics, 130, 136–151.
Effects of Fast Presynaptic Noise in Attractor Neural Networks
633
Marro, J., & Dickman, R. (1999). Nonequilibrium phase transitions in lattice models. Cambridge: Cambridge University Press. Marro, J., Torres, J. J., & Garrido, P. L. (1999). Neural network in which synaptic patterns fluctuate with time. J. Stat. Phys., 94(1–6), 837–858. Miller, L. M., & Schreiner, C. E. (2000). Stimulus-based state control in the thalamocortical system. J. Neurosci., 20, 7011–7016. Pantic, L., Torres, J. J., Kappen, H. J., & Gielen, S. C. A. M. (2002). Associative memory with dynamic synapses. Neural Comp., 14, 2903–2923. Pitler, T., & Alger, B. E. (1992). Postsynaptic spike firing reduces synaptic GABA(a) responses in hippocampal pyramidal cells. J. Neurosci., 12, 4122–4132. Sompolinsky, H. (1986). Neural networks with nonlinear synapses and a static noise. Phys. Rev. A, 34, 2571–2574. Thomson, A. M., Bannister, A. P., Mercer, A., & Morris, O. T. (2002). Target and temporal pattern selection at neocortical synapses. Philos. Trans. R. Soc. Lond. B Biol. Sci., 357, 1781–1791. Torres, J. J., Garrido, P. L., & Marro, J. (1997). Neural networks with fast time-variation of synapses. J. Phys. A: Math. Gen., 30, 7801–7816. Torres, J. J., Munoz, M. A., Marro, J., & Garrido, P. L. (2004). Influence of topology on the performance of a neural network. Neurocomputing, 58–60, 229–234. Torres, J. J., Pantic, L., & Kappen, H. J. (2002). Storage capacity of attractor neural networks with depressing synapses. Phys. Rev. E., 66, 061910. Tsodyks, M. V., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Zador, A. (1998). Impact of synaptic unreliability on the information transmitted by spiking neurons. J. Neurophysiol., 79, 1219–1229.
Received February 9, 2005; accepted July 29, 2005.
LETTER
Communicated by Paul Bressloff
Response Variability in Balanced Cortical Networks Alexander Lerchner∗
[email protected] Technical University of Denmark, 2800 Lyngby, Denmark
Cristina Ursta
[email protected] Niels Bohr Institut, 2100 Copenhagen Ø, Denmark
John Hertz
[email protected] Nordita, 2100 Copenhagen Ø, Denmark
Mandana Ahmadi
[email protected] Nordita, 2100 Copenhagen Ø, Denmark
Pauline Ruffiot pruffi
[email protected] Universit´e Joseph Fourier, Grenoble, France
Søren Enemark
[email protected] Niels Bohr Institut, 2100 Copenhagen Ø, Denmark
We study the spike statistics of neurons in a network with dynamically balanced excitation and inhibition. Our model, intended to represent a generic cortical column, comprises randomly connected excitatory and inhibitory leaky integrate-and-fire neurons, driven by excitatory input from an external population. The high connectivity permits a mean field description in which synaptic currents can be treated as gaussian noise, the mean and autocorrelation function of which are calculated selfconsistently from the firing statistics of single model neurons. Within this description, a wide range of Fano factors is possible. We find that the irregularity of spike trains is controlled mainly by the strength of the synapses relative to the difference between the firing threshold and the postfiring reset level of the membrane potential. For moderately strong ∗
Current address: Laboratory of Neuropsychology, NIMH, NIH, Bethesda, MD 20893,
USA Neural Computation 18, 634–659 (2006)
C 2006 Massachusetts Institute of Technology
Response Variability in Balanced Cortical Networks
635
synapses, we find spike statistics very similar to those observed in primary visual cortex.
1 Introduction The observed irregularity and relatively low rates of the firing of neocortical neurons suggest strongly that excitatory and inhibitory input are nearly balanced. Such a balance, in turn, finds an attractive explanation in the approximate, heuristic mean field description of Amit and Brunel (1997a, 1997b) and Brunel (2000). In this treatment, the balance does not have to be put in “by hand”; rather, it emerges self-consistently from the network dynamics. This success encourages us to study firing correlations and irregularity in models like theirs in greater detail. In particular, we would like to quantify the irregularity and identify the parameters of the network that control it. This is important because one cannot extract the signal in neuronal spike trains correctly without a good characterization of the noise. Indeed, an incorrect noise model can lead to spurious conclusions about the nature of the signal, as demonstrated by Oram, Wiener, Lestienne, and Richmond (1999). Response variability has been studied for a long time in primary visual cortex (Heggelund & Albus, 1978; Dean, 1981; Tolhurst, Movshon, & Thompson, 1981; Tolhurst, Movshon, & Dean, 1983; Vogels, Spileers, & Orban, 1989; Snowden, Treue, & Andersen, 1992; Gur, Beylin, & Snodderly, 1997; Shadlen & Newsome, 1998; Gershon, Wiener, Latham, & Richmond, 1998; Kara, Reinagel, & Reid, 2000; Buracas, Zador, DeWeese, & Albright, 1998) and elsewhere (Lee, Port, Kruse, & Georgopoulos, 1998; Gershon et al., 1998; Kara et al., 2000; DeWeese, Wehr, & Zador, 2003). Most, though not all, of these studies found rather strong irregularity. As an example, we consider the findings of Gershon et al. (1998). In their experiments, monkeys were presented with flashed, stationary visual patterns for several hundred ms. Repeated presentations of a given stimulus evoked varying numbers of spikes in different trials, though the mean number (as well as the peristimulus time histogram) varied systematically from stimulus to stimulus. The statistical objects of interest to us here are the distributions of single-trial spike counts for given fixed stimuli. Often one compares the data with a Poisson model of the spike trains, for which the count distribution P(n) = mn e−m /n!. This distribution has the property that its mean n = m is equal to its variance δn2 = (n − n)2 . However, the experimental finding was that the measured distributions were quite generally wider than this: δn2 > m. Furthermore, when data were collected for many stimuli, the variance of the spike count was fit well by a power law function of the mean count: δn2 ∝ m y , with y typically in the range 1.2 to 1.4, broadly consistent with the results of many of the other studies cited above.
636
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Some of this observed variance could have a simple explanation: the condition of the animal might have changed between trials, so the intrinsic rate at which the neuron fires might differ from trial to trial, as suggested by Tolhurst et al. (1981). But it is far from clear whether all the variance can be accounted for in this way. Moreover, there is no special reason to take a Poisson process as the null hypothesis, so we do not even really know how much variance we are trying to explain. In this article, we try to address the question of how much variability, or more generally, what firing correlations can be expected as a consequence of the intrinsic dynamics of cortical neuronal networks. The theories of Amit and Brunel (1997a, 1997b) and of van Vreeswijk and Sompolinsky (1996, 1998) do not permit a consistent study of firing correlations. The AmitBrunel equations ignore firing correlations and variations in firing rate within neuronal populations; thus, they do not constitute a complete mean field theory. Although one can calculate the variability of the firing (Brunel, 2000), the calculation is not self-consistent. Van Vreeswijk and Sompolinsky use a binary neuron model with stochastic dynamics, which makes it difficult, if not impossible, to study temporal correlations that might occur in networks of spiking neurons. Therefore, in this article, we do a complete mean field theory for a network of leaky integrate-and-fire neurons, including, as self-consistently-determined order parameters, both firing rates and autocorrelation functions. This kind of theory is needed whenever the connections in the network are random. A general formalism for doing this was introduced by Fulvi Mari (2000) and used for an all-excitatory network; here we employ it for a network with both excitatory and inhibitory neurons. A preliminary study of this approach for an all-inhibitory network was presented previously (Hertz, Richmond, & Nilsen, 2003). 2 Model and Methods The model network, indicated schematically in Figure 1, consists of N1 excitatory neurons and N2 inhibitory ones. In this work, we use leaky integrateand-fire neurons, though the methods could be carried over directly to networks of other kinds of model neurons, such as conductance-based ones. They are randomly interconnected by synapses, both within and between populations, with the mean number of connections from population b to population a equal to K b , independent of a . In specific calculations, we have used K 1 from 400 to 6400, and we take K 2 = K 1 /4. We scale the synaptic strengths in the way van Vreeswijk and Sompolinsky (1996, 1998) did, with each √nonzero synapse from population b to population a having the value J ab / K b . Thus, the mean value of a synapse J ijab is √ J ijab
=
K b J ab , Nb
(2.1)
Response Variability in Balanced Cortical Networks
637
Figure 1: Structure of the Model Network.
and its variance is
δ J ijab
2
2 K b J ab = 1− . Nb Nb
(2.2)
The parameters J ab are taken to be of order 1, so the net input current to a neuron from the K b neurons in population b connected to it is of order √ K b . With this scaling, the fluctuations in this current are of order 1. Similarly, we assume that the external input to any neuron is the sum of K 0 1 contributions from individual neurons (in the lateral geniculate √ nucleus, if we are thinking √ about modeling V1), each of order 1/ K 0 , so the net input is of order K 0 . In our calculations, we have used K 0 = K 1 . We point out that this scaling is just for convenience in thinking about the problem. In the balanced asynchronous firing state, the large excitatory and inhibitory input currents nearly cancel, leaving a net input current of order 1. Thus, for this choice, both the net mean current and its typical fluctuations are of order 1, which is convenient for analysis. The physiologically relevant assumptions are only that excitatory and inhibitory inputs are separately much larger than their sum and that the latter is of the same order as its fluctuations.
638
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Our synapses are not modeled as conductances. Our synaptic strength simply defines the amplitude of the postsynaptic current pulse produced by a single presynaptic spike. The model is formally specified by the subthreshold equations of motion for the membrane potentials uia (a = 1, 2, i = 1, . . . , Na ), b duia ua J ijab Sbj (t), =− i + dt τ
2
N
(2.3)
b=0 j=1
together with the condition that when uia reaches the threshold θa , the neuron spikes and the membrane potential is reset to a value ura . The indices a or b = 0, 1, or 2 label populations: b = 0 refers to the (excitatory) population providing the external input, b = 1 refers to the excitatory population, and b = 2 to the inhibitory population. In equation 2.3, τ is the membrane time constant (taken the same for all neurons, for convenience), and Sbj (t) = s δ(t − t sjb ) is the spike train of neuron j in population b. We have ignored transmission delays, and we take the reset levels ura equal to the rest value of the membrane potential, 0. In our calculations, the thresholds are given a gaussian distribution with a standard deviation equal to 10% of the mean. We fix the mean threshold θa = 1. Analogous variability in other single-cell parameters (such as membrane time constants) could also be included in the model, but for simplicity, we do not do so here. We assume that the neurons in the external input population (b = 0) fire as independent Poisson processes. However, the neurons in the network (b = 1, 2) are not in general Poissonian; it is their correlations that we want to find in this investigation. 2.1 Mean Field Theory: Stationary States. We describe the mean field theory and its computational implementation first for the case of stationary rates. We will assume Na K a 1 (large but extremely dilute connectivity). Any mean field theory has to start with an ansatz for the structure of its order parameters. In words, our ansatz is that neurons fire noisily: thus, they are characterized by their rates (which can vary across a neuronal population) and autocorrelation functions. We assume the latter to contain a delta function spike at equal times of strength equal to the rate (because they are spiking neurons) plus a continuous part at unequal times. In our dilute limit, the theory is simplified by the fact that there are no crosscorrelations between neurons. (Therefore, we generally drop the “auto” from “autocorrelation.”) Under these assumptions, each of the three terms in the sum on b on the right-hand side of equation 2.3 can be treated as a gaussian random function with time-independent mean. (This can be proved formally using the generating-functional formalism of Fulvi Mari, 2000; it is a consequence of the fact that K b 1 and the independence of the J ijab . Furthermore,
Response Variability in Balanced Cortical Networks
639
experiments (Destexhe, Rudolph, & Par´e, 2003) show that a gaussian approximation is very good for real synaptic noise.) We write the contribution from population b as
Iiab (t) =
Nb
J ijab + δ J ijab rb + δr bj + δSbj (t) ,
(2.4)
j=1
where rb is the mean rate in population b, δr bj is the deviation of the rate of neurons j from its population mean, and δSbj (t) = Sbj (t) − r bj describes the fluctuations of the activity of neuron j from its temporal mean r bj . The mean (over both time and neurons in the receiving population a ) comes from the product of the first terms:
Iiab (t) = K b J ab rb .
(2.5)
By · · · we mean a time average or, equivalently, an average over “trials” (independent repetitions of the Poisson processes defining the input population neurons). We will generally use a bar over a quantity to indicate an average over the neuronal population or over the distribution of the J ijab . (Note that these two kinds of averages are very different things.) The fluctuations around this mean are of two kinds. One is the neuron-toneuron rate variations in population a , obtained from the time-independent terms in equation 2.4: ab b ab b δ Iiab = J ij δr j + δ J ij r j . j
(2.6)
j
Using equations 2.1 and 2.2, their variance reduces, for K b Nb , to ab 2 b 2 J 2 b 2 2 δ Ii r j = J ab rj . = ab Nb j
(2.7)
The second kind is the temporal fluctuations for single neurons, obtained from the terms in equation 2.4 involving δSbj (t). Their population-averaged correlation function is proportional to the average correlation function in population b:
ab J 2 b 2 Cb (t − t ). δ Ii (t)δ Iiab (t ) = ab δS j (t)δSbj (t ) ≡ J ab Nb j
(2.8)
640
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Thus, we can write this contribution to the input current for a single neuron as
ab
Ii (t) = J ab
K b rb +
b 2 ab ab r j xi + ξi (t) ,
(2.9)
where xiab is a unit-variance gaussian random number and
ξiab (t)ξiab (t ) = Cb (t − t ).
(2.10)
The xiab are time and trial independent, while the noise ξiab (t) varies both in time within a trial and randomly from trial to trial. Note that for this model, a correct and complete mean field theory has to include rate variations, through (r bj )2 , and the temporal firing correlations, given by Cb (t − t ), as well as the mean rates. In our treatment here, we will assume that the neurons in the external input population fire like Poisson processes, so Iia 0 (t) is white noise. However, the neurons providing the source of the recurrent currents are not generally Poissonian, so their correlations appear in the statistics of the noise term. The self-consistency equations of mean field theory are simply the conditions that the average output statistics of the neurons, ra , (r aj )2 and Ca (t − t ), are the same as those used to generate the inputs for single neurons using integrate-and-fire neurons with synaptic input currents given by equation 2.9. In an equivalent formulation, the second term in equation 2.9 can be omitted if the noise terms ξiab (t) have correlations equal to the unsubtracted correlation function, Cbtot (t − t ) =
1 b S j (t)Sbj (t ) . Nb j
(2.11)
For |t − t | → ∞, Cbtot (t − t ) → (r bj )2 , so ξiab (t) acquires a random static component of mean square value (r bj )2 . In still another way to do it, one can use the average rate rb in place of its root mean square value in the second term on the right-hand side of equation 2.9 and employ noise with a correlation function: C˜ b (t − t ) =
1 b S j (t) − rb Sbj (t ) − rb . Nb j
(2.12)
Response Variability in Balanced Cortical Networks
641
For |t − t | → ∞, 2 2 C˜ b (t − t ) → r bj − rb ≡ δr bj .
(2.13)
There are now two static random parts of Iiab (t) in equation 2.9: one from the second term and one from the static component of the noise. Their sum is a gaussian random number with variance equal to (r bj )2 . Thus, these three ways of generating the input currents are equivalent. 2.1.1 The Balance Condition. In a stationary, low-rate state, the mean membrane potential described by equation 2.3 has to√be approximately stationary. If excitation dominates, we have duia /dt ∝ K 0 , implying a firing rate √ of order K 0 (or one limited only by the refractory period of the neuron). If inhibition dominates, the neuron will never fire. The only way to have a stationary state at a low rate (less than one spike per membrane time constant) is to have the excitation and inhibition nearly cancel. Then the mean membrane potential can lie a little below threshold, and the neuron can fire occasionally due to the input current fluctuations. This suggests the following heuristic theory, based on this approximate balance. Using equations 2.3 and 2.5, we have 2
J ab K b rb = O(1),
(2.14)
b=0
√ or, up to corrections of O(1/ K 0 ), 2
Jˆab rb = 0,
(2.15)
b=0
√ with Jˆab = J ab K b /K 0 . These are two linear equations in the two unknowns ra , a = 1, 2, with the solution ra =
2 [Jˆ−1 ]ab J b0 r0 ,
(2.16)
b=1
where Jˆ−1 is the inverse of the 2 × 2 matrix with elements Jˆab , a , b = 1, 2. If there is a stationary balanced state, the average rates of the excitatory and inhibitory populations are given by the solutions of equation 2.16. This argument, given by Amit and Brunel and by Sompolinsky and van Vreeswijk, depends only on the rates, not on the correlations.
642
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
2.1.2 Numerical Procedure. For integrate-and-fire neurons in a stationary state, the mean field theory can be reduced to a set of analytic equations if neuron-to-neuron rate variations are neglected and a white-noise (Poisson firing) approximation is made (Amit & Brunel, 1997a, 1997b; Brunel, 2000). However, in a complete mean field theory for our model, the randomness in the connectivity forces these features to be taken into account, and it is necessary to resort to numerical methods. Thus, we simulate single neurons driven by gaussian synaptic currents; collect their firing statistics to compute the rates ra , rate fluctuations (δr aj )2 , and correlations Ca (t − t ); and then use these to generate improved input current statistics. The cycle is repeated until the input and output statistics are consistent. This algorithm was first used by Eisfeller and Opper (1992) to calculate the remanent magnetization of a mean field model for spin glasses. Explicitly, we proceed as follows. We simulate single excitatory and inhibitory neurons over “trials” 100 integration time steps long. (We will call each time step a “millisecond.” We have explored using smaller time steps and verified that there are no qualitative changes in the results.) We start from estimates of the rates given by the balance condition, which makes the net mean input current vanish. Then the sum over presynaptic √ populations of the O( K b ) terms in equation 2.9 vanishes, leaving only the rate variation and noise terms. We then run 10,000 trials of single excitatory and inhibitory neurons, selecting on each trial random values of xiab and ξiab (t). Since at this point we do not have any estimates of either the rate fluctuations (δr bj )2 or the correlations Cb (t − t ), we use rb2 in place of (r bj )2 in equation 2.9 and use white noise for ξiab (t): Cb (t − t ) → rb δ(t − t ). The random choice of xi from trial to trial effectively samples across the neuronal populations, so we can then collect the statistics ra , (r aj )2 (or, equivalently, (δr aj )2 ), and Ca (t − t ) from these trials. These can be used to generate an improved estimate of the input noise statistics to be used in equation 2.9 in a second set of trials, which yields new spike statistics again. This procedure is iterated until the input and output statistics agree. This may take up to several hundred iterations, depending on network parameters and how the computation is organized. If one tries this procedure in its naive form, that is, using the output statistics directly to generate the input noise at the next step, it will lead to big oscillations and √ not converge. It is necessary to make small corrections (of relative order 1/ K 0 ) to the previous input noise statistics to guarantee convergence. When one computes statistics from the trials in any iteration, the simplest procedure involves calculating not the average correlation function Cb (t − t ) defined in equation 2.8 but, rather, C˜ b (t − t ) (see equation 2.12). From it, we can proceed in two ways. In one (the first of the three schemes described above for organizing the noise), from its |t − t | → ∞ limit we can obtain (δr bj )2 , and thereby (r bj )2 = rb2 + (δr bj )2 for use in equation 2.9. Subtracting this limiting value from C˜ b (t − t ) gives us Cb (t − t ) (which vanishes for
Response Variability in Balanced Cortical Networks
643
large |t − t |) for use in generating the noise ξiab (t). We will call this the subtracted correlation method. Alternatively, as in the third of the schemes above, we can, at each step of our iterative procedure, generate noise directly with the correlations C˜ b (t − t ) (which have a long-range time dependence) and use rb2 in place of (r bj )2 in equation 2.9. We call this the unsubtracted correlation method. We have verified that the two methods give the same results when carried out numerically, though the second one converges more slowly. While the true rates in the stationary case are time independent and Ca (t, t ) is a function only of t − t , the statistics collected over a finite set of noise-driven trials will not exactly have these stationarity properties. Therefore, we improve the statistics and impose time-translational invariance by averaging the measured ra (t) and (δr aj (t))2 over t and averaging over the measured values Ca (t, t ) with a fixed t − t . After the iterative procedure converges, so that we have a good estimate of the statistics of the input, we want to run many trials on a single neuron and compute its firing statistics. This means that the numbers xiab (b = 0, 1, 2) should be held constant over these trials. In this case, it is necessary to subtract out the large t − t limit of C˜ a (t − t ) and use fixed xiab (constant in time and across trials) to generate the input noise. (If we did it the other way, without the subtraction, we would effectively be assuming that xiab changed randomly from trial to trial, which is not correct.) In our calculations we have used 10,000 trials to calculate these singleneuron firing statistics. We perform the subtraction of the long-time limit of C˜ a (t − t ) at |t − t | = 50, and we have checked that equation 2.12 is flat beyond this point in all the cases we have done. If we perform this kind of measurement separately for many values of the xiab , we will be able to see how the firing statistics vary across the population. Here, however, we will confine most of our attention to what we call the “average neuron”: the one with the average value (0) of all three xiab . In particular, we calculate the mean spike count in the 100 ms trials and its variance across trials. From this we can get the Fano factor F (the varianceto-mean ratio). We also compute the autocorrelation function, which offers a consistency check, since the Fano factor can also be obtained from F =
1 r
∞
−∞
C(τ )dτ.
(2.17)
(This formula is valid when the measurement period is much larger than the time over which C(τ ) falls to zero.) We will study how these firing statistics vary as we change various parameters of the model: the input rates r0 , parameters that control the balance of excitation and inhibition and the overall strength of the synapses.
644
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
This will give us some generic understanding of what controls the degree of irregularity of the neuronal firing. 2.2 Nonstationary Case. When the input population is not firing at a constant rate, almost the same calculational procedure can be followed, except that one does not average measured rates, their fluctuations, or correlation function over time. To start, we get initial instantaneous rate estimates from the balance condition, assuming that the time-dependent average input currents do not vary too quickly. (This condition is not very stringent; van Vreeswijk and √ Sompolinsky showed that the stability eigenvalues are proportional to K 0 , so if they have the right sign, the convergence to the balanced state is very rapid.) To do the iterative procedure to satisfy the self-consistency conditions of the theory, it is simplest to use the second of the two ways described above (the unsubtracted correlation method). In this case, we get equations for the noise input currents just like equation 2.9, except that the second term is omitted and the rb are t-dependent and the correlation functions Cbtot depend on both t and t , not just their difference. The only tricky part is the subtraction of the long-time limit of the correlation function, which is not simply defined. We treat this problem in the following way. We examine the ratenormalized quantity, Cˆ a (t, t ) =
Catot (t, t ) . ra (t)ra (t )
(2.18)
We find that this quantity is time-translation invariant (i.e., a function only of t − t ) to a very good approximation, so we perform the subtraction of the long-time limit on it. Then multiplying the subtracted Cˆ by ra (t)ra (t ) gives a good approximation to the true (subtracted) correlation function Ca (t, t ). The meaning of this finding is, loosely speaking, that when the rates vary (slowly enough) in time, the correlation functions just inherit these rates as overall factors without changing anything else about the problem. We will use the this time-dependent formulation below to simulate experiments like those of Gershon et al. (1998), where the LGN input r0 (t) to visual cortical cells is time dependent because of the flashing on and off of the stimulus. 3 Results The results presented in this article were obtained from simulations with parameters K 1 = 4444 excitatory inputs and K 2 = 1111 inhibitory inputs to each neuron. The average number of external (excitatory) inputs K 0
Response Variability in Balanced Cortical Networks
645
was chosen to be equal to K 2 . All neurons have the same membrane time constant τ of 10 ms. To study the effect of various combinations in synaptic strength, we use the following generic form to define the intracortical weights J ab :
J 11
J 12
J 21
J 22
=
−2g
1 −2g
.
(3.1)
For the synaptic strengths from the external population, we use J 10 = 1 and J 20 = . With this notation, g determines the strength of inhibition relative to excitation within the network and the strength of intracortical excitation. Additionally, we scale the overall strength of the synapses with a multiplicative scaling factor denoted J s so that each synapse has an actual weight of J s · J ab , regardless of a and b. Figure 2 summarizes how the firing statistics depend on all of the parameters g, , and J s . The irregularity of spiking, as measured by the Fano factor, depends most sensitively on the overall scaling of the synaptic strength, J s . The Fano factor increases systematically as J s increases, and higher values of intracortical excitation also result in higher values of F . The same pattern holds for stronger intracortical inhibition, parameterized by g. For all of these cases, the mean firing rate remains virtually unchanged due to the dynamic balance of excitation and inhibition in the network, whereas the fluctuations increase with the increase of any of the synaptic weights. Interspike interval (ISI) distributions are shown in Figure 3 for three different values of J s , keeping and g fixed at 0.5 and 1, respectively. For a Poisson spike train, the Fano factor F = 1, while F > 1 (which we term super-Poissonian) indicates a tendency of spikes occurring in clusters separated by accordingly longer empty intervals, and F < 1 (sub-Poissonian) indicates more regularity, reflected by a narrower distribution. We have adjusted the input rate r0 so that the output rate is the same in all three cases. The top panel of Figure 3 shows the ISI distribution of a super-Poissonian spike train, obtained for J s = 1.42. Overlaied on the histogram of ISI counts is an exponential curve indicating a Poisson distribution with the same mean ISI length. Compared with the Poisson distribution, the superPoissonian spike train contains more short intervals, as seen by the peak at short lengths, and also more long intervals, causing a long tail. Necessarily, the interval count around the average ISI length is lower than that for a Poisson spike train. The ISI distribution in the middle panel of Figure 3 belongs to a spike train with a Fano factor close to one, obtained for J s = 0.714. The overlaid exponential reveals a deviation from the ISI count: while intervals of diminishing length are the most likely ones for a real Poisson process, our neuronal spike trains always show some refractoriness reflected by a dip at
646
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark g = 0.950
7 6
7
Fano Factor
6
5
5 4
4 3
3
2 1
2
0 2.85 2.38 1.90
0.855
1
0.713 1.43 0.950
0.475 0.475
J
0.238
s
0
ε
g = 1.43
7 6
7
Fano factor
6
5
5
4
4 3
3
2 1
2
0 2.85 2.38
0.855 1.90
0.713 1.43 0.950
J
0.475 0.475
0.238
s
1 0
ε
g = 1.90 7 6 7
Fano factor
6
5
5 4
4
3 3
2 1
2
0 2.85 2.38 1.90
0.855 0.713 1.43 0.950
Js
0.475 0.475
0.238
1 0
ε
Figure 2: Fano factors as a function of overall synaptic strength J s and intracortical excitation strength for three different inhibition factors: g = 1, 1.5, and 2, respectively. The increase of any of these parameters results in more irregular firing statistics as measured by the Fano factor.
Response Variability in Balanced Cortical Networks
647
0.06
Superpoisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
80
100
80
100
80
100
0.06
Poisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
0.06
Subpoisson
ISI counts
0.05 0.04 0.03 0.02 0.01 0 0
20
40 60 ISI length [ms]
Figure 3: Interspike interval distributions for fixed = 0.5 and g = 1, and three different values of overall synaptic strength J s : 1.42 (super-Poissonian), 0.714 (Poissonian), and 0.357 (sub-Poissonian). Overlaid on each figure is the exponential fall-off of a true Poisson distribution with the same average rate as in all of the three cases.
648
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
the shortest intervals. (We have not used an explicit refractory period in our model. The dip seen here simply reflects the fact that it takes a little time for the membrane potential distribution to return to its steady-state form after reset.) Apart from this deviation, however, there is a close resemblance between the observed distribution and the “predicted” one. Finally, the lower panel of Figure 3 depicts a case with F < 1, with weaker synapses, leading to a stronger refractory effect and (since the rate is fixed) an accordingly narrower distribution around the average ISI length, as compared to the overlaied Poisson distribution. This distribution was obtained with weak synapses produced by a small scaling factor of J s = 0.357. As mentioned in the previous section, the Fano factor can also be obtained by integrating over the spike train autocorrelation divided by the spike rate, equation 2.17. For a Poisson process, the autocorrelation vanishes for all lags different from zero. In contrast, F > 1 (super-Poissonian case) implies a positive integral over nonzero lags, whereas in the sub-Poissonian case, there must be a negative area under the curve. Figure 4 shows examples of autocorrelations for all of the three cases. For the super-Poissonian case (dashed line), there is a “hill” of positive correlations for short intervals,
−3
7
x 10
6
Poisson Superpoisson Subpoisson
5
Autocorrelation
4 3 2 1 0 −1 −2 −3
−40
−30
−20
−10
0
10
20
30
40
50
Time shift [ms]
Figure 4: Three different spike train autocorrelations illustrating the relationship between the Fano factor F and the area under the curve. For F = 1 (Poissonian, solid line), the autocorrelation is an almost perfect delta function. F > 1 (super-Poissonian, dashed line) is reflected by a hill generating a positive area, and F < 1 (sub-Poissonian, dotted line) is accompanied by a valley of negative correlations. (See the text for more details.)
Response Variability in Balanced Cortical Networks
649
2
10
Spike Count Variance
Superpoisson Poisson Subpoisson 1
10
0
10
−1
10
−1
10
0
1
10 10 Spike Count Mean
2
10
Figure 5: Spike count log(variance) versus log(mean) for three different values of overall synaptic strength J s , varying the external input rate r0 . For J s = 1.19 (super-Poissonian, triangles), the data look qualitatively like those from experiments. The other values for J s are 0.714 (Poisson, stars) and 0.357 (subPoissonian, crosses).
reflecting the tendency toward spike clustering. The sub-Poissonian autocorrelation (dotted line) shows a valley of negative correlations for short intervals, indicating well-separated spikes in a more regular spike train. The curve labeled as Poisson (solid line) does have a small valley around zero lag, which reflects once more the refractoriness of neurons to fire at extremely short intervals, unlike a completely random Poisson process. (Actually, the measured F in this case is slightly greater than 1, implying that in this case, the integral of the very small positive tail for t > 2 ms is slightly larger than that of the (more obvious) negative short-time dip.) Measurements on V1 neurons in awake monkeys (see, e.g., Gershon et al., 1998) suggest a linear relationship between the log variance and the log mean of stimulus-elicited spike counts. We find a similar dependence for neurons within our model network. Figure 5 shows results for three different values of J s . In each case, five different values of the external input rate r0 were tried, causing various mean spike counts and variances. The logarithm of the spike count variance is plotted as a function of the logarithm of the spike count mean, and a solid diagonal line indicates the identity, that is, a Fano factor of exactly 1. We see that for the largest value of J s used here, the data look qualitatively like those from experiments, with Fano factors in the range around 1.5 to 2.
650
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Input rate r0(t) 0.07
Tonic part A
0.06
Phasic part B0
Input rate r0(t)
0
0.05 0.04 0.03 0.02 0.01 0 0
20
40
60
80
100
Time steps
Figure 6: Parameterization of the time-dependent input rate r0 (t). The input is modeled as the sum of three functions: (1) a stationary background rate (which is zero in this case); (2) a tonic part, which rises within the first 20 ms to a constant level of A0 , where it stays for 60 ms, falling back to zero within the last 20 ms; and (3) an initial phasic part, which is nonzero only in the first 50 ms, rising to a maximum value of B0 .
3.1 Nonstationary Case. The results presented in the previous section were obtained with stationary inputs, while experimental data like those from Gershon et al. (1998) were collected from visual neurons subject to time-dependent inputs. Therefore, we performed calculations of the spike statistics in which the input population rate r0 was time dependent. The modeled temporal shape of r0 (t) is depicted in Figure 6. It is the sum of three terms: r0 (t) = R0 + A(t) + B(t).
(3.2)
The first, R0 , is a constant, as in the preceding section. The second term, A(t), rises to a maximum over a 25 ms interval, remains constant for 50 ms, and then falls off to zero over the final 25 ms, 0.5A0 (1 − cos(4tπ/T)) for 0 < t ≤ T/4 for T/4 < t ≤ 3T/4 A(t) = A0 0.5A (1 − cos(4(T − t)π/T)) for 3T/4 < t ≤ T, 0
(3.3)
Response Variability in Balanced Cortical Networks
651
2
Spike Count Variance
10
1
10
0
10 0 10
1
10 Spike Count Mean
2
10
Figure 7: Spike count log(variance) versus log(mean) for time-varying external inputs with varying overall strength. The neuron in the simulated network (triangles) fires in a super-Poissonian regime, with an almost linear relationship for low spike rates between the log variance and the log mean, resembling closely data obtained from in vivo experiments. The diagonal solid line indicates the identity of variance and mean (Fano factor F = 1).
where T is the total simulation interval of 100 ms. The third term, B0 , rises to a maximum in the first 25 ms and then falls back to zero in the next 25 ms, remaining zero thereafter: for 0 < t ≤ T/4 0.5B0 (1 − cos(4tπ/T)) 0.5B (1 − cos(4(T/2 − t)π/T)) for T/4 < t ≤ T/2 0 B(t) = 0 for T/2 < t ≤ T.
(3.4)
Figure 7 shows the logarithm of the spike count variance plotted against the logarithm of the spike count mean for various nonstationary inputs characterized by different values of A0 and B0 . The graph shows results for J s = 0.95, = 0.5, g = 1, and a background rate of R0 = 0.1. Table 1 shows the choice of the 16 combinations of the stimulus parameters A0 and B0 , together with the resulting Fano factors F for the simulated neuron. The data look qualitatively like those obtained from in vivo experiments by Gershon et al. (1998) and are similar to the super-Poissonian case in Figure 5. The neuron fires consistently in a super-Poissonian regime with Fano factors slightly higher than 1 and an almost linear relationship between the
652
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Table 1: Stimulus Parameters A0 and B0 for the Results Depicted in Figure 7 and the Resulting Fano Factors F . A0 B0
0.375 0.125
0.375 0.375
0.500 0.125
0.500 0.375
0.750 0.250
0.750 0.750
1.000 0.250
1.000 0.750
F
1.14
1.2
1.22
1.23
1.29
1.36
1.37
1.4
A0 B0
1.500 0.500
1.500 1.500
2.000 0.500
2.000 1.500
3.000 1.000
3.000 3.000
4.000 1.000
4.000 3.000
F
1.48
1.5
1.55
1.53
1.57
1.41
1.43
1.34
log variance and the log mean for low spike counts. For higher spike counts, the curve bends toward values of lower Fano factors, just as for stationary inputs (see Figure 5). In both cases, this bend reflects the decrease in irregularity of firing caused by an increasingly prominent role of refractoriness for shorter interspike intervals. 3.2 Comparison with Network Simulations. An extensive exploration of the validity of mean field theory is beyond the scope of this letter. However, we have performed simulations of networks constructed according to the model of section 2 and compared their firing irregularity with that obtained in mean field theory. Specifically, we tested two main results from our mean field analysis, stating that Fano factors increase systematically with synaptic strength and that there is an approximate power law between the mean spike count and the spike count variance, similar to experimental findings (see Figures 2 and 5, respectively). Figure 8 shows measured Fano factors for a typical neuron in a network with K 0 = K 1 = 400, K 2 = 100, and N = 10000, where we varied J s in the same range as in Figure 2 (other parameters were g = 1 and = 0.5). The Fano factor increases systematically as J s increases, lying in the quantitative range predicted by mean field theory. In addition (results not shown), we explored the lower and upper limits of Fano factors in our network. For J s = 0.1, the average Fano factor of all neurons in the network was 0.034. Notwithstanding the very regular firing of all neurons in this network with very weak synapses, the overall activity remained asynchronous, as required in our mean field analysis. At the other extreme, for very strong synapses with J s = 32, we observed an average Fano factor of 16.05, and individual neurons exhibited Fano factors of up to 30 and higher. These results show that networks of integrate-andfire neurons exhibit a wide range of Fano factors in their balanced state, depending on synaptic strength. In Figure 9, we show plots of the logarithm of the spike count variance as a function of the logarithm of the mean spike count for six individual
Response Variability in Balanced Cortical Networks
653
7 6 Fano Factor
5 4 3 2 1 0 0.5
1
1.5
2
2.5
3
J
s
Figure 8: Fano factors as a function of overall synaptic strength J s obtained from network simulations for a randomly chosen neuron (g = 1, = 0.5). A comparison with Figure 2 reveals that the mean field calculations correctly predict both the qualitative relationship between Fano factors and synaptic strength and the quantitative range of Fano factors for this range of J s values.
neurons in the network. Analogous to Figure 5, results for three different values of J s are shown (1.5, 1.1, and 0.714, indicated by triangles, stars, and crosses, respectively), each probed with five different strengths of external inputs. The neurons were chosen randomly from all 8000 excitatory neurons with nonzero firing rates. With the exception of neuron 4638 (in the lower middle panel), J s = 1.5 resulted in super-Poissionian firing statistics, J s = 0.714 in sub-Poissionian firing, and J s = 1.1 in approximately Poisson statistics. There is a strong qualitative resemblance between the network simulation results in Figure 9 and the mean field results in Figure 5, with the latter showing spike count statistics of the hypothetical “average neuron” defined above. Taken together, these results suggest that mean field theory provides a reliable way to estimate firing variability in balanced networks. 4 Discussion Cortical neurons receive thousands of excitatory and inhibitory inputs, and despite the high number of inputs from nearby neurons with similar firing statistics and similar connectivity, their observed firing is very irregular (Heggelund & Albus, 1978; Dean, 1981; Tolhurst et al., 1981, 1983; Vogels, Spileers, & Orban, 1989; Snowden et al., 1992; Gur et al., 1997; Shadlen &
654
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
2
Neuron number 4662
2
Mean Spike Count
10
1
0
0
2
10
10
Neuron number 1808
−1
0
10 2
10 Mean Spike Count
0
10
−1
0
10
2
1
Neuron number 4638
2
1
0
0
10
−1
0
2
10
Neuron number 4239
1
0
10 10 Spike Count Variance
2
10
10
10
−1
0
10
10
10
10
10
10
10
10
10
10
10
−1
Neuron number 2672
1
10
10
2
2
10
1
10
10
Neuron number 3388
10
−1
0
2
10 10 Spike Count Variance
10
0
2
10 10 Spike Count Variance
Figure 9: Spike count statistics obtained from network simulations for six randomly chosen neurons. Spike count log(variance) versus log(mean) are plotted as in Figure 5, for various external input rates at three different values of synaptic strength (J s = 1.5, 1.1, and 0.714, represented by triangles, stars, and crosses, respectively). There is a close qualitative resemblance to the results from mean field calculations shown in Figure 5, where spike statistics of a hypothetical neuron with average overall input are shown.
Newsome, 1998; Gershon et al., 1998; Kara et al., 2000; Buracas et al., 1998; Lee et al., 1998; DeWeese et al., 2003). Dynamically balanced excitation and inhibition through a simple feedback mechanism provide an explanation that naturally accounts for this phenomenon without requiring fine-tuning of the parameters (Amit & Brunel, 1997a, 1997b; Brunel, 2000; van Vreeswijk and Sompolinsky, 1996, 1998). Moreover, neurons in such model networks show an almost linear input-output relationship (input current versus firing frequency), as do neurons in the neocortex. Whenever one wants to understand a complex dynamical system, one asks whether there is an approximate theory, possibly exact in some interesting limit, that captures and affords insight into the observed properties. Here, the high connectivity of the cortical networks of interest suggests trying to obtain this insight from mean field theory, which becomes exact for an infinite, extensively connected system. In this article, we have formulated a
Response Variability in Balanced Cortical Networks
655
complete mean field description of the dynamically balanced asynchronous firing state in the dilute, high-connectivity limit Na K a 1. Because of the assumed random connection structure in the network, the mean field theory has to include autocorrelation functions and rate variations as well as population-mean rates as order parameters, as in spin glasses (Sompolinsky & Zippelius, 1982). We used this mean field theory to analyze firing correlations. We found that the relationship between the observed irregularity of firing (spike count variance) and the firing rate (spike count mean) of the neurons resembles closely data collected from in vivo experiments (see Figures 5 and 7). To do this, we developed a complete mean field theory for a network of leaky integrate-and-fire neurons, in which both firing rates and correlation functions are determined self-consistently. Using an algorithm that allows us to find the solutions to the mean field equations numerically, we could elucidate how the strength of synapses within the network influences the expected firing statistics of cortical neurons in a systematic manner (see Figure 2). We have shown that the irregularity of firing, as measured by the Fano factor, increases with increasing synaptic strengths (see Figure 2). Nearly Poisson statistics (with F ≈ 1) are observed for moderately strong synapses, but the transition from sub-Poissonian to super-Poissonian statistics is smooth, without a special role for F = 1. The higher irregularity in the spike counts is always accompanied by a tendency toward more “bursty” firing. (These bursts are a network effect because the model contains only leaky integrate-and-fire neurons, which do not burst on their own.) This burstiness can best be seen in the spike train autocorrelation function (see Figure 4), which acquires a hill of growing size and width around zero lag for increasing Fano factors. The interdependence between firing irregularity and bursting can be understood with the help of the ISI distributions depicted in Figure 3: when the rate, and thus the average ISI, is kept constant, then any higher count for shorter-than-average ISIs must be accompanied by an accordingly higher count for longer ISIs (indicating bursts), and vice versa. Thus, higher irregularity always goes hand in hand with a higher tendency toward temporal clustering of spikes. Why do stronger synapses lead to higher irregularity in firing? The size of the input current fluctuations in equation 2.9 is controlled by the J ab , and so, therefore, are the corresponding membrane potential fluctuations. Thus, for example, the width of the steady-state membrane potential distribution is proportional to J s . We next have to consider where this distribution is centered. Remembering that, according to the balance condition, the firing rate is independent of J s , and the center of the distribution has to move farther away from threshold as J s is increased in order to keep the rate fixed. Therefore, for very small J s almost the entire equilibrium membrane potential distribution will lie well above the postspike reset value, while for large J s , it will be mostly below reset.
656
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Immediately after a spike, the membrane potential distribution is a delta function centered at the reset (here 0). It then spreads, and its mean moves up or down toward its equilibrium value. This equilibration will take about a membrane time constant. If the equilibrium value is well above zero (the small-J s case), the probability of reaching threshold will be suppressed during this time, implying a refractory dip in the ISI distribution and the correlation function and a tendency toward a Fano factor less than 1. In the large-J s case, where the membrane potential is reset much closer to the threshold than to its eventual equilibrium√value, the initial rapid spread (with the width growing proportional to J s t) leads to an enhanced probability of early spikes. At short times, this diffusive spread dominates the downward drift of the mean (which is only linear in t). Thus, there is extra weight in the ISI distribution and a positive correlation function at these short times, leading to a Fano factor greater than 1. Empirically, an approximate power-law relationship between the mean and variance of the spike count has frequently been observed for cortical neurons (see, e.g., Tolhurst et al., 1981; Vogels, Spileers, & Orban, 1989; Gershon et al., 1998; Lee et al., 1998). Our model shows the same qualitative feature (see Figures 5 and 6), though we have no argument that the relation should be an exact power law. However, this agreement suggests that the model captures at least part of physics underlying the firing statistics. As already observed, not all of the variability in measured neuron responses has to be explained in the manner outlined above. Changing conditions during the run of a single experiment may introduce extra irregularity, caused by collecting statistics over trials with different mean firing rates. Our analysis shows why—and how much—irregularity can be expected due to intrinsic cortical dynamics. Other authors have also studied firing irregularity, in phase-oscillator models (Bressloff & Coombes, 2000; Bressloff, Bressloff, & Cowan, 2000) and in a ring model with inhomogeneous excitation and inhibition (Lin, Pawelzik, Ernst, & Sejnowski, 1998). Both groups found that their models could produce highly irregular firing. In our work, we have tried to make a systematic study of how the irregularity (quantified by the Fano factor) depends on system parameters for a fairly simple model appropriate for describing local (intracolumn) neocortical networks. We have used instantaneous synapses; that is, we have not included synaptic filtering of input spike trains in the calculations we have reported here. However, we have incorporated such filtering, with a simple exponential kernel, into our code and explored the effects of a nonzero synaptic current decay time τsyn . We find that for small τsyn /τ , Fano factors grow proportional to this ratio, τsyn , F (τsyn ) = F (0) 1 + a τ
(4.1)
Response Variability in Balanced Cortical Networks
657
with a = O(1) > 0. Since we are most interested in the limit τsyn τ , we have not studied these corrections in detail. However, it should be noted that in a model where the synapses are modeled by conductances instead of current pulses (Lerchner, Ahmadi, & Hertz, 2004), the effective membrane time constant can become very small, so τsyn can be considerably larger than it (Destexhe et al., 2003). In this case, the dynamics become rather different. Our formulation of the mean field theory is general enough to allow other straightforward extensions toward greater biological realism and more complicated network architectures. We have extended the model to include systematic structure in the connections, modeling an orientation hypercolumn in the primary visual cortex (Hertz & Sterner, 2003). Moreover, our algorithm for finding the mean field solutions is not restricted to networks of integrate-and-fire neurons. It can be applied to any kind of neuronal model. Furthermore, synaptic depression and facilitation can be incorporated by using synaptically filtered spike trains to compute the self-consistent solutions. As we remarked earlier, if one ignores correlations in the synaptic input and neuron-to-neuron rate variations (Amit & Brunel, 1997a, 1997b), analytic self-consistent equations for the population rates can be derived. From these, one can calculate the steady-state Fano factor analytically in closed form (Brunel, 2000). Such a calculation is obviously not self-consistent, although it can give qualitative information about firing irregularity. We have done some calculations, using our single-neuron simulation methods, but in which we impose the Amit-Brunel approximations by hand when generating the input noise. As one could anticipate, we find that this procedure systematically underestimates Fano factors in the super-Poissonian regime, by factors of up to 2 or so at the largest values of J s studied here. Since it is necessary to solve the full mean field theory numerically, one might ask: If it is necessary to resort to numerical solution anyway, why not just simulate the network directly? Our answer is that beyond the advantage of having to simulate only one neuron at a time, it is interesting to know what the predictions of mean field theory are. To the extent that they agree with network simulations, we can understand our findings in terms of single-neuron properties (albeit with self-consistent synaptic current statistics). Discrepancies would point to either finite-size or finite-concentration effects or more subtle correlation effects not included in the mean field ansatz. Identifying such effects, if they exist, would point the way toward future theoretical investigations, which could shed potentially useful light on the dynamics of these networks.
References Amit, D., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.
658
A. Lerchner, C. Ursta, J. Hertz, M. Ahmadi, P. Ruffiot, and S. Enemark
Amit, D., & Brunel, N. (1997b). Model of spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237– 252. Bressloff, P. C., Bressloff, N. W., & Cowan, J. D. (2000). Dynamical mechanism for sharp orientation tuning in an integrate-and-fire model of a cortical hypercolumn. Neural Comp., 12, 2473–2511. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled neurons. Neural Comp., 12, 91–129. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Buracas, G. T., Zador, A. M., DeWeese, M. R., & Albright, T. D. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–761. DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. J. Neurosci., 23, 7940–7949. Eisfeller, H., & Opper, M. (1992). New method for studying the dynamics of disordered spin systems without finite-size effects. Phys. Rev. Lett., 68, 2094– 2097. Fulvi Mari, C. (2000). Random networks of spiking neurons: Instability in the Xenopus tadpole moto-neuron pattern. Phys. Rev. Lett., 85, 210–213. Gershon, E., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortex. J. Neurophysiol., 79, 1135–1144. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17, 2914–2920. Heggelund, P., & Albus, K. (1978). Response variability and orientation discrimination of single cells in striate cortex of cat. Exp. Brain Res., 32, 197–211. Hertz, J., Richmond, B., & Nilsen, K. (2003). Anomalous response variability in a balanced cortical network model. Neurocomputing, 52–54, 787–792. Hertz, J., & Sterner, G. (2003). Mean field model of an orientation hypercolumn. Soc. for Neurosci. Abstract, no. 911.19. Kara, P., Reinagel, P., & Reid, R. C. (2000). Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 27, 635–646. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of primate cortex. J. Neurosci., 18, 1161–1170. Lerchner, A., Ahmadi, M., & Hertz, J. (2004). High conductance states in a mean field cortical network model. Neurocomputing, 58–60, 935–940. Lin, J. K., Pawelzik, K., Ernst, U., & Sejnowski, T. J. (1998). Irregular synchronous activity in stochastically-coupled networks of integrate-and-fire neurons. Network, 9, 333–344. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely-timed spike patterns in visual system neural responses. J. Neurophysiol., 81, 3021–3033.
Response Variability in Balanced Cortical Networks
659
Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Snowden, R. J., Treue, S., & Andersen, R. A. (1992). The response of neurons in areas V1 and MT of the alert rhesus monkey to moving random dot patterns. Exp. Brain Res., 88, 389–400. Sompolinsky, H., & Zippelius, A. (1982). Relaxational dynamics of the EdwardsAnderson model and the mean-field theory of spin glasses. Phys. Rev. B., 25, 6860–6875. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res., 23, 775–785. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Exp. Brain Res., 41, 414–419. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1371. Vogels, R., Spileers, W., & Orban, G. A. (1989). The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res., 77, 432–436.
Received February 18, 2004; accepted August 23, 2005.
LETTER
Communicated by Richard Zemel
The Costs of Ignoring High-Order Correlations in Populations of Model Neurons Melchi M. Michel
[email protected]
Robert A. Jacobs
[email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
Investigators debate the extent to which neural populations use pairwise and higher-order statistical dependencies among neural responses to represent information about a visual stimulus. To study this issue, three statistical decoders were used to extract the information in the responses of model neurons about the binocular disparities present in simulated pairs of left-eye and right-eye images: (1) the full joint probability decoder considered all possible statistical relations among neural responses as potentially important; (2) the dependence tree decoder also considered all possible relations as potentially important, but it approximated high-order statistical correlations using a computationally tractable procedure; and (3) the independent response decoder, which assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Simulation results indicate that high-order correlations among model neuron responses contain significant information about binocular disparities and that the amount of this high-order information increases rapidly as a function of neural population size. Furthermore, the results highlight the potential importance of the dependence tree decoder to neuroscientists as a powerful but still practical way of approximating high-order correlations among neural responses. 1 Introduction The left and right eyes of human observers are offset from each other, and, thus, the visual images received by these eyes differ. For example, an object in the visual environment may project to one location in the left eye image but project to a different location in the right eye image. Differences in left eye and right eye images that arise in this manner are known as binocular disparities. Disparities are important because they are often among the most reliable cues to the relative depth of a surface or object in space. Observers with normal stereo vision are typically able to Neural Computation 18, 660–682 (2006)
C 2006 Massachusetts Institute of Technology
Costs of Ignoring High-Order Correlations
661
make fine depth discriminations because they can resolve differences in horizontal disparities below 1 arc minute (Andrews, Glennerster, & Parker, 2001). How this is accomplished is a matter of current research. Neurophysiological and modeling studies have identified binocular simple and complex cells in primary visual cortex as a likely source of disparity information, and researchers have developed a computational model known as a binocular energy filter to characterize the responses of these cells to visual scenes viewed binocularly (DeAngelis, Ohzawa, & Freeman, 1991; Freeman & Ohzawa, 1990; Ohzawa, DeAngelis, & Freeman, 1990). Based on analyses of binocular energy filters, Qian (1994), Fleet, Wagner, and Heeger (1996), and others have argued, however, that the response of an individual simple or complex cell is ambiguous. In addition to uncertainty introduced by neural noise, ambiguities arise because a cell’s preferred disparity depends on the distribution of stimulus frequencies, a cell’s tuning response has multiple false peaks (i.e., the cell gives large responses to disparities that differ from its preferred disparity), and image features in a cell’s left eye and right eye receptive fields may influence a cell’s response even when the features do not arise from the same event in the visual world. These points suggest that in order to overcome the ambiguity of an individual neuron’s responses, the neural process responsible for estimating disparity must pool the responses of a large number of neurons. Researchers studying neural codes often use statistical techniques to interpret the activities of neural populations (Abbott & Dayan, 1999; Oram, ¨ ak, Perrett, & Sengpiel, 1998; Pouget, Dayan, & Zemel, 2003). A matFoldi` ter of current debate among these investigators is the relative importance of considering dependencies, or correlations, among cells in a population when decoding the information that the cells convey about a stimulus. Correlations among neural responses have been investigated as a potentially important component of neural codes for over 30 years (Perkel & Bullock, 1969). Unfortunately, determining the importance of correlations is not straightforward. For methodological reasons, it is typically feasible only to experimentally measure pairwise or second-order correlations among neural responses, meaning that high-order correlations are not measured. Even if correlations are accurately measured, there is no guarantee that these correlations contain useful information: correlations can increase, decrease, or leave unchanged the total information in a neural population (Abbott & Dayan, 1999; Nirenberg & Latham, 2003; Seri`es, Latham, & Pouget, 2004). To evaluate the importance of correlations, researchers have often compared the outputs of statistically efficient neural decoders, based on maximum likelihood or Bayesian statistical theory, that make different assumptions regarding correlations. Neural decoders are not models of neural mechanisms, but rather statistical procedures that help determine how much information neural responses contain about a stimulus by expressing this information as a probability distribution (Abbott & Dayan, 1999; Oram et al., 1998; Pouget et al., 2003). Statistically efficient neural decoders are
662
M. Michel and R. Jacobs
useful because they provide an upper bound on the amount of information about a stimulus contained in the activity of a neural ensemble. Researchers can evaluate the importance of correlations by comparing the value of this bound when it is computed by a neural decoder that makes use of correlations with the value of this bound when it is computed by a decoder that does not. Alternatively, researchers can compare the performances of neural decoders that use or do not use correlations on a stimulus-relevant task. Several recent studies have suggested that correlations among neurons play only a minor role in encoding stimulus information (e.g., Averbeck & Lee, 2003; Golledge et al., 2003; Nirenberg, Carcieri, Jacobs, & Latham, 2001; Panzeri, Schultz, Treves, & Rolls, 1999; Rolls, Franco, Aggelopoulos, & Reece, 2003), and that the independent responses of neurons carry more than 90% of the total information available in the population response (Averbeck & Lee, 2004). An important limitation of these studies is that they considered only pairwise or second-order correlations among neural responses and thus ignored high-order correlations either by assuming multivariate gaussian noise distributions (e.g., Averbeck & Lee, 2003) or by using a shorttime scale approximation to the joint distribution of responses and stimuli (e.g., Panzeri et al., 1999; Rolls et al., 2003). These studies therefore did not fairly evaluate the information contained in the response of a neural population when correlations are considered versus when they are ignored. In a population of n neurons, there are on the order of n P pth-order statistical interactions among neural response variables. In other words, computing high-order correlations is typically not computationally feasible with current computers. This does not mean, of course, that the nervous system does not make use of high-order correlations or that researchers who fail to consider high-order correlations are justified in concluding that nearly all the information in a neural code is carried by the independent responses of the neurons comprising the population. What is needed is a computationally tractable method for estimating high-order statistics, even if this is done in only an approximate way. This letter addresses these issues through the use of computer simulations of model neurons, known as binocular energy filters, whose binocular sensitivities resemble those of simple and complex cells in primary visual cortex. The responses of the model neurons to binocular views of visual scenes of frontoparallel surfaces were computed. These responses were then decoded in order to measure how much information they carry about the binocular disparities in the left eye and right eye images. Three neural decoders were simulated. The first decoder, referred to as the full joint probability decoder (FJPD), did not make any assumptions regarding statistical correlations. Because it considered all possible combinations of neural responses, it is the gold standard to which all other decoders were compared. The second decoder, known as the dependence tree decoder (DTD), is similar to the FJPD in the sense that it regarded all correlations as potentially important. However, it used a computationally tractable method to estimate
Costs of Ignoring High-Order Correlations
663
high-order statistics, albeit in an approximate way (Chow & Liu, 1968; Meil˘a & Jordan, 2000). The final decoder, referred to as the independent response decoder (IRD), assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Via computer simulation, we measured the percentage of information that is lost in a population of disparity tuned cells when high-order correlations are approximated and when all correlations are ignored. We also examined the abilities of the DTD and IRD (and a decoder limited to second-order correlations) to correctly estimate the disparity of a frontoparallel surface. The results reveal several interesting findings. First, relative to the amount of information about disparity calculated by the FJPD, the amounts of information calculated by the IRD and DTD were proportionally smaller when more model neurons were used. In other words, the informational cost of ignoring correlations or of roughly approximating high-order correlations increased as a function of neural population size. This implies that there is a large amount of information about disparity conveyed by secondorder and high-order correlations among model neuron responses. Second, the informational cost of ignoring all correlations (as in the IRD) rose as the number of neural response levels increased. For example, relative to the amount of information calculated by the FJPD, the amount of information calculated by the IRD was smaller when neuron responses were discretized to four levels (2 bits of information about each neural response) than when they are discretized to eight levels (3 bits of information about a neural response). This trend was less evident for the DTD. Third, when used to estimate the disparity in a pair of left eye and right eye images, the DTD consistently outperformed the IRD, and the magnitude of its performance advantage increased rapidly as the neural population size increased and as the number of response levels increased. Because the DTD also outperformed a neural decoder based on a multivariate gaussian distribution, our data again indicate that high-order correlations among model neuron responses contain significant information about binocular disparities. These results have important implications for researchers studying neural codes. They suggest that earlier studies indicating that independent neural responses carry the vast majority of information conveyed by a neural population may be flawed because these studies limited their investigations to second-order correlations and thus did not examine high-order correlations. Furthermore, these results highlight the potential importance of the DTD to neuroscientists. This decoder uses a technique developed in the engineering literature (Chow & Liu, 1968; Meil˘a & Jordan, 2000), but seemingly unknown in the neuroscientific literature, to approximate highorder statistics. Significantly, it does so in a way that is computationally tractable—the calculation of the approximation requires only knowledge about pairs of neurons. This fact, in the context of the results summarized above, suggests that the DTD can replace the IRD as a better, but still practical, approximation to the information contained in a neural population.
664
M. Michel and R. Jacobs
2 Simulated Images The simulated images were created in a manner similar to the method used by Lippert & Wagner (2002), with the difference that the texture elements used by those authors were random black and white dots, whereas the elements that we used were white noise (luminances were real-valued as in Tsai & Victor, 2003). Each image depicted a one-dimensional frontoparallel surface on which were painted dots whose luminance values were chosen from a uniform distribution to take values between 0 (dark) and 1 (light). A virtual observer who maintained fixation at a constant depth and horizontal position in the scene viewed the surface as its depth was varied between 15 possible depth values relative to the fixation point. One of these depth values was the depth of the fixation plane; of the remaining depths, 7 were located farther than the fixation point from the observer, and 7 were located nearer the observer. Each image of a scene extended over 5 degrees of visual angle and was divided into 186 pixels per degree. Because each pixel’s luminance value was chosen randomly from a uniform distribution, an image contained approximately equal power at all frequencies between 0 cycles per degree and 93 cycles per degree (the Nyquist frequency). For each stereo pair, the left image was generated first; then the right image was created by shifting the left image to the right by a particular number of pixels (this was done with periodic borders; e.g., pixel values that shifted past the right border were assigned to pixels near the left border). This shift varied between –7 and 7 pixels so that the shift was negative when the surface was nearer the observer, zero when the surface was located at the fixation plane, and positive when the surface was located beyond fixation. 3 Model Neurons Model neurons were instances of binocular energy filters, which are computational models developed by Ohzawa et al. (1990). We used binocular energy filters because they provide a good approximation to the binocular sensitivities of simple and complex cells in primary visual cortex. The fidelity of the energy model with respect to the responses of binocular simple and complex cells has been demonstrated in both cat area 17 (Anzai, Ohzawa, & Freeman, 1997; Ohzawa et al., 1990; Ohzawa, DeAngelis, & Freeman, 1997) and in macaque V1 (Cumming & Parker, 1997; Perez, Castro, Justo, Bermudez, & Gonzalez, 2005; Prince, Pointon, Cumming, & Parker, 2002). Although modifications and extensions to the model have been proposed by different researchers (e.g., Fleet et al., 1996; Qian & Zhu, 1997; Read & Cumming, 2003; Tsai & Victor, 2003), the basic form of the energy model remains a widely accepted representation of simple and complex cell responses to binocular stimuli. A simple cell is modeled as comprising left eye and right eye receptive subfields. Each subfield is modeled as a
Costs of Ignoring High-Order Correlations
665
Gabor function, which is a sinusoid multiplied by a gaussian envelope. We used the phase-shift version of the binocular energy model, meaning that the retinal positions of the gaussian envelopes for the left eye and right eye Gabor functions are identical, though the sinusoidal components differ by a phase shift. Formally, the left (gl ) and right (gr ) simple cell subfields are expressed as the following Gabor functions: 1 2 2 e (−x /2σ ) sin(2πωx + φ) gl = √ 2 2πσ
(3.1)
1 2 2 e (−x /2σ ) sin(2πωx + φ + δφ), gr = √ 2 2πσ
(3.2)
where x is the distance to the center of the gaussian, the variance σ 2 specifies the width of the gaussian envelope, ω represents the frequency of the sinusoid, φ represents the base phase of the sinusoid, and δφ represents the phase shift between the sinusoids in the right and left subfields. The response of a simple cell is formed in two stages: first, the convolution of the left eye image with the left subunit Gabor is added to the convolution of the right eye image with the right subunit Gabor; next, this sum is rectified. The response of a complex cell is the sum of the squared outputs of two simple cells whose parameter values are identical except that one has a base phase of 0 and the other has a base phase of π/2.1 In our simulations, the gaussian envelopes for all neurons were centered at the same point in the visual scene. The parameter values that we used in our simulations were randomly sampled from the same distributions used by Lippert and Wagner (2002); these investigators picked distributions based on neurophysiological data regarding spatial frequency selectivities of neurons in macaque visual cortex. Preferred spatial frequencies were drawn from a log-normal distribution whose underlying normal distribution had a mean of 1.6 cycles per degree and a standard deviation of 0.7 cycle per degree. The range of these preferred frequencies was clipped at a ceiling value of 20 cycles per degree and a floor value of 0.4 cycle per degree. The simple cells’ receptive field sizes were sampled from a normal distribution with a mean of 0.5 period and a standard deviation of 0.25 period, with a floor value of 0.1 period. A cell’s preferred disparity, given by 2πδφ/ω, was sampled from a normal distribution with a mean of 0 degrees of visual angle and a standard deviation of 0.5 degree. Figure 1 shows the normalized responses of a typical model complex cell to three different scenes, each using a different white noise pattern to cover the frontoparallel surface. Each of the lines in the figure represents the
1 Note that binocular energy filters are deterministic. The probability distributions we use have nonzero variances because the white noise visual stimuli are stochastic.
666
M. Michel and R. Jacobs
image 1 0.3
normalized response
image 2 image 3 0.2
0.1
0 -0.15
-0.1
-0.05
0
0.05
0.1
0.15
stimulus disparity (degrees) Figure 1: Characteristic responses of an individual model neuron as a function of the disparity (in degrees of visual angle) of the presented surface. The three curves show the normalized responses of a single model binocular energy neuron to each of three sample surfaces presented along a range of disparities (from −0.2 to 0.2 degree). The vertical dotted line indicates the cell’s preferred disparity (−0.0417 degree). This figure illustrates the fact that an individual model neuron’s response depends on many factors and thus is an ambiguous indicator of stimulus disparity.
responses of the model neuron as the disparity of a surface was varied. The neuron responded differently to different surfaces, illustrating that a single neuron’s response is an ambiguous indicator of stimulus disparity. This finding motivates the importance of decoding the activity of a population of neurons rather than that of a single neuron (Fleet et al., 1996; Qian, 1994). 4 Neural Decoders Neural decoders are statistical devices that estimate the distribution of a stimulus parameter based on neural responses. Three different decoders evaluated p(d | r), the distribution of disparity, denoted d, given the
Costs of Ignoring High-Order Correlations
667
responses of the model complex cells, denoted r. The decoders differ in their assumptions about the importance of correlations among neural responses. 4.1 Full Joint Probability Decoder. The FJPD is the simplest of the decoders used, but also has the highest storage cost since it requires representing the full joint distribution of disparity and complex cell responses p(d, r). This distribution has sbn states, where s is the number of possible binocular disparities, b is the number of bins or response levels (i.e., each complex cell response was discretized to one of b values), and n is the number of complex cells in the population. The conditional distribution of disparity was calculated as pfull (d | r) =
p(d, r) , p(r )
(4.1)
where the joint distribution p(d, r) and marginal distribution p(r ) were computed directly from the complex cell responses to the visual scenes (histograms giving the frequencies of each of the possible values of r and (d, r) were generated and then normalized; see below). The result of equation 4.1 represents the output of the FJPD. 4.2 Dependence Tree Decoder. The DTD makes use of a data structure and learning algorithm originally proposed in the engineering literature (Chow & Liu, 1968; see also Meil˘a & Jordan, 2000). It can be viewed as an instance of a graphical model or Bayesian network, a type of model that is currently popular in the artificial intelligence community (Neapolitan, 2004). The basic idea underlying Bayesian networks is that a joint distribution over a set of random variables can be represented by a graph in which nodes correspond to variables and directed edges between nodes correspond to statistical dependencies (e.g., an edge from node x1 to node x2 means that the distribution of variable x2 depends on the value of variable x1 ; as a matter of terminology, node x1 is referred to as the parent of x2 ). Dependence trees are Bayesian networks that are restricted in the following ways: (1) the graphical model must be a tree (i.e., ignoring the directions of edges, there are no loops in the graph, meaning that there is exactly one path between every pair of nodes); (2) there is one node that is the root of the tree—this node has no parents; and (3) all other nodes have exactly one parent. A dependence tree is a graphical representation of the following factorization of a joint distribution:
p(x1 , . . . , xn ) =
n i=1
p(xi | pa (i)),
(4.2)
668
M. Michel and R. Jacobs
p (r1) r1 p ( r6r1)
p (r 3r1)
r6
r3 1
p (r5r6 )
p (r 7r6)
p (r4|r r6) r7
r4
r5
p (r2r4) r2 Figure 2: An example of a dependence tree. Each of the nodes r1 , . . . , r7 represents a random variable, such as the response of a model neuron. The edges (depicted as arrows) represent the conditional dependencies between variables and are labeled with the conditional distribution of a child variable given its parent p(child|parent). According to this tree, the joint distribution of these variables is factorized as follows: p(r1 , . . . , r7 ) = p(r1 ) p(r3 | r1 ) p(r6 | r1 ) p(r7 | r6 ) p(r4 | r6 ) p(r5 | r6 ) p(r2 | r4 ).
where p(x1 , . . . , xn ) is the joint distribution of variables x1 , . . . , xn and p(xi | pa (i)) is the conditional distribution of variable xi given the value of its parent (if xi is the root of the tree, then p(xi | pa (i)) = p(xi )). Figure 2 depicts an example of a dependence tree. Of course, not all joint distributions can be factorized in this way. In this case, the right-hand side of equation 4.2 gives an approximation to the joint distribution. How can good approximations be discovered?
Costs of Ignoring High-Order Correlations
669
Chow and Liu (1968) developed an algorithm for finding approximations and proved that this approximation maximizes the likelihood of the data over all tree distributions. In short, the algorithm has three steps: (1) compute all pairwise marginal distributions p(xi , x j ) where xi and x j are a pair of random variables; (2) compute all pairwise mutual informations Iij ; and (3) compute the maximum weight spanning tree using Iij as the weight for the edge between nodes xi and x j . This spanning tree is the dependence tree.2 Importantly for our purposes, the algorithm has quadratic time complexity in the number of random variables, linear space complexity in the number of random variables, and quadratic space complexity in the number of response levels. That is, discovering the dependence tree that approximates the joint distribution among a set of variables will often be computationally tractable. The dependence tree decoder computes a dependence tree to approximate the joint distribution of complex cell responses given a binocular disparity value.3 This approximation is denoted ptree (r | d). Using Bayes’ rule, the distribution of disparity given cell responses is given by ptree (d | r) =
ptree (r | d) p(d) , p(r )
(4.3)
where p(d), the distribution of disparities, is a uniform distribution (i.e., all disparities are equally likely), and p(r ), the distribution of cell responses, is computed by marginalizing ptree (r | d) over d. Equation 4.3 is the output of the DTD. 4.3 Independent Response Decoder. Using Bayes’ rule, we can rewrite the probability of a disparity d given a response ras p(d | r) =
p(r | d) p(d) , p(r )
(4.4)
where p(d) is the prior distribution of binocular disparities and p(r ) is a distribution over complex cell responses. Because all disparities were equally likely, we set p(d) to be a uniform distribution. Consequently,
p(d | r ) = kp(r | d),
(4.5)
2 The spanning tree is an undirected graph. Our simulations used an equivalent directed graph obtained by choosing an arbitrary node to serve as the root of the tree. The directionality of all edges follows from this choice (all edges point away from the root). 3 Our data structure can be regarded as a mixture of trees in which there is one mixture component (i.e., one dependence tree) for each possible disparity value (Meil˘a & Jordan, 2000).
670
M. Michel and R. Jacobs
where k is a normalization factor equal to p(d)/ p(r ). The distinguishing feature of the independent response decoder (IRD) is that it assumes that the complex cell responses are statistically independent given the binocular disparity. In other words, the conditional joint distribution of cell responses is equal to the product of the distributions for the individual cells, that is, p(r | d) = i p(r i | d), where r i is the response of the ith complex cell. Equation 4.5 can therefore be rewritten as pind (d | r) = k
p(r i | d).
(4.6)
i
The distribution of disparity as computed by equation 4.6 is the output of the IRD. The conditional distributions for individual cells p(r i | d) were approximated in our simulations by normalized histograms based on cell responses to visual scenes. 4.4 Response Histograms. Normalized relative frequency histograms were used in our simulations to approximate the distributions of cell responses. In these histograms, each cell’s response was discretized to one of b bins or response levels. This discretization was based on a cell’s maximum observed response value. Our procedure was similar to that used by Lippert and Wagner (2002), with one important difference. Because the probability of a response was a rapidly decreasing function of response magnitude, Lippert and Wagner created bins representing responses from zero to half of the maximum observed response value and grouped all responses greater than half-maximum into the final bin. This was necessary to avoid bins corresponding to response values that never (or rarely) occurred. To deal with this same problem, we created histograms whose bin values were a logarithmic function of cell response.4 5 Simulation Results Two sets of simulations were conducted. The goal of the first set was to compute the informational costs of using the approximate distributions calculated by the IRD, pind (d | r), or the DTD, ptree (d | r), instead of the exact distribution calculated by the FJPD, pfull (d | r). To quantify these costs, we used an information-theoretic measure, referred to as I/I, introduced 4 Specifically, histograms were created as follows. A cell’s responses were first linearly normalized by dividing each response by that cell’s maximum response across all stimuli. Next, each normalized response was discretized into one of b bins where boundaries between bins were logarithmically spaced. To get probabilities of responses given a disparity, bin counts were appropriately normalized and then smoothed using a gaussian kernel whose standard deviation equaled one-quarter of a bin width. This was done to avoid probabilities equal to zero.
Costs of Ignoring High-Order Correlations
671
by Nirenberg et al. (2001). We chose this measure because, unlike other measures of information difference such as Ishuffled (Nirenberg & Latham, 2003; Panzeri et al., 2002) and Isynergy (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000), this measure is sensitive only to dependencies that are relevant for decoding (Nirenberg & Latham, 2003).5 In brief, I/I can be characterized as follows. The numerator of this measure is the Kullback-Leibler distance between the exact distribution and an approximating distribution. This distance is normalized by the mutual information between a stimulus property (e.g., the disparity d) and the neural responses r based on the exact distribution. A small value of I/I means that the decoding produced by an approximate distribution contains similar amounts of information about the stimulus property as the decoding produced by an exact distribution, whereas a large value means that the approximate decoding contains much less information than the exact decoding. Simulations were conducted with a variety of neural population sizes (denoted n) and bins or response levels (denoted b). Neural populations sizes were kept small because of the computational costs of computing the exact distribution pfull (d | r). Note that the possible values of r equals b n — for example, if n = 8 and b = 8, then r can take 16,777,216 possible values. Fortunately, in practice, r took a smaller number of values by a factor of about 100, allowing us to compute pfull (d | r) using fewer presentations of visual scenes than would otherwise be the case. We used the responses of model neurons to a collection of 3 × 106 visual scenes in which the frontoparallel surface was located at all possible depths (15 possible depths × 200,000 scenes per depth) to compute each of the probability distributions pfull (d | r), ptree (d | r), and pind (d | r). This process was repeated six times for each combination of neural population size n and number of bins b. The repetitions differed in the parameter values (e.g., spatial frequencies, receptive field sizes) used by the model neurons. The results are illustrated in Figure 3. The horizontal axis represents the simulation condition (combination of n and b), and the vertical axis represents the measure I/I. Dark bars give the value of this measure for the IRD, and light bars give the value for the DTD. The error bars indicate the standard errors of the means based on six repetitions of each condition. There are at least two interesting trends in the data. First, for both the IRD and DTD approximations, the information cost grows with the size of the neural population. In other words, the approximate distributions provided by these decoders become poorer relative to the exact distribution as the neural population grows in size. A two-way (decoder by population size) ANOVA across the b = 3 conditions confirmed that this effect is significant (F (2,35) = 22.15; p < 0.001), with no significant decoder 5
The best way to measure the distance between two distributions for the purposes of neural decoding is a topic of ongoing scientific discussion (e.g., Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003).
672
M. Michel and R. Jacobs
0.3
DTD IRD
∆I/I
0.2
0.1
0.0 (n=4,b=3)
(n=4,b=8)
(n=8,b=3)
(n=8,b=8) (n=16,b=3)
conditions Figure 3: The informational cost I/I of using the dependence tree decoder (DTD; light bars) or the independent response decoder (IRD; dark bars) as a function of population size (n) and the number of discretized response levels (b). Error bars represent the standard errors of the means.
by population size interaction. This trend is not surprising given that the number of possible high-order correlations grows rapidly with the number of neurons in a population. This result has important implications. Many studies that have attempted to measure the information lost by assuming independence among neural responses have approximated the exact joint distribution with a distribution that takes into account only secondorder dependencies (e.g., Abbott & Dayan, 1999; Averbeck & Lee, 2003; Golledge et al., 2003; Nirenberg et al., 2001; Panzeri et al., 1999; Rolls et al., 2003; Seri`es et al., 2004). Our results suggest that the difference in relevant information between an approximation based on the assumption that responses are independent and an approximation based on second-order correlations may greatly underestimate the information difference that investigators actually care about: the difference between an approximation based on statistical independence and the exact distribution. If so, this may account for why previous investigators concluded that most of the useful information is in the independent responses of individual neurons. A
Costs of Ignoring High-Order Correlations
673
second trend in our data is an increase in information cost as the number of discrete response levels increases. This trend is unsurprising as we would expect the differences between exact and inexact distributions to increase as the resolution of neuron responses increases. A three-way ANOVA (decoder by population size by response levels) confirmed that this trend is significant (F (1,47) = 9.49; p < 0.01), along with a main effect for decoder type (F (1,47) = 4.35; p < 0.05) and a decoder by response levels interaction (F (1,47) = 5.05; p < 0.05), which indicate that the effect is significantly greater and more pronounced, respectively, for the IRD than the DTD. In summary, the results of the first set of simulations suggest that the cost of ignoring or approximating statistical dependencies becomes greater with larger populations and also may tend to increase with more neural response levels. A limitation of the first set of simulations is that the excessive computational cost of calculating the exact distribution pfull (d | r) prevented us from examining large population sizes. Therefore, a second set of simulations was conducted in which we evaluated the IRD and DTD with large populations using a performance measure that compared the disparity predicted by a decoder with the true disparity present in a pair of left eye and right eye images. The disparity predicted by a decoder was the disparity with the highest conditional probability (i.e., the disparity that maximized p(d | r), known as the maximum a posteriori estimate). The distributions pind (d | r) and ptree (d | r) generated by the IRD and DTD, respectively, were computed on the basis of 150,000 visual scenes in which the frontoparallel surface was located at all possible depths (15 possible depths × 10,000 scenes per depth). However, the performances of the decoders were measured using a different set of scenes. This set consisted of 1400 scenes in which the surface was located at the central seven depths (possible disparities ranged from −3 to 3 pixels × 200 scenes per disparity). The simulation results are illustrated in Figure 4. The horizontal axis indicates the simulation condition (neural population size n and number of response levels b), and the vertical axis indicates the root mean squared (RMS) error of the disparity estimate. Dark bars give the RMS error value for the IRD, and light bars give the value for the DTD. The error bars indicate the standard errors of the means based on six repetitions of each condition. A three-way ANOVA showed significant main effects for population size (F (2,107) = 9.83; p < 0.0001), for decoder (F (1,107) = 343.55; p < 0.0001), and for the number of discretized response levels (F (2,107) = 12.71; p < 0.0001), along with significant effects ( p < 0.0001) for all two-way interactions. Three primary trends can be gleaned from these combined effects. First, performance for the DTD improved as the population size increased. This was also found for the IRD in the b = 5 condition. This trend is unsurprising, as we would expect the amount of information to increase with the size of a neural population. Second, the performance of the DTD became
674
M. Michel and R. Jacobs
4.0 DTD
3.5
IRD
RMS error
3.0 2.5 2.0 1.5 1.0 0.5
(n
(n
=1
6, b= 5)
=1 6, b= 10 (n ) =1 6, b= 20 (n ) =3 2, b= (n 5) =3 2, b= 10 (n ) =3 2, b= 20 (n ) =6 4, b= (n 5) =6 4, b= 10 (n ) =6 4, b= 20 )
0.0
conditions Figure 4: Root mean squared (RMS) error (in pixels) for the DTD (light bars) and IRD (dark bars) as a function of population size (n) and number of discretized response levels (b). RMS errors were calculated by comparing the maximum a posteriori estimates of disparity given by the decoders with the true disparities over 1400 test trials (or novel visual scenes). Error bars indicate the standard errors of the means.
significantly better than that of the IRD with increases in population size, suggesting that the proportion of information about disparity contained in high-order correlations increases with population size compared with the proportion stored in the independent responses of model neurons. Third, the performance of the IRD decreased as the number of discretized response levels increased. In contrast, the performance of the DTD showed the opposite trend—for example, its performance improved slightly from the b = 5 to b = 10 conditions. This trend may seem surprising given that the number of parameters estimated by the DTD grows quadratically with b while the number of parameters estimated by the IRD grows only linearly. However, the DTD is capable of representing much richer distributions than the IRD. Increasing the number of discretized response levels, like increasing the
Costs of Ignoring High-Order Correlations
675
number of neurons in a population, increases the possible complexity of correlations. To the extent that information about a stimulus is contained in the possibly high-order response correlations of a neural population, we should expect that any decoder that takes into account these correlations will perform better than the IRD, which, by definition, discards all information in these correlations. Similar to the results of the first set of simulations, the results of the second set of simulations suggest that much of the information about disparity is carried by statistical dependencies among model neuron responses. These results do not, however, indicate whether the information carried by response dependencies is limited to second-order dependencies or whether higher-order dependencies also need to be considered. To examine this issue, we evaluated the performance of a decoder that was limited to second-order statistics; it approximated the distribution of neural responses given a disparity, p(r | d), with a multivariate gaussian distribution whose mean vector and covariance matrix were estimated using a maximum likelihood procedure. The performance of this decoder is not plotted because the decoder consistently generated a prediction of disparity equal to 0 pixels (the frontoparallel surface is at the depth of the fixation point) regardless of the true disparity in the left eye and right eye images. A decoder that was forced to use a diagonal covariance matrix produced the same behavior. The poor performances of these decoders are not surprising given the fact that the marginal distributions of an individual neuron’s response given a disparity, p(r i | d), are highly nongaussian. The horizontal axis of the graph in Figure 5 represents a normalized response of a typical model neuron, and the vertical axis represents the probability that the neuron will give each response. The light bars indicate the probability when the disparity in a pair of images equals the preferred disparity of the neuron, and the dark bars indicate the probability when the image disparity is different from the neuron’s preferred disparity. In both cases, the probability distributions are highly nongaussian; the distributions peak near a response of zero (the neuron most frequently gives a small response) and have relatively long tails (especially the distribution for when the image and preferred disparities are equal). This finding is consistent with earlier results, such as those reported by Lippert and Wagner (2002; see Figure 3). A possible objection to the simulations discussed so far is that the simulations used a very large number of training stimuli. In contrast, neuroscientists use much smaller data sets, and there is no guarantee that the results that we have found will also be found when using fewer data items. To address this issue, we conducted new simulations with a relatively small data set (100 training samples for each disparity). Figure 6 shows the results for the IRD and the DTD when population sizes were set to 64 neurons, and the number of response levels was set to either 5, 10, or 20. Again, the DTD consistently outperformed the IRD, and the trends described above for the large training set appear to hold for the small training set too.
probability of response
676
M. Michel and R. Jacobs
preferred response
0.6
non-preferred response 0.5 0.4 0.3 0.2 0.1 0.0
0
0.2
0.4
0.6
0.8
1
normalized response Figure 5: Sample response histograms for a typical model neuron. The black bars indicate the probability of a normalized response to an image pair with the cell’s preferred disparity, and the white bars indicate the probability of a response to an image pair with an arbitrarily selected nonpreferred disparity. Note that cell responses are highly nongaussian; the probability distributions are skewed with a peak at very low responses and tails at higher response values. In general, as the selected disparity deviates from the preferred disparity, the mass of the response distribution becomes increasingly concentrated at zero.
A second possible objection to the simulations discussed above is that they used white-noise stimuli; frontoparallel surfaces were covered with dots whose luminance values were independently sampled from a uniform distribution ranging from 0 (dark) to 1 (light). We chose these stimuli for several reasons. White noise stimuli have simple properties that make them amenable to mathematical analyses. Consequently, they have played an important role in engineering, neuroscientific, and behavioral studies. In addition, for our current purposes, we are interested in how binocular disparities can be evaluated in the absence of form information. Furthermore, white noise stimuli do not contain correlations across spatial frequency bands, and thus their use should not introduce biases into our evaluations of the role of high-order correlations when decoding populations of model neurons. Despite the motivations for the use of white noise stimuli, natural visual stimuli contain very different properties. Images of natural scenes usually contain a great deal of form information and contain energy in
Costs of Ignoring High-Order Correlations
677
8.0 7.0
RMS error
6.0
DTD IRD
5.0 4.0 3.0 2.0 1.0 0.0 (n=64,b=5)
(n=64,b=10)
(n=64,b=20)
conditions Figure 6: RMS error of the maximum a posteriori disparity estimate provided by the DTD (light bars) and IRD (dark bars) as a function of the number of discretized response levels (b) in the small training sample case. These data were generated using a fixed population size (n = 64), and using only 100 training samples per disparity rather than the 10,000 training samples per disparity used to generate the data in Figure 4.
a large range of spatial frequency bands. Because of dependencies in the energies across frequency bands, we expect that high-order correlations in model neuron responses to natural stimuli should be important during neural decoding, as was found when using white noise stimuli. To partially evaluate this prediction, we repeated some of the preceeding simulations using more “naturalistic” stimuli.6
6 Ideally, we would have liked to conduct simulations using left eye and right eye images of natural scenes. Unfortunately, this was not possible for a variety of reasons. Perhaps most important, there are no available databases, to our knowledge, of large numbers of left eye and right eye images of natural scenes taken by well-calibrated camera systems that include ground truth information (e.g., true disparity or depth at each point in the scene).
678
M. Michel and R. Jacobs
8.00 8.0 7.00 7.0
5.00 5.0
DTD IRD MVG
RMS error
RMS error
6.00 6.0
DTD IRD MVG
4.00 4.0 3.00 3.0 2.00 2.0 1.00 1.0 0.00 0.0 (n=64,b=5) (n=64,b=10) (n=64,b=20)
MVG
conditions conditions Figure 7: RMS error of the maximum a posteriori disparity estimate provided by the DTD (white bars) and IRD (gray bars) as a function of the number of discretized response levels (b), along with the performance of a multivariate gaussian fitted to the training data (black bar) when the training and test surfaces were painted with 1/f noise rather than white noise. These data were generated using a fixed population size (n = 64) and using 10,000 training samples per disparity.
In these new simulations, we exploited the fact that the amplitude spectra of natural images fall as approximately 1/f (Burton & Moorhead, 1987; Field, 1987, 1994; Tolhurst, Tadmor, & Tang, 1992), We generated left eye and right eye images in the manner described above for the white-noise stimuli, with the exception that each image was a “noise texture” with 1/f amplitude spectra; the luminance values of the dots on a surface were independently sampled from a uniform distribution and then passed through a 1/f filter (i.e., the luminance values were Fourier transformed, the amplitude at each frequency f was multiplied by 1/f , and the result was inverse Fourier transformed; in addition, the images resulting from this process were normalized so that their luminance values fell in the range from 0 to 1). The graph in Figure 7 shows the results for the IRD and the DTD based on a population of 64 neurons. As was the case with white noise stimuli, the DTD consistently outperformed the IRD, though the performance of both decoders was markedly worse with the 1/f -noise stimuli. These results are
Costs of Ignoring High-Order Correlations
679
consistent with our earlier conclusions that high-order correlations among model neuron responses contain significant information about binocular disparities. 6 Summary Investigators debate the extent to which neural populations use pairwise and higher-order statistical dependencies among neural responses to represent information about a visual stimulus. To study this issue, we used three statistical decoders to extract the information in the responses of model neurons about the binocular disparities present in simulated pairs of left eye and right eye images. The full joint probability decoder (FJPD) considered all possible statistical relations among neural responses as potentially important. The dependence tree decoder (DTD) also considered all possible relations as potentially important, but it approximated high-order statistical correlations using a computationally tractable procedure. Finally, the independent response decoder (IRD) assumed that neural responses are statistically independent, meaning that all correlations should be zero and thus can be ignored. Two sets of simulations were performed. The first set examined the informational cost of ignoring all correlations or of approximating high-order correlations by comparing the IRD and DTD with the FJPD. The second set compared the performances of the IRD and DTD on a binocular disparity estimation task when neural population size and number of response levels were varied. The results indicate that high-order correlations among model neuron responses contain significant information about disparity and that the amount of this high-order information increases rapidly as a function of neural population size. In addition, the DTD consistently outperformed the IRD (and also a decoder based on a multivariate gaussian distribution) on the disparity estimation task, and its performance advantage increased with neural population size and the number of neural response levels. These results raise the possibility that previous researchers who have ignored pairwise or high-order statistical dependencies among neuron responses, or who have examined the importance of statistical dependencies in a way that limited their evaluation to pairwise dependencies may not be justified in doing so. Moreover, the results highlight the potential importance of the dependence tree decoder to neuroscientists as a powerful but still practical way of approximating high-order correlations among neural responses. Finally, the strengths and limitations of this work highlight important areas for future research. For example, future investigations will need to make use of databases of natural images, such as databases with many pairs of right eye and left eye images of natural scenes taken by wellcalibrated camera systems, along with ground-truth information about each scene (e.g., depth or disparity information at every point in a scene). Such a database for the study of binocular vision in natural scenes does not
680
M. Michel and R. Jacobs
currently exist. In addition, future computational work will need to use more detailed neural models, such as models of populations of neurons that communicate via action potentials and models of individual neurons that include ion kinetics. We expect that the results reported here will generalize to these more realistic situations, but further work is needed to test this prediction. Acknowledgments We thank A. Pouget for encouraging us to study the contributions to neural computation of high-order statistical dependencies among neuron responses and thank F. Klam and A. Pouget for many helpful discussions on this topic. This work was supported by NIH research grant RO1-EY13149. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Andrews, T. J., Glennerster, A., & Parker, A. J. (2001). Stereoacuity thresholds in the presence of a reference surface. Vision Research, 41, 3051–3061. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proceedings of the National Academy of Sciences, 94, 5438–5443. Averbeck, B. B., & Lee, D. (2003). Neural noise and movement-related codes in the macaque supplementary motor area. Journal of Neuroscience, 23, 7630–7641. Averbeck, B. B., & Lee, D. (2004). Coding and transmission of information by neural ensembles. Trends in Neurosciences, 27, 225–230. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Burton, G. J., & Moorhead, I. R. (1987). Color and spatial structure in natural scenes. Applied Optics, 26, 157–170. Chow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14, 462–467. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280–283. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive field structure. Nature, 352, 156–159. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America A, 4, 2379– 2394. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Neural encoding of binocular disparity: Energy models, position shifts, and phase shifts. Vision Research, 36, 1839–1857. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Research, 30, 1661–1676.
Costs of Ignoring High-Order Correlations
681
Golledge H. D. R., Panzeri, S., Zheng, F., Pola, G., Scannell, J. W., Giannikopoulos, D. V., Mason, R. J., Tov´ee, M. J., & Young, M. P. (2003). Correlations, featurebinding and population coding in primary visual cortex. NeuroReport, 14, 1045– 1050. Lippert, J., & Wagner, H. (2002). Visual depth encoding in populations of neurons with localized receptive fields. Biological Cybernetics, 87, 249–261. Meil˘a, M., & Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1, 1–48. Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Prentice Hall. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Nirenberg, S., & Latham, P. E. (2003). Decoding neuronal spike trains: How important are correlations? Proceedings of the National Academy of Sciences USA, 100, 7348– 7353. Ohzawa I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. Journal of Neurophysiology, 77, 2879–2909. ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Oram M. W., Foldi` Decoding neural population signals. Trends in Neuroscience, 29, 259–265. Panzeri, S., Golledge, H. D. R., Zheng, F., Pola, G., Blanche, T. J., Tovee, M. J., & Young, M. P. (2002). The role of correlated firing and synchrony in coding information about single and separate objects in cat V1. Neurocomputing, 44–46, 579–584. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of information in the nervous system. Proceedings of the Royal Society of London Series B, 266, 1001–1012. Perez, R., Castro, A. F., Justo, M. S., Bermudez, M. A., & Gonzalez, F. (2005). Retinal correspondence of monocular receptive fields in disparity-sensitive complex cells from area V1 in the awake monkey. Investigative Ophthalmology and Visual Science, 46, 1533–1539. Perkel, D. H., & Bullock, T. H. (1969). Neural coding. In F. O. Schmitt, T. Melnechuk, G. C. Quarton, & G. Adelman (Eds.), Neurosciences research symposium summaries (pp. 405–527). Cambridge, MA: MIT Press. Pouget, A., Dayan, P., & Zemel, R. S. (2003). Computation and inference with population codes. Annual Review of Neuroscience, 26, 381–410. Prince, S. J. D., Pointon, A. D., Cumming, B. G., & Parker, A. J. (2002). Quantitative analysis of the responses of V1 neurons to horizontal disparity in dynamic random-dot stereograms. Journal of Neurophysiology, 87, 191–208. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Research, 37, 1811–1827.
682
M. Michel and R. Jacobs
Read, J. C. A., & Cumming, B. G. (2003). Testing quantitative models of binoculary disparity selectivity in primary visual cortex. Journal of Neurophysiology, 90, 2795– 2817. Rolls, E. T., Franco, L., Aggelopoulos, N. C., & Reece, S. (2003). An information theoretic approach to the contributions of the firing rates and the correlations between the firing of neurons. Journal of Neurophysiology, 89, 2810–2822. Seri`es, P., Latham, P. E., & Pouget, A. (2004). Tuning curve sharpening for orientation selectivity: Coding efficiency and the impact of correlations. Nature Neuroscience, 10, 1129–1135. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Tolhurst D. J., Tadmor, Y., & Tang, C. (1992). The amplitude spectra of natural images. Ophthalmic and Physiological Optics, 12, 229–232. Tsai, J. J., & Victor, J. D. (2003). Reading a population code: A multi-scale neural model for representing binocular disparity. Vision Research, 43, 445–466.
Received February 4, 2005; accepted July 20, 2005.
LETTER
Communicated by Michael Cohen
Dynamical Behaviors of Delayed Neural Network Systems with Discontinuous Activation Functions Wenlian Lu
[email protected],
[email protected]
Tianping Chen
[email protected] Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai, 200433, People’s Republic of China
In this letter, without assuming the boundedness of the activation functions, we discuss the dynamics of a class of delayed neural networks with discontinuous activation functions. A relaxed set of sufficient conditions is derived, guaranteeing the existence, uniqueness, and global stability of the equilibrium point. Convergence behaviors for both state and output are discussed. The constraints imposed on the feedback matrix are independent of the delay parameter and can be validated by the linear matrix inequality technique. We also prove that the solution of delayed neural networks with discontinuous activation functions can be regarded as a limit of the solutions of delayed neural networks with high-slope continuous activation functions.
1 Introduction It is well known that recurrently connected neural networks (RCNNs), proposed by Cohen and Grossberg (1983) and Hopfield (1984; Hopfield & Tank, 1986), have been extensively studied in both theory and applications. They have been successfully applied in signal processing, pattern recognition, and associative memories, especially in static image treatment. Such applications heavily rely on dynamical behaviors of the neural networks. Therefore, analysis of dynamical behaviors is a necessary step for the practical design of neural networks. In hardware implementation, time delays inevitably occur due the to finite switching speed of the amplifiers and communication time. What is more, to process moving images, one must introduce time delays in the signals transmitted among the cells (see Civalleri, Gilli, & Pabdolfi, 1993). Neural networks with time delay have much more complicated dynamics due to the incorporation of delays. The model concerning delay is described Neural Computation 18, 683–708 (2006)
C 2006 Massachusetts Institute of Technology
684
W. Lu and T. Chen
as follows: d xi (t) = −di xi (t) + a i j g j (x j (t)) + b i j g j (x j (t − τ )) + Ii , dt n
n
j=1
j=1
i = 1, 2, . . . , n,
(1.1)
where n is the number of units in a neural network, xi (t) is the state of the ith unit at time t, and g j (x j (t)) denotes the output of jth unit at time t. a i j denotes the strength of the jth unit on the ith unit at time t, and b i j denotes the strength of the jth unit on the ith unit at time t − τ . Ii denotes the input to the ith unit, τ corresponds to the transmission delay and is a nonnegative constant, and di represents the positive rate with which the ith unit will reset its potential to the resting state in isolation when disconnected from the network and the external inputs Ii . System (1.1) can be rewritten as d x(t) = −Dx(t) + Ag(x(t)) + Bg(x(t − τ )) + I, dt
(1.2)
where x = (x1 , x2 , . . . , xn )T , g(x) = (g1 (x1 ), g2 (x2 ), . . . , gn (xn ))T , I = (I1 , I2 , . . . , In )T , T denotes transpose, D = diag{d1 , d2 , . . . , dn }, A = {a i j } is the feedback matrix, and B = {b i j } is the delayed feedback matrix. Some useful results on the stability analysis of delayed neural networks (DNNs) have already been obtained. Readers can refer to Chen (2001), Zeng, Weng, and Liao (2003), Lu, Rong, and Chen (2003), Joy (2000), and many others. In particular, Lu et al. (2003) and Joy (2000) provided some effective criteria based on LMI. All discussion in these article is based on the assumption that the activation functions are continuous and even Lipshitzean. As Forti and Nistri (2003) pointed out, a brief review of some common neural network models reveals that neural networks with discontinuous activation functions are important and frequently arise in practice. For example, consider the classical Hopfield neural networks with graded response neurons (see Hopfield 1984). The standard assumption is that the activations are used in high-gain limit where they closely approach discontinuous and comparator functions. As shown by Hopfield (1984; Hopfield & Tank, 1986), the high-gain hypothesis is crucial to make negligible the connection to the neural network energy function of the term depending on neuron self-inhibitions, and to favor binary output formation—for example, a hard comparator function sign(s). A conceptually analogous model based on hard comparators are discrete-time neural networks discussed by Harrer, Nossek, and Stelzl (1992). Another important example concerns the class of neural networks introduced by Kennedy and Chua (1988) to solve linear and nonlinear
Delayed Systems with Discontinuous Activations
685
programming problems. Those networks exploit constrained neurons with diode-like input-output activations. Again, in order to guarantee satisfaction of the constraints, the diodes are required to possess a very high slope in the conducting region, that is, they should approximate the discontinuous characteristic of an ideal diode (see Chua, Desoer, & Kuh, 1987). And when dealing with dynamical systems possessing high-slope nonlinear elements, it is often advantageous to model them with a system of differential equations with a discontinuous right-hand side, rather than studying the case where the slope is high but of finite value (see Utkin, 1977). Forti and Nistri (2003) applied the concepts and results of differential equations with discontinuous right-hand side introduced by Filippov (1967) to investigate the global convergence of neural networks with discontinuous neuron activations. Furthermore, they also discussed various types of convergence behaviors of global stability, such as convergence in finite time and convergence in measure. Useful sufficient conditions were obtained to ensure global convergence. But the discontinuous activations are assumed bounded. Lu and Chen (2005) studied global stability of a more general neural network model: Cohen-Grossberg neural networks. In this letter, the discontinuous activation functions were not assumed bounded. However, in both articles, the models do not involve time delays. In this letter, we introduce a new concept of solution for delayed neural networks with discontinuous activation functions. Without assuming the boundedness and the continuity of the neuron activations, we present sufficient conditions for the global stability of neural networks with time delay based on linear matrix inequality, and we discuss their convergence. Moreover, we explore the importance of the concept of the solution presented in this letter.
2 Preliminaries In this section, we present some definitions used in this letter. Definition 1. (See Preliminaries in Forti & Nistri, 2003.) Suppose E ⊂ Rn . Then x → F (x) is called a set-value map from E → Rn , if to each point x of a set E ⊂ Rn , there corresponds a nonempty set F (x) ⊂ Rn . A setvalue map F with nonempty values is said to be upper semicontinuous at x0 ∈ E, if for any open set N containing F (x0 ), there exists a neighborhood M of x0 such that F (M) ⊂ N. F (x) is said to have a closed (convex, compact) image if for each x ∈ E F (x) is closed (convex,compact). Gra ph(F (E)) = {(x, y)|x ∈ E, a nd y ∈ F (x)}, where E is subset of Rn . More details about set value maps can be found in Aubin and Frankowska (1990).
686
W. Lu and T. Chen
Definition 2. Class G¯ of functions: Let g(x) = (g1 (x1 ), g2 (x2 ), . . . , gn (xn ))T . We ¯ if for all i = 1, 2, . . . , n, g (·) satisfies: call g(x) ∈ G, i
1. gi (·) is nondecreasing and continuous, except on a countable set of isolated points {ρki }, where the right and left limits gi+ (ρki ) and gi− (ρki ), satisfy gi+ (ρki ) > gi− (ρki ). Moreover, in every compact set of R, gi (·) has only finite discontinuous points. 2. Denote the set of points {ρki : i = 1, . . . , n; k = . . . , −2, −1, 0, 1, 2, . . .} of i discontinuity in the following way: for any i, ρk+1 > ρki , and there exist constants G i,k > 0, i = 1, . . . , n; k = . . . , −2, −1, 0, 1, 2, . . ., such that 0≤
gi (ξ ) − gi (ζ ) ≤ G i,k ξ −ζ
i f or all ξ = ζ and ξ, ζ ∈ (ρki , ρk+1 ).
Remark 1. Forti and Nistri (2003) assumed that discontinuous activations satisfy the first condition of definition 2. We impose some local Lipschitz continuity on each interval that does not contain any points of discontinuity. Furthermore, we do not assume the activation functions are bounded, which is required by Forti and Nistri (2003). Note that the function gi (·) is undefined at the points, where gi (·) is discontinuous. Such discontinuous functions G¯ include a number of neuron activations of interest for applications—for example, the standard hard comparator function sign(·): sign(s) =
1 s>0 −1 s < 0
.
(2.1)
¯ the right-hand side of equation 1.2 is disconIt is clear that if g(·) ∈ G, tinuous. Therefore, we have to explain the meaning of a solution of the Cauchy problem associated with equation 1.2 before further investigation. Filippov (1967) developed a concept of a solution for differential equations with a discontinuous right-hand side, which was used by Forti et al. (2003) and Lu and Chen (2005) to investigate the stability of neural networks with discontinuous activation functions. In the following, we apply this framework in discussing delayed neural networks with discontinuous activation functions. Now we introduce the concept of Filippov solution. Consider the following system, dx = f (x), dt where f (·) is not continuous.
(2.2)
Delayed Systems with Discontinuous Activations
687
Definition 3. A set-value map is defined as φ(x) =
K [ f (B(x, δ) − N)]
(2.3)
δ>0 µ(N)=0
where K (E) is the closure of the convex hull of set E, B(x, δ) = {y : y − x ≤ δ}, and µ(N) is Lebesgue measure of set N. A solution of the Cauchy problem for equation 2.2 with initial condition x(0) = x0 is an absolutely continuous function x(t), t ∈ [0, T], which satisfies x(0) = x0 , and differential inclusion: dx ∈ φ(x), dt
a .e. t ∈ [0, T].
(2.4)
The concept of the solution in the sense of Filippov is useful in engineering applications because the Filippov solution is a limit of solutions of ordinary differential equations (ODEs) with the continuous right-hand side. Thus, we can model a system that is a near discontinuous system and expect that the Filippov trajectory of the discontinuous system will be close to the real trajectories. This approach is important in many applications, such as variable structure control and nonsmooth analysis (see Utkin, 1997; Aubin & Cellina, 1984; Paden & Sastry, 1987). Moreover, Haddad (1981), Aubin (1991), and Aubin and Cellina (1984) gave a functional differential inclusion with memory as follows: dx (t) ∈ F (t, A(t)x), dt
(2.5)
where F : R × C([−τ, 0], Rn ) → Rn is a given set-value map and [A(t)x](θ ) = xt (θ ) = x(t + θ ).
(2.6)
Now we denote K [g(x)] = (K [g1 (x1 )], K [g2 (x2 )], . . . , K [gn (xn )])T where K [gi (xi )] = [gi− (xi ), gi+ (xi )]. We extend the concept of Filippov solution to the delayed differential equations 1.2 as follows: dx (t) ∈ −Dx(t) + AK [g(x(t))] + B K [g(x(t − τ ))] + I, dt Equivalently, dx (t) = −Dx(t) + Aα(t) + Bβ(t − τ ) + I, dt where α(t) ∈ K [g(x(t))] and β(t) ∈ K [g(x(t))].
for almost all t.
688
W. Lu and T. Chen
In the sequel, we assume α(t) = β(t). It is reasonable. The output of the system should be consistent over time. Therefore, we will consider the following delayed system, dx (t) = −Dx(t) + Aα(t) + Bα(t − τ ) + I, dt
for almost all t,
(2.7)
where output α(t) is measurable, and α(t) ∈ K [g(x(t))],
for almost all t.
Definition 4. A solution of the Cauchy problem for the delayed system 2.7 with initial condition φ(θ ) ∈ C([−τ ,0], Rn ) is an absolutely continuous function x(t) on t ∈ [0, T], such that x(θ ) = φ(θ ) for θ ∈ [−τ, 0], and dx = −Dx(t) + Aα(t) + Bα(t − τ ) + I dt
a .e. t ∈ [0, T],
(2.8)
where α(t) is measurable and for almost t ∈ [0, T], α(t) ∈ K [g(x(t))]. Remark 2. Concerning the solution of the ODEs or functional differential equations (FDEs) with a discontinuous right-hand side, there are various definitions, such as Euler solutions and generalized sampling solutions. Among them, Carath´eodory and a weak solution set are widely studied. Liz and Pouso (2002) gave some general results for the existence of the solutions of first-order discontinuous FDEs subject to nonlinear boundary conditions. The Carath´eodory solution set is defined as follows (here, we compare these solution set in the case without time delay). Consider the following ODE, d x(t) = f (x(t)), dt
t ∈ [0, T],
(2.9)
with initial condition x(0) = x0 . An absolutely continuous function ξ (t) is said to be a Carath´eodory solution if d x(t) = f (x(t)), dt x(0) = x0 .
a .e. t ∈ [0, T] (2.10)
Delayed Systems with Discontinuous Activations
689
A function ζ (t) ∈ L 1 ([0, T]) is said to be a weak solution in L 1 ([0, T]) if for each p(t) ∈ C0∞ ([0, T]), there holds
T
0
d p(t) ζ (t)dt = − dt
T
f (ζ (t)) p(t)dt.
(2.11)
0
It is clear that the Carath´eodory solution set is surely a subset of the Filippov solution set if it involves a discontinuous right-hand side. On the other hand, the weak solution set might contain a discontinuous solution. But if we focus on the absolutely continuous weak solutions, this solution set is equivalent to the Carath´eodory solution set. Both are subsets of the Filippov solution set. Spraker and Biles (1996) pointed out that in a one-dimensional case, the Carath´eodory solution set equals the Filippov solution set if and only if {x, 0 ∈ φ(x)} = {x, f (x) = 0}, where φ(·) is defined as in definition 3. Otherwise, the two solution sets are different. For example, consider the following one-dimensional ODE, d x(t) = −x − q (x), dt
(2.12)
where 1 q (x) = −1 1 2
if
ρ>0
if
ρ<0 .
if
ρ=0
(2.13)
The initial condition is x(0) = 0. It is easy to see that {x, 0 ∈ φ(x)} = {0} and {x, f (x) = 0} = ∅. It can be seen that system (2.12) has neither a Carath´eodory solution nor a weak absolutely continuous solution. On the other hand, equation 2.12 has the Filippov solution x(t) = 0. Definition 5. (Equilibrium) x is said to be an equilibrium of system 2.8, if there exists α ∈ K [g(x )], such that 0 = −Dx + Aα + Bα + I. Definition 6. If x is an equilibrium of the system 2.8, x is said to be globally asymptotically stable if for any solution x(t) of equation 2.8 whose existence interval is [0, +∞), we have lim x(t) = x .
t→∞
690
W. Lu and T. Chen
Moreover, x(t) is said to be exponentially asymptotically stable globally if there exist constants > 0 and M > 0, such that x(t) − x ≤ Me − t . The letter is organized as follows. In section 3, we discuss the existence of the equilibrium point and the solution for system 2.8. The stability of the equilibrium point and the convergence of the output of the delayed neural networks are studied in section 4. Some numerical examples are presented in section 5. We conclude this letter in section 6. 3 Existence of an Equilibrium Point and Solution In this section, we prove that under some conditions, system 2.8 has an equilibrium point and a solution in the infinite time interval [0, ∞). 3.1 Existence of an Equilibrium Point. First, we investigate the existence of an equilibrium point. For this purpose, consider the following differential inclusion, dy ∈ −Dy(t) + T K [g(y(t))] + I. dt
(3.1)
where y(t) = (y1 (t), y2 (t), . . . , yn (t))T , D, K [g(·)], and I are the same as those in system 2.8. We need the following result. ¯ If there exists a Theorem A. (Lu and Chen, 2005, theorem 2). Suppose g ∈ G. positive definite diagonal matrix P such that −P T − T T P is positive definite, then there exists an equilibrium point of system 3.1; that is, there exist y ∈ Rn and α ∈ K [g(y )], such that 0 = −Dy + Tα + I.
(3.2)
By theorem A, we can prove theorem 1. Theorem 1. If there exist a positive definite diagonal matrix diag{P1 , P2 , . . . , Pn } and a positive definite symmetric matrix Q such that
−P A − AT P − Q −P B −B T P
Q
P=
> 0,
then there exists an equilibrium point of system (see equation 2.8).
(3.3)
Delayed Systems with Discontinuous Activations
691
Proof. By the Schur complement theorem (see Boyd, Ghaoui, Feron, & Balakrishnan (1994), inequality 3.3 is equivalent to −(P A + AT P) > 1 1 1 1 P B Q−1 B T P + Q. By the inequality [Q− 2 B T P − Q 2 ]T [Q− 2 B T P − Q 2 ] ≥ 0, P B Q−1 B T P + Q ≥ P B + B T P holds. Then inequality 3.3 becomes −P(A + B) − (A + B)T P > 0.
(3.4)
From theorem A, there exists an equilibrium point x ∈ Rn and α ∈ K [g(x )] such that 0 = −Dx + (A + B)α + I,
(3.5)
which implies that α is an equilibrium point of system 2.8. Theorem 1 is proved. Suppose that x = (x1 , x2 , . . . , xn )T is an equilibrium point of system 2.8, that is, there exists α = (α1 , α2 , . . . , αn )T ∈ K [g(x)] such that equation 3.5 satisfies. Let u(t) = x(t) − x be a translation of x(t) and γ (t) = α(t) − α be a translation of α(t). Then u(t) = (u1 (t), u2 (t), . . . , un (t))T satisfies du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ), dt
for almost t,
(3.6)
where γ (t) ∈ K [g (u(t))], gi (s) = gi (s + xi ) − γi , i = 1, 2, . . . , n. To simplify, we still use gi (s) to denote gi (s), . Therefore, instead of equation 2.8, we will investigate du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ) dt
for almost t
(3.7)
¯ and 0 ∈ K [g (0)], for all i = 1, 2, . . . , n. where γ (t) ∈ K [g(u(t))], g(·) ∈ G, i It can be seen that the dynamical behavior of equation 2.8 is equivalent to that of equation 3.7. Namely, if there exists a solution u(t) for equation 3.7, then x(t) = u(t) + x must be a solution for equation 2.8; moreover, if all trajectories of equation 3.7 converge to the origin, then the equilibrium x must be globally stable for system 2.8, as defined in definition 6. Therefore, instead of equation 2.8, we will investigate the dynamics of system 3.7. 3.2 Viability. In this section, we investigate the viability of system 3.7, that is, there exists at least one solution of system 3.7 on [0, +∞), which is the prerequisite to study global stability. First, we give following lemma on matrix inequalities:
692
W. Lu and T. Chen
Lemma 1. If P = diag{P1 , P2 , . . . , Pn } with Pi > 0, Q is a positive definite symmetric matrix such that
−P A − AT P − Q −P B Z= Q −B T P
> 0,
(3.8)
then the following two statements hold true: 1. There are a small, positive constant ε < mini di , a positive diagonal matrix ˆ Pˆ = diag{ Pˆ 1 , Pˆ 2 , . . . , Pˆ n }, and a positive definite symmetric matrix Q, such that −2D + ε I A B ˆ ετ Pˆ B AT Pˆ A + AT Pˆ + Qe Z1 = (3.9) ≤ 0. T T ˆ ˆ B −Q B P 2. There are a small, constant > 0, a diagonal matrix P˜ = diag{ P˜ 1 , P˜ 2 , . . . , P˜ n } with P˜ i > 0, and a positive definite symmetric matrix ˜ such that Q, −2D A B (3.10) Z2 = AT P˜ A + AT P˜ + I P˜ B ≤ 0. BT
B T P˜
˜ −Q
ˆ = α Q, where P and Q are defined in inequality 3.8, Proof. Let Pˆ = α P, Q and ε and α are constants determined later. Then, for any x, y, and z ∈ Rn , we have x [x T , yT , zT ]Z1 y = −2x T Dx + εx T x + 2x T Ay + 2x T Bz z + αyT (P A + AT P)y + αyT Qy + 2αyT P Bz − αzT Qz + α(e ετ − 1)yT Qy = −2x T Dx + εx T x + 2x T Ay + 2x T Bz − α[yT , zT ]Z y × + α(e ετ − 1)yT Qy z ≤ −x T Dx + 2x T Ay − αyT [λI − (e ετ − 1)Q]y − x T (D − ε I )x + 2x T Bz − αλzT z T = − D1/2 x − D−1/2 Ay D1/2 x − D−1/2 Ay + yT AT D−1 Ay − αyT [λI − (e ετ − 1)Q]y
Delayed Systems with Discontinuous Activations
693
T − (D − ε I )1/2 x − (D − ε I )−1/2 Bz (D − ε I )1/2 x − (D − ε I )−1/2 Bz + zT B T (D − ε I )−1 Bz − αλzT z ≤ 0,
(3.11)
where λ = λmin (Z) > 0. Pick ε, satisfying ε < mini {di } and e ετ < and α satisfying α > max
λ Q2
+ 1,
AT D−1 A2 B T (D − ε I )−1 B2 , , λmin {λI − (e ετ − 1)Q} λ
where X2 = λmax (XT X) and λmax (Z) and λmin (Z) denote the maximum and minimum eigenvalue of the square matrix Z, respectively. Then x [x T , yT , zT ]Z1 y ≤ 0 z holds for any x, y, z ∈ Rn , which implies Z1 ≤ 0. In a similar way, we can prove equation 3.10. To prove the existence of the solution for system 3.7, we will construct a sequence of functional differential systems and prove that the solutions of these systems converge to a solution of system 3.7. Specifically, consider the following Cauchy problem: dx a .e. t ∈ [0, T] dt (t) ∈ −Dx(t) + Aγ (t) + Bγ (t − τ ), a measurable function γ (t) ∈ K [g(x(t))], for almost t ∈ [0, T] x(θ ) = φ(θ ) θ ∈ [−τ, 0].
(3.12)
Let C = C(Rn , Rn ), and define a family of functions = { f (x) : f (x) = [ f 1 (x1 ), f 2 (x2 ), . . . , f n (xn )]T ∈ C} satisfying: 1. Every f i (·) is nondecreasing for all i = 1, 2, . . . , n 2. Every f i (·) is uniformly locally bounded, that is, for any compact set Z ⊂ Rn , there exists a constant M > 0 independent of f such that | f i (x)| ≤ M
f or all x ∈ Z i = 1, . . . , n.
3. Every f i (·) is locally Lipschitzean continuous, that is, for any compact set Z ⊂ Rn , there exists λ > 0 such that for any ξ , ζ ∈ Z, and
694
W. Lu and T. Chen
i = 1, 2, . . . , n, we have | f i (ξ ) − f i (ζ )| ≤ λ|ξ − ζ |. 4. f i (0) = 0, for all i = 1, 2, . . . , n. As pointed out in Hale (1977), if f ∈ , then the following system,
du f dt
(t) = −Du f (t) + Af (u f (t)) + B f (u f (t − τ ))
u f (θ ) = φ(θ ) θ ∈ [−τ, 0],
(3.13)
has a unique solution u f (t) = (u1 (t), u2 (t), . . . , un (t))T on [0, +∞). Moreover, we can prove the following result. Theorem 2. If the matrix inequality 3.3 is satisfied, then for any φ ∈ C, there exists M = M(φ) > 0 such that ε
|u f (t)| ≤ Me − 2 t ,
f or all t > 0 a nd f ∈ ,
(3.14)
where ε > 0 is a constant. Proof. Let V2 (t) = e εt uTf u f + 2 +
n
e εt Pˆ i
i=1 t
ufi
f i (ρ)dρ 0
ˆ f (u f (s))e ε(s+τ ) ds, f T (u f (s)) Q
t−τ
ˆ and Q ˆ are of the second conclusion of lemma 2. And differentiate where ε, P, V2 (t): d V2 (t) = εe εt u f (t)T u f (t) + 2e εt uTf [−Du f + Af (u f (t)) + B f (u f (t − τ ))] dt ˆ + 2e εt f T (u f (t)) P[−Du f (t) + Af (u f (t)) + B f (u f (t − τ ))] + εe εt
n
Pˆ i
i=1
ufi
ˆ f (u f (t − τ )) f i (ρ)dρ − e εt f T (u f (t − τ )) Q
0
ˆ f (u f (t)). + e ε(t+τ ) f T (u f (t)) Q Pick ε < mini di . Then ε 0
ufi
f i (ρ)dρ ≤ εu f i (t) f i (u f i (t)) ≤ di u f i (t) f i (u f i (t)).
(3.15)
Delayed Systems with Discontinuous Activations
695
By matrix inequality 3.9, we have d V2 (t) ≤ e εt uTf (t), f T (u f (t)), f T (u f (t − τ )) Z1 dt
u f (t) f (u f (t))
.
f (u f (t − τ ))
≤0 Therefore, u f (t)T u f (t)e εt ≤ V√ 2 (t) ≤ V2 (0) ≤ M, where M is a constant independent of f ∈ . Let M1 = M. We have ε
u f (t) ≤ M1 e − 2 t
for all t > 0 and f ∈ .
(3.16)
Construct a sequence of systems with high-slope continuous activation functions, and prove that the sequence converges to a solution of system 3.7. Let {ρk,i } be the set of discontinuous points of gi (·). Pick a strictly decreasing sequence {δk,i,m }m with limm→∞ δk,i,m = 0 such that Ik1 ,i,m Ik2 ,i,n = ∅ holds for any k1 = k2 and m, n ∈ N, where Ik,i,m = [ρk,i − δk,i,m , ρk,i + δk,i,m ]. Define functions {g m (x) = (g1m (x1 ), . . . , gnm (xn ))T }, m = 1, . . ., as follows: ¯ Ik,i,m gi (s) if s ∈ k g (ρ +δ )−g (ρ −δ ) i k,i k,i,m2δk,i,mi k,i k,i,m [x − (ρk,i + δk,i,m )] if 0 = ρk,i a nd s ∈ Ik,i,m gim (s) = + gi (ρk,i + δk,i,m ) gi (ρk,i +δk,i,m ) [x − ρk,i ] if 0 = ρk,i and s ∈ [ρk,i , ρk,i + δk,i,m ] δk,i,m gi (ρk,i −δk,i,m ) − δk,i,m [x − ρk,i ] if 0 = ρk,i and s ∈ [ρk,i − δk,i,m , ρk,i ]. (3.17) It can be seen that every {g m (x)}, m = 1, . . . , satisfies
r r
g m (x) ∈ . For any compact set Z ⊂ Rn , lim d (Gra ph(g m (Z)), Gra ph(K [g(Z)])) = 0
m→∞
where d (A, B) = sup inf x − y. x∈A y∈B
r
For every continuous point s of gi (·), there exists m0 ∈ N such that gim (s) = gi (s), for all m ≥ m0 , i = 1, 2, . . . , n.
696
W. Lu and T. Chen
m T Let um (t) = (um 1 (t) . . . , un (t)) be the solution of the following system:
dum (t) = −Dum (t) + Ag m (um (t)) + Bg m (um (t − τ )) dt um (θ ) = φ(θ ) θ ∈ [−τ, 0].
(3.18)
Next, we will prove that system 3.7 (or, equivalently, system 2.8) has at least one solution. ¯ Theorem 3 (Viability theorem). If the matrix inequality 3.3 holds and g ∈ G, T then the system 3.7 has a solution u(t) = (u1 (t), . . . , un (t)) for t ∈ [0, ∞). Proof. By theorem 2, we know that all the solutions {um (t)} of system m 3.18 are uniformly bounded, which implies that { dudt(t) } are also uniformly bounded. By the Arzela-Ascoli lemma and the diagonal selection principle, we can select a sub-sequence of {um (t)} (still denoted by {um (t)}) such that um (t) uniformly converges to a continuous function u(t) on any compact set of R. Because the derivative of {um (t)} is uniformly bounded, it can be seen that for any fixed T > 0, u(t) is Lipschitz continuous on [0, T]. Therefore, du(t) exists for almost all t and is bounded on [0, T]. dt For each p(t) ∈ C0∞ ([0, T], Rn ) (noticing that C0∞ ([0, T], Rn ) is dense in the Banach space L 1 ([0, T], Rn )),
T
0
dum (t) du(t) − dt dt
T
p(t) dt = − 0
dp(t) m (u (t) − u(t)) dt dt
m
m
holds and { dudt(t) } is uniformly bounded. Therefore, dudt(t) weakly converge on L ∞ ([0, T], Rn ) ⊂ L 1 ([0, T], Rn ). By Mazur’s convexity theorem (see to du dt ∞ Yosida, 1978), we can find constants αlm ≥ 0 with l=m αlm = 1, and for any ∞ m only finite αlm = 0 such that ym (t) = l=m αlm ul (t). Then ym (θ ) = φ(θ ), if θ ∈ [−τ, 0], and lim ym (t) = u(t), on [0, T] uniformly
m→∞
dym (t) du(t) = , for all almost t ∈ [0, T]. m→∞ dt dt lim
Let γ m (s) =
∞ l=m
(3.19) (3.20)
αlm gl (ul (s)). Then
dym (t) = −Dym (t) + Aγ m (t) + Bγ m (t − τ ). dt
(3.21)
Delayed Systems with Discontinuous Activations
697
Finally, we will prove that there exists a measurable function γ (t) that is limit of a sub-sequence of γ m (t) and satisfies du(t) = −Du(t) + Aγ (t) + Bγ (t − τ ), dt
for almost t ∈ [0, T].
First, we consider the time interval of t ∈ [0, τ ]. For s ∈ [−τ, 0], gl (ul (s)) = g (φ(s)). According to the boundedness of φ(·), the uniform boundedness of gm (·) on the image of φ, and the limitedness of the set of discontinuous points of g(·) on the image of φ, we can find a sub-sequence of g m (still denoted by g m ) and a measurable function γ (t) on [−τ, 0] such that g m (φ(s)) converges to γ (s) for all s ∈ [−τ, 0]. Therefore, limm→∞ γ m (φ(s)) = γ (s) for all s ∈ [−τ, 0]. For t ∈ [0, τ ] and then t − τ ∈ [−τ, 0], we can find a measurable function on [0, τ ], still denoted by γ (t), such that l
dym + Dym (t) + Bγ (t − τ ) m→∞ m→∞ dt −1 du + Du(t) + Bγ (t − τ ) for almost t ∈ [0, τ ]. =A dt
γ (t) = lim γ m (t) = A−1 lim
Similarly, we can construct a measurable function γ (t), t ∈ [0, T], such that du = −Du(t) + Aγ (t) + Bγ (t − τ ) dt
for almost t ∈ [0, T].
(3.22)
Then we will prove γ (t) ∈ K [g(u(t))]. Both ym (t) and um (t) converge to u(t) uniformly, and K [g(·)] is an uppersemicontinuous set-valued map. Therefore, for any > 0, there exists N > 0 such that for all m > N and t ∈ [0, T], we have g m (um (t)) ∈ O(K [g(u(t))], ), where O(K [g(u(t))], ) = {x ∈ Rn : d(x, K [g(u(t))]) < }. Because K [g(·)] is convex and compact, γ m (t) ∈ O(K [g(u(t))], ) for t ∈ [0, T]. Letting m → ∞, we have γ (t) ∈ O(K [g(u(t))], ) for t ∈ [0, T]. Because of the arbitrariness of , we conclude that γ (t) ∈ K [g(u(t))]
t ∈ [0, T].
(3.23)
Because T is arbitrary, the solution u(t) can be extended to infinite time interval [0, +∞). Theorem 3 is proved. 4 Global Asymptotic Stability In this section, we study the global stability of system 3.7.
698
W. Lu and T. Chen
Theorem 4 (global exponential asymptotic stability). If the matrix inequality 3.3 ¯ then for any solution u(t) on [0, ∞) of system 3.7, there exists holds and g(·) ∈ G, M = M(φ) > 0 such that ε
u(t) ≤ Me − 2 t
f or all t > 0,
where ε is given by matrix inequality 3.9. Equivalently, for any solution x(t) on [0, ∞) of system 2.8, we have ε
x(t) − x ∗ ≤ Me − 2 t
f or all t > 0.
Proof. Let εt T
V3 (t) = e u (t)u(t) + 2
n
εt
e Pˆ i
s 0
ui (t)
gi (ρ) dρ +
0
i=1
Notice that for pi (s) = gi+ (s)}.
t
ˆ (s)e ε(s+τ ) ds. γ (s)T Qγ
t−τ
gi (ρ) dρ, we have ∂c pi (s) = {v ∈ R : gi− (s) ≤ v ≤
Differentiating V3 (t) by the chain rule (for details, see Baciotti, Conti, & Marcellini, 2000; Clarke, 1983; or Lu & Chen, 2005, for details), we have d V3 (t) = εe εt u(t)T u(t) + 2e εt uT [−Du + Aγ (t) + Bγ (t − τ )] dt ˆ + Aγ (t) + Bγ (t − τ )] + 2e εt γ (t) P[−Du(t) + εe εt
n i=1
+e
ε(t+τ )
Pˆ i
ui
ˆ (t − τ ) gi (ρ)dρ − e εt γ T (t − τ ) Qγ
0
ˆ (t). γ T (t) Qγ
Because ε < mini di , we have ε
ui
gi (ρ)dρ ≤ εui (t)γi (t) ≤ di ui (t)γi (t)
0
and u(t) d V3 (t) ≤ e εt [uT (t), γ T (t), γ T (t − τ )]Z1 γ (t) dt γ (t − τ ) ≤ 0.
(4.1)
Delayed Systems with Discontinuous Activations
699
Then, u(t)T u(t) ≤ V3 (0)e −εt and
ε
V3 (0)e − 2 t ε x(t) − x 2 ≤ V3 (0)e − 2 t u(t)2 ≤
hold. Remark 3. From theorem 4, the uniqueness of the equilibrium is also obtained. Corollary 1. If condition 3.3 holds and gi (·) is locally Lipschitz continuous, then there exist ε > 0 and x ∗ ∈ Rn such that for any solution x(t) on [0, ∞) of system 1.1, there exists M = M(φ) > 0 such that ε
x(t) − x ∗ ≤ Me − 2 t
f or all t > 0
where ε is given by the matrix inequality 3.9. If every xi∗ is a continuous point of the activation functions gi (·), i = 1, . . . , n. For the outputs, we have limt→∞ gi (xi (t)) = gi (xi∗ ). Instead, if for some i, xi∗ is a discontinuous point of the activation function gi (·), we can prove the outputs converge in measure (also see Forti & Nistri, 2003). Theorem 5 (convergence of output). If the matrix inequality 3.3 holds and g(·) ∈ ¯ then the output α(t) of system 2.7 converges to α in measure, that is, µ − G, limt→∞ α(t) = α Proof. Define V5 (t) = uT (t)u(t) + 2
n i=1
P˜ i
ui
gi (ρ) dρ +
0
t
˜ (s) ds, γ (s)T Qγ
t−τ
˜ Q, ˜ and are those in the matrix inequality 3.10 of lemma 1. where P, Differentiate V5 (t): d V5 (t) ˜ = 2uT (t)[−Du(t) + Aγ (t) + Bγ (t − τ )] + 2γ T (t) P[−Du(t) dt ˜ (t) − γ T (t − τ ) Qγ ˜ (t − τ ) + Aγ (t) + Bγ (t − τ )] + γ T (t) Qγ + γ (t)T γ (t) − γ (t)T γ (t)
700
W. Lu and T. Chen
u(t) = [uT (t), γ T (t), γ T (t − τ )]Z2 γ (t) − γ T (t)γ (t) γ (t − τ )
≤ − γ T (t)γ (t).
(4.2)
Then V5 (t) − V5 (0) ≤ −
t
γ T (s)γ (s)ds.
0
Since lim V5 (t) = 0, t→∞
0
∞
1 γ T (s)γ (s)ds ≤ − V5 (0).
For any 1 > 0, let E 1 = {t ∈ [0, ∞) : γ (t) > 1 }: V5 (0) ≥
0
∞
γ T (s)γ (s)ds ≥ E 1
γ T (s)γ (s) ≥ 12 µ(E ).
Therefore, µ(E 1 ) < ∞. From proposition 2 in Forti and Nistri (2003), one can see that γ (t) converges to zero in measure, that is, µ − limt→∞ γ (t) = 0. Therefore, µ − limt→∞ α(t) = α . Remark 4. From the proof of theorem 5, one can see that the equilibrium of output α is also unique. Similar to the concept of the Filippov solution for a system of ordinary differential equations with a discontinuous right-hand side, we propose the concept of the solution for the delayed system 2.8. Suppose g(·) ∈ G¯ is bounded. The solution of system 2.8 can be regarded as an approximation of the solutions of delayed neural networks with high-slope gain functions. From the proof of viability, one can see that any limit of the solutions of delayed neural networks with high-slope activation functions, which converge to the discontinuous activations α(t), is a solution of system 2.8. More precisely, we give the following result: Theorem 6. Suppose g(·) ∈ G¯ is bounded, and the function sequence {g m (x) = (g1m (x1 ), g2m (x2 ), . . . , gnm (xn ))T : m = 1,2, . . .} satisfies: 1. {gim (·)} is nondecreasing for all i = 1, 2, . . . , n. 2. {gim (·)} is locally Lipschitz for all i = 1, 2, . . . , n.
Delayed Systems with Discontinuous Activations
701
3. For any compact set Z ⊂ Rn , lim d (Gra ph(g m (Z)), Gra ph(K [g(Z)])) = 0.
m→∞
xm (t) is the solution of the following system: d xm = −Dxm (t) + Agm (xm (t)) + Bgm (xm (t − τ )) + I dt xm (θ ) = φ(θ ) θ ∈ [−τ, 0].
(4.3)
Then there exists a sub-sequence mk such that for any T > 0, xmk (t) converges uniformly to an absolutely continuous Function x(t) on [0, T], which is a solution of system 2.8 on [0, ∞). And for any sub-sequence mk that converges uniformly to an absolutely continuous function x(t) on any [0, T], x(t) must be a solution of system 2.8 on [0, ∞); moreover, if the solution of the system 2.8 is unique, then the sequence xm (t) itself uniformly converges to x(t) on any [0, T]. The proof is similar to that of theorem 3. Details are omitted. Remark 5. There are close relationships and essential differences between this article and that of Forti and Nistri (2003). The dynamical behaviors of neural networks with discontinuous activations were first investigated in Forti and Nistri (2003). However, in that article, the authors assumed that the discontinuous neuron activations are bounded. We do not make that assumption here. Thus, the discussion of the existence of the equilibrium and solution of system 2.7 is much more difficult. Furthermore, the model in this article is a delayed neural network. Here, we introduce a new concept for the solution of delayed neural networks with discontinuous activation functions. Obviously, if a delayed feedback matrix B = 0, the conclusion that Forti and Nistri (2003) proposed for global stability can be regarded as a corollary. 5 Numerical Examples In this section, we present several numerical examples to verify the theorems we have given in the previous sections. Consider a two-dimensional neural network with time delay, d x(t) = −Dx(t) + Ag(x(t)) + Bg(x(t − τ )) + I, dt
(5.1)
where x(t) = (x1 (t), x2 (t))T ∈ R2 denotes the state and α(t) = (α1 (t), α2 (t))T denotes the output satisfying α1 (t) ∈ K [g1 (x1 (t))], α2 (t) ∈ K [g2 (x2 (t))].
702
W. Lu and T. Chen
In examples 1, 2, and 3, we assume D=
10
A=
01
− 14
2
B=
−10 − 14
1 10 1 − 17 10 1 5
,
and g1 (s) = g2 (s) = g(s) = s + sign(s), τ = 1. By the Matlab LMI and Control Toolbox, we obtain P=
5.8083
0
0
1.1796
Q=
1.1515 0.2456 0.2456 0.4056
,
such that Z=
−P A − AT P −P B −B T P Q
> 0.
By theorem 4, system 5.1 is globally exponentially stable for any input I ∈ R2 . 3
2
1 x1(t) 0
−1
x (t)
−2
2
−3
−4 0
2
4
6
8
10 time
12
14
16
18
Figure 1: Dynamical behaviors of the solution of the neural networks 5.2.
20
Delayed Systems with Discontinuous Activations
703
4
3
2 α1(t) 1
0
−1
−2
−3
α (t) 2
−4
−5
0
2
4
6
8
10 time
12
14
16
18
20
Figure 2: Dynamical behaviors of the outputs of the neural networks 5.2.
5.1 Example 1. Consider the following system: 1 x˙ 1 = −x1 (t) − [x1 (t) + sign(x1 (t))] + 2[x2 (t) + sign(x2 (t))] 4 1 1 + [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) + sign(x2 (t − 1))] + 6 5 10 1 x˙ 2 = −x2 (t) − 10[x1 (t) + sign(x1 (t))] − [x2 (t) + sign(x2 (t))] 4 1 1 − [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) + sign(x2 (t − 1))] + 10. 7 10 (5.2) Let φ1 (θ ) = −e 10θ , φ2 (θ ) = sin(20θ ), θ ∈ [−1, 0], be the initial condition. By the previous analysis, system 5.2 is globally exponentially stable. The trajectories of x1 (t) and x2 (t) are shown in Figure 1, and the trajectories of output, α1 (t) and α2 (t), are shown in Figure 2. The equilibrium of system 5.2 is
704
W. Lu and T. Chen
3
2
1
x1(t)
0
x (t) 2
−1
−2
−3
0
2
4
6 time
8
10
12
Figure 3: Dynamical behaviors of the solution of the neural networks 5.3.
(0.1974, −1.7346)T . It is clear that 0.1974 and −1.7346 are continuous points of the function ρ + sign(ρ). Thus, the output lim α(t) = (1.1974, −2.7346)T . t→∞
5.2 Example 2. In this example, we change the inputs and consider the following system, 1 x˙ 1 (t) = −x1 (t) − [x1 (t) + sign(x1 (t))] + 2[x2 (t) + sign(x2 (t))] 4 1 1 + [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) 5 10 43 + sign(x2 (t − 1))] + 20 1 x˙ 2 (t) = −x2 (t) − 10[x1 (t) + sign(x1 (t))] − [x2 (t) + sign(x2 (t))] 4 1 1 − [x1 (t − 1) + sign(x1 (t − 1))] + [x2 (t − 1) 7 10 1399 + sign(x2 (t − 1))] + , 140
(5.3)
Delayed Systems with Discontinuous Activations
705
4
3
2 α1(t) 1
0
−1 α2(t) −2
−3
−4
0
2
4
6 time
8
10
12
Figure 4: Dynamical behaviors of the outputs of the neural networks 5.3.
with the same initial condition as in example 1. The equilibrium of system 5.3 is (0, 0)T , and the equilibrium of outputs is (1, −1)T . It can be seen that 0 is a discontinuous point of the activation function g(ρ) = ρ + sign(ρ) and K [g(0)] = [1, −1]. In this case, the solution trajectories (x1 (t), x2 (t))T converge to the equilibrium, as indicated by Figure 3. The outputs (α1 (t), α2 (t))T cannot converge in norm, but by theorem 5, they converge in measure, as indicated by Figure 4. 5.3 Example 3. We use this example to verify the validity of theorem 6. Consider the following system, x˙ 1 = −x1 (t) + sign(x1 (t)) + sign(x2 (t)) + sign(x1 (t − 1)) + sign(x2 (t − 1)) x˙ 2 = −x2 (t) + sign(x1 (t)) + sign(x2 (t)) + sign(x1 (t − 1)) + sign(x2 (t − 1)), (5.4) and construct a sequence of systems as follows, x˙ m,1 = −xm,1 (t) + tan h(mxm,1 (t)) + tan h(mxm,2 (t)) + tan h(mxm,1 (t − 1)) + tan h(mxm,2 (t − 1))
706
W. Lu and T. Chen 0.14
0.12
0.1
errorm
0.08
0.06
0.04
0.02
0
0
50
100
150
200
250 m
300
350
400
450
500
Figure 5: Variation of error m with respect to m.
x˙ m,2 = −xm,2 (t) + tan h(mxm,1 (t)) + tan h(mxm,2 (t)) + tan h(mxm,1 (t − 1)) + tan h(mxm,2 (t − 1)),
(5.5)
with the same initial condition φ1 (s) = −4 and φ2 (s) = 10 for s ∈ [−1, 0]. Pick 20 sample points 1 = t1 < t2 < . . . < t20 = 3, and define error m = max x(tk ) − xm (tk ). k
Figure 5 indicates that error m converges to zero with respect to m. Therefore, the solutions {xm (t)} of systems 5.5 with high-slope activation functions converge to the solution x(t) of the delayed neural network with discontinuous activation. 6 Conclusion In this letter, we considered the global stability of delayed neural networks with discontinuous activation functions, which might be unbounded. We extend the Filippov solution to the case of delayed neural networks. Under some conditions, the existence of equilibrium and absolutely continuous solution on infinite time interval is proved. Thus, the Lyapunov-type method
Delayed Systems with Discontinuous Activations
707
can be used to study stability. In this way, we obtain an LMI-based criterion for global convergence. If some components of the equilibrium are discontinuous points of the activation function, the output does not converge in norm. In this case, we prove that the output converges in measure. Furthermore, we point out that the solution of the delayed neural networks with discontinuous activation functions can be regarded as a limit of the solution sequence of delayed systems with high-slope activation functions. Acknowledgments We are very grateful to the reviewers for their comments, which were helpful to us in revising the article. This work is supported by National Science Foundation of China 60374018 and 60574044 and also supported by the Graduate Student Innovation Foundation of Fudan University.
References Aubin, J. P. (1991). Viability theory. Boston: Birkhauser. Aubin, J. P., & Cellina, A. (1984). Differential inclusions. Berlin: Springer-Verlag. Aubin, J. P., & Frankowska, H. (1990). Set-valued analysis. Boston: Birkhauser. Baciotti, A., Conti, R., & Marcellini, P. (2000). Discontinuous ordinary differential equations and stabilization. Tesi di Dottorato di Ricerca in Mathematica. Universita degli STUDI di Firenze. Boyd, S., Ghaoui, L. E., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in system and control theory. Philadelphia: SIAM. Chen, T. P. (2001). Global exponential stability of delayed Hopfield neural networks. Neural Networks, 14, 977–980. Chua, L. O., Desoer, C. A. & Kuh, E. S. (1987). Linear and nonlinear circuits. New York: McGraw-Hill. Civalleri, P. P., Gilli, L. M., & Pabdolfi, L. (1993). On stability of cellular neural networks with delay. IEEE Trans. Circuit Syst., 40, 157–164. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley. Cohen, M. A., & Grossberg, S. (1983). Absolute stability and global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Syst. Man. Cybern., 13, 815–826. Filippov, A. F. (1967). Classical solution of differential equations with multivalued right-hand side. SIAM J. Control, 5, 609–621. Forti, M., & Nistri, P. (2003). Global convergence of neural networks with discontinuous neuron activations. IEEE Trans. Circuits Syst.-1, 50, 1421–1435. Haddad, G. (1981). Monotine viable trajectories for functional differential inclusions. J. Diff. Equ., 42, 1–24. Hale, J. K. (1977). Theory of functional differential equations. New York: SpringerVerlag. Harrer, H., Nossek, J. A., & Stelzl, R. (1992). An analog implementation of discretetime neural networks. IEEE Trans. Neural Networks, 3, 466–476.
708
W. Lu and T. Chen
Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-stage neurons. Proc. Nat. Acad. SCi-Biol., 81, 3088– 3092. Hopfield, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Joy, M. P. (2000). Results concerning the absolute stability of delayed neural networks. Neural Networks, 13, 613–616. Kennedy, M. P., & Chua, L. O. (1988). Neural networks for nonlinear programming. IEEE Trans. Circuits Syst.–1, 35, 554–562. Liz, E., & Pouso, R. L., (2002). Existence theory for first order discontinuous functional differential equations. Proc. Amer. Math. Soc., 130, 3301–3311. Lu, W. L., & Chen, T. P. (2005). Dynamical behaviors of Cohen-Grossberg neural networks with discontinuous activation functions. Neural Networks, 18, 231–242. Lu, W. L., Rong, L. B., & Chen, T. P. (2003). Global convergence of delayed neural networks systems. Int. J. Neural Networks, 13, 193–204. Paden, B. E., & Sastry, S. S. (1987). Calculus for computing Filippov’s differential inclusion with application to the variable structure control of robot manipulator. IEEE Trans. Circuits Syst., 34, 73–82. Spraker, J. S., & Biles, D. C. (1996). A comparison of the Carath´eodory and Filippov solution set. J. Math. Anal. Appl., 198, 571–580. Utkin, V. I. (1977). Variable structure systems with sliding modes. IEEE Trans. Automat. Contr., AC-22, 212–222. Yosida, K. (1978). Functional analysis. New York: Springer-Verlag. Zeng, Z., Wang, J., & Liao, X. (2003). Global exponential stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circuit Syst.–1, 50, 1353–1358.
Received September 24, 2004; accepted July 11, 2005.
LETTER
Communicated by Sun-ichi Amari
A Generalized Contrast Function and Stability Analysis for Overdetermined Blind Separation of Instantaneous Mixtures Xiao-Long Zhu xlzhu
[email protected]
Xian-Da Zhang
[email protected] National Laboratory for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
Ji-Min Ye
[email protected] School of Science, Xidian University, Xi’an 710071, China
In this letter, the problem of blind separation of n independent sources from their m linear instantaneous mixtures is considered. First, a generalized contrast function is defined as a valuable extension of the existing classical and nonsymmetrical contrast functions. It is applicable to the overdetermined blind separation (m > n) with an unknown number of sources, because not only independent components but also redundant ones are allowed in the outputs of a separation system. Second, a natural gradient learning algorithm developed primarily for the complete case (m = n) is shown to work as well with an n × m or m × m separating matrix, for each optimizes a certain mutual information contrast function. Finally, we present stability analysis for a newly proposed generalized orthogonal natural gradient algorithm (which can perform the overdetermined blind separation when n is unknown), obtaining an expectable result that its local stability conditions are slightly stricter than those of the conventional natural gradient algorithm using an invertible mixing matrix (m = n). 1 Introduction The problem of blind source separation (BSS) has been studied intensively in recent years (Girolami, 1999; Hyvarinen, Karhunen, & Oja, 2001). In the noise-free instantaneous case, the available sensor vector xt = [x1 (t), . . . , xm (t)]T and the unobservable source vector st = [s1 (t), . . . , sn (t)]T are related by xt = Ast , Neural Computation 18, 709–728 (2006)
(1.1) C 2006 Massachusetts Institute of Technology
710
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
where A is an unknown m × n mixing matrix. For simplicity, all variables in this article are supposed to be real-valued. The superscript T denotes transpose of a vector or matrix. The objective of BSS is to find a separating matrix B given just a sequence of observations {xt }, such that the output vector, yt = Bxt ,
(1.2)
provides accurate estimates of the n source signals. For this purpose, the following assumptions are usually made: A1. The unknown mixing matrix A is of full column rank with n ≤ m. A2. The source signals s1 (t), . . . , sn (t) are statistically independent. A3. Each source signal is a zero-mean and unit-power stationary process. A4. At most one of the source signals is subject to the normal distribution. The full-column-rank requirement of the mixing matrix guarantees that all source signals are theoretically recoverable (Cao & Liu, 1996; Li & Wang, 2002). More precisely, when the column rank of A is deficient, at most a part of source signals can be successfully extracted unless additional prior information about the mixing matrix or the source signals is available (Lewicki & Sejnowski, 2000). The independence assumption is a premise of the BSS problem, and it holds in many practical situations. The unit power assumption comes from the amplitude indeterminacy inherent in the problem (Comon, 1994; Tong, Liu, Soon, & Huang, 1991), and it incorporates the scale information of each source signal into the corresponding column vector of the mixing matrix. The last assumption is made due to the fact that a linear mixture of several gaussian signals is still gaussian and thus cannot be factorized exclusively. The BSS problem is also solvable if the assumptions regarding the source signals (A2–A4) are replaced by another assumption that all the source signals are statistically nonstationary with distinct variances at two or more time instants (Douglas, 2002; Pham & Cardoso, 2001). Numerous algorithms have been proposed for BSS (see, e.g., Amari & Cichocki, 1998; Cardoso, 1998; Girolami, 1999; Hyvarinen et al., 2001). Of particular note is the natural gradient algorithm (Amari, Cichocki, & Yang, 1996; Amari, 1998; Yang & Amari, 1997), Bt+1 = Bt + ηt I − φ(yt )ytT Bt ,
(1.3)
where ηt is a positive learning rate parameter and φ(yt ) = [ϕ1 (y1 (t)), . . . , ϕn (yn (t))]T is a nonlinear-transformed vector of yt . This
A Generalized Contrast Function and Stability Analysis
711
algorithm is computationally efficient with uniform convergence performance independent of the conditioning of the mixing matrix (Cardoso & Laheld, 1996). Additionally, algorithm 1.3 is developed primarily for complete BSS where both the mixing matrix and the separating matrix are invertible square matrices (m = n); nevertheless, it works effectively as well in the overdetermined case (m > n) by designating an n × m separating matrix Bt (Zhang, Cichocki, & Amari, 1999; Zhu & Zhang, 2004). The extension of algorithm 1.3 to the overdetermined case requires that the source number n should be known a priori. When the source number is unknown, we can exact the source signals one by one using the sequential approaches (Delfosse & Loubaton, 1995; Hyvarinen & Oja, 1997; Thawonmas, Cichocki, & Amari, 1998). They work regardless of the source number, but the quality of the separated source signals would increasingly degrade due to the accumulated errors coming from the deflation process. Moreover, the subsequent signal extraction cannot proceed unless the current deflation process has converged, which is prohibitive in some real-time situations such as wireless communications. As to parallel algorithms that recover source signals simultaneously, the BSS problem with an unknown number of sources can be achieved in two stages. The observations are first preprocessed such that the sensor vector is transformed to a white vector, and at the same time its dimensionality is decreased from m to n; and then an n × n orthogonal matrix is determined to obtain source separation. This scheme suffers from a poor separation for ill-conditioned mixing matrix or weak source signals (Karhunen, Pajunen, & Oja, 1998). Surprisingly, algorithm 1.3 can be employed directly to learn an m × m nonsingular matrix. Extensive experiments show that among m outputs at its convergence, n components are mutually independent, providing the rescaled source signals; each of the remaining m − n components is a copy of some independent component. Therefore, a postprocessing layer can be utilized to remove the redundant signals and estimate the source number. The above finding was reported earlier (Cichocki, Karhunen, Kasprzak, & Vigario, 1999), and later a theoretical justification was given from the viewpoint of contrast function (Ye, Zhu, & Zhang, 2004). As also verified by experiments, if the dimensionality of yt is greater than the source number, then the convergence of algorithm 1.3 is impermanent: it would diverge inevitably because the separation state is not a stationary point. To perform BSS with an unknown number of sources effectively, a generalized orthogonal natural gradient algorithm was proposed (Ye et al., 2004), but the (local) stability conditions remain to be given. The purpose of this article is to fill this gap. Additionally, a generalized contrast function is defined here, which includes the existing classical contrast function and the nonsymmetrical one as its two special cases. The natural gradient algorithm, 1.3, is able to perform (before divergence) the overdetermined BSS (m > n) with an m × m separating matrix in that it maximizes locally a generalized mutual information contrast function.
712
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
2 A Generalized Contrast Function A contrast function (or contrast) plays a crucial role in optimization since its maximization gives solutions of certain problems. There are two types of contrasts for BSS in the literature: classical contrast, first defined by Comon (1994), and nonsymmetrical contrast introduced by Moreau and ThirionMoreau (1999). The two contrasts address the case where the composite matrix C = BA is a square matrix, which implies that the separating matrix B is an n × m matrix and the source number n is known a priori. In this section, we use the following notations. Cn×n denotes the set of n × n nonsingular matrices, Dn×n denotes the set of n × n invertible diagonal matrices, Gn×n denotes the set of n × n generalized permutation matrices (which contains only one nonzero entry at each row and each column), Sn is the set of source vectors satisfying the assumptions A2 and A3, and Yn is the set of n-dimensional output vectors built from y = Cs, where C ∈ Cn×n and s ∈ Sn . Similarly, Cq ×q , Dq ×q , Gq ×q , and Yq are defined. ∅ represents an empty set containing no element. For convenience, we drop the time index t from the source vector st , the observation vector xt , and the output vector yt . Definition 1: Classical Contrast (Comon, 1994; Moreau & Macchi, 1996). A contrast function on Yn is a multivariate mapping J from the set Yn to the real number set R, which satisfies the following three requirements: R1. ∀ y ∈ Yn , ∀C ∈ Gn×n , J (C y) = J ( y). R2. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) ≤ J (s). R3. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) = J (s) ⇔ C ∈ Gn×n . Definition 2: Nonsymmetrical Contrast (Moreau & Thirion-Moreau, 1999). A contrast function on Yn is a multivariate mapping J from the set Yn to R, which satisfies the following three requirements: Q1. ∀ y ∈ Yn , ∀ D ∈ Dn×n , J ( D y) = J ( y). Q2. ∀s ∈ Sn , ∀C ∈ Cn×n , J (Cs) ≤ J (s). Q3. ∀s ∈ Sn , ∀C ∈ Cn×n , ∃G n×n ⊂ Gn×n with G n×n = ∅, J (Cs) = J (s) ⇔ C ∈ G n×n . From the above two definitions, it can be seen that a classical contrast is invariant under all separation states, and its global maximization is a necessary and sufficient condition for source separation. In contrast, a nonsymmetrical contrast is merely invariant to nonzero scaling transformation of its arguments. Although a separation state where the outputs consist of independent components does not necessarily correspond to a maximum, the maximization of a nonsymmetrical contrast does result in source separation.
A Generalized Contrast Function and Stability Analysis
713
In BSS, mostly used are the classical contrasts, including the maximum likelihood contrasts (Cardoso, 1998), the mutual information contrasts (Amari et al., 1996; Comon, 1994; Pham, 2002; Yang & Amari, 1997), the nonlinear principal component analysis contrasts (Karhunen et al., 1998; Oja, 1997; Zhu & Zhang, 2002), and the high-order statistics contrasts (Cardoso, 1999; Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999). The negative mutual information of the outputs in equation 1.2 is given by (Amari et al., 1996; Comon, 1994; Yang & Amari, 1997), J (y1 , . . . , yq ; B) = H(y1 , . . . , yq ) −
q
H(yk ),
(2.1)
k=1
where the differential entropy is defined as (Cover & Thomas, 1991) H(y1 , . . . , yq ) = −
∞ −∞
p(y1 , . . . , yq ) ln p(y1 , . . . , yq )dy1 , . . . dyq ,
(2.2)
in which p(y1 , . . . , yq ) is the joint probability density function (PDF) of {y1 , . . . , yq }, and ln(·) is the natural logarithm operator. By the property of the differential entropy, J (y1 , . . . , yq ; B) ≤ 0 with equality if and only if q p(y1 , . . . , yq ) = k=1 p(yk ). It is straightforward that equation 2.1 with q = n is a classical contrast for BSS (Comon, 1994), meaning J (y1 , . . . , yn ; B) = 0
⇔
BA ∈ Gn×n .
(2.3)
Since the outputs (y1 , . . . , yq ) are linear combinations of n source signals, they are impossible to be mutually independent, and the global maximum of equation 2.1 cannot be reached if q > n. As mentioned previously, the task of BSS is to recover the source signals from the observations, so the dimensionality of y can be larger than the number of sources. An algorithm also achieves source separation if y1 , . . . , yq (q ≥ n) are made up of n independent components plus q − n redundant components. That is, a sufficient condition for BSS is that all the source signals are recovered at least once, and the composite matrix takes the following form, C = G · [e1 , . . . , en , cn+1 , . . . , cq ]T ,
(2.4)
where G ∈ Gq ×q , [e1 , . . . , en ] = I is the n × n identity matrix, and ci (i = n + 1, . . . , q ) is either a null vector or a column of I. To develop such algorithms, the conventional contrasts need to be modified to handle a rectangular matrix as well as a square matrix C. Definition 3: Generalized Contrast. Let ξ be a q × 1 vector (q ≥ n), n components of which are the original source signals and each of the rest q − n
714
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
components is a constant zero or some source signal. All q ! · (n + 1)q −n possible versions of ξ form the set q . Denote q = {ς| ς = Dξ , ξ ∈ q , D ∈ Dq ×q }, Qq ×q = { Q| Qξ ∈ q , ξ ∈ q }, and let Wq ×q be the set of q × q (singular or nonsingular) matrices. A contrast function on Yq is a multivariate mapping J from the set Yq to R, which satisfies the following three requirements: T1. ∀ y ∈ Yq , ∀ D ∈ Dq ×q , J ( D y) = J ( y). T2. ∀ y ∈ Yq , ∀W ∈ Wq ×q , ∃δ > 0, J ( y + δW y) ≤ J ( y) ⇒ y ∈ q . T3. ∀ξ ∈ q , ∀W ∈ Cq ×q , ∃Qq ×q ⊂ Qq ×q with Qq ×q = ∅, J (Wξ ) = J (ξ ) ⇔ W ∈ Qq ×q . Obviously, when a generalized contrast is maximized, the outputs provide all source signals up to ordering and scaling, which is common with the classical and nonsymmetrical contrasts. The main difference of the three definitions is that all local maxima of the generalized contrast correspond to source separation (where q outputs are composed of n independent components and q − n redundant components), while only the global ones of the classical and nonsymmetrical contrasts do so (where no redundant component is allowed). Hence, the generalized contrast is particularly useful for the overdetermined BSS problem with an unknown number of sources. The classical and nonsymmetrical contrasts, as special instances of the generalized one, are simply appropriate for the case where the source number is known a priori. Now, we turn back to equation 2.1. For a square mixing matrix A, the separating matrix B should be taken as an n × n invertible matrix, resulting in (Amari et al., 1996; Bell & Sejnowski, 1995; Comon, 1994; Yang & Amari, 1997) H(y1 , . . . , yn ) = H(x1 , . . . , xn ) + ln | det (B)|,
(2.5)
where | det(B)| is the absolute value of the determinant of B. On the other hand, if the m × n matrix A is of full column rank with n in hand, then an n × m full-row-rank matrix B can be assigned to perform BSS. By the singular value decomposition (Golub & van Loan, 1996), there must be an n × n orthogonal matrix U, an n × n nonsingular diagonal matrix , . and an m × m orthogonal matrix V such that B = U[ .. O]VT , where O denotes an n × (m − n) null matrix. Let V1 be a submatrix made up of the first n columns of V and V2 the remainder m − n columns. The output vector then becomes y = UV1T x. According to definition 2.2 and the property det() =
det(BBT ), we obtain
1 H(y1 , . . . , yn ) = H V1T x + ln | det (BBT )|. 2
(2.6)
A Generalized Contrast Function and Stability Analysis
715
Since B = UV1T does not depend on V2 , from the formula (Cover & Thomas, 1991) H V1T x = H(x) − H V2T x V1T x ,
(2.7)
where H(V2T x|V1T x) represents the differential entropy of V2T x conditioning on V1T x, it can be deduced that H(V1T x) is not a function of B (e.g., H(V1T x) = H(x) when V2T A = O). To sum up, if the source number is known and equal to the number of outputs, the mutual information contrast, equation 2.1, subject to definition 1 can be simplified to (Zhu & Zhang, 2004) 1 H(yk ), ln det BBT − 2 n
J1 (y1 , . . . , yn ; B) =
(2.8)
k=1
in which the trivial term H(x1 , . . . , xn ) in equation 2.5 or H(V1T x) in equation 2.6 is removed. The separating matrix permits two choices: Bt ∈ Rn×n or Bt ∈ Rn×m . We notice that another contrast is also useful in this case (Zhang et al., 1999),
J2 (y1 , . . . , yn ; B) = ln | det(BET )| −
n
H(yk ),
(2.9)
k=1
where E is the identity element of the Lie group Gl(n, m). Clearly, the classical contrasts 2.8 and 2.9 are equivalent for ln | det(BET )| = 12 ln | det (BBT )|. When the source number is unknown, neither the classical contrast nor the nonsymmetrical contrast works, and it is necessary to use the generalized contrast. Since the sensor number m is, if not always, usually in hand, the separating matrix B can be restricted to be an m × m nonsingular matrix. Thus, an n × m submatrix B1 exists such that B1 A is invertible. Without loss of generality, we suppose that B1 is composed of the first n rows of B and denote by B2 the remaining submatrix. It can be shown that equation 2.1 with q = m becomes (Ye et al., 2004)
J3 (y1 , . . . , ym ; B) = −I(y1 , . . . , yn ) −
m−n
H(yn+k ),
(2.10)
k=1
in which [y1 , . . . , yn ]T = B1 x, [yn+1 , . . . , ym ]T = B2 x, and I(y1 , . . . , yn ) denotes the mutual information of {y1 , . . . , yn }. Because a random variable
716
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
that is the sum of several independent components has the entropy not less than any individual does (Cover & Thomas, 1991), that is, H(yn+k ) ≥ max H([B2 A]n+k, j s j ),
(2.11)
j∈{1,...,n}
where C pq denotes the ( p, q )th entry of matrix C, the second term (including the minus sign) on the right-hand side of equation 2.10 reaches a local maxima when each component yn+k is a copy of some source signal. Furthermore, the first term obtains the global maximum of zero if and only if y1 , . . . , yn are mutually independent. Therefore, we can infer that equation 2.10 or its equivalent version, J3 (y1 , . . . , ym ; B) = ln | det (B)| −
m
H (yk ),
(2.12)
k=1
works as a generalized contrast function for overdetermined BSS. Based on equation 2.12 and the natural gradient learning (Amari, 1998), we can readily get the following overdetermined BSS algorithm (Cichocki et al., 1999; Ye et al., 2004), Bt+1 = Bt + ηt I − F(yt )ytT Bt ,
(2.13)
where Bt ∈ Rm×m , F(y) = [ f 1 (y1 ), . . . , f m (ym )]T and f i (yi ) = −
dp (y ) 1 · i i , pi (yi ) dyi
i = 1, . . . , m
(2.14)
are referred to as the score functions of BSS. Regarding equation 2.13, we make the following remarks: Remark 1. For the complete BSS problem, the sensor number is equal to the number of sources (m = n), and the natural gradient algorithm, 1.3, which was first proposed heuristically in Cichocki, Unbehauen, and Rummert (1994) and later theoretically justified in Amari (1998), has the same form as equation 2.13 except Bt ∈ Rn×n . Remark 2. In the overdetermined case (m > n), if the source number n is known, we can apply the relative gradient learning (Cardoso & Laheld, 1996) to the classical contrast, equation 2.8 (Zhu & Zhang, 2004), or apply the natural gradient learning to equation 2.9 (Zhang et al., 1999), obtaining finally a homologous algorithm to equation 2.13 but Bt ∈ Rn×m .
A Generalized Contrast Function and Stability Analysis
717
Remark 3. When the source number n is unknown, it is necessary to employ algorithm 2.13 with Bt ∈ Rm×m . Let A+ = (AT A)−1 AT be the pseudo-inverse of the mixing matrix A, N (AT ) be an m × (m − n) matrix formed by the basis vector(s) of the null space of AT , and B be an (m − n) × m matrix whose rows are made up of some row(s) of A+ . As long as the separating matrix Bt updates in the equivalent class (Amari, Chen, & Cichocki, 2000; Ye et al., 2004), defined by B|
B = G[(A+ )T , N (AT ) + B ]T , G ∈G m×m }, B = {
T
(2.15)
m output components consist of n rescaled source signals and m − n redundant signals. In other words, yt = Gξt and some estimated source signals are replicated. From the viewpoint of contrast function, one local maximum of equation 2.12 is achieved. Remark 4. For the ideal noiseless case, m components in yt = Bt Ast , which are the linear mixtures of n independent source signals, are impossible to be mutually independent when m > n. Thus, the stationary condition E I − F(yt )ytT = O
(2.16)
does not hold, and the natural gradient algorithm, 2.13, cannot stabilize in the separation state of yt = Gξt . As a result, algorithm 2.13 would diverge inevitably unless the learning rate ηt is sufficiently small. A different phenomenon arises in the noisy case xt = Ast + vt , where vt is a vector
= [A,A] of additive noises. Let A be an m × (m − n) matrix such that A is an m × m invertible matrix. Then the sensor vector can be rewritten
·
as xt = A st +
vt with
st = [stT ,vtT ]T and
vt = vt − Avt . Provided that the components of vt are independent of the source signals, it is a noisy but complete BSS model (in the specific case vt = Avt , it reduces to a noiseless
·
one, xt = A st ); thereby in the noisy case, the behavior of algorithm 2.13 is similar to the (noisy) complete natural gradient algorithm 1.3. It will collect n unknown source signals (with reduced noise) and m − n noise signals after convergence (under the condition that the noise is not too large; usually its power is less than 3% of the source signal’s; Cichocki, Sabala, Choi, Orsier, & Szupiluk, 1997). Furthermore, algorithm 2.13 does not diverge any more as condition 2.16 is fulfilled. The above analysis as well as the efficiency of the framework, 2.13, using different forms of separating matrices (Bt ∈ Rn×n , Bt ∈ Rn×m or Bt ∈ Rm×m ) has been verified by extensive experiments (Amari et al., 1996; Cichocki et al., 1994, 1999; Girolami, 1999; Hyvarinen et al., 2001; Ye et al., 2004; Zhu & Zhang, 2004).
718
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
Remark 5. In algorithm 1.3 or 2.13, the identity matrix I determines the amplitude of the reconstructed signals, and it can be replaced by a nonsingular diagonal matrix , yielding Bt+1 = Bt + ηt − F(yt )ytT Bt .
(2.17)
If = diag {F(yt )ytT } is a diagonal matrix composed of the on-diagonal elements of F(yt )ytT , then equation 2.17 becomes an orthogonal natural gradient algorithm (Amari et al., 2000; Cichocki et al., 1997). With Bt ∈ Rm×m and m > n, this algorithm can make the redundant components decay to zero, and it is particularly useful when the magnitudes of the source signals are intermittent or changing rapidly over time (e.g., speech signals). One drawback of the orthogonal natural gradient algorithm is that the redundant signals, although very weak in scale, may still be mixtures of several source signals. Furthermore, the magnitudes of some reconstructed source signals are possibly very small as well; hence, it is sometimes difficult to discriminate them from the redundant ones. 3 Stability Analysis of a Generalized Natural Gradient Algorithm The natural gradient algorithm, 2.13, would diverge in the noiseless case, while the orthogonal natural gradient algorithm, 2.17, may experience trouble in determining whether some weak outputs are the recovered source signals or the redundant ones. To overcome these problems, we have recently proposed a generalized orthogonal natural gradient algorithm (Ye et al., 2004), Bt+1 = Bt + ηt R − φ(yt )ytT Bt ,
(3.1)
in which R = E φ(yt )ytT |yt =Gξt
(3.2)
is no longer a diagonal matrix when m > n. In practice, R should start with the identity matrix and then take an exponentially windowed average of equation 3.2, that is,
Rt =
t ≤ T0
I, λRt−1 + (1 −
λ)φ(yt )ytT ,
t > T0
,
(3.3)
where 0 < λ < 1 is a forgetting factor and T0 is an empirical number associated with the convergence speed. Additionally, when the source number is known a priori or has been estimated, other choices of R are also possible
A Generalized Contrast Function and Stability Analysis
719
(they may control what the redundant components are or where they appear; refer to Ye et al., 2004 for more details). Notice that the PDFs of source signals are usually inaccessible in the BSS problem; hence, another group of nonlinearities, called activation functions ϕi (yi ), are generally used instead of the score functions in equation 2.14. A pivotal question is: How much can the activation functions ϕi (yi ) depart from the score functions f i (yi ) corresponding to true distributions before a separating stationary point becomes unstable? The local stability conditions answer such a question (Amari, Chen, & Cichocki, 1997; Amari, 1998; Amari et al., 2000; Cardoso & Laheld, 1996; Cardoso, 2000; Ohata & Matsuoka, 2002; von Hoff, Lindgren, & Kaelin, 2000). To the best of our knowledge, however, none of the existing literature considers the general case where there are more outputs than source signals. It is worth mentioning that the generalized algorithm 3.1 can perform the overdetermined BSS with an unknown number of sources, and its validity has been confirmed by computer simulations, but the local stability conditions remain to be given. Here, we study this problem, and formulate the following theorem. Again, the time index t is omitted for simplicity. Theorem 1. When the activation functions and the zero-mean source signals satisfy the following conditions, E{ϕi (si )s j } = 0, E ϕi (si )s j sk = 0,
if
i = j
(3.4)
if
j = k
(3.5)
where ϕi (si ) denotes the first-order derivative of ϕi (si ), and when the inequalities E ϕi (si )s 2j E ϕ j (s j )si2 > E{ϕi (si )si }E{ϕ j (s j )s j } E ϕi (si )s 2j > 0
(3.6) (3.7)
hold for all i, j = 1, . . . , n, then the separating matrix B ensuring y = B As = Gξ is a stable equilibrium point of the generalized orthogonal natural gradient algorithm 3.1. Proof. The local (asymptotic) stability of a stochastic BSS algorithm can be examined by the response of the composite matrix C = BA near equilibrium points. Multiplying both sides of equation 3.1 by A and taking mathematical expectation, we have E {C} := E {C} + ηE
R − φ(y)y C .
(3.8)
In what follows, the variables on the left-hand side of the notation “:=” have the time index t + 1 while those on the right-hand side have the time
720
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
index t. In addition, ϕi (si ) and ϕi (si ) are simplified by ϕi and ϕi , respectively. Suppose there is a small disturbance matrix of C at a separating point; the local stability issue is then simplified to investigating whether and under what conditions E{ } converges to a null matrix. Without loss of generality, we assume n < m ≤ 2n and the output vector y = ξ with
ξi =
si ,
i = 1, . . . , n
si−n , i = n + 1, . . . , m,
(3.9)
namely, the composite matrix is C =
I = [e1 , . . . , en , e1 , . . . , em−n ]T ,
(3.10)
where ei is the ith column vector of the n × n identity matrix. Taking the small disturbance into account, the output vector becomes y = ξ + s, and the first-order Taylor expansion of φ(y) reads φ(y) ≈ φ(ξ ) + Dϕ s,
(3.11)
in which Dϕ = diag[ϕ1 , . . . , ϕn , ϕ1 , . . . , ϕm−n ]. Substituting R = E{φ(ξ )ξ T } along with the approximation 3.11 into equation 3.8 and neglecting the second-order infinitesimal of , we obtain
E{ } := E{ } − ηE{φ(ξ )sT T + Dϕ s
sT },
(3.12)
s =
IT · ξ = [2s1 , . . . , 2sm−n , sm−n+1 , . . . , sn ]T . Since the where =
IT · and
disturbance is uncorrelated with the source signals, the expression 3.12 under the assumptions 3.4 and 3.5 can be written into six different cases: Case1 (C1). 1 ≤ i ≤ m − n E{ i,i } := 1 − ηE 2ϕi si2 + ϕi si E i,i − ηE ϕi si E n+i,i . Case2 (C2). m − n < i ≤ n E i,i := 1 − ηE ϕi si2 + ϕi si E i,i . Case3 (C3). 1 ≤ i ≤ n, 1 ≤ j ≤ m − n, i = j E i, j := 1 − ηE 2ϕi s 2j E i, j − ηE ϕi si E j,i + n+ j,i .
A Generalized Contrast Function and Stability Analysis
721
Case4 (C4). 1 ≤ i ≤ n, m − n < j ≤ n, i = j E i, j := 1 − ηE ϕi s 2j E i, j − ηE ϕi si E j,i . Case5 (C5). n < i ≤ m, 1 ≤ j ≤ m − n E i, j 2 := 1 − ηE 2ϕi−n s j E i, j − ηE ϕi−n si−n E j,i−n + n+ j,i−n . Case6 (C6). n < i ≤ m, m − n < j ≤ n 2 s j E i, j − ηE ϕi−n si−n E j,i−n . E i, j := 1 − ηE ϕi−n Now, we study the convergence of each entry of E{ }. The diagonal terms E{ i,i }, m − n < i ≤ n, behave according to C2, and for stability, the coefficient [1 − ηE{ϕi si2 + ϕi si }] must lie between 0 and 1. Assuming a sufficiently small positive learning rate η, then the condition E{ϕi si2 + ϕi si } > 0
(3.13)
should be met (m − n < i ≤ n). Note that the diagonal terms E{ i,i }, 1 ≤ i ≤ m − n, rely not only on themselves but also on the off-diagonal terms E{ n+i,i }; thus, they need to be considered in pairs. Based on C1 and C5, the vector V0 = [ i,i , n+i,i ]T evolves by E{V0 } := E{V0 } − η0 E{V0 },
(3.14)
where 0 is the following matrix 0 =
E 2ϕi si2 + ϕi si E{ϕi si }
E{ϕi si } . E 2ϕi si2 + ϕi si
In order to make V0 converge to a null vector, both of the eigenvalues of 0 must be positive, which requires the inequalities E 2ϕi si2 + ϕi si > 0 2 E 2ϕ s + ϕ s > |E{ϕ s }| i i i i i i
(3.15) (3.16)
to hold for 1 ≤ i ≤ m − n. Concerning the off-diagonal terms E{ i, j }, 1 ≤ i = j ≤ m − n, they depend on E{ i, j }, E{ j,i }, E{ n+i, j } and E{ n+ j,i }, and hence the quadruples
722
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
in V1 = [ i, j , j,i , n+i, j , n+ j,i ]T should be studied. Owing to C3 and C5, a matrix similar to 0 in equation 3.14 can be obtained as 2 E 2ϕi s j E{ϕ j s j } 1 = 0 E{ϕ j s j }
E{ϕi si } E 2ϕ j si2
0
E{ϕi si } 0
E{ϕi si }
E{ϕ j s j } E 2ϕi s 2j
0
E{ϕ j s j }
E{ϕi si } E 2ϕ j si2
,
which is a positive-definite matrix under the following conditions: E ϕi s 2j > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j },
(3.17) (3.18)
where 1 ≤ i = j ≤ m − n. Based on C4, C3 and C6, the triples in V2 = [ i, j , j,i , n+i, j ]T , 1 ≤ i ≤ m − n, m − n < j ≤ n, are associated with the matrix 2 E ϕi s j 2 = E ϕ j s j 0
E{ϕi si } E 2ϕ j si2 E{ϕi si }
0
E{ϕ j s j } , 2 E ϕi s j
and the convergence is guaranteed by E ϕi s 2j > 0 E ϕi s 2j + 2ϕ j si2 > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.19) (3.20) (3.21)
Analogously, using C3, C4, and C6, the triples in V3 = [ i, j , j,i , n+ j,i ]T , m − n < i ≤ n, 1 ≤ j ≤ m − n, are related to the matrix 2 E 2ϕi s j 3 = E{ϕ j s j } E{ϕ j s j }
E{ϕi si } E ϕ j si2
E{ϕi si } 0
0
E ϕ j si2 }
and the conditions E ϕ j si2 > 0 E 2ϕi s 2j + ϕ j si2 > 0
(3.22) (3.23)
A Generalized Contrast Function and Stability Analysis
723
E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.24)
When m − n < i = j ≤ n, the pairs in V4 = [ i, j , j,i ]T behave according to C4, yielding 4 =
E ϕi s 2j E{ϕ j s j }
E{ϕi si } , E ϕ j si2
and for stability, the requirements below should be satisfied: E ϕi s 2j + ϕ j si2 > 0 E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }.
(3.25) (3.26)
Following the similar procedures used to determine the convergence of the off-diagonal terms E{ i, j }, 1 ≤ i = j ≤ m − n, the quadruples in V5 = [ i, j , i−n, j , j,i−n , n+ j,i−n ]T , n < i ≤ m, 1 ≤ j ≤ m − n, produce the matrix 2 E 2ϕi−n s j 0 5 = E{ϕ s } j j E{ϕ j s j }
0
E{ϕi−n si−n }
E{ϕ j s j }
E{ϕi−n si−n } 2 E 2ϕ j si−n
E{ϕ j s j }
0
2 sj E 2ϕi−n
E{ϕi−n si−n }
E{ϕi−n si−n } , 0 2 E 2ϕ j si−n
and they vanish if and only if 2 E ϕi−n sj > 0 2 E ϕ j si−n > 0 2 2 2E ϕi−n s j E ϕ j si−n > E{ϕi−n si−n }E{ϕ j s j }.
(3.27) (3.28) (3.29)
Finally, the triples in V6 = [ i, j , i−n, j , j,i−n ]T , n < i ≤ m, m − n < j ≤ n, are associated with the matrix 2 E ϕi−n s j 6 = 0 E{ϕ j s j }
0
2 sj E ϕi−n E{ϕ j s j }
E{ϕi−n si−n }
E{ϕi−n si−n } 2 E 2ϕ j si−n
and the conditions 2 sj > 0 E ϕi−n
(3.30)
724
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
2 2 >0 E ϕi−n s j + 2ϕ j si−n 2 2 E ϕi−n s j E ϕ j si−n > E{ϕi−n si−n }E{ϕ j s j }.
(3.31) (3.32)
In conclusion, the conditions 3.13 and 3.15 to 3.32 can be summarized as follows: √ Z1. 1 ≤ i ≤ m − n: E 2ϕi si2 + ϕi si > E ϕi si , 2E ϕi si2 > |E{ϕi si }|. Z2. m − n < i ≤ n: E ϕi si2 + ϕi si > 0. Z3. 1 ≤ i ≤ m − n, 1 ≤ j ≤ n: E ϕi s 2j > 0. Z4. 1 ≤ i ≤ m − n, m − n ≤ j ≤ n: E ϕi s 2j + 2ϕ j si2 > 0. Z5. m − n ≤ i = j ≤ n: E ϕi s 2j + ϕ j si2 > 0. Z6. 1 ≤ i = j ≤ n: E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }. Obviously, inequalities 3.6 and 3.7 are sufficient conditions for Z1 to Z6. Moreover, the above analysis is based on the special output vector 3.9. Somewhat tedious but almost identical manipulations show that the same conclusion can be made for m > n and for the general form of the output vector y = Gξ . This completes the proof of theorem 1. Remark 6. Taking m = n in Z1 to Z6, we obtain the local stability conditions of the complete BSS algorithm, 1.3, given by E ϕi si2 + ϕi si > 0, i = 1, . . . , n; E ϕi s 2j + ϕ j si2 > 0, 1 ≤ i = j ≤ n; E ϕi s 2j E ϕ j si2 > E{ϕi si }E{ϕ j s j }, 1 ≤ i = j ≤ n.
(3.33)
This result can be found, for example, in Amari (1998); Amari et al. (1997, 2000), Cardoso (2000), Ohata and Matsuoka (2002), and von Hoff et al. (2000). By comparison, it can be seen that the local stability conditions, equations 3.6 to 3.7, for the overdetermined BSS algorithm, 3.1, are a little stricter than those in quation 3.33 for the complete one, equation 1.3, which agrees with our conjecture. Remark 7. The activation functions depend on the distributions of the source signals. Among all nonlinearities, the score functions 2.14 are shown to be most efficient in that they always satisfy the local stability conditions, they work most robustly against outliers, and they can also obtain the best quality of the separated source signals (Hyvarinen et al., 2001; Mathis & Douglas, 2002; von Hoff et al., 2000).
A Generalized Contrast Function and Stability Analysis
725
Remark 8. For two gaussian source signals si and s j , it can be shown that E f i (si )s 2j E f j (s j )si2 = E{ f i (si )si }E{ f j (s j )s j },
(3.34)
which contradicts the requirement, equation 3.6. Consequently, the assumption A4 that at most one source signal is gaussian is usually made in BSS. Remark 9. In the proof of theorem 1, it is conditions 3.4 and 3.5 rather than the statistical independence of source signals that are used. Utilizing the independence assumption, we may get an expedient rule for the activation functions as E{ϕi (si )} > 0,
E{φi (si )si2 + φi (si )si } > 0,
E ϕi (si )si2 > E{ϕi (si )si }, (3.35)
in which i = 1, . . . , n. The conditions in equation 3.35 are stricter than those in theorem 1 and than those in equation 3.33. (See Amari et al., 1997; Cardoso, 2000; Girolami, 1999; Hyvarinen et al., 2001; and Mathis & Douglas, 2002, for some illustrative nonlinearities of ϕi .) Remark 10. Theorem 1 as well as the existing stability analyses (Amari et al., 1997, 2000; Amari, 1998; Cardoso & Laheld, 1996; Cardoso, 2000; Ohata & Matsuoka, 2002; von Hoff et al., 2000) address simply the noiseless case. When the observations are polluted by additive noises, a rigorous stability analysis for BSS seems very complicated and even impossible. Here, we consider the special case as mentioned in remark 4, where the additive noise vector is vt = Avt such that the noisy observation vector
·
= [A,A] and
xt = Ast + vt reduces to a noiseless one xt = A st with A st = [stT ,vtT ]T . It is a complete BSS model, and by equation 3.33, the stability conditions depend not only on the distributions of the source signals in st but also on those of the noise signals in vt . In other words, to analyze the stability of a noisy BSS algorithm, we must take into account the properties of additive noises, which is impractical in many applications. Therefore, most algorithms for BSS assume that each additive noise is not too large (otherwise the algorithms would degrade severely and even not work at all; Cichocki et al., 1997), and the activation functions are generally selected according to equation 3.35. In any event, the stability issue of a noisy BSS algorithm deserves further study. 4 Conclusion In this letter, we defined a generalized contrast function that takes the existing classical contrast and nonsymmetrical contrast as its two special cases. It can handle the general blind separation problem where there are more
726
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
mixtures than sources and at the same time the source number is unknown. Then, by means of the classical and the generalized mutual information contrast functions, we justify the finding that the natural gradient algorithm can be employed to perform the complete or overdetermined BSS whether the source number n is known or not, by a single adjustment of the dimension of the separating matrix (n × n, n × m, or m × m). Finally, we analyze the local stability of the generalized orthogonal natural gradient algorithm (Ye et al., 2004) and obtain an expectable result that the nonlinear activation functions of the overdetermined BSS algorithm are specified by somewhat stricter conditions than those of the complete BSS algorithm. Acknowledgments We are grateful to the anonymous reviewers and Terrence J. Sejnowski for their valuable comments and suggestions that helped to greatly improve the quality and clarity of the presentation. We also thank the anonymous reviewers for bringing our attention to Cichocki et al., (1994, 1997). This work was supported by the Major Program of the National Natural Science Foundation of China under grant 60496311 and by the Chinese Postdoctoral Science Foundation under grant 2004035061. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., Chen, T. P., & Cichocki, A. (2000). Nonholonomic orthogonal learning algorithms for blind source separation. Neural Computation, 12(6), 1463–1484. Amari, S., & Cichocki A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings IEEE, 86(10), 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Cao, X. R., & Liu, R. W. (1996). A general approach to blind source separation. IEEE Trans. on Signal Processing, 44(3), 562–571. Cardoso, J. F. (1998). Blind source separation: Statistical principles. Proc. IEEE, 86(10), 2009–2025. Cardoso, J. F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J. F. (2000). On the stability of source separation algorithms. Journal of VLSI Signal Processing, 26(1), 7–14. Cardoso, J. F., & Laheld, H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3029.
A Generalized Contrast Function and Stability Analysis
727
Cichocki, A., Karhunen, J., Kasprzak, W., & Vigario, R. (1999). Neural networks for blind separation with unknown number of sources. Neurocomputing, 24(1), 55–93. Cichocki, A., Sabala, I., Choi, S., Orsier, B., & Szupiluk, R. (1997). Self-adaptive independent component analysis for sub-gaussian and super-gaussian mixtures with an unknown number of sources and additive noise. International Symposium on Nonlinear Theory and Its Applications, 2, 731–734. Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17), 1386–1387. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Cover, T. M., & Thomas J. A. (1991). Elements of information theory. New York: Wiley. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45(1), 59–83. Douglas, S. C. (2002). Simple algorithms for decorrelation-based blind source separation. IEEE Workshop on Neural Networks for Signal Processing, 12, 545–554. Girolami, M. (1999). Self-organising neural networks: Independent component analysis and blind source separation. London: Springer-Verlag. Golub, G. H., & van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvarinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Karhunen, J., Pajunen, P., & Oja, E. (1998). The nonlinear PCA criterion in blind source separation: Relations with other approaches. Neurocomputing, 22(1), 5–20. Lewicki, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Li, Y. Q., & Wang, J. (2002). Sequential blind extraction of instantaneously mixed sources. IEEE Trans. on Signal Processing, 50(5), 997–1006. Mathis, H., & Douglas, S. C. (2002). On the existence of universal nonlinearities for blind source separation. IEEE Trans. Signal Processing, 50(5), 1007–1016. Moreau, E., & Macchi, O. (1996). High-order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10(1), 19–46. Moreau, E., & Thirion-Moreau, N. (1999). Nonsymmetrical contrasts for source separation. IEEE Trans. on Signal Processing, 47(8), 2241–2252. Ohata, M., & Matsuoka, K. (2002). Stability analysis of information-theoretic blind separation algorithms in the case where the sources are nonlinear processes. IEEE Trans. Signal Processing, 50(1), 69–77. Oja, E. (1997). The nonlinear PCA learning rule and signal separation: Mathematical analysis. Neurocomputing, 17(1), 25–46. Pham, D. T. (2002). Mutual information approach to blind separation of stationary sources. IEEE Trans. Information Theory, 48(77), 1935–1946. Pham, D. T., & Cardoso, J. F. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Trans. on Signal Processing, 49(9), 1837–1848. Thawonmas, R., Cichocki, A., & Amari, S. (1998). A cascade neural network for blind extraction without spurious equilibria. IEICE Trans. on Fundamentals of Electronics, Communications, & Computer Science, E81-A(9), 1833–1846.
728
X.-L. Zhu, X.-D. Zhang, and J.-M. Ye
Tong, L., Liu, R., Soon, V. C., & Huang, Y. F. (1991). Indeterminacy and identifiability of blind identification. IEEE Trans. on Circuits and Systems, 38(5), 499–509. von Hoff, T. P., Lindgren, A. G., & Kaelin, A. N. (2000). Transpose properties in the stability and performance of the classical adaptive algorithms for blind source separation and deconvolution. Signal Processing, 80(9), 1807–1822. Yang, H. H., & Amari, S. (1997). Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(5), 457–1482. Ye, J. M., Zhu, X. L., & Zhang, X. D. (2004). Adaptive blind separation with an unknown number of sources. Neural Computation, 16(8), 1641–1660. Zhang, L. Q., Cichocki, A., & Amari, S. (1999). Natural gradient algorithm for blind separation of over-determined mixtures with additive noise. IEEE Signal Processing Letters, 6(11), 293–295. Zhu, X. L., & Zhang, X. D. (2002). Adaptive RLS algorithm for blind source separation using a natural gradient. IEEE Signal Processing Letters, 9(12), 432–435. Zhu, X. L., & Zhang, X. D. (2004). Overdetermined blind source separation based on singular value decomposition. Journal of Electronics and Information Technology, 26(3), 337–343. (in Chinese)
Received June 18, 2004; accepted August 10, 2005.
LETTER
Communicated by Nicholas Hatsopoulos
Connection and Coordination: The Interplay Between Architecture and Dynamics in Evolved Model Pattern Generators Sean Psujek
[email protected] Department of Biology, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
Jeffrey Ames
[email protected] Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
Randall D. Beer
[email protected] Department of Biology and Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
We undertake a systematic study of the role of neural architecture in shaping the dynamics of evolved model pattern generators for a walking task. First, we consider the minimum number of connections necessary to achieve high performance on this task. Next, we identify architectural motifs associated with high fitness. We then examine how high-fitness architectures differ in their ability to evolve. Finally, we demonstrate the existence of distinct parameter subgroups in some architectures and show that these subgroups are characterized by differences in neuron excitabilities and connection signs. 1 Introduction From molecules to cells to animals to ecosystems, biological systems are typically composed of large numbers of heterogeneous nonlinear dynamical elements densely interconnected in specific networks. Understanding such systems necessarily involves understanding not only the dynamics of their elements, but also their architecture of interconnection. Interest in the role of network architecture in complex systems has been steadily growing for several years, with work on a diverse range of systems, including genetic networks, metabolic networks, signaling networks, nervous systems, food webs, social networks and the Internet (Watts & Strogatz, 1998; Jeong, Tombor, Albert, Oltvai, & Barab´asi, 2000; Strogatz, 2001; Guelzim, Bottani, Neural Computation 18, 729–747 (2006)
C 2006 Massachusetts Institute of Technology
730
S. Psujek, J. Ames, and R. Beer
Bourgine, & K´ep`es, 2002; Garlaschelli, Caldarelli, & Pietronero, 2003; Newman, 2003). Most recent research on complex networks has focused primarily on structural questions. For example, studies of a wide variety of naturally occurring networks have found that small-world structures are common (Watts & Strogatz, 1998). Structural questions have also been a major concern in neuroscience (van Essen, Anderson, & Felleman, 1992; Sporns, Tononi, & Edelman, 2000; Braitenberg, 2001). In addition, research on the dynamics of network growth has begun to provide insight into how observed network structures might arise. For example, preferential attachment of new links during network growth can produce scale-free network architectures (Barab´asi & Albert, 1999). An equally important but less well-studied aspect of complex networks is how network architecture shapes the dynamics of the elements it interconnects. For example, some architectures lend robustness to perturbations of both parameters and topology, while others do not (Albert, Jeong, & Barab´asi, 2000; Stelling, Klamt, Bettenbrock, Schuster, & Gilles, 2002). Again, the influence of circuit architecture on neural activity has long been a major concern in neuroscience, especially in the invertebrate pattern generation community, where detailed cellular and synaptic data are sometimes available (Getting, 1989; Marder & Calabrese, 1996; Roberts, 1998). However, while a great deal of work has been done on nonlinear oscillators coupled in regular patterns (Pikovsky, Rosenblum, & Kurths, 2001), there has been very little research on nonlinear dynamical systems connected in irregular but nonrandom patterns. Yet, arguably, this is the case most relevant to biological systems. In this article, we undertake a systematic study of the role of network architecture in shaping the dynamics of evolved model pattern-generation circuits for walking (Beer & Gallagher, 1992). While simple, this walking task raises a number of interesting coordination issues and has been extensively analyzed (Chiel, Beer, & Gallagher, 1999; Beer, Chiel, & Gallagher, 1999), providing a solid foundation for detailed studies of the interplay between architecture and dynamics. We first consider the minimum number of connections necessary to achieve high performance on this task. Next, we identify architectural motifs that are associated with high fitness and study the impact of architecture on evolvability. Finally, we demonstrate the existence of distinct parameter subgroups in some architectures and show that these subgroups are characterized by differences in neuron excitabilities and connection signs. 2 Methods We examined the effect of architecture on the evolution of central pattern generators for walking in a simple legged body (Beer & Gallagher, 1992). The body consisted of a single leg with a joint actuated by two opposing swing
Connection and Coordination
731
“muscles” and a foot. When the foot was “down,” any torque produced by the muscles served to translate the body under Newtonian mechanics. When the foot was “up,” any torque produced by the muscles served to swing the leg relative to the body. Details of the body model and its analysis can be found in Beer et al. (1999). This leg was controlled by a continuous-time recurrent neural network (CTRNN): τi y˙ i = −yi +
N
wji σ (y j + θ j ) i = 1, . . . , N
j=1
where yi is the state of the ith neuron, y˙ i denotes the time rate of change of this state, τi is the neuron’s membrane time constant, wji is the strength of the connection from the jth to the ith neuron, θi is a bias term, and σ (x) = 1/(1 + e −x ) is the standard logistic output function. We interpret a self-connection wii as a simple nonlinear active conductance rather than as a literal connection. We focus here on three-, four-, and five-neuron CTRNNs. Three of these neurons are always motor neurons that control the two opposing muscles of the leg (labeled BS for backward swing and FS for forward swing) and the foot (labeled FT), while any additional neurons are interneurons (labeled INTn) with no preassigned function. A real-valued genetic algorithm was used to evolve CTRNN parameters. A population of 100 individuals was maintained, with each individual encoded as a vector of N2 + 2N real numbers (N time constants, N biases, and N2 connection weights). Elitist selection was used to preserve the best individual each generation, whereas the remaining children were generated by mutation of selected parents. Individuals were selected for mutation using a linear rank-based method, with the best individual producing an average of 1.1 offspring. A selected parent was mutated by adding to it a random displacement vector with uniformly distributed direction and normally distributed magnitude (B¨ack, 1996). The mutation magnitude had zero mean and a variance of 0.5. Searches were run for 250 generations. Connection weights and biases were constrained to lie in the range ±16, while time constants were constrained to the range [0.5, 10]. The walking performance measure optimized by the genetic algorithm was average forward velocity of the body. This average velocity was computed in two ways. During evolution, truncated fitness was evaluated by integrating the model for 220 time units using the forward Euler integration method with a step size of 0.1 and then computing the average velocity (total forward distance covered in 220 time units divided by 220). During analysis, asymptotic fitness was evaluated by integrating the model for 1000 time units to skip transients and then computing its average velocity for one stepping period (with a fitness of 0 assigned to nonoscillatory circuits). Although asymptotic fitness more accurately describes the long-term
732
S. Psujek, J. Ames, and R. Beer
performance of a circuit, truncated fitness is much less expensive to compute during evolutionary searches. In both cases, the highest average velocity achievable is known to be 0.627 from a previous analysis of the optimal controller for this task and body model (Beer et al., 1999). The best truncated fitness that can be achieved by a nonoscillatory circuit (which takes only a single step) is also known to be 0.125. We define an architecture to be a set of directed connections between neurons. Since we do not consider self-connections to be part of an architecture, there are N2 − N possible interconnections in an N-neuron circuit, 2 and thus 2 N −N possible architectures. However, not all of these architectures are unique. When more than one interneuron is present, all permutations of interneuron labels that leave an architecture invariant should be counted only once since, unlike motor neurons, the interneuron labels are arbitrary. Counting the number of unique N-neuron architectures is an instance of the partially labeled graph isomorphism problem, which can be ´ solved using Polya’s enumeration theorem (Harary, 1972). We found that there are 64 distinct three-neuron architectures, 4096 distinct four-neuron architectures, and 528,384 distinct five-neuron architectures. Details of the ´ Polya theory calculations for the five-neuron case can be found in Ames (2003). The studies described in this article are based on the results of evolutionary searches on a large sample of different three-, four-, and five-neuron architectures. In the three- and four-neuron cases, the samples were exhaustive. We ran all 64 three-neuron architectures (300 random seeds each) and all 4096 four-neuron architectures (200 random seeds each). Because the number of five-neuron architectures was so large, we ran only a sample of 5000 five-neuron architectures (100 random seeds each). Thus, 1,338,400 evolutionary searches were run to form our baseline data set. A total of 850,900 additional experiments were run as described below to augment this baseline data set when necessary.
3 A Small Number of Connections Suffice How many connections are required to achieve high performance on the walking task? In the absence of any architectural constraints, a common simplification is to use fully interconnected networks because they contain all possible architectures as subcircuits. However, the number of connections between N neurons has been observed to scale roughly linearly in mammals (Stevens, 1989), much slower than the O(N2 ) scaling produced by full interconnection. Thus, our first task was to characterize the relationship between the number of connections and the best attainable fitness. There are N2 − N = 6, 12 and 20 possible connections for three-, four-, and five-neuron circuits, respectively. A uniform sample of architectures
Connection and Coordination
733
0.6
Best Fitness
0.5 0.4
3-neuron
0.3
4-neuron 0.2
5-neuron 0.1 0 0
5
10
15
20
Number of Connections Figure 1: Maximum asymptotic fitness obtained by all evolutionary searches in our baseline data set as a function of the number of connections by three-neuron (dashed), four-neuron (gray), and five-neuron (black) circuits.
leads to a nonuniform sample of number of connections because there 2 are ( N C−N ) architectures with C connections (0 ≤ C ≤ N2 − N). Thus, most architectures of an N-neuron circuit have close to (N2 − N)/2 connections. Because of the binomial distribution of architectures having a given number of connections, our sample of 5000 five-neuron architectures contained very few architectures with few or many connections. In order to compensate for this bias, we augmented our baseline data set with 732,300 additional fiveneuron experiments that exhaustively covered the five-neuron architectures having 0 to 5 and 18 to 20 connections. Figure 1 plots the very best asymptotic fitness obtained for three- (dashed line), four-, (gray line), and five-neuron (solid line) architectures as a function of number of connections. Regardless of the number of neurons, circuits with fewer than two connections have essentially zero fitness, while circuits with more than two connections have high fitness. The reason for this difference is that it takes at least two connections to link three motor neurons, and it takes at least three connections to form an oscillator involving all three motor neurons. Most interesting, although there is an increase in best fitness with larger numbers of connections in four- and five-neuron circuits, the additional benefit has saturated by about five connections. Thus, far sparser than fully interconnected circuits are sufficient to achieve high performance on the walking task.
734
S. Psujek, J. Ames, and R. Beer
4 Architectural Motifs Predict Performance Which architectures perform well on the walking task, and what particular connectivity features predict the best fitness that an architecture can achieve? There is growing evidence for recurring network motifs in biological networks, leading to the hope that general structural design principles may exist (Milo et al., 2002). In order to explore the existence of architectural motifs in our model and their correlation with fitness, we analyzed our three-neuron data in detail and then tested the patterns we found against our four- and five-neuron data. If we plot the best asymptotic fitness obtained over all runs of each three-neuron architecture (see Figure 2A), the data clearly fall into three distinct fitness groups, with wide gaps between them. This suggests that architecture strongly constrains the maximum achievable fitness of a circuit and that three separate classes of architectures may exist. Behaviorally, architectures from the low-fitness group (29/64) produced at most a single step. Architectures from the middle-fitness group (8/64) stepped rhythmically, but either the swing or stance phase of the motion was very slow. Closer inspection revealed that one of the swing motor neurons always adopted a fixed output, while the foot and the other swing motor neuron rhythmically oscillated. Interestingly, the constant outputs adopted by each swing motor neuron were consistently distinct in the best circuits in this group. When BS was the constant output motor neuron, the output it adopted was always around 0.7. In contrast, when FS was the constant output motor neuron, it adopted an output value around 0.3. Finally, architectures from the high-fitness group (27/64) exhibit fast rhythmic stepping. What architectural features characterize these three fitness groups? An example three-neuron architecture from each group is shown in the left column of Figure 3. Architectures in the low-fitness group lack feedback
Figure 2: Fitness classification of architectural motifs. In all cases, the horizontal axis represents an arbitrary architecture label. (A) The maximum asymptotic fitnesses obtained by all evolutionary searches with each of the 64 possible three-neuron architectures (black points) fall into three distinct fitness groups (indicated by gray rectangles). Architectures can independently be classified by their connectivity patterns (labeled class 1, class 2, and class 3). Note that architecture class strongly predicts fitness group in three-neuron circuits. (B) Maximum asymptotic fitness for all four-neuron baseline searches, with the data classified as class 1 (black points), class 2 (gray points), or class 3 (crosses) based on the connectivity pattern of each architecture. The dashed line indicates the maximum fitness obtainable given the strategy used by the best class 2 architectures. (C) Maximum asymptotic fitness for all five-neuron baseline searches classified by connectivity pattern.
Connection and Coordination
735
A 0.6
Class 1
Best Fitness
0.5
Class 2
0.4 0.3 0.2 0.1
Class 3
0
B Best Fitness
0.6 0.5 0.4 0.3 0.2 0.1 0
C 0.6
Best Fitness
0.5 0.4 0.3 0.2 0.1 0
Class 1 Class 2 Class 3
736
S. Psujek, J. Ames, and R. Beer
FT
Class 1
FT FT
BS BS
BS
INT2
INT1
FS FS
FS
INT1
FT
FT
Class 2
FT
BS BS BS
INT2
INT1
FS
FS
FS
INT1
FT
Class 3
FT FT
BS BS BS
INT2
INT1
FS
FS
FS
3-Neuron
4-Neuron
INT1
5-Neuron
Figure 3: Three-, four-, and five-neuron examples of the three architecture classes identified in Figure 2.
loops that link foot and swing motor neurons. Because they cannot achieve oscillatory activity involving both the foot and a swing motor neuron, these circuits are unable to produce rhythmic stepping. Architectures in the middle-fitness group possess feedback loops between the foot and one of the swing motor neurons, but these feedback loops do not drive the other swing motor neuron. Thus, these circuits can oscillate, but one direction of leg motion is always slowed by constant activity in the opposing
Connection and Coordination
737
Table 1: Definition of the Three Architecture Classes. CD(FT)
CD(BS)
CD(FS)
Class
Fitness
T T T T F F F F
T T F F T T F F
T F T F T F T F
1 2 2 3 3 3 3 3
High Medium Medium Low Low Low Low Low
Note: The predicate CycleDriven(m) has been abbreviated to CD(m). T = true; F = false.
swing motor neuron. Architectures in the high-fitness group contain feedback loops that either involve or drive all three motor neurons. This pattern of feedback allows these circuits to produce coordinated oscillations in all three motor neurons. These results suggest that neural circuits can be partitioned into distinct classes based solely on their connectivity and that these architecture classes might strongly predict the best obtainable fitness. In order to test the generality of these predictions in larger circuits, we must first state the definition of each architecture class precisely and in such a way that it can be applied to circuits with interneurons. Let the predicate CycleDriven(m) be true of a motor neuron m in a particular architecture if and only if m either participates in or is driven by a feedback loop in that architecture. Since we have three motor neurons, there are eight possibilities, which are classified according to the architectural patterns observed above (see Table 1). By definition, classes 1, 2, and 3 for three-neuron circuits are fully consistent with the high-, middle-, and low-fitness groups shown in Figure 2A, respectively. Examples of each of the three classes for four-neuron and five-neuron circuits are shown in Figure 3. We next tested the ability of this architecture classification to predict best asymptotic fitness in our four- and five-neuron data sets (see Figures 2B and 2C, respectively). In the four-neuron circuits, 2617/4096 architectures were class 1, 528/4096 were class 2, and 951/4096 were class 3. In the fiveneuron circuits, 3991/5000 were class 1, 488/5000 were class 2, and 521/5000 were class 3. In both cases, class 1 (black points), class 2 (gray points), and class 3 (crosses) were strongly correlated with high, middle and low fitness, respectively. However, unlike in the three-neuron case, there was some fitness overlap between class 1 and class 2 architectures (12/4096 for fourneuron circuits and 37/5000 for five-neuron circuits) and a small amount of fitness overlap between class 2 and class 3 architectures in the five-neuron case (2/5000). We hypothesized that this overlap was caused by an insufficient number of searches for these architectures, so that these architectures had not yet
738
S. Psujek, J. Ames, and R. Beer
achieved their best attainable fitness. To test this hypothesis, we performed additional searches on all overlap architectures. As a control, we also ran the same number of additional searches for two class 2 architectures with comparable fitness for each class 1 overlap architecture and with two class 3 architectures for each class 2 overlap architecture. We ran 43,500 additional experiments in the four-neuron case and 75,100 additional experiments in the five-neuron case. After these additional experiments, only three class 1 overlap architectures remained in the four-neuron case (see Figure 4A), and only two class 1 overlap architectures and one class 2 overlap architecture remained in the five-neuron case (see Figure 4B). Interestingly, all remaining overlap architectures contained independent cycles, in which subgroups of motor neurons were driven by separate feedback loops (see Figure 4C). Even if the oscillations produced by independent cycles are individually appropriate for walking, they will not in general be properly coordinated unless their initial conditions are precisely set. However, this cannot be done stably unless some other source of coupling is present, such as shared sensory feedback or mechanical coupling of the relevant body degrees of freedom. Independent cycle architectures aside, the fitness separation between class 2 and class 3 architectures is quite large. However, the boundary between class 1 and class 2 architectures is very sharp, occurring at a fitness value of around 0.47. What defines this fitness boundary, and how can we calculate its exact value? As noted above, one swing neuron always has a constant output in class 2 architectures. By repeating the optimal fitness calculations described in appendix A of Beer et al. (1999) with the constraint that either BS or FS must be held constant, we obtain expressions for the optimal fitness achievable as a function of these constant values: √ −1 85 4 5π /3 +√ √ √ 1 − BS 6 BS 165 FS (1 − FS) . V ∗ (FS) = √ √ √ √ 85 6 FS + 8 1 − FS 15π
55 V (BS) = 2 ∗
These expressions can be maximized exactly, but it is sufficient for our purposes here to do so only numerically. We find that when BS is the constant motor neuron, the highest fitness is achieved at BS∗ ≈ 0.709. In contrast, when FS is held constant, the highest fitness is achieved at FS∗ ≈ 0.291. Note that these values correspond closely to the constant outputs observed in the best-evolved class 2 circuits. The maximum fitnesses are the same in both cases: V ∗ (FS∗ ) = V ∗ (BS∗ ) ≈ 0.473, which is very close to the observed boundary between class 1 and class 2 (dashed lines in Figure 2). This value serves as an upper bound for the best fitness achievable by a class 2 architecture.
Connection and Coordination
739
B
0.6
0.6
0.5
0.5
Best Fitness
Best Fitness
A 0.4 0.3
0.2
0.4 0.3
0.2
0.1
0.1
0
0
C BS BS
FT
FT
FT
INT2
BS
INT2
INT1
FS
FS
INT1
FS
INT1
Figure 4: Investigation of the fitness overlaps between architecture classes in Figures 2B and 2C. (A) Data obtained from additional evolutionary searches with the class 1 overlap architectures from Figure 2B and class 2 architectures of comparable fitness. Note that only three class 1 overlap architectures remain. (B) Data obtained from additional evolutionary searches with the class 1 and class 2 overlap architectures from Figure 2C and, respectively, class 2 and class 3 overlap architectures of comparable fitness. Note that only two class 1 and one class 2 overlap architectures remain. (C) Examples of overlap architectures from A and B. (Left) A class 1 four-neuron independent cycles architecture from A whose best fitness lies in the range characteristic of class 2 architectures. Note that the BS and FS motor neurons occur in separate cycles. (Middle) A class 1 five-neuron independent cycles architecture from B whose best fitness lies in the range characteristic of class 2. Note that BS and FS occur in separate cycles. (Right) A class 2 five-neuron independent cycles architecture from B whose best fitness lies in the range characteristic of class 3. Note that the foot motor neuron FT occurs in a cycle separate from both BS and FS.
740
S. Psujek, J. Ames, and R. Beer
5 Architecture Influences Evolvability Are high-fitness circuits easier to evolve with some architectures than others? The best fitness obtained over a set of evolutionary searches provides a lower bound on the maximum locomotion performance that can be achieved with a given architecture. In contrast, the average of the best fitness obtained in each of a set of searches provides information about the difficulty of finding high-performance circuits with that architecture through evolutionary search. The lower this average is relative to the best fitness achievable with a given architecture, the less frequently evolutionary runs with that architecture attain high fitness, and thus the more difficult that architecture is to evolve. In order to examine the impact of architecture on evolvability, we examined scatter plots of best and average asymptotic fitness for all five-neuron circuit architectures that we evolved (see Figure 5A). Qualitatively identical results were obtained for the three- and four-neuron circuit architectures when using average or median fitness as a surrogate for evolvability. In this plot, the three architecture classes described in the previous section are apparent along the best fitness (horizontal) axis, but no such groupings exist along the average fitness (vertical) axis. Instead, for any given best fitness, there is a range of average fitnesses. This suggests that architectures with the same best achievable fitness can differ significantly in their evolvability. Interestingly, the spread of average fitnesses increases with best fitness, so that the largest range of evolvability occurs for the best architectures in each architecture class. We will focus on the class 1 architectures with the highest best fitness. In order to characterize these differences in evolvability, two subpopulations of five-neuron architectures whose best fitness was greater than 0.6 were studied. The high-evolvability subgroup had average fitnesses that were greater than 0.38 (N = 39), while the low-evolvability subgroup had average fitnesses that were less than 0.1 (N = 34). These subgroups are indicated by light gray rectangles in Figure 5A. Using 106 random samples
Figure 5: An analysis of evolvability. (A) A scatter plot of average and best fitness for all five-neuron evolutionary searches in our baseline data set. Architectures of class 1, 2, and 3 are represented as black points, gray points, and crosses, respectively, as in Figure 2. The gray rectangles indicate two subgroups of high best fitness architectures with high- (upper rectangle) and low- (lower rectangle) evolvability. (B) Mean truncated fitness distributions for the high- (black) and low- (gray) evolvability subgroups from A based on 106 parameter samples for each architecture. Note that the vertical scale is logarithmic. (C) Fraction (mean ±SD) of parameter space samples whose truncated fitness is greater than 0.125 for low- (gray bars) and high- (white bars) evolvability subgroups of three-, four-, and five-neuron architectures.
Connection and Coordination
741
A
High Evolvability Subgroup
0.4
Average Fitness
Class 1 0.3
Class 2 Class 3
0.2
0.1
Low Evolvability Subgroup
0 0
0.1
0.2
0.3
0.4
0.5
0.6
Best Fitness
B Log Frequency
6 5
High Evolvability
4
Low Evolvability
3 2 1 0 0
0.2
0.3 0.4
0.5
0.6
Fitness Frequency x 10-5
C
0.1
8 6 4 2
3-Neuron 4-Neuron 5-Neuron
742
S. Psujek, J. Ames, and R. Beer
from the parameter spaces of each architecture, we computed the mean truncated fitness distribution for each subgroup of architectures (see Figure 5B), with the high-evolvability subgroup distribution denoted by a black line and the low-evolvability subgroup distribution denoted by a gray line. Truncated rather than asymptotic fitness is the appropriate measure here because it is the one used during evolution. These fitness distributions exhibit several interesting features. The fraction of samples below a truncated fitness of 0.125 is several orders of magnitude larger than the fraction above 0.125. This reflects the extreme scarcity of high-fitness oscillatory behavior in the parameter spaces of architectures in both subgroups. Below 0.125, the distributions are nearly identical for the two subgroups, with strong peaks at 0 (no steps) and 0.125 (a single step). However, above 0.125, the fitness distributions of the two subgroups exhibit a clear difference. While both the low- and high-evolvability subgroups follow power law distributions within this range (with exponents of −2.23 and −3.28, respectively), a larger fraction of the parameter spaces of the high-evolvability architectures clearly have a fitness greater than 0.125. This suggests that the difference in evolvability between the two subgroups is due primarily to differences in the probability of finding an oscillatory circuit whose fitness is higher than that of a single stepper. Plots of the mean fraction of parameter space volume with truncated fitness greater than 0.125 (see Figure 5C) demonstrate that this conclusion holds not only for five-neuron circuits, but also for analogous subgroups of low- and high-evolvability three-neuron and four-neuron architectures. Using a two-sample Kolmogorov-Smirnov test, the four- and five-neuron differences are highly significant ( p < 0.00001), while the significance of the three-neuron difference is borderline ( p < 0.07). Ultimately, we would like to relate the observed differences in evolvability to particular architectural features, as we did for best fitness in section 4. Although we found some evidence for correlations between evolvability and the presence of particular feedback loops (Ames, 2003), none of these correlations was particularly strong. The best predictor of evolvability that we found was the fraction of an architecture’s parameter space with fitness greater than 0.125.
6 Beyond Architecture The results clearly demonstrate that circuit architecture plays a major role in determining both the maximum attainable performance and the evolvability of model pattern generators. Of course, architecture alone is insufficient to completely specify circuit dynamics. Ultimately, we would like to refine our architectural classification with quantitative information. Are there patterns in the best parameter sets discovered by multiple evolutionary searches with a given architecture?
Connection and Coordination
743
To begin to explore this question, we studied one of the highestperforming three-neuron architectures: a class 1 architecture consisting of a counterclockwise ring of connections among the three motor neurons. We performed a principal component analysis of the parameter sets of all evolutionary runs with this architecture whose best truncated fitness exceeded 0.5 (90/300). The first two principal components are plotted in Figure 6. Clearly, the evolved parameters sets are neither identical nor randomly distributed. Instead, they fall into two distinct clusters. What network features underlie these parameter clusters? Computing the means of the circuit parameters for each cluster separately reveals that they correspond to distinct sign patterns. Circuits in the left cluster have three intrinsically active neurons arranged in an inhibitory ring oscillator. In contrast, circuits in the right cluster have one intrinsically active and two intrinsically inactive neurons arranged in a mixed excitatory/inhibitory ring oscillator. In addition, the sign of the self-weight of FT changes from negative to positive between the left and right clusters, and the self-weight sign of BS changes from positive to negative. This suggests that neuron excitabilities and connection signs may represent an intermediate level of analysis between connectivity and raw parameter values. 7 Discussion Complex networks are ubiquitous in the biological world, and understanding the dynamics of such networks is arguably one of the most important theoretical obstacles to progress in many subdisciplines of biology. Most research on networks has focused on either structural questions that largely ignore node dynamics or network dynamics questions that assume a regular or random connection topology. However, realistic networks have both nontrivial node dynamics and specific but irregular connection topologies (Strogatz, 2001). As a first step in this direction, we have systematically studied the impact of neural architecture on walking performance in a large population of evolved model pattern generators for walking. Specifically, we have shown that a small number of connections is sufficient to achieve high fitness on this task, characterized the correlation between architectural motifs and fitness, explored the impact of architecture on evolvability, and demonstrated the existence of parameter subgroups with distinct neuron excitabilities and connection signs. These results lay the essential groundwork for a more detailed analysis of the interplay between architecture and dynamics. We have explained the observed correlations between architecture and best fitness in terms of the structure of feedback loops in the circuits, while the relationship between architecture and evolvability was explained in terms of the fraction of an architecture’s parameter space that contains oscillatory dynamics whose fitness is greater than that obtainable by nonoscillatory circuits. However, several questions remain. How do different architectures differ in their
744
S. Psujek, J. Ames, and R. Beer
FT
BS
FT
FS
BS
FS
20
PCA 2
10
0
-10
-20 -30
-20
-10
0
10
20
30
PCA 1 Figure 6: Two variants of the three-neuron architecture consisting of a counterclockwise ring. A principal components analysis of the parameters of evolved circuits whose best truncated fitness exceeds 0.5 reveals two subgroups corresponding to distinct neuron excitability and connection sign patterns. Here inhibitory connections are denoted by a filled circle, excitatory connections are denoted by a short line, and neurons are shaded according to whether they are intrinsically active (white) or inactive (black). These two principal components account for 87.6% of the variance (78.8% for PCA1 and 8.8% for PCA2).
Connection and Coordination
745
dynamical operation? Which excitability and sign variants of a given architecture can achieve high fitness? What underlies the fitness differences between architectures within a class? What architectural properties produce the parameter space differences responsible for the observed differences in evolvability? Ultimately such questions can be answered only by detailed studies of particular circuits in our existing data set (Beer, 1995; Chiel et al., 1999; Beer et al., 1999). There has been a great deal of interest in the use of evolutionary algorithms to evolve not only neural parameters but also neural architecture (Angeline, Saunders, & Pollack, 1994; Yao, 1999; Stanley & Miikkulainen, 2002). However, this previous work provides little understanding as to why a particular architecture is chosen for a given problem, or how the structure of the space of architectures biases an evolutionary search. Our approach is complementary. While it is obviously impractical to evaluate all possible architectures on a given task, a systematic study such as ours can provide a foundation for the analysis of architectural evolution. Given our substantial data on the best architectures and their evolvability for the walking task, it could serve as an interesting benchmark for comparing different architecture evolution algorithms and analyzing their behavior. In fact, several such algorithms have already been applied to a multilegged version of exactly this walking task (Gruau, 1995; Kodjabachian & Meyer, 1998). While the questions explored in this article are general ones, the importance of the particular feedback structures we have described is obviously specific to our walking task. Likewise, our evolvability results depend on the structure of the fitness space of each architecture, which in turn depends on the particular neural and mechanical models we chose and the performance measure we used. Examining a wider variety of neural models and tasks will be necessary to identify any general principles that might exist. As we have done here, it will be important for such studies to examine large populations of circuits, so that trends can be identified. In addition, the development of more powerful mathematical tools for studying the interplay of architecture and dynamics is essential. One promising development along these lines is the recent application of symmetry groupoid methods to analyze the constraints that network topology imposes on network dynamics (Stewart, Golubitsky, & Pivato, 2003). Acknowledgments We thank Hillel Chiel and Chad Seys for their feedback on an earlier draft of this article. This research was supported in part by NSF grant EIA-0130773 and in part by an NSF IGERT fellowship. References Albert, R., Jeong, H., & Barab´asi, A.-L. (2000). Error and attack tolerance of complex networks. Nature, 406, 378–382.
746
S. Psujek, J. Ames, and R. Beer
Ames, J. C. (2003). Design methods for pattern generation circuits. Master’s thesis, Case Western Reserve University. Angeline, P. J., Saunders, G. M., & Pollack, J. B. (1994). An evolutionary algorithm that constructs recurrent neural networks. IEEE Trans. Neural Networks, 5, 54– 65. B¨ack, T. (1996). Evolutionary algorithms in theory and practice. New York: Oxford University Press. Barab´asi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512. Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3, 469–509. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Computational Neuroscience, 7, 119–147. Beer, R. D., & Gallagher, J. C. (1992). Evolving dynamical neural networks for adaptive behavior Adaptive Behavior, 1, 91–122. Braitenberg, V. (2001). Brain size and number of neurons: An exercise in synthetic neuroanatomy. J. Computational Neuroscience, 10, 71–77. Chiel, H. J., Beer, R. D., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking I. Dynamical modules. J. Computational Neuroscience, 7, 99– 118. Garlaschelli, D., Caldarelli, G., & Pietronero, L. (2003). Universal scaling relations in food webs. Nature, 423, 165–168. Getting, P. (1989). Emerging principles governing the operation of neural networks. Annual Review of Neuroscience, 12, 185–204. Gruau, F. (1995). Automatic definition of modular neural networks. Adaptive Behavior, 3, 151–183. Guelzim, N., Bottani, S., Bourgine, P., & K´ep`es, F. (2002). Topological and causal structure of the yeast transcriptional regulatory network. Nature Genetics, 31, 60–63. Harary, F. (1972). Graph theory. Reading, MA: Addison-Wesley. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., & Barab´asi, A.-L. (2000). The largescale organization of metabolic networks. Nature, 407, 651–654. Kodjabachian, J., & Meyer, J.-A. (1998). Evolution and development of neural controllers for locomotion, gradient-following, and obstacle avoidance in artificial insects. IEEE Trans. Neural Networks, 9, 796–812. Marder, E., & Calabrese, R. L. (1996). Principles of rhythmic motor pattern generation. Physiological Reviews, 76, 687–717. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., & Alon, U. (2002). Network motifs: Simple building blocks of complex networks. Science, 298, 824– 827. Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Review, 45, 167–256. Pikovsky, A., Rosenblum, M., & Kurths, J. (2001). Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press. Roberts, P. D. (1998). Classification of temporal patterns in dynamic biological networks. Neural Computation, 10, 1831–1846.
Connection and Coordination
747
Sporns, O., Tononi, G., & Edelman, G. M. (2000). Theoretical neuroanatomy: Relating anatomical and functional connectivity in graphs and cortical connection matrices. Cerebral Cortex, 10, 127–141. Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10, 99–127. Stelling, J. Klamt, S., Bettenbrock K., Schuster, S., & Gilles, E. D. (2002). Metabolic network structure determines key aspects of functionality and regulation. Nature, 420, 190–193. Stevens, C. F. (1989). How cortical interconnectedness varies with network size. Neural Computation, 1, 473–479. Stewart, I., Golubitsky, M., & Pivato, M. (2003). Symmetry groupoids and patterns of synchrony in coupled cell networks. SIAM J. Applied Dynamical Systems, 2, 609–646. Strogatz, S. H. (2001). Exploring complex networks. Nature, 510, 268–276. van Essen, D. C., Anderson, C. H., & Felleman, D. J. (1992). Information processing in the primate visual system: An integrated systems perspective. Science, 255, 419–423. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks. Nature, 393, 440–442. Yao, X. (1999). Evolving artificial neural networks. Proc. of the IEEE, 87, 1423–1447.
Received August 24, 2004; accepted August 9, 2005.
NOTE
Communicated by Bernard Haasdonk
An Invariance Property of Predictors in Kernel-Induced Hypothesis Spaces Nicola Ancona
[email protected] Institute of Intelligent Systems for Automation, C. N. R., Bari, Italy
Sebastiano Stramaglia
[email protected] TIRES, Center of Innovative Technologies for Signal Detection and Processing, University of Bari, Italy; Dipartimento Interateneo di Fisica, University of Bari, Italy; and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
We consider kernel-based learning methods for regression and analyze what happens to the risk minimizer when new variables, statistically independent of input and target variables, are added to the set of input variables. This problem arises, for example, in the detection of causality relations between two time series. We find that the risk minimizer remains unchanged if we constrain the risk minimization to hypothesis spaces induced by suitable kernel functions. We show that not all kernel-induced hypothesis spaces enjoy this property. We present sufficient conditions ensuring that the risk minimizer does not change and show that they hold for inhomogeneous polynomial and gaussian radial basis function kernels. We also provide examples of kernel-induced hypothesis spaces whose risk minimizer changes if independent variables are added as input. 1 Introduction Recent advances in kernel-based learning algorithms have brought the field of machine learning closer to the goal of autonomy: providing learning systems that require as little intervention as possible on the part of a human user (Vapnik, 1998). Kernel algorithms work by embedding data into a Hilbert space and searching for linear relations in that space. The embedding is performed implicitly, by specifying the inner product between pairs of points. Kernel-based approaches are generally formulated as convex optimization problems, with a single minimum, and thus do not require heuristic choices of learning rates, start configuration, or other free parameters. The choice of the kernel and the corresponding feature space are central choices that generally must be made by a human user. While this provides opportunities to use prior knowledge about the problem at hand, in practice it is Neural Computation 18, 749–759 (2006)
C 2006 Massachusetts Institute of Technology
750
N. Ancona and S. Stramaglia
difficult to find prior justification for the use of one kernel instead of another (Shawe-Taylor & Cristianini, 2004). The purpose of this work is to introduce a novel property enjoyed by some kernel-based learning machines, which is of particular relevance when a machine learning approach is developed to evaluate causality between two simultaneously acquired signals. In this article, we define a learning machine to be invariant with respect to independent variables (property IIV) if it does not change when statistically independent variables are added to the set of input variables. We show that the risk minimizer constrained to belong to suitable kernel-induced hypothesis spaces is IIV. This property holds true for hypothesis spaces induced by inhomogeneous polynomial and gaussian kernel functions. We discuss the case of quadratic loss function and provide sufficient conditions for a kernel machine to be IIV. We also present examples of kernels that induce spaces where the risk minimizer is not IIV, and they should not be used to measure causality. 2 Preliminaries We focus on the problem of predicting the value of a random variable (r.v.) s ∈ R with a function f (x) of the r.v. vector x ∈ Rd . Given a loss function V and a set of functions called the hypothesis space H, the best predictor is sought in H as the minimizer f ∗ of the prediction error or generalization error or risk, defined as R[ f ] =
V(s, f (x)) p(x, s)dxds,
(2.1)
where p(x, s) is the joint density function of x and s. Given another r.v. y ∈ Rq , let us add y to the input variables and define a new vector appending x and y—z = (x , y ) . Let us also consider the predictor f ∗ (z) of s, based on the knowledge of the r.v. x and y, minimizing the risk: R [ f ] =
V(s, f (z)) p(z, s)dzds.
(2.2)
If y is statistically independent of x and s, it is intuitive to require that f ∗ (x) and f ∗ (z) coincide and have the same risk. Indeed in this case, y variables do not convey any information on the problem at hand. The property stated above is important when predictors are used to identify causal relations among simultaneously acquired signals, an important problem with applications in many fields ranging from economy to physiology (see, e.g., Ancona, Marinazzo, & Stramaglia, 2004). The major approach to this problem examines if the prediction of one series could be improved by incorporating information of the other, as proposed by Granger (1969). In particular, if the prediction error of the first time series is reduced by including
An Invariance Property of Kernel Predictors
751
measurements from the second time series in the regression model, then the second time series is said to have a causal influence on the first time series. However, not all prediction schemes are suitable to evaluate causality between two time series; they should be invariant with regard to independent variables, so that, at least asymptotically, they would be able to recognize variables without causality relationship. In this work, we consider as predictor the function minimizing the risk, and we show that it does not always enjoy this property. In particular, we show that if we constrain the minimization of the risk to suitable hypothesis spaces, then the risk minimizer is IIV (stable under inclusion of independent variables). We limit our analysis to the case of quadratic loss function V(s, f (x)) = (s − f (x))2 . 2.1 Unconstrained H. If we do not constrain the hypothesis space, then H is the space of measurable functions for which R is well defined. It is well known (Papoulis, 1985) that the minimizer of equation 2.1 is the regression function: f ∗ (x) =
sp(s|x)ds.
Note that if y is independent of x and s, then p(s|x) = p(s|x, y), and this implies f ∗ (z) =
sp(s|x, y)ds =
sp(s|x)ds = f ∗ (x).
Hence the regression function does not change if y is also used for predicting s; the regression function is stable under inclusion of independent variables. 2.2 Linear Hypothesis Spaces. Let us consider the case of linear hypothesis spaces: d H = f | f (x) = w . x x, wx ∈ R Here, and in all the other hypothesis spaces that we consider in this article, we assume that the mean value of the predictor and the mean of s coincide: E s − w x x = 0,
(2.3)
where E{·} means the expectation value. This can be easily achieved by adding a constant component (equal to one) to the x vector. Equation 2.3 is
752
N. Ancona and S. Stramaglia
a sufficient condition for property IIV in the case of linear kernels. Indeed, let us consider the risk associated with an element of H: R [wx ] =
s − w xx
2
p(x, s)dxds.
(2.4)
The parameter vector w∗x , minimizing the risk, is a solution of the following linear system: E{xx }wx = E{sx}.
(2.5)
Let us consider the hypothesis space of linear functions in the z = (x , y ) variable: d+q . H = f | f (z) = w z z,wz ∈ R q Writing wz = (w x , wy ) with wy ∈ R , let us consider the risk associated with an element of H :
R [wz ] =
2 s − w x x − wy y p(x, y, s)dxdyds.
(2.6)
If y is independent of x and s, then equation 2.6 can be written, due to equation 2.3, as:
R [wz ] = R[wx ] +
2 wy y p(y)dy.
(2.7)
It follows that the minimum of R corresponds to wy = 0. In conclusion, if y is independent of x and s, the predictors f ∗ (x) = w∗x x and f ∗ (z) = w∗z z, which minimize the risks of equations 2.4 and 2.6, respectively, coincide (i.e., f ∗ (x) = f ∗ (x, y) for every x and every y). Moreover the weights associated with the components of the y vector are identically null. So the risk minimizer in linear hypothesis spaces is a IIV predictor. 3 Nonlinear Hypothesis Spaces Let us now consider nonlinear hypothesis spaces. An important class of nonlinear models is obtained mapping the input space to a higherdimensional feature space and finding a linear predictor in this new space. Let φ be a nonlinear mapping function that associates with x ∈ Rd the vector φ(x) = (φ1 (x), φ2 (x), . . . , φh (x)) ∈ Rh , where φ1 , φ2 , . . . , φh are h fixed
An Invariance Property of Kernel Predictors
753
real-valued functions. Let us consider linear predictors in the space spanned by the functions φi for i = 1, 2, . . . , h. The hypothesis space is then H = f | f (x) = wx φ(x), wx ∈ Rh . In this space, the best linear predictor of s is the function f ∗ ∈ H minimizing the risk: 2 s − wx φ(x) p(x, s)dxds. R[wx ] = (3.1) Let us denote w∗x the minimizer of equation 3.1. We first restrict to the case of a single additional new feature: let y be a new real random variable, statistically independent of s and x, and denote γ (z), with z = (x , y) , a generic new feature involving the y variable. For predicting the r.v. s, we use the linear model involving the new feature, f (z) = wz φ (z), where φ (z) = (φ(x) , γ (z)) and wz = (wx , v) has to be fixed, minimizing R [wz ] =
2 s − wx φ(x) − vγ (x, y) p(x, s) p(y)dxdyds.
(3.2)
We would like to have v = 0 at the minimum of R . Therefore, let us evaluate ∂ R = −2 γ (x, y) s − w∗x φ(x) p(x, s) p(y)dxdyds, ∂v 0 where ∂/∂|0 means that the derivative is evaluated at v = 0 and wx = w∗x , where w∗x minimizes the risk of equation 3.1. If ∂ R /∂v|0 is not zero, then the predictor is changed after inclusion of feature γ . Therefore, ∂ R /∂v|0 = 0 is the condition that must be satisfied by all the features involving y, to constitute a IIV (stable) predictor. It is easy to show that if γ does not depend on x, then this condition holds. More important, it holds if γ is the product of a function γ (y) of y alone and of a component φi of the feature vector φ(x): γ (x, y) = γ (y)φi (x)
for some i ∈ {1, . . . , h}.
(3.3)
Indeed, in this case we have ∂ R ∗ = −2 γ (y) p(y)dy φi (x) s − wx φ(x) p(x, s)dxds = 0, ∂v 0
754
N. Ancona and S. Stramaglia
because the second integral vanishes as w∗x minimizes the risk of equation 3.1 when only x variables are used to predict s. We observe that the second derivative, 2 ∂ 2 R γ (x, y) p(x, s) p(y)dxdyds, = 2 2 ∂v 0 is positive; (w∗x , 0) remains a minimum after inclusion of the y variable. In conclusion, if the new feature γ involving y verifies equation 3.3, then the predictor f ∗ (z), which uses both x and y for predicting s, minimizing equation 3.2, and the predictor f ∗ (x) minimizing equation 3.1 coincide. This shows that the risk minimizer is unchanged after inclusion of y in the input variables. This preliminary result, which is used in the next section, may be easily seen to hold also for finite-dimensional vectorial y. 3.1 Kernel-Induced Hypothesis Spaces. In this section, we analyze whether our invariance property holds true in specific hypothesis spaces that are relevant for many learning schemes such as support vector machines (Vapnik, 1998) and regularization networks (Evgeniou, Pontil, Poggio, 2000), citing just a few. In order to predict s, we map x in a higherdimensional feature space H by using the mapping √ √ √ φ(x) = ( α1 ψ1 (x), α2 ψ2 (x), . . . , αh ψh (x), . . .), where αi and ψi are the eigenvalues and eigenfunctions of an integral operator whose kernel K (x, x ) is a positive-definite symmetric function with the property K (x, x ) = φ(x) φ(x ) (see Mercer’s theorem in Vapnik, 1998). Let us now consider in detail two important kernels. 3.2 Case K (x, x ) = (1 + x x ) p . Let us consider the hypothesis space induced by this kernel, d H = f | f (x) = w , x φ(x), wx ∈ R where the components φi (x) of φ(x) are d monomials, up to pth degree, which enjoy the following property: φ(x) φ(x ) = (1 + x x ) p . Let f ∗ (x) be the minimizer of the risk in H. Moreover, let z = (x , y ) , and consider the hypothesis space H induced by the mapping φ (z) such that φ (z) φ (z ) = (1 + z z ) p .
An Invariance Property of Kernel Predictors
755
Let f ∗ (z) be the minimizer of the risk in H . If y is independent of x and s, then f ∗ (x) and f ∗ (z) coincide. In fact the components of φ (z) are all the monomials, in the variables x and y, up to the pth degree: it follows trivially that φ (z) can be written as φ (z) = φ(x) , γ (z) , where each component γi (z) of the vector γ (z) verifies equation 3.3, that is, it is given by the product of a component φ j (x) of the vector φ(x) and of a function γi (y) of the variable y only: γi (z) = φ j (x)γi (y). As an example, we show this property for the case of x = (x1 , x2 ) , z = (x1 , x2 , y) , and p = 2. In this case, the mapping functions φ(x) and φ (z) are √ √ √ φ(x) = 1, 2x1 , 2x2 , 2x1 x2 , x12 , x22 , √ √ √ √ √ √ φ (z) = 1, 2x1 , 2x2 , 2x1 x2 , x12 , x22 , 2y, 2x1 y, 2x2 y, y2 , where one can easily check that φ(x) φ(x ) = (1 + x x )2 and φ (z) φ (z ) = (1 + z z )2 . In this case, the vector γ (z) is √ γ (z) = (φ1 (x) 2y, φ2 (x)y, φ3 (x)y, φ1 (x)y2 ) . According to the argument already made, the risk minimizer in this hypothesis space satisfies the invariance property. Note that, remarkably, the risk minimizer in the hypothesis space induced by the homogeneous polynomial kernel K (x, x ) = (x x ) p does not have the invariance property for a generic probability density, as one can easily check working out explicitly the p = 2 case. 3.3 Translation-Invariant Kernels. In this section we present a formalism that generalizes our discussion to the case of hypothesis spaces whose features constitute an uncountable set. We show that the IIV property holds for linear predictors on feature spaces induced by translation-invariant kernels. In fact, let K (x, x ) = K (x − x ) be a positive-definite kernel function, with x, x ∈ Rd . Let K˜ (ω x ) be the Fourier transform of K (x): K (x) ↔ K˜ (ω x ). For the time-shifting property, we have that K (x − x ) ↔ K˜ (ω x )e − j ωx x . By definition of the inverse Fourier transform, neglecting constant factors, we know that (Girosi, 1998) K (x − x ) =
Rd
K˜ (ω x )e − j ωx x e j ωx x dω x .
756
N. Ancona and S. Stramaglia
Since K is positive definite, we can write
K (x − x ) =
Rd
K˜ (ω x )e j ωx x
K˜ (ω x )e j ωx x
∗
dω x ,
where ∗ indicates conjugate. Then we can write K (x, x ) = φ x , φ x where φ x (ω x ) =
K˜ (ω x )e j ωx x
(3.4)
are the generalized eigenfunctions. Note that in this case, the mapping function φ x associates a function to x, that is, φ x maps the input vector x in a feature space with an infinite and uncountable number of features. Let us consider the hypothesis space induced by K : H = f | f (x) = w x , φ x , w x ∈ W x , where w x , φ x =
Rd
w x (ω x )φ ∗x (ω x )dω x ,
(3.5)
and W x is the set of complex measurable functions for which equation 3.5 is well defined and real.1 Note that w x is now a complex function; it is no longer a vector. In this space, the best linear predictor is the function f¯ = w ¯ x , φ x in H minimizing the risk functional: R [w x ] = E (s − w x , φ x )2 . It is easy to show that the optimal function w ¯ x is a solution of the following integral equation, E se − j ωx x =
Rd
w x (ξ x ) K˜ (ξ x )∗x (ω x + ξ x )dξ x ,
(3.6)
where ξ x is a dummy integration variable and x (ω x ) = E{e j ωx x } is the characteristic function2 of the r.v. x (Papoulis, 1985). Let us indicate F˜ (ω x ) = ˜ x ) = E{se j ωx x }. Then equation 3.6 can be written as w x (ω x ) K˜ (ω x ) and G(ω ˜ x ) = F˜ (ω x ) x (ω x ), G(ω
1 2
In particular, elements of W x satisfy w x (−ω x ) = w ∗x (ω x ). x (−ω x ) is the Fourier transform of the probability density p(x) of the r.v. x.
An Invariance Property of Kernel Predictors
757
where indicates cross-correlation between complex functions. In the spatial domain, this implies G(x) = F ∗ (x) p(−x). In conclusion, assuming that the density p(x) is strictly positive, the function w ¯ x (ω x ) minimizing the risk is unique, and it is given by w ¯ x (ω x ) = F G ∗ (x)/ p(−x) / K˜ (ω x ). Substituting this expression into equation 3.5 leads to f¯ (x) =
sp(s|x)ds,
that is, the risk minimizer coincides with the regression function. In other words, the hypothesis space H, induced by K , is sufficiently large to contain the regression function. This proves that translation-invariant kernels are IIV. It is interesting to work out and explicitly prove the IIV property in the case of translation-invariant and separable kernels. As in the previous section, let y ∈ Rq be an r.v. vector independent of x and s and use the vector z = (x , y ) for predicting s. Now let us consider the following mapping function, φ z(ω z) =
K˜ (ω z)e j ωz z ,
(3.7)
where ω z = (ω x , ω y ) and φ z, φ z = K (z − z ). Let us consider the hypothesis space induced by K :
H = f | f (z) = w z, φ z , w z ∈ W z . ¯ z , φz in H minimizing the The best linear predictor is the function f¯ = w risk functional, R [w z] = E
2
, s − w z, φ z
where the optimal function w ¯ z is the solution of the integral equation (see equation 3.6), E se − j ωz z =
Rd+q
w z(ξ z) K˜ (ξ z)∗z (ω z + ξ z)dξ z,
(3.8)
758
N. Ancona and S. Stramaglia
where ω z = (ω x , ω y ) . Note that being x and y independent, the characteristic function of z factorizes z(ω z) = x (ω x ) y(ω y). If K (z) is separable, K (z) = K (x)H(y),
(3.9)
˜ y). Being then its Fourier transform takes the form of K˜ (ω z) = K˜ (ω x ) H(ω ω − j ω z − j ω x − j y y }, equation 3.8 becomes z } = E{se x }E{e E{se E se − j ωx x ∗y(ω y) = ˜ y)∗x (ω x + ξ x )∗y(ω y + ξ y)dξ z. w z(ξ z) K˜ (ξ x ) H(ξ Rd+q
(3.10)
The risk minimizer w ¯ z solution of equation 3.10 is δ(ω y) . ¯ x (ω x ) w ¯ z(ω x , ω y) = w ˜ H(0)
(3.11)
This can be checked by substituting equation 3.11 in equation 3.10 and using equation 3.6. The structure of equation 3.11 guarantees that the predictor is unchanged under inclusion of variables y. This is the case, in particular, for the gaussian radial basis function kernel. Finally note that a property similar to equation 3.3 holds true in this hypothesis space too. In fact, as K is separable, equation 3.7 implies that φ z(ω z) = φ x (ω x )γ y(ω y),
(3.12)
˜ y)e j ωy y with the property: γ y, γ y = H(y − y ). where γ y(ω y) = H(ω Equation 3.12 may be seen as a continuum version of property 3.3. 4 Discussion In this work, we consider, in the frame of kernel methods for regression, the following question: Does the risk minimizer change when statistically independent variables are added to the set of input variables? We show that this property is not guaranteed by all the hypothesis spaces. We outline sufficient conditions ensuring this property and show that it holds for inhomogeneous polynomial and gaussian radial basis function kernels. While
An Invariance Property of Kernel Predictors
759
these results are relevant to construct machine learning approaches to study causality between time series, in our opinion they might also be important in the more general task of kernel selection. Our discussion concerns the risk minimizer; hence, it holds only in the asymptotic regime. The analysis of the practical implications of our results—when only a finite data set is available to train the learning machine—is a matter for further research. It is worth noting, however, that our results hold also for a finite set of data if the probability distribution is replaced by the empirical measure. Another interesting question is how this scenario changes when a regularization constraint is imposed on the risk minimizer (Poggio & Girosi, 1990) and loss functions different from the quadratic one are considered. Moreover, it would be interesting to analyze the connections between our results and classical problems of machine learning such as feature selection and sparse representation, that is, the determination of a solution with only a few nonvanishing components. If we look for the solution in overcomplete or redundant spaces of vectors or functions, where more than one representation exists, then it makes sense to impose a sparsity constraint on the solution. In the case here considered, the sparsity of w∗ emerges as a consequence of the existence of independent input variables using a quadratic loss function. Acknowledgments We thank two anonymous reviewers whose comments were valuable in improving the presentation of this work. References Ancona, N., Marinazzo, D., & Stramaglia, S. (2004). Radial basis function approach to nonlinear Granger causality of time series. Physical Review E, 70, 56221–56227. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1), 1–50. Girosi, F. (1998) An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480. Granger, C.W.J. (1969). Testing for causality and feedback. Econometrica, 37, 424–438. Papoulis, A. (1985). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–986. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Received February 25, 2005; accepted August 10, 2005.
LETTER
Communicated by Emanuel Todorov
Modeling Sensorimotor Learning with Linear Dynamical Systems Sen Cheng
[email protected]
Philip N. Sabes
[email protected] Sloan-Swartz Center for Theoretical Neurobiology, W. M. Keck Foundation Center for Integrative Neuroscience and Department of Physiology, University of California, San Francisco, CA 94143-0444, U.S.A.
Recent studies have employed simple linear dynamical systems to model trial-by-trial dynamics in various sensorimotor learning tasks. Here we explore the theoretical and practical considerations that arise when employing the general class of linear dynamical systems (LDS) as a model for sensorimotor learning. In this framework, the state of the system is a set of parameters that define the current sensorimotor transformation— the function that maps sensory inputs to motor outputs. The class of LDS models provides a first-order approximation for any Markovian (statedependent) learning rule that specifies the changes in the sensorimotor transformation that result from sensory feedback on each movement. We show that modeling the trial-by-trial dynamics of learning provides a substantially enhanced picture of the process of adaptation compared to measurements of the steady state of adaptation derived from more traditional blocked-exposure experiments. Specifically, these models can be used to quantify sensory and performance biases, the extent to which learned changes in the sensorimotor transformation decay over time, and the portion of motor variability due to either learning or performance variability. We show that previous attempts to fit such models with linear regression have not generally yielded consistent parameter estimates. Instead, we present an expectation-maximization algorithm for fitting LDS models to experimental data and describe the difficulties inherent in estimating the parameters associated with feedback-driven learning. Finally, we demonstrate the application of these methods in a simple sensorimotor learning experiment: adaptation to shifted visual feedback during reaching. 1 Introduction Sensorimotor learning is an adaptive change in motor behavior in response to sensory inputs. Here, we explore a dynamical systems approach to modeling sensorimotor learning. In this approach, the mapping from sensory Neural Computation 18, 760–793 (2006)
C 2006 Massachusetts Institute of Technology
Modeling Sensorimotor Learning with Linear Dynamical Systems
761
inputs to motor outputs is described by a sensorimotor transformation (see Figure 1, top), which constitutes the state of a dynamical system. The evolution of this state is governed by the dynamics of the system (see Figure 1, bottom), which may depend on both exogenous sensory inputs and sensory feedback. The goal is to quantify how these sensory signals drive trial-bytrial changes in the state of the sensorimotor transformations underlying movement. To accomplish this goal, empirical data are fit with linear dynamical systems (LDS), a general, parametric class of dynamical systems. The approach is best illustrated with an example. Consider the case of prism adaptation of visually guided reaching, a well-studied form of sensorimotor learning in which shifted visual feedback drives rapid recalibration of visually guided reaching (von Helmholtz, 1867). Prism adaptation has almost always been studied in a blocked experimental design, with exposure to shifted visual feedback occurring over a block of reaching trials. Adaptation is then quantified by the after-effect, the change in the mean reach error across two blocks of no-feedback test reaches—one before and one after the exposure block (Held & Gottlieb, 1958). This experimental approach has had many successes, for example, identifying different components of adaptation (Hay & Pick, 1966; Welch, Choe, & Heinrich, 1974) and the experimental factors that influence the quality of adaptation (e.g., Redding & Wallace, 1990; Norris, Greger, Martin, & Thach, 2001; Baraduc & Wolpert, 2002). However, adaptation is a dynamical process, with behavioral and
MOVEMENT Sensorimotor Transformation wt Inputs
Ft
yt Movement
γt Noise Ft+1
ut
SENSORIMOTOR LEARNING
Ft
Ft+1 Space of Sensorimotor Transformations
Figure 1: Sensorimotor learning modeled as a dynamic system in the space of sensorimotor transformations. For definitions of variables, see section 2.1.
762
S. Cheng and P. Sabes
neural changes in both the behavior and the underlying patterns of neural activity occurring on every trial. Our goal is to describe how the state of the system, which in this case could be modeled as the mean reach error, changes after each trial in response to error feedback (e.g., reach errors, perceived visual-proprioceptive misalignment) on that trial. As we will describe, a comparison of the performance before and after the training block is not sufficient to characterize this process. Only recently have investigations of sensorimotor learning from a dynamical systems perspective begun to appear (Thoroughman & Shadmehr, 2000; Scheidt, Dingwell, & Mussa-Ivaldi, 2001; Baddeley, Ingram, & Miall, 2003; Donchin, Francis, & Shadmehr, 2003). While these investigations have all made use of the LDS model class, they primarily focused on the application of these methods. A number of important algorithmic and statistical issues that arise when applying these methods remain unaddressed. Here we outline a general framework for modeling sensorimotor learning with LDS models, discuss the key analytical properties of these models, and address the statistical issues that arise when estimating model parameters from experimental data. We show how LDS can account for performance bias and the decay of learning over time, observed properties of adaptation that have not been included in previous studies. We show that the decay effect can be confounded with the effects of sensory feedback and that it can be difficult to separate these effects statistically. In contrast, the effects of exogenous inputs that are uncorrelated with the state of the sensorimotor transformation are much easier to characterize. We describe a novel resampling-based hypothesis test that can be used to identify the significance of such effects. The estimation of the LDS model parameters requires an iterative, maximum likelihood, system identification algorithm (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996), which we present in a slightly modified form. This iterative algorithm is necessary because, as we show, simple linear regression approaches are biased or inefficient, or both. The maximumlikelihood model can be used to quantify characteristics of the dynamics of sensorimotor learning and can make testable predictions for future experiments. Finally, we illustrate this framework with an application to a modern variant of the prism adaptation experiment. 2 A Linear Dynamical Systems Model for Sensorimotor Learning 2.1 General Formulation of the Model. Movement control can be described as a transformation of sensory signals into motor outputs. This transformation is generally a continuous-time stochastic process that includes both internal (“efference copy”) and sensory feedback loops. We will use the term sensorimotor transformation to refer to the input-output mapping of this entire process—feedback loops and all. This definition is useful in the case of discrete movements and other situations where continuous
Modeling Sensorimotor Learning with Linear Dynamical Systems
763
time can be discretized in a manner that permits a concise description of the feedback processes within each time step. We assume that at any given movement trial or discrete time step, indexed by t, the motor output can be quantified by a vector yt . In general, this output might depend on both a vector of inputs wt to the system and the “output noise” γt , the combined effect of sensory, motor, and processing variability. As shown in the upper box of Figure 1, the sensorimotor transformation can be formalized as a time-varying function of these two variables, yt = Ft (wt , γt ).
(2.1a)
We next define sensorimotor learning as a change in the transformation Ft in response to the sensorimotor experience of previous movements, as shown in the lower box of Figure 1. We let ut represent the vector of sensorimotor variables at time step t that drive such learning. This vector might include exogenous inputs rt , and since feedback typically plays a large role in learning, the motor outputs yt . The input ut might have all, some, or no components in common with the inputs wt . Generally, learning after time step t can depend on the complete history of this variable, Ut ≡ {u1 , . . . , ut }. Sensorimotor learning can then be modeled as a discrete-time dynamical system whose state is the current sensorimotor transformation, Ft , and whose state-update equation is the “learning rule” that specifies how F changes over time: Ft+1 = L({Fτ }tτ =1 , Ut , ηt , t),
(2.1b)
where the noise term ηt includes sensory feedback noise as well as intrinsic variability in the mechanisms of learning and will be referred to as learning noise. Previous studies that have adopted a dynamical systems approach to studying sensorimotor learning have taken only the output noise into account (Thoroughman & Shadmehr, 2000; Donchin et al., 2003; Baddeley et al., 2003). However, it seems likely that variability exists in both the sensorimotor output and the process of sensorimotor learning. Attempts to fit empirical data with parametric models of learning that do not account for learning noise may yield incorrect results (see section 4.4 for an example). Aside from these practical concerns, it is also of intrinsic interest to quantify the relative contributions of the output and learning noise to performance variability. 2.2 The Class of LDS Models. The model class defined in equation 2.1 provides a general framework for thinking about sensorimotor learning, but it is too general to be of practical use. Here we outline a series of assumptions that lead us from the general formulation to the specific class
764
S. Cheng and P. Sabes
of LDS models, which can be a practical yet powerful tool for interpreting sensorimotor learning experiments:
r r
r
r
Stationarity: On the timescale that it takes to collect a typical data set, the learning rule L does not explicitly depend on time. Parameterization: Ft can be written in parametric form with a finite number of parameters, xt ∈ Rm . This is not a serious restriction, as many finite basis functions sets can describe large classes of functions. The parameter vector xt now represents the state of the dynamical system at time t, and Xt is the history of these states. The full model, consisting of the learning rule L and sensorimotor transformation F , is now given by xt+1 = L(Xt , Ut , ηt ),
(2.2a)
yt = F (xt , wt , γt ).
(2.2b)
Markov property: The evolution of the system depends on only the current state and inputs, not on the full history: xt+1 = L(xt , ut , ηt ),
(2.3a)
yt = F (xt , wt , γt ).
(2.3b)
In other words, we assume online or incremental learning, as opposed to batch learning, a standard assumption for models of biological learning. Linear approximation and gaussian noise: The functions F and L can be linearized about some equilibrium values for the states (xe ), inputs (ue and we ), and outputs (ye ): xt+1 − xe = A(xt − xe ) + B(ut − ue ) + ηt , yt − ye = C(xt − xe ) + D(wt − we ) + γt .
(2.4a) (2.4b)
Thus, if the system were initially set up in equilibrium, the dynamics would be solely driven by random fluctuations about that equilibrium. The linear approximation is not unreasonable for many experimental contexts in which the magnitude of the experimental manipulation, that is, the inputs, is small, since in these cases, the deviations from equilibrium of the state and the output are small. The combined effect of the constant equilibrium terms in equation 2.4 can be lumped into a single constant “bias” term for each equation: xt+1 = Axt + But + b x + ηt ,
(2.5a)
yt = C xt + Dwt + b y + γt .
(2.5b)
Modeling Sensorimotor Learning with Linear Dynamical Systems
765
We will show in section 3.2 that it is possible to remove the effects of the bias terms b x and b y from the LDS. In anticipation of that result, we drop the bias terms in the following discussion. With the additional assumption that ηt and γt are zero-mean, gaussian white noise processes, we arrive at the LDS model class we use below: xt+1 = Axt + But + ηt ,
(2.6a)
yt = C xt + Dwt + γt ,
(2.6b)
with ηt ∼ N(0, Q),
γt ∼ N(0, R).
(2.6c)
In principle, signal-dependent motor noise (Clamann, 1969; Matthews, 1996; Harris & Wolpert, 1998; Todorov & Jordan, 2002) could be incorporated into this model by allowing the output variance R to vary with t. In practice, this would complicate parameter estimation. In the case where the data consist of a set of discrete movements with similar kinematics (e.g., repeated reaches with only modest variation in start and end locations), the modulation of R with t would be negligible. We will restrict our consideration to the case of stationary R. Among the LDS models defined in equation 2.6 there are distinct subclasses that are functionally equivalent. The parameters of two equivalent models are related to each other by a similarity transformation of the states xt , xt −→ A −→ P AP −1 B −→ P B Q −→ P QP T
P xt C −→ C P −1 D −→ D R −→ R,
(2.7)
where P is an invertible matrix. This equivalence exists because the state cannot be directly observed, but must be inferred from the outputs yt . An LDS from one subclass of equivalent models cannot be transformed into an LDS of another subclass via equation 2.7. In particular, a similarity transformation of an LDS with A = I always yields another LDS with A = I since (A = I ) −→ (P AP −1 = I ).
(2.8)
Therefore, it is restrictive to assume that A = I —that there is no “decay” of input-driven state changes over time. The equivalence under similarity transformation can be useful if one wishes to place certain constraints on the LDS parameters. For instance, if
766
S. Cheng and P. Sabes
one wishes to identify state components that evolve independently in the absence of sensory inputs, then the matrix A has to be diagonal. In many cases,1 this constraint can be met by performing unconstrained parameter estimation and then transforming the parameters with P = [v1 · · · vn ]−1 , where vi are the eigenvectors of the estimate of A. The transformed matrix A = P AP −1 is a diagonal matrix composed of the eigenvalues of A. As another example, the relationship between the state and the output might be known, that is, C = C0 . If both C0 and the estimated value of C are invertible, this constraint is achieved by transforming the estimated LDS ˆ with P = C0−1 C. 2.3 Feedback in LDS Models. In the LDS model of equation 2.6, learning is driven by an input vector ut . In an experimental setting, the exact nature of this signal will depend on the details of the task and is likely to be unknown. In general, it can include sensory feedback of the previous movement as well as exogenous sensory inputs. When we consider the problem of parameter estimation in section 4, we will show that the parameters corresponding to these two components of the input have different statistical properties. Therefore, we will explicitly write the input vector as uTt = [rtT ytT ], where the vector rt contains the exogenous input signals. We will similarly partition the input parameter, B = [G H]. This yields the form of the LDS model that will be used in the subsequent discussion: xt+1 = Axt + [ G H ]
rt yt
+ ηt ,
(2.9a)
yt = C xt + Dwt + γt ,
(2.9b)
ηt ∼ N(0, Q),
(2.9c)
γt ∼ N(0, R).
The decomposition of ut specified above is not overly restrictive, since any feedback signal can be divided into a component that is uncorrelated with the output (rt above) and a component that is a linear transformation of the output. Furthermore, equation 2.9 can capture a variety of common feedback models. To illustrate this point, we consider three forms of errorcorrective learning. In the first case, learning is driven by the previous performance error, ut = yt − yt∗ , where yt∗ is the target output. As an example, yt∗ could be a visual reach target, and ut could be the visually perceived reach error. If we let rt = −yt∗ and G = H, then equation 2.9a acts as a feedback controller designed to reduce performance error.
1 This transformation exists only if there are n linearly independent eigenvectors, where n is the dimensionality of the state.
Modeling Sensorimotor Learning with Linear Dynamical Systems
767
As a second form of error corrective learning, consider the case where learning is driven by the unexpected motor output, ut = yt − yˆ t , where yˆ t = C xt + Dwt is the predicted motor output. This learning rule would be used if the goal of the learner were to accurately predict the output of the system given the inputs ut and wt , that is, to learn a “forward model” of the physical plant yˆ (ut , wt , xt ) (Jordan & Rumelhart, 1992). Writing this learning rule in the LDS form of equation 2.6a, we obtain xt+1 = Axt + B(yt − C xt − Dwt ) + ηt = (A − BC) xt − B D wt + B yt + ηt . A
−G
H
Thus, this scheme can be expressed in the form of equation 2.9a, with rt = wt and parameters A , G , and H . Finally, learning could be driven by the predicted performance error, ut = yˆ t − yt∗ (e.g., Jordan & Rumelhart, 1992; Donchin et al., 2003). This scheme would be useful if the learner already had access to an accurate forward model. Using the predicted performance to estimate movement error would eliminate the effects that motor variability and feedback sensor noise have on learning. Also, in the context of continuous time systems, learning from the predicted performance error minimizes the effects of delays in the feedback loop. Putting this learning rule into the form of equation 2.6a, we obtain xt+1 = Axt + B(C xt + Dwt − yt∗ ) + ηt wt = (A + BC) xt + [BD −B] ∗ + ηt . yt A G rt
Again, this scheme is consistent with equation 2.9a, with the inputs rt and parameters A and G and H = 0. 2.4 Example Applications. LDS models can be applied to a wide range of sensorimotor learning tasks, but there are some restrictions. The true dynamics of learning must be consistent with the assumptions underlying the LDS framework, as discussed in section 2.2. Most notably, both the learning dynamics and motor output have to be approximately linear within the range of states and inputs experienced. In addition, LDS models can be fit to experimental data only if the inputs ut and outputs yt are well defined and can be measured by the experimenter. Identifying inputs amounts to defining the error signals that could potentially drive learning. While the true inputs will typically not be known a priori, it is often the case that several candidate input signals are available. Hypothesis testing can then
768
S. Cheng and P. Sabes
be used to determine which signals contribute significantly to the dynamics of learning, as discussed in section 4.2. The outputs yt must be causally related to the state of the sensorimotor system, since it functions as a noisy readout of the state. Several illustrative examples are described here. Consider first the case where t represents discrete movements. Two example tasks would be goal-directed reaching and hammering. A reasonable choice of state variable for these tasks would be the average positional error at the end of the movement. In this case, yt would be the error on each trial. In the hammering task, one might also include task-relevant dynamic information such as the magnitude and direction of the impact force. In some circumstances, these end-point-specific variables might be affected too much by online feedback to serve as a readout of the sensorimotor state. In such a case, one may choose to focus on variables from earlier in the movement (e.g., Thoroughman & Shadmehr, 2000; Donchin et al., 2003). Indeed, a more complete description of the sensorimotor state might be obtained by augmenting the output vector by multiple kinematic or dynamic parameters of the movement and similarly increasing the dimensionality of the state. In the reaching task, for example, yt could contain the position and velocity at several time points during the reach. Similarly, the output for the hammering task might contain snapshots of the kinematics of the hammer or the forces exerted on the hammer by the hand. If learning is thought to occur independently in different components of the movement, then state and output variables for each component should be handled by separate LDS models in order to reduce the overall model complexity. Next, consider the example of gain adaptation in the vestibular ocular reflex (VOR). The VOR stabilizes gaze direction during head rotation. The state of the VOR can be characterized by its “gain,” the ratio of the angular speed of the eye response over the speed of the rotation stimulus. When magnifying or minimizing lenses are used to alter the relationship between head rotation and visual motion, VOR gain adaptation is observed (Miles & Fuller, 1974). An LDS could be used to model this form of sensorimotor learning with the VOR gain as the state of the system. If the output is chosen to be the empirical ratio of eye and head velocity, then a linear equation relating output to state is reasonable. The input to the LDS could be the speed of visual motion or the ratio of that speed to the speed of head rotation. On average, such input would be zero if the VOR had perfect gain. A more elaborate model could include separate input, state, and output variables for movement about the horizontal and vertical axes. The dynamics of VOR adaptation could be modeled across discrete trials, consisting, for example, of several cycles of sinusoidal head rotations about some axis. The variables yt and ut could then be time averages over trial t of the respective variables. On the other hand, VOR gain adaptation is more accurately described as a continuous learning process. This view can also be incorporated into the LDS framework. Time is discretized into time steps indexed by t, and the variables yt and ut represent averages over each time step.
Modeling Sensorimotor Learning with Linear Dynamical Systems
769
In the examples described so far, the input and output error signals are explicitly defined with respect to some task-relevant goal. It is important to note, however, that the movement goal need not be explicitly specified or directly measurable. There are many examples where sensorimotor learning occurs without an explicit task goal: when auditory feedback of vowel production is pitch-shifted, subjects alter their speech output to compensate for the shift (Houde & Jordan, 1998); when reaching movements are performed in a rotating room, subjects adapt to the Coriolis forces to produce nearly straight arm trajectories even without visual feedback (Lackner & Dizio, 1994); when the visually perceived curvature of reach trajectories is artificially shifted, subjects adapt their true arm trajectories to compensate for the apparent curvature (Wolpert, Ghahramani, & Jordan, 1995; Flanagan & Rao, 1995). What is common to these examples is an apparent implicit movement goal that subjects are trying to achieve. The LDS approach can still be applied in this common experimental scenario. In this case, a measure of the trial-by-trial deviation from a baseline (preexposure) movement can serve as a measure of the state of sensorimotor adaptation or as an error feedback signal. This approach has been applied successfully in the study of reach adaptation to force perturbations (Thoroughman & Shadmehr, 2000; Donchin et al., 2003). 3 Characteristics of LDS Models We now describe how the LDS parameters relate to two important characteristics of sensorimotor learning: the steady-state behavior of the learner and the effects of performance bias. 3.1 Dynamics vs. Steady State. Most experiments of sensorimotor learning have focused on the after-effect of learning, measured as the change in motor output following a block of repeated exposure trials. The LDS can be used to model such blocked exposure designs. An LDS with constant exogenous inputs (rt = r , wt = w) will, after many trials, approach a steady state in which the state and output are constant except for fluctuations due to noise. The after-effect is essentially the expected value of the steady-state output, y∞ = lim E(yt ) = C x∞ + Dw. t→∞
(3.1)
An expression for the steady state x∞ = limt→∞ E(xt ) is obtained by taking the expectation value and then the limit of equation 2.9a, yielding x∞ = Ax∞ + Gr + Hy∞ .
(3.2)
770
S. Cheng and P. Sabes
Combining equations 3.1 and 3.2, the steady state is given by the solution of −(A + HC − I )x∞ = Gr + H Dw.
(3.3)
A unique solution to equation 3.3 exists only if A + HC − I is invertible. One sufficient condition for this is asymptotic stability of the system, which means that the state converges to zero as t → ∞ in the absence of any inputs or noise. This follows from the fact that a system is asymptotically stable if and only if all of the eigenvalues of A + HC have magnitude less than unity (Kailath, 1980). When a unique solution exists, it is given by x∞ = −(A + HC − I )−1 (Gr + H Dw),
(3.4)
and the steady-state output is y∞ = −C(A + HC − I )−1 (Gr + H Dw) + Dw.
(3.5)
This last expression can be broken down into the effective gains for the inputs r and w—the coefficients C(A + HC − I )−1 G and C(A + HC − I )−1 H D + D, respectively. While these two gains can be estimated from the steady-state output in a blocked-exposure experiment, they are insufficient to determine the full system dynamics. In fact, none of the parameters of the LDS model in equation 2.9 can be directly estimated from these two gains. The difference between studying the dynamics and the steady state is best illustrated with examples. We consider the performance of two specific LDS models. For simplicity, we place several constraints on the LDS: C = I , meaning the state x represents the mean motor output; D = 0, meaning the exogenous input has no direct effect on the current movement; and G and H are invertible, meaning that no dimensions of the input r or feedback yt are ignored during learning. In the first example, we consider the case where A = I . This is a system with no intrinsic decay of the learned state. From equation 3.5, we see that the steady-state output in this case converges to the value y∞ = −H −1 Gr.
(3.6)
If this is a simple error-corrective learning algorithm as in the first example of section 2.3, then G = H and the output converges to a completely adaptive response in the steady state, y∞ = −r . In such a case, the steady-state output is independent of the value of H, and so experiments that measure only the steady-state performance will miss the dynamical effects due to the structure of H. Such effects are illustrated in Figure 2, which shows
Modeling Sensorimotor Learning with Linear Dynamical Systems
771
1 0.8
y
0.6 0.4 0.2 0 0
0.5 x
1
Figure 2: Illustration of the difference between trial-by-trial state of adaptation (connected arrows) and steady-state of adaptation (open circles) in a simple simulation of error-corrective learning. The data were simulated with no noise and with diagonal matrices G = H. The learning rate in the x-direction, H11 , was 40% smaller than in the y-direction, H22 . Four different input vectors rt = −yt∗ were used, as shown in the inset in the corresponding shade of gray.
the simulated evolution of the state of the system when the learning rate along the x-direction (H11 ) is 40% smaller than in the y-direction (H22 ). Such spatial anisotropies in the dynamics of learning provide important clues about the underlying mechanisms of learning. For example, anisotropies could be used to identify differences in the learning and control of specific spatial components of movement (Pine, Krakauer, Gordon, & Ghez, 1996; Favilla, Hening, & Ghez, 1989; Krakauer, Pine, Ghilardi, & Ghez, 2000). In the second example, we consider a learning rule that does not achieve complete adaptation, a more typical outcome in experimental settings. Unlike in the last example, we assume that the system is isotropic: A = a I , G = −g I , and H = −h I , where a , g, and h are positive scalars. In this case, the steady-state output is y∞ =
−g r. (1 − a ) + h
(3.7)
A system with a = 1 and g = h will exhibit complete adaptation: y∞ = −r . However, incomplete adaptation, |y∞ | < |r |, could be due to either state decay (a < 1) or a greater weighting of the feedback signal compared to the exogenous shift (h > g). Measurements of the steady state are not sufficient to distinguish between these two qualitatively different scenarios. These examples illustrate that knowing the steady-state output does not fully constrain the important features of the underlying mechanisms of adaptation.
772
S. Cheng and P. Sabes
3.2 Modeling Sensorimotor Bias. The presence of systematic errors in human motor performance is well documented (e.g., Weber & Daroff, 1971; Soechting & Flanders, 1989; Gordon, Ghilardi, Cooper, & Ghez, 1994). Such sensorimotor biases could arise from a combination of sensory, motor, and processing errors. While bias has not been considered in previous studies using LDS models, failing to account for it can lead to poor estimates of model parameters. Here, we show how sensorimotor bias can be incorporated into the LDS model and how the effects of bias on parameter estimation can be avoided. It might seem that the simplest way to account for sensorimotor bias would be to add a bias vector b y to the output equation (equation 2.9): xt+1 = Axt + Grt + Hyt + ηt ,
(3.8a)
yt = C xt + Dwt + b y + γt .
(3.8b)
However, in a stable feedback system, feedback-dependent learning will effectively adapt away the bias. This can be seen by examining the steady state of equation 3.8: x∞ = −(A + HC − I )−1 (Gr + H Dw + Hb y ), y∞ = −C(A + HC − I )−1 (Gr + H Dw + Hb y ) + Dw + b y . Considering the case where A = I and C and H are invertible, the steady state compensates for the bias, x∞ = −C −1 (H −1 Gr + Dw + b y ), and so the bias term vanishes entirely from the asymptotic output: y∞ = −H −1 Gr . The simplest way to capture a stationary sensorimotor bias in an LDS model is to incorporate it into the learning equation. For reasons that will become clear below, we add a bias term −Hb x , xt+1 = Axt + Grt + Hyt − Hb x + ηt , yt = C xt + Dwt + γt .
(3.9a) (3.9b)
Now the bias carries through to the sensorimotor output: x∞ = −(A + HC − I )−1 (Gr + H Dw − Hb x ), y∞ = −C(A + HC − I )−1 (Gr + H Dw − Hb x ) + Dw. Again assuming that A = I , and C and H are invertible, the stationary output becomes y∞ = −H −1 Gr + b x , which is the unbiased steady-state output plus the bias vector b x . As described in section 2, the sensorimotor biases defined above are closely related to the equilibrium terms xe , ye , and so on in equation 2.4. If
Modeling Sensorimotor Learning with Linear Dynamical Systems
773
equation 3.9 were fit to experimental data, the bias term b x would capture the combined effects of all the equilibrium terms in equation 2.4a. Similarly, a bias term b y in the output equation would account for the equilibrium terms in equation 2.4b. However, adding these bias parameters to the model would increase the model complexity with little benefit, as it would be very difficult to interpret these composite terms. Here we show how bias can be removed from the data so that these model parameters can be safely ignored. 1 T−1 With T being the total number of trials or time steps, let x¯ = T−1 t=1 xt , and u, ¯ w, ¯ and y¯ defined accordingly. Averaging equation 2.5 over t, we get 1
1
xt+1 = A¯x + B u¯ + b x + ηt , T −1 T −1 T−1
T−1
t=1
t=1
1
γt . T −1 T−1
y¯ = C x¯ + Dw ¯ + by +
t=1
1 T−1 1 T−1 ¯ + O(1/T), T−1 With the approximations T−1 t=1 xt+1 ≈ x t=1 ηt ≈ 0 + √ √ T−1 1 O(1/ T), and T−1 γ ≈ 0 + O(1/ T), which are quite good for large t t=1 T, we get
x¯ = A¯x + B u¯ + b x , y¯ = C x¯ + Dw ¯ + b y.
Subtracting these equations from equation 2.5 leads to xt+1 − x¯ = A(xt − x¯ ) + B(ut − u) ¯ + ηt , ¯ + γt . yt − y¯ = C(xt − x¯ ) + D(wt − w)
(3.10) (3.11)
Therefore, the mean-subtracted values of the empirical input and output variables obey the same dynamics as the raw data, but without the bias terms. This justifies using equation 2.6 to model experimental data, as long as the inputs and outputs are understood to be the mean-subtracted values. 4 Parameter Estimation Ultimately, LDS models of sensorimotor learning are useful only if they can be fit to experimental data. The process of selecting the LDS model that best accounts for a sequence of inputs and outputs is called system identification (Ljung, 1999). Here we take a maximum-likelihood approach to system identification. Given a sequence (or sequences) of input and output data,
774
S. Cheng and P. Sabes
we wish to determine the model parameters for equation 2.9 for which the data set has the highest likelihood; that is, we want the maximum likelihood estimates (MLE) of the model parameters. Since no analytical solution exists in this case, we employ the expectation-maximization (EM) algorithm to calculate the MLE numerically (Dempster, Laird, & Rubin, 1977). A review of the EM algorithm for LDS (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996) is presented in the appendix, with one algorithmic refinement. A Matlab implementation of this EM algorithm is freely available online at http://keck.ucsf.edu/∼sabes/software/. Here we discuss several issues that commonly arise when trying to fit LDS models that include sensory feedback. 4.1 Correlation Between Parameter Estimates. Generally, identification of a system operating under closed loop (i.e., where the output is fed back to the learning rule) is more difficult than if the same system were operating in open loop (no feedback). This is partly because the closed loop makes the system less sensitive to external input (Ljung, 1999). In addition, and perhaps more important for our application, since the output directly affects the state, these two quantities tend to be correlated. This makes it difficult to distinguish their effects on learning, that is, to fit the parameters A and H. To determine the conditions that give rise to this difficulty consider two hypothetical LDS models, xt+1 = Axt + Grt + Hyt + ηt ,
xt+1 = A xt + Grt + H yt + ηt ,
(4.1) (4.2)
which are related to each other by A = A − δC and H = H + δ. These models differ in how much the current state affects the subsequent state directly (A) or through output feedback (H), and the difference is controlled by δ. However, the total effect of the current state is the same in both models: A + HC = A + H C. Distinguishing these two models is thus equivalent to separating the effects of the current state and the feedback on learning. To determine when this distinction is possible, we rewrite the second LDS in terms of A and H: xt+1 = (A − δC)xt + Grt + (H + δ)yt + ηt = Axt + Grt + Hyt + δ(yt − C xt ) + ηt = Axt + Grt + Hyt + δ(Dwt + γt ) + ηt ,
(4.3)
where the last step uses equation 2.9b. Comparing equations 4.1 and 4.3, we see that the two models are easily distinguished if the contribution of the δ Dwt term is large. However, in many experimental settings, the
Modeling Sensorimotor Learning with Linear Dynamical Systems
correlations
T = 400
R/Q = 1/8
correlations
R/Q = 8
2
2
2
0
0
0
−2 ^ A^ ^ G ^ H ^ Q R
T = 4000
R/Q = 1
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
2
2
2
0
0
0
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
−2 ^ A^ ^ G ^ H ^ Q R
^ ^R ^Q ^H ^G A
775
−2 ^ A^ ^ G ^ H ^ Q R
29.1
^ ^R ^Q ^H ^G A
^ ^R ^Q ^H ^G A
Figure 3: Correlations between LDS parameter estimates across 1000 √ simulated data sets. Each panel corresponds to a particular value for T and R/Q. Simulations used an LDS with parameters A = 0.8, |Gr | = 0.5, H = −0.2, C = 1, D = 0, Q = 1, and zero-mean, white noise inputs rt with unit variance.
exogenous input has little effect on the ongoing motor output (e.g., in the terminal feedback paradigm described in section 5), implying that Dwt is negligible. In this case, the main difference between these two models is the noise term δγt , and the ability to distinguish between the models depends on the relative magnitudes of the variances of the output and learning noise terms, R = Var(γt ) and Q = Var(ηt ). For |R| |Q|, separation will be relatively easy. In other cases, the signal due to δγt may be lost in the learning noise. We next fit a one-dimensional LDS to simulated data in order to confirm the correlation between the estimates for A and H and to investigate other potential correlations among parameter estimates. In these simulations, the inputs rt and the learning noise ηt were both zero-mean, gaussian white noise processes. Both variables had unity variance, determining the scale for the other parameters’ values, which are listed in Figure 3. Three differ√ ent values for R/Q and two data set lengths T were used. Note that the input affects learning via the product Gr , and so the standard deviation of this product quantifies the input effect. Here and below we √ use |Gr | to refer to this standard deviation. For each pair of values for R/Q and T, we simulated 1000 data sets and estimated the LDS parameters with the EM algorithm. We then calculated the correlations between the various parameter estimates across simulated data sets (see Figure 3). With sufficiently large data sets the MLE are consistent and exhibit little variability
776
S. Cheng and P. Sabes
(see Figure 3, bottom row). If the data sets are relatively small, two different effects can be seen, depending on the relative magnitudes of R and Q. As predicted above, when R < Q, there is large variability in the estimates Aˆ ˆ and they are negatively correlated (see Figure 3, top-left panel). In and H, separate simulations (data not shown), we confirmed that this correlation disappears when there are substantial feedthrough inputs—that is, when |Dw| is large. In contrast, when the output noise has large variance, R > Q, the estimate Rˆ covaries with the other model parameters (see Figure 3, topright panel). This effect has an intuitive explanation: when the estimated output variance is larger than the true value, even structured variability in the output is counted as noise. In other words, a large Rˆ masks the effects of the other parameters, which are thus estimated to be closer to zero. Next, we isolate and quantify in more detail the correlation between the ˆ We computed the MLE for simulated data under two estimates Aˆ and H. different constraint conditions. In the first case, only Hˆ was fit to the data, while all other parameters were fixed at their true values (H unknown, A known). In the second case, the sum A + HC and all parameters other than A and H were fixed to their true values (A and H unknown, A + HC known). In this case, Aˆ and Hˆ were constrained to be A − δC and H + δ, respectively, and the value of δ was fit to the data. This condition tests directly whether the maximum likelihood approach can distinguish the effects of the current state and the feedback on learning. Under both of these conditions, there is only a single free parameter, and so a simple line search algorithm was used to find the MLE of the parameter from simulated data. Data were simulated with√the same parameters as in Figure 3 while three quantities were varied: R/Q, |Gr |, and T (see Figure 4). For a given set of parameter values, we simulated 1000 data sets and found the MLE Hˆ for each. The 95% confidence interval (CI) for Hˆ was computed as the ˆ symmetric interval around the true H containing 95% of the fit values H. The results for the first constraint case (H unknown) are shown on the left side of Figure 4. Uncertainty in Hˆ is largely invariant to the magnitude of the output noise or the exogenous input signal, |Gr |, although there is a small interaction between the two at high input levels. For later comparison, we note that with T = 1000, we obtain a 95% CI of ±0.05. The results are very different when A + HC is fixed but A and H are unknown (right side of Figure 4). While the uncertainty in Hˆ is independent of the input magnitude, it is highly dependent on the magnitude of the output noise. As predicted, when R is small relative to Q, there is much greater uncertainty in Hˆ compared to the first constraint condition. For example, if √ R/Q ≈ 2 (comparable to the empirical results shown in Figure 8C below), several thousand trials would be needed to reach a 95% CI of ±0.05. In ˆ order to match the √ precision of H obtained in the first constraint condition with T = 1000, R/Q ≈ 8 is needed.
Modeling Sensorimotor Learning with Linear Dynamical Systems
H unknown A known
A
0.04 0.15 0.3
0.15
A and H unknown A+HC known
C 0.2
|Gr| / Q
^
half-width of 95% CI of H
0.2
777
0.6 1.2 2.4
0.15
0.1
0.1
0.05
0.05
0
0 0
2
4
6
R/Q
B
0
8
2
4
D
0.2
6
8
3000
4000
R/ Q
0.2
^
half-width of 95% CI of H
R/Q 1/8 1/2 1
0.15
2 4 8
0.15
0.1
0.1
0.05
0.05
0
0 0
1000
2000
3000
dataset length
4000
0
1000
2000
dataset length
Figure 4: Uncertainty in the feedback parameter Hˆ in two different constraint conditions. All data were simulated with parameters A = 0.8, H = −0.2, C = 1, D = 0. Q sets the scale. (A) Variability of Hˆ as a function of the output noise magnitude, when all other parameters, in particular A, are known (T = 400 trials). Lines correspond to different values of the magnitude of the exogenous input signal. (B) Variability of Hˆ as a function of data set length, T, for |Gr | = 0 (no exogenous input). Lines correspond to different levels of output noise. (C, D): Variability of Hˆ when both H and A are unknown, but A + HC, as well as all other parameters, are known; otherwise as in Figures A and B, respectively.
4.2 Hypothesis Testing. One important goal in studying sensorimotor learning is to identify which sensory signal, or signals, drive learning. Consider the case of a k-dimensional vector of potential input signals, rt . We wish to determine whether the ith component of this input has a significant effect on the evolution of the state, that is, whether G i , the ith column of
778
S. Cheng and P. Sabes
G, is significantly different from zero. Specifically, we would like to test the hypothesis H1 : G i = 0 against the null hypothesis H0 : G i = 0. Given the framework of maximum likelihood estimation, we could use a generic likelihood ratio test in order to assess the significance of the parameters G i (Draper & Smith, 1998). However, the likelihood ratio test is valid only in the limit of large data sets. Given the complexity of these models, that limit may not be achieved in typical experimental settings. Instead, we propose the use of a permutation test, a simple class of nonparametric, resampling-based hypothesis tests (Good, 2000). Consider again the null hypothesis H0 : G i = 0, which implies that rti , the ith component of the exogenous input rt , does not influence the evolution system. If H0 were true, then randomly permuting the values of rti across t should not affect the quality of the fit; in any case we expect Gˆ i to be near zero. This suggests the following permutation test. We randomly permute the ith component of the exogenous input, determine the MLE parameters of the LDS model from the permuted data, and compute a statistic representing the goodness-of-fit of the MLE—in our case, the log likelihood of the permuted data given the MLE, L perm . This process is repeated many times. The null hypothesis is then rejected at the (1 − α) level if the fraction of L perm that exceeds the value of L computed from the original data set is below α. Alternatively, the magnitude of G i itself could be used as the test statistic, since G i is expected to be near zero for the permuted data sets. To evaluate the usefulness of the permutation test outlined above, we performed a power analysis on data sets simulated from the one-dimensional LDS described in Figure 5. For each combination of parameters, we simulated 100 data sets. For each of these data sets, we determined the significance of the scalar G using the permutation test described above with k = i = 1, α = 0.05, and 500 permutations of rt . The fraction of data sets for which the null hypothesis was (correctly) rejected represents the power of the permutation test for those parameter values. The top panel of Figure 5 shows√the power as a function of the input magnitude and trial length T, given R/Q = 2. The lower panel shows the magnitude of the input required to obtain a power of 80%, as a function √ of R/Q and T. Plots such as these should be used during experimental design. However, since neither G nor the output and learning noise √ magnitudes are typically known, heuristic values must be used. With R/Q = 2 √ and |Gr |/ Q = 0.2, approximately 1600 trials are needed to obtain 80% power. Note, however, that the exogenous input signal is often under experimental control. In this case, the same power could be achieved with about 400 trials if the magnitude of that signal were doubled. In general, increasing the amplitude of the input signal will allow the experimenter to achieve any desired statistical power for this test. Practically, however, the size of a perturbative input signal is usually limited by several factors, including physical constraints or a requirement that subjects remain
Modeling Sensorimotor Learning with Linear Dynamical Systems
779
A 1
dataset length (T) 4000 3000 1600 800 400 100
power
0.8 0.6 0.4 0.2 0
0
0.1
0.2
B
0.3 |Gr|
0.4
0.5
0.8
0.4
|Gr|
80
/
Q
0.6
0.2
0
0
2
4
6
8
R/Q
Figure 5: Power analysis of the permutation test for the significance of G. Simulation parameters: A = √0.8, H = −0.2, C = 1, and D = 0. Q sets the scale. (A) Statistical power when √ R/Q = 2. (B) Input magnitude required to achieve 80% power, as a ratio of Q. In both panels, α = 0.05 and line type indicates the data set length T.
unaware of the manipulation. Therefore, large numbers of trials may often be required to achieve sufficient power. 4.3 Combining Multiple Data Sets. A practical approach to collecting larger data sets is to perform repeated experiments, either during multiple sessions with a single subject or with multiple subjects. In either case, accurate model fitting requires that the learning rule is stationary (i.e., constant LDS parameters) across sessions or subjects. There are two possible approaches to combining N data sets of length T. First, the data from different sessions can be concatenated to form a “super data set” of length NT. The
780
S. Cheng and P. Sabes
system identification procedure outlined in the appendix can be directly applied to the super data set with the caveat that the initial state xt=1 has to be reset at the beginning of each data set. A second approach can be taken when nominally identical experiments are repeated across sessions, that is, when the input sequences rt and wt are the same in each session. In this approach, the inputs and outputs are averaged across sessions, yielding a single average data set with inputs r˜t and w ˜ t and outputs y˜ t . The dynamics underlying this average data set are obtained by averaging the learning and output equations for each t (see equation 2.9) across data sets: x˜ t+1 = A˜xt + [G H]
r˜t
y˜ t
+ η˜ t ,
(4.4a)
y˜ t = C x˜ t + Dw ˜ t + γ˜t .
(4.4b)
Note that the only difference between this model and that describing the individual data sets is that the noise variances have been scaled: Var(η˜ t ) = Q/N and Var(γ˜t ) = R/N. Therefore, the EM algorithm (see the appendix) can be directly applied to average data sets as well. Since both approaches to combining multiple data sets are valid, we ran simulated experiments to determine which approach produces better parameter estimates, that is, tighter confidence intervals. Simulations were performed with the model √ described in Figure 6, varying R and Q while maintaining a fixed ratio R/Q = 2.
^
half-width of 95% CI of G
0.2
0.15
Q 2 1 1/2 1/4 1/8
0.1
0.05
0
0
1000
2000
3000
4000
dataset length
Figure 6: 95% confidence intervals for the MLE of the input parameter √ G, computed from 1000 simulated data sets. All simulations were run with R/Q = 2 and gaussian white noise inputs with zero-mean and unit variance. A = 0.8, G = −0.3, H = −0.2, C = 1, and D = 0.
Modeling Sensorimotor Learning with Linear Dynamical Systems
781
The preferred approach for combining data sets depends on which parameter one is trying to estimate. CIs for the exogenous input parameter Gˆ are shown in Figure 6. Variability depends strongly on the number of trials but even more so on the noise variances. For example, with 200 trials, R = 1, and Q = 1/2, the 95% CI is ±0.145. Multiplying the number of trials by 4 results in a 41% improvement in CI, yet dividing the noise variances by 4 yields a 76% improvement. Therefore, for estimating G, it is best to average the data sets. It should be noted, however, that the advantage is somewhat weaker for larger input variances (Figure 6 was generated with unit input variance; other data not shown). By contrast, the variability of the MLE for H only mildly depends on the noise variances (data not shown), if at all. Therefore, increasing the data set length by concatenation produces better estimates for H than decreasing the noise variances by averaging. 4.4 Linear Regression. As noted above, there is no analytic solution for the maximum likelihood estimators of the LDS parameters, so the EM algorithm is used (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996). While EM has many attractive properties (Dempster et al., 1977), it is a computationally expensive, iterative algorithm, and it can get caught in local minima. The computation time becomes particularly problematic when system identification is used in conjunction with resampling schemes for statistical analyses such as bootstrapping (Efron & Tibshirani, 1993) or the permutation tests described above. It is therefore tempting to circumvent the inefficiencies of the EM algorithm by converting the problem into one that can be solved analytically. Specifically, there have been attempts to fit a subclass of LDS models using linear regression (e.g., Donchin et al., 2003). It is in fact possible to find a regression equation for the input parameters G, H, and D if we assume that A = I , a fairly strict constraint implying no state decay. There are then two possible approaches to transforming system identification into a linear regression problem. As we describe here, however, both approaches can lead to inefficient, or worse, inconsistent estimates. First we consider the subtraction approach. If A = I , the following expression for the trial-by-trial change in output can be derived from the LDS in equation 2.6 (i.e., the form without explicit feedback), yt+1 − yt = C But + D(wt+1 − wt ) + γt+1 − γt + Cηt .
(4.5)
This equation suggests that the trial-by-trial changes in yt can be regressed on inputs ut and wt in order to determine the parameters C B and D. Obtaining an estimate of B requires knowing C; however, that is not a significant constraint due to the existence of the similarity classes described in equation 2.7. One complication with this approach is that the noise terms in equation 4.5, γt+1 − γt + Cηt , are correlated across trials. Such colored noise can be accommodated in linear regression (Draper & Smith, 1998);
782
S. Cheng and P. Sabes 0 −0.2 −0.4 ^
H
−0.6 −0.8 −1 0
10
20
30
R/Q Figure 7: Bias in Hˆ using two different linear regression fits of simulated data. The data sets were simulated with no exogenous inputs and the LDS model parameters A = 1, G = 0, H = −0.2, C = 1, and D = 0. Q sets the scale. The lower (black) data points represent the average Hˆ over 1000 simulated data sets using the subtraction approach. The upper (gray) data points are for the summation approach. The true value of H is shown as the dotted line. Error bars represent standard deviations.
however, the ratio of R and Q would have to be known in order to construct an appropriate covariance matrix for the noise. A more serious problem with the regression model in equation 4.5 arises in feedback systems, in which But = Grt + Hyt . In this case, the right-hand side of equation equation 4.5 contains the same term, yt , as the left-hand side. This correlation between the dependent variables yt and the independent variables yt+1 − yt leads to a biased estimate for H. As an example, consider a sequence yt that is generated by a pure white-noise process with no feedback dynamics. If the regression model of equation equation 4.5 were applied to this sequence, with C = 1 and no exogenous inputs, the expected value of the regression parameter would be Hˆ = −1, compared to the true value H = 0. Application of the regression model in equation 4.5 to simulated data confirms this pattern of bias in Hˆ (see Figure 7, black line). In the general multivariate case, Hˆ will be biased toward −I as long as the output noise is large compared to the learning noise. The bias described above arises from the fact that the term yt occurs on both sides of the regression equation. This problem can be circumvented by deriving a simple expression for the output as a linear function of the initial state and all inputs up to, but not including, the current time step:
yt = C x1 + C
t−1
τ =1
(Grτ + Hyτ + ητ ) + Dwt + γt .
(4.6)
Modeling Sensorimotor Learning with Linear Dynamical Systems
783
This expression suggests the summation approach to system identification by linear regression. The set of equations 4.6 for each t can be combined into the standard matrix regression equation Y = Xβ+ noise: 1 0 0 y1T T T 1 r y T y2 1 1 .. .. .. .. . = . . T . y t−1 T t−1 T t 1 τ =1 rτ τ =1 yτ .. .. .. .. . . . .
w1T
0
T T T w2 η1 x1 C T T .. G C .. . . + T T H C t−1 T T wt η τ =1 τ DT .. .. . .
γ1T
γ2T .. C T . I . T γt .. . (4.7)
For a given value of C, regression would produce estimates for x1 , G, H, and D. One pitfall with this approach is that the variance of the noise terms grows linearly with the trial count t: Var C
t−1
η τ + γt
= (t − 1)C QC T + R.
(4.8)
τ =1
This problem is negligible if |Q| |R|. Otherwise, as the trial count increases, the accumulated learning noise will dominate all other correlations and the estimated parameter values will approach zero. This effect can be seen for our simulated data sets in Figure 7, gray line. Of course, this problem could be addressed by modeling the full covariance matrix of the noise terms and including them in the regression (equation 4.8 gives the diagonal terms of the matrix). However, this requires knowing Q and R in advance. Also, the linear growth in variance means that later terms will be heavily discounted, forfeiting the benefit of large data sets. 5 Example: Reaching with Shifted Visual Feedback Here we present an application of the LDS framework using a well-studied form of feedback error learning: reach adaptation due to shifted visual feedback. In this experiment, subjects made reaching movements in a virtual environment with artificially shifted visual feedback of the hand. The goal is to determine whether a dynamically changing feedback shift drives reach adaptation and, if so, what the dynamics of learning are. Subjects were seated with their unseen right arm resting on a table. At the beginning of each trial, the index fingertip was positioned at a fixed start location, a virtual visual target was displayed at location gt , and after a short delay an audible go signal instructed subjects to reach to the visual target. The movement began with no visual feedback, but just before the
784
S. Cheng and P. Sabes
end of the movement (determined online from fingertip velocity), a white disk indicating the fingertip position appeared (“terminal feedback”). The feedback disk tracked the location of the fingertip, offset from the finger by a vector rt . The fingertip position at the end of the movement is represented by f t . Each subject performed 200 reach trials. The sequence of visual shifts rt was a random walk in two-dimensional space. The two components of each step, rt+1 − rt , were drawn independently from a zero-mean normal distribution with a standard deviation of 10 mm and with the step magnitude limited to 20 mm. In addition, each component of rt was limited to the ±60 mm range, with reflective boundaries. These limitations were chosen to ensure that subjects did not become aware of the visual shift. In order to model this learning process with an LDS, we need to define its inputs and outputs. Reach adaptation is traditionally quantified with the after-effect, that is, the reach error yt = f t − gt measured in a no-feedback test reach (Held & Gottlieb, 1958). In our case, the terminal feedback appeared sufficiently late to preclude feedback-driven movement corrections. Therefore, the error on each movement, yt , is a trial-by-trial measure of the state of adaptation. We will also define the state of the system xt to be the mean reach error at a given trial, that is, the average yt that would be measured across repeated reaches if learning were frozen at trial t. This definition is consistent with the output equation of the LDS model, equation 2.9b, if two constraints are placed on the LDS: D = 0 and C = I . The first constraint is valid because of the late onset of the visual feedback. The second constraint resolves the ambiguity of the remaining LDS parameters, as discussed in section 2.2. We will also assume that the input to the LDS is the visually perceived error, ut = yt + rt . Thus, we are modeling reach adaptation as error-corrective learning, with a target output yt∗ = −rt (see section 2.3). Note that with this input variable, But = Hyt + Grt if B = H = G. Using the EM algorithm, the sequence of visually perceived errors (inputs) and reach errors (outputs) from each subject was used to estimate the LDS parameters A, B, Q, and R. The parameter fits from four subjects’ data are shown in Figure 8A. The decay parameter Aˆ is nearly diagonal for all subjects, implying that the two components of the state do not mix and thus evolve independently. Also, the diagonal terms of Aˆ are close to 1, which means there is little decay of the adaptive state from one trial to the next. The individual components of the input parameter Bˆ are considerably more variable across subjects. A useful quantity to consider is the square ˆ which is root of the determinant of the estimated input matrix, |det B|, ˆ This value, the geometric mean of the magnitudes of the eigenvalues of B. shown in Figure 8B, is a scalar measure of the incremental state correction due to the visually perceived error on each trial. To determine whether these responses are statistically significant, we performed the permutation test
Modeling Sensorimotor Learning with Linear Dynamical Systems
A
785
B 0.5
2
^
A
B
^
− 0.5
0
−1
11 12 21 22
200 0
S1 S2 S3 S4
C 1/4 ^
400 200 0
11 12 21 22
0
11 12 21 22
^
2
400
^
2
0.1
600 R (m m )
600
0.2
(d et R / d et Q)
−1
Q (m m )
^
0
1
^
|det B|
0.3
11 12 21 22
3 2 1 0 S1 S2 S3 S4
ˆ B, ˆ Q, ˆ and Rˆ for four subjects. Figure 8: (A) Estimated LDS parameters A, Labels on the x-axis indicate the components of each matrix. Each bar shading corresponds to a different subject (S1–S4). (B) Results of permutation test for the ˆ and the error bars input parameter B for each subject. The square marks |det B| show the 95% confidence interval for that value given H0 : detB = 0, generated from 1000 permuted data sets. (C) Estimate of ratio of output to learning noise standard deviation.
ˆ The null hypothesis was described in section 4.2 on the value of |det B|. 2 H0 : detB = 0. Figure 8B shows that H0 can be rejected with 95% confidence for all four subjects, and so we conclude that the visually perceived error significantly contributes to the dynamics of reach adaptation. As discussed in section 4, the statistical properties of the MLE parameters depend to a large degree on the ratio of the learning and output noise terms. In the present example, the covariance matrices are two-dimensional, and so we quantify the magnitude of the noise terms by the square root of ˆ ˆ respectively. The ratio of standard their determinants, det Q and det R, 1/4 ˆ ˆ deviations is thus det R/det Q . The experimental value of this ratio ranges from 1.2 to 3.7 with a mean of 2.5 (see Figure 8C). These novel findings suggest that learning noise might contribute almost as much to behavioral variability as motor performance noise.
Since det B = 0 could result from any singular matrix B, even if B = 0, this test is more stringent than testing B = 0. A singular matrix B implies that some components of the input do not affect the dynamics. 2
786
S. Cheng and P. Sabes
The results presented in Figure 8 depend on the assumption that the visually perceived error ut drives learning. This guess was based on the fact that ut is visually accessible to the subject and is a direct measure of the subject’s visually perceived task performance. However, several alternative inputs exist, even if we restrict consideration to the variables already discussed. Note again that the input term But can be expanded to Hyt + Grt . The hypothesis that the visually perceived error drives learning is thus equivalent to the claim that H = G. However, the true reach error yt and the visual shift rt could affect learning independently. These variables are not accessible from visual feedback alone, but they could be estimated from comparisons of visual and proprioceptive signals or the predictions of internal forward models. While these estimates might be noisier than those derived directly from vision, this sensory variability is included in the learning noise term, as discussed in section 2.1. Note that an incorrect choice of input variable ut means that the LDS model cannot capture the true learning dynamics, and so the resulting estimate of learning noise should be high. Indeed, one explanation for the large Q in the model fits above is that an important input signal was missing from the model. The LDS framework itself can be used to address such issues. In this case, we can test the alternative H = G against the null hypothesis H = G by asking whether a significantly better fit is obtained with two inputs, [ut , rt ], compared to the single input ut . The permutation test, with permuted rt , would be used. We showed in section 4.1, however, that such comparisons require more than the 200 trials per subject collected here. 6 Discussion Quantitative models, even very simple ones, can be extremely powerful tools for studying behavior. They are often used to clarify difficult concepts, quantitatively test intuitive ideas, and rapidly test alternative hypothesis with virtual experiments. However, successful application of such models depends on understanding the properties of the model class being used. The class of LDS models is an increasingly popular tool for studying sensorimotor learning. We have therefore studied the properties of this model class and identified the key issues that arise in their application. We explored the steady-state behavior of LDS models and related that behavior to the traditional measures of learning in blocked-exposure experimental designs. These results demonstrate why the dynamical systems approach provides a clearer picture of the mechanisms of sensorimotor learning. We described the EM algorithm for system identification and discussed some of the details and difficulties involved in estimating model parameters from empirical data. Most important, in closed-loop systems, it is difficult to separate the effects of state decay (A) and feedback (H) on the dynamics
Modeling Sensorimotor Learning with Linear Dynamical Systems
787
of learning. Note that this limitation is an example of a more general difficulty with all linear models. If any two variables in either the learning or output equations are correlated, then it will be difficult to independently estimate the coefficients of those variables. For example, if the exogenous learning signal rt and feedthrough input wt are correlated in a feedback system, then it is difficult to distinguish the exogenous and feedback learning parameters G and H. As a second example, if the exogenous input vector rt is autocorrelated with a timescale longer than that of learning, then the input and the state will be correlated across trials. In this case, it would be difficult to distinguish A and G. Such is likely to be the case in experiments that include blocks of constant input, giving a compelling argument for experimental designs with random manipulations. One attractive feature of LDS models is that they contain two sources of variability: an output or performance noise and a learning noise. Both sources of variability will contribute to the overall variance in any measure of performance. We know of no prior attempts to quantify these variables independently, despite the fact that they are conceptually quite different. In addition, the ratio of the two noise contributions has a large effect on the statistical properties of LDS model fits. We motivated the LDS model class as a linearization of the true dynamics about equilibrium values for the state and inputs. Linearization is justifiable for modeling a set of movements with similar kinematics (e.g., repeated reaches with small variations in start and end locations) and small driving signals. However, many experiments consist of a set of distinct trial types that are quite different from each other, for example, a task with K different reach targets. It is a straightforward extension of the LDS model presented here to include separate parameters and state variables for each of K trial types. In this case, the effect of the input variables (feedback and exogenous) on a given trial will be different for each of the K state variables (i.e., for the future performance of each trial type). The parameters G i j and Hi j that describe these cross-condition effects (the effects of a type i trial on the jth state variable) are essentially measures of generalization across the trial types (Thoroughman & Shadmehr, 2000; Donchin et al., 2003). In addition, each trial type could be associated with a different learning noise variance Qi and output noise variance Ri to account for signal-dependent noise. All of the practical issues raised in this letter apply, except that additional parameters (whose number goes as K 2 ) will require more data for equivalent power. Finally, we note that if the output is known to be highly nonlinear, it is fairly straightforward to replace the linear output equation, equation 2.6b, with a known, nonlinear model of the state-dependent output, equation 2.2b. In that case, the Kalman smoother in the E-step of the EM algorithm would have to be replaced by the extended Kalman smoother and the closed-form solution of the M-step would likely have to be replaced with an iterative optimization routine.
788
S. Cheng and P. Sabes
Appendix: Maximum Likelihood Estimation We take a maximum likelihood approach to system identification. The maximum likelihood estimator (MLE) for the LDS parameters is given by: ˆ = argmax log P (y1 , . . . , yT |; u1 , . . . , uT , w1 , . . . , wT ) ,
(A.1)
where ≡ {A, B, C, D, R, Q} is the complete set of parameters of the model in equation 2.6. In the following, we will suppress the explicit dependence on the inputs, u1 , . . . , uT and w1 , . . . , wt , and use the notation Xt ≡ {x1 , . . . , xt } and Yt ≡ {y1 , . . . , yt }. Generally, equation A.1 cannot be solved analytically, and numerical optimization is needed. Here we discuss the application of the EM algorithm (Dempster et al., 1977) to system identification of the LDS model defined in equation 2.6. The EM algorithm is chosen for its attractive numerical and computational properties. In most practical cases, it is numerically stable; every iteration increases the log likelihood monotonically: ˆ i+1 ≥ log P YT | ˆi , log P YT |
(A.2)
ˆ i is the parameter estimate in the ith iteration, and convergence where to a stationary point of the log likelihood is guaranteed (Dempster et al., 1977). In addition, the two iterative steps of EM algorithm are often easy to implement. The E-step consists of calculating the expected value of the complete log likelihood, log P (XT , YT |), as a function of , given the ˆ i: current parameter estimate ˆ i ) = E(log P (XT , YT |))Y ,ˆ . (, T i
(A.3)
In the M-step, the parameters that maximize the expected log likelihood are found: ˆ i+1 = argmax (, ˆ i ).
(A.4)
The starting point in the formulation of the EM algorithm is the derivation of the complete likelihood function, which is generally straightforward if the likelihood factorizes. Thus, we begin by asking whether the likelihood of an LDS model factorizes, even when there are feedback loops (cf. equation 2.9). From the graphical model in Figure 9, it is evident that yt and xt+1 are conditionally independent of all other previous states and inputs, given xt . The mutual probability of yt and xt+1 , given by Bayes’ theorem, is P (xt+1 , yt |xt ) = P (xt+1 |yt , xt ) P (yt |xt ) .
Modeling Sensorimotor Learning with Linear Dynamical Systems
789
→ xt−1 → xt → xt+1 → ↓ ↓ ↓ yt−1 yt yt+1 Figure 9: Graphical model of the statistical relationship between the states and the outputs of the closed-loop system. The dependence on the deterministic inputs has been suppressed for clarity.
The complete likelihood function is thus P (XT , YT ) = P (x1 )
T−1
P (xt+1 |yt , xt )
t=1
T
P (yt |xt ) .
(A.5)
t=1
This factorization means that for the purposes of this algorithm, we can regard the feedback as just another input variable. This view corresponds to the direct approach to closed-loop system identification (Ljung, 1999). The two steps of the EM algorithm for identifying the LDS model in equation 2.6, when B = D = 0 and C is known, were first reported by Shumway and Stoffer (1982). A more general version, which included estimation of C, was presented by Ghahramani and Hinton (1996). The EM algorithm for the general LDS of equation 2.6 is a straightforward extension, and we present it here without derivation. A.1 E-Step. The E-step consists of calculating the following expectations and covariances: ˆ i ), xˆ t ≡ E(xt |YT , ˆi , Vt ≡ cov xt |YT , ˆi Vt,τ ≡ cov xt , xτ |YT ,
(A.6a) (A.6b) (A.6c)
These are computed by Kalman smoothing (Anderson & Moore, 1979), which consists of two passes through the sequence of trials. The forward pass is specified by t−1 xtt−1 = Axt−1 + But−1 , t−1 T A, Vtt−1 = Q + AVt−1 −1 , K t = Vtt−1 C T C Vtt−1 C T + R xtt = xtt−1 + K t yt − C xtt−1 − Dwt ,
(A.7a) (A.7b) (A.7c) (A.7d)
790
S. Cheng and P. Sabes
Vtt = (I − K t C)Vtt−1 .
(A.7e)
This pass is initialized with x10 = π and V10 = , where π is the estimate for the initial state and is its variance. If there are multiple data sets, all initial state estimates are set to π with variance . In fact, these parameters are included in and will be estimated in the M-step. The backward pass is initialized with xTT and VTT from the last iteration of the forward pass and is given by t J t = Vtt A(Vt+1 )−1 , T t xtT = xtt + J t xt+1 − xt+1 , T t J tT . − Vt+1 VtT = Vtt + J t Vt+1
(A.8a) (A.8b) (A.8c)
The only quantity that remains to be computed is the covariance Vt+1,t , for which we present a closed-form expression: T Vt+1,t = Vt+1 J tT .
(A.8d)
It is simple to show that this expression is equivalent to the recursive equation given by Shumway and Stoffer (1982) and Ghahramani and Hinton (1996). ˆ i , and the state estimates With the previous estimates of the parameters, from the E-step, it is possible to compute the value of the incomplete log likelihood function using the innovations form (Shumway & Stoffer, 1982): T 1
mT ˆ i =− log P YT | log(2π) − log |Rt | 2 2 t=1
−
1 2
T
δtT Rt−1 δt ,
(A.9)
t=1
where δt = yt − C xtt−1 − Dwt are the innovations, Rt = C Vtt−1 C + R their variances, and m is the dimensionality of the output vectors yt . A.2 M-Step. The quantities computed in the E-step are used in the M-step ˆ i ). Using the to determine the argmax of the complete log likelihood (, T T definitions Pt ≡ xˆ t xˆ t + Vt and Pt,τ ≡ xˆ t xˆ τ + Vt,τ the solution to the M-step is given by: π = xˆ 1 ,
(A.10a)
= V1 ,
(A.10b)
Modeling Sensorimotor Learning with Linear Dynamical Systems
[A B] =
Pt+1,t
T
xˆ t+1 ut
Pt ut xˆ tT
xˆ t uTt
791
−1
ut uTt −1 Pt xˆ t wtT T T yt xˆ t yt wt [C D] = , wt xˆ tT wt wtT
,
(A.10c)
(A.10d)
1 T Pt+1 − APt,t+1 − But xˆ t+1 , T −1
(A.10e)
T 1
(yt − C xˆ t − Dwt ) ytT , T
(A.10f)
T−1
Q=
t=1
R=
t=1
T T−1 = t=1 in equation A.10d. where = t=1 in equation A.10c and The parameters A and B in equation 3.10e are the current best estimates computed from equation 3.10c, and C and D in equation A.10f are the solutions from equation A.10f. The above equations, except for equation 3.10a and equation 3.10b, generalize to multiple data sets. The sums are then understood to extend over all the data sets. For multiple data sets, equation 3.10a is replaced by an average over the estimates of the initial state, xˆ 1(i) , across the N data sets: 1
xˆ 1(i) . N N
π=
(3.11a)
i=1
Its variance includes the variances of the initial state estimates as well as variations of the initial state across the data sets:
=
1
1
V1(i) + (xˆ 1(i) − π)(xˆ 1(i) − π)T . N N N
N
i=1
i=1
(3.11b)
Acknowledgments This work was supported by the Swartz Foundation, the Alfred P. Sloan Foundation, the McKnight Endowment Fund for Neuroscience, the National Eye Institute (R01 EY015679), and HHMI Biomedical Research Support Program grant 5300246 to the UCSF School of Medicine. References Anderson, B. D. O., & Moore, J. B. (1979). Optimal filtering. Englewood Cliffs, NJ: Prentice Hall.
792
S. Cheng and P. Sabes
Baddeley, R. J., Ingram, H. A., & Miall, R. C. (2003). System identification applied to a visuomotor task: Near-optimal human performance in a noisy changing task. J. Neurosci., 23(7), 3066–3075. Baraduc, P., & Wolpert, D. M. (2002). Adaptation to a visuomotor shift depends on the starting posture. J. Neurophysiol., 88(2), 973–981. Clamann, H. P. (1969). Statistical analysis of motor unit firing patterns in a human skeletal muscle. Biophys. J., 9(10), 1233–1251. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the E M algorithm. J. Royal Statistical Society, Series B, 39, 1–38. Donchin, O., Francis, J. T., & Shadmehr, R. (2003). Quantifying generalization from trial-by-trial behavior of adaptive systems that learn with basis functions: Theory and experiments in human motor control. J. Neurosci., 23(27), 9032– 9045. Draper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). New York: Wiley. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Favilla, M., Hening, W., & Ghez, C. (1989). Trajectory control in targeted force impulses. VI. Independent specification of response amplitude and direction. Exp. Brain Res., 75(2), 280–294. Flanagan, J. R., & Rao, A. K. (1995). Trajectory adaptation to a nonlinear visuomotor transformation: Evidence of motion planning in visually perceived space. J. Neurophysiol., 74(5), 2174–2178. Ghahramani, Z., & Hinton, G. E. (1996). Parameter estimation for linear dynamical systems (Tech. Rep. No. CRG-TR-96-2). Toronto: University of Toronto. Good, P. I. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses (2nd ed.). New York: Springer-Verlag. Gordon, J., Ghilardi, M. F., Cooper, S. E., & Ghez, C. (1994). Accuracy of planar reaching movements. II. Systematic extent errors resulting from inertial anisotropy. Exp. Brain Res., 99(1), 112–130. Harris, C. M., & Wolpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394(6695), 780–784. Hay, J. C., & Pick, H. L. (1966). Visual and proprioceptive adaptation to optical displacement of the visual stimulus. J. Exp. Psychol., 71(1), 150–158. Held, R., & Gottlieb, N. (1958). Technique for studying adaptation to disarranged hand-eye coordination. Percept. Mot. Skills, 8, 83–86. Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279(5354), 1213–1216. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models—supervised learning with a distal teacher. Cognitive Science, 16(3), 307–354. Kailath, T. (1980). Linear systems. Englewood Cliffs, NJ: Prentice Hall. Krakauer, J. W., Pine, Z. M., Ghilardi, M. F., & Ghez, C. (2000). Learning of visuomotor transformations for vectorial planning of reaching trajectories. J. Neurosci., 20(23), 8916–8924. Lackner, J. R., & Dizio, P. (1994). Rapid adaptation to coriolis force perturbations of arm trajectory. J. Neurophysiol., 72(1), 299–313.
Modeling Sensorimotor Learning with Linear Dynamical Systems
793
Ljung, L. (1999). System identification: Theory for the user (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Matthews, P. B. (1996). Relationship of firing intervals of human motor units to the trajectory of post-spike after-hyperpolarization and synaptic noise. J. Physiol., 492, 597–628. Miles, F., & Fuller, J. (1974). Adaptive plasticity in the vestibulo-ocular responses of the rhesus monkey. Brain Res., 80(3), 512–516. Norris, S., Greger, B., Martin, T., & Thach, W. (2001). Prism adaptation of reaching is dependent on the type of visual feedback of hand and target position. Brain Res., 905(1–2), 207–219. Pine, Z. M., Krakauer, J. W., Gordon, J., & Ghez, C. (1996). Learning of scaling factors and reference axes for reaching movements. Neuroreport, 7(14), 2357–2361. Redding, G., & Wallace, B. (1990). Effects on prism adaptation of duration and timing of visual feedback during pointing. J. Mot. Behav., 22(2), 209–224. Scheidt, R., Dingwell, J., & Mussa-Ivaldi, F. (2001). Learning to move amid uncertainty. J. Neurophysiol., 86(2), 971–985. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3(4), 253–264. Soechting, J. F., & Flanders, M. (1989). Errors in pointing are due to approximations in sensorimotor transformations. J. Neurophysiol., 62(2), 595–608. Thoroughman, K. A., & Shadmehr, R. (2000). Learning of action through adaptive combination of motor primitives. Nature, 407, 742–747. Todorov, E., & Jordan, M. (2002). Optimal feedback control as a theory of motor coordination. Nat. Neurosci., 5(11), 1226–1235. von Helmholtz, H. (1867). Handbuch der physiologischen Optik. Leipzig: Leopold Voss. Weber, R. B., & Daroff, R. B. (1971). The metrics of horizontal saccadic eye movements in normal humans. Vision Res., 11(9), 921–928. Welch, R. B., Choe, C. S., & Heinrich, D. R. (1974). Evidence for a three-component model of prism adaptation. J. Exp. Psychol., 103(4), 700–705. Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). Are arm trajectories planned in kinematic or dynamic coordinates? An adaptation study. Exp. Brain Res., 103(3), 460–470.
Received January 6, 2005; accepted September 9, 2005.
LETTER
Communicated by Terrence Sejnowski
Changing Roles for Temporal Representation of Odorant During the Oscillatory Response of the Olfactory Bulb Soyoun Kim
[email protected]
Benjamin H. Singer
[email protected]
Michal Zochowski
[email protected] Department of Physics and Biophysics Research Division, University of Michigan, Ann Arbor, MI 48109, U.S.A.
It has been hypothesized that the brain uses combinatorial as well as temporal coding strategies to represent stimulus properties. The mechanisms and properties of the temporal coding remain undetermined, although it has been postulated that oscillations can mediate formation of this type of code. Here we use a generic model of the vertebrate olfactory bulb to explore the possible role of oscillatory behavior in temporal coding. We show that three mechanisms—synaptic inhibition, slow self-inhibition and input properties—mediate formation of a temporal sequence of simultaneous activations of glomerular modules associated with specific odorants within the oscillatory response. The sequence formed depends on the relative properties of odorant features and thus may mediate discrimination of odorants activating overlapping sets of glomeruli. We suggest that period-doubling transitions may be driven through excitatory feedback from a portion of the olfactory network acting as a coincidence modulator. Furthermore, we hypothesize that the period-doubling transition transforms the temporal code from a roster of odorant components to a signal of odorant identity and facilitates discrimination of individual odorants within mixtures. 1 Introduction Odor encoding is a spatially distributed process resulting from activation of different cell populations. However, it has been postulated that in addition to combinatorial codes, the brain may use temporal coding strategies based on synchronization of cell activities within or between different functional networks (Laurent et al., 2001; Laurent, Wehr, & Davidowitz, 1996; Wehr & Laurent, 1996). While the basic properties of combinatorial coding have been relatively well established, the properties of temporal coding are largely undetermined. It has been hypothesized Neural Computation 18, 794–816 (2006)
C 2006 Massachusetts Institute of Technology
Temporal Patterning During Odorant Processing
795
that synchronization, mediated through ubiquitously observed stimulusevoked oscillations (Eckhorn, 1994; Gelperin & Tank, 1990; Laurent & Naraghi, 1994), can underlie binding between different stimulus features and thus promote formation of neural representations of processed stimuli (Eckhorn, 2000; Singer, 1999; von der Malsburg, 1995, 1999). Since the discovery of odorant-evoked oscillations in the olfactory system, it has been shown that the presentation of odorant evokes different types of oscillatory patterning across species (Adrian, 1950; Hughes & Mazurowski, 1962; Lam, Cohen, Wachowiak, & Zochowski, 2000; Laurent & Naraghi, 1994; MacLeod, Backer, & Laurent, 1998; Wehr & Laurent, 1996). The mechanisms and the properties of observed oscillations have been studied extensively, but their role during information processing has not been yet established. However, it has been postulated that as in the case of other sensory modalities, these oscillations mediate binding of different odorant features that are detected by the next processing stages in the brain (the piriform cortex or mushroom body in vertebrates or insects, respectively; MacLeod et al., 1998). This hypothesis was partially confirmed by showing that oscillations in the insect mediate behavioral discrimination of structurally similar odorants, linking oscillatory patterning with cognitive processes (Hosler, Buxton, & Smith, 2000; Stopfer, Bhagavan, Smith, & Laurent, 1997; Teyke & Gelperin, 1999). The olfactory system displays a high degree of structural homology among vertebrate phyla, in both the laminar structure, cell types, and connectivities found in the olfactory bulb (OB) itself and centripetal projections of the olfactory bulb to secondary structures (Allison, 1953; Crosby & Humphrey, 1939). In the vertebrate OB, oscillations are primarily mediated by the interaction of the excitatory M/T cells, and inhibitory periglomerular (PG) and granule cells (Aroniadou-Anderjaska, Zhou, Priest, Ennis, & Shipley, 2000; Rall & Shepherd, 1968). It has been reported that odor stimuli elicit different frequency ranges of oscillations in the vertebrate (Bressler, 1984; Bressler & Freeman, 1980; Eeckman & Freeman, 1990; Gray & Skinner, 1988; Lam et al., 2000) as well as in mollusk (Inokuma, Inoue, Watanabe, & Kirino, 2002). Also a number of groups have addressed the function of the insect antennal lobe (AL) (homologue of OB in vertebrates) through mathematical models and have explored the generation of oscillations from the point of view of network dynamics and cellular biophysics (Bazhenov, Stopfer, Rabinovich, Abarbanel, et al., 2001; Bazhenov, Stopfer, Rabinovich, Huerta, et al., 2001; Brody & Hopfield, 2003; Hendin, Horn, & Tsodyks, 1998; Li & Hertz, 2000; Linster & Cleland, 2001; Olufsen, Whittington, Camperi, & Kopell, 2003); however, the role and the mechanism of these oscillations are still not clear. Our earlier experiments showed that the presentation of odorant evokes rostral (14 Hz) and caudal (14 and 7 Hz) oscillations in the turtle OB (Lam et al., 2000; Lam, Cohen, & Zochowski, 2003; see Figure 1). The oscillations were measured optically as a fractional change in fluorescence of
796
S. Kim, B. Singer, and M. Zochowski
Caudal oscillation
DF/F =10-3
Rostral oscillation
10% isoamyl acetate
1000 ms
Figure 1: Oscillatory response and formation of the temporal sequence due to odorant presentation. Two oscillations detected in the turtle OB are shown: the caudal oscillation (top) and the rostral oscillation (bottom). Two oscillations start relatively simultaneously and initially have the same frequency (14 Hz), with the caudal oscillation undergoing a period-doubling transition resulting in 7 Hz oscillation.
voltage-sensitive dye. It is assumed that peaks of those oscillations correspond to coincident firing of large cell populations, while troughs represent relatively a quiescent state of population activity. Those two oscillations appear simultaneously with the caudal oscillation initially at 14 Hz and eventually undergo a period doubling transition to 7 Hz. Period-doubling transitions have been observed in the mammalian OB, as well as odor-evoked changes in the relative power of oscillation in the γ (50–100 Hz) and β (15–40 Hz) bands of local field potential recordings (Martin, Gervais, Hugues, Messaoudi, & Ravel, 2004; Neville & Haberly, 2003; Ravel et al., 2003). These studies have suggested that the appearance of β activity is dependent on centrifugal feedback from extrabulbar areas and is a correlate of learned olfactory recognition. Here we use a network model incorporating basic properties of the vertebrate OB to capture experimentally observed properties of odor-evoked oscillations and explore basic mechanisms underlying formation of the temporal code on the level of activity of individual glomerular modules within this oscillatory response. We show in our model that (1) in the absence of feedback from higher regions, spatial codes are temporally separated in the OB circuitry depending on the relative strength of the odorant components, and (2) the introduction of feedback from higher-order processing regions leads to synchronization that forms a unified temporal representation of the presented odorant, producing at the same time a period-doubling
Temporal Patterning During Odorant Processing
797
transition. Thus, we suggest that at the time of the bifurcation, the role of the oscillation may be redefined from discrimination of odorant features, grouped by input strength, to the discrimination of whole-odorant identity. 2 Methods 2.1 Description of the Model. The OB has a relatively simple layered cortical structure and is composed of excitatory mitral/tufted (M/T) and inhibitory periglomerular (PG) and granule cells. The axons of olfactory receptor neurons (ORNs) expressing limited receptor types (single type in mammals) are sorted and converge onto specific glomeruli to form a chemotopic map of odorant in the OB (Bozza & Kauer, 1998; Malnic, Hirono, Sato, & Buck, 1999; Mombaerts et al., 1996; Ressler, Sullivan, & Buck, 1993). The glomeruli therefore represent specific types of receptors and are tuned to specific molecular features of the odorant. The glomeruli are relatively large spherical neuropils, where axons of sensory neurons synapse onto the dendrites of relatively few excitatory M/T cells and an extensive number of inhibitory PG cells (Bozza & Kauer, 1998; Mori, Nagao, & Yoshihara, 1999; Pinching & Powell, 1971; Shepherd, 1998). The populations of PG cells and M/T cells innervating the same glomeruli form a glomerular module (GM). M/T cells distribute their axons in secondary cortical structures and thus are the output cells of the OB. The most deeply positioned layer in the OB contains a very large pool of inhibitory granule cells and a substantial number of centrifugal fibers from various forebrain regions. Our model has a simplified structure resembling the basic vertebrate OB anatomy. The excitatory M/T and inhibitory PG cells are grouped into five GMs, each having 4 M/T cells and 20 PG cells (see Figure 2A). In addition, 100 granule cells are located outside the GMs in the OB. The PG cells receive inputs from ORN while the granule cells do not (Aungst et al., 2003; Mori et al., 1999). The ratio of PG/granule cells to M/T cells is different for different species (for example, 200:1 in the mouse OB, (Mori et al., 1999). We tested ratios between 5:1 and 40:1 and could produce the same results by tuning connectivity parameters. 2.1.1 Model: Input. It has been established that an odorant activates several GMs, generating specific odor chemotopic maps, as every GM is associated with the input of a single class of receptor neurons (Mombaerts et al., 1996; Vosshall, Wong, & Axel, 2000). In this model, the odorant is simplified as a set of static input strengths received by a set of GMs, and the relative amplitude of these inputs (see Figure 2B) represents the initial signature of the presented odorants. Excitatory cells in the same GM receive the same strength of input, while inhibitory PG cells receive 50% smaller input than excitatory cells in the same GM. The input builds up linearly to its maximum amplitude over a 200 ms interval after stimulus onset, mimicking the time course of calcium
798
S. Kim, B. Singer, and M. Zochowski
Figure 2: A reduced model of olfactory processing in the turtle olfactory bulb. (A) Model of the OB: The model is divided into glomerular modules (GMs— dashed line). Every GM has an excitatory (M/T cells) population and inhibitory (PG cells) population. Outside of GMs, the inhibitory granule cells are located in the OB. Network interactions are denoted by arrows. Every GM has a different input strength corresponding to the combined activation received from specific olfactory receptor neurons (ORNs). A group of excitatory and inhibitory cells lies outside the OB network and receives converging input from M/T cells in pairs of GMs, acting as a coincidence modulator (see section 2). (B) Definition of the odorant. The odorant is defined as a combination of inputs to different GMs. (C) Time course of SSI (bottom) of excitatory cell population. SSI current is activated when an excitatory cell fires.
Temporal Patterning During Odorant Processing
799
release from ORN terminals (Wachowiak, Cohen, & Zochowski, 2002), and remains constant thereafter. Since in this model we do not include dynamic features of input to the OB, our network model does not show dynamical patterns that might be related to the input dynamics. Neither the differences in receptor activation timescales nor mechanisms such as presynaptic inhibition are considered in this model. Thus, our result does not reproduce slowly modulated patterns of single neuronal activity that have been observed experimentally (Harrison & Scott, 1986; Laurent et al., 1996, 2001; MacLeod & Laurent, 1996). 2.1.2 Model: Connectivity Within a GM. It has been observed that excitatory neurons in the same GM are electronically strongly coupled (Schoppa & Westbrook, 2002) with each other as well as with PG cells (AroniadouAnderjaska et al., 2000; Ennis et al., 2001; Wachowiak & Cohen, 1999). Here, connectivities of different cell types are greatly simplified from the known synaptic organization of the OB. Excitatory cells innervating the same GM are fully connected with each other and coupled with 40% of PG inhibitory neurons within a GM. The PG cells form short-range connections with 50% of other PG cells within the same GM. 2.1.3 Model: Connectivity Between GMs. The granule and PG cells form short-range connections with other inhibitory neurons and excitatory cells in different GMs (Gracia-Llanes, Crespo, Blasco-Ibanez, Marques-Mari, & Martinez-Guijarro, 2003), while anatomical studies have failed to provide any evidence of monosynaptic excitatory connections between M/T cells innervating different glomeruli (Pinching & Powell, 1971). In this model, there is no monosynaptic pathway between excitatory neurons belonging to different GMs, while the M/T neurons project to a fraction of both granule and PG cells. The fraction depends on the distance between GMs. The distance between adjacent GMs is set to unity and increases linearly. Distances between granule cells are calculated by assigning a subset of 20 granule cells to every position of every GM. The excitatory M/T cells in GMi are connected to a fraction of inhibitory PG cells in GMj and granule cells outside the GM. This fraction decreases as the distance between GMs (or GM and the subset of granule cells) increases: Conni, j =
Conn0 , |i − j| + 1
where i, j correspond to the GM number of an individual neuron, Conn0 is the connectivity when length (|i − j|) is zero and, Conn0 = 40% for PG(or granule)-M/T connections. The connectivity between the inhibitory cells scales similarly with Conn0 = 50% for PG (or granule)-PG (or granule) connections.
800
S. Kim, B. Singer, and M. Zochowski
2.1.4 Model: Description of Individual Neurons in the OB. Individual neurons are defined by Hodgkin-Huxley-type equations, and network interactions are adopted with modifications from Kopell, Ermentrout, Whittington, & Traub (2000). The excitatory M/T neurons are known to be modulated through self-induced excitation and inhibition (Schoppa & Urban, 2003). Thus, the group of excitatory neurons has the added feature of an inhibitory slow after-hyperpolarization current, which is to mimic the slow self-inhibition (SSI) observed in the M/T cells (Jahr & Nicoll, 1980; Margrie, Sakmann, & Urban, 2001; Schoppa, Kinzie, Sahara, Segerson, & Westbrook, 1998; Schoppa & Urban, 2003). Although self-inhibition reflects on the activation of interneurons by glutamate released from M/T cell dendrites, which in turn leads to GABA release back onto the M/T cell (Jahr & Nicoll, 1982; Mori & Takagi, 1978; Nowycky, Mori, & Shepherd, 1981; Schoppa et al., 1998), here for model simplicity we modeled the total effect of interplay of the SSI and self-excitation as a single current that produces relatively longlasting hyperpolarization of the excitatory cells (see Figure 2C). Results of the model simulation were independent of the particular shape of this current. The excitatory neurons are defined by: C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) e e − g K n4 (V − VK ) − g SI (w − 0.025)(V − V∞ ) + Isyn + Iappl .
The inhibitory neurons are defined by: i i + Iappl , C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) + Isyn e ie ee ce i ei ii ci where Isyn = Isyn + Isyn + Isyn , and Isyn = Isyn + Isyn + Isyn . Index i, e, and c are, respectively, inhibitory, excitatory, and coincidence modulator cells. The first three currents represent leakage, sodium, and potassium currents, respectively, where gl = 0.1 mS/cm2 , g Na = 100 mS/cm2 , g K = 80 mS/cm2 , Vl = −67 mV, VNa = 50 mV, Vl = −100 mV, and
m =
0.28(V + 27) 0.32(54 + V) (1 − m) − m, 1 − exp(−0.25(V + 54)) exp(0.2(V + 27)) − 1
h = 0.128 exp(−(50 + V)/18)(1 − h) −
4 h, 1 + exp(−0.2(V + 27))
n =
0.032(V + 52) (1 − n) − 0.5 exp(−(57 + V)/40)n, 1 − exp(−0.2(V + 52))
w =
w∞ (V) − w τw (V)
Temporal Patterning During Odorant Processing
w∞ (V) = τw (V) = V∞ (V) =
801
1 , 1 + exp(−0.1(V + 35)) 400 , 3.3 exp(0.05(V + 35)) + exp(−0.05(V + 35)) 150 . 0.1 exp(−0.03(V + 70)) − 1
The term g SI (w − 0.025)(V − V∞ ) in the definition of the excitatory neuron defines the SSI current (see Figure 2B) with g SI = 0.7 mS/cm2 . The conductance is 1 µF/cm2 for all cell populations. e i The Iappl , Iappl represent external currents from receptor neurons to excitatory and inhibitory cells, respectively. The strength of the external input (defining activation of particular GMs) to the excitatory cells is set between 1.4 µA/cm2 and 5 µA/cm2 (depending on the GM), whereas that for inhibitory cells is limited to 50% of the excitatory input of the same GMs. jk ci Isyn denotes a synaptic current from j to k (j, k = e, i, c). For instance, Isyn indicates a current from a coincident modulator (c) to inhibitory (i) cells. The synaptic currents for the excitatory neurons are given by: ie Isyn
+
ce Isyn
= −g ie
j si (t)
(Vj + 80) − g ce
j
scj (t)
(Vk ).
j
Additionally, the M/T cells innervating the same GM are coupled: ee = −g ee Isyn
sej (t)Vj .
j∈G
For inhibitory neurons, ei ii ci Isyn + Isyn + Isyn = −gei
j
−gii
sej (t − τ ) Vj j si (t) (Vj + 80) − gci scj (t) (Vk ),
j
j
where j is an index for connected neuron, from which the synaptic input originates, and Vk = Vj + 80 for inhibitory feedback (when c = I ), or Vk = Vj for excitatory feedback (when c = E). Synaptic currents for inhibitory and excitatory cells are defined by se = 5(1 + tanh(V/4))(1 − se ) − se /2 si = 2(1 + tanh(V/4))(1 − si ) − si /15,
802
S. Kim, B. Singer, and M. Zochowski
and sc is defined by the same equation as either se or si , depending on whether feedback from the coincidence modulator is modeled as excitatory (c = E) or inhibitory (c = I ). The average transmission delay between excitatory and inhibitory neurons is τ = 2 ms. The synaptic efficacies between interconnected neurons are defined to be g ie = 0.03 mS/cm2 , gee = 0.04 mS/cm2 , g ii = 0.005 mS/cm2 and g ei = 0.03 mS/cm2 , gce = 0.4 mS/cm2 , and g ci = 0.3 mS/cm2 . The behavior of the model was tested for a range of connectivity parameters and yielded essentially the same results. The initial values of the parameters were taken from Kopell et al. (2000). Additional tuning was performed only to obtain balanced firing rates of inhibitory and excitatory cells. 2.1.5 Model: Cortical Processing as Coincidence Modulator. The experimentally observed period-doubling transition in the caudal oscillation cannot be mediated solely by our simulated structure of the OB. However, it has been shown that a period-doubling transition could take place on a population level and might be mediated by excitatory-to-excitatory interaction (Ermentrout & Kopell, 1998; Kopell et al., 2000; Whittington, Stanford, Colling, Jefferys, & Traub, 1997). This transition results from mutual synchronization of excitatory cells firing at different oscillatory cycles. It has been well established that the M/T cells project to overlapping fields of termination (Wilson, 2001; Zou, Horowitz, Montmayeur, Snapper, & Buck, 2001), and the OB receives feedback from higher centers (Luskin & Price, 1983; Macrides, 1983; Martin et al., 2004; Neville & Haberly, 2003; Pinching & Powell, 1972; Scott, McBride, & Schneider, 1980). Thus, feedback from central areas may be extensively involved in OB information processing. In this model, we investigate the hypothesis that feedback may mediate the experimentally observed period doubling in the caudal oscillation and the transition is formed through the mutual synchronization of activated M/T cells. This is mediated via an additional layer of 10 excitatory and 10 inhibitory cells outside the OB structure (see Figure 2A). Excitatory (E) and inhibitory (I) cells in a coincidence modulator follow the equations of excitatory and inhibitory cells in the OB except that the applied current j I E Iappl = 0 and Iapp = −g ec j se (t)Vj (where e and E refer to excitatory cells in the OB and the coincidence modulator, respectively) is the converging input from M/T cells in two sets of GMs, where g ec = 0.03 mS/cm2 . The excitatory neurons in the coincidence modulator are described by
C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) IE EE E − g SI (w − 0.025)(V − V∞ ) + Isyn + Isyn + Iappl .
Temporal Patterning During Odorant Processing
803
The inhibitory neurons in the coincidence modulator are described by EI II + Isyn . C V = −gl (V − Vl ) − g Na m3 h(V − VNa ) − g K n4 (V − VK ) + Isyn
The parameters defining respective currents are the same as in the OB, and the connectivity within the coincidence modulator is 70%. 2.1.6 Model: Connectivity Between the Coincidence Modulator and the OB. We investigate two types of connectivity between the coincidence modulator and the OB: (1) connections from excitatory (E) cells in the coincidence modulator project directly to M/T and PG cells in the OB, and separately (2) connections from inhibitory (I) cells in the coincidence modulator project to M/T and PG cells in the OB. In the first case, M/T cells innervating same GMs are connected to two assigned E cells in the coincidence modulator, sending output to the E cells in the coincidence modulator and, in turn, getting feedback from them. Fifty percent of PG/granule cells get feedback from two randomly selected E cells in the coincidence modulator. In the second case, M/T and PG/granule cells get feedback in the same way as the first case except that feedback comes from I cells in the coincidence modulator rather than E cells. With tuning of synaptic weights, these two conditions generated essentially the same result, corresponding to either direct excitation of the OB in the first case or disinhibition of the OB in the second case. 2.2 Analysis: Calculation of Angular Separation of Odor Representation. We calculate the angle between odor representations as a measure of odor discrimination. Individual odorant is represented as an n-dimensional vector (n is the number of total GMs), and each dimension indicates a fraction of active (firing) excitatory neurons in each GM. Angular separation is defined as arc cosine of the normalized inner product of the vector representing an odorant in five dimensions:
1 • P 2 P θ = Arcos , 1 · P 2 P i is a vector representation of the ith odorant. where P 2.3 Analysis: Calculation of Temporal Vector Representation and Overlap. To compare the correlation of individual GMs in an odorant, we calculated the fraction of excitatory cells firing within each GM for 10 consecutive oscillatory cycles. Each GM is represented by a 10-dimensional vector. Then we calculated the dot product to calculate the correlation between GMs. This value measures the similarity of firing pattern among GMs.
804
S. Kim, B. Singer, and M. Zochowski
3 Results 3.1 Formation of Temporal Clustering Based on Input Strength. The odorant is defined as a set of activation amplitudes received by the GMs (see Figure 3D inset). When the network is presented with an odorant and the coincidence modulator is not present, two oscillatory patterns appear (see Figure 3A, two bottom traces) generated by excitatory and inhibitory cells populations. The observed oscillatory patterning was robust to changes in parameter values. Excitatory cells in GMs that share similar input amplitudes (see Figure 3D inset) fire at the same cycles of the oscillation, whereas cells belonging to differentially activated GMs fire on alternating cycles, creating a segmented temporal sequence of cell populations activated on different cycles within the oscillatory response. The firings of all four excitatory cells innervating the same GM are synchronized due to their coupling. Three mechanisms contribute to the formation of this temporal pattern. Fast synaptic inhibition within and between the GMs is known to be responsible for the formation of oscillatory patterning in the OB by inhibiting the simultaneous activation of excitatory cell populations belonging to different glomeruli (MacLeod & Laurent, 1996; Stopfer et al., 1997). Elimination of synaptic inhibition in the network (see Figure 3C) abolishes population oscillatory patterning and the associated temporal code as the cells fire without temporal locking. The individual GM’s activity remains similar but desynchronized with the other GMs. The second mechanism is based on the SSI current (see Figure 2C), which prohibits firing of excitatory cells on consecutive cycles of the population oscillation. This mechanism is responsible for the actual formation of the temporal sequence within the oscillatory response through selective inactivation of cell populations. This is because cells that fired at a given cycle and received additional slow inhibition do not recover in time to fire at the next oscillatory cycle. The elimination of SSI does not disrupt oscillatory patterning; however, it abolishes the formation of a temporal sequence as most of the excitatory cells fire at every cycle of the oscillation (see Figure 3B). The third mechanism is based on the patterning of the input received by GMs (see Figure 3D inset). The differences and similarities in input strength to different GMs underlie the temporal binding and segmentation of different cell populations. If the GMs receive similar input, excitatory neurons composing them will robustly fire in the same cycles of the oscillations. Excitatory neurons in GMs receiving different levels of input fire at different cycles. This is because the cells receiving higher input activate first and inhibit firing of the cells receiving weaker input through lateral inhibition. 3.2 Temporal Segmentation of a GM as a Function of Input. We tested how the temporal patterning of a single GM changes as a function of input amplitude in the context of a glomerular network. We assigned fixed
Temporal Patterning During Odorant Processing
SSI and Synaptic inhibition
total exc. total inh.
0
C Depolarization [mV]
B Depolarization [mV]
1 2 3 4 5
200
400
Time [ms]
Synaptic inhibition only 1 2 3 4 5
total exc. total inh.
600
0
Slow self- inhibition only
1 2 3 4 5
D 80
total exc. total inh.
0
200
400
600
Time [ms]
Overlap
Depolarization [mV]
A
805
60
Altering a single input 1 2 3 4 5 Input strength
40 20
2&3 1&4
0
200
400
Time [ms]
600
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Input strength
Figure 3: Response to the odorant presentation. Traces are averaged membrane potentials of the excitatory populations innervating the same GM (matched number), and averaged total membrane potential for both excitatory (black) and inhibitory (gray) cell populations. Each GM fires periodically due to the constant ORN input, with periodicity dependent on their input amplitude, and therefore different GMs desynchronize. The structure of an odorant is included in the inset of D. (A) Synaptic and SSI combined. Excitatory cell populations receiving inputs of similar strength fire on the same cycle. After a transient, excitatory populations fire only on alternate cycles. (B) Synaptic inhibition only. All excitatory cell populations fire at every oscillatory cycle, abolishing the temporal features of the firing pattern. (C) SSI only. The oscillatory response is abolished. Excitatory cell populations receiving similar input do not synchronize. Bottom: time course of the odorant delivery. (D) Segmentation of the temporal pattern depending on the relative input strength. The temporal pattern of the excitae = 2.4, tory population in GM1 is compared to the patterns of GMs 2 and 3 (Iappl
e = 3.0, 2.95 µA/cm2 , dashed line). 2.5 µA/cm2 , solid line) or GMs 1 and 4 (Iappl e The structure of an odorant is included in the inset (Iappl = 3.0, 2.4, 2.5, 2.95, 2.3 for the GM 1–5).
806
S. Kim, B. Singer, and M. Zochowski
e input strengths to four GMs (GM1, 4:Iappl = 3.0, 2.95 µA/cm2 and GM2, 3: e = 2.4, 2.5 µA/cm2 in Figure 3D inset) and systematically varied the Iappl input of GM 5. We then calculated the overlap of the temporal vectors (see the methods in section 2.3) formed based on activity of excitatory cells within a given GM (see Figure 3D). When the input to GM 5 is small (below 2.0 µA/cm2 ), the M/T cells of that GM do not activate. Above the threshold input, the excitatory neurons in GM 5 start to fire on any cycle of the oscillation. With further increase of the input, the tested GM aligned itself with activity of GMs with lower input. When the input increases to 3.0 µA/cm2 , there is an abrupt change as the tested GM aligns itself with activity of the GMs with higher input. With the further increase of the input (above 3.8 µA/cm2 ), the GM 5 module becomes active at both oscillatory cycles, as it recovers relatively fast from the self-inhibition and lateral inhibition received from other GMs. Thus, the segmentation of glomerular activity is driven by the relative strengths of inputs to the glomeruli and shows robust temporal classification over a relatively wide range of parameter values. It is also important to note that the absolute values of the currents are not as important as the relative strength of input among the GMs.
3.3 Formation of Different Temporal Sequences to Presentation of Similar Odorants. Presentation of similar odorants (i.e., odorants activating the same GMs but with different input amplitudes; see Figure 4A) results in the formation of different temporal sequences of activation in different glomeruli (see Figure 4B). Thus, different neuronal assemblies of active excitatory cells are formed temporally. The elimination of the SSI preserves the oscillatory response but abolishes the formation of the specific temporal sequential assemblies, making presented odorants indistinguishable. Such a temporal code may not be critical for dissimilar odorants since their differences in spatial representation may be big enough to separate the spatiotemporal code of two dissimilar odorants, but for similar odorants,
Figure 4: Temporal segmentation and discrimination of two similar odorants. (A) Definitions of the similar odorants. The two odorants activate the same glomerular units (100% spatial overlap) but with different relative intensities. (B) Differences in temporal activation. Averaged membrane potentials (left: odorant 1; right: odorant 2) of excitatory populations (black) innervating different GMs, and of all inhibitory population (gray). (C) Odor discrimination by angular separation of odorant representations over time. The angle between odorant representations is computed as arc-cosine of the normalized inner product of the vectors representing each odorant in five dimensions. When SSI is present, temporal coding is maintained, with the angular separation bounded away from 0. With synaptic inhibition only, the separation of representations goes to 0.
Temporal Patterning During Odorant Processing
Odorant 2
Odorant 1 1 2 3 4 5
Glomeruli
Glomeruli
A
807
1 2 3 4 5 0.0 1.0 2.0 3.0 Relative input strength
Synaptic and slow inhibition Odorant 1 Odorant 2 Depolarization [mV] inh.
Depolarization [mV] inh.
B
0.0 1.0 2.0 3.0 Relative input strength
1 2 3 4 5
1 2 3 4 5
0 100 200 300 400 500 Time [ms]
C
0 100 200 300 400 500 Time [ms]
Discrimination over time 100 80
θ
60 40
With SSI Without SSI
20 0
0
100
200
Time [ms]
300
400
808
S. Kim, B. Singer, and M. Zochowski
temporal segmentation may provide a mechanism of discrimination. To better illustrate the role of the SSI and temporal coding in distinguishing similar stimuli, we calculate the angular separation of the representations for two similar odorants, individually presented to the network (see Figure 4C and the methods in section 2.2). When the SSI is present, angular separation evolves from zero and stays nonzero with mean value [θ = 74.17◦ ], indicating distinct spatiotemporal representations of two odorants. Angular separation remains almost zero when the SSI is abolished, indicating that the spatiotemporal representation of both odorants is nearly the same. 3.4 Interaction of the Olfactory Bulb and a Coincidence Modulator. Introduction of feedback from cortical regions could provide a polysynaptic pathway between different GMs and modify the odor representation in the OB circuitry. To establish whether such a polysynaptic connection could mediate the period-doubling transition, we introduce an additional layer of excitatory and inhibitory neurons outside the OB structure, the coincidence modulator, which receives input from the M/T cells and provides feedback to the OB circuit (see section 2 and Figure 2A). We assume convergence of the OB output on the coincidence modulator excitatory neurons and test two types of feedback (see section 2): (1) excitatory feedback (see Figure 5A) and (2) the inhibitory feedback (see Figure 5B) to the M/T and PG/granule cells. Providing an odor input to the OB network in the absence of feedback from the coincidence modulator results in oscillations of the same frequency in both the inhibitory and excitatory cell populations (see Figure 5). At time t = 500 ms, we activate the excitatory (or inhibitory) feedback connection between the OB and the coincidence modulator (the black arrows on Figures 5A and 5B, respectively). The excitatory neurons synchronize, rapidly generating a period-doubling transition, resulting in the slow oscillation. At the same time, the frequency of the oscillation in the inhibitory population remains unchanged. Similar results were obtained with both types of feedback (see Figure 5). This is due to the fact that the direct excitatory feedback to M/T cells plays the same role in the OB network as disinhibition of the OB via inhibitory feedback to PG/granule cells. In these simulations, we found that synchronization within the coincidence modulator itself was required for the induction of period doubling in the OB through feedback. 4 Summary and Discussion Here we present a reduced model of the vertebrate OB that reproduces basic dynamical patterns observed across phyla. The focus of this model is to propose a role of oscillatory patterning in odorant discrimination within a fixed set of GMs through the formation of combinatorial assemblies of GM
Temporal Patterning During Odorant Processing
Membrane Depolarization
A
809
Excitatory feedback 1 2 3 4 5
tot. excit. tot. inhib. 200
400
600
800
1000
800
1000
Time [ms]
Membrane Depolarization
B
Inhibitory feedback 1 2 3 4 5
tot. excit. tot. inhib. 200
400
600
Time [ms] Figure 5: Feedback from a coincidence modulator can mediate the perioddoubling transition. (A) Excitatory feedback to M/T and PG/granule cells. (B) Inhibitory feedback to M/T and PG/granule cells. The feedback is activated at 500 ms (black arrows with dotted line) for both cases. Two cases produce essentially the similar result (period doubling). The transition to a slow oscillation happens abruptly, whereas there is no significant change in the oscillation generated by the inhibitory cells.
810
S. Kim, B. Singer, and M. Zochowski
activity. We have linked the oscillations to possible underlying mechanisms of formation of the temporal coding and possible functional roles during the information processing. The results are presented here as an analysis of steady-state behavior in response to static input, but the illustrated principle of information processing may form a component of dynamic encoding schemes. We show that the interaction of three effects underlies the initial formation of temporal patterning within the oscillatory response in the absence of feedback from higher cortical regions: fast synaptic interactions responsible for actual formation of the oscillatory patterning, the SSI responsible for alternating activation of different neuronal groups at different cycles, and the differential input strength to different GMs. Processing individual components of an odorant through asynchronous activity of individual GMs allows temporal clustering with given set of GMs. Thus, coding of odorant features may depend not only on which GMs are activated but on when, and in combination with which other GMs, activation occurs. Our results indicate that this initial segmentation is mediated by the relative input strength at different GMs. We further show that although input strength to different GMs may vary continuously, segmentation of GMs into temporal assemblies occurs in a discrete manner. This temporal segmentation underlies the formation of discriminable representations of odorants that activate the same GMs but with different strengths. This differentiation occurs even in the absence of dynamic modulation of OB input by such mechanisms as differential time courses of receptor cell activation and presynaptic inhibition, which produce slow modulation in neural activity (Laurent et al., 1996) that could additionally aid odorant discrimination. Abolition of temporal segmentation in the model leads to the formation of identical spatiotemporal representations of different odorants, despite their dissimilar input to the OB. SSI alone has already been shown to play an important role during discrimination of similar odorants (Bazhenov, Stopfer, Rabinovich, Abarbanel et al., 2001; Bazhenov, Stopfer, Rabinovich, Huerta et al., 2001). In that model, SSI is mostly responsible for slow modulation of the observed activity patterns. Here, however, we suggest how the interaction of three essential components is responsible for the initial formation of a temporal representation of odorant features through pattern segmentation. This dependence of discrimination on temporal patterning has been observed experimentally in insects (Hosler et al., 2000; Stopfer et al., 1997; Teyke & Gelperin, 1999). This further corroborates the hypothesis that the discrimination of similar odorants is mediated by differences in the temporal segmentation of individual features of the presented odorant (Stopfer, Jayaraman, & Laurent, 2003). Temporal segmentation of odorant features may serve as a substrate for the activation of different cell populations
Temporal Patterning During Odorant Processing
811
at the next processing stage, which has been shown to be activated on a cycle-by-cycle basis (Perez-Orive et al., 2002). An experimental study in insects (Perez-Orive, Bazhenov, & Laurent, 2004) and theoretical studies (Abeles, 1982; Konig, Engel, & Singer, 1996) have proposed that neurons in higher processing centers act as coincidence detectors. We show that the feedback from higher cortical regions, modeled here as a simple coincidence modulator network, can cause the experimentally observed period-doubling transition (Lam et al., 2000; Lam et al., 2003) and at the same time redefine the information content of the oscillation. In the absence of feedback, initial feature resolution is based on relative input strength among a set of GMs. With the presence of feedback, this is replaced by binding and synchronization of GM activity corresponding to all components of the odorant. We show that this mechanism is relatively robust and can be obtained using two different feedback mechanisms: direct excitation of M/T cells or disinhibition of appropriate GMs. In both cases, we assume a simple connectivity pattern: M/T cells converge pairwise onto cells in the coincidence modulator and receive pairwise feedback, while inhibitory cells in the OB are connected to coincidence modulator cells at random. We hypothesize that the synchrony of individual GMs in the presence of feedback may represent a transition from a component-wise peripheral representation based purely on odorant features to a representation of odor identity based on the interaction between the initial representation in the OB and central processing. Odor segmentation and discrimination in the presence of feedback from the cortex has been discussed by Li and Hertz (2000). They show that when central feedback is present, odorant can be distinguished by oscillatory patterning in the OB. Here, however, we emphasize the mechanisms of initial segmentation by input properties that form the representation passed to higher processing centers. Initial temporal feature binding may be a necessary substrate for odor identity binding, which is induced by feedback from higher stages. This view is consistent with the hypothesis of two steps of odor recognition processing before and after involvement of odor memory in higher regions (Wilson & Stevenson, 2003). Although we propose here that the formation of temporally segmented assemblies of GMs may be an effective strategy for odor encoding, it is not the only strategy. Future work will illuminate the relationship of temporal segmentation and odor-evoked oscillations to evolving input patterns.
References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18(1), 83–92.
812
S. Kim, B. Singer, and M. Zochowski
Adrian, E. D. (1950). The electrical activity of the mammalian olfactory bulb. Electroencephalogr. Clin. Neurophysiol., 2(4), 377–388. Allison, A. C. (1953). The morphology of the olfactory system in the vertebrates. Biological Reviews of the Cambridge Philosophical Society, 28(2), 195–244. Aroniadou-Anderjaska, V., Zhou, F. M., Priest, C. A., Ennis, M., & Shipley, M. T. (2000). Tonic and synaptically evoked presynaptic inhibition of sensory input to the rat olfactory bulb via GABA(b) heteroreceptors. J. Neurophysiol., 84(3), 1194– 1203. Aungst, J. L., Heyward, P. M., Puche, A. C., Karnup, S. V., Hayar, A., Szabo, G., & Shipley, M. T. (2003). Centre-surround inhibition among olfactory bulb glomeruli. Nature, 426(6967), 623–629. Bazhenov, M., Stopfer, M., Rabinovich, M., Abarbanel, H. D., Sejnowski, T. J., & Laurent, G. (2001). Model of cellular and network mechanisms for odor-evoked temporal patterning in the locust antennal lobe. Neuron, 30(2), 569–581. Bazhenov, M., Stopfer, M., Rabinovich, M., Huerta, R., Abarbanel, H. D., Sejnowski, T. J., & Laurent, G. (2001). Model of transient oscillatory synchronization in the locust antennal lobe. Neuron, 30(2), 553–567. Bozza, T. C., & Kauer, J. S. (1998). Odorant response properties of convergent olfactory receptor neurons. J. Neurosci., 18(12), 4560–4569. Bressler, S. L. (1984). Spatial organization of EEGs from olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol., 57(3), 270–276. Bressler, S. L., & Freeman, W. J. (1980). Frequency analysis of olfactory system EEG in cat, rabbit, and rat. Electroencephalogr. Clin. Neurophysiol., 50(1–2), 19–24. Brody, C. D., & Hopfield, J. J. (2003). Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron, 37(5), 843–852. Crosby, E., & Humphrey, T. (1939). Studies of the vertebrate telencephelon I: The nuclear configuration of the olfactor and accessory olfactory formations and of the nucleus olfactorius anterior of certain reptiles, birds and mammals. J. Comp. Neurol., 71, 121–213. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their possible roles in associations of visual features. Prog. Brain Res., 102, 405–426. Eckhorn, R. (2000). Cortical synchronization suggests neural principles of visual feature grouping. Acta Neurobiol. Exp. (Wars), 60(2), 261–269. Eeckman, F. H., & Freeman, W. J. (1990). Correlations between unit firing and EEG in the rat olfactory system. Brain Res., 528(2), 238–244. Ennis, M., Zhou, F. M., Ciombor, K. J., Aroniadou-Anderjaska, V., Hayar, A., Borrelli, E., Zimmer, L. A., Margolis, F., Shipley, M. T. (2001). Dopamine D2 receptormediated presynaptic inhibition of olfactory nerve terminals. J. Neurophysiol., 86(6), 2986–2997. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95(3), 1259–1264. Gelperin, A., & Tank, D. W. (1990). Odour-modulated collective network oscillations of olfactory interneurons in a terrestrial mollusc. Nature, 345(6274), 437–440. Gracia-Llanes, F. J., Crespo, C., Blasco-Ibanez, J. M., Marques-Mari, A. I., & MartinezGuijarro, F. J. (2003). Vip-containing deep short-axon cells of the olfactory bulb
Temporal Patterning During Odorant Processing
813
innervate interneurons different from granule cells. Eur. J. Neurosci., 18(7), 1751–1763. Gray, C. M., & Skinner, J. E. (1988). Field potential response changes in the rabbit olfactory bulb accompany behavioral habituation during the repeated presentation of unreinforced odors. Exp. Brain Res., 73(1), 189–197. Harrison, T. A., & Scott, J. W. (1986). Olfactory bulb responses to odor stimulation: Analysis of response pattern and intensity relationships. J. Neurophysiol., 56(6), 1571–1589. Hendin, O., Horn, D., & Tsodyks, M. V. (1998). Associative memory and segmentation in an oscillatory neural model of the olfactory bulb. J. Comput. Neurosci., 5(2), 157–169. Hosler, J. S., Buxton, K. L., & Smith, B. H. (2000). Impairment of olfactory discrimination by blockade of GABA and nitric oxide activity in the honey bee antennal lobes. Behav. Neurosci., 114(3), 514–525. Hughes, J. R., & Mazurowski, J. A. (1962). Studies on the supracallosal mesial cortex of unanesthetized, conscious mammals. II. Monkey. B. Responses from the olfactory bulb. Electroencephalogr. Clin. Neurophysiol., 14, 635–645. Inokuma, Y., Inoue, T., Watanabe, S., & Kirino, Y. (2002). Two types of network oscillations and their odor responses in the primary olfactory center of a terrestrial mollusk. J. Neurophysiol., 87(6), 3160–3164. Jahr, C. E., & Nicoll, R. A. (1980). Dendrodendritic inhibition: Demonstration with intracellular recording. Science, 207(4438), 1473–1475. Jahr, C. E., & Nicoll, R. A. (1982). An intracellular analysis of dendrodendritic inhibition in the turtle in vitro olfactory bulb. J. Physiol., 326, 213–234. Konig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19(4), 130–137. Kopell, N., Ermentrout, G. B., Whittington, M. A., & Traub, R. D. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Natl. Acad. Sci. USA, 97(4), 1867–1872. Lam, Y. W., Cohen, L. B., Wachowiak, M., & Zochowski, M. R. (2000). Odors elicit three different oscillations in the turtle olfactory bulb. J. Neurosci., 20(2), 749– 762. Lam, Y. W., Cohen, L. B., & Zochowski, M. R. (2003). Odorant specificity of three oscillations and the DC signal in the turtle olfactory bulb. Eur. J. Neurosci., 17(3), 436–446. Laurent, G., & Naraghi, M. (1994). Odorant-induced oscillations in the mushroom bodies of the locust. J. Neurosci., 14(5 Pt. 2), 2993–3004. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. (2001). Odor encoding as an active, dynamical process: Experiments, computation, and theory. Annu. Rev. Neurosci., 24, 263–297. Laurent, G., Wehr, M., & Davidowitz, H. (1996). Temporal representations of odors in an olfactory network. J. Neurosci., 16(12), 3837–3847. Li, Z., & Hertz, J. (2000). Odour recognition and segmentation by a model olfactory bulb and cortex. Network, 11(1), 83–102. Linster, C., & Cleland, T. A. (2001). How spike synchronization among olfactory neurons can contribute to sensory discrimination. J. Comput. Neurosci., 10(2), 187– 193.
814
S. Kim, B. Singer, and M. Zochowski
Luskin, M. B., & Price, J. L. (1983). The topographic organization of associational fibers of the olfactory system in the rat, including centrifugal fibers to the olfactory bulb. J. Comp. Neurol., 216(3), 264–291. MacLeod, K., Backer, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395(6703), 693–698. MacLeod, K., & Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding neural assemblies. Science, 274(5289), 976–979. Macrides, F. D. B. (1983). The olfactory bulb. In P. C. Emson (Ed.), Neuroanatomy (pp. 391–426). New York: Raven Press. Malnic, B., Hirono, J., Sato, T., & Buck, L. B. (1999). Combinatorial receptor codes for odors. Cell, 96(5), 713–723. Margrie, T. W., Sakmann, B., & Urban, N. N. (2001). Action potential propagation in mitral cell lateral dendrites is decremental and controls recurrent and lateral inhibition in the mammalian olfactory bulb. Proc. Natl. Acad. Sci. USA, 98(1), 319–324. Martin, C., Gervais, R., Hugues, E., Messaoudi, B., & Ravel, N. (2004). Learning modulation of odor-induced oscillatory responses in the rat olfactory bulb: A correlate of odor recognition? J. Neurosci., 24(2), 389–397. Mombaerts, P., Wang, F., Dulac, C., Chao, S. K., Nemes, A., Mendelsohn, M., Edmenson, J., & Axel, R. (1996). Visualizing an olfactory sensory map. Cell, 87(4), 675–686. Mori, K., Nagao, H., & Yoshihara, Y. (1999). The olfactory bulb: Coding and processing of odor molecule information. Science, 286(5440), 711–715. Mori, K., & Takagi, S. F. (1978). An intracellular study of dendrodendritic inhibitory synapses on mitral cells in the rabbit olfactory bulb. J. Physiol., 279, 569–588. Neville, K. R., & Haberly, L. B. (2003). Beta and gamma oscillations in the olfactory system of the urethane-anesthetized rat. J. Neurophysiol., 90(6), 3921–3930. Nowycky, M. C., Mori, K., & Shepherd, G. M. (1981). GABAergic mechanisms of dendrodendritic synapses in isolated turtle olfactory bulb. J. Neurophysiol., 46(3), 639–648. Olufsen, M. S., Whittington, M. A., Camperi, M., & Kopell, N. (2003). New roles for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comput. Neurosci., 14(1), 33–54. Perez-Orive, J., Bazhenov, M., & Laurent, G. (2004). Intrinsic and circuit properties favor coincidence detection for decoding oscillatory input. J. Neurosci., 24(26), 6037–6047. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom body. Science, 297(5580), 359–365. Pinching, A. J., & Powell, T. P. (1971). The neuropil of the glomeruli of the olfactory bulb. J. Cell. Sci., 9(2), 347–377. Pinching, A. J., & Powell, T. P. (1972). The termination of centrifugal fibres in the glomerular layer of the olfactory bulb. J. Cell. Sci., 10(3), 621–635. Rall, W., & Shepherd, G. M. (1968). Theoretical reconstruction of field potentials and dendrodendritic synaptic interactions in olfactory bulb. J. Neurophysiol., 31(6), 884–915.
Temporal Patterning During Odorant Processing
815
Ravel, N., Chabaud, P., Martin, C., Gaveau, V., Hugues, E., Tallon-Baudry, C., Bertrand, O., & Gervais, R. (2003). Olfactory learning modifies the expression of odour-induced oscillatory responses in the gamma (60–90 Hz) and beta (15–40 Hz) bands in the rat olfactory bulb. Eur. J. Neurosci., 17(2), 350– 358. Ressler, K. J., Sullivan, S. L., & Buck, L. B. (1993). A zonal organization of odorant receptor gene expression in the olfactory epithelium. Cell, 73(3), 597–609. Schoppa, N. E., Kinzie, J. M., Sahara, Y., Segerson, T. P., & Westbrook, G. L. (1998). Dendrodendritic inhibition in the olfactory bulb is driven by NMDA receptors. J. Neurosci., 18(17), 6790–6802. Schoppa, N. E., & Urban, N. N. (2003). Dendritic processing within olfactory bulb circuits. Trends Neurosci, 26(9), 501–506. Schoppa, N. E., & Westbrook, G. L. (2002). AMPA autoreceptors drive correlated spiking in olfactory bulb glomeruli. Nat. Neurosci., 5(11), 1194–1202. Scott, J. W., McBride, R. L., & Schneider, S. P. (1980). The organization of projections from the olfactory bulb to the piriform cortex and olfactory tubercle in the rat. J. Comp. Neurol., 194(3), 519–534. Shepherd, G. M. (1998). The synaptic organization of the brain (4th ed.). New York: Oxford University Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24(1), 49–65, 111–125. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390(6655), 70–74. Stopfer, M., Jayaraman, V., & Laurent, G. (2003). Intensity versus identity coding in an olfactory system. Neuron, 39(6), 991–1004. Teyke, T., & Gelperin, A. (1999). Olfactory oscillations augment odor discrimination not odor identification by Limax CNS. Neuroreport, 10(5), 1061–1068. von der Malsburg, C. (1995). Binding in models of perception and brain function. Curr. Opin. Neurobiol., 5(4), 520–526. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24(1), 95–104, 111–125. Vosshall, L. B., Wong, A. M., & Axel, R. (2000). An olfactory sensory map in the fly brain. Cell, 102(2), 147–159. Wachowiak, M., & Cohen, L. B. (1999). Presynaptic inhibition of primary olfactory afferents mediated by different mechanisms in lobster and turtle. J. Neurosci., 19(20), 8808–8817. Wachowiak, M., Cohen, L. B., & Zochowski, M. R. (2002). Distributed and concentration-invariant spatial representations of odorants by receptor neuron input to the turtle olfactory bulb. J. Neurophysiol., 87(2), 1035–1045. Wehr, M., & Laurent, G. (1996). Odour encoding by temporal sequences of firing in oscillating neural assemblies. Nature, 384(6605), 162–166. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G., & Traub, R. D. (1997). Spatiotemporal patterns of gamma frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol., 502 (Pt. 3), 591–607. Wilson, D. A. (2001). Receptive fields in the rat piriform cortex. Chem. Senses, 26(5), 577–584.
816
S. Kim, B. Singer, and M. Zochowski
Wilson, D. A., & Stevenson, R. J. (2003). Olfactory perceptual learning: The critical role of memory in odor discrimination. Neurosci. Biobehav. Rev., 27(4), 307–328. Zou, Z., Horowitz, L. F., Montmayeur, J. P., Snapper, S., & Buck, L. B. (2001). Genetic tracing reveals a stereotyped sensory map in the olfactory cortex. Nature, 414(6860), 173–179.
Received February 15, 2005; accepted September 29, 2005.
LETTER
Communicated by Bard Ermentrout
Computation of the Phase Response Curve: A Direct Numerical Approach W. Govaerts
[email protected]
B. Sautois
[email protected] Department of Applied Mathematics and Computer Science, Ghent University, B-9000 Ghent, Belgium
Neurons are often modeled by dynamical systems—parameterized systems of differential equations. A typical behavioral pattern of neurons is periodic spiking; this corresponds to the presence of stable limit cycles in the dynamical systems model. The phase resetting and phase response curves (PRCs) describe the reaction of the spiking neuron to an input pulse at each point of the cycle. We develop a new method for computing these curves as a by-product of the solution of the boundary value problem for the stable limit cycle. The method is mathematically equivalent to the adjoint method, but our implementation is computationally much faster and more robust than any existing method. In fact, it can compute PRCs even where the limit cycle can hardly be found by time integration, for example, because it is close to another stable limit cycle. In addition, we obtain the discretized phase response curve in a form that is ideally suited for most applications. We present several examples and provide the implementation in a freely available Matlab code.
1 Introduction Dynamical systems are often used to model individual neurons or neural networks to study the general behavior of neurons and how neural networks respond to different kinds of inputs. In this field, an important concept is the phase resetting or response curve. When a neuron is firing a train of spikes (action potentials), a short input pulse can modify the ongoing spike train. Even if the incoming pulse is purely excitatory, depending on the model and the exact timing of the pulse, it can either speed up or delay the next action potential. This affects properties of networks such as synchronization and phase locking (Ermentrout, 1996; Hansel, Mato, & Meunier, 1995). Applications of the phase response curve (PRC) to the stochastic response dynamics of weakly coupled or uncoupled neural populations can be found Neural Computation 18, 817–847 (2006)
C 2006 Massachusetts Institute of Technology
818
W. Govaerts and B. Sautois
in Brown, Moehlis, and Holmes (2004). They also derive PRCs for cycles close to bifurcations such as homoclinic orbits and Hopf points. The PRC quantifies the linearized effect of an input pulse at a given point of the orbit on the following train of action potentials. If an input pulse does not affect the spike train, the period is unchanged, and the PRC at that point of the curve is zero. If the input pulse delays the next spike, the period increases, and the PRC is negative; if the pulse speeds up the next spike, the PRC is positive. The PRC can be used to compute the effect of any perturbing influence on the curve if the perturbation does not essentially change the dynamic pattern of the neuron; in particular, it should not move the state of the neuron into a domain of attraction different from that of the spiking orbit. In the case of coupling, PRCs can be used to compute the influence of weak coupling. In this letter, we present a new numerical way to compute the PRC of a spiking neuron. Mathematically it is an implementation of the adjoint method, introduced by Ermentrout and Kopell (1991), but our implementation, as part of the boundary value problem for the limit cycle of the stable orbit, is faster and more robust than any existing method; in fact, we get the PRC as a by-product of the computation of the limit cycle. Our method is particularly useful in the context of the numerical continuation of orbits with a variable parameter of the system. We describe the details of the method, its implementation in freely available software, and several examples in neural models. We provide tests to compare the speed and accuracy with the direct (simulation) method in the case of the Hodgkin-Huxley equations and with the classical adjoint method in two other classical neural models. Also, we provide extensive tests in cases where the periodic orbits are close to homoclinics and to saddle nodes of cycles. We also describe the computation of the response to a function. The input function f is given over the whole periodic orbit. This is more realistic in the case of coupled neurons. We further apply this functionality to a simple study of synchronization of phase models.
2 Response or Resetting? The terms phase response curve and phase resetting curve, both abbreviated to PRC, are used interchangeably in the literature. Since there seems to be some confusion, we start with the precise definitions that will be used in the rest of this article. 2.1 Peak-Based Phase Response Curve. This is a rather intuitive and biologically oriented definition that can easily be applied to experimental data.
Computation of the Phase Response Curve
819
The phase response curve is a curve that is defined over the time span of one cycle in the spike train of a firing neuron with no extra input. It starts at the peak of a spike and extends to the peak of the following spike. At each point of the cycle, it indicates the effect on the oncoming spikes of an input pulse to the neuron at that particular time in its cycle. If we define the period of the unperturbed cycle as Told and Tnew is the period when the input pulse I is given at time t, the phase response curve is defined as G(t, I ) =
Told − Tnew . Told
(2.1)
2.2 Phase Resetting Curve. This notion can be defined mathematically for any stable periodic orbit of a dynamical system. Let the period of the orbit be denoted as Told , and suppose that a point of the orbit is chosen as the base point (in a neural model, this point typically corresponds to a peak in the spike train). The phase resetting curve is defined in the interval [0, 1], in which the variable φ is called the phase and is defined as φ=
t (mod 1), Told
(2.2)
with t ∈ [0, Told ]. Now suppose that an input pulse I is given to the oscillator when it is at the point x(t). Mathematically, the pulse I could be any vector in the state space of the system. In the biological application, it is usually a vector in the direction of the voltage component of the state space, since that component is affected by the synaptic input from other neurons. Since the orbit starting from x(t) + I will close in on the stable limit cycle, there is exactly one point x(t1 ), t1 ∈ [0, Told ] that has the property that d(y(t), z(t)) → 0 for if
y(0) = x(t) + I z(0) = x(t1 )
t→∞
(2.3)
.
Here x(t1 ) and x(t) + I are said to be on the same isochron (cf. Guckenheimer, 1975). We then define g(φ, I ) =
t1 . Told
(2.4)
820
W. Govaerts and B. Sautois
2.3 Universal Definition of the Phase Response Curve (PRC). The definition of the peak-based response curve in section 2.1 is satisfactory only if each period contains precisely one easily identifiable spike and transients can be ignored. In this case, the next spike will occur at time Tnew = (φ + (1 − g(φ, I ))) Told .
(2.5)
So according to equation 2.1, PRC(φ, I ) = G(Tφ, I ) =
Told − (φ + (1 − g(φ, I ))) Told = g(φ, I ) − φ. Told (2.6)
To avoid the difficulties related to transients, we redefine the phase response curve by PRC(φ, I ) = g(φ, I ) − φ
(2.7)
in general. This definition is mathematically unambiguous and reduces to the definition in section 2.1 in the case of no transients. 3 Survey of Methods Currently, two classes of methods are often used to compute PRCs. The simplest methods are direct applications of definition 2.1: using simulations, one passes through the cycle repeatedly, each time giving an input pulse at a different time and measuring the delay of the spike. From this, a PRC curve can be computed. This was done by Guevara, Glass, Mackey, and Shrier (1983) and many others. This method has several advantages. It is simple and does not require any special software (only a good time integrator is needed). It can be used for arbitrarily large perturbations, even if the next spike would be delayed by more than the period of one cycle in which case the term delay might be confusing. Depending on the required accuracy of the PRC, this method typically takes a few seconds. In applications where the PRC of one given limit cycle is desired, it is quite satisfactory for its typical applications. The other methods are based on the use of the equation adjoint to the dynamical system that describes the activity of the neuron. The mathematical basis of this approach goes back to Malkin (1949, 1956; see also Blechman, 1971, and Ermentrout, 1981). An easily accessible reference is in Hoppensteadt and Izhikevich (1997, sec. 9.2). The idea to use this method for numerical implementation goes back to Ermentrout and Kopell (1991). The implementation described in Ermentrout (1996) and available through Ermentrout’s software package
Computation of the Phase Response Curve
821
XPPAUT (Ermentrout, 2002) is based on the backward integration of the adjoint linear equation of the dynamical system. This method is mathematically better defined and more general than the simulation method because it does not use the notion of a spike. It is also more restricted because it does not measure the influence of a real finite perturbation but rather a linearization of this effect that becomes more accurate if the perturbation gets smaller. As Ermentrout (1996) noted, the accuracy of the method based on backward integration is limited for two reasons. First, the adjoint Jacobian of the dynamical system is not available analytically. It has to be approximated by finite differences from a previously numerically computed discretized form of the orbit. Second, the integration of the linear equations produces the usual numerical errors. The method that we propose is mathematically equivalent to the adjoint method; only the implementation is new. The main advantage is that it is much faster. Timing results are given in section 6.2. Although the increase in speed is impressive, this is not very relevant if only one or a few PRCs are needed. The main application of our method therefore lies in cases where a large number of PRCs is needed, such as where the evolution of PRCs is important when a parameter is changed. The sources of error from the existing method do not apply to the method that we propose. In fact, we can compute PRCs even where the stable periodic orbits are hard to find by direct time integration. 4 Used Software Our continuation computations are based on several software packages. Mostly we used MATCONT and CL MATCONT (Dhooge, Govaerts, Kuznetsov, Mestrom, & Riet, 2003). CL MATCONT is a MATLAB package for the study of dynamical systems and their bifurcations; MATCONT (Dhooge, Govaerts, & Kuznetsov, 2003) is its GUI version. Both packages are freely available online at http://matcont.UGent.be/. CL MATCONT and MATCONT are successor packages to AUTO (Doedel et al., 1997–2000) and CONTENT (Kuznetsov & Levitin, 1998), which are written in compiled languages (Fortran, C, C++ ). Recently we made some structural changes to CL MATCONT, included C-code to speed it up, and added some functionalities (Govaerts & Sautois, 2005a). It is this new version that we used test to our method for computing the PRC. This version can be downloaded from http://allserv.UGent.be/∼bsautois. Currently the two versions are being merged into one. 5 Methods 5.1 The PRC as an Adjoint Problem. For background on dynamical systems theory we refer to the standard textbooks (e.g., Guckenheimer
822
W. Govaerts and B. Sautois
& Holmes, 1983, or Kuznetsov, 2004). In this section, we introduce the adjoint method to compute the phase response curve. Our exposition is selfcontained and involves Floquet multipliers and the monodromy matrix; however, these are not used in the actual computations. Let a neural model be defined by a system of differential equations, X˙ = f (X, α),
(5.1)
where X ∈ Rn represents the state variables and α is a vector of parameters. We consider a solution to equation 5.1 that corresponds to the periodically spiking behavior of the neuron with period T. By rescaling time, we find a periodic solution x(t) with period 1, solution to the equations
x˙ − T f (x, α) = 0 x(0) − x(1) = 0
.
(5.2)
We consider the periodic function A(t) = f x (x(t)) with period 1. To equation 5.2, we associate the fundamental matrix solution (t) of the nonautonomous linear equation:
˙ (t) − T A(t) (t) = 0 (0) = In
.
(5.3)
It is also useful to consider the equation adjoint to equation 5.3:
˙ (t) + T A(t)T (t) = 0 (0) = In
.
(5.4)
From equations 5.3 and 5.4, it follows that the time derivative of T (t)(t) is identically zero. By the initial conditions, this implies T (t)(t) ≡ 1.
(5.5)
By taking derivatives of equation 5.2, we find that the tangent vector v(t) = x(t) ˙ satisfies
v(t) ˙ − T A(t) v(t) = 0 v(0) − v(1) = 0
.
(5.6)
From this we infer v(t) = (t)v(0).
(5.7)
Computation of the Phase Response Curve
823
The monodromy matrix M(t) = (t + 1)(t)−1 is the linearized return map of the dynamical system. It is easy to see that (t + 1) = M(t)(t) = (t)M(0).
(5.8)
Hence, M(t) = (t)M(0)(t)−1 .
(5.9)
The eigenvalues of M(t) are called multipliers; the similarity, equation 5.9, implies that they are independent of t. M(t) always has a multiplier equal to +1. By equations 5.6 and 5.7, v(0) = v(1) is a right eigenvector for M(0) = (1) for the eigenvalue 1. By equation 5.9, v(t) is a right eigenvector of M(t) for the eigenvalue 1 for all t. Let us assume that the limit cycle is stable, such that all multipliers different from 1 have modulus strictly less than 1. Then in particular, M(0) T has a unique left eigenvector vl0 for the multiplier 1, for which vl0 v = 1. For all t, we define vl (t) = (t)vl0 . It is now easy to see that v˙l (t) + T A(t)T vl (t) = 0 vl (0) − vl (1) = 0
(5.10)
.
(5.11)
Also, for all t, vl (t) is a left eigenvector of M(t) for the eigenvalue 1 and vl (t)T v(t) = 1.
(5.12)
Let R(t) be the joint right (n − 1)-dimensional eigenspace of M(t) that corresponds to all multipliers different from 1, that is, R(t) is the space orthogonal to vl (t). Now, let I be a pulse given at time t. We can decompose this pulse uniquely as I = Iv + Ir ,
(5.13)
with
Iv = c v(t) Ir ∈ R(t)
where c ∈ R.
,
(5.14)
824
W. Govaerts and B. Sautois
Linearizing this perturbation of x(t), we find that the effect of Ir will die out, while the effect of Iv will be to move the system vector along the orbit. The amount of the change in time is equal to c as defined in equation 5.14. The linearized PRC, which for simplicity we just call the PRC, at t for pulse I is therefore (note that we have rescaled to period 1) PRC(t, I ) = c.
(5.15)
Since vl (t) is orthogonal to Ir we get from equations 5.13 and 5.12 that vl (t)T I = vl (t)T Iv = c,
(5.16)
PRC(t, I ) = vl (t)T I.
(5.17)
so
The adjoint equation to equation 5.1, as used by Ermentrout (1996), is defined by the system
˙ Z(t) = −A(t)T Z(t) , 1 T ˙ =1 Z(t)T X(t)dt T
(5.18)
0
and the periodicity condition Z(T) = Z(0). It is related to our solution by the equation Z(t) = T vl
t . T
(5.19)
In the case of the unscaled system 5.1, if a pulse I is given at time t = Tφ, we find that PRC(φ, I ) = vl (φ)T I.
(5.20)
If I is the unit vector along the first (voltage) component, we have PRC(φ, I ) = (vl (φ))1 ,
(5.21)
where (.)1 denotes the first component. This situation is so common in neural models that (vl (φ))1 is sometimes also referred to as the phase response curve. Instead of giving an impulse I at a fixed time t = Tφ, we can also add a (small) vector function g( Tt ), g continuous over [0, 1], to the right-hand side
Computation of the Phase Response Curve
825
of equation 5.1 to model the ongoing input from other neurons. The phase response (P R) to this ongoing input is then given by
1
PR(g) =
vl (φ)T g(φ)dφ.
(5.22)
0
5.2 Implementation. Our implementation was done using CL MATCONT (see section 4). In this package, as in AUTO (Doedel et al., 1997– 2000) and CONTENT (Kuznetsov & Levitin, 1998), the continuation of limit cycles is based on equation 5.2 and uses orthogonal collocation. We briefly summarize this technique before describing its extension to compute phase response curves. (For details on the orthogonal collocation method, see Ascher & Russell, 1981, and De Boor & Swartz, 1973.) The idea to use the discretized equation to solve the adjoint problem was used in a completely different context in Kuznetsov, Govaerts, Doedel, & Dhooge (in press). Since a limit cycle is periodic, we need an additional equation to fix the phase. For this phase equation, an integral condition is used. The system is the following: ˙ − T f (x(t), α) = 0 x(t) x(0) − x(1) = 0 , 1 T ˙x˜ (t)x(t) dt = 0 0
(5.23)
where x˜ is some initial guess for the solution, typically obtained from the previous continuation step. This system is referred to as the boundary value problem (BVP). To describe a continuous limit cycle numerically, it needs to be stored in a discretized form. First, the interval [0, 1] is subdivided into N intervals: 0 = τ0 < τ1 < · · · < τ N = 1.
(5.24)
The grid points τ0 , τ1 , . . . , τ N form the coarse mesh. In each interval [τi , τi+1 ], the limit cycle is approximated by a degree m polynomial, whose values are stored in equidistant mesh points on each interval, namely, τi, j = τi +
j (τi+1 − τi ) ( j = 0, 1, . . . , m). m
(5.25)
These grid points form the fine mesh. In each interval [τi , τi+1 ], we then require the polynomials to satisfy the BVP exactly at m collocation points. The best choices for these collocation points are the gauss points ζi, j , that is, the roots of the Legendre polynomial of degree m, relative to the interval [τi , τi+1 ] (De Boor & Swartz, 1973).
826
W. Govaerts and B. Sautois
For a given vector function η ∈ C 1 ([0, 1], Rn ), we consider two different discretizations and two weight forms:
r r r r
η M ∈ R(Nm+1)n , the vector of the function values at the mesh points ηC ∈ R Nmn , the vector of the function values at the collocation points ηWL ∈ R(Nm+1)n , the vector of the function values at the mesh points, each multiplied with its coefficient for piecewise Lagrange quadrature ηWG = [ ηηWW1 ] ∈ R(Nm+1)n , where ηW1 is the vector of the function values 2 at the collocation points multiplied by the Gauss-Legendre weights and the lengths of the corresponding mesh intervals, and ηW2 = η(0).
To explain the use of the weight forms, we first consider a scalar function 1 f (t) ∈ C 0 ([0, 1], R). Then the integral 0 f (t)dt can be numerically approximated by appropriate linear combinations of function values. This can be done in several ways. (For background on quadrature methods, we refer to Deuflhard & Hohmann, 1991.) If the fine mesh points are used, then the best approximation has the form m N−1
lm, j ( f M )i, j (τi+1 − τi )
(5.26)
i=0 j=0
=
N−1 m−1
( f WL )i, j + ( f WL ) N−1,m .
(5.27)
i=0 j=0
In equation 5.26, the coefficients lm, j are the Lagrange quadrature coefficients and ( f M )i, j = f (τi, j ); this equation is the exact integral if f (t) is a piecewise polynomial of degree m or less. Equation 5.27 is a reorganization of equation 5.26 and defines f WL . 1 The integral 0 f (t)g(t)dt ( f (t), g(t) ∈ C 0 ([0, 1], R)) is then approximated T g M . For vector functions f (t), g(t) ∈ by the vector inner product f WL 1 0 n T C ([0, 1], R ), the integral 0 f (t) g(t)dt is formally approximated by the T same expression: f WL gM. If the collocation points are used, then the best approximation has the form m N−1
i=0 j=1
ωm, j ( f C )i, j (τi+1 − τi ) =
m N−1
( f W1 )i, j , i=0 j=1
(5.28)
Computation of the Phase Response Curve
827
where ( f C )i, j = f (ζi, j ) and ωm, j are the Gauss-Legendre quadrature coefficients. Formula 5.28 delivers the exact integral if f (t) is a piecewise polynomial of degree 2m − 1 or less. 1 The integral 0 f (t)g(t)dt ( f (t), g(t) ∈ C 0 ([0, 1], R)) is approximated with T T Gauss-Legendre by f W g = fW L C×M g M . For vector functions f (t), g(t) ∈ 1 C 1 1 0 n C ([0, 1], R ), the integral 0 f (t)T g(t)dt is formally approximated by the T T same expression: f W g = fW L C×M g M . 1 C 1 Here we formally used the structured sparse matrix L C×M that converts a vector η M of function values at the mesh points into the vector ηC of its values at the collocation points: ηC = L C×M η M .
(5.29)
This matrix is usually not formed explicitly; its entries are fixed and given by the values of the Lagrange interpolation functions in the collocation points. In the Newton steps for the computation of the solution to equation 5.23, we solve matrix equations with the Jacobian matrix of the discretization of this equation:
(D − T A(t))C×M (− f (x(t), α))C (δ0 − δ1 )TM (x˜˙
T
(t))TWL
0
.
(5.30)
0
In equation 5.30, the matrix (D − T A(t))C×M is the discretized form of the operator D − T A(t) where D is the differentiation operator. So we have (D − T A(t))C×M η M = (η(t) ˙ − T A(t)η(t))C . We note that this is a large, sparse, and well-structured matrix. In AUTO (Doedel et al., 1997–2000) and CONTENT (Kuznetsov & Levitin, 1998), this structure is fully exploited; in CL MATCONT, only the sparsity is exploited by using the MATLAB sparse solvers. We note that the evaluation of (D − T A(t))C×M takes most of the computing time in the numerical continuation of limit cycles. Furthermore, (δ0 − δ1 )TM is the discretization in the fine mesh points of the operator δ0 − δ1 where δ0 , δ1 are the Dirac evaluation operators in 0 and 1, respectively. So the second block row in equation 5.30 is an (n × (Nm + 1)n)matrix whose first (n × n)-block is the identity matrix In and whose last but one (n × n)-block is −In ; all other entries are zero. We note that in a continuation context, equation 5.30 is extended by an additional column, which contains (−T f α (x(t), α))C where α is the free parameter, and by an additional row, which is added by the continuation algorithm.
828
W. Govaerts and B. Sautois
Now if the limit cycle is computed, we also want to compute vl (t), solution to equations 5.11 and 5.12. So vl (t) is defined up to scaling by the condition that it is a null vector of the operator φ2 : C 1 ([0, 1], Rn ) → C 0 ([0, 1], Rn ) × Rn , where φ2 (ζ ) =
ζ˙ + T AT ζ
.
ζ (0) − ζ (1)
By routine computations, one proves that this is equivalent to
ζ
⊥ φ1 (C 1 ([0, 1], Rn ))
ζ (0) where
φ1 : C 1 ([0, 1], Rn ) → C 0 ([0, 1], Rn ) × Rn , φ1 (ζ ) =
ζ˙ − T Aζ ζ (0) − ζ (1)
.
(t) A(t) ] is orthogonal to the range of [ D−T ]. Now, by equation 5.12 So [ vvll (0) δ0 −δ1 and the fact that v(t) = x(t), ˙ we can state that
[((vl )WG ) 0] T
(D − T A(t))C×M (− f (x(t), α))C (δ0 − δ1 )TM
0
T (x˜˙ (t))TWL
0
=
1 0 − T
.
(5.31)
The matrix in equation 5.31 is freely available, since it is the same one as in equation 5.30. Obtaining (vl )WG from equation 5.31 is equivalent to solving a large, sparse system, a task done efficiently by MATLAB. The first part (vl )W1 of the obtained vector (vl )WG represents vl (t) but is evaluated in the collocation points and still weighted with the interval widths and the Gauss-Legendre weights. The second part (vl )W2 of the obtained vector (vl )WG is equal to vl (0). In many important cases (see, e.g.,
Computation of the Phase Response Curve
829
section 7), (vl )W1 is precisely what we need because vl (t) will be used to compute integrals of the form
1
vl , g = 0
vlT g(t)dt,
where g(t) is a given vector function that is continuous in [0, 1]. This integral is then numerically approximated by the vector inner product, (vl )TW1 gC = (vl )TW1 L C×M g M .
(5.32)
In some cases, we want to know (vl ) M . Since we know the GaussLegendre weights and the interval widths, we can eliminate them explicitly from (vl )W1 and obtain (vl )C . Now equation 5.29 converts values in mesh points to values in collocation points. In addition, we know that vl (0) = vl (1) = (vl )W2 . Combining these results, we get
(vl )C vl (0)
=
L C×M 0.5 0 · · · 0 0.5
(vl ) M ,
(5.33)
] is sparse, square, and well conditioned and so equawhere [ 0.5 0L C×M ... 0 0.5 tion 5.33 can be solved easily by MATLAB to get (vl ) M . We note that in this case (and only in this case), it is necessary to form L C×M explicitly. Now we have access to all elements needed to compute the PRC using equation 5.20. 6 Test Results 6.1 Correctness of Our Method 6.1.1 Comparison to Direct Method. Here we compare our method of computing the PRC to the direct method, which consists of giving input pulses at different points in the cycle and measuring the resulting changes in the cycle period (cf. section 3). As a test system, we choose the Hodgkin-Huxley system. Figure 1 shows the PRC for this model for I = 12, as computed using our new method. The period of the limit cycle is then 13.72 seconds. During a continuation experiment, this computation takes about 0.04 seconds. Figure 2 shows two PRCs for the same model and the same parameter values, which were computed in the direct way. The PRCs were computed for different pulse amplitudes and durations: pulse amplitudes are 10 and 20, and pulse durations are 0.05 and 0.15 seconds, for Figures 2A and 2B, respectively.
830
W. Govaerts and B. Sautois 0.03 0.025 0.02 0.015 0.01 0.005 0 – 0.005 – 0.01 – 0.015 0
0.2
0.4
0.6
0.8
1
Figure 1: PRC of the Hodgkin-Huxley model at I = 12.
Visually, it is clear that the shapes of the curves match. A rough computation shows that the matching is also quantitative. Indeed, the situation of Figure 2A corresponds to a resetting pulse of size 10 × 0.05 = 0.5 (millivolts). Dividing the maximal value of the computed PRC by 0.5, we obtain 0.014/0.5 = 0.028 (per millivolt). Similarly, for the situation of Figure 2B, we obtain 0.075/(20 × 0.15) = 0.025 (per millivolt). This closely matches the computed maximal value in Figure 1. This shows that our method is accurate and applicable not only for infinitesimally small input pulses but also for pulses of finite size and duration. The direct computation of these experimental PRCs, with the limited precision they have (as is obvious from the figures), took between 60 and 70 seconds each. Also, the smaller the inputs for an experimental computation, the higher the precision has to be, since the PRC amplitudes shrink, and thus noise due to imprecision increases in relative size. This is already evident from the difference between Figures 2A and 2B. And computation time increases with increasing precision. 6.1.2 Three Classical Situations. In this section we check that in some standard situations, our results correspond with those in the literature. In fact, our figures almost perfectly match the ones computed by Ermentrout (1996), except for the scaling factor (which is the cycle period). This should not come as a surprise since in the absence of computational errors (e.g., rounding, truncation), the methods used are equivalent. Here we present a
Computation of the Phase Response Curve 0.015
831
A
0.01
0.005
0
– 0.005
–0.01 0
0.08
0.2
0.4
0.6
0.2
0.4
0.6
0.8
1
B
0.06 0.04 0.02 0 –0.02 –0.04 – 0.06 0
0.8
1
Figure 2: Experimentally obtained PRCs for the Hodgkin-Huxley model. (A) Pulse amplitude = 10, pulse duration = 0.05. (B) Pulse amplitude = 20, pulse duration = 0.15.
832
W. Govaerts and B. Sautois
couple of the most widely known models. The equations and fixed parameter values for all models are listed in the appendix. The Hodgkin-Huxley model, as described in the appendix in section A.1, is known to exhibit a PRC of type II, which means a PRC with a positive and a negative region. So two coupled Hodgkin-Huxley neurons with excitatory connections can still slow down each other’s spike rate. Figure 1 shows the PRC for this model, for I = 12, where the limit cycle has a period of 13.72. The Morris-Lecar model is given in section A.2. It is known to have different types of behavior and phase response curves at different parameter values (Govaerts & Sautois, 2005b). Figure 3A shows the PRC at V3 = 4 and I = 45, where the cycle period is 62.38. We clearly see a negative and positive regime in the PRC. In Figure 3B, the PRC is shown at V3 = 15 and I = 39, where the period is 106.38; the PRC is practically nonnegative. Finally, we show a result for the Connor model (Connor, Walter, & McKown, 1977), a six-dimensional model, given in section A.3, which has a nonnegative PRC. Figure 4 depicts this PRC, for I = 8.5 and period 98.63. 6.2 Three Further Applications. In this section, we compute families of PRCs in cases where the shape of the PRC is known to have important consequences for networking behavior. The change of the shapes of the PRCs under parameter variation therefore is an interesting object of study. Moreover, this will allow us to obtain timing results for the computation of the PRCs in both absolute and relative terms. We further illustrate the robustness of the method by computing the PRC for a limit cycle that is hard to find by direct integration because of its small domain of attraction. 6.2.1 Morris-Lecar: Limit Cycles Close to Homoclinics. Brown et al. (2004) study the response dynamics of weakly coupled or uncoupled neural populations using the PRCs of periodic orbits near certain bifurcation points. In particular, they obtain the PRCs of the spiking periodic orbits near homoclinic to saddle node orbits (HSN; Brown et al. call this the SNIPER bifurcation) and homoclinic to hyperbolic saddle orbits (HHS; they call this the homoclinic bifurcation). They obtain standard forms for these PRCs, based on a normal form analysis of the homoclinic orbits. In numerical calculations that involve the Hindmarsh-Rose model in the first (HSN) case and the Morris-Lecar model in the second (HHS) case, the normal form predictions are largely confirmed. It turns out that the PRCs in the two cases look very different. On the other hand, it is well known that the transition from HSN to HHS orbits is generic when a curve of HSN or HHS orbits is computed. The generic transition point is a noncentral homoclinic to saddle node orbit (NCHSN). Moreover, this transition is not uncommon in neural models, and indeed we found it in the ubiquitous Morris-Lecar model (cf. Govaerts & Sautois, 2005b).
Computation of the Phase Response Curve 0.12
833
A
0.1 0.08 0.06 0.04 0.02 0 – 0.02 0
0.2
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
B
0.15
0.1
0.05
0
– 0.05
0
Figure 3: PRCs of the Morris-Lecar model. (A) PRC at V3 = 4 and I = 45. (B) PRC at V3 = 15 and I = 39.
834
W. Govaerts and B. Sautois 0.03 0.025 0.02 0.015 0.01 0.005 0 – 0.005 0
0.2
0.4
0.6
0.8
1
Figure 4: PRC of the Connor model at I = 8.5.
It follows that the PRCs near HSN orbits can be transformed smoothly into the PRCs near HHS orbits. Analytically this calls for a normal form analysis of the NCHSN orbits. We have not performed this but used the situation as a testing ground for the computation of PRCs. We computed a branch of spiking orbits with fixed high period (this forces the branch to follow the homoclinic orbits closely) from the HHS regime into the HSN regime, computed the PRCs in a large number of points, and plotted the resulting PRCs in a single picture to get geometric insight into the way the PRCs are transformed from one into the other. In Figure 5A, part of the phase plane is shown for the Morris-Lecar model. Since the curves are very close together, we added a qualitative diagram (see Figure 5B). The pictures show the saddle node curve (thin line) and the curve of HHS orbits (dashed line), which coincide and form a curve of HSN orbits (thick line). The point of coincidence is encircled; this is the NCHSN point. Close to the curves of HHS and HSN orbits is the curve (dotted) of limit cycles with fixed period 237.665 along which we have computed PRCs. The continuation was done using 80 mesh intervals and 4 collocation points per interval. Figure 5C shows the resulting 100 computed PRCs. We started the limit cycle continuation from a cycle with V3 = 10 and I = 40.536. The corresponding PRC is the one that, in Figure 5C, has the left-most peak; it is also slightly darker in the picture than the following PRCs. The picture shows the smooth transition of consecutive PRCs. The gray PRCs are the
Computation of the Phase Response Curve
15
835
A
V3
14 13 12 11 10 39
39.5
40
40.5
I –4
12
x 10
C
10 8 6 4 2 0 –2 0
0.2
0.4
0.6
0.8
1
Figure 5: (A) Phase plot for the Morris-Lecar model. (B) Qualitative picture to clarify relative positions of the curves. Thin line = limit point curve; thick line = HSN curve; dashed line = HHS curve; circle = NCHSN point; dotted line = curve of limit cycles with period 237,6646. (C) PRCs for limit cycles along limit cycle curve. Gray = PRCs for limit cycles close to HHS curves; black = PRCs for limit cycles close to HSN curves.
ones for limit cycles at values of V3 lower than the value at the NCHSN point (V3 = 12.4213). The dark PRCs correspond to limit cycles at higher V3 values. The shapes obtained for PRCs near HSN orbits and those near HHS orbits both confirm the results of Brown et al. (2004). Two significant facts stand out from the picture. First, the dark PRCs are (at least close to) strictly positive, while the gray PRCs have a distinct
836
W. Govaerts and B. Sautois
negative region. Second, the PRCs closest to the V3 value of the NCHSN point are also the PRCs with the lowest peak, that is, the PRCs in the bottom of the “valley” formed by the consecutive PRCs for limit cycles further away from that particular value, in either direction. This suggests that the NCHSN orbit has a distinct influence on the shape of the PRCs of nearby limit cycles. This is not surprising but to our knowledge has never been investigated. The computation of the 100 PRCs along the limit cycle curve took a total time of 2.34 seconds: 0.0234 seconds per PRC. To compare this with standard methods, we note that Ermentrout (2002, section 9.5.3) states that to compute one PRC takes 1 or 2 seconds. This is certainly acceptable if one is interested in a single PRC but hardly acceptable if one is interested in the evolution of the PRC under a change of parameters. The full continuation run for these 100 limit cycles took 62.41 seconds. So our PRC computations took only about 3.75% of the total time needed for the experiment. If one used a PRC computing method that takes, for example, 2 seconds per PRC, this would cause the total run time of the program to increase up to 260 seconds, an increase of more than 300%. 6.2.2 Hodgkin-Huxley: Robustness. Our method is robust in the sense that it can compute PRCs for all limit cycles that can be found through continuation; there are no further sources of error as in the traditional implementation of the computation of the PRC (cf. Ermentrout, 1996). In fact we can compute PRCs even for limit cycles that are hard to find by any means other than numerical continuation, which can happen when their domain of attraction is not easily found. In the Hodgkin-Huxley model, there is typically a short interval in which a stable equilibrium and two stable limit cycles coexist. In our case, for the parameter values specified in section A.1, this happens between values I = 7.8588 and 7.933. These limit cycles are shown in Figure 6A. The smaller of the two stable limit cycles in the picture exists for only a short I -interval and has a very small attraction domain. Therefore, it is extremely hard to find by, for example, time integration. This implies that it is also not trivial to compute the PRC corresponding to that particular limit cycle. Our method, however, has no problem computing it. In Figure 6B, the PRCs are depicted that correspond to the limit cycles from Figure 6A. The shapes of the two PRCs are very different. Also note that the darker PRC was actually larger in amplitude but was rescaled to the same ranges as the gray PRC. 6.2.3 Hodgkin-Huxley: Limit Cycles Close to Saddle Node of Cycles. As a third test, we have done a continuation of both coexisting stable limit cycles in the Hodgkin-Huxley system, mentioned in section 6.2.2. In both cases we did the continuation for decreasing values of I , approaching a saddle node of limit cycles (also limit point of cycles, LPC).
Computation of the Phase Response Curve 1
837
A
0.9 0.8 0.7
m
0.6 0.5 0.4 0.3 0.2 0.1 0 –20
0
20
40
60
80
100
V 1
B
0.8 0.6 0.4 0.2 0 – 0.2 – 0.4 – 0.6 – 0.8 –1 0
0.2
0.4
0.6
0.8
1
Figure 6: (A) Two stable limit cycles for the Hodgkin-Huxley model at parameter value I = 7.919. (B) Corresponding PRCs. The gray PRC corresponds to the gray limit cycle and the black PRC to the black limit cycle.
838
W. Govaerts and B. Sautois
The big stable limit cycle loses its stability at an LPC that occurs at I = 6.276. Starting from I = 7.919, we computed 150 continuation steps, with 80 mesh intervals and 4 collocation points. The final computed limit cycle was at I = 6.377. The last 100 PRCs are shown in Figure 7A, where the palest one, with the biggest amplitude, is the PRC corresponding to the limit cycle closest to the LPC. The small, stable limit cycle loses its stability at an LPC that occurs at I = 7.8588. Starting from I = 7.919, we computed 100 continuation steps, with smaller step sizes than for the big limit cycle. The final computed limit cycle was at I = 7.8592. The corresponding 100 PRCs are shown in Figure 7B. Again the palest one is closest to the LPC. During the continuations, the actual PRC computations took around 4 seconds total for 100 PRCs—0.04 second per PRC. The full continuation run took around 100 seconds, so the PRC computations took about 4% of the total run time. Again, the use of another method to compute the PRC, which takes 2 seconds per limit cycle, would increase the time needed to about 300 seconds, an increase by 200%. Both PRCs have positive and negative regions, but otherwise their shapes are very different. It is noteworthy that the shapes of the PRCs near the big LPC look similar to the PRCs predicted near the Bautin bifurcation by Brown et al. (2004). It is well known that branches of LPCs generically are born at Bautin bifurcation points. However, the shapes of the PRCs near the small LPC are very different. We note that both LPCs are far from Bautin bifurcation points. This is again a subject for further investigation. 7 Response to a Function In section 5.1 we discussed the computation of spike train responses to any given input function g. We can now compute the right-hand side of equation 5.22 numerically. In fact, using equation 5.29, we obtain the highorder approximation PR(g) = (vl )TW1 (g(φ))C = (vl )TW1 L C×M (g(φ)) M .
(7.1)
Hence, the computation of the phase response to a function is reduced to the discretization of that function and the computation of a vector inner product. In this section, we briefly show how this is done in a classical and well-understood situation: the phase dynamics of two coupled Poincar´e oscillators. The Poincar´e oscillator has been used many times as a model of biological oscillations (e.g., in Glass & Mackey, 1988, and Glass, 2003). It is sufficient for present purposes to note that it has two state variables x, y and a stable limit cycle with period 1 given by x = cos 2πt, y = cos 2πt.
Computation of the Phase Response Curve 0.25
839
A
0.2 0.15 0.1 0.05 0 –0.05 –0.1 –0.15
0
10
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
B
8 6 4 2 0 –2 –4 –6 –8 –10 –12
0
Figure 7: (A) PRCs of big Hodgkin-Huxley limit cycles approaching an LPC at I = 6.276. (B) PRCs of small Hodgkin-Huxley limit cycles approaching an LPC at I = 7.919.
840
W. Govaerts and B. Sautois
The dynamics of weakly coupled oscillators can be reduced to their phase dynamics (cf. Hansel, Mato, & Meunier, 1993, where further references can be found). We restrict to the simple setting discussed in Ermentrout (2002). Consider two identical, weakly coupled oscillators, with autonomous period T:
X1 = F(X1 ) + G 1 (X2 , X1 )
X2 = F(X2 ) + G 2 (X1 , X2 )
,
(7.2)
with G 1 and G 2 two possibly different coupling functions, and a small positive number. Let X0 (t) be an asymptotically stable periodic solution of X = F(X) with period T. Then, for sufficiently small, X j (t) = X0 (θ j ) + O() ( j = 1, 2)
(7.3)
with θ1 = 1 + H1 (θ2 − θ1 ) θ2 = 1 + H2 (θ1 − θ2 ) where H j are T-periodic functions given by H j (ψ) ≡
1 T
T
Z(t)T G j [X0 (t + ψ), X0 (t)]dt,
(7.4)
0
where Z is the adjoint solution as defined in equation 5.18. Now consider two identical Poincar´e oscillators. A natural choice for the coupling functions G j is G 1 (Xi , X j ) = G 2 (Xi , X j ) = Xi − X j . It follows that H1 (φ) = H2 (φ) = 0
1
vlT (t)[X0 (t + φ) − X0 (t)]dt.
(7.5)
One finds easily that
1
(θ2 − θ1 ) = 2 sin[2π(θ1 − θ2 )] 0
vl (t)
T
− sin(2πt) dt. cos(2πt)
(7.6)
Computation of the Phase Response Curve
841
We set ζ = θ2 − θ1 . The constant function ζ = 12 is a solution to equation 7.6; it corresponds to antiphase synchronization. On the other hand, if at the starting time ζ = 12 (mod 1), then by elementary calculus, we obtain tan(πζ ) = C exp(−4π α t),
(7.7)
for some constant C = 0, where
1
α=
vl (t)
T
− sin(2πt) cos(2πt)
0
dt.
(7.8)
This leads to the following conclusion about the behavior of ζ for t → ∞: α > 0 ⇒ ζ → 0/1 α < 0⇒ζ →
1 2
α = 0 ⇒ ζ → constant.
(7.9)
In the first case, the two oscillators converge to in-phase synchronization, in the second case they move toward antiphase synchronization, and in the third case, the oscillators remain in the out-of-phase synchronization they started with. Now α is nothing else than the phase response to a function and can be computed by equation 7.1. In this case, we find that α > 0, so two identical Poincar´e oscillators coupled in both directions by the function G(X2 , X1 ) = X2 − X1 always converge to in-phase synchronization, except when they start in perfect antiphase synchronization, in which case they will remain in that state. This is easily checked by numerical simulation (and can be shown analytically). Analogously, it is easy to compute α for other coupling functions and to check the result by simulations. For example, if G 1 (X2 , X1 ) = G 2 (X2 , X1 ) = X1 − X2 , then α < 0, so the two Poincar´e oscillators always converge to antiphase synchronization. A more interesting case is obtained if we set X1 =
x1 x2
, X2 =
G 1 (X2 , X1 ) =
x3
,
x4
x3 − x1 0
, G 2 (X1 , X2 ) =
x1 − x3 0
.
842
W. Govaerts and B. Sautois
In this case, we again find α > 0, and there is always in-phase synchronization, except for when we start in exact antiphase. When we set G 1 (X2 , X1 ) =
x4 − x2 0
, G 2 (X1 , X2 ) =
x2 − x4 0
,
then α is zero (up to truncation and rounding errors), so the oscillators keep their initial out-of-phase synchronization. This is confirmed by numerical simulations. 8 Conclusion We have developed a new method for computing the phase response curve of a spiking neuron. The new method is mathematically equivalent to the adjoint method, but our implementation is faster and more robust than any existing method. It can be used as part of the continuation of the boundary value problem defining the limit cycle that describes the spiking neuron. In that case, the computational work has few costs. Tests on several well-established neural models have shown that our method produces correct results for the PRCs very quickly. We have also extended the code with the possibility of computing the spike train response to any given function. Appendix: Neural Models A.1 Hodgkin-Huxley Model. The Hodgkin-Huxley model is defined by the following equations: C
dV = I − g Na m3 h(V − VNa ) − g K n4 (V − VK ) − g L (V − VL ) dt
dm = φ ((1 − m) αm − m βm ); dt dh = φ ((1 − h) αh − h βh ); dt dn = φ ((1 − n) αn − n βn ); dt
φ=3
T−6.3 10
(A.1)
Computation of the Phase Response Curve
843
25 − V 10 ψαm αm = exp[ψαm ] − 1
ψαm =
βm = 4 exp[−V/18] αh = 0.07 exp[−V/20] βh =
ψαn =
1 1 + exp[(30 − V)/10] 10 − V 10
αn = 0.1
ψαn exp[ψαn ] − 1
βn = 0.125 exp[−V/80]. In this letter, the parameters C = 1, g Na = 120, VNa = 115, g K = 36, VK = −12, gl = 0.3, Vl = 10.559, and T = 6.3 are fixed. I is varied according to the tests. A.2 Morris-Lecar Model. The Morris-Lecar model is defined by the following equations: C
dV = Ie xt − g L (V − VL ) − gCa M∞ (V − VCa ) − g K N(V − VK ) dt dN = τ N (N∞ − N) dt
(A.2)
1 M∞ = (1 + tanh((V − V1 )/V2 )) 2 1 N∞ = (1 + tanh((V − V3 )/V4 )) 2 τ N = φ cosh((V − V3 )/2V4 ) In our tests, C = 5, g L = 2, VL = −60, gCa = 4, VCa = 120, g K = 8, VK = −80, 1 φ = 15 , V1 = −1.2, V2 = 18, and V4 = 17.4 are fixed. I and V3 are varied, according to the tests.
844
W. Govaerts and B. Sautois
A.3 Connor Model. The Connor model is defined by the following equations: C
dV = I − g L (V − E L ) − g Na m3 h(V − E Na ) − g K n4 (V − E K ) dt −g A A3 B(V − E A)
dm m∞ (V) − m = dt τm (V) dh h ∞ (V) − h = dt τh (V) dn n∞ (V) − n = dt τn (V) d A A∞ (V) − A = dt τ A(V) dB B∞ (V) − B = dt τ B (V) with αm =
0.1(V + 29.7) 1 − exp(−0.1(V + 29.7))
βm = 4 exp[−(V + 54.7)/18] αm m∞ = αm + βm τm =
1 1 3.8 αm + βm
αh = .07 exp[−0.05(V + 48)] 1 1 + exp[−0.1(V + 18)] αh h∞ = αh + βh βh =
τh =
1 1 3.8 αh + βh
αn = 0.01
V + 45.7 1 − exp[−0.1(V + 45.7)]
(A.3)
Computation of the Phase Response Curve
845
βn = 0.125 exp(−0.0125(V + 55.7)) αn n∞ = αn + βn τn =
1 2 3.8 αn + βn
1/3 exp[(V + 94.22)/31.84] A∞ = 0.0761 1 + exp[(V + 1.17)/28.93)] τ A = 0.3632 + B∞ =
1.158 1 + exp[(V + 55.96)/20.12]
1 (1 + exp[(V + 53.3)/14.54])4
τ B = 1.24 +
2.678 . 1 + exp[(V + 50)/16.027]
In this article, the parameters C = 1, g L = 0.3, E L = −17, g Na = 120, g A = −47.7, E Na = 55, g K = 20, E K = −72, and E A = −75 are fixed. I is varied according to the tests. Acknowledgments B. S. thanks the Fund for Scientific Research Flanders, FWO for funding the research reported in this article. Both authors thank the two referees for several critical remarks that led to substantial improvements in the article content and presentation. References Ascher, U., & Russell, R. D. (1981). Reformulation of boundary value problems in standard form. SIAM Rev., 23(2), 238–254. Blechman, I. I. (1971). Synchronization of dynamical systems [in Russian: Sinchronizatzia dinamicheskich sistem]. Moscow: Nauka. Brown, E., Moehlis, J., & Holmes, P. (2004). On the phase reduction and response dynamics of neural oscillator populations. Neural Comput., 16, 673–715. Connor, J. A., Walter, D., & McKown, R. (1977). Modifications of the Hodgkin-Huxley axon suggested by experimental results from crustatean axons. Biophys. J., 18, 81– 102. De Boor, C., & Swartz, B. (1973). Collocation at gaussian points. SIAM J. Numer. Anal., 10, 582–606. Deuflhard, P., & Hohmann, A. (1991). Numerische Mathematik: Eine algorithmisch ori¨ entierte Einfuhrung. New York: Walter de Gruyter.
846
W. Govaerts and B. Sautois
Dhooge, A., Govaerts, W., & Kuznetsov, Yu. A. (2003). MATCONT: A MATLAB package for numerical bifurcation analysis of ODEs. ACM TOMS, 29(2), 141– 164. Dhooge, A., Govaerts, W., Kuznetsov, Yu. A., Mestrom, W., & Riet, A. M. (2003). A continuation Toolbox in MATLAB. Available online at http://allserv. UGent.be/∼ajdhooge/doc cl matcont.zip. Doedel, E. J., Champneys, A. R., Fairgrieve, T. F., Kuznetsov, Yu. A., Sandstede, B., & Wang, X. J. (1997–2000). AUTO97: Continuation and bifurcation software for ordinary differential equations (with HomCont). Available at http://indy.cs.concordia.ca/auto. Ermentrout, G. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comput., 8(5), 979–1001. Ermentrout, G. B. (1981). n : m Phase-locking of weakly coupled oscillators. J. Math. Biol., 12, 327–342. Ermentrout, G. (2002). Simulating, analyzing, and animating dynamical systems: A guide to XXPAUT for researchers and students. Philadelphia: SIAM. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biol., 29, 195–217. Glass, L. (2003). Resetting and entraining biological rhythms. In A. Beuter, L. Glass, M. C. Mackey, & M. S. Titcombe (Eds.), Nonlinear dynamics in physiology and medicine. New York: Springer-Verlag. Glass, L., & Mackey, M. C. (1988). From clocks to chaos: The rhythms of life. Princeton, NJ: Princeton University Press. Govaerts, W., & Sautois, B. (2005a). Bifurcation software in MATLAB with applications in neuronal modeling. Comput. Meth. Prog. Bio., 77(2), 141– 153. Govaerts, W., & Sautois, B. (2005b). The onset and extinction of neural spiking: A numerical bifurcation approach. J. Comput. Neurosci., 18(3), 273–282. Guckenheimer, J. (1975). Isochrons and phaseless sets. J. Math. Biology, 1, 259–273. Guckenheimer, J., & Holmes, Ph. (1983). Nonlinear oscillations, dynamical systems and bifurcations of vector fields. New York: Springer. Guevara, M. R., Glass, L., Mackey, M. C., & Shier, A. (1983). Chaos in neurobiology. IEEE Trans. Syst. Man Cybern., 13(5), 790–798. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Neurophys. Lett., 23(5), 367–372. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Kuznetsov, Yu. A. (2004). Elements of applied bifurcation theory (3rd ed.). New York: Springer-Verlag. Kuznetsov, Yu. A., Govaerts, W., Doedel, E. J., & Dhooge, A. (in press). Numerical periodic normalization for codim 1 bifurcations of limit cycles. SIAM J. Numer. Anal. Kuznetsov, Yu. A., & Levitin, V. V. (1998). CONTENT: Integrated environment for analysis of dynamical systems, version 1.5. Amsterdam: CWI. Available online at http://ftp.cwi.nl/CONTENT.
Computation of the Phase Response Curve
847
Malkin, I. G. (1949). Methods of Poincar´e and Lyapunov in the theory of non-linear oscillations [in Russian: Metodi Puankare i Liapunova v teorii nelineinix kolebanii]. Moscow: Gostexizdat. Malkin, I. G. (1956). Some problems in nonlinear oscillation theory [in Russian: Nekotorye zadachi teorii nelineinix kolebanii]. Moscow: Gostexizdat.
Received February 9, 2005; accepted June 27, 2005.
LETTER
Communicated by Richard Hahnloser
Multiperiodicity and Exponential Attractivity Evoked by Periodic External Inputs in Delayed Cellular Neural Networks Zhigang Zeng
[email protected] School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China
Jun Wang
[email protected] Department of Automation and Computer-Aided Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
We show that an n-neuron cellular neural network with time-varying delay can have 2n periodic orbits located in saturation regions and these periodic orbits are locally exponentially attractive. In addition, we give some conditions for ascertaining periodic orbits to be locally or globally exponentially attractive and allow them to locate in any designated region. As a special case of exponential periodicity, exponential stability of delayed cellular neural networks is also characterized. These conditions improve and extend the existing results in the literature. To illustrate and compare the results, simulation results are discussed in three numerical examples. 1 Introduction Cellular neural networks (CNNs) and delayed cellular neural networks (DCNNs) are arrays of dynamical cells that are suitable for solving many complex computational problems. In recent years, both have been extensively studied and successfully applied for signal processing and solving nonlinear algebraic equations. As dynamic systems with a special structure, CNNs and DCNNs have many interesting properties that deserve theoretical studies. In general, there are two interesting nonlinear neurodynamic properties in CNNs and DCNNs: stability and periodic oscillations. The stability of a CNN or a DCNN at an equilibrium point means that for a given activation function and a constant input vector, an equilibrium of the network exists and any state in the neighborhood converges to the equilibrium. The stability of neuron activation states at an equilibrium is prerequisite for most applications. Some neurodynamics have multiple (two) stable equilibria and may be stable at any equilibrium depending on the initial state, which is called Neural Computation 18, 848–870 (2006)
C 2006 Massachusetts Institute of Technology
Multiperiodicity and Exponential Attractivity of CNNs
849
multistability (bistability). For stability, either an equilibrium or a set of equilibria is the attractor. Besides stability, an activation state may be periodically oscillatory around an orbit. In this case, the attractor is a limit set. Periodic oscillation in recurrent neural networks is an interesting dynamic behavior, as many biological and cognitive activities (e.g., heartbeat, respiration, mastication, locomotion, and memorization) require repetition. Persistent oscillation, such as limit cycles, represents a common feature of neural firing patterns produced by dynamic interplay between cellular and synaptic mechanisms. Stimulus-evoked oscillatory synchronization was observed in many biological neural systems, including the cerebral cortex of mammals and the brain of insects. It was also known that time delays can cause oscillations in neurodynamics (Gopalsamy & Leung, 1996; Belair, Campbell, & Driessche, 1996). In addition, periodic oscillations in recurrent neural networks have found many applications, such as associative memories (Nishikawa, Lai, & Hoppensteadt, 2004), pattern recognition (Wang, 1995; Chen, Wang, & Liu, 2000), machine learning (Ruiz, Owens, & Townley, 1998; Townley et al., 2000), and robot motion control (Jin & Zacksenhouse, 2003). The analysis of periodic oscillation of neural networks is more general than stability analysis since an equilibrium point can be viewed as a special case of oscillation with any arbitrary period. The stability of CNNs and DCNNs has been widely investigated (e.g., Chua & Roska, 1990, 1992; Civalleri, Gilli, & Pandolfi, 1993; Liao, Wu, & Yu, 1999; Roska, Wu, Balsi, & Chua, 1992; Roska, Wu, & Chua, 1993; Setti, Thiran, & Serpico, 1998; Takahashi, 2000; Zeng, Wang, & Liao, 2003). The existence of periodic orbits together with global exponential stability of CNNs and DCNNs is studied in Yi, Heng, and Vadakkepat (2002) and Wang and Zou (2004). Most existing studies (Berns, Moiola, & Chen, 1998; Jiang & Teng, 2004; Kanamaru & Sekeine, 2004; Liao & Wang, 2003; Liu, Chen, Cao, & Huang, 2003; Wang & Zou, 2004) are based on the assumption that the equilibrium point of CNNs or DCNNs is globally stable or the periodic orbit of CNNs or DCNNs is globally attractive; hence, CNNs or DCNNs have only one equilibrium point or one periodic orbit. However, in most applications, it is required that CNNs or DCNNs exhibit more than one stable equilibrium point (e.g., Yi, Tan, & Lee, 2003; Zeng, Wang, & Liao, 2004), or more than one exponentially attractive periodic orbit instead of a single globally stable equilibrium point. In this letter, we investigate the multiperiodicity and multistablity of DCNNs. We show that an n-neuron DCNN can have 2n periodic orbits that are locally exponentially attractive. Moreover, we present the estimates of attractive domain of such 2n locally exponentially attractive periodic orbits. In addition, we give the conditions for periodic orbits to be locally or globally exponentially attractive when the periodic orbits locate in a designated position. All of these conditions are very easy to be verified.
850
Z. Zeng and J. Wang
The remaining part of this letter consists of six sections. In section 2, relevant background information is given. The main results are stated in sections 3, 4, and 5. In section 6, three illustrative examples are provided with simulation results. Finally, concluding remarks are given in section 7. 2 Preliminaries Consider the DCNN model governed by the following normalized dynamic equations: dxi (t) a ij f (x j (t)) = − xi (t) + dt n
j=1
+
n
b ij f (x j (t − τij (t))) + ui (t), i = 1, . . . , n,
(2.1)
j=1
where x = (x1 , . . . , xn )T ∈ n is the state vector, A = (a ij ) and B = (b ij ) are connection weight matrices that are not assumed to be symmetric, u(t) = (u1 (t), . . . , un (t))T ∈ n is a periodic input vector with period ω (i.e., there exists a constant ω > 0 such that ui (t + ω) = ui (t) ∀t ≥ 0, ∀i ∈ {1, 2, . . . , n}), τij (t) is the time-varying delay that satisfies 0 ≤ τij (t) ≤ τ (τ is constant), and f (·) is the piecewise linear activation function defined by f (v) = (|v + 1| − |v − 1|)/2. In particular, when b ij ≡ 0 (∀i, j = 1, 2, . . . , n), the DCNN degenerates as a CNN. Let C([t0 − τ, t0 ], D) be the space of continuous functions mapping [t0 − τ, t0 ] into D ⊂ n with the norm defined by ||φ||t0 = max1≤i≤n {supu∈[t0 −τ,t0 ] |φi (u)|}, where φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T . Denote ||x|| = max1≤i≤n {|xi |} as the vector norm of the vector x = (x1 , . . . , xn )T . ∀φ, ϕ ∈ C([t0 − τ, t0 ], D), where φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T , ϕ(s) = (ϕ1 (s), ϕ2 (s), . . . , ϕn (s))T . Denote ||φ, ϕ||t0 = max1≤i≤n {supt0 −τ ≤s≤t0 {|φi (s) − ϕi (s)|}} as a measurement in C([t0 − τ, t0 ], D). The initial condition of DCNN model 2.1 is assumed to be φ(ϑ) = (φ1 (ϑ), φ2 (ϑ), . . . , φn (ϑ))T , where φ(ϑ) ∈ C([t0 − τ, t0 ], n ). Denote x(t; t0 , φ) as the state of DCNN model 2.1 with initial condition (t0 , φ). It means that x(t; t0 , φ) is continuous and satisfies equation 2.1 and x(s; t0 , φ) = φ(s), for s ∈ [t0 − τ, t0 ]. Denote (−∞, −1) = (−∞,−1)1 ×[−1,1]0 ×(1,+∞)0 ;[−1,1] = (−∞,−1)0 × [−1, 1]1 × (1, +∞)0 ; (1, +∞)= (−∞, −1)0 × [−1, 1]0 × (1, +∞)1 , = (−∞, +∞) = (−∞, −1) [−1, 1] (1, +∞), so (−∞, +∞)n can be divided into 3n subspaces: (i)
(i)
(i)
n (−∞, −1)δ1 × [−1, 1]δ2 × (1, +∞)δ3 , = {i=1 (i) (i) (i) δ1 , δ2 , δ3 = (1, 0, 0) or (0, 1, 0) or (0, 0, 1), i = 1, . . . , n};
(2.2)
Multiperiodicity and Exponential Attractivity of CNNs
851
and can be divided into three subspaces: 1 = {[−1, 1]n } n 2 = {i=1 (−∞, −1)δ × (1, +∞)1−δ , δ (i) = 1 or 0, i = 1, . . . , n} (i)
(i)
3 = − 1 − 2 Hence, 1 is composed of one region, 2 is composed of 2n regions, and 3 is composed of 3n − 2n − 1 regions. Definition 1. A periodic orbit x ∗ (t) is said to be a limit cycle of a DCNN if x ∗ (t) is an isolated periodic orbit of the DCNN; that is, there exists ω > 0 such that ∀t ≥ t0 , x ∗ (t + ω) = x ∗ (t), and there exists δ > 0 such that ∀ x ∈ {x| 0 < ||x, x ∗ (t)|| < δ, x ∈ n , t ≥ t0 }, where x is not a point on any periodic orbit of the DCNN. Definition 2. A periodic orbit x ∗ (t) of a DCNN is said to be locally exponentially attractive in region if there exist constants α > 0, β > 0 such that ∀t ≥ t0 , x(t; t0 , φ) − x ∗ (t) ≤ β||φ||t exp{−α(t − t0 )}, 0 where x(t; t0 , φ) is the state of the DCNN with any initial condition (t0 , φ), φ(ϑ) ∈ C([t0 − τ, t0 ], ), and is said to be a locally exponentially attractive set of the periodic orbit x ∗ (t). When = n , x ∗ (t) is said to be globally exponentially attractive. In particular, if x ∗ (t) is a fixed point x ∗ , then the DCNN is said to be global exponentially stable. Lemma 1. (Kosaku, 1978). Let D be a compact set in n , H be a mapping on complete metric space (C([t0 − τ, t0 ], D), ||·, ·||t0 ). If H(C([t0 − τ, t0 ], D)) ⊂ C([t0 − τ, t0 ], D), and there exists a constant α < 1 such that ∀φ, ϕ ∈ C([t0 − τ, t0 ], D), ||H(φ), H(ϕ)||t0 ≤ α||φ, ϕ||t0 , then there exists φ ∗ ∈ C([t0 − τ, t0 ], D) such that H(φ ∗ ) = φ ∗ . Consider the following coupled system: dx(t) = −x(t) + Ay(t) + By(t − τ (t)), dt dy(t) = h(t, y(t), y(t − τ (t))), dt
(2.3) (2.4)
where x(t) ∈ n , y(t) ∈ m , A and B are n × m matrices, h ∈ C( × m × m , m ), and C( × m × m , m ) is the space of continuous functions mapping × m × m into m .
852
Z. Zeng and J. Wang
Lemma 2. If system 2.4 is globally exponentially stable, then system 2.3 is also globally exponentially stable. Proof. By applying the variant format of constants, the solution x(t) of equation 2.3 can be expressed as x(t) = exp{−(t − t0 )}x(t0 ) +
t
exp{−(t − s)}(Ay(s) + By(s − τ (s)))ds.
t0
Since equation 2.4 is globally exponentially stable, there exist constants α, β > 0 such that |y(s)| ≤ β exp{−α(s − t0 )}. Hence, |Ay(s) + By(s − τ (s))| ≤ β¯ exp{−α(s − t0 )}, where β¯ = (||A|| + ||B|| exp{ατ })β. Then ¯ − t0 ) exp{−(t − t0 )}; when α = 1, ∀t ≥ t0 , |x(t)| ≤ |x(t0 )| exp{−(t − t0 )} + β(t ¯ when α = 1, |x(t)| ≤ |x(t0 )| exp{−(t − t0 )} + β(exp{−(t − t0 )} + exp{−α(t − t0 )})/|1 − α|; that is, equation 2.3 is also globally exponentially stable. Throughout this article, we assume that N = {1, 2, . . . , n}, N N 1 2 3 N1 N2 , N1 N3 , and N2 N3 are empty. Denote D1 = {x ∈ n | xi ∈ (−∞, −1), i ∈ N1 ; xi ∈ (1, ∞), i ∈ N2 ; xi ∈ [−1, 1], i ∈ N3 }. Note that D1 ⊂ , where is defined in equation 2.2. If N3 is empty, then denote D2 = {x ∈ n |xi ∈ (−∞, −1), i ∈ N1 ; xi ∈ (1, ∞), i ∈ N2 }. 3 Locally Exponentially Attractive Multiperiodicity in a Saturation Region In this section, we show that an n-neuron delayed cellular neural network can have 2n periodic orbits located in saturation regions and these periodic orbits are locally exponentially attractive. Theorem 1. If ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , |ui (t)| < a ii − 1 −
n j=1, j=i
|a ij | −
n
|b ij |,
(3.1)
j=1
then DCNN (see equation 2.1) has 2n locally exponentially attractive limit cycles. Proof. If ∀s ∈ [t0 − τ, t], x(s) ∈ D2 , from equation 2.1, ∀i = 1, 2, . . . , n, dxi (t) = −xi (t) + (a ij + b ij ) − (a ij + b ij ) + ui (t). dt j∈N j∈N 1
2
(3.2)
Multiperiodicity and Exponential Attractivity of CNNs
853
When i ∈ N2 and xi (t) = 1, from equations 3.1 and 3.2, dxi (t) = −1 + (a ij + b ij ) − (a ij + b ij ) + ui (t) > 0. dt j∈N j∈N 1
(3.3)
2
When i ∈ N1 and xi (t) = −1, from equations 3.1 and 3.2, dxi (t) (a ij + b ij ) − (a ij + b ij ) + ui (t) < 0. =1+ dt j∈N j∈N 1
(3.4)
2
Equations 3.3 and 3.4 imply that if ∀φ ∈ C([t0 − τ, t0 ], D2 ), then x(t; t0 , φ) will keep in D2 , and D2 is an invariant set of DCNN (see equation 2.1). So ∀t ≥ t0 − τ, x(t) ∈ D2 . Hence, DCNN, equation 2.1, can be rewritten as equation 3.2. Let x(t; t0 , φ) and x(t; t0 , ϕ) be two states of DCNN, equation 2.1, with initial conditions (t0 , φ) and (t0 , ϕ), where φ, ϕ ∈ C([t0 − τ, t0 ], D2 ). From equations 2.1 and 3.2, ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , d(xi (t; t0 , φ) − xi (t; t0 , ϕ)) = −(xi (t; t0 , φ) − xi (t; t0 , ϕ)). dt
(3.5)
Hence, ∀i = 1, 2, . . . , n, ∀t ≥ t0 , |xi (t; t0 , φ) − xi (t; t0 , ϕ)| ≤ ||φ, ϕ||t0 exp{−(t − t0 )}.
(3.6)
(t)
Define xφ (θ ) = x(t + θ ; t0 , φ), θ ∈ [t0 − τ, t0 ]. Then from equations 3.3 (t)
and 3.4, ∀φ ∈ C([t0 − τ, t0 ], D2 ), xφ ∈ C([t0 − τ, t0 ], D2 ). Define a mapping (ω)
H : C([t0 − τ, t0 ], D2 ) → C([t0 − τ, t0 ], D2 ) by H(φ) = xφ . Then H(C([t0 − τ, t0 ], D2 )) ⊂ C([t0 − τ, t0 ], D2 ), (mω)
and H m (φ) = xφ . We can choose a positive integer m such that exp{−(mω − τ )} ≤ α < 1. Hence, from equation 3.6, ||H m (φ), H m (ϕ)||t0 ≤ max
1≤i≤n
sup θ∈[t0 −τ,t0 ]
|xi (mω + θ; t0 , φ) − xi (mω + θ; t0 , ϕ)|
≤ ||φ, ϕ||t0 exp{−(mω + t0 − τ − t0 )} ≤ α||φ, ϕ||t0 . Based on lemma 1, there exists a unique fixed point φ ∗ ∈ C([t0 − τ, t0 ], D2 ) such that H m (φ ∗ ) = φ ∗ . In addition, H m (H(φ ∗ )) = H(H m (φ ∗ )) = H(φ ∗ ). This shows that H(φ ∗ ) is also a fixed point of H m . Hence, by the uniqueness of the fixed
854
Z. Zeng and J. Wang (ω)
point of the mapping H m , H(φ ∗ ) = φ ∗ ; that is, xφ ∗ = φ ∗ . Let x(t; t0 , φ ∗ ) be a state of DCNN, equation 2.1, with initial condition (t0 , φ ∗ ). Then from equation 2.1, ∀i = 1, 2, . . . , n, ∀t ≥ t0 , dxi (t; t0 , φ ∗ ) = −xi (t; t0 , φ ∗ ) + (a ij + b ij ) − (a ij + b ij ) + ui (t). dt j∈N j∈N 1
2
Hence, ∀i = 1, 2, . . . , n, ∀t + ω ≥ t0 , dxi (t + ω; t0 , φ ∗ ) = −xi (t + ω; t0 , φ ∗ ) + (a ij + b ij ) dt j∈N −
1
(a ij + b ij ) + ui (t + ω)
j∈N2
= −xi (t + ω; t0 , φ ∗ ) + −
(a ij + b ij )
j∈N1
(a ij + b ij ) + ui (t).
j∈N2
This implies x(t + ω; t0 , φ ∗ ) is also a state of DCNN, equation 2.1, with initial (ω) condition (t0 , φ ∗ ). xφ ∗ = φ ∗ implies that ∀t ≥ t0 , x(t + ω; t0 , φ ∗ ) = x(t; t0 , xφ ∗ ) = x(t; t0 , φ ∗ ). (ω)
Hence, x(t; t0 , φ ∗ ) is a periodic orbit of DCNN, equation 2.1, with period ω. From equation 3.5, it is easy to see that any state of DCNN, equation 2.1, with initial condition (t, φ) (φ ∈ C([t0 − τ, t0 ], D2 )) converges to this periodic orbit exponentially as t → +∞. Hence, the isolated periodic orbit x(t; t0 , φ ∗ ) located in D2 is locally exponentially attractive, and D2 is a locally exponentially attractive set of x(t; t0 , φ ∗ ). Since there exist 2n elements in 2 , there exist 2n isolated periodic orbits in 2 . And such 2n isolated periodic orbits are locally exponentially attractive. When the periodic external input u(t) degenerates into a constant vector, we have the following corollary: Corollary 1. If ∀i ∈ {1, 2, . . . , n}, ∀t ≥ t0 , ui (t) ≡ ui (constant), and
|ui | < a ii − 1 −
n j=1, j=i
|a ij | −
n
|b ij |,
j=1
then DCNN (see equation 2.1) has 2n locally exponentially stable equilibrium points.
Multiperiodicity and Exponential Attractivity of CNNs
855
Proof. Since ui (t) ≡ ui (constant), for an arbitrary constant ν ∈ , ui (t + ν) ≡ ui ≡ ui (t). According to theorem 1, DCNN, equation 2.1, has 2n locally exponentially attractive limit cycles with period ν. The arbitrariness of constant ν implies that such limit cycles are fixed points. Hence, DCNN, equation 2.1, has 2n locally exponentially attractive equilibrium points. Remark 1. In theorem 1 and corollary 1, it is necessary for a ii to be dominant such that a ii > 1 + nj=1, j=i |a ij | + nj=1 |b ij |. Remark 2. A main objective for designing associative memories is to store a large number of patterns as stable equilibria or limit cycles such that stored patterns can be retrieved when the initial probes contain sufficient information about the patterns. CNNs and DCNNs are also suitable for very largescale integration (VLSI) implementations of associative memories. It is also expected that they can be applied to association memories by storing patterns as periodic limit cycles. According to theorem 1 and corollary 1, the n-neuron DCNN model, equation 2.1, can store up to 2n patterns in locally exponentially attractive limit cycles or equilibria, which can be retrieved when the input vector satisfies condition 3.1. This implies that the external stimuli also play a major role in encoding and decoding patterns in DCNN associative memories, in contrast with the zero input vector in the bidirectional associate memories and the autoassociative memories based on the Hopfield network.
4 Locally Exponentially Attractive Periodicity in a Designated Region As the limit cycles are stimulus driven (nonautonomous), some information can be encoded in the phases of the oscillating states xi relative to the inputs ui . Hence, it is necessary to find some conditions on the inputs ui , when the periodic orbit x(t) is desired to be located in a designated region. In this section, we give the conditions that allow a periodic orbit to be locally exponentially attractive and located in any designated region.
Theorem 2. If ∀t ≥ t0 , ui (t) < −1 +
(a ij + b ij ) −
j∈N1
ui (t) > 1 +
j∈N1
(a ij + b ij ) −
(a ij + b ij ) −
j∈N2
j∈N2
(a ij + b ij ) +
(|a ij | + |b ij |),
i ∈ N1 ,
j∈N3
(4.1) (|a ij | + |b ij |),
i ∈ N2 ,
j∈N3
(4.2)
856
Z. Zeng and J. Wang ui (t) < 1 − a ii − −
(a ij + b ij ),
ui (t) > a ii − 1 +
|a ij | −
j∈N3 , j=i
j∈N2
−
|b ij | +
j∈N3
(a ij + b ij )
j∈N1
i ∈ N3 , |a ij | +
j∈N3 , j=i
(a ij + b ij ),
j∈N3
(4.3) |b ij | +
(a ij + b ij )
j∈N1
i ∈ N3 ,
(4.4)
j∈N2
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) = τij (t + ω), then DCNN, equation 2.1, has only one limit cycle located in D1 , which is locally exponentially attractive in D1 . Proof. If ∀s ∈ [t0 − τ, t], x(s) ∈ D1 , then from equation 2.1, ∀s ∈ [t0 , t], ∀i = 1, 2, . . . , n, dxi (s) = −xi (s) − (a ij + b ij ) + (a ij + b ij ) + a ij x j (s) dt j∈N1 j∈N2 j∈N3 b ij x j (s − τij (s)) + ui (s). +
(4.5)
j∈N3
When i ∈ N1 and xi (t) = −1, from equations 4.1 and 4.5, dxi (t) (a ij + b ij ) + (a ij + b ij ) ≤1 − dt j∈N1 j∈N2 + (|a ij | + |b ij |) + ui (t) < 0.
(4.6)
j∈N3
When i ∈ N2 and xi (t) = 1, from equations 4.2 and 4.5, dxi (t) ≥ −1 − (a ij + b ij ) − (|a ij | + |b ij |) dt j∈N1 j∈N3 (a ij + b ij ) + ui (t) > 0. +
(4.7)
j∈N2
When i ∈ N3 and xi (t) = 1, from equations 4.3 and 4.5, dxi (t) ≤ −1 − (a ij + b ij ) + (a ij + b ij ) + a ii dt j∈N1 j∈N2 + |a ij | + |b ij | + ui (t) < 0. j∈N3 , j=i
j∈N3
(4.8)
Multiperiodicity and Exponential Attractivity of CNNs
857
When i ∈ N3 and xi (t) = −1, from equations 4.4 and 4.5, dxi (t) (a ij + b ij ) + (a ij + b ij ) − a ii ≥ 1− dt j∈N1 j∈N2 − |a ij | − |b ij | + ui (t) > 0. j∈N3 , j=i
(4.9)
j∈N3
Equations 4.6 to 4.9 imply that if ∀s ∈ [t0 − τ, t0 ], φ(s) ∈ D1 , then x(t; t0 , φ) will keep in D1 , and D1 is an invariant set of DCNN (see equation 2.1). So ∀t ≥ t0 − τ, x(t) ∈ D1 . Hence, ∀t ≥ t0 , DCNN, equation 2.1, can be rewritten as dxi (t) = −xi (t) − (a ij + b ij ) + a ij x j (t) + b ij x j (t − τij (t)) dt j∈N1 j∈N3 j∈N3 (a ij + b ij ) + ui (t), i = 1, 2, . . . , n. +
(4.10)
j∈N2
Let x(t; t0 , φ) and x(t; t0 , ϕ) be two states of DCNN (equation 4.10) with initial conditions (t0 , φ) and (t0 , ϕ), where φ, ϕ ∈ C([t0 − τ, t0 ], D1 ). From equation 4.10, ∀i = 1, 2, . . . , n; ∀t ≥ t0 , d(xi (t; t0 , φ) − xi (t; t0 , ϕ)) = −(xi (t; t0 , φ) − xi (t; t0 , ϕ)) dt + (a ij (x j (t; t0 , φ) − x j (t; t0 , ϕ)) j∈N3
+b ij (x j (t − τij (t); t0 , φ) − x j (t − τij (t); t0 , ϕ))). (4.11) Let yi (t) = xi (t; t0 , φ) − xi (t; t0 , ϕ). Then from equation 4.11, for i = 1, . . . , n; ∀t ≥ t0 , dyi (t) = −yi (t) + a ij y j (t) + b ij y j (t − τi j (t)). dt j∈N j∈N 3
(4.12)
3
4.3 and 4.4, for i ∈ N3 , a ii + |b ii | + j∈N3 , j=i (|a ij | + |b ij |) + From equations | j∈N1 (a ij + b ij ) − j∈N2 (a ij + b ij ) − ui (t)| < 1. Hence, there exists ϑ > 0 such that (1 − a ii ) −
j∈N3 , j=i
|a ij | +
|b ij | exp{ϑτ } + ϑ ≥ 0,
i ∈ N3 .
j∈N3
(4.13)
858
Z. Zeng and J. Wang
Consider a subsystem of equation 4.12: dyi (t) = −yi (t) + a ij y j (t) + b ij y j (t − τi j (t)), t ≥ t0 , i ∈ N3 . dt j∈N j∈N 3
3
(4.14)
Denote || y|| ¯ t0 = maxt0 −τ ≤s≤t0 {||y(s)||}. Then for i ∈ N3 , ∀t ≥ t0 , |yi (t)| ≤ || y|| ¯ t0 exp{−ϑ(t − t0 )}. Otherwise, one of the following two cases holds: Case i. There exist t2 > t1 ≥ t0 , k ∈ N3 , sufficiently small ε1 > 0 such that yk (t1 ) − ¯ t0 exp{−ϑ(t2 − t0 )} = ε1 , and when s ∈ || y|| ¯ t0 exp{−ϑ(t1 − t0 )} = 0, yk (t2 ) − || y|| ¯ t0 exp{−ϑ(s − t0 )} ≤ ε1 , and [t0 − τ, t2 ], for all i ∈ N3 , |yi (s)| − || y|| dyk (t) ¯ t0 exp{−ϑ(t1 − t0 )} ≥ 0, |t=t1 + ϑ|| y|| dt dyk (t) |t=t2 + ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )} > 0. dt
(4.15)
Case ii. There exist t4 > t3 ≥ t0 , j ∈ N3 , sufficiently small ε2 > 0 such that ¯ t0 exp{−ϑ(t3 − t0 )} = 0, yk (t4 ) + || y|| ¯ t0 exp{−ϑ(t4 − t0 )} = −ε2 , and y j (t3 ) + || y|| ¯ t0 exp{−ϑ(s − t0 )} ≥ −ε2 , and when s ∈ [t0 − τ, t4 ], for all i ∈ N3 , |yi (s)| − || y|| dy j (t) dt dy j (t) dt
|t=t3 − ϑ|| y|| ¯ t0 exp{−ϑ(t3 − t0 )} ≤ 0, |t=t4 − ϑ|| y|| ¯ t0 exp{−ϑ(t4 − t0 )} < 0.
(4.16)
It follows from equations 4.13 and 4.14 that for k ∈ N3 , dyk (t) |t=t2 = −yk (t2 ) + (a k j y j (t2 ) + b k j y j (t2 − τk j (t2 ))) dt j∈N 3
|a k j | ≤ || y|| ¯ t0 exp{−ϑ(t2 − t0 )} − 1 + a kk + +
j∈N3 , j=k
|b k j | exp{ϑτ } + ϑ − ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )}
j∈N3
|a k j | + |b k j | +ε1 − 1 + a kk + j∈N3 , j=k
≤ −ϑ|| y|| ¯ t0 exp{−ϑ(t2 − t0 )}.
j∈N3
Multiperiodicity and Exponential Attractivity of CNNs
859
This contradicts equation 4.15. Similarly, it follows from equations 4.13 and 4.14 that dy j (t) dt
¯ t0 exp{−ϑ(t4 − t0 )}. |t=t4 ≥ ϑ|| y||
This contradicts equation 4.16. The two contradictions show that for i ∈ N3 , ∀t ≥ t0 , ¯ t0 exp{−ϑ(t − t0 )}. |yi (t)| ≤ || y|| Hence, according to lemma 2, there exists ϑ¯ > 0 such that ∀i = 1, 2, . . . , n, ∀t ≥ t0 , ¯ − t0 )}. |xi (t; t0 , φ) − xi (t; t0 , ϕ)| ≤ ||φ, ϕ||t0 exp{−ϑ(t
(4.17)
(t)
Define xφ (θ ) = x(t + θ ; t0 , φ), θ ∈ [t0 − τ, t0 ]. From equations 4.6 to 4.9, if (t) ¯ : C([t0 − φ ∈ C([t0 − τ, t0 ], D1 ), then xφ ∈ C([t0 − τ, t0 ], D1 ). Define a mapping H (ω) ¯ ¯ τ, t0 ], D1 ) → C([t0 − τ, t0 ], D1 ) by H(φ) = xφ , then H(C([t0 − τ, t0 ], D1 )) ⊂ (mω)
¯ m (φ) = xφ . C([t0 − τ, t0 ], D1 ), and H Similar to the proof of theorem 1, there exists a periodic orbit x(t; t0 , φ ∗ ) of DCNN, equation 2.1, with period ω such that ∀t ≥ t0 , x(t; t0 , φ ∗ ) ∈ D1 and all other states of DCNN, equation 2.1, with initial condition (t, φ) (φ ∈ C([t0 − τ, t0 ], D1 )) converge to this periodic orbit exponentially as t → +∞. Hence, the isolated periodic orbit x(t; t0 , φ ∗ ) located in D1 is locally exponentially attractive, and D1 is a locally exponentially attractive set of x(t; t0 , φ ∗ ). Remark 3. From equations 4.1 to 4.4, we can see that the input vector u(t) can control the locality of a limit cycle that represents a memory pattern in a designated region. Specifically, when condition 4.1 holds, the part in corresponding coordinate of the limit cycle is located in (1, +∞); when condition 4.2 holds, the part in corresponding coordinate of the limit cycle is located (−∞, −1); when conditions 4.3 and 4.4 hold, the part in corresponding coordinate of the limit cycle is located [−1, 1]. When N3 is empty, we have the following corollary: Corollary 2. Let N1 ∪ N2 = {1, 2, . . . , n}, and N1 ∩ N2 be empty. If ∀t ≥ t0 , ui (t) <
(a ij + b ij ) −
j∈N1
ui (t) >
j∈N1
(a ij + b ij ) − 1,
i ∈ N1 ,
(4.18)
i ∈ N2 ,
(4.19)
j∈N2
(a ij + b ij ) −
j∈N2
(a ij + b ij ) + 1,
860
Z. Zeng and J. Wang
then DCNN, equation 2.1, has exactly one limit cycle located in D2 , and such a limit cycle is locally exponentially attractive. Proof. Let N3 in theorem 2 be an empty set. According to theorem 2, corollary 2 holds. When the periodic external input u(t) degenerates into a constant, we have the following corollary. Corollary 3. If ∀t ≥ t0 , ui (t) ≡ ui (constant), and ui < −1 +
(a ij + b ij ) −
j∈N1
ui > 1 +
ui < 1 − a ii −
j∈N2
|a ij | −
j∈N3 , j=i
ui > a ii − 1 +
(a ij + b ij ) −
j∈N2
(a ij + b ij ) −
j∈N1
j∈N3 , j=i
(|a ij | + |b ij |),
j∈N3
|b ij | +
j∈N3
|a ij | +
(|a ij | + |b ij |),
i ∈ N1 ,
j∈N3
(a ij + b ij ) +
(a ij + b ij ) −
j∈N1
|b ij | +
j∈N3
i ∈ N2 , (a ij + b ij ),
i ∈ N3 ,
(a ij + b ij ),
i ∈ N3 ,
j∈N2
(a ij + b ij ) −
j∈N1
j∈N2
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) ≡ τij (constant), then DCNN, equation 2.1, has only one equilibrium point located in D1 , which is locally exponentially stable. Proof. Since ui (t) ≡ ui (constant), for arbitrary constant ν ∈ , ui (t + ν) ≡ ui ≡ ui (t). According to theorem 2, DCNN, equation 2.1, has only one limit cycle located in D1 , which is locally exponentially attractive in D1 . The arbitrariness of constant ν implies that such a limit cycle is a fixed point. Hence, DCNN, equation 2.1, has only one equilibrium point located in D1 , which is locally exponentially stable.
5 Globally Exponentially Attractive Periodicity in a Designated Region In order to obtain optimal spatiotemporal coding in the periodic orbit and reduce computational time, it is desirable for a neural network to be globally exponentially attractive to periodic orbit in a designated region. In this section, we give some conditions that allow a periodic orbit to be globally exponentially attractive and to be located in any designated region. Theorem 3. If ∀t ≥ t0 , ui (t) < −1 −
n j=1
(|a ij | + |b ij |),
i ∈ N1 ,
(5.1)
Multiperiodicity and Exponential Attractivity of CNNs
ui (t) > 1 +
n (|a ij | + |b ij |), i ∈ N2 ,
861
(5.2)
j=1
|ui (t)| < 1 − a ii −
n j=1, j=i
|a ij | −
n
|b ij |, i ∈ N3 ,
(5.3)
j=1
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) = τij (t + ω), then DCNN, equation 2.1, has a unique limit cycle located in D1 , and such a limit cycle is globally exponentially attractive. Proof. When i ∈ N1 and xi (t) ≥ −1, from equations 2.1 and 5.1, dxi (t) ≤1+ (|a ij | + |b ij |) + ui (t) < 0. dt j=1 n
(5.4)
When i ∈ N2 and xi (t) ≤ 1, from equations 2.1 and 5.2, dxi (t) (|a ij | + |b ij |) + ui (t) > 0. ≥ −1 − dt j=1 n
(5.5)
When i ∈ N3 and xi (t) ≤ −1, from equations 2.1 and 5.3, n n dxi (t) ≥ 1 − a ii − |a ij | − |b ij | + ui (t) > 0. dt j=1, j=i j=1
(5.6)
When i ∈ N3 and xi (t) ≥ 1, from equations 2.1 and 5.3, n n dxi (t) ≤ −1 + a ii + |a ij | + |b ij | + ui (t) < 0. dt j=1, j=i j=1
(5.7)
Equations 5.4 to 5.7 imply that x(t; t0 , φ) will go into and keep in D1 , where φ ∈ C([t0 − τ, t0 ], n ). So there exists T > 0 such that ∀t ≥ T, x(t) ∈ D1 . Hence, ∀t ≥ T + τ, DCNN, equation 2.1, can be rewritten as dxi (t) = −xi (t) − (a ij + b ij ) + a ij x j (t) + b ij x j (t − τij (t)) dt j∈N j∈N j∈N +
j∈N2
1
3
(a ij + b ij ) + ui (t), i = 1, 2, . . . , n.
3
862
Z. Zeng and J. Wang
Similar to the proof of theorem 2, DCNN, equation 2.1, has a unique limit cycle located in D1 , and such a limit cycle is globally exponentially attractive. Remark 4. By comparison, we can see that if conditions 5.1 to 5.3 hold, then conditions 4.1 to 4.4 also hold. But not vice versa, as will be shown in examples 2 and 3. In other words, the conditions in theorem 3 are stronger than those in theorem 2. When N1 ∪ N2 is empty, we have the following corollary: Corollary 4. If ∀i, j ∈ {1, 2, . . . , n}, τij (t) = τij (t + ω), and ∀t ≥ t0 , |ui (t)| < 1 − a ii −
n
|a ij | −
j=1, j=i
n
|b ij |,
j=1
then the DCNN, equation 2.1, has a unique limit cycle located in [−1, 1]n , which is globally exponentially attractive. Proof. Choose N3 = {1, 2, . . . , n} in theorem 3. According to theorem 3, the corollary holds. When N3 is empty, we have the following corollary: Corollary 5. Let N3 be empty. If ui (t) < −1 −
n (|a ij | + |b ij |),
i ∈ N1 ,
(5.8)
i ∈ N2 ,
(5.9)
j=1
ui (t) > 1 +
n (|a ij | + |b ij |), j=1
then DCNN, equation 2.1, has a unique limit cycle located in D2 . Moreover, such a limit cycle is globally exponentially attractive. Proof. Since N3 is empty, according to theorem 3, corollary 5 holds. Remark 5. Since −1 − nj=1 (|a ij | + |b ij |) ≤ −1 + j∈N1 (a ij + b ij ) − j∈N2 (a ij + b ij ), if condition 5.8 holds, then condition 4.18 also holds, but not vice versa. Similarly, if condition 5.9 holds, then condition 4.19 also holds, but not vice versa. This implies that the conditions in corollary 5 are stronger than those in corollary 2. In addition, corollary 5 shows that a DCNN has a globally exponentially attractive limit cycle if its periodic external stimulus is sufficiently strong.
Multiperiodicity and Exponential Attractivity of CNNs
863
When the periodic external input u(t) degenerates into a constant vector, we have the following corollary: Corollary 6. If ∀t ≥ t0 , ui (t) ≡ ui (constant), and
ui < −1 −
n (|a ij | + |b ij |),
i ∈ N1 ,
j=1
ui > 1 +
n (|a ij | + |b ij |),
i ∈ N2 ,
j=1
|ui | < 1 − a ii −
n
|a ij | −
j=1, j=i
n
|b ij |,
i ∈ N3 ,
j=1
and ∀i ∈ {1, 2, . . . , n}, j ∈ N3 , τij (t) ≡ τij (constant), then DCNN, equation 2.1, has a unique equilibrium point located in D1 and is globally exponentially stable at such an equilibrium point. Proof. Since ui (t) ≡ ui (constant), for an arbitrary constant ν ∈ , ui (t + ν) ≡ ui (t) ≡ ui . According to theorem 3, DCNN, equation 2.1, has a unique limit cycle located in D1 , which is globally exponentially attractive. The arbitrariness of constant ν implies that such a limit cycle is a fixed point. Hence, DCNN, equation 2.1, has a unique equilibrium point located in D1 , and such an equilibrium point is globally exponentially stable.
6 Illustrative Examples In this section, we give three numerical examples to illustrate the new results.
6.1 Example 1. Consider a CNN, where
2
0.2
0.2
0.4 0.6
2.4
A = 0.2 2.4
0.5 sin(t)
0.2 u(t) = −0.6 cos(t)
.
−0.2(sin(t) + cos(t))
According to theorem 1, this CNN has 23 = 8 limit cycles, which are locally exponentially attractive. Simulation results with 136 random initial states are depicted in Figures 1 to 3.
864
Z. Zeng and J. Wang 4
x
1
2 0
−2 −4
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
4
x
2
2 0
−2 −4 4
x
3
2 0
−2 −4
Figure 1: Transient behavior of x1 , x2 , x3 in Example 1.
4
3
2
x3
1
0
−1
−2
−3
−4 −4
−3
−2
−1
0 x
1
1
Figure 2: Transient behavior of (x1 , x3 ) in Example 1.
2
3
4
Multiperiodicity and Exponential Attractivity of CNNs
865
4 3 2
x3
1 0
−1 −2 −3 −4 4 4
2 2
0 0
−2
−2 −4
x2
−4
x1
Figure 3: Transient behavior of (x1 , x2 , x3 ) in Example 1.
6.2 Example 2. Consider a CNN, where
2
A = 0.5
0.2 0.4
−1.4 0.4
0.2
0.5 sin(t)
0.5 u(t) = 0.5 cos(t) 0.8
.
0.5(sin(t) + cos(t))
Since u1 (t) < −1 + a 11 − a 13 − Choose N1 = {1}, N2 = {3}, N3 = {2}. |a 12 |; a 22 + |a 21 − a 23 − u2 (t)| < 1; u3 (t) > 1 + a 31 − a 33 + |a 32 |, according to theorem 2, this CNN has a limit cycle located in D1 = {x| x1 < −1, |x2 | ≤ 1, x3 > 1}, which is locally exponentially attractive in D1 . Choose N1 = {3}, N2 = {1}, N3 = {2}. Since u1 (t) > 1 + a 13 − a 11 + |a 12 |, a 22 + |a 21 − a 23 − u2 (t)| < 1, u3 (t) < −1 + a 33 − a 31 − |a 32 |, according to theorem 2, this CNN has a limit cycle located in D1 = {x| x3 < −1, |x2 | ≤ 1, x1 > 1}, which is locally exponentially attractive in D1 . However, since 0.5 sin(t) > 1 + (2 + 0.2 + 0.2) = 3.4 does not hold (i.e., condition 5.1 does not hold), it does not satisfy the conditions in theorem 3. Simulation results with 136 random initial states are depicted in Figures 4 and 5.
866
Z. Zeng and J. Wang 4
x
1
2 0
−2 −4 0
5
10
15 time
20
25
30
5
10
15 time
20
25
30
5
10
15 time
20
25
30
4
x
2
2 0
−2 −4 0 4
x
3
2 0
−2 −4 0
Figure 4: Transient behavior of x1 , x2 , x3 in Example 2.
4 3 2
x3
1 0
−1 −2 −3 −4 4 4
2 2
0 0
−2 x2
−2 −4
−4
x1
Figure 5: Transient behavior of (x1 , x2 , x3 ) in Example 2.
Multiperiodicity and Exponential Attractivity of CNNs
867
x1
5
0
−5
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
0
5
10
15 time
20
25
30
x2
5
0
−5
x3
5
0
−5
Figure 6: Transient behavior of x1 , x2 , x3 in Example 3.
6.3 Example 3. Consider a CNN, where
0.2 0.2
0.2
0.2 0.2
0.2
A = 0.2 −2
0.8 sin(t) − 2.6
0.6 u(t) = 0.8 cos(t)
.
0.8(sin(t) + cos(t)) + 2.8
Since u1 (t) < −1 − 3j=1 |a 1 j |, Choose N1 = {1}, N2 = {3}, N3 = {2}. |u2 (t)| < 1 − a 22 − 3j=1, j=2 |a 2 j |, u3 (t) > 1 + 3j=1 |a 3 j |, according to theorem 3, this CNN has a limit cycle located in D1 = {x| x1 < −1, |x2 | ≤ 1, x3 > 1}, which is globally exponentially attractive. Since u1 (t) < −1 + a 11 − |a 12 | − a 13 ; u2 (t) < 1 − a 22 + a 21 − a 23 , u2 (t) > −1 + a 22 + a 21 − a 23 ; u3 (t) > 1 + a 31 + |a 32 | − a 33 , conditions 4.1 to 4.4 also hold. According to theorem 2, this CNN has a limit cycle located in D1 , which is also locally exponentially attractive. However, since a 11 > 0, a 33 > 0 the conditions in Yi et al. (2003) cannot be used to ascertain the complete stability of this CNN. Simulation results with 136 random initial states are depicted in Figures 6 and 7.
868
Z. Zeng and J. Wang
5 4 3
x3
2 1 0 −1 −2 −3 6 4
6 2
4 2
0 0
−2 x2
−4
−2 −4
x1
Figure 7: Transient behavior of (x1 , x2 , x3 ) in Example 3.
7 Concluding Remarks Rhythmicity represents one of most striking manifestations of dynamic behaviors in biological systems. CNNs and DCNNs, which have been shown to be capable of operating in a pacemaker or pattern generator mode, are studied here as oscillatory mechanisms in response to periodic external stimuli. Some information can be encoded in the oscillating activation states relative to external inputs, and these relative phases change as a function of the chosen limit cycle. In this article, we show that the number of locally exponentially attractive periodic orbits located in saturation regions in a DCNN is exponential of the number of the neurons. In view of the fact that neural information is often desired to be encoded in a designated region, we also give conditions to allow a globally exponentially attractive periodic orbit located in any designated region. The theoretical results are supplemented by simulation results in three illustrative examples.
Acknowledgments This work was supported by the Hong Kong Research Grants Council under grant CUHK4165/03E, the Natural Science Foundation of China
Multiperiodicity and Exponential Attractivity of CNNs
869
under grant 60405002 and China Postdoctoral Science Foundation under Grant 2004035579.
References Belair, J., Campbell, S., & Driessche, P. (1996). Frustration, stability and delay-induced oscillation in a neural network model. SIAM J. Applied Math., 56, 245–255. Berns, D. W., Moiola, J. L., & Chen, G. (1998). Predicting period-doubling bifurcations and multiple oscillations in nonlinear time-delayed feedback systems. IEEE Trans. Circuits Syst. I, 45(7), 759–763. Chen, K., Wang, D. L., & Liu, X. (2000). Weight adaptation and oscillatory correlation for image segmentation. IEEE Trans. Neural Networks, 11, 1106–1123. Chua, L. O., & Roska, T. (1990). Stability of a class of nonreciprocal cellular neural networks. IEEE Trans. Circuits Syst., 37, 1520–1527. Chua, L. O., & Roska, T. (1992). Cellular neural networks with nonlinear and delaytype template elements and non-uniform grids. Int. J. Circuit Theory and Applicat., 20, 449–451. Civalleri, P. P., Gilli, M., & Pandolfi, L. (1993). On stability of cellular neural networks with delay. IEEE Trans. Circuits Syst. I, 40, 157–164. Gopalsamy, K., & Leung, I. (1996). Delay induced periodicity in a neural netlet of excitation and inhibition. Physica D, 89, 395–426. Jiang, H., & Teng, Z. (2004). Boundedness and stability for nonautonomous bidirectional associative neural networks with delay. IEEE Trans. Circuits Syst. II, 51(4), 174–180. Jin, H. L., & Zacksenhouse, M. (2003). Oscillatory neural networks for robotic yo-yo control. IEEE Trans. Neural Networks, 14(2), 317–325. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Trans. Neural Networks, 15(5), 1009– 1017. Kosaku, Y. (1978). Functional analysis. Berlin: Springer-Verlag. Liao, X. X., & Wang, J. (2003). Algebraic criteria for global exponential stability of cellular neural networks with multiple time delays. IEEE Trans. Circuits and Syst. I, 50(2), 268–275. Liao, X., Wu, Z., & Yu, J. (1999). Stability switches and bifurcation analysis of a neural network with continuous delay. IEEE Trans. Systems, Man and Cybernetics, Part A, 29(6), 692–696. Liu, Z., Chen, A., Cao, J., & Huang, L. (2003). Existence and global exponential stability of periodic solution for BAM neural networks with periodic coefficients and time-varying delays. IEEE Trans. Circuits Syst. I, 50(9), 1162–1173. Nishikawa, T., Lai, Y. C., & Hoppensteadt, F. C. (2004). Capacity of oscillatory associative-memory networks with error-free retrieval. Physical Review Letters, 92(10), 108101. Roska, T., Wu, C. W., Balsi, M., & Chua, L. O. (1992). Stability and dynamics of delay-type general and cellular neural networks. IEEE Trans. Circuits Syst. I, 39, 487–490.
870
Z. Zeng and J. Wang
Roska, T., Wu, C. W., & Chua, L. O. (1993). Stability of cellular neural networks with dominant nonlinear and delay-type templates. IEEE Trans. Circuits Syst. I, 40, 270–272. Ruiz, A., Owens, D. H., & Townley, S. (1998). Existence, learning, and replication of periodic motion in recurrent neural networks. IEEE Trans. Neural Networks, 9(4), 651–661. Setti, G., Thiran, P., & Serpico, C. (1998). An approach to information propagation in 1-D cellular neural networks, part II: Global propagation. IEEE Trans. Circuits and Syst. I, 45(8), 790–811. Takahashi, N. (2000). A new sufficient condition for complete stability of cellular neural networks with delay. IEEE Trans. Circuits Syst. I, 47, 793–799. Townley, S., Ilchmann, A., Weiss, M. G., McClements, W., Ruiz, A. C., Owens, D. H., & Pratzel-Wolters, D. (2000). Existence and learning of oscillations in recurrent neural networks. IEEE Trans. Neural Networks, 11, 205–214. Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Networks, 6, 941–948. Wang, L., & Zou, X. (2004). Capacity of stable periodic solutions in discrete-time bidirectional associative memory neural networks. IEEE Trans. Circuits Syst. II, 51(6), 315–319. Yi, Z., Heng, P. A., & Vadakkepat, P. (2002). Absolute periodicity and absolute stability of delayed neural networks. IEEE Trans. Circuits Syst. I, 49(2), 256–261. Yi, Z., Tan, K. K., & Lee, T. H. (2003). Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 15(3), 639–662. Zeng, Z., Wang, J., & Liao, X. X. (2003). Global exponential stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circuits and Syst. I, 50(10), 1353–1358. Zeng, Z., Wang, J., & Liao, X. X. (2004). Stability analysis of delayed cellular neural networks described using cloning templates. IEEE Trans. Circuits and Syst. I, 51(11), 2313–2324.
Received November 30, 2004; accepted June 28, 2005.
LETTER
Communicated by Ennio Mingolla
Smooth Gradient Representations as a Unifying Account of Chevreul’s Illusion, Mach Bands, and a Variant of the Ehrenstein Disk Matthias S. Keil
[email protected] Instituto de Microelectr´onica de Sevilla, Centro Nacional de Microelectr´onica, E-41012 Seville, Spain
Recent evidence suggests that the primate visual system generates representations for object surfaces (where we consider representations for the surface attribute brightness). Object recognition can be expected to perform robustly if those representations are invariant despite environmental changes (e.g., in illumination). In real-world scenes, it happens, however, that surfaces are often overlaid by luminance gradients, which we define as smooth variations in intensity. Luminance gradients encode highly variable information, which may represent surface properties (curvature), nonsurface properties (e.g., specular highlights, cast shadows, illumination inhomogeneities), or information about depth relationships (cast shadows, blur). We argue, on grounds of the unpredictable nature of luminance gradients, that the visual system should establish corresponding representations, in addition to surface representations. We accordingly present a neuronal architecture, the so-called gradient system, which clarifies how spatially accurate gradient representations can be obtained by relying on only high-resolution retinal responses. Although the gradient system was designed and optimized for segregating, and generating, representations of luminance gradients with real-world luminance images, it is capable of quantitatively predicting psychophysical data on both Mach bands and Chevreul’s illusion. It furthermore accounts qualitatively for a modified Ehrenstein disk. 1 Introduction Reflectance, or albedo, is a physical property of surface materials that measures how much of the incident light is reflected from a surface. In realworld environments, reflectance is often composed of a diffusive or lambertian component (light is scattered isotropically in all directions) and a specular component (light is scattered anisotropically in a limited subset of directions). Recent psychophysical data show that the visual system can suppress the specular component for judging the apparent color of a surface (Todd, Norman, & Mingolla, 2004). Specifically, it was found that Neural Computation 18, 871–903 (2006)
C 2006 Massachusetts Institute of Technology
872
M. Keil
achromatic color or lightness is almost entirely determined by the diffusive component of surface reflectance. On the other hand, there is now evidence that the visual system generates explicit representations for object surfaces (e.g., Komatsu, Murakami, & Kinoshita, 1996; Rossi, Rittenhouse, & Paradiso, 1996; MacEvoy, Kim, & Paradiso, 1998; Bakin, Nakayama, & Gilbert, 2000; Castelo-Branco, Goebel, Neuenschwander, & Singer, 2000; Kinoshita & Komatsu, 2001; Sasaki & Watanabe, 2004). The neuronal activity associated with these representations is thought to encode perceived surface attributes, such as color, motion, lightness, or depth (e.g., Paradiso & Nakayama, 1991; Rossi & Paradiso, 1996; Paradiso & Hahn, 1996; Davey, Maddess, & Srinivasan, 1998; Nishina, Okada, & Kawato, 2003; Sasaki & Watanabe, 2004). Given that lightness constancy is preserved in the presence of specular highlights, the visual system must somehow identify and discount the specular component of surface reflectance in the (neuronal) representations of lightness or brightness, respectively. Lightness constancy means that the perception of achromatic surface reflectance remains the same despite changes in illumination. Notice that changes in illumination may not only be caused by changes of the illumination source per se, such as changes in intensity or spectral composition, but also by, for example, selfmotion of the organism, or the presence of different objects near the object under consideration (Bloj, Kersten, & Hurlbert, 1999). Lightness constancy implies that the activity of a corresponding neuronal representation remains approximately constant in spite of changes in illumination, an effect that is experimentally observed as early as in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996). Since specular highlights are viewpoint and illumination dependent, lightness constancy implies furthermore that highlights should be discounted from lightness representations. As obvious from introspection, however, highlights are not discounted perceptually, and one may argue accordingly that the different components of surface reflectance (diffusive and specular) are segregated and subsequently encoded by separated representations. This hypothesis is supported by two observations. First, the location of highlights on an object’s surfaces varies with viewing directions. When trying to recover an object’s 3D shape from disparity cues (“shape from stereo”), the left and the right retina see the highlights at different positions, and thus binocular disparity should provide contradicting information about “shape from stereo.” However, it was demonstrated that perceptual highlights actually enhance the appearance of stereoscopic depth instead of impairing it (Todd, Norman, Koendernik, & Kappers, 1997). Second, when an object is observed in motion (shape from motion), then specular highlights do not stay at fixed positions at the object’s surfaces. Rather, they are subjected to deformation. Even so human observers reveal no apparent difficulties in interpreting them (Norman & Todd, 1994). Segregating the constant components of surface reflectance from its variable components and encoding these components by different representations would have the advantage that downstream mechanisms
Early Gradient Representations
873
for object recognition can select, at each instant, the most reliable cue for determining an object’s 3D shape in order to ensure steadily robust performance for object recognition. Specular highlights are an example of a smooth gradation in surface luminance. Such gradations are often referred to as shading. Shading can encode important information about 3D surface structure (curvature; e.g., Todd & Mingolla, 1983; Mingolla & Todd, 1986; Ramachandran, 1988; Todd & Reichel, 1989; Knill & Kersten, 1991). For instance, a golf ball can be distinguished visually from a table tennis ball by the smooth intensity variations occurring at its dimples. However, shading is not an unambiguous cue for deriving an object’s 3D shape, since shading (like-) effects may be generated by various sources, such as the spectral composition of illumination, local surface reflectance, local surface orientation, penumbras of cast shadows, attenuation of light with distance, interreflections among different surfaces, and observer position (Tittle & Todd, 1995; Todd, 2003; Todd et al., 2004). Shading effects represent, from a more general point of view, a subset of smooth luminance gradients. Here, we define luminance gradients as smooth variations in intensity,1 which includes optical effects such as focal blur (due to the limited depth-of-field representation of the retinal optics; see Elder & Zucker, 1998). Local image blur provides important information about the arrangement of objects in depth (depth from blur; e.g., Deschˆenes, Ziou, & Fuchs, 2004). In general, smooth gradients in luminance can be generated by spatial variations in reflectance of a surface that is homogeneously illuminated, or by gradual variations in illumination of a surface with constant reflectance (Paradiso, 2000). Obviously the information conveyed by luminance gradients may subserve different purposes, such as recovering the 3D shape of objects, estimating surface reflectance, depth estimation from blur, or depth estimation from shadows (Kersten, Mamassian, & Knill, 1997; Mamassian, Knill, & Kersten, 1998). It is important to recognize that smooth gradients can encode surface properties (e.g., curvature) or not (e.g., shadows). This means that the visual system cannot always bind smooth gradients to surfaces in a static fashion or that at some point in the visual system, smooth gradients must be segregated from surface representations. Given that lightness constancy is already observed in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996) and smooth gradients are not lost to perception, we propose that they should be explicitly represented in the visual system, in parallel with, and as early as, corresponding surface representations. Therefore, smooth luminance gradients could be unspecifically suppressed in V1 surface representations, without losing corresponding information. In this way, gradients could be recruited or ignored, respectively,
1 This is to say that the relevant perceptual quantity considered here is the degree of smoothness of a luminance variation. We disregard the direction of such variations.
874
M. Keil
by downstream mechanisms for object recognition, dependent on whether they aid in disambiguating the recognition process. Although mechanisms were proposed for generating two-dimensional (e.g., Grossberg & Todorovi´c, 1988; Pessoa & Ross, 2000; Neumann, Pessoa, & Hansen, 2001) or three-dimensional (e.g., Grossberg & Pessoa, 1998; Kelly & Grossberg, 2000; Grossberg & Howe, 2003) surface representations, theories addressing the processing and representation of luminance gradients in the visual system are scarce (an approach based on differential geometry was suggested by Ben-Shahar, Huggins, & Zucker, 2002; see section 4). Here we propose a corresponding two-dimensional (monocular) architecture, called a gradient system. The gradient system is thought to operate in parallel, and to interact, with mechanisms for generating surface representations (representations for both chromatic and achromatic surface reflectance). Put another way, we propose that representations of brightness gradients as generated by the gradient system are involved in lightness computations. Direct evidence for separated representations of gradients and surfaces comes from the observation that chromatic variations and spatially aligned luminance gradients occur in the first place due to variations in surface reflectance. In contrast, shadows, or shading, generate luminance variations, which are independent from chromatic variations. In line with these observations, recent evidence suggests that the visual system utilizes chromatic and achromatic information for different purposes (Kingdom, 2003), where achromatic information seems to generate the 3D shape for chromatic surfaces. Apart from the conceptional advantages of having gradient representations, the purpose of the gradient system is to clarify, by suggesting corresponding neuronal mechanisms, how such representations can be generated with high spatial acuity from foveal retinal responses. The gradient system consists of two further subsystems. The first subsystem detects luminance gradients by using solely foveal retinal responses. The second subsystem generates representations of luminance gradients. Notice that luminance gradients may extend over distances that may exceed several times the receptive field size of foveal ganglion cells and V1 cells, respectively. Our proposed mechanism detects gradients indirectly, by attenuating high spatial frequency information, which is typically met at lines and points (even symmetric luminance features) or sharply bounded surface edges (odd symmetric features), respectively. In what follows, we designate these latter features nongradient features. Attenuation of the nongradient features results in sparse local measurements for gradient evidence in the first subsystem. These measurements are often noisy and fragmented. The second subsystem generates from these measurements smooth and dense gradient representations by means of a novel diffusion paradigm (clamped diffusion), which is proposed as an underlying neural substrate. Clamped diffusion shares some analogies with filling-in mechanisms, since both involve a locally controlled lateral
Early Gradient Representations
875
propagation of activity. Filling in was proposed as a mechanism for creating cortical surface representations (see above references), and is now supported both psychophysically (Gerrits, de Haan, & Vendrik, 1966; Gerrits & Vendrik, 1970; Paradiso & Nakayama, 1991; Rossi & Paradiso, 1996; Paradiso & Hahn, 1996; Davey et al., 1998; Rudd & Arrington, 2001) and neurophysiologically (Rossi et al., 1996; Rossi & Paradiso, 1999; but see Pessoa & Neumann, 1998; Pessoa, Thompson, & No¨e, 1998). To represent luminance gradients within the filling-in framework, Grossberg and Mingolla (1987) demonstrated that boundary webs can build up in regions of continuous luminance gradients. Whereas nongradient features (e.g., lines and edges) usually constitute impermeable barriers for the filling-in process, boundary webs can lead to a partial blocking, thereby generating effects of smooth shading in perceived brightness (see also Pessoa, Mingolla, & Neumann, 1995). Notice, however, that by incorporating smooth gradients into surface representations, segregation of such gradients is only postponed. As exemplified above, lightness constancy and illumination-invariant object recognition, respectively, imply the suppression of those smooth gradients, which are not properties of object surfaces. In general, most models do not distinguish explicitly between surfaces (achromatic surface reflectance) and gradients (specular highlights, local blur), thus ignoring the diverse semantics of luminance gradients. This is to say that typically a mixed representation of both structures is created, which stands in contrast to having them separated as proposed here. Typical approaches generate such mixed representations by superimposing responses from bandpass filters of various sizes or scales (e.g., Blakeslee & McCourt, 1997, 1999, 2001; Sepp & Neumann, 1999; McArthur & Moulden, 1999), implying that gradient information is implicitly represented in filter responses. In a nutshell, the proposed framework is as follows. The visual system computes bottom-up estimates for brightness gradients and surface brightness in parallel. At this level, surface representations are assumed to be devoid of any activity corresponding to smooth luminance gradients. Although not considered in this article, the latter process may involve low-level interactions between representations for surfaces and gradients. Subsequently, gradient representations contribute to deriving reliable estimates for surface lightness. Lightness computations may involve feedback from downstream mechanisms involved in object recognition, hence establishing high-level interactions between surface and gradient representations. Our line of argumentation is similar to one adopted earlier (Ginsburg, 1975): if gradient representations do exist in the visual system, then it should be possible for the gradient system to predict certain perceptual data. This is indeed the case: the gradient system accounts for psychophysical data on Mach bands and Chevreul’s illusion and a variant of the Ehrenstein disk (with line inducers), despite not taking into account possible interactions
876
M. Keil
between representations for surfaces and gradients. Notice that these illusions are traditionally explained by quite different mechanisms. The universality and robustness of the proposed mechanisms are further demonstrated by successfully segregating luminance gradients with real-world images. 2 Formulation of the Gradient System The gradient system is composed of two further subsystems. The first detects gradient evidence in a given luminance distribution L. The output of the first subsystem represents sparse activity maps for gradient evidence. From these sparse maps, dense representations for brightness gradients are subsequently recovered by the second subsystem. The model was designed under the assumption that all stages relax to equilibrium before changes in luminance occur.2 This is to say that we define L such that ∂Lij (t)/∂t = 0, where (i, j) are spatial indices and t denotes time. The latter assumption is consistent with the notion that shading and blur are so-called pictorial depth cues, which are available with individual and static images (Todd, 2003). In all simulations, black was assigned the intensity L = 0 (minimum value) and white L = 1 (maximum value). Responses of all cells are presumed to represent average firing rates. Model parameters were optimized for obtaining good results for gradient segregation over a set of real-world images. The exact parameter values are not significant for the model’s performance, as long as they stay within their given orders of magnitude. Our model is minimal complex in the sense that it cannot be reduced further: each equation and parameter, respectively, is crucial to arrive at our conclusions. 2.1 Retinal Stage. The retina is the first stage in processing luminance information. We assume that the cortex has access only to responses of the lateral geniculate nucleus (LGN) and treat the LGN as relaying retinal responses. We also assume that any residual DC part in the retinal responses was discounted (see Neumann, 1994; Pessoa et al., 1995; Rossi & Paradiso, 1999). No additional information, such as large filter scales or an “extra luminance channel” (low-pass information) is used (Burt & Adelson, 1983; du Buf, 1992; Neumann, 1994; Pessoa et al., 1995; du Buf & Fischer, 1995). Ganglion cell responses to a luminance distribution L are computed under the assumption of space-time separability (Wandell, 1995; Rodieck, 1965), where we assumed a constant temporal term by definition. The receptive field structure of our ganglion cells approximates a Laplacian (Marr & Hildreth, 1980), with center activity C and surround activity S. Center width is 1 pixel; hence, Cij ≡ Lij . Surround activity is computed by convolving L
2 When simulating real-time perception, however, model stages can no longer be expected to reach a steady state.
Early Gradient Representations
877
with a 3 × 3 kernel with zero weight in the center, exp(−1)/η for the four nearest neighbors and exp(−2)/η for the four next-nearest neighbors. The constant η is chosen such that the kernel integrates to one. Retinal responses are evaluated at steady state of d Vij (t) = gleak (Vrest − Vij ) dt Cij + E surr Sij + gij (E si − Vij ), +E cent (si )
(2.1)
where gleak = 1 denotes the leakage conductance of the cell membrane and (si ) Cij + E surr Sij ]+ denotes Vrest = 0 is the cell’s resting potential. gij ≡ [E cent self-inhibition with reversal potential E si = 0, where [·]+ ≡ max(·, 0). Selfinhibition implements the compressive and nonlinear response curve observed in biological X-type cells (Kaplan, Lee, & Shapley, 1990). From equation 2.1, we obtain two types of retinal responses, with one selective for luminance increment (ON-type) and another selective for luminance decrement (OFF-type). ON-cell responses xij⊕ ≡ [Vij ]+ are obtained by setting E cent = 1 and E surr = −1, and OFF-cell responses xij ≡ [Vij ]+ by E cent = −1 and E surr = 1. Equation 2.1 implies that responses of associated ON and OFF cells are equal, or balanced, except for a sign reversal (parallel pathways).3 Although a given type of biological ON and OFF ganglion cells constitutes nearly parallel pathways with respect to the impulse response function kinetics (DeVries & Baylor, 1997; Benardete & Kaplan, 1999; Keat, Reinagel, Reid, & Meister, 2001; Chichilnisky & Kalmar, 2002), they do not with respect to other properties. For instance, ON cells fire spontaneously at a higher rate than OFF cells under photopic stimulation (Cleland, Levick, & Sanderson, 1973; Kaplan, Purpura, & Shapley, 1987; Troy & Robson, 1992; Passaglia, Enroth-Cugell, & Troy, 2001; Chichilnisky & Kalmar, 2002; Zaghloul, Boahen, & Demb, 2003). Model predictions for synthetic stimuli (such as a luminance ramp or a sign wave grating) are not influenced in a significant way by using different retinal models (e.g., the one proposed in Pessoa et al., 1995, or one with tonic ON activity). The use of different retinal models, however, leads to different results for real-world images. These results are worse with one channel, but not the other, having tonic baseline activity (not shown). Due to the absence of tonic or baseline activity, equation 2.1 may be considered a compact formulation for describing the outcome of a retino-geniculate competition (Maffei & Fiorentini, 1972), which previously has been modeled by separated stages (Neumann, 1996).
3 Associated responses are responses that “belong together”; they were generated by exactly one luminance feature (such as an edge, a line, or a bar).
878
M. Keil
A
bright
dark
B
Figure 1: Two types of luminance gradients. (A) A sine wave grating is an example for a nonlinear gradient. Graphs (columns for a fixed row number) show luminance (big curves) and corresponding retinal ON-responses (small and darker curves) and OFF-responses (small curves). (B) A ramp is an example for a linear gradient. At the “knee points,” where the luminance ramp meets the bright and dark luminance plateau, respectively, Mach bands are perceived (Mach, 1865). Mach bands are illusory overshoots in brightness (labeled “bright”) and darkness (“dark”) at positions indicated by the arrows.
2.2 Subsystem I: Gradient Detection. Ganglion cell responses provide the input to the first stage, the detection subsystem. Responses of our ganglion cells are computed by using a Laplacian-like receptive field. Hence, lines or edges (sharply bounded luminance features) generate responses with activity peaks of ON- and OFF-cells occurring in close vicinity (see Neumann, 1994, for a mathematical treatment). In contrast, luminance gradients (e.g., blurred lines and edges) give rise to spatially more separated ON- and OFF-activity peaks. This leads to the notion that luminance gradients can be detected by suppressing ON- and OFF-responses occurring closely together, while at the same time enhancing more separated responses (see Figure 1). Of course, finding ON- and OFF-responses that are close together amounts to boundary detection. For our purposes, the specific mechanism for detecting boundaries is not relevant—it is the quality of the resulting boundary map that matters. Thus, in order to keep things simple, we detect ON and OFF responses occurring in close vicinity (= nongradients) by multiplicative interactions between retinal channels: dgij1◦ (t) dt dgij1• (t) dt
= −gleak gij1◦ + xij⊕ 1 + gij1• E ex − gij1◦ + ∇ 2 gij1◦ = −gleak gij1• + xij 1 + gij1◦ E ex − gij1• + ∇ 2 gij1• .
(2.2)
The superscript “o” denotes activity that encodes brightness, and “•” denotes darkness. Superscript numbers indicate the respective level of model
Early Gradient Representations
879
hierarchy. ∇ 2 is the Laplacian, which models lateral voltage spread (e.g., Winfree, 1995; Lamb, 1976; Naka & Rushton, 1967). In neurophysiological terms, the latter equation approximates responses of orientationally pooled cortical complex cells (we understand cortical complex cells as being equivalently excited by even-symmetric [lines, points] and odd-symmetric [edges] luminance features; cf. Hubel & Wiesel, 1962, 1968). Notice that one could substitute equation 2.2 by a more sophisticated boundary processing stage, such as proposed, for example, in Gove, Grossberg, & Mingolla (1995) and Mingolla, Ross, and Grossberg (1999) if one aimed to increase the biological realism of our model. We used adiabatic boundary conditions in our simulations. The leakage conductance gleak = 0.35 specifies the spatial extent of lateral voltage spread ¨ (cf. Benda, Bock, Rujan, & Ammermuller, 2001). Therefore, varying gleak provides an easy means for adjusting the spatial frequency sensitivity of the detection circuit: lower values of gleak make equation 2.2 respond to image features with a higher degree of blur (see Figure 7, bottom graph). Excitatory input (reversal potential E ex = 1) into both diffusion layers is provided by two sources. The first source is the output of retinal cells x⊕ and x , respectively. Their activities propagate laterally in each respective diffusion layer. If the input is caused by nongradients (e.g., lines or edges), then in the end, activities in both layers spatially overlap. This situation is captured by multiplicative interaction between diffusion layers and provides the second excitatory input (terms xij⊕ gij1• and xij gij1◦ , respectively). Thus, multiplicative interaction leads to mutual amplification of activity in both layers at nongradient positions. Equations 2.2 were fixed-point-iterated 50 times at steady state.4 Finally, the activity of (notice that g 1◦ , g 1• ≥ 0 ∀t) (2)
gij = gij1◦ × gij1•
(2.3)
correlates with high-spatial-frequency features at position (i, j) (see Gabbiani, Krapp, Koch, & Laurent, 2002 for a biophysical mechanism implementing multiplication). Nongradient suppression is brought about by (2) inhibitory activity g in ≡ α · gij . Inhibitory activity interacts with retinal responses by means of gradient neurons: dgij3◦ (t) dt dgij3• (t) dt 4
in = −gleak gij3◦ + xij⊕ E ex − gij3◦ + gij E in − gij3◦ in E in − gij3• . = −gleak gij3• + xij E ex − gij3• + gij
(2.4)
We compared steady-state fixed-point iteration with direct integration using Euler’s method and an explicit fourth-order Runge-Kutta scheme (t = 0.75). We did not observe any differences in our results.
880
M. Keil
α = 35 is an inhibitory weight constant, chosen such that the last equation cannot produce responses to isolated lines and edges. Equation 2.4 is evaluated at steady state (gleak = 0.75, E ex = 1, and E in = −1). The output g˜ ij3◦ and g˜ ij3• of the two parts of equation 2.4 is computed by means of an (2) activity-dependent threshold ≡ β ·
(the operator < · > computes the average value over all spatial indices): g˜ ij3◦
=
gij3◦
if gij3◦ >
0
otherwise
.
(2.5)
g˜ ij3• is computed in an analogous way. Activity-dependent thresholding can be considered an adaptational mechanism at network level. Note that it depends, as opposed to simple, fixed-valued thresholds, on the activation of all other neurons. In this respect it is similar to the so-called quenching threshold proposed by Grossberg (1983). Here, it reduces responses to spurious gradient features.5 β = 1.75 is a constant for elevating the threshold such that most of the spurious gradient features are suppressed. The robustness for detecting gradients with real-world images is further improved by eliminating those spurious features that survived previous stages. For this purpose, adjacent responses g˜ 3◦ and g˜ 3• are subjected to mutual inhibition. Mutual inhibition is implemented via the operator A, which adds an activity (say, a ij ) at position (i, j) to its four nearest neighbors, that is, A(a ij ) implies a i −1,j := a i −1,j + a ij , a i ,j −1 := a i ,j −1 + a ij , a i +1,j := a i +1,j + a ij , and a i ,j +1 := a i ,j +1 + a ij , where the symbol := denotes “is replaced by.” A is reduced to existing neighbors at boundary locations. Then: + g˜ ij4◦ = g˜ ij3◦ − A g˜ ij3• + g˜ ij4• = g˜ ij3• − A g˜ ij3◦
(2.6)
defines the output of the first subsystem, where g˜ 4◦ represents gradient brightness, and g˜ 4• gradient darkness. Notice that this output by itself cannot give rise yet to gradient percepts, since, for example, g˜ 4◦ and g˜ 4• responses to a luminance ramp do not correspond to human perception (see Figure 1). Although no responses of equation 2.1 and equation 2.6, respectively, are obtained along the ramp, humans nevertheless perceive a brightness gradient there. This observation leads to the proposal of a second subsystem, where perceived gradient representations are generated by means of a novel diffusion paradigm (clamped diffusion). The responses of the individual stages are illustrated with figure 2 for a real-world image at various levels of gaussian blur. 5
These are erroneously detected gradients that would survive for = 0.
Early Gradient Representations retina
non-gradients
gradient neurons
blurred with σ=2
blurred with σ=1
not blurred
luminance
881
Figure 2: Subsystem I stages. Visualization of individual stages for an unblurred image (top row) and gaussian blurred images (σ = 1 and σ = 2 pixels, respectively). The column luminance shows the input (both L ∈ [0, 1]), retina shows the output of equation 2.1 (x⊕ [x ] activity is indicated by lighter [darker] gray), nongradients shows the output of equation 2.3. Finally, gradient neurons shows the output of equation 2.4. Images were normalized individually to achieve better visualization. The number at each image denotes its respective maximum activity value.
2.3 Subsystem II: Gradient Generation. Luminance gradients can be subdivided into two classes, according to their retinal responses (see Figure 2). Responses of model ganglion cells are obtained to luminance regions with varying slope (nonlinear gradients), but they are unresponsive within regions of constant slope (linear gradients).6 Consequently, for a luminance ramp, we obtain nonzero ganglion cell responses only at the “knee points” where the luminance ramp meets the plateaus. In contrast, ganglion cell responses smoothly follow a sine wave–modulated luminance function. Ideally, we like the spatial layout of gradient representations as generated by the model to be isomorphic with perceived brightness gradients.7 Thus,
6 In biological ganglion cells, this distinction may be relaxed. In fact, what matters is that (retinal) responses to linear and nonlinear gradients are different. 7 For image processing tasks, such isomorphism may be understood such that the input (stimulus) and the output (gradient representation) are equal. In general, however, such
882
M. Keil
we have to explicitly generate a gradient in perceived activity in the case of a linear luminance gradient, whereas perceived activity should match the retinal activity pattern with nonlinear gradients (perceived or perceptual activity denotes the activity that leads to the perception of luminance gradients). The generation and representation of both types of luminance gradients are accomplished by a single mechanism, the clamped diffusion equation: (5)
dgij (t) dt
(5)
(2)
= −gleak gij + γ gij
(5)
E in − gij
(2.7)
(5) + g˜ ij4◦ + xij⊕ − g˜ ij4• + xij + ∇ 2 gij . source
sink
Parameter values are gleak = 0.0025, E in = 0, and γ = 250. Equation 2.7 was integrated through fixed-point iteration at steady state (see note 4). A brightness source (or equivalently a darkness sink) is defined as retinal ON-activity x⊕ enhanced by gradient brightness g˜ 4◦ . Likewise, a darkness source (brightness sink) is defined by OFF-activity x enhanced by gradient darkness g˜ 4◦ . g (5) already expresses perceptual gradient activity: perceived darkness is represented by negative values of g (5) and perceived brightness by positive values. Note that no thresholding operation is applied. The resting potential of equation 2.7 is zero and is identified with the perceptual Eigengrau value (Gerrits et al., 1966; Gerrits & Vendrik, 1970; Gerrits, 1979; Knau & Spillman, 1997). In a biophysically realistic scenario, equation 2.7 has to be replaced by two equations: one for describing perceived brightness and one for perceived darkness. Then, to compute perceptual activity, rectified darkness activity has to be subtracted from, and rectified brightness activity has to be added to, a suitable chosen Eigengrau value. The single equation 2.7 is equivalent to the “brightness-minus-darkness” case for a zero Eigengrau value, given that both the equation for brightness and the equation for darkness make use of shunting inhibition (reversal potentials = 0) but not of subtractive inhibition (reversal potentials < 0). The anchoring process with an Eigengrau level is adopted as a simple means for visualizing the output of the gradient system, especially with real-world images. It is not supposed to represent a fully qualified mechanism for brightness anchoring (cf. Gilchrist et al., 1999). Rather, the gradient system could in principle help solve the lightness anchoring problem for surface representations, for example, according to the recently proposed blurred highest luminance as white (BHLAW) rule (Grossberg & Hong, 2003; Hong & Grossberg, 2004). equivalence is not true for lightness perception, due to nonlinear information processing in the visual system (e.g., the presence of compressive functions)
Early Gradient Representations
883
Equation 2.7 establishes monotony between intensity and perceived gradient activity by (1) the leakage conductance gleak , (2) shunting inhibition (E in = 0) triggered by nongradient features (e.g., lines and edges), and (3) diffusion, which brings about an activity flux between clamped sources and sinks. Shunting inhibition by nongradient features can be conceived as locally increasing the leakage conductance at positions of sharply bounded luminance features. In total, equation 2.7 achieves gradient segregation by two mechanisms: enhancement of gradient activity by gradient neurons (see equation 2.4) and simultaneous suppression of nongradient features by equation 2.3. 3 Model Predictions Results obtained from the gradient system are evaluated in the context of brightness perception. As an approximation, we assume that representations for surfaces and brightness gradients are computed independently from each other (i.e., they do not interact). Also, any activity associated with smooth luminance gradients is assumed to be discounted from surface representations. Although only predictions from the gradient system are shown below, we assume that overall brightness is obtained by some linear combination of both representations, thereby neglecting that, for example, downstream mechanisms could suppress or enhance the one or the other representation. Both representations are thought to be in spatial register (retinotopic). By definition in all graphs here, negative values represent perceived darkness activity and positive values perceived brightness activity. In the images shown, perceived brightness (darkness) corresponds to brighter (darker) levels of gray. 3.1 Nonlinear and Linear Gradients. The evolution of g (5) for a nonlinear and a linear luminance gradient is juxtaposed in Figure 3, where a sine wave grating and a triangular-shaped luminance profile (or triangular grating), respectively, served as input. For the latter stimulus, humans perceive Mach band–like overshoots in brightness and darkness. In fact, psychophysical data from Ross, Morrone, and Burr (1989) support the hypothesis that Mach bands, and the Mach band–like overshoots, derive from the same neuronal mechanism due to their similar detection thresholds. Our predictions are consistent with this finding: both types of Mach features correspond to clamped sources and sinks, respectively, which are visible while gradient representations for luminance ramps are generated. This means that while representations for linear gradients are generated, clamped sources and sinks are visible relatively longer than the actual gradient representation. The situation is different for nonlinear gradients: luminance and clamped sources and sinks have similar activity profiles, and hence corresponding gradient representations retain their spatial layout at each time step (see Figures 1 and 3). In other words, with nonlinear gradients, the dynamics of
triangular wave sine wave
884
M. Keil
luminance
1 iteration
50 iterations
100
250
500
Figure 3: State of equation 2.7 at different times. The images show the perceptual activity g (5) , with numbers indicating elapsed time steps. Images luminance show the respective stimuli, a sine wave grating (= nonlinear gradient), and a triangular-shaped luminance profile (= linear gradient).
A
B
Figure 4: Perceptual activity for linear and nonlinear gradients. Images (insets; 128 × 128 pixel) show gradient representations (i.e., state of equation 2.7 at t = (5) 500 iterations), and graphs show the corresponding profile gij for i = 64, j = 1, . . . , 128, along with normalized luminance (see legend). Stimuli were the same as in Figure 3.
clamped diffusion preserves the shape of retinal responses at gradient positions, and perceived activity patterns remain at constant shape (compare Figure 1 with Figure 4). In order to understand the relationship between input intensity levels and magnitude of perceived gradient activity, we now consider the convergence behavior of equation 2.7 for both gradient types. In the top graph of Figure 5, a sine wave grating was used to examine the interaction of clamped sources and sinks in the presence of negligible inhibition by nongradients (such as lines or edges). For the two gratings with equal spatial frequency (0.03 cycles per pixel), the initial activity difference was preserved during the dynamics. However, doubling the spatial frequency of the full contrast grating to 0.06 cycles per pixel leads to a lower level in convergent activity.
Early Gradient Representations
A
885
B
Figure 5: Evolution of equation 2.7 with time. (A) Evolution of g (5) for a sine wave grating of different contrasts and spatial frequencies at the bright phase position (see inset; full contrast means that L is scaled to the range 0, . . . , 1). (B) Evolution of g (5) at the position of the bright Mach band for different ramp widths (see inset).
This is because an increase in spatial frequency decreases the separation between clamped sources and sinks, which in turn causes an increased activity flux between them. The bottom graph of Figure 5 demonstrates, with a luminance ramp, the effect of shunting inhibition by nongradient features. Although decreasing the ramp width also leads to an increased activity flux between clamped sources and sinks, shunting inhibition by nongradient features (in this case, a luminance edge) is nevertheless the dominating effect in establishing the observed convergence behavior. Large ramp widths (10 pixels) do not trigger nongradient inhibition, and even with intermediate ramp widths (5 pixels) nongradient inhibition is negligible (cf. Figure 7). However, for small ramp widths (2 pixels), nongradient inhibition decreases final activity levels significantly. Summarizing, the final magnitude of perceptual activity |g (5) | depends monotonically on input intensity and the initial amplitude of clamped sources and sinks, respectively, but this monotonic relationship is modulated by the strength of nongradient inhibition and the spatial separation between clamped sources and sinks. 3.2 Linear Gradients and Mach Bands. With more than a century of investigation, there is still no agreement about the mechanisms underlying the generation of Mach bands (Pessoa, 1996; Ratliff, 1965). Within our theory, both Mach bands and the glowing overshoots associated with a triangular-shaped luminance profile occur as a result of clamped sources and sinks being relatively longer visible than the gradient representation that is about to form. Ross et al. (1989) measured Mach band strength as a function of spatial frequency of a trapezoidal wave and found a maximal perceived strength at some intermediate frequency (inverted-U behavior). Both narrow and wide ramps decreased the visibility of Mach bands, and
886
A
M. Keil
B
Figure 6: Threshold contrasts for seeing light (A) and dark (B) Mach bands according to Ross et al. (1989). The data for generating the above plots were extracted from Figure 5 in Ross et al. (1989). The t-values specifies the ratio of ramp width to period (varying from 0 for a square wave to 0.5 for a triangle wave). The ramp width was estimated from spatial frequencies by defining a maximum display size (2048 pixels) at minimum spatial frequency (0.05 cycles per degree). For each spatial frequency and t-value, the original stimulus wave forms were generated with equation 1 in Ross et al., and the ramp width (in pixels) was measured. Notice, however, that the value for the maximum display size is arbitrary, implying that the above data are defined only up to a translation along the abscissa.
Mach bands were barely seen, or not visible at all, at luminance steps. From Figure 5 in Ross et al. (1989), ramp widths of the original stimuli were estimated in order to allow comparison with the gradient system (details are given in the legend of Figure 6). Notice the tendency for the maxima to group around a specific ramp width, regardless of the ramp form used (as specified by the t-value). Figure 7 shows light Mach band strength as a function of ramp width. The gradient system predicted the inverted-U behavior, in agreement with the original data. However, the gradient system makes identical predictions for both the light and the dark Mach bands, whereas the data of Ross et al. (1989) indicate lower threshold contrasts for darker Mach bands (see Figure 6). Nevertheless, although asymmetries seem to exist, studies diverge as to whether light bands are stronger than dark ones, and vice versa (see Pessoa, 1996). In addition, the gradient system predicted a shift of maximum perceived gradient activity toward smaller ramp widths, as the ramp contrast was decreased (data not shown). The driving force in establishing the inverted-U behavior is inhibition of high spatial frequency (i.e., nongradient) features and an increased activity flux with narrow ramps. Wide ramps, on the other hand, generate smaller retinal responses at knee-point positions, and consequently smaller perceptual activities. Varying the value of gleak in equations 2.1 and 2.2 causes the maximum Mach band strength to shift (cf. Figure 7), thus providing an
Early Gradient Representations
A
887
B
Figure 7: Inverted-U behavior of bright Mach band as predicted by the gradient system. These graphs should be compared with Figure 6. Curves in A (B) show the shift of maximum Mach band strength for different values of gleak of equation 2.1 (equation 2.2). These simulations involved t = 2000 iterations of equation 2.7 for each data point.
easy means for fitting model output to psychophysical data. Parameters of the gradient system, however, were optimized for robust performance with real-world images. The gradient system also predicts psychophysical data addressing the interaction of Mach bands with adjacently placed stimuli ´ (data not shown; see Keil, Cristobal, & Neumann, in press). 3.3 Chevreul Illusion. A luminance staircase gives rise to the Chevreul illusion (Chevreul, 1890). The illusion is that brightness across the staircases seems inhomogeneous (except at the first and the last plateau), albeit luminance is constant at each plateau. Chevreul’s illusion is traditionally explained by the contrast sensitivity function of the visual system. More recent explanations rely, for example, on spatially dissipative filling-in activity (Pessoa et al., 1995) or even-symmetric filter responses from coarse filter scales (e.g., Watt & Morgan, 1985; Kingdom & Moulden, 1992; Morrone, Burr, & Ross, 1994; du Buf & Fischer, 1995). Figure 8 illustrates the illusion for different numbers of plateaus. With a three-plateau configuration, human observers perceive the inhomogeneity significantly weaker at the middle plateau. Despite the absence of physical luminance gradients, the gradient system nevertheless engenders representations (see Figure 9). Put another way, Chevreul’s illusion is predicted as a consequence of “incorrect” gradient representations at the plateaus. One must not forget, however, that luminance staircases trigger surface representations at the same time. With a luminance staircase, perceptual activity is assumed to be significantly higher for surface representations than for corresponding gradient representations. Hence, the surface representation constitutes the dominating perceptual event, which is overlaid by a relatively weak gradient
888
M. Keil
Figure 8: Chevreul illusion. Although luminance is homogeneous across each plateau, brightness is perceived inhomogeneously (with the exception of the first and the last one): overshoots in brightness are perceived at the left side of each plateau and overshoots in darkness at each right side. It appears that the spatial layout of the illusion changes with increasing number of plateaus, with a corresponding increment in nonuniformity of brightness (128×128 pixel). Notice that this effect may be difficult to see due to the photographic reproduction process.
representation. Why does the gradient system predict an increase of the illusion with increasing number of steps? As compared to the inverted-U characteristic of Mach bands (see Figure 7), this increment is not an effect of increasing amplitude of perceptual activity (see the corresponding numbers indicating maximum activities in Figure 9). Rather, the profile plots in Figure 9 reveal that peaks (see graphs) decrease, and generated gradients are getting steeper, with an increasing number of plateaus (peaks are created by clamped sources and sinks). With three plateaus, the perceptual gradient is shallow, and peaks are relatively accentuated. This means that the generated activity gradient between the peaks has a relatively small slope at the middle plateau. With an increasing number of plateaus, however, the peaks get smaller and broader, while gradients increase in slope. This means that gradient activity varies strongly across plateaus—what gives rise to stronger predicted illusions. Notice, however, that we were not aware of psychophysical studies that investigated the strength of Chevreul’s illusion in dependence on the number of plateaus. Model ganglion cells are unresponsive to luminance gradients with constant slope. Hence, one can devise a “teeth”-shaped luminance profile, which makes ganglion cells respond the same way as to a luminance staircase (see Figure 10). This implies that the gradient system cannot distinguish between both luminance profiles and generates equivalent representations for both of them. These situations can be distinguished only by taking into account corresponding surface representations, where a brightness staircase is generated for the luminance staircase (brightness levels increase) and a flat brightness profile for the “teeth” profile. These observations can furthermore be used to verify our claim that Chevreul’s illusion is caused by a representation of a linear luminance gradient: superimposing a teeth profile onto a luminance staircase should give the impression of an enhanced
Early Gradient Representations
889
A
0.04419388
0.05304163
0.04839845
B
Figure 9: Predictions for the Chevreul illusion. Predictions of the gradient system, for luminance configurations shown in Figure 8, after 2000 iterations. (A) For better visualization, each image was normalized individually (num(5) bers indicate maximum activity values of g (5) ). (B) Graphs show gij for i = 64 and j = 1, . . . , 128 (black) and luminance (gray, individually re-normalized). The illusion is predicted to be weaker at the first and the last plateau of each staircase, consistent with psychophysical data.
890
M. Keil
A
B
Figure 10: Two luminance distributions giving rise to the same gradient representations. A luminance staircase (A) and a “teeth” profile (B) generate the same retinal response patterns and consequently lead to identical gradient representations. However, the luminance staircase triggers surface representations with different brightness levels, whereas the teeth profile generates a surface representation with constant brightness. This difference causes gradient representations to be the dominating perceptual event in the latter case but not in the former one.
staircase
50% teeth 100% teeth
teeth
Figure 11: Enhancing Chevreul’s illusion. A “teeth” profile (last image) is superimposed on a luminance staircase (first image). The mixture was done according to mixed image = (1 − η) × staircase + η × teeth with η ∈ {0.25, 0.50} corresponding to the indicated numbers {50%, 100%}. Overlaying the teeth profile on the staircase seems to enhance the Chevreul illusion (compare the middle images with the right image of Figure 8). This experiment corroborates our claim that Chevreul’s illusion is the consequence of a weakly perceived luminance gradient representation.
Chevreul effect. The reader may verify this by scrutinizing Figure 11, where a teeth profile and a luminance staircase were superimposed with varying relative amplitudes. Remarkably, the enhanced illusion appears to be very similar to the percept of a staircase with actually more plateaus (compare
Early Gradient Representations
luminance
891
gradient
effect
Figure 12: Gradient representation for a line-based Ehrenstein disk. Luminance show the input to the gradient system (128 × 128 pixel, 8 lines, inner disk radius 15 pixel), and gradient shows corresponding gradient representations (1000 iterations). In order to illustrate the brightening effect of the disk, perceptual activity at positions around lines was set to zero (i.e., the line inducers were deleted from the gradient representations).
Figure 11 with the right image of Figure 8).8 Notice that the teeth profile only enhances the illusion; it does not alter its spatial appearance. In other words, the enhanced illusion does not appear unnaturally or distorted to us. Hence, this paradigm could be used in psychophysical experiments as a means for quantifying the strength of Chevreul’s illusion. 3.4 Modified Ehrenstein Disk. The original Ehrenstein disk is an illusory brightening of a disk-shaped region enclosed by a set of radially arranged bars. Here we consider an Ehrenstein disk induced by lines, since the gradient system does not predict the bar-induced Ehrenstein disk. Figure 12 demonstrates that the Ehrenstein disk is generated even if the lines do not define an entire (or closed) circle Although possibly difficult to see in Figure 12, the gradient system also predicts an illusory darkening of inducer line ends, along with brightness buttons just outside both ends of the lines (Day & Jory, 1978; Kennedy, 1979). Nevertheless, the perceptual activity is comparatively small (order 10−5 ) at the disk positions where humans perceive an illusionary brightening. Furthermore, the gradient system cannot explain other similar illusions like Kanizsa’s square. This indicates that for this class of illusions, corresponding surface representations play the dominant role for perception. Notice that this result was not included 8 Although we conducted no sophisticated psychophysical study, 10 out of 10 persons we asked considered this to be the case (all participants were naive on the subject).
892
M. Keil
not blurred
blurred with σ=1 blurred with σ=2
Figure 13: Gradient representations for images of Figure 2. The images show (5) the perceptual activity gij at 2000 iterations, with maximum values indicated by numbers (for visualization, images were normalized individually). As is obvious from the maximum values, activity levels of gradient representations increase according to the degree of blur.
to explain the Ehrenstein disk, but rather to indicate that clamped diffusion is functionally more universal than originally thought. 3.5 Real-World Images. The robustness of our proposed mechanisms is demonstrated with real-world images. We believe that successful processing of real-world images is an important point, since the primate visual system evolved with such images, apart from opening the door to potential applications. Figure 13 shows how gradient representations depend on the degree of gaussian blurring. Low-pass filtering an image by gaussian blurring transforms it entirely into a nonlinear luminance gradient, causing model ganglion cells to respond along gradients. Therefore, gradient representations of blurred images are similar to their originals, albeit slightly deblurred and with enhanced contrasts. Speed is another relevant issue: representations for nonlinear luminance gradients are readily available, since nonlinear gradients need not be “reconstructed” (i.e., explicitly generated) by clamped diffusion. The degree of local blur correlates with the magnitude of perceptual activity: the higher the degree of blurring, the higher is the perceptual activity at corresponding positions in the gradient representation. Figure 14 shows gradient representations for two standard test images. This experiment nicely demonstrates how gradient representations are triggered in a localized fashion. Specifically, the left image reveals focal blur with the peppers in the upper left corner, as a consequence of objects being at remote positions. The specular highlights appearing on various peppers are also represented in the gradient image. Furthermore, gradient representations are engendered where penumbral blur (the transition region between a shadow and the illuminated region surrounding it) is present in the original image. The right image also contains focal blur (the
893
gradient
luminance
Early Gradient Representations
Figure 14: Gradient representations for real-world images. The images show (5) the perceptual activity gij at 2000 iterations for two standard real-world images (luminance; size 256 × 256 pixels). Notice that sharply bounded luminance features (nongradients) are attenuated, whereas out-of-focus image features (e.g., specular highlights on the peppers and the bars in the background, respectively) are enhanced.
bar in the background, the mirror, and the mirror image) and highlights (e.g., at the nose). Again, gradient representations are triggered for the corresponding image regions. The deblurring effect is especially prominent for the out-of-focus bar in the background, which appears better focused in the gradient representation. In both examples, no gradient representations are triggered in image regions without blurring. Similarly, surface texture (i.e., even symmetric luminance features) impedes the generation of luminance gradients. Notice that in both images, the specular highlights provide information about surface curvature, and downstream mechanisms for recovering the 3D shape could readily retrieve corresponding information from the gradient representations.
894
M. Keil
4 Summary and Conclusions With this contribution, we propose a novel theory on the perception and representation of smooth gradations in luminance (here called luminance gradients) in the early stages of the primate visual system. The corresponding proposed mechanism is called a gradient system. Luminance gradients are, besides object surfaces, prevailing constituents of real-world scenes. Because luminance gradients may encode different types of information, they accordingly may be of different value for object recognition. For example, shading effects can reveal information about curvature of object surfaces, and thus may contribute to recovering an object’s 3D shape. On the other hand, if a soft shadow from one object is cast over a second object’s surface, then it should be ignored for recovering the second object’s shape, because it probably is not a surface attribute. However, in the latter situation, the cast shadow provides information about the object’s arrangements in depth. Since it is a priori unknown whether the information that is locally encoded by a luminance gradient corresponds to surface properties, the visual system cannot bind by default smooth luminance gradients to object surfaces. But information on luminance gradients is unlikely being lost, as suggested by everyday perceptual experience. The latter suggests that smooth luminance gradients are explicitly represented in the visual system in parallel with surface representations. With the coexistence of gradient representations, we propose that smooth luminance gradients are discounted from surface representations, consistent with the observation of lightness constancy in V1 (MacEvoy & Paradiso, 2001; Rossi & Paradiso, 1999; Rossi et al., 1996). Having separated representations for surfaces and gradients enables downstream, or higher-level, mechanisms for object recognition to locally “decide” whether to use gradient information or not, according to its consistency with other processes for deriving shape information. The gradient system is proposed to interact with brightness or lightness perception, respectively, and makes consistent predictions: it provides a novel account of Mach band formation and Chevreul’s illusion, whereby in both cases, available psychophysical data are successfully predicted. Mach bands are explained as the consequence of handling both linear and nonlinear gradients with a single mechanism for generating gradient representations (clamped diffusion). Clamped diffusion preserves the shape of nonlinear gradients but triggers an explicit generation process for building representations of linear gradients. In the latter case, clamped sources and sinks of activity are visible relatively longer during the dynamics of gradient formation (as illustrated by Figure 3): bright Mach bands correspond to clamped sources of activity and dark Mach bands to sinks of activity. Furthermore, the gradient system predicts that the Mach band–like overshoots in brightness and darkness, which are observed with a triangular luminance profile, are caused by
Early Gradient Representations
895
identical neural mechanisms, in agreement with psychophysical data (Ross et al., 1989). This is because a triangular profile is also a linear gradient. A luminance staircase, on the other hand, triggers an erroneous representation of a linear, sawtooth-shaped luminance gradient. Since this gradient is actually absent from the luminance staircase, it forms the underlying substrate of Chevreul’s illusion. We verified this prediction by superimposing a sawtooth–shaped luminance pattern onto a luminance staircase (see Figure 10). This causes Chevreul’s illusion to be enhanced but without perceptually distorting it (i.e., the artificially enhanced illusion is supposed to look like a different Chevreul’s illusion with a corresponding greater number of plateaus; see note 8). Mixing a sawtooth-shaped luminance profile with a luminance staircase also provides a relatively simple means to quantify the strength of Chevreul’s illusion in psychophysical experiments: one can use it either to cancel the illusion or match its strength. Notice that other models for brightness perception typically explain Chevreul’s illusion as a by-product of some nonlinearity interposed between the superposition of bandpass filter responses (e.g., du Buf & Fischer, 1995; Pessoa et al., 1995; McArthur & Moulden, 1999). To our knowledge, there is no attempt on the part of other models for brightness perception to examine the dependence of the strength of Chevreul’s illusion on the number of plateaus with a fixed display size. Clamped diffusion is also successful in predicting the (also illusory) brightening effect of an Ehrenstein disk based on line inducers (see Figure 12). This result should be compared with other models for brightness perception that explain the Ehrenstein disk by neural grouping mechanisms ¨ (Heitger, von der Heydt, Peterhans, Rosenthaler, & Kubler, 1998; Gove et al., 1995). Although our claim is not to account for brightness illusions like the Kanizsa triangle or the Ehrenstein disk by means of gradient representations, this result is a further demonstration that the gradient system makes consistent predictions and is computationally more universal than originally planned. This universality and its robustness are furthermore demonstrated with real-world images (see Figures 13 and 14), where gradient representations for specular highlights, cast shadows (i.e., penumbral blur), and optically blurred image features (e.g., due to arrangement in depth) are locally triggered. The formation of gradient representations is impeded at image locations with high-contrast texture, that is even-symmetric contrast configurations (see the right images of Figure 14). One may argue that gradient representations as generated by the gradient system are not useful for the visual system, since they unspecifically lump together all kind of smooth gradients. However, one has to take into account that the gradient system in its present form does not involve any interpretation of gradient structures. Rather, it clarifies how to segregate and recover smooth luminance gradients from retinal information by using model cells with small receptive fields (i.e., nonglobal information).
896
M. Keil
Thus, the situation is similar to one faced with surface representations in V1, where neuronal responses are initially quite unspecific and later indicate figure-ground relationships (Lamme, 1995; Lee, Mumford, Romero, & Lamme, 1998). Segregation of objects from background is thought to be mediated by feedback from extrastriate or higher visual areas. Is there any evidence for corresponding high-level interactions involved with smooth gradients? Knill and Kersten (1991) showed that luminance gradients are interpreted differently depending on an object’s contour curvature. Similarly, Ramachandran (1988) provided a demonstration on how the form of an object’s outline changes the interpretation and perception of luminance gradients generated by illumination and surface curvature, respectively. Adopting a conservative point of view, these findings may be interpreted as evidence in favor of an explicit representation for smooth gradients associated with illumination (i.e., a specific representation), most likely in an extrastriate area. But where one should look for unspecific gradient representations in the brain? Because there are no specific data available to address the issue directly, one can only speculate by extrapolating data about surface representations. A widespread assumption is that surface representations are an element of midlevel vision. However, a recent fMRI study conducted by Sasaki and Watanabe (2004) suggests that V1 is possibly the only cortical area involved in generating surface representations. Nevertheless, their data indicated a different situation for contour processing, where the early visual areas V1, V2, V3/VP, and V4v seem to be involved. According to these recent data, we suggest that unspecific representations of smooth luminance gradients reside also in V1. In addition, the anatomical requirements necessary for the gradient system are met in V1. First, the gradient system relies on neurons providing information about contours, such as simple cells or complex cells. Second, the gradient system processes all information at the highest spatial acuity, and V1 is the largest visual cortical area having the highest spatial resolution. Third, evidence exists that long-range horizontal connections are involved in lateral activity propagation (e.g., Bringuier, Chavane, Glaeser, & Fr´egnac, 1999), and specifically in isotropic filling-in processes for generating surface representations (e.g., Sasaki & Watanabe, 2004). Thus, long-range horizontal connections between cortical microcolumns could also serve as a substrate for implementing the clamped diffusion process. Properties of neurons that are involved in unspecific gradient representations can be derived from the model. For example, just like the filling-in process, gradient representations involve a lateral propagation of activity, and hence one should observe latency effects when stimulating such neurons with smooth variations in intensity. One should observe different response latencies for linear gradient stimuli and nonlinear gradients, respectively (latencies associated with linear gradients are predicted to be longer). With
Early Gradient Representations
897
stimulation inside the receptive field, the depolarization of gradient neurons should increase with the degree of blur of the stimulus. In contrast, bars and lines should hyperpolarize such neurons where stimulus contours fall within their receptive field. In analogy to surface representations, the responses of gradient neurons may be modulated by attention, and initially unspecific responses may encode a particular type of feature in their later response phase (e.g., surface curvature, a cast shadow, or illumination gradients). Such modification of responses most likely requires feedback from extrastriate visual areas. It is also conceivable that more specific representations are found in extrastriate areas. However, since extrastriate areas usually have lower spatial resolution than V1 (Paradiso, 2002), these areas can be expected to interact with V1 to preserve spatial accuracy. Recently, Ben-Shahar et al. (2002) proposed a differential geometry– based approach for processing shading information. In their approach, shading is initially measured by oriented operators selective for low spatial frequencies. The resulting flow field is then compared with a generic flow field model (osculating flow field), which is locally parameterized over the curvature values of the originally measured flow field. A relaxation labeling algorithm (Hummel & Zucker, 1983) is used to update the original flow field according to the generic model. The latter process finally results in a consistent flow field. In order to increase stability, Ben-Shahar et al. incorporated edge information into the relaxation labeling algorithm. The use of edges can additionally be motivated by recognizing that edges in many cases correlate with material changes. Edge maps were computed using standard techniques—Canny’s edge detector (Canny, 1986) and the logical/linear detector Iverson & Zucker, 1995). The approach of Ben-Shahar et al. (2002) compares to the gradient system as follows. The gradient system does not make use of orientation-selective low-spatial frequency operators for detecting shading. Whereas the model of Ben-Shahar et al. (2002) relies on such operators, the gradient system detects shading indirectly by suppressing features with high spatial frequencies. Nevertheless, boundary maps or edge maps (i.e., high-frequency information) in both approaches subserve similar goals: in the gradient system, they are indispensable for ensuring spatial accuracy of gradient representations, and in the relaxation labeling algorithm, they regulate the growth of flow structures. Since in both approaches edges act in an inhibitory fashion, both gradient representations and flow fields, respectively, are rejected in textured regions of an image. A further difference between both methods lies in the explanatory power. The gradient system focuses on the prediction of psychophysical data (Keil et al., 2005) and linking them to concrete neuronal mechanisms. On the other hand, the generic flow field model predicts anatomical data about longrange horizontal connections acting on perceptual variables such as shading (Ben-Shahar & Zucker, 2004), with relaxation labeling as an abstraction of corresponding neuronal computations for processing the underlying flow
898
M. Keil
fields. The latter emphasizes the importance of the orientation component of horizontal connections. Conversely, the gradient system relies on only isotropic operators in its perceptual module (clamped diffusion). Hence, horizontal connections in that case may serve to implement an isotropic (clamped) diffusion process. Notice that the steady state of clamped diffusion corresponds in fact to a dynamic equilibrium, and clamped diffusion establishes nothing but a representation of a shading flow field. As a consequence, the approach of Ben-Shahar et al. (2002) could in principle act on gradient representations as a subsequent stage in the processing hierarchy. We emphasize that the performance of the gradient system was optimized for real-world images, where psychophysical data were used only to constrain the circuitry. The gradient system was not designed ad hoc for the explanation of psychophysical data. All employed mechanisms lie in the range of biophysical possibilities and are not contradictory to existing anatomical data. The gradient system is a minimally complex model since all stages are necessary to arrive at our results.
Acknowledgments This work has been partially supported by the following grants: LOCUST IST 2001-38097, and AMOVIP INCO-DC 961646. I thank the two reviewers whose comments helped to significantly improve the first drafts of this article.
References Bakin, J., Nakayama, K., & Gilbert, C. (2000). Visual responses in monkey areas V1 and V2 to three-dimensional surface configurations. Journal of Neuroscience, 20(21), 8188–8198. Ben-Shahar, O., Huggins, P., & Zucker, S. (2002). On computing visual flows with ¨ boundaries: The case of shading and edges. In H.-H. Bulthoff, S.-W. Lee, T. Poggio, & C. Wallraven (Eds.), 2nd Workshop on Biologically Motivated Computer Vision (BMCV 2002) (Vol. LNCS 2525, pp. 189–198). Berlin: Springer-Verlag. Ben-Shahar, O., & Zucker, S. (2004). Geometrical computations explain projection patterns of long range horizontal connections in visual cortex. Neural Computation, 16(3), 445–476. Benardete, E., & Kaplan, E. (1999). The dynamics of primate M retinal ganglion cells. Visual Neuroscience, 16, 355–368. ¨ Benda, J., Bock, R., Rujan, P., & Ammermuller, J. (2001). Asymmetrical dynamics of voltage spread in retinal horizontal cell networks. Visual Neuroscience, 18(5), 835–848. Blakeslee, B., & McCourt, M. (1997). Similar mechanisms underlie simultaneous brightness contrast and grating induction. Vision Research, 37, 2849–2869.
Early Gradient Representations
899
Blakeslee, B., & McCourt, M. (1999). A multiscale spatial filtering account of the white effect, simultaneous brightness contrast and grating induction. Vision Research, 39, 4361–4377. Blakeslee, B., & McCourt, M. (2001). A multiscale spatial filtering account of the Wertheimer-Benary effect and the corrugated Mondrian. Vision Research, 41, 2487– 2502. Bloj, M., Kersten, D., & Hurlbert, A. (1999). Perception of three-dimensional shape influences colour perception through mutual illumination. Nature, 402, 877– 879. Bringuier, V., Chavane, F., Glaeser, L., & Fr´egnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration field of area 17 neurons. Science, 283, 695–699. Burt, P., & Adelson, E. (1983). The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4), 532–540. Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698. Castelo-Branco, M., Goebel, R., Neuenschwander, S., & Singer, W. (2000). Neural synchrony correlates with surface segregation rules. Nature, 405, 685–689. Chevreul, M. (1890). In C. T. Martel (Ed.), The principles of harmony and contrast of colours, and their applications to arts. London: Bell. Chichilnisky, E., & Kalmar, R. (2002). Functional asymmetries in ON and OFF ganglion cells of primate retina. Journal of Neuroscience, 22(7), 2737–2747. Cleland, B., Levick, W., & Sanderson, K. (1973). Properties of sustained and transient ganglion cells in the cat retina. Journal of Physiology (London), 228, 649–680. Davey, M., Maddess, T., & Srinivasan, M. (1998). The spatiotemporal properties of the Craik-O’Brien-Cornsweet effect are consistent with “filling-in.” Vision Research, 38, 2037–2046. Day, R., & Jory, M. (1978). Subjective contours, visual acuity, and line contrast. In J. A. J. Krauskopf & B. Wooten (Eds.), Visual psychophysics and physiology (pp. 331– 349). New York: Academic Press. Deschˆenes, F., Ziou, D., & Fuchs, P. (2004). A unified approach for a simultaneous and cooperative estimation of defocus blur and spatial shifts. Image and Vision Computing, 22, 35–57. DeVries, S., & Baylor, D. (1997). Mosaic arrangement of ganglion cell receptive fields in rabbit retina. Journal of Neurophysiology, 78(4), 2048–2060. du Buf, J. (1992). Lowpass channels and White’s effect. Perception, 21, A80. du Buf, J., & Fischer, S. (1995). Modeling brightness perception and syntactical image coding. Optical Engineering, 34(7), 1900–1911. Elder, J., & Zucker, S. (1998). Local scale control for edge detection and blur estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(7), 699–716. Gabbiani, F., Krapp, H., Koch, C., & Laurent, G. (2002). Multiplicative computation in a visual neuron sensitive to looming. Nature, 420, 320–324. Gerrits, H. (1979). Apparent movements induced by stroboscopic illumination of stabilized images. Experimental Brain Research, 34, 471–488. Gerrits, H., de Haan, B., & Vendrik, A. (1966). Experiments with retinal stabilized images: Relations between the observations and neural data. Vision Research, 6, 427–440.
900
M. Keil
Gerrits, H., & Vendrik, A. (1970). Simultaneous contrast, filling-in process and information processing in man’s visual system. Experimental Brain Research, 11, 411–430. Gilchrist, A., Kossyfidis, C., Bonato, F., Agostini, T., Cataliotti, J., Li, X., Spehar, B., Annan, V., & Economou, E. (1999). An anchoring theory of lightness perception. Psychological Review, 106(4), 795–834. Ginsburg, A. (1975). Is the illusory triangle physical or imaginary? Nature, 257, 219– 220. Gove, A., Grossberg, S., & Mingolla, E. (1995). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neuroscience, 12, 1027–1052. Grossberg, S. (1983). The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behavioral and Brain Sciences, 6, 625–692. Grossberg, S., & Hong, S. (2003). Cortical dynamics of surface lightness anchoring, filling-in, and perception (abstract). Journal of Vision, 3(9), 415a. Grossberg, S., & Howe, P. (2003). A laminar cortical model of stereopsis and threedimensional surface perception. Vision Research, 43, 801–829. Grossberg, S., & Mingolla, E. (1987). Neural dynamics of surface perception: Boundary webs, illuminants, and shape-from-shading. Computer Vision, Graphics, and Image Processing, 37, 116–165. Grossberg, S., & Pessoa, L. (1998). Texture segregation, surface representation, and figure-ground separation. Vision Research, 38, 2657–2684. Grossberg, S., & Todorovi´c, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241–277. ¨ Heitger, F., von der Heydt, R., Peterhans, E., Rosenthaler, L., & Kubler, O. (1998). Simulation of neural contour mechansims: representing anomalous contours. Image and Vision Computing, 16, 407–421. Hong, S., & Grossberg, S. (2004). A neuromorphic model for achromatic and chromatic surface representation of natural images. Neural Networks, 17(5–6), 787–808. Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, London, 160, 106–154. Hubel, D., & Wiesel, T. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, London, 195, 214–243. Hummel, R., & Zucker, S. (1983). On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 267–287. Iverson, L., & Zucker, S. (1995). Logical/linear operators for image curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10), 982–996. Kaplan, E., Lee, B., & Shapley, R. (1990). New views of primate retinal function. Progress in Retinal Research, 9, 273–336. Kaplan, E., Purpura, K., & Shapley, R. (1987). Contrast affects the transmission of visual information through the mammalian lateral geniculate nucleus. Journal of Physiology (London), 391, 267–288. Keat, J., Reinagel, P., Reid, R., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30, 803–817. ´ Keil, M., Cristobal, G., & Neumann, H. (in revision). Gradient representation and perception in the early visual system—a novel account to Mach band formation. Vision Research.
Early Gradient Representations
901
Kelly, F., & Grossberg, S. (2000). Neural dynamics of 3-D surface perception: Figureground separation and lightness perception. Perception and Psychophysics, 62, 1596–1619. Kennedy, J. (1979). Subjective contours, contrast and assimilation. In C. Nodine & D. Fisher (Eds.), Perception and pictorial representation. New York: Praeger. Kersten, D., Mamassian, P., & Knill, D. (1997). Moving cast shadows induce apparent motion in depth. Perception, 26, 171–192. Kingdom, F. (2003). Color brings relief to human vision. Nature Neuroscience, 6(6), 641–644. Kingdom, F., & Moulden, B. (1992). A multi-channel approach to brightness coding. Vision Research, 32, 1565–1582. Kinoshita, M., & Komatsu, H. (2001). Neural representations of the luminance and brightness of a uniform surface in the macaque primary visual cortex. Journal of Neurophysiology, 86, 2559–2570. Knau, H., & Spillman, L. (1997). Brightness fading during Ganzfeld adaptation. Journal of the Optical Society of America A, 14(6), 1213–1222. Knill, D., & Kersten, D. (1991). Apparent surface curvature affects lightness perception. Nature, 6351, 228–230. Komatsu, H., Murakami, I., & Kinoshita, M. (1996). Surface representations in the visual system. Cognitive Brain Research, 5, 97–104. Lamb, T. (1976). Spatial properties of horizontal cell responses in the turtle retina. Journal of Physiology, 263, 239–255. Lamme, V. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15, 1605–1615. Lee, T., Mumford, D., Romero, R., & Lamme, V. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38, 2429–2454. MacEvoy, S., Kim, W., & Paradiso, M. (1998). Integration of surface information in the primary visual cortex. Nature Neuroscience, 1(7), 616–620. MacEvoy, S., & Paradiso, M. (2001). Lightness constancy in the primary visual cortex. Proceedings of the National Academy of Sciences USA, 98(15), 8827–8831. ¨ Mach, E. (1865). Uber die Wirkung der r¨aumlichen Verteilung des Lichtreizes auf die Netzhaut, I. Sitzungsberichte der mathematisch-naturwissenschaftlichen Klasse der Kaiserlichen Akademie der Wissenschaften, 52, 303–322. Maffei, L., & Fiorentini, A. (1972). Retinogenigulate convergence and analysis of contrast. Journal of Neurophysiology, 35, 65–72. Mamassian, P., Knill, D., & Kersten, D. (1998). The perception of cast shadows. Trends in Cognitive Sciences, 2(8), 288–294. Marr, D., & Hildreth, E. (1980). Theory of edge detection. Proc. R. Soc. Lond. B, 207, 187–217. McArthur, J., & Moulden, B. (1999). A two-dimensional model of brightness perception based on spatial filtering consistent with retinal processing. Vision Research, 39, 1199–1219. Mingolla, E., Ross, W., & Grossberg, S. (1999). A neural network for enhancing boundaries and surfaces in synthetic aperture radar images. Neural Networks, 12, 499–511. Mingolla, E., & Todd, J. (1986). Perception of solid shape from shading. Biological Cybernetics, 53, 137–151.
902
M. Keil
Morrone, M., Burr, D., & Ross, J. (1994). Illusory brightness step in the Chevreul illusion. Vision Research, 12, 1567–1574. Naka, K.-I., & Rushton, W. (1967). The generation and spread of S-potentials in fish (cyprinidae). Journal of Physiology, 193, 437–461. Neumann, H. (1994). An outline of a neural architecture for unified visual contrast and brightness perception (Tech. Rep. No. CAS/CNS-94-003). Boston: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems. Neumann, H. (1996). Mechanisms of neural architectures for visual contrast and brightness perception. Neural Networks, 9(6), 921–936. Neumann, H., Pessoa, L., & Hansen, T. (2001). Visual filling-in for computing perceptual surface properties. Biological Cybernetics, 85, 355–369. Nishina, S., Okada, M., & Kawato, M. (2003). Spatio-temporal dynamics of depth propagation on uniform region. Vision Research, 43, 2493–2503. Norman, J., & Todd, J. (1994). The perception of rigid motion in depth from the optical deformations of shadows and occlusion boundaries. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 343–356. Paradiso, M. (2000). Visual neuroscience: Illuminating the dark corners. Current Biology, 10(1), R15–R18. Paradiso, M. (2002). Perceptual and neural correspondence in primary visual cortex. Current Opinion in Neurobiology, 12, 155–161. Paradiso, M., & Hahn, S. (1996). Filling-in percepts produced by luminance modulation. Vision Research, 36(17), 2657–2663. Paradiso, M., & Nakayama, K. (1991). Brightness perception and filling-in. Vision Research, 31(7/8), 1221–1236. Passaglia, C., Enroth-Cugell, C., & Troy, J. (2001). Effects of remote stimulation on the mean firing rate of cat retinal ganglion cells. Journal of Neuroscience, 21, 5794–5803. Pessoa, L. (1996). Mach-bands: How many models are possible? Recent experimental findings and modeling attempts. Vision Research, 36(19), 3205–3277. Pessoa, L., Mingolla, E., & Neumann, H. (1995). A contrast- and luminance-driven multiscale network model of brightness perception. Vision Research, 35(15), 2201– 2223. Pessoa, L., & Neumann, H. (1998). Why does the brain fill-in? Trends in Cognitive Sciences, 2, 422–424. Pessoa, L., & Ross, W. (2000). Lightness from contrast: A selective integration model. Perception and Psychophysics, 62(6), 1160–1181. Pessoa, L., Thompson, E., & No¨e, A. (1998). Finding out about filling-in: A guide to perceptual completion for visual science and the philosophy of perception. Behavioral and Brain Sciences, 21, 723–802. Ramachandran, V. (1988). Perception of shape from shading. Nature, 331, 163–166. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden Day. Rodieck, R. W. (1965). Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5, 583–601. Ross, J., Morrone, M., & Burr, D. (1989). The conditions under which Mach bands are visible. Vision Research, 29(6), 699–715. Rossi, A., & Paradiso, M. (1996). Temporal limits of brightness induction and mechanisms of brightness perception. Vision Research, 36(10), 1391–1398.
Early Gradient Representations
903
Rossi, A., & Paradiso, M. (1999). Neural correlates of perceived brightness in the retina, lateral geniculate nucleus, and striate cortex. Journal of Neuroscience, 193(14), 6145–6156. Rossi, A., Rittenhouse, C., & Paradiso, M. (1996). The representation of brightness in primary visual cortex. Science, 273, 1391–1398. Rudd, M., & Arrington, K. (2001). Darkness filling-in: A neural model of darkness induction. Vision Research, 41(27), 3649–3662. Sasaki, Y., & Watanabe, T. (2004). The primary visual cortex fills in color. Proceedings of the National Academy of Sciences USA, 52(101), 18251–18256. Sepp, W., & Neumann, H. (1999). A multi-resolution filling-in model for brightness perception. In ICANN99, Ninth International Conference on Artificial Neural Networks (Vol. 470, pp. 461–466). University of Edinburgh. Tittle, J., & Todd, J. (1995). Perception of three-dimensional structure. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 715–718). Cambridge, MA: MIT Press. Todd, J. (2003). Perception of three-dimensional structure. In M. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 868–871). Cambridge, MA: MIT Press. Todd, J., & Mingolla, E. (1983). Perception of surface curvature and direction of illumination from patterns of shading. Journal of Experimental Psychology: Human Perception and Performance, 9(4), 583–595. Todd, J., Norman, J., Koendernik, J., & Kappers, A. (1997). Effects of texture, illumination and surface reflectance on stereoscopic shape perception. Perception, 26, 806–822. Todd, J., Norman, J., & Mingolla, E. (2004). Lightness constancy in the presence of specular highlights. Psychological Science, 15(1), 33–39. Todd, J., & Reichel, F. (1989). Ordinal structure in the visual perception and cognition of smoothly curved surfaces. Psychological Review, 96(1), 643–657. Troy, J., & Robson, J. (1992). Steady discharges of X and Y retinal ganglion cells of cat under photopic illuminance. Visual Neuroscience, 9, 535–553. Wandell, B. (1995). Foundations of vision. Sunderland, MA: Sinauer. Watt, R., & Morgan, M. (1985). A theory of the primitive spatial code in human vision. Vision Research, 25, 1661–1674. Winfree, A. (1995). Wave propagation in cardiac muscle and in nerve networks. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 1054–1056). Cambridge, MA: MIT Press. Zaghloul, K., Boahen, K., & Demb, J. (2003). Different circuits for ON and OFF retinal ganglion cells cause different contrast sensitivities. Journal of Neurosciene, 23(7), 2645–2654.
Received February 23, 2004; accepted August 23, 2005.
LETTER
Communicated by Jean-Pierre Nadal
Memory Capacity for Sequences in a Recurrent Network with Biological Constraints Christian Leibold [email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, Germany, and Neuroscience Research Center, Charit´e, Medical Faculty of Berlin, Germany
Richard Kempter [email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin; Bernstein Center for Computational Neuroscience, Berlin; and Neuroscience Research Center, Charit´e, Medical Faculty of Berlin, Germany
The CA3 region of the hippocampus is a recurrent neural network that is essential for the storage and replay of sequences of patterns that represent behavioral events. Here we present a theoretical framework to calculate a sparsely connected network’s capacity to store such sequences. As in CA3, only a limited subset of neurons in the network is active at any one time, pattern retrieval is subject to error, and the resources for plasticity are limited. Our analysis combines an analytical mean field approach, stochastic dynamics, and cellular simulations of a time-discrete McCulloch-Pitts network with binary synapses. To maximize the number of sequences that can be stored in the network, we concurrently optimize the number of active neurons, that is, pattern size, and the firing threshold. We find that for one-step associations (i.e., minimal sequences), the optimal pattern size is inversely proportional to the mean connectivity c, whereas the optimal firing threshold is independent of the connectivity. If the number of synapses per neuron is fixed, the maximum number P of stored sequences in a sufficiently large, nonmodular network is independent of its number N of cells. On the other hand, if the number of synapses scales as the network size to the power of 3/2, the number of sequences P is proportional to N. In other words, sequential memory is scalable. Furthermore, we find that there is an optimal ratio r between silent and nonsilent synapses at which the storage capacity α = P/[c(1 + r )N] assumes a maximum. For long sequences, the capacity of sequential memory is about one order of magnitude below the capacity for minimal sequences, but otherwise behaves similar to the case of minimal sequences. In a biologically inspired scenario, the information content per synapse is far below theoretical optimality, suggesting that the brain trades off error tolerance against information content in encoding sequential memories. Neural Computation 18, 904–941 (2006)
C 2006 Massachusetts Institute of Technology
Capacity for Sequences Under Biological Constraints
905
1 Introduction Recurrent neuronal networks are thought to serve as a physical basis for learning and memory. A fundamental strategy of memory organization in animals and humans is the storage of sequences of behavioral events. One of the brain regions of special importance for sequence learning is the hippocampus (Brun et al., 2002; Fortin, Agster, & Eichenbaum, 2002; Kesner, Gilbert, & Barua, 2002). The recurrent network in the CA3 region of hippocampus, in particular, is critically involved in the rapid acquisition of single-trial or one-shot, episodic-like memory (Nakazawa, McHugh, Wilson, & Tonegawa, 2004), that is, memory of the sequential ordering of events. It is generally assumed that the hippocampus can operate in at least ¨ two states (Lorincz & Buzs`aki, 2000). One state, called theta, is dedicated to fast, or one-shot, learning; the other state, referred to as sharp-wave ripple, is dedicated to the replay of stored sequences. Experiments by Wilson ´ Csicsvari, and Buzs´aki and McNaughton (1994), N´adasdy, Hirase, Czurko, (1999), and Lee and Wilson (2002) strongly corroborate the hypothesis that the hippocampus can replay sequences of previously experienced events. The sequences are assumed to be stored within the highly plastic synapses that recurrently connect the pyramidal cells of the CA3 region (Csicsvari, Hirase, Mamiya, & Buzsaki, 2000). In this letter, we tackle the problem of how many sequences can be stored in a recurrent neuronal network such that their replay can be triggered by an activation of adequate cue patterns. This question is fundamental to neural computation, and many classical papers calculate the storage capacity of pattern memories. There, one can roughly distinguish between two major classes of network models: perceptron-like feedforward networks in which associations occur within one time step (Willshaw, Bunetman, & Longuet-Higgins, 1969; Gardner, 1987; Nadal & Toulouse, 1990; Brunel, Nadal, & Toulouse, 1992) and recurrent networks that describe memories as attractors of a time-discrete dynamics (Little, 1974; Hopfield, 1982; Amit, Gutfreund, & Sompolinsky, 1987; Golomb, Rubin, & Sompolinsky, 1990). Also for networks that act as memory for sequences, capacities have been calculated in both the perceptron (Nadal, 1991) and the attractor case (Herz, Li, & van Hemmen, 1991). An important result is that the capacity of sequence memory in Hopfield-type networks is about twice as ¨ large as that of a static attractor network (During, Coolen, & Sherington, 1998). Here, we describe sequence replay in a sparsely connected network by means of time-discrete dynamics, binary neurons, and binary synapses. Our model for sequential replay of activity patterns is different from attractortype models (Sompolinsky & Kanter, 1986; Buhmann & Schulten, 1987; Amit, 1988). In fact, we completely dispense with fixed points of the network dynamics. Instead, we discuss transients that are far from equilibrium
906
C. Leibold and R. Kempter
(Levy, 1996; August & Levy, 1999; Jensen & Lisman, 2005). In the case of a sequence consisting of a single transition between two patterns (a minimal sequence), the mathematical structure we choose is similar to the one of Nadal (1991) for an autoassociative Willshaw network (Willshaw et al., 1969). For longer sequences, our analysis resembles that of synfire networks (Diesmann, Gewaltig, & Aertsen, 1999), although we take expectation values as late as possible (Nowotny & Huerta, 2003). Some of the previous approaches to memory capacity explicitly focus on questions of biological applicability. Golomb et al. (1990), for example, ¨ address the problem of low firing rates. Herrmann, Hertz, and PrugelBennett (1995) explore the biological plausibility of synfire chains. Other approaches assess the dependence of storage capacity on restrictions to connectivity (Gutfreund & M´ezard, 1988; Deshpande & Dasgupta, 1991; Maravall, 1999) and on the distribution of synaptic states (Brunel, Hakim, Isope, Nadal, & Barbour, 2004). We propose a framework that allows discussing how a combination of several biological constraints affects the performance of neuronal networks that are operational in the brain. An important constraint that supports dynamical stability at a low level of activity is a low mean connectivity. Another one is imposed by limited resources for synaptic plasticity; that is, not every synapse that may combinatorially be possible can really be established. This constraint sets an upper bound to the maximum connectivity between two groups of neurons that are to be associated. Moreover, the number of synapses per neuron may be limited. Another important constraint for sequential memories is the length of replayed sequences, which interferes with dynamical properties of the network. Finally, the capacity of sequential memory is also influenced by the specific nature of a neuronal structure that reads out replayed patterns. This influence is often neglected by assuming a perfect detector for the network states. Our approach explicitly takes into account that synapses are usually classified into activated and silent ones (Montgomery, Pavlidis, & Madison, 2001; Nusser et al., 1998). Activated synapses have a nonzero efficacy or weight and are essential for the network dynamics. Silent synapses, which do not contain postsynaptic AMPA receptors (Isaac, Nicoll, & Malenka, 1995), are assumed to not contribute to the network dynamics. Changing the state of synapses from the silent to the nonsilent state, and vice versa, acts as a resource for plasticity for the storage of sequences. Synaptic learning rules can set a fixed ratio between silent and nonsilent synapses, which gives rise to an additional constraint. In this letter, we calculate the capacity of sequential memory in a constrained recurrent network by means of a probabilistic theory as well as a mean field approach. Both theoretical models are compared to cellular simulations of networks of spiking units. We thereby describe the memory capacity for sequences in dependence on five free parameters. The network size N, the mean connectivity c, and the ratio r between silent and nonsilent
Capacity for Sequences Under Biological Constraints
907
synapses are three network parameters. In addition there are two replay parameters: the sequence length Q and the threshold γ of pattern detection. The number M of active neurons per pattern and the neuronal firing threshold θ are largely considered as dependent variables. It is shown how M and θ are to be optimized to allow replaying a maximum number of sequences. Scaling laws are then derived by using the optimal values for M and θ , both being functions of the five free parameters. 2 Model of a Recurrent Network for the Replay of Sequences In this section we specify notations to describe the dynamics and morphology of a recurrent network that allows for a replay of sequences of predefined activity patterns. A list of symbols used throughout this article can be found in appendix A. 2.1 Binary Synapses Connect Binary Neurons. Let us consider a network of N McCulloch-Pitts (McCulloch & Pitts, 1943) neurons that are described by binary variables xk , 1 ≤ k ≤ N. At each discrete point in time t, neuron k can be either active, xk (t) = 1, or inactive, xk (t) = 0. The state of the network is then denoted by a binary N-vector x(t) = [x1 (t), . . . , xN (t)]T . A neuron k that is active at time t provides input to a neuron k at time t + 1 if there is an activated synaptic connection from k to k . Neuron k fires at time t + 1 if its synaptic input crosses some firing threshold θ > 0. In order to specify a neuron’s input, we classify synapses into activated and silent ones. All activated connections contribute equally to the synaptic input. Silent synapses have no influence on the dynamics. Therefore, the synaptic input of neuron k at time t + 1 equals the number of active neurons at time t that have an activated synapse to neuron k . Silent synapses are assumed to act as a resource for plastic changes, although this article does not directly incorporate plasticity rules. The total number c N2 of activated synapses in the network defines a mean connectivity c > 0, which later will be interpreted as the probability of having an activated synapse connecting a particular neuron to another one. The connectivity through activated synapses in the network is described by the N × N binary matrix C, where Ckk = 1 if there is an activated synapse from neuron k to neuron k , and Ckk = 0 if there is a silent synapse or no synapse at all. Similarly, the connectivity through silent synapses is denoted by c s , and the total number of silent synapses in the network is c s N2 . Then each neuron has on average (c + c s )N morphological synapses, which in turn defines the morphological connectivity c m = c + c s . Experimental literature (Montgomery et al., 2001; Nusser et al., 1998) usually assesses the ratio c s /c between the silent and nonsilent connectivities. For convenience, we introduce the abbreviation r = c s /c. We note that the four connectivity parameters c, c m , c s , and r have two independent degrees of freedom.
908
C. Leibold and R. Kempter
2.2 Patterns and Sequences. A pattern or event is defined as a binary N-vector ξ where M elements of ξ are 1 and N − M elements are 0. The network is in the state ξ at time t if x(t) = ξ . An ordered series of events is called a sequence. A minimal sequence is defined as a series of two events, say, a cue pattern ξ A preceding a target pattern ξ B . The minimal sequence ξ A → ξ B is said to be stored in the network if initialization with the cue x(t) ≈ ξ A at time t leads to the target x(t + 1) ≈ ξ B one time step later. Typically, the network only approximately recalls or replays the events of a sequence (see section 4). Sequences of arbitrary length, denoted by Q ≥ 1, are obtained by concatenating minimal sequences of length Q = 1. In the next section, we specify how to set up the connectivity such that a recurrent network can act as a memory for sequences. 3 Embedding Sequences and Storage Capacity For a minimal sequence ξ A → ξ B to be stored in the network, one requires an above-average connectivity through activated synapses from the cells that are active in the cue ξ A to those that are supposed to be active during the recall of the target ξ B . In what follows, we assume that all morphological synapses from neurons active in the cue pattern to cells active in the target pattern are switched on and none of them is silent. Such a network can be constructed similar to the one in Willshaw et al., 1969 (see also Nadal & Toulouse, 1990, and Buckingham & Willshaw, 1993). Let us therefore consider a randomly connected network—the probability of having a morphological synapse from one neuron to another one is c m . Beginning with all synapses being in the silent state, one randomly defines pairs of patterns that are to be connected into minimal sequences. Then one converts those silent synapses into active ones that connect the M active neurons in a cue pattern to the M active neurons in the corresponding target pattern. Imprinting of sequences stops when the overall connectivity through activated synapses reaches the value c; that is, the total number of activated synapses in the network attains a value of c N2 . 3.1 Capacity of Sequential Memory. Let us now address the question of how many sequences can be concurrently stored using the above algorithm for a given mean connectivity c and morphological connectivity c m > c. In so doing, we define the capacity α of sequential memory as the maximum number P of minimal sequences that can be stored, normalized by the number c m N = (1 + r )c N of morphological synapses per neuron, α :=
P . cm N
(3.1)
The number P of minimal sequences that can be stored is assessed by extending the classical derivation of Willshaw et al. (1969). Suppose that we
Capacity for Sequences Under Biological Constraints
909
have two groups of M cells that should be linked into a minimal sequence. For each morphological synapse in the network, the probability that the presynaptic neuron is active in the cue pattern is M/N, and the probability that the postsynaptic neuron is active in the target pattern is also M/N. Then the probability that a synapse is not involved in this specific minimal sequence is 1 − M2 /N2 . Given P stored minimal sequences, the probability that a synapse does not contribute to any of those sequences is [1 − M2 /N2 ] P , and therefore the probability of a synapse being in a nonsilent state is C = 1 − [1 − M2 /N2 ] P . For a mean connectivity c, on the other hand, the probability C also equals the ratio between the number c N2 of activated synapses and the total number c m N2 of synapses in the network: C = c/c m . Combining the two approaches, we can derive P for any given pair of connectivities c and c m = c (1 + r ) and find α=
log(1 − c/c m ) . c m N log (1 − M2 /N2 )
(3.2)
Equation 3.2 is valid for all biologically reasonable choices of M, c, and c m and also can account for nonorthogonal patterns, as in Willshaw et al. (1969). A somewhat simpler expression for α can be obtained in the case M/N 1. Independent of specific values of c and c m , we can expand [1 − M2 /N2 ] P ≈ 1 − P M2 /N2 to end up with α=
cN 2 M2 cm
for
1 M N.
(3.3)
Equation 3.3 can also be interpreted through a different way of estimating the number P of minimal sequences that can be stored: P roughly equals the ratio between the total number c N2 of activated synapses in the network and the number c m M2 of activated synapses that link two patterns: P = c N2 /(c m M2 ). This estimate, however, requires that different patterns are represented by different groups of neurons; there is no overlap between the patterns, which is an excellent approximation for sparsely coded patterns, M/N 1. Equations 3.2 and 3.3 for the capacity α of sequential memory, however, do not tell us whether embedded sequences can actually be replayed. In the next section, we therefore introduce a method to quantify sequence replay. 4 Replaying Sequences We consider a sequence as being stored in the network if and only if it can be replayed at a given quality. In order to be able to efficiently simulate replay in large networks, this section introduces a probabilistic Markovian
910
C. Leibold and R. Kempter
dynamics that approximates the deterministic cellular simulations well. Finally, we define a measure to quantify the quality of sequence replay. 4.1 Capacity and Dynamical Stability. Let us design a network and patterns such that the number of sequences that can be concurrently stored is as large as possible. From equations 3.2 and 3.3 we see that the capacity α is maximized if the pattern size M is as small as possible. However, M cannot be arbitrarily small, which will be illustrated below and explained in detail in section 5. Examples of how sequence replay depends on network parameters are illustrated by simulations of a network of N = 100,000 McCulloch-Pitts units at low connectivities c = c s = 0.05. The choice r = c s /c = 1 roughly resembles values experimentally obtained by Nusser et al. (1998), Montgomery et al. (2001), and Isope and Barbour (2002). Minimal sequences have been concatenated so that nonminimal sequences [ξ 0 → ξ 1 → . . . → ξ Q ] of length Q = 20 are obtained. In the simulations, the network is always initialized with the cue pattern ξ 0 at time step 0. The replay of nonminimal sequences at times t > 0 is then indicated through two order parameters: the number of correctly activated neurons (hits), mt := x(t) · ξ t , and the number of incorrectly activated neurons (false alarms), nt := x(t) · (1 − ξ t ), where 1 = [1, . . . , 1]T and the symbol ‘·’ denotes the standard dot product. Figure 1 shows sequence replay in cellular simulations for two different pattern sizes (M = 800 and M = 1600). Sequence replay crucially depends on the value of the firing threshold θ . In general, if the threshold is too high, the network becomes silent after a few iterations. If the threshold is too low, the whole network becomes activated within a few time steps. Whether there exist values of θ at which a sequence can be fully replayed, however, also critically depends on the pattern size M. At a small pattern size of M = 800, there is no such firing threshold, whereas for a pattern size M = 1600, there is a whole range of thresholds that allow replaying the full sequence. So there is a conflict between the maximization of the capacity of the network, which requires M to be small, and the dynamical stability of replay, which becomes more robust for larger values of M (cf. section 7). In section 5, we will derive a lower bound for the pattern size below which replay is impossible, and we also determine the respective firing threshold. In connection with equation 3.2, these results enable us to calculate the maximum number of sequences that can be simultaneously stored in a recurrent network such that all stored sequences can be replayed. These calculations require a simultaneous optimization of pattern size M and threshold θ . A numerical treatment as shown in Figure 1, however, is infeasible for much larger networks. Therefore, we first introduce a numerically less costly approach. 4.2 Markovian Dynamics. Assessing the dynamics of large networks of neurons by means of computer simulations is mainly constrained by
Capacity for Sequences Under Biological Constraints
A
B
M=800 m/M n/(N-M)
1
911
M=1600 m/M n/(N-M)
1 θ=65
0 1
θ=134
0 1
θ=63 0 1 θ=62 0 1
θ=133
Relative activity
Relative activity
θ=64 0 1
0 1 θ=121 0 1 θ=114 0 1
θ=61 0 1
θ=113 0 1
θ=60 0
θ=112 0
0
4
8
12
Time step
16
20
0
4
8
12
16
20
Time step
Figure 1: Stability-capacity conflict. Sequence replay critically depends on both the firing threshold θ and the pattern size M. In all graphs, we show the fraction mt /M of hits (disks) at time step t and the fraction nt /(N − M) of false alarms (crosses) during the replay of a nonminimal sequence of length Q = 20. The network consists of N = 105 McCulloch-Pitts neurons with a mean connectivity of activated synapses of c = 5%. The ratio of silent and activated synapses is r = 1. (A) For a pattern size M = 800, full replay is impossible. For high thresholds θ ≥ 64, the sequence dies out, whereas for low thresholds θ ≤ 63, the network activity explodes. (B) For a pattern size of M = 1600, sequence replay is possible for a broad range of thresholds θ between 114 and 133.
the amount of accessible memory. Simulations of a network of N = 105 cells with a connectivity of about c = 5%, as the ones shown in Figure 1, require about 2 GB of computer memory. A doubling of neurons would therefore result in 8 GB and is thus already close to the limit of these days’ conventional computing facilities. Networks with more than 106 cells that need at least 200 GB are very inconvenient. It is therefore reasonable to follow a different approach for investigating scaling laws of sequential memory. To be able to simulate pattern replay in large networks, we reduce the dynamical degrees of freedom of the network to the two order parameters defined in the previous section: the number mt of correctly activated neurons (hits) and the number nt of incorrectly activated neurons (false alarms) at time t (see also Figure 2A). Furthermore, we take advantage of the fact that the network dynamics has only a one-step memory and thus
912
C. Leibold and R. Kempter
A t
B
t
M
N−M
mt
nt
mt
nt
c 11
m t+1
ξA c 00
c 01
t+1
ξ
c 10 nt+1
ξB
Figure 2: Pattern size and connectivity matrix. (A) At some time t, the network is assumed to be associated with a specific event ξ t = ξ A of size M. We therefore divide the network of N neurons into two parts. The first part consists of the M neurons that are supposed to be active if an event ξ t is perfectly represented. The second part contains the N − M neurons that are supposed to be inactive. The quantities mt (hits) and nt (false alarms) denote the number of active neurons in the two groups at time t. (B) The number mt+1 of hits and the number nt+1 of false alarms with respect to pattern ξ t+1 = ξ B at time t + 1 are determined by the state of the network at time t, x(t) = ξ A, and the connectivity matrix C. The average number of synaptic connections between four groups of cells c10the , which is defined in is described by the reduced connectivity matrix cc11 01 c 00 section 4.2.1.
reduce the full network dynamics to a discrete probabilistic dynamics governed by a transition matrix T (Nowotny & Huerta, 2003; Gutfreund & M´ezard, 1988). The transition matrix is defined as the conditional probability T(mt+1 , nt+1 |mt , nt ) that a network state (mt+1 , nt+1 ) follows the state (mt , nt ). We note that due to this probabilistic interpretation, the dynamics of (mt , nt ) is stochastic, although single units behave deterministically. More precisely, we derive a dynamics for the probability distribution of (mt , nt ). How to interpret expectation values with respect to this distribution is specified next. 4.2.1 Reduced Connectivity Matrix. In the limit of a large pattern size M, the connectivities c and c m can be interpreted as probabilities of having synaptic connections—in other words, the probability that in the embedded sequence ξ A → ξ B there is an activated synapse from a cell active in ξ A to a cell active in ξ B is c m . This probabilistic interpretation can be formalized by means of a reducc 10 tion of the binary connectivity matrix C to four mean connectivities cc11 , 01 c 00 which are average values over all P minimal sequences stored (see also Figure 2B). First, we define the mean connectivity c 11 between neurons that
Capacity for Sequences Under Biological Constraints
913
are supposed to be active in cue patterns and those that are supposed to be active in their corresponding targets, c 11 =
P N 1 1 A ξk Ckk ξkB . P {A,B} N2
(4.1)
k,k =1
Here the sum over {A, B} is meant to be taken over the cue target pairs of P different minimal sequences. By construction (see section 3), c 11 is at its maximum c m . Second, the connectivity c 10 describes activated synapses between cells that are active in cue patterns to cells that are supposed to be inactive in target patterns. Similarly, the mean connectivity c 01 describes activated synapses from neurons that are supposed to be inactive in the cue to those that should be active in the target pattern. Finally, c 00 denotes the mean connectivity between cells that are to be silent in both the cue and the target. The four connectivities are summarized in the reduced connectivity mean c 10 matrix cc11 (see also Figure 2B). The interpretation of the mean connec01 c 00 tivities as probabilities of having activated synaptic connections between two neurons can be considered as the assumption of binomial statistics. This assumption is a good approximation for Willshaw-type networks in the limit N M 1 (Buckingham & Willshaw, 1992). Cues and targets of minimal sequences are assumed to be linked as tight as possible, which results in c 11 = c m = c (1 + r ). The remaining three entries of the reduced connectivity matrix follow from normalization conditions: since every active neuron in a target pattern, for example, ξ B , receives, on average, c N activated synapses, and those synapses originate from two different groups of neurons in a cue pattern, for example, ξ A, we have c N = c 11 M + c 01 (N − M). Similarly, every inactive neuron in the target pattern receives, on average, c N = c 10 M + c 00 (N − M) activated synapses. As a consequence of recurrence, every neuron of a cue pattern projects, on average, to c N postsynaptic neurons. From that we obtain two similar conditions with c 10 and c 01 interchanged and thus c10 = c01 . c 10 can therefore be All entries of the reduced connectivity matrix cc11 01 c 00 expressed in terms of the mean connectivity c, the ratio r of silent and nonsilent connectivities, the pattern size M, and the network size N,
c 11 c 10 c 01 c 00
=c
1+r
1−r
M 1 − r N−M 1+r
M N−M M2 (N−M)2
.
(4.2)
The assumption of binary statistics together with the reduced connectivity matrix enables us to specify the transition matrix T as it was defined at the beginning of section 4.2. Calculation of the capacity α for an arbitrary connectivity c 11 , that is, c < c 11 < c m , between cue and target patterns is somewhat more involved than
914
C. Leibold and R. Kempter
in the case of section 3, where patterns were connected with the maximum morphological connectivity c 11 = c m . The scenario c < c 11 < c m is outlined in appendix B. For 1 M N, however, equation 3.3 with c m replaced by c 11 turns out to be an excellent approximation. 4.2.2 Transition Matrix. Due to statistical independence of the activation of different postsynaptic neurons, the transition matrix can be separated, T(mt+1 , nt+1 |mt , nt ) = p(mt+1 |mt , nt ) q (nt+1 |mt , nt ),
(4.3)
where p(mt+1 |mt , nt ) is the probability that at time t + 1 a number of mt+1 cells are correctly activated, and q (nt+1 |mt , nt ) is the probability of having nt+1 cells incorrectly active, given mt and nt . Defining the binomial probability b j,l (x) =
l x j (1 − x)l− j , j
(4.4)
with 0 ≤ x ≤ 1, and 0 ≤ j ≤ l, we obtain p(mt+1 |mt , nt ) = b mt+1 ,M (ρmt nt ) and q (nt+1 |mt , nt ) = b nt+1 ,N−M (λmt nt ), (4.5) with ρmt nt and λmt nt denoting the probabilities of correct (ρ) and incorrect (λ) activation of a single ccell, respectively. Both are specified by the reduced 10 connectivity matrix cc11 and the firing threshold θ , 01 c 00 ρmt nt =
b j,M
j,k; j+k≥θ
λmt nt =
j,k; j+k≥θ
b j,M
m
t
M m
t
M
c 11 b k,N−M
nt c 01 , N−M
(4.6)
nt c 00 . N−M
(4.7)
c 10 b k,N−M
Equations 4.6 and 4.7 can be understood as adding up the probabilities of all combinations of the number j of hits and the number k of false alarms that together cross the firing threshold θ . The transition matrix T gives rise to probability distributions t for the number mt of hits and the number nt of false alarms. To be able to compare the Markovian dynamics with the network dynamics obtained from cellular simulations (see Figure 1), we calculate the expectation values mt and nt
of hits and false alarms with respect to the probability distribution t for
Capacity for Sequences Under Biological Constraints
915
t ≥ 1: mt =
M N−M
µ t (µ, ν|m0 , n0 )
(4.8)
ν t (µ, ν|m0 , n0 ),
(4.9)
µ=1 ν=1
nt =
M N−M µ=1 ν=1
where t (mt , nt |m0 , n0 ) =
{(m1 ,n1 )}
···
t
T(mτ , nτ |mτ −1 , nτ −1 )
(4.10)
{(mt−1 ,nt−1 )} τ =1
is the probability of having mt hits and nt false alarms at some time t, given that the network has been initialized with m0 = M hits and n0 = 0 false alarms at time zero. Equation 4.10 can be derived from the recursive formula t (.|.) = {(.)} T(.|.) t−1 (.|.), and the sums in equation 4.10 are meant to be over all pairs (mτ , nτ ) ∈ {0, . . . , M} ⊗ {0, . . . , N − M}, for 1 ≤ τ ≤ t − 1. An increase in numerical efficiency is gained from the fact that sums over binomial probabilities can be evaluated by means of the incomplete beta function (Press, Flannery, Teukolsky, & Vetterling, 1992). Moreover, numerical treatment of the Markovian dynamics can take advantage of the separability of T = p q (see equation 4.3). But still, for large numbers of N, computing and multiplying p and q in full is costly. We therefore reduced p and q to at most 5000 interpolation points, where each of them is assigned to the same portion of probability. The reduced vectors are then used to calculate an iteration step t → t + 1. Numerical results provided are thus estimates in the above sense and serve as approximations to the full Markov dynamics. Figure 3 shows a numerical evaluation of the Markovian dynamics for the same parameter regime as used for the cellular simulations in Figure 1. We observe a qualitative agreement between the two approaches but also small differences regarding the upper and lower bounds for the set of firing thresholds allowing stable sequence replay. A further comparison is postponed to section 5. 4.3 Quality of Replay and Detection Criterion. In the examples shown in Figures 1 and 3, the quality of sequence replay at a certain time step is obvious because we typically have to distinguish among only three scenarios: (1) all neurons are silent, (2) all neurons are active, and (3) a pattern is properly represented. If, however, the network exhibits intermediate states, one needs a quantitative measure of whether a particular sequence is actually replayed. For this purpose, we specify the quality at which single
916
C. Leibold and R. Kempter M=800
A
m/M n/(N-M)
1
M=1600
B
m/M n/(N-M)
1 θ=65
0 1
θ=135
0 1
θ=63 0 1 θ=62 0 1
θ=134
Relative activity
Relative activity
θ=64 0 1
0 1 θ=133 0 1 θ=121 0 1
θ=61 0 1
θ=112 0 1
θ=60 0
θ=111 0
0
4
8
12
16
20
Time step
0
4
8
12
16
20
Time step
Figure 3: Stability-capacity conflict for Markovian network dynamics. The expected fraction of hits mt /M (disks) and false alarms nt /(N − M) (crosses) are plotted as a function of time t after the network has been initialized with the cue pattern at t = 0. The parameters N = 105 , c = 5%, r = 1, and Q = 20 are the same as in Figure 1. (A) For a pattern size M = 800, full replay is impossible. For high firing thresholds θ ≥ 64, the sequence dies out, whereas for low thresholds θ ≤ 63, the network activity explodes, which is identical to Figure 1 although the time courses of hits and false alarms slightly differ. (B) For M = 1600, sequence replay is possible for thresholds 112 ≤ θ ≤ 133, whereas in Figure 1, we have obtained 114 ≤ θ ≤ 133.
patterns ξ t are represented by the actual network state xt . We consider to be a function of the numbers mt and nt of hits and false alarms, respectively (see section 4.2). The quality function
(mt , nt ) := mt /M − nt /(N − M)
(4.11)
is chosen such that a perfect representation of a pattern is indicated by
= 1. Random activation of the network, on the other hand, yields | | 1 in the generic scenario 1 M N. The quality function weighs hits much stronger than false alarms, similar to the so-called “normalized winnertake-all” recall as introduced by Graham and Willshaw (1997) or Maravall (1999). Equation 4.11 is physiologically inspired by a downstream neuron receiving excitation from hits and inhibition from false alarms.
Capacity for Sequences Under Biological Constraints
917
We say a pattern to be replayed correctly at time t if the detection criterion
(mt , nt ) ≥ γ
(4.12)
is satisfied where γ denotes the threshold of detection. A sequence of Q patterns is said to be replayed if the final target pattern in the Qth time step is correctly represented: (m Q , n Q ) ≥ γ . Here, we implicitly assume that all the patterns of a sequence are represented at least as proper as the last one. Similar to equation 4.12, we specify a detection criterion for sequence replay approximated by the Markovian dynamics, (m Q , n Q ) = ( m Q , n Q ) ≥ γ ,
(4.13)
where the expectation values m Q and n Q are obtained from the Q-times iterated transition matrix T Q for the initial condition (m0 , n0 ) = (M, 0). The criteria 4.12 and 4.13 are obviously different. For 1 M N, however, they are almost equivalent with γ ≈ γ because the distribution of the quality measure is typically unimodal and sharply peaked with variance below 1/(4 M) + 1/[4(N − M)]. Moreover, we will see in the next section that the specific value of the detection threshold does not affect scaling laws for sequential memory. 5 Scaling Laws for Minimal Sequences The capacity α of sequential memory is proportional to M−2 (see equation 3.3). In order to maximize α, one therefore seeks a minimal pattern size M at which the replay of sequences serves a given detection threshold γ . In this section, we assess this minimal pattern size for minimal sequences (Q = 1) and sparse patterns (1 M N). In particular, we explain why the minimal pattern size is independent of the network size N. In the case 1 M N, the reduced connectivity matrix in equation 4.2 1+r 1 c 10 can be approximated through cc11 ; neurons that are active in ≈ c 1 1 01 c 00 cue patterns are connected to neurons that should be active in target patterns with probability c m = c(1 + r ). Otherwise, the connectivity is about c (see Figure 2). 5.1 Hits and False Alarms in Pattern Recall. At some time t, only those M neurons are supposed to be active that belong to the cue pattern ξ A. We then require a particular minimal sequence ξ A → ξ B to be imprinted such that at time t + 1 event ξ B is recalled. We have assumed that the number j of inputs to each of the M “on” neurons that should be active at time t + 1 is binomially distributed with probability distribution b j,M (c + c s ) (see equation 4.4). In the same way, the input distribution for the N − M “off”
918
C. Leibold and R. Kempter
off units
σoff κ+
C Probability density
Probability density
A
0.5 0.4 0.2 0.1 0
σ κ
on −
Pattern size M
Probability density
D
on units
κ+
0.3
Mc
B
−κ−
4
10
<Γ>=0.82
−3 −2 −1 0 1 2 3 Standard deviations <Γ>=.2 .3 .4.5.6 .7 .8 .9
3
10
2
10
1
10 Mc(1+r) Mc θ Synaptic input
−1 0 1 2 Threshold parameter κ
3
+
Figure 4: Mean quality
of replay and threshold parameters κ+ and κ− . (A) Probability density of the number of synaptic inputs for “off” units, which are supposed to be inactive during the recall of a target pattern. The vertical dashed line indicates the firing threshold θ. The gray area represents the probability n /(N − M) of having a false alarm. (B) Same as in A but for “on” units, which are supposed to be active. The gray area represents the probability m /(N − M) of having a hit. (C) For 1 M N, the binomial distributions in A and B can be approximated by normal distributions. The probability of hits minus that of false alarms equals the gray area under the normal distribution between −κ− and κ+ . From equation 5.3, we see that this area can also be interpreted as the mean quality
of replay. (D) Pattern size M as a function of κ+ for different replay qualities
at constant r = 1 and c = 0.01; see equations 5.1 and 5.4. The dashed line connects the minima of M.
cells that should be inactive at time t + 1 is b j,M (c). As a result, a neuron that is supposed to be inactive receives, on average, input through c M activated synapses with a standard deviation c (1 − c)M. To avoid unintended firing, we require a firing threshold θ that is somewhat larger than c M. The larger the threshold, the more noise-induced firing due to fluctuations in the number of synapses is suppressed. Let us take a threshold θ = c M + κ+ c (1 − c) M where κ+ is a threshold parameter that determines the number of incorrectly activated neurons (Brunel et al., 2004), called false alarms. For κ+ = 1, for example, we have nt+1 ≈ 0.16 (N − M) false alarms (see Figure 4A). On the other hand, the threshold θ has to be small enough so that a neuron that is supposed to be active during event ξ B is indeed activated. Each of these neurons receives, on average,
Capacity for Sequences Under Biological Constraints
919
c m M inputs with standard deviation c m (1 − c m )M. A recall of ξ B is therefore achieved by a threshold that is somewhat smaller than c m M, that is, θ = c m M − κ− c m (1 − c m ) M where κ− is another threshold parameter that determines the number of correctly activated neurons, called hits. For κ− = 2, for example, we have mt+1 ≈ 0.98 M hits (see Figure 4B). The firing threshold θ is assumed to be the same for all neurons. Hence, combining the above two conditions, we find c M + κ+ c M (1 − c) = c m M − κ− c m M (1 − c m ), which then leads to expressions for the pattern size 1 M= c
2 √ κ+ 1 − c + κ− [r + 1][1 − c (1 + r )] r
(5.1)
and the firing threshold θ = c M + κ+ c (1 − c) M.
(5.2)
The pattern size M in equation 5.1 is independent of the network size N and scales like c −1 for small values of c. Moreover, the firing threshold θ in equation 5.2 is independent of the network size N. For small mean connectivities c, the firing threshold θ is also independent of c. We emphasize that the validity of these scaling laws requires an almost perfect initialization of the cue pattern. 5.2 Optimal Pattern Size and Optimal Firing Threshold. We now argue that the firing threshold parameters κ+ and κ− in equation 5.1 can be chosen such that M is minimal and, hence, the storage capacity is maximal. As indicated by the gray areas of the binomial distributions in Figures 4A and 4B, κ+ and κ− determine the mean numbers of false alarms n and hits m , respectively. For 1 M N, these binomial distributions are well √ approximated by gaussians, √ and we have n /(N − M) = [1 − erf(κ+ / 2)]/2 and m /M = [erf(κ / 2) + 1]/2, where − √ x the error function erf(x) := 2/ π 0 dt exp(−t 2 ) is the cumulative distribution of a gaussian. These approximations yield √ √ (m, n) = [erf(κ− / 2) + erf(κ+ / 2)]/2,
(5.3)
which can be interpreted as the area under a normal distribution between −κ− and +κ+ (see Figure 4C). From equation 5.3, we see that for a given mean quality
of replay, the threshold parameters κ+ and κ− are not independent. More precisely,
920
C. Leibold and R. Kempter
for some given detection criterion γ =
and κ+ > tion 5.3 yields
κ− =
√ √ 2 erf−1 [2γ − erf(κ+ / 2)].
√ 2 erf−1 (2γ − 1), equa-
(5.4)
For fixed
= γ one therefore can choose κ+ in equation 5.1 such that the pattern size M becomes minimal. At this minimal pattern size, the capacity α in equation 3.3 reaches its maximum, and encoding of events is as sparse as possible. Let us therefore call this minimum value of M the optimal pattern size Mopt for sequential memory. The dashed line in Figure 4D indicates that Mopt := minκ+ M is located at values κ+ 1. We also observe that the larger the detection threshold γ , the larger is Mopt . From equation 5.1, we find that for small connectivities c 1, as they occur in biological networks, the minimum pattern size Mopt can be phrased as
Mopt =
1 [M(r, γ ) + O(c)], c
(5.5)
where M(r, γ ) is a function of r and γ that has to be obtained by numerical minimization. Here, the order function O(c k ) is defined through limc→0 c −k O(c k ) = const. for k > 0. At values r = 1 and γ = 0.7, for example, we have M = 6.1 c corroborating the scaling law Mopt ∝ c −1 . For an optimal pattern size Mopt , we can find the optimal firing threshold θ opt from equation 5.2. In first approximation, θ opt is independent of the connectivity c and the network size N, but depends on r and γ , θ opt = T (r, γ ) + O(c).
(5.6)
For example, r = 1 and γ = 0.7 account for θ opt ≈ 9.1 c. The dependencies of Mopt and θ opt on the connectivity c are indicated in Figure 5 through solid lines. Both Mopt and θ opt increase with increasing detection threshold γ . These mean field results match numerical simulations well: in cellular network simulations (open circles in Figure 5), Mopt and θopt were determined as the minimal M and the corresponding θ that account for replay at a fixed detection threshold γ = 0.5. The numerical evaluation of the Markovian network dynamics as defined in section 4.2 (filled symbols in Figure 5) confirms the analytical results for a wider range of c and γ .
γ=0.5 (c.s.) γ=0.5 γ=0.7 γ=0.8 γ=0.9
-1
10
-2
10
-3
10
-4
10
-5
10
0.1
1 10 Connectivity c (%)
921
B Optimal threshold θopt
A
Optimal pattern size Mopt /N
Capacity for Sequences Under Biological Constraints
30 25 20 15 10 5 0 0.1
1 10 Connectivity c (%)
Figure 5: Optimal pattern size M opt and optimal firing threshold θ opt . Lines depict results from the mean field theory (equations 5.5 and 5.6). We also show numerical simulations (c.s.) of the network introduced in section 4.1 (empty circles, γ = 0.5) and Markovian dynamics defined in section 4.2 (filled symbols, γ = 0.5, 0.7, 0.8, 0.9). (A) For small connectivities c, the optimal pattern size M opt scales like c −1 and increases with increasing γ . (B) The optimal threshold θ opt is almost independent of the connectivity c for c 10%, and θ opt increases with increasing γ . Further parameters: sequence length Q = 1, network size N = 250 000, plasticity resources r = 1. For the Markovian dynamics, we used Brent’s method (Press et al., 1992) to numerically find a firing threshold θ as a root of the implicit equation m Q /M − n Q /(N − M) = γ , which is the detection criterion. By subsequently reducing M, we end up with a minimal value M opt for which the detection criterion
= γ can be fulfilled. The threshold root that is obtained at M opt is called θ opt .
The lower bound Mopt for the pattern size in equation 5.5 enables us to determine an upper bound for the capacity α of sequential memory. Combining equations 3.3 and 5.5, we find α = cN
(1
+ r )2
1 + O(c 2 ). M(r, γ )2
(5.7)
We note that α is linear in the connectivity c and the network size N, decreases with increasing γ and has a nontrivial dependency on the plasticity resources r that will be evaluated below. This scaling law for minimal sequences can now be used to study the storage of sequences in biologically feasible networks that face certain constraints. 6 Constrained Sequence Capacity Biological and computational networks generally face certain constraints. Those constraints can lead to limiting values and interdependencies of the network parameters c, N, and r . Some constraints and their implications on
922
C. Leibold and R. Kempter
the optimization of the capacity α of sequential memory in equation 5.7 are discussed in this section. 6.1 Limited Number of Synapses per Neuron. A biological neuron may have a limited number c N of synapses. If c N is constant, we find from equation 5.7 (for constant r and γ ) α = const.
and
P = const.
Increasing the capacity α therefore cannot be achieved by increasing N. Numerical results in Figure 6A (symbols) confirm this behavior for c 1.
Optimal pattern size Mopt /N
A
Synapses-per-neuron constraint
B
Synapses-per-network constraint
Nc=2500 5000 -2 10 7500 10000
-2
10
2
-3
10
-4
-4
10
10
c=0.5
α = Const.
3
10 Capacity α
6
N c=18x10 6 36x10 6 54x10 6 71x10
-3
10
c=0.5
3
-1
α~N
10
c=0.25 2
c=0.1
10
c=0.01
2
c=0.26
10
c=0.1 1
1
10
10
0
c=0.01
0
10
10 4
10
5
10 Network size N
6
10
4
10
5
10 Network size N
Figure 6: Influence of constraints on the optimal pattern size Mopt and the capacity α of sequential memory. (A) Synapses-per-neuron constraint. For a fixed number c N of synapses per neuron, we find M opt ∝ N and α = const. as N → ∞. The capacity α increases with increasing c N. Tilted solid lines connect symbols that refer to constant connectivities, for example, c = 0.01, 0.1, 0.25, 0.5. (B) Synapses-per-network constraint. For a fixed number c N2 of synapses in the network, we find M opt ∝ N2 and α ∝ N−1 as N → ∞. The capacity α increases with increasing c N2 . For both constraints, c N = const. and c N2 = const., there is an optimal network size at which the capacity α reaches its maximum. For r = 1 this maximum occurs at c ≈ 0.5. A further increase in c is impossible since the morphological connectivity c m = c (1 + r ) cannot exceed 1. Other parameters are Q = 1 and γ = 0.7. Dotted lines link symbols and are not obtained by mean field theory.
Capacity for Sequences Under Biological Constraints
923
For r = 1, the capacity α reaches its maximum at c ≈ 0.5, where we have c m = c (1 + r ) = 1, and the network can be considered an undiluted Willshaw one. For c → 0.5, the scaling law α = const. (solid line) underestimates the storage capacity because the O(c 2 ) term in the mean field equation 5.7 has been neglected. In biologically relevant networks, we typically have c 1, and thus, for c N = const., we face the scaling law P = α (1 + r ) c N = const. Therefore, the number c N of synapses a single neuron can support fully determines the network’s computational power for replaying sequences in the sense that adding more neurons to the network does not increase α or P. In the CA3 region of the rat hippocampus, for example, we have c m N ≈ 12,000 recurrent synapses at each pyramidal cell (see Urban, Henze, & Barrionuevo, 2001, for a review). The network size of CA3 is N ≈ 2.4 · 105 (Rapp & Gallagher, 1996). From these numbers, r = 1 and c N = const., we derive the connectivity c ≈ 0.025. A comparison of these numbers with Figure 6A leads to estimates for the minimal pattern size being in the order of Mopt ≈ 200 cells, a storage capacity of α ≈ 15 minimal sequences per synapse at a neuron, and P ≈ 1.8 · 105 minimal sequences per CA3 network. The saturation of α and P at about N = 105 for c N = 6,000 (see Figure 6A) may explain why the CA3 region has relatively few neurons (N 106 in humans) despite its seminal importance for episodic memory. 6.2 Limited Number of Synapses in the Network. Numerical simulations of artificial networks are constrained by the available computer memory, which limits the number c N2 of activated synapses in the network. For c N2 = const. we find from equation 5.7 (for constant r and γ ) α ∝ N−1
and
P ∝ N−2 .
Therefore an increase in both α and P can be achieved only by reducing the network size N at the expense of increasing the connectivity c. Numerical results in Figure 6B confirm this behavior for c 1. The capacity α increases with increasing c and, for r = 1, assumes its maximum at the upper bound c = 0.5 when c m = 1. For c → 0.5, the scaling law α ∝ N−1 (solid line) underestimates the storage capacity, similar to Figure 6A. We conclude that computer simulations of neural networks with constant c N2 perform worse in storing sequences the more the connectivity resembles the biologically realistic scenario c 1. 6.3 On the Ratio of Silent and Activated Synapses. In the previous two sections, we have assumed a constant ratio r between the connectivity c s through silent synapses and the connectivity c through nonsilent synapses. The specific choice r = 1 was motivated by neurophysiological estimates from Nusser et al. (1998) and Montgomery et al. (2001). We now focus on
924
C. Leibold and R. Kempter
B c+cs=0.05 0.10 0.15 0.25
-2
10
-3
10
c+cs = 0.05 c+cs = 0.05 c+cs = 0.25
100
10 Optimal threshold θopt
Optimal pattern size Mopt /N
A
-4
10
-5
10
3
Capacity α
10
1 100 c+cs = 0.25 c+cs = 0.25 c+cs = 0.05
10
2
10
1
10
1 0
10
1
10
100
1
10
100
Resources for plasticity r
Figure 7: Dependence of sequence replay on the resources r of synaptic plasticity for a constant total number (c + c s )N of synapses per neuron. Mean field theory (solid lines) explains numerical results obtained from the Markovian dynamics (symbols) well as long as r < 10 and θ opt 4. Below θ opt 4, the discreteness of θopt limits the validity of the mean field theory. (A) The optimal pattern size M opt (top) decreases with increasing r and saturates at values (c + c s )M opt = 1 (symbols). As a result, the capacity α (bottom) increases with r until M opt has reached its lower bound, and α exhibits a maximum. A further increase in r reduces c but leaves M opt constant and thus leads to a decrease of α (see equations 3.2 and 3.3). (B) The optimal firing threshold θ opt decreases with increasing r to its lower bound 1 (symbols). The weak dependence of θ opt on c m = c + c s is indicated by the gray lines. Further parameters for A and B: N = 250,000, γ = 0.7, Q = 1.
how the storage capacity α depends on this ratio r assuming that the total number c m N of morphological synapses per neuron is constant. Because of c m = c(1 + r ), an increase in r increases c s but reduces c. We note that this constraint is equivalent to a fixed number c N of activated synapses per neuron for constant r , a scenario evaluated in section 6.1. For constant (c + c s )N, numerical results in Figure 7A (symbols) show that the capacity α exhibits a maximum as a function of r . The maximum capacity occurs at a pattern size at which (c + c s )Mopt = c 11 Mopt = 1, that
Capacity for Sequences Under Biological Constraints
925
is, the association from a cue pattern to an “on” neuron of a target pattern is supported by a single spike, on average. For larger r , the optimal firing threshold θ opt remains at its minimum value of one (see Figure 7B). An increase of r beyond its optimum reduces c but leaves Mopt constant and thus leads to a decrease of α (see equations 3.2 and 3.3). Values of r larger than 1 are thus beneficial for good memory performance—in our case, the storage capacities α. Similarly, Brunel et al. (2004) find a high ratio r to be necessary for increasing the signal-to-noise ratio of readout from a perceptron-like structure. These findings raise the question why values of r found in some experiments (Nusser et al., 1998; Montgomery et al., 2001) are in the range r 1. We suppose that the specific value of r is due to the interplay between the recurrence in the hippocampus and the locality of synaptic learning rule (see section 9). In contrast, Isope and Barbour (2002) report r ≈ 4 at the cerebellar parallel fibers, which, locally, is a feedforward system. 6.4 Scale-Invariance of Sequential Memory. Given the scaling laws α ∝ c N of the storage capacity in equation 5.7, we can ask how the connectivity in a brain region should be set up in order to have scale-invariant sequential memory, which means P ∝ N. From P = αc m N ∝ c 2 N2 (see equation 5.7) we then find c ∝ N−1/2 or, equivalently, that the total number c N2 of synapses in the network is proportional to N3/2 . Surprisingly, the latter result is in agreement with findings from Stevens (2001; personal communication, 2005) in visual cortex and other brain areas. Thus, a N3/2 -law for the number of synapses can generate a scalable architecture for associative memory. To summarize this section, constraints have seminal influence on the scaling laws of the capacity of sequential memory, and different constraints lead to fundamentally different strategies for optimizing the performance of networks for replaying sequences. 7 Nonminimal Sequences In addition to constraints on intrinsic features of the network like a small connectivity or a limited number of synapses, there are also constraints that may be imposed on a sequence memory device from outside, for example, a fixed detection threshold γ and a nonminimal sequence length Q. 7.1 Finite Sequences (Q > 1). To determine the capacity α for nonminimal sequences Q > 1 in dependence on the network size N, we apply the
926
C. Leibold and R. Kempter
Markovian approximation as introduced in section 4.2. As in the case Q = 1, replay of sequences is initialized with an ideal representation of the cue pattern, (mt , nt ) = (M, 0) for t = 0. Patterns that occur later in the sequence at t ≥ 1, however, are not represented in an ideal way; typically, there is a finite number of false alarms nt , and the number of hits mt is generally below M (see also Figure 3). The recall of patterns amid a sequence therefore depends on noisy cues. As a consequence, for Q > 1, we expect that the dependence of the optimal pattern size Mopt and the optimal threshold θ opt on the network size N are different as compared to the case Q = 1. Assuming a constrained number of synapses per neuron (c N = const.), we nevertheless find that for Q > 1 the dependence of the optimal pattern size Mopt on N is almost linear for large N (see Figure 8A). Accordingly, the capacity α is nearly independent of N (see Figure 8B). Moreover, the optimal firing threshold θ opt is almost constant for large N (see Figure 8C).
B
-1
Q=1 2 4 8 16 ∞ (th)
-2
10
3
10
2
10
1
10
0
10
C Optimal threshold θopt
Optimal pattern size Mopt /N
10
Capacity α
A
-3
10
-4
2
10
1
10
0
10
10 4
10
5
6
10 10 Network size N
4
10
5
6
10 10 Network size N
Figure 8: Optimal event size M opt , capacity α and optimal firing threshold θ opt for nonminimal sequences in networks with a constrained number c m N = 10,000 of synapses per neuron. (A) The optimal pattern size M opt increases half a magnitude between Q = 1 and Q = 4 (three bottom lines) and saturates for Q ≥ 8 (three top-most lines). (B) The capacity α is almost constant for large N and decreases with increasing Q. (C) The optimal firing threshold θ opt reflects the dependencies of the optimal pattern size M opt . Numerical results (symbols) are obtained for γ = 0.7 and r = 1. Graphs for Q → ∞ (lines) are obtained from the mean field equation 7.1.
Capacity for Sequences Under Biological Constraints
927
These results for Q > 1 resemble the ones for Q = 1 shown in Figure 6, some of which are also indicated by disks in Figure 8. One reason for this correspondence is that patterns within a sequence are typically replayed at a high quality (see Figures 1 and 3). Figure 8B also shows that α is a decreasing function of Q. We note that α still refers to the number of minimal sequences. Then the maximum number of stored sequences of length Q is the Qth fraction of P = α(1 + r ) c N. Compared to Q = 1, the storage capacity α drops about an order of magnitude for Q = 2, 4 and soon, at Q 8, arrives at a baseline value for Q → ∞ (solid line) that was obtained from a mean field approximation to be explained below. We thus conclude that nonminimal sequences impose no fundamental limit to the memory capacity for sequences. However, due to discrete time, our model cannot comprise temporal dispersion of synchronous firing, which may limit the replay of long sequences in biological networks (Diesmann et al., 1999). 7.2 Infinite Sequences (Q → ∞). A beneficial consequence of the weak dependence of the capacity α on the sequence length for Q 8 is that sequential memory for large Q can be more easily discussed in the framework Q → ∞. Such a discussion requires finding the fixed-point distributions of the transition matrix T defined in equation 4.3. Assuming that the fixedpoint distributions for hits m and false alarms n are unimodal and given the case N 1, we can reduce the problem of finding fixed-point distributions of m and n to the much simpler problem of finding fixed points of the mean values m and n . Let us therefore introduce the iterated map,
mt+1
mt
= T ·
, nt+1
nt
(7.1)
for the mean values of the order parameters. To specify the map T · in accordance with the Markovian dynamics introduced in section 4.2, we define the mean synaptic inputs to “on” and “off” units, µon = c 11 m + c 01 n
and
µoff = c 10 m + c 00 n ,
respectively, as well as the variances, 2 = c 11 m (1 − c 11 m /M) + c 01 n [1 − c 01 n /(N − M)] σon 2 σoff = c 10 m (1 − c 10 m /M) + c 00 n [1 − c 00 n /(N − M)] ,
928
C. Leibold and R. Kempter
c10 from equawhich are determined by the reduced connectivity matrix cc11 01 c 00 tion 4.2. A gaussian approximation to binomial statistics then yields T ·
2 − θ )/ 2 σ M 1 + erf (µ on on m
1 = . 2 (N − M) 1 + erf (µ − θ )/ 2 σ 2 n
off off
(7.2)
Numerical iteration of equation 7.1 results in the fixed points ( m ∗ , n ∗ ) of the mean field dynamics and their basins of attraction (see Figure 9A). The iterated map has two trivial fixed points that are largely independent of the choice of firing threshold θ and pattern size M. These trivial fixed points represent complete activation of the network, on the one hand, and no activity at all, on the other hand. Shape and size of their basins of attraction (black and white areas in Figure 9A), however, are modulated by the specific values of M and θ . We also observe a third type of fixed point comprising a large number of hits and a small number of false alarms; numerics shows that we always find
1 at this fixed point of infinite sequence replay. Its basin of attraction is plotted in gray and extends over a small interval of false alarm rates; note the logarithmic scale on the ordinates in Figure 9A. In Figure 9A we see that the smaller the pattern size, the narrower is the range of thresholds allowing an infinite sequence replay. For a large enough pattern size, the range of possible thresholds is broad (see also Figures 1 and 3). The region in the (M, θ ) space where infinite sequence replay can occur is summarized in Figure 9B. The wedge-shaped stability regions are not much affected by N but strongly depend on c. The borders of such a stability region in Figure 9B can be described by upper and lower bounds for the thresholds, θ upper and θ lower , that can be approximated through linear functions of the pattern size M. The upper bound θ upper is interpreted as an iso-
line that separates the region of a completely deactivated state with fixed point m ∗ = n ∗ = 0 from the region of stable sequence replay where m ∗ = M
and n ∗ M. From the first line of equation 7.1, we then obtain √ θ upper ≈ c 11 M
− erf−1 (2
− 1) O( θ upper ). Thus, for large M, the bound θ upper is an almost linear function of the pattern size with a slope c 11
≈ c 11 . Similarly, from the second line of equation 7.1, we obtain the boundary θ lower between the region where m ∗ = M and n ∗ N and the region of a completely activated state m ∗ /M = n ∗ /(N − M) = 1, θ lower ≈ c 10 M + c 10 N (1 −
) + erf−1 (2
− 1) O( θ lower ).
Capacity for Sequences Under Biological Constraints
A 0
10
−2
10
929
M=700
M=800
M=900
M=1200
M=1600
θ=57
θ=64
θ=72
θ=98
θ=134
θ=56
θ=63
θ=71
θ=97
θ=133
θ=55
θ=62
θ=70
θ=92
θ=121
θ=54
θ=61
θ=69
θ=87
θ=111
θ=53
θ=60
θ=68
θ=86
θ=110
−4
10
0
10
−2
False alarms n/(N−M)
10
−4
10
0
10
−2
10
−4
10
0
10
−2
10
−4
10
0
10
−2
10
−4
10
0
0.5
0.5
10
N=105
B Threshold θ
10
200
c=0.2
0.5 10 Hits m/M
0.5
6
10
0.5
1
107
10
0.1
100
0.05
0 0
1000
0
1000 Pattern size M
0
1000
Figure 9: Fixed points of infinite sequence replay. (A) Basins of attraction of the mean field dynamics in equation 7.1 depend on pattern size M and firing threshold θ . The discrete dynamics of mean hit rates m /M and mean false alarm rates n /(N − M) exhibits two trivial fixed points. The first is a completely deactivated state, m ∗ = n ∗ = 0, with basins of attraction represented by a white area. The second fixed point represents maximal activation, m ∗ /M = n ∗ /(N − M) = 1, with basins of attraction painted black. For a few pairs of (M, θ), we also observe nontrivial fixed points (black dots) corresponding to sequence replay. Their basins of attraction are depicted by gray areas. Parameters (N = 105 , c = 0.05, r = 1) are the same as in Figures 1 and 3. (B) Regions of stable sequence replay in the (M, θ) space are plotted in gray; connectivities are c = 0.05, 0.1, 0.2, and network sizes are N = 105 , 106 , 107 for r = 1. The slopes of the upper and lower borders of these stability regions approximately equal the connectivities c 11 and c 10 , respectively.
930
C. Leibold and R. Kempter
The slope of θ lower is about c 10 , which for r = 1 is about half the slope of θ upper . These predicted slopes agree with numerical results in Figure 9B. The size of the region of infinite sequence replay is therefore proportional to c 11 − c 10 ∝ r . The larger the ratio r between silent and nonsilent synapses, the larger are the stability regions and, hence, the more robust is sequence replay. We emphasize that the above expressions for θ upper and θ lower are rough estimates that correspond to large pattern sizes M at which the distributions of synaptic inputs to “off” and “on” units do not overlap too much (see Figures 4A and B). Moreover, the optimal parameters Mopt and θ opt at the tip of a stability region cannot be determined explicitly because we cannot assess the exact value of
analytically. The mean field results in Figure 9A are largely consistent with the cellular simulations in Figure 1, but there are also important differences. Cellular simulations have been obtained for finite sequences Q = 20 whereas mean field results are valid for Q → ∞. Further discrepancies at the edges of the stability regions also occur because random fluctuations in cellular simulations can kick the network into complete activation or deactivation. The edges of the wedge-shaped regions in Figure 9B therefore describe the behavior of cellular networks only approximately. To summarize, the higher the capacity, the less robust is sequence replay against variations of the parameters M and θ . The wedge-shaped structures of the stability regions in Figure 9B indicate that the maximal sequence capacity and, hence, minimal M go along with a critical dependence of stability on the firing threshold. In the limit of M → Mopt , the network lives on the edge of dynamical (in)stability. 8 Information Content for N → ∞ The detection criterion we have proposed in section 4.3 permits a limited amount of errors. It is intuitively clear that these retrieval errors allow an increase of the storage capacity α as compared to an errorless case. However, the more errors occur, the more deteriorated is the representation of each of the patterns during replay. The common way of measuring the balance of these two opposing effects of retrieval errors is to calculate the information content I . The latter can be understood as the logarithm of the number of all possible ways of concurrently storing a number of P of associations or, more precisely (Nadal & Toulouse, 1990), I = lg2
P N M N−M . m+n m n
(8.1)
N M N−M Here, m+n /[ m ] is the number of patterns of size M that can be n represented in a network of size N, given the hits m and false alarms
Capacity for Sequences Under Biological Constraints
931
n. We note that the number P of associations between patterns depends on the performance of the readout device, and so does the information content. The information content is often calculated as a function of the so-called coding ratio f = M/N, which is interpreted as a firing rate. In biologicalrelevant networks, the firing rate is low ( f → 0) while they are required to be operable in the limit N → ∞. This asymptotic behavior of networks is extensively discussed in the literature (e.g., Willshaw et al., 1969; Gardner, 1987; Golomb et al., 1990). In what follows we will show that in our framework, we also have lim N→∞ f = 0. In this limit, we will assess the information content I for Q = 1 and Q → ∞. From equation 8.1, we derive an approximation of I for f → 0 given that the number n of false alarms is considerably smaller than the pattern size M, as it is motivated in section 7.2. For a fixed fraction η := m/M 1 of hits we can approximate I by evaluating equation 8.1 with n = 0. Then, applying Stirling’s formula and introducing the mixing entropy s(x) = −x lg2 x − (1 − x) lg2 (1 − x), we obtain I /(c m N2 ) = α f [η| lg2 η f | − s(η)].
(8.2)
From equation 3.3 we know that the storage capacity α scales like N/M2 and, thus, I /N 2 ∝ | ln M/N|/M. As a corollary, this shows that minimizing M not only maximizes α but also I . In case Q = 1, a combined optimization of θ and M leads to Mopt being independent of network size N (see section 5). As a result, we obtain f ∝ 1/N. Accordingly, the information content per synapse I /(c m N2 ) ∝ ln N increases with network size N. In order to also obtain the asymptotic behavior of I in the case of large sequence length, we have assessed the optimal pattern size Mopt for Q → ∞ as a function of network size N for a fixed connectivity c, that is, without any constraint (see Figure 10). Numerics reveals a sublogarithmic behavior, Mopt (N) ∝ (ln N)0.82 ; the coding ratio f ∝ (ln N)0.82 /N also falls below every bound as N → ∞, that is, coding becomes arbitrarily sparse. Together with equation 3.3, the unconstrained storage capacity diverges like α ∝ N/(ln N)1.64 . From equation 8.2, we thus find for N → ∞ the information content per synapse to increase sublogarithmically: I /N2 ∝ (ln N)−0.18 . The information content per synapse diverges for N → ∞, though very slowly. In fact, I /N2 grows so slowly that in the range of biologically
C. Leibold and R. Kempter
Optimal event size M
opt
932 3
10
c=0.1
0.82
log
c=0.2
2
10
3
4
5 6 7 8 9 10 11 12 Network size log10(N)
Figure 10: The dependency of the optimal pattern size M opt on the network size N is weak in the case of sequence length Q → ∞; note the logarithmic scale on the abscissa. The connectivity c is fixed, that is, no constraint is imposed (crosses: c = 0.1, circles: c = 0.2). The symbols represent optimal pattern sizes obtained from numerical solution of the fixed-point equation ( m ∗ , n ∗ ) = T · ( m ∗ , n ∗ ) (see equation 7.1). The solid line illustrates the asymptotic behavior M opt ∝ (ln N)0.82 found from linear regression.
reasonable network sizes 103 < N < 107 , the information content per synapse varies only by a factor of (7/3)0.18 ≈ 1.2. To summarize, combined optimization of M and θ provides an efficient algorithm to set up a sequential memory network for a broad range of network sizes. However, combining the results from sections 6 and 7 for biologically relevant parameter regimes, one obtains information contents I = |lg2 f α f c m N2 that are far below the theoretical maximum c m N2 of one bit per synapse.
9 Discussion This article combines analytical and numerical methods to assess the capacity for storing sequences of activity patterns in a recurrent network of McCulloch-Pitts units. Results from mean field theory are validated through simulations of cellular networks and a probabilistic dynamical description. Our approach is new in that we concurrently optimize the pattern size M and the firing threshold θ in order to maximize the storage capacity α. Within this framework, we derive the capacity α in dependence on five system parameters: network size N, mean connectivity c, synaptic plasticity resources r , sequence length Q, and detection threshold γ . The storage capacity of a network crucially depends on the criterion for pattern detection. One typically requires that the quality of replay of patterns exceeds some detection threshold γ (see equations 4.12 and 4.13). Our retrieval criterion with γ < 1, which allows errors in the replay of patterns, is fundamentally different from the error-free criterion γ → 1 in
Capacity for Sequences Under Biological Constraints
933
the classical Willshaw et al. (1969) network where the storage capacity is subject to Gardner’s bound (Gardner, 1987). In the original Willshaw model, as well as in our approach for minimal sequences (Q = 1), the network is initialized with a perfect representation of the cue pattern, m0 = M and n0 = 0. The Willshaw model, however, requires a perfect retrieval of a target pattern in that the number of hits is maximal, m1 = M, and that there is less than one false alarm on average, n1 < 1; furthermore, the firing threshold θ is set to the pattern size M. Then, binomial statistics yields the well-known logarithmic scaling laws for the optimal pattern size Mopt ∝ log N and the capacity αWillshaw ∝ N/ log2 N (Willshaw et al., 1969; Gardner, 1987; see also equation 3.3). In terms of the coding ratio f = Mopt /N, they find αWillshaw ∝ 1/( f | ln f |) for N → ∞. In contrast, in this article, we optimize both the firing threshold θ and the pattern size M, and we use a readout criterion that permits errors. Thus, the storage capacity α ∝ 1/ f (see equation 5.7) diverges faster than αWillshaw . An error-full representation of patterns is in agreement with the situation in the brain, for example, in the hippocampal CA3 network. There, the recurrently connected pyramidal cells also have feedforward connections to the pyramidal cells in CA1 via highly plastic synapses. It is generally assumed (Hasselmo, 1999) that these synapses are to be adjusted by CA3 activity and local learning rules; that is, CA1 can learn replayed patterns. Readout in CA1 may therefore be successful even if the absolute number of false alarms in CA3 exceeds the number of hits. The detection criterion in equation 4.13 can be motivated by such downstream neurons that receive excitation from the correctly activated neurons and inhibition from the incorrectly activated ones (e.g., via a globally coupled network of interneurons). For sequence length Q = 1, the concurrent optimization of M and θ leads to scaling laws for the replay of minimal sequences for biologically relevant connectivities c 1: the optimal pattern size is inversely proportional to the mean connectivity, Mopt ∝ c −1 , and the optimal firing threshold θopt is independent of c. Both θ opt and Mopt are independent of the network size N. The above dependencies finally lead to the capacity of sequential memory that scales like α ∝ c N (see equation 5.7). Moreover, the number of associations that can be stored scales like P ∝ c 2 N2 . A main conclusion from the scaling laws α ∝ c N and P ∝ c 2 N2 is that for a constrained number of synapses per cell (synapses-per-neuron constraint, c N = const.), the capacity α and the number P are constant, that is, independent of the network size N (see Figures 6 and 8). This means that it is impossible to increase the computational power of the network by increasing N. One could argue, however, that taking two independent networks doubles P and therefore would account for a performance increase that is linear in N. The drawback of this strategy is that then each pattern can be connected to only half of the other patterns, which are those located in the same network module.
934
C. Leibold and R. Kempter
A technically relevant constraint (e.g., in a computer simulation) is a constant total number of synapses in the network (synapses-pernetwork constraint, c N2 = const.). From above scaling laws, we conclude that α and P necessarily decrease with increasing network size (see Figure 6B). One can also ask whether there is a scaling law for the connectivity that accounts for scale-invariant storage, that√ is, P ∝ N (see section 6.3). In so doing, we find scale invariance for c ∝ 1/ N. As a result, the total number of synapses then is proportional to N3/2 , which is in line with results by Stevens (2001; personal communication, 2005). For the synapses-per-neuron constraint, there is an optimal value for the ratio r between silent and nonsilent synapses. For generic parameter regimes, this optimal value is rather large (r ≈ 10; see Figure 7). However, α exhibits a broad maximum as a function of r , and therefore the exact value of r is not critical for sequential memory. If one considers the network connectivity to be determined by local Hebbian learning rules, such as Spike timing dependent synaptic plasticity, ratios r that strongly deviate from 1 are implausible, since synaptic LTP at a specific pair of pre- and postsynaptic neurons can be compensated for locally only by LTD of another synapse at the very same pair of neurons (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Bi & Poo, 1998; Kempter, Gerstner, & van Hemmen, 1999). One thus can argue that the functional benefit of a very high amount of plastic resources may no longer justify the expenses of nonlocal signaling in synaptic plasticity. In short, ratios r ≈ 1 may be sufficient for an excellent performance of sequential memory. This article also shows that for long sequences (e.g., Q > 8), memory capacity becomes virtually independent of Q (see Figure 8). For large Q, however, the optimal pattern size is necessarily such that the network is close to dynamical instability (see Figure 9). Yet from the point of view of maximizing storage capacity α, the strategy of avoiding dynamical instabilities by increasing pattern size M is problematic, since α is proportional to M−2 (see equation 3.3). In order to approach the maximal storage capacity without the danger of complete activation or silencing of the network, one rather might introduce an activity-dependent stabilization mechanism that provides a negative feedback after a certain number of time steps. A biological realization that is at hand is a network of inhibitory interneurons (Bragin et al., 1995; Battaglia & Treves, 1998; Traub et al., 2000; Csicsvari, Jamieson, Wise, & Buzsaki, 2003). This of course may come at the cost of limiting sequence length Q or reducing the detection threshold γ . Our results for large sequence lengths Q are not immediately applicable to synfire chains (Abeles, 1991; Herrmann et al., 1995; Diesmann et al., 1999). The chief difficulty in translating our model into a more realistic network with continuous dynamics is to preserve the temporal separation between distinct patterns. The functional constraint of minimal sequence lengths is thus more likely a constraint on the temporal precision of network dynamics than on counting statistics. We speculate that for biological networks, spike
Capacity for Sequences Under Biological Constraints
935
desynchronization restricts the applicability of our results to small values of Q. The framework here is limited to orthogonal sequences; a particular pattern is not allowed to occur presynaptically in more than one minimal sequence. Nonorthogonal or loop-like sequential memories can be taken into account by, for example generalizing the framework to neurons with more than one-step memory (Dehaene, Changeux, & Nadal, 1987; Guyon, Personnaz, Nadal, & Dreyfus, 1988) or adding “internal patterns” that represent repetitions (Amit, 1988) or context (Levy, 1996). A possible neurophysiological application of our theory can be found in the hippocampus. During slow-wave sleep, low levels of the neuromodulator acetylcholine boost the impact of the excitatory feedback connections within CA3 (see Hasselmo, 1999, for a review). Slow-wave sleep goes along with a phenomenon called sharp-wave ripples, which is speculated to be a result of the replay of short sequences (Draguhn, Traub, Bibbig, & Schmitz, 2000; Csicsvari et al., 2000). A sharp-wave ripple burst is a pulselike incident of the local field potential in CA3 that is accompanied by 200 Hz oscillations. The latter are supposed to be generated by CA3 pyramidal cells (Behrens, van den Boom, de Hoz, Friedman, & Heinemann, 2005) and may reflect sequence replay (Wilson & McNaughton, 1994; N´adasdy et al., 1999; Lee & Wilson, 2002) occurring in timeslices of about 5 ms. The total duration of ripples of about 40 ms limits the number of putative events in a sequence to fewer than about eight. The temporal extent of a sharp wave may be controlled by inhibition (Maier, Nimmrich, & Draguhn, 2003), which would hint at dynamical stabilization of the network activity at a high level of storage capacity (see above). In Figure 8 we plotted the coding ratio f = Mopt /N and storage capacity α as a function of network size for various sequence lengths. If we apply these results to the situation in the hippocampal CA3 region of rats and a sequence length of Q = 8, we find for a network size of N = 240,000 a synapses-per-neuron constraint of c m N = 10,000 synapses per cell and plasticity resources r = 1, the optimal pattern size to be about 1500 cells. As a consequence, the storage capacity is about α = 1.2 minimal sequence per synapse at a cell, which corresponds to about 1600 full sequences of length 8 stored in the network. Interestingly the firing threshold we obtain is 55, which is approximately the same as that assumed by Diesmann et al. (1999) for cortical synfire networks. To summarize, this letter provides a simple rule of how to choose pattern size and threshold in order to optimize storage capacity under biologically realistic constraints such as low connectivity and similar amounts of silent and nonsilent synapses. From that, one can conclude that sequence completion in the recurrent network operates far below maximal information content. To put it more positively, information seems to be redundantly distributed over a large number of synapses, which seems consistent with the picture that memories are stored in a way that is robust against synaptic noise and some variability of morphological plasticity.
936
C. Leibold and R. Kempter
Appendix A: List of Symbols Symbol
Meaning (Location of First Use)
t x θ c cs cm r = c s /c C = (Cnn ) N M ξ Q P α = P/(c m N) m n
discrete time (section 2.1) binary network state vector (section 2.1) firing threshold (section 2.1) mean connectivity of activated synapses (section 2.1) mean connectivity of silent synapses (section 2.1) mean morphological connectivity (section 2.1) ratio between silent and active connectivity (section 2.1) connectivity matrix of activated synapses (section 2.1) network size (section 2.1) pattern size (section 2.2) binary pattern vector (section 2.2) sequence length (section 2.2) number of minimal sequences stored (section 3) capacity of sequential memory (section 3) number of hits (section 4.1) number of false alarms (section 4.1) reduced connectivity matrix (equation 4.1) transition matrix (equation 4.3) binomial probability (equation 4.4) conditional probability of hits (equation 4.5) conditional probability of false alarms (equation 4.5) conditional probability of one hit (equation 4.6) conditional probability of one false alarm (equation 4.7) quality of replay (equation 4.11) detection thresholds (equations 4.12 and 4.13) mean transition function (equation 7.2) coding ratio (section 8) information content (section 8)
c 11 c 10 c 01 c 00
T b p q ρ λ
γ , γ T ·
f = M/N I
Appendix B: Memory Capacity Revisited Let us consider a naive network of size N that initially has nosynapses at all. To imprint the first minimal sequence ξ A → ξ B in the network, we need M2 c 11 functional synapses in order to link two groups of M neurons at connectivity c 11 (see Figure 11). Let us first discuss the simpler case c 11 = c m . For the second sequence, ξ C → ξ D , fewer synapses are needed because we have to take into account that there are cells in pattern ξ C that are already connected to cells in ξ D because of some overlap with the first sequence. For random patterns, the probability that a neuron is active in a specific pattern is f = M/N, which is also called the coding ratio. As a result, the mean number of cells that are active in both of a given pair of patterns is Mf . Consequently, the Mf presynaptic cells that belong to both cue patterns ξ A and ξ C only have to be connected to the M (1 − f ) postsynaptic neurons of ξ D that do not overlap with ξ B . The number of new synapses needed is Mf · c 11 · M(1 − f ). In order to complete the second minimal sequence,
Capacity for Sequences Under Biological Constraints 1st minimal sequence AB ξA
2nd minimal sequence CD M
M
937
ξC
M
1
f
1−f
+ ...
+ 1 ξB
M
M 2 c11
f M
1−f M
ξD
2
M c11 (1–f 2)
Figure 11: Consumption of synapses by subsequently storing minimal sequences. The first minimal sequence ξ A → ξ B consumes c 11 M2 synapses. The patterns of a second minimal sequence ξ C → ξ D have some overlap f with ξ A and ξ B ; there are Mf cells (gray) both pre- and postsynaptically that contribute to both the first and the second minimal sequences. The number of synapses that are consumed by ξ C → ξ D is reduced by a factor of (1 − f 2 ); see the text.
we are left with connecting the remaining M(1 − f ) presynaptic cells of ξ C to all M postsynaptic cells of ξ D . In summary, the second sequence consumes Mf c 11 M(1 − f ) + M(1 − f ) c 11 M = M2 c 11 (1 − f 2 ) synapses. Similarly, the kth minimal sequence consumes M2 c 11 (1 − f 2 )k−1 synapses that have not yet been accounted for. Summing up all contributions until we reach the limit N2 c of available nonsilent synapses yields a condition on the maximal number P of minimal sequences,
!
N2 c = M2 c 11
P (1 − f 2 )k−1 = M2 c 11 f −2 1 − (1 − f 2 ) P .
(B.1)
k=1
In case c 11 < c m , we need to take into account the probability c 11 /c m of having a morphological synapse from cue to target in the nonsilent state. The transformation f 2 → f 2 c 11 /c m is sufficient to generalize the result in equation B.1. Solving the generalized version of equation B.1 for P and normalizing the result by N, we find the capacity to be α=
log(1 − c/c m ) , c m N log(1 − f 2 c 11 /c m )
which can be approximated for f 1 by α=
cN . c m c 11 M2
(B.2)
938
C. Leibold and R. Kempter
Equation B.2 is an extension to the results for the case c 11 = c m = 1, originally obtained by Willshaw et al. (1969), Nadal & Toulouse (1990), and Nadal (1991). In the main part of this article, we discuss the scenario c 11 = c m < 1; see equations 3.2 and 3.3. Acknowledgments We thank D. Schmitz for valuable discussions on the hippocampal circuitry, ¨ A.V.M. Herz for ongoing support, and R. Gutig, R. Schaette, M. Stemmler, K. Thurley, and L. Wiskott for discussions and comments on the manuscript. This research was supported by the Deutsche Forschungsgemeinschaft (Emmy Noether Programm: Ke 788/1-3, SFB 618) and the Bundesminis¨ Bildung und Forschung (Bernstein Center for Computational terium fur Neuroscience). References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J. (1988). Neural networks counting chimes. Proc. Natl. Acad. Sci. USA, 85, 2141–2145. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Information storage in a network with low levels of activity. Phys. Rev. A, 35, 2293–2303. August, D. A., & Levy, W. B. (1999). Temporal sequence compression by an integrateand-fire model of hippocampal area CA3. J. Comput. Neurosci., 6, 71–90. Battaglia, F. P., & Treves, A. (1998). Stable and rapid recurrent processing in realistic autoassociative memories. Neural Comput., 10, 431–450. Behrens, C. J., van den Boom, L. P., de Hoz, L., Friedman, A., & Heinemann, U. (2005). Induction of sharp wave-ripple complexes in vitro and reorganization of hippocampal networks. Nat. Neurosci., 8, 560–567. Bi, G.-Q., & Poo M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Brun, V. H., Otnass, M. K., Molden, S., Steffenach, H. A., Witter, M. P., Moser, M. B., & Moser, E. I. (2002). Place cells and place recognition maintained by direct entorhinal-hippocampal circuitry. Science, 296, 2243–2246. Brunel, N., Hakim, V., Isope, P., Nadal, J.-P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: Perceptron versus Purkinje cell. Neuron, 43, 745–757. Brunel, N., Nadal, J.-P., & Toulouse, G. (1992). Information capacity of a perceptron. J. Phys. A: Math. Gen., 25, 5017–5037. Buckingham, J., & Willshaw, D. (1992). Performance characteristics of the associative net. Network, 3, 407–414.
Capacity for Sequences Under Biological Constraints
939
Buckingham, J., & Willshaw, D. (1993). On setting unit thresholds in an incompletely connected associative net. Network, 4, 441–459. Buhmann, J., & Schulten, K. (1987). Noise-driven temporal association in neural networks. Europhys. Lett., 4, 1205–1209. Csicsvari, J., Hirase, H., Mamiya, A., & Buzsaki, G. (2000). Ensemble patterns of hippocampal CA3-CA1 neurons during sharp wave associated population events. Neuron, 28, 585–594. Csicsvari, J., Jamieson, B., Wise, K. D., & Buzsaki, G. (2003). Mechanisms of gamma oscillations in the hippocampus of the behaving rat. Neuron, 37, 311–322. Dehaene, S., Changeux, J.-P., & Nadal, J.-P. (1987). Neural networks that learn temporal sequences by selection. Proc. Natl. Acacd. Sci. USA, 84, 2727–2731. Deshpande, V., & Dasgupta, C. (1991). A neural network for storing individual patterns in limit cycles. J. Phys. A: Math. Gen., 24, 5105–5119. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Draguhn, A., Traub, R. D., Bibbig, A., & Schmitz, D. (2000). Ripple (approximately 200-Hz) oscillations in temporal structures. J. Clin. Neurophysiol., 17, 361–376. ¨ During, A., Coolen, A. C. C., & Sherington, D. (1998). Phase diagram and storage capacity of sequence processing neural networks. J. Phys. A: Math. Gen., 31, 8607– 8621. Fortin, N. J., Agster, K. L., & Eichenbaum, H. B. (2002). Critical role of the hippocampus in memory for sequences of events. Nat. Neurosci., 5, 458–462. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhys. Lett., 4, 481–485. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low firing rates. Phys. Rev. A, 41, 1843–1854. Graham, B., & Willshaw, D. (1997). Capacity and information efficiency of the associative net. Network: Comput. Neural Syst., 8, 35–54. Gutfreund, H., & M´ezard, M. (1988). Processing of temporal sequences in neural networks. Phys. Rev. Lett., 61, 235–238. Guyon, I., Personnaz, L., Nadal, J.-P., & Dreyfus, G. (1988). Storage and retrieval of complex sequences in neural networks. Phys. Rev. A, 38, 6365–6372. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends Cogn. Sci., 3, 351–359. ¨ Herrmann, M., Hertz, J. A., & Prugel-Bennett, A. (1995). Analysis of synfire chains. Network: Comput. Neural Syst., 6, 403–414. Herz, A. V. M., Li, Z., & van Hemmen, J. L. (1991). Statistical mechanics of temporal association in neural networks with transmission delays. Phys. Rev. Lett., 66, 1370–1373. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Isaac, J. T., Nicoll, R. A., & Malenka, R. C. (1995). Evidence for silent synapses: Implications for the expression of LTP. Neuron, 15, 427–444. Isope, P., & Barbour, B. (2002). Properties of unitary granule cell → Purkinje cell synapses in adult rat cerebellar slices. J. Neurosci., 22, 9668–9678.
940
C. Leibold and R. Kempter
Jensen, O., & Lisman, J. E. (2005). Hippocampal sequence-encoding driven by a cortical multi-item working memory buffer. Trends Neurosci., 28, 67–72. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kesner, R. P., Gilbert, P. E., & Barua, L. A. (2002). The role of the hippocampus in memory for the temporal order of a sequence of odors. Behav. Neurosci., 116, 286–290. Lee, A. K., & Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron, 36, 1183–1194. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6, 579–590. Little, W. A. (1974). Existence of persistent states in the brain. Math. Biosci., 19, 101– 120. ¨ Lorincz, A., & Buzs`aki, G. (2000). Two-phase computational model training longterm memories in the entorhinal-hippocampal region. Ann. New York Acad. Sci., 911, 83–111. Maier, N., Nimmrich, V., & Draguhn, A. (2003). Cellular and network mechanisms underlying spontaneous sharp wave-ripple complexes in mouse hippocampal slices. J. Physiol., 550, 873–887. Maravall, M. (1999). Sparsification from dilute connectivity in a neural network model of memory. Network: Comput. Neural Syst., 10, 15–39. McCulloch, W. S., & Pitts, W. (1943). Logical calculus of ideas immanent in nervous activity. Bull. of Math. Biophys., 5, 115–133. Montgomery, J. M., Pavlidis, P., & Madison, D. V. (2001). Pair recordings reveal all-silent synaptic connections and the postsynaptic expression of long-term potentiation. Neuron, 29, 691–701. Nadal, J.-P. (1991). Associative memory: On the (puzzling) sparse coding limit. J. Phys. A: Math. Gen., 24, 1093–1101. Nadal, J.-P., & Toulouse, G. (1990). Information storage in sparsely-coded memory nets. Network, 1, 61–74. ´ A., Csicsvari, J., & Buzs´aki, G. (1999). Replay and N´adasdy, Z., Hirase, H., Czurko, time compression of recurring spike sequences in the hippocampus. J. Neurosci., 19, 9497–9507. Nakazawa, K., McHugh, T. J., Wilson, M. A., & Tonegawa, S. (2004). NMDA receptors, place cells and hippocampal spatial memory. Nature Rev. Neurosci., 5, 361–372. Nowotny, T., & Huerta, R. (2003). Explaining synchrony in feed-forward networks: Are McCulloch-Pitts neurons good enough? Biol. Cybern., 89, 237–241. Nusser, Z., Lujan, R., Laube, G., Roberts, J. D. B., Molnar, E., & Somogyi, P. (1998) Cell type and pathway dependence of synaptic AMPA receptor number and variability in the hippocampus. Neuron, 21, 545–559. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rapp, P. R., & Gallagher, M. (1996). Preserved neuron number in the hippocampus of aged rats with spatial learning deficits. Proc. Natl. Acad. Sci. U.S.A., 93, 9926– 9930. Sompolinsky, H., & Kanter, I. (1986). Temporal associations in asymmetric neural networks. Phys. Rev. Lett., 57, 2861–2864.
Capacity for Sequences Under Biological Constraints
941
Stevens, C. F. (2001). An evolutionary scaling law for the primate visual system and its basis in cortical function. Nature, 411, 193–195. Traub, R. D., Bibbig, A., Fisahn, A., LeBeau, F. E. N., Whittington, M. A., & Buhl, E. H. (2000). A model of gamma-frequency network oscillations induced in the rat CA3 region by carbachol in vitro. Europ. J. Neurosci., 12, 4093–4106. Urban, N. N., Henze, D. A., & Barrionuevo, G. (2001). Revisiting the role of the hippocampal mossy fiber synapse. Hippocampus, 11, 408–417. Willshaw, D. J., Bunetman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679.
Received May 10, 2005; accepted September 8, 2005.
LETTER
Communicated by Christopher Williams
Kernel Fisher Discriminants for Outlier Detection Volker Roth [email protected] ETH Zurich, Institute of Computational Science, CH-8092 Zurich, Switzerland
The problem of detecting atypical objects or outliers is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classifiers. The method presented in this letter bridges the gap between kernelized one-class classification and gaussian density estimation in the induced feature space. Having established the exact relation between the two concepts, it is now possible to identify atypical objects by quantifying their deviations from the gaussian model. This model-based formalization of outliers overcomes the main conceptual shortcoming of most one-class approaches, which, in a strict sense, are unable to detect outliers, since the expected fraction of outliers has to be specified in advance. In order to overcome the inherent model selection problem of unsupervised kernel methods, a crossvalidated likelihood criterion for selecting all free model parameters is applied. Experiments for detecting atypical objects in image databases effectively demonstrate the applicability of the proposed method in realworld scenarios.
1 Introduction A one-class-classifier attempts to find a separating boundary between a data set and the rest of the feature space. A natural application of such a classifier is estimating a contour line of the underlying data density for a certain quantile value. Such contour lines may be used to separate typical objects from atypical ones. Objects that look sufficiently atypical are often considered to be outliers, for which one rejects the hypothesis that they come from the same distribution as the majority of the objects. Thus, a useful application scenario would be to find a boundary that separates the jointly distributed objects from the outliers. Finding such a boundary defines a classification problem in which, however, usually only sufficiently many labeled samples from one class are available. In most practical problems, no labeled samples from the outlier class are available at all, and it is even unknown if any outliers are present. Since the contour lines of the data density often have a complicated form, highly Neural Computation 18, 942–960 (2006)
C 2006 Massachusetts Institute of Technology
Kernel Fisher Discriminants for Outlier Detection
943
nonlinear classification models are needed in such an outlier-detection scenario. Recently, it has been proposed to address this problem by exploiting the modeling power of kernel-based support vector machine (SVM) ¨ classifiers (see, e.g., Tax & Duin, 1999; Scholkopf, Williamson, Smola, & Shawe-Taylor, 2000). These one-class SVMs are able to infer highly nonlinear decision boundaries, although at the price of a severe model selection problem. The approach of directly estimating a boundary, as opposed to first estimating the whole density, follows one of the main ideas in learning theory, which states that one should avoid solving an intermediate problem that is too hard. While this line of reasoning seems to be appealing from a theoretical point of view, it leads to a severe problem in practical applications: when it comes to detecting outliers, the restriction to estimating only a boundary makes it impossible to derive a formal characterization of outliers without prior assumptions on the expected fraction of outliers or even on their distribution. In practice, however, any such prior assumptions can hardly be justified. The fundamental problem of the one-class approach lies in the fact that outlier detection is a (partially) unsupervised task that has been squeezed into a classification framework. The missing part of information has been shifted to prior assumptions that require detailed information about the data source. This letter aims at overcoming this problem by linking kernel-based one-class classifiers to gaussian density estimation in the induced feature space. Objects that have an unexpected high Mahalanobis distance to the sample mean are considered as atypical objects, or outliers. A particular Mahalanobis distance is considered to be unexpected if it is very unlikely to observe an object that far away from the mean vector in a random sample of a certain size. We formalize this concept in section 3 by way of fitting linear models in quantile-quantile plots. The main technical ingredient of our method is the one-class kernel Fisher discriminant classifier (OC-KFD), for which the relation to gaussian density estimation is shown. From the classification side, the OC-KFD-based model inherits both the modeling power of Mercer kernels and the simple complexity control mechanism of regularization techniques. Viewed as a function in the input space variables, the model can be viewed as a nonparametric density estimator. The explicit relation to gaussian density estimation in the kernel-induced feature space, however, makes it possible to formalize the notion of an atypical object by observing deviations from the gaussian model. Like any other kernel-based algorithm, however, the OC-KFD model contains some free parameters that control the complexity of the inference mechanism, and it is clear that deviations from gaussianity will heavily depend on the actual choice of these model parameters. In order to characterize outliers, it is thus necessary to select a suitable model in advance. This model selection problem is overcome by using a likelihood-based cross-validation framework for inferring the free parameters.
944
V. Roth
2 Gaussian Density Estimation and One-Class LDA Let X denote the n × d data matrix that contains the n input vectors x i ∈ Rd as rows. It has been proposed to estimate a one-class decision boundary ¨ by separating the data set from the origin (Scholkopf et al., 2000), which effectively coincides with replicating all x i with the opposite sign and separating X and −X. Typically, a ν-SVM classifier with a radial basis kernel function is used. The parameter ν upper-bounds the fraction of outliers in the data set and must be selected a priori. There are, however, no principled ways of choosing ν in a general (unsupervised) outlier-detection scenario. Such unsupervised scenarios are characterized by the lack of class labels that would assign the objects to either the typical class or the outlier class. The method proposed here follows the same idea of separating the data from their negatively replicated counterparts. Instead of an SVM, however, a kernel Fisher discriminant (KFD) classifier is used (Mika, R¨atsch, ¨ ¨ Weston, Scholkopf, & Muller, 1999; Roth & Steinhage, 2000). The latter has the advantage that is is closely related to gaussian density estimation in the induced feature space. By making this relation explicit, outliers can be identified without specifying the expected fraction of outliers in advance. We start with a linear discriminant analysis (LDA) model and then introduce kernels. The intuition behind this relation to gaussian density estimation is that discriminant analysis assumes a gaussian class-conditional data density. Let Xa = (X, −X) denote the augmented (2n × d) data matrix which also contains the negative samples −x i . Without loss of generality, we assume that the sample mean µ+ := i x i > 0, so that the sample means of the positive data and the negative data differ: µ+ = µ− . Let us now assume that our data are realizations of a normally distributed random variable in d dimensions: X ∼ Nd (µ, ). Denoting by Xc the centered data matrix, the estimator for takes the form ˆ = (1/n)Xc Xc =: W. The LDA solution β ∗ maximizes the between-class scatter β ∗ Bβ ∗ with B = µ+ µ + + µ− µ− under the constraint on the within-class scat ter β ∗ Wβ ∗ = 1. Note that in our special case with Xa = (X, −X) , the usual pooled within-class matrix W simply reduces to the above-defined W = (1/n)Xc Xc . It is well known (see, e.g., Duda, Hart, & Stork, 2001) that the LDA solution (up to a scaling factor) can be found by minimizing a least-squares functional, βˆ = arg min ya − Xa β2 , β
(2.1)
Kernel Fisher Discriminants for Outlier Detection
945
where ya = (2, . . . , 2, −2, . . . , −2) denotes a 2n-indicator vector for class membership in class + or −. In (Hastie, Buja, & Tibshirani, 1995), a slightly more general form of the problem is described where the above functional is minimized under a constrained on β, which in the simplest case amounts to adding a term γ β β to the functional. Such a ridge regression model assumes a penalized total covariance of the form T = (1/(2n)) · Xa Xa + γ I = (1/n) · X X + γ I . Defining an n-vector of ones y = (1, . . . , 1) , the solution vector βˆ reads −1 βˆ = Xa Xa + γ I Xa ya = (X X + γ I )−1 X y.
(2.2)
An appropriate scaling factor is defined in terms of the quantity ˆ s 2 = (1/n) · y yˆ = (1/n) · y Xβ,
(2.3)
which leads us to the correctly scaled LDA vector β ∗ = s −1 (1 − s 2 )−1/2 βˆ that fulfills the normalization condition β ∗ Wβ ∗ = 1. One further derives from Hastie et al. (1995) that the mean vector of X, projected onto the one-dimensional LDA subspace, has the coordinate value m+ = s(1 − s 2 )−1/2 , and that the Mahalanobis distance from a vector x to the sample mean µ+ is the sum of the squared Euclidean distance in the projected space and an orthogonal distance term: 2 D(x, µ+ ) = (β ∗ x − m+ ) + D⊥ 2 −1 with D⊥ = −(1 − s 2 )(β x. ∗ x) + x T
(2.4)
While in the standard LDA setting, all discriminative information is contained in the first term, we have to add the orthogonal term D⊥ to establish the link to density estimation. Note, however, that it is the term D⊥ that makes the density estimation model essentially different from OC classification: while the latter considers only distances in the direction of the projection vector β, the true density model also takes into account the distances in the orthogonal subspace. Since the assumption X ∼ Nd (µ, ) is very restrictive, we propose to relax it by assuming that we have found a suitable transformation of our input data φ : Rd → R p , x → φ(x), such that the transformed data are gaussian in p dimensions. If the transformation is carried out implicitly by introducing a Mercer kernel k(x i , x j ), we arrive at an equivalent problem in terms of the kernel matrix K = and the expansion coefficients α: αˆ = (K + γ I )−1 y.
(2.5)
946
V. Roth
¨ From Scholkopf et al. (1999) it follows that the mapped vectors can be represented in Rn as φ(x) = K −1/2 k(x), with k(x) = (k(x, x 1 ), . . . , k(x, x n )) .1 Finally we derive the following form of the Mahalanobis distances, which again consists of the Euclidean distance in the classification subspace plus an orthogonal term, 2 2 2 D(x, µ+ ) = (α ∗ k(x) − m+ ) − (1 − s )(α ∗ k(x)) + n(x),
(2.6)
where α ∗ = s −1 (1 − s 2 )−1/2 αˆ and (x) = k (x)(K + γ I )−1 K −1 k(x). Equation 2.6 establishes the desired link between OC-KFD and gaussian density estimation, since for our outlier detection mechanism, only Mahalanobis distances are needed. While it seems to be rather complicated to estimate a density by the above procedure, the main benefit over directly estimating the mean and the covariance lies in the inherent complexity regulation properties of ridge regression. Such a complexity control mechanism is of particular importance in highly nonlinear kernel models. Moreover, for ridge regression models, it is possible to analytically calculate the effective degrees of freedom, a quantity that will be of particular interest when it comes to detecting outliers.
3 Detecting Outliers Let us assume that the model is completely specified: both the kernel function k(·, ·) and the regularization parameter γ are fixed. The central lemma that helps us to detect outliers can be found in most statistical textbooks: Lemma 1. Let X be a gaussian random variable X ∼ Nd (µ, ). Then := (X − µ) -1 (X − µ) follows a χ 2 - distribution on d degrees of freedom. For the penalized regression models, it might be more appropriate to use the effective degrees of freedom df instead of d in the above lemma. In the case of one-class LDA with ridge penalties, we can easily estimate it as df = trace(X(X X + γ I )−1 X ) (Moody, 1992), which for a kernel model translates into df = trace(K (K + γ I )−1 ).
(3.1)
1 With a slight abuse of notation, we denote both the map and its empirical counterpart by φ(x).
Kernel Fisher Discriminants for Outlier Detection
947
The intuitive interpretation of the quantity df is the following: denoting n by V the matrix of eigenvectors of K and by {λi }i=1 the corresponding eigenvalues, the fitted values yˆ read yˆ = Vdiag λi /(λi + γ ) V y.
(3.2)
=:δi
It follows that compared to the unpenalized case, where all eigenvectors v i are constantly weighted by 1, the contribution of the ith eigenvector v i is downweighted by a factor δi /1 = δi . If the ordered eigenvalues decrease rapidly, however, the values δi are either close to zero or close to one, and df determines the number of terms that are essentially different from zero. A similar interpretation is possible for the orthogonal distance term in equation 2.6. From lemma 1, we conclude that if the data are well described by a gaussian model in the kernel feature space, the observed Mahalanobis distances should look like a sample from a χ 2 -distribution with df degrees of freedom. A graphical way to test this hypothesis is to plot the observed quantiles against the theoretical χ 2 quantiles, which in the ideal case gives a straight line. Such a quantile-quantile plot is constructed as follows. Let (i) denote the observed Mahalanobis distances ordered from lowest to highest, and pi the cumulative proportion before each (i) given by pi = (i − 1/2)/n. Further, let zi = F −1 pi denote the theoretical quantile at position pi , where F is the cumulative χ 2 -distribution function. The quantile-quantile plot is then obtained by plotting (i) against zi . Deviations from linearity can be formalized by fitting a linear model on the observed quantiles and calculating confidence intervals around the fit. Observations falling outside the confidence interval are then treated as outliers. A potential problem of this approach is that the outliers themselves heavily influence the quantile-quantile fit. In order to overcome this problem, the use of robust fitting procedures has been proposed in the literature (see, e.g., Huber, 1981). In the experiments below we use an M-estimator with Huber loss function. For estimating confidence intervals around the fit, we use the standard formula (see, e.g., Fox, 1997; Kendall & Stuart, 1977),
σ ((i) ) = b · (χ 2 (zi ))−1 ( pi (1 − pi ))/n,
(3.3)
which can be intuitively understood as the product of the slope b and the standard error of the quantiles. A 100(1 − ε)% envelope around the fit is then defined as (i) ± zε/2 σ ((i) ) where zε/2 is the 1 − (1 − ε)/2 quantile of the standard normal distribution. The choice of the confidence level ε is somewhat arbitrary, and from a conceptual viewpoint, one might even argue that the problem of specifying
948
V. Roth
one free parameter (i.e., the expected fraction of outliers) has simply been transferred into the problem of specifying another one. In practice, however, selecting ε is a much more intuitive procedure than guessing the fraction of outliers. Whereas the latter requires problem-specific prior knowledge, which is hardly available in practice, the former depends on only the variance of a linear model fit. Thus, ε can be specified in a problem-independent way. Note that a 100(1 − ε)% envelope defines a relative confidence criterion. Since we identify all objects outside the envelope as outliers and remove them from the model (see algorithm 1), it might be plausible to set ε ← ε/n, which defines an absolute criterion. As described above, we use a robust fitting procedure for the linear quantile fit. To further diminish the influence of the outliers, the iterative exclusion and refitting procedure presented in algorithm 1 has been shown to be very successful. Algorithm 1: Iterative Outlier Detection (Given-Estimated Mahalanobis Distances) repeat Fit a robust line into the quantile plot. Compute the 100(1 − (ε/n))%-envelope. Among the objects within the upper quartile range of Mahalanobis distances, remove the one with the highest positive deviation from the upper envelope. until no further outliers are present. The OC-KFD model detects outliers by measuring deviations from gaussianity. The reader should notice, however, that for kernel maps, which transform the input data into a higher-Dimensional space, a severe modeling problem occurs: in a strict sense, the gaussian assumption in the feature space will always be violated, since the transformed data lie on a d-dimensional submanifold. For regularized kernel models, the effective dimension of the feature space (measured by df ) can be much lower than the original feature space dimension. If the chosen model parameters induce a feature space where df does not exceed the input space dimension d, the gaussian model might still provide a plausible data description. We conclude that the user should be alarmed if the chosen model has df d. In such a case, the proposed outlier detection mechanism might produce unreliable results, since one expects large deviations from gaussianity anyway. The experiments presented in section 5, on the other hand, demonstrate that if a model with df ≈ d is selected, the OC-KFD approach successfully overcomes the problem of specifying the fraction of outliers in advance, which seems to be inherent in the ν-SVMs.
Kernel Fisher Discriminants for Outlier Detection
949
4 Model Selection In our model, the data are first mapped into some feature space in which a gaussian model is fitted. Mahalanobis distances to the mean of this gaussian are computed by evaluating equation 2.6. The feature space mapping is implicitly defined by the kernel function, for which we assume that it is parameterized by a kernel parameter σ . For selecting all free parameters in equation 2.6, we are thus left with the problem of selecting θ = (σ, γ ) . The idea is now to select θ by maximizing the cross-validated likelihood. From a theoretical viewpoint, the cross-validated (CV) likelihood framework is appealing, since in van der Laan, Dudoit, and Keles (2004), the CV likelihood selector has been shown to asymptotically perform as well as the optimal benchmark selector, which characterizes the best possible model (in terms of Kullback-Leibler divergence to the true distribution) contained in the parametric family. For kernels that map into a space with dimension p > n, however, two problems arise: (1) the subspace spanned by the mapped samples varies with different sample sizes, and (2) not the whole feature space is accessible for vectors in the input space. As a consequence, it is difficult to find a proper normalization of the gaussian density in the induced feature space. This problem can be easily avoided by considering the likelihood in the input space rather than in the feature space; that is, we are looking for a properly normalized density model p(x|·) in some bounded subset S ⊂ Rd such that the contour lines of p(x|·) and the gaussian model in the feature space have the same shape: p(x i |·) = p(x j |·) ⇔ p(φ(x i )|·) = p(φ(x j )|·).2 Denoting n by Xn = {x i }i=1 a sample from p(x) from which the kernel matrix K is built, a natural input space model is −1
pn (x|Xn , θ ) = Z
1 exp − D(x; Xn , θ ) , with Z = pn (x|Xn , θ ) d x, 2 S (4.1)
where D(x; Xn , θ ) denotes the (parameterized) Mahalanobis distances, equation 2.6, of a gaussian model in the feature space. Note that this density model in the input space has the same functional form as our gaussian model in the feature space, except for the different normalization constant Z. Only the interpretation of the models is different: the input space model is viewed as a function in x, whereas the feature space model is treated as a function in φ(x). The former can be viewed as a nonparametric density estimator (note that for RBF kernels, the functional 2
In order to guarantee integrability, we assume that the input density has a bounded support. Since in practice we have to approximate the integral by sampling anyway, this assumption does not limit the applicability of the proposed method.
950
V. Roth
form of the Mahalanobis distances in the exponent of equation 4.1 is closely related to a Parzen-window estimator). The feature-space model, on the other hand, defines a parametric density. Having selected the maximum likelihood model in the input space, the parametric form of the corresponding feature space model is then used for detecting atypical objects. Computing this constant Z in equation 4.1 requires us to solve a normalization integral over the d-dimensional space S. Since in general this integral is not analytically tractable for nonlinear kernel models, we propose to approximate Z by a Monte Carlo sampling method. In our experiments, for instance, the VEGAS algorithm (Lepage, 1980), which implements a mixed importance-stratified sampling approach, was a reasonable method for up to 15 input dimensions. The term reasonable here is not of a qualitative nature, but refers solely to the time needed for approximating the integral with a sufficient precision. For the ten-dimensional examples presented in the next section, for instance, the sampling takes approximately 1 minute on a standard PC. The choice of the subset S on which the numerical integration takes place is somewhat arbitrary, but choosing S to be the smallest hypercube including all training data has been shown to be a reasonable strategy in the experiments. 5 Experiments 5.1 Detecting Outliers in Face Databases. In a first experiment the performance of the proposed method is demonstrated for an outlier detection task in the field of face recognition. The Olivetti face database3 contains 10 different images of each of 40 distinct subjects, taken under different lighting conditions and at different facial expressions and facial details (glasses/no glasses). None of the subjects, however, wears sunglasses. All the images are taken against a homogeneous background with the subjects in an upright, frontal position. In this experiment, we additionally corrupted the data set by including two images in which we artificially changed normal glasses to “sunglasses,” as can be seen in Figure 1. The goal is to demonstrate that the proposed method is able to identify these two atypical images without any problem-dependent prior assumptions on the number of outliers or on their distribution. In order to exclude illumination differences, the images are standardized by subtracting the mean intensity and normalizing to unit variance. Each of the 402 images is then represented as a ten-dimensional vector that contains the projections onto the leading 10 eigenfaces (eigenfaces are simply the eigenvectors of the images treated as pixel-wise vectorial objects). From these vectors, a RBF kernel of the form k(x i , x j ) = exp(−x i − x j 2 /σ ) is built. In order to guarantee the reproducibility of the results, both the data
3
See http://www.uk.research.att.com/facedatabase.html.
Kernel Fisher Discriminants for Outlier Detection
951
Figure 1: Original and corrupted images with in-painted “sunglasses.”
set and an R-script for computing the OC-KFD model can be downloaded from www.inf.ethz.ch/personal/vroth/OC-KFD/index.html. Whereas our visual impression suggests that the “sunglasses” images might be easy to detect, the automated identification of these outliers based on the eigenface representation is not trivial: due to the high heterogeneity of the database, these images do not exhibit extremal coordinate values in the directions of the leading principal components. For the first principal direction, for example, the coordinate values of the two outlier images are −0.34 and −0.52, whereas the values of all images range from −1.5 to +1.1. Another indicator of the difficulty of the problem is that a one-class SVM has severe problems to identify the sunglasses images, as will be shown. In a first step of the proposed procedure, the free model parameters are selected by maximizing the cross-validated likelihood. A simple two-fold cross-validation scheme is used: the data set is randomly split into a training set and a test set of equal size, the model is built from the training set (including the numerical solution of the normalization integral), and finally the likelihood is evaluated on the test set. This procedure is repeated for different values of (σ, γ ). In order to simplify the selection process, we kept γ = 10−4 fixed and varied only σ . Both the test likelihood and the corresponding model complexity measured in terms of the effective degrees of freedom (df ) are plotted in Figure 2 as a function of the (natural) logarithm of σ . One can clearly identify both an overfitting and an underfitting regime, separated by a broad plateau of models with similarly high likelihood. The df -curve, however, shows a similar plateau, indicating that all these models have comparable complexity. This observation suggests that the results should be rather insensitive to variations of σ over values contained in this plateau. This suggestion is indeed confirmed by the results in Figures 3 and 4, where we compared the quantile-quantile plots for different parameter values (marked as I to IV in Figure 2). The plots for models II and III look very similar, and in both cases, two objects clearly fall outside a 100(1 − 0.1/n)%-envelope around the linear fit. Outside the plateau, the number of objects considered as outliers drastically increases in the overfitting regime (model I, σ too small), or decreases to zero in the underfitting regime (model IV, σ too large). The upper-right panel in Figure 3
952
V. Roth
test log−likelihood
−200
18
II
III
−300
15
−400
12
−500
9
I −600 −700
6
IV 6
7
8
9
10
11
12
effective degrees of freedom
−100
13
ln(σ) Figure 2: Selecting the kernel width σ by cross-validated likelihood (solid line). The dashed line shows the corresponding effective degrees of freedom (df ).
shows that the two outliers found by the maximum-likelihood model II are indeed the two sunglasses images. Furthermore, we observe that the quantile plot, after removing the two outliers by way of algorithm 1, appears to be gaussian-like (bottom right panel). Despite the potential problems of fitting gaussian models in kernel-induced feature spaces, this observation may be explained by the similar degrees of freedom in the input space and the kernel space: the plateau of the likelihood curve in Figure 2 corresponds to approximately 10 to 11 effective degrees of freedom. In spite of the fact that the width of the maximum-likelihood RBF kernel is relatively large (σ = 2250), the kernelized model is still different from a standard linear model. Repeating the model selection experiment with a linear kernel for different values of the regularization parameter γ , the highest test likelihood is found for a model with df = 6.5 degrees of freedom. A reliable identification of the two sunglass images, however, is not possible with this model: one image falls clearly within the envelope, and the other only slightly exceeds the upper envelope. In order to compare the results with standard approaches to outlier detection, a one-class ν-SVM with RBF kernel is trained on the same data set. The main practical problem with the ν-SVM is the lack of a plausible selection criterion for both the σ - and the ν-parameter. Taking into account the conceptual similarity of the SVM approach and the proposed OC-KFD method, we decided to use the maximum likelihood kernel emerging from the above model selection procedure (model II in Figure 2). The choice of the ν-parameter that upper-bounds the fraction of outliers turned out to
40 30 20 0
10
mahalanobis distances
100
150
953
model II (initial)
model I (initial)
50
mahalanobis distances
Kernel Fisher Discriminants for Outlier Detection
10
20
30
40
50
5
10
15
20
25
30
10
15
20
model II (outliers removed)
0
5
20
30
40
50
60
mahalanobis distances
model I (outliers removed)
25
chisq quantiles
10
mahalanobis distances
70
chisq quantiles
10
20
30
40
chisq quantiles
50
5
10
15
20
25
30
chisq quantiles
Figure 3: Quantile-quantile plots for different modes. (Left column) Overfitting model I from Figure 2, initial qq-plot (top) and final plot after subsequently removing the outliers (bottom). (Right column) Optimal model II.
be even more problematic: for different ν-values, the SVM model identifies roughly ν · 402 outliers in the data set (cf. Figure 5). Note that in the ν-SVM model, the quantity (ν · 402), where 402 is the size of the data set, provides an upper bound for the number of outliers. The observed almost linear increase of the detected outliers means, however, that the SVM model does not provide us with a plausible characterization of outliers. We basically “see” as many outliers as we have specified in advance by choosing ν. Furthermore, it is interesting to see that the SVM model has problems to identify the two sunglasses images: for ν = 0.0102, the SVM detects two outliers, which, however, do not correspond to the desired sunglass images (see the right panel of Figure 5). To find the sunglasses images within the
15 10
model IV (initial)
0
5
model III (initial) mahalanobis distances
10 15 20 25 30 35
V. Roth
5
mahalanobis distances
954
5
10
15
20
25
30
0
chisq quantiles
5
10
15
20
chisq quantiles
40 30 20 10 0
number of detected outliers
Figure 4: Quantile-quantile plots. (Left) Slightly suboptimal model III. (Right) Underfitting model IV.
0.02
0.04
0.06
ν− parameter
0.08
0.1
Figure 5: One-class SVM results. (Left) Number of detected outliers versus ν-parameter. The solid line defines the theoretical upper bound ν · 402. (Right) The two outliers identified with ν = 0.0102.
outlier set, we have to increase ν to 0.026, which “produces” nine outliers in total. One might argue that the observed correlation between the ν parameter and the number of identified outliers would be an artifact of using a kernel width that was selected for a different method (i.e., OC-KFD instead of νSVM). When the SVM experiments are repeated with both a 10 times smaller width (σ = 225) and a 10 times larger width (σ = 22,500), one observes the same almost linear dependency of the number of outliers on ν. The problem of identifying the sunglass images remains too: in all tested cases, the most
Kernel Fisher Discriminants for Outlier Detection
955
0
log−likelihood
−500 −1000 −1500
7 98 35 6 4 2
−2000 −2500 −3000
1
0 2
3
4
5
6
7
8
9
10
11
12
13
ln(σ) Figure 6: USPS data set. Cross-validated likelihoods as a function of the kernel parameter σ . Each curve corresponds to a separate digit.
dominant outlier is not a sunglass image. This observation indicates that, at least in this example, the problems of the ν-SVM are rather independent of the used kernel. 5.2 Handwritten Digits from the USPS Database. In a second experiment, the proposed method is applied to the USPS database of handwritten digits. The data are divided into a training set and a test set, each consisting of 16 × 16 gray-value images of handwritten digits from postal codes. It is well known that the test data set contains many outliers. The problem of detecting these outlier images has been studied before in the context ¨ of one-class SVMs (see Scholkopf & Smola, 2002). Whereas for the face data set, we used the unsupervised eigenface method (i.e., PCA) to derive a low-dimensional data representation, in this case we are given class labels, which allow us to employ a supervised projection method, such as LDA. For 10 classes, LDA produces a data description in a 9-dimensional subspace. For actually computing the projection, a penalized LDA model (Hastie et al., 1995) was fitted to the training set. Given the trained LDA model, the test set vectors were projected on the subspace spanned by the nine LDA vectors. To each of the classes, we then fitted an OC-KFD outlier detection model. The test-likelihood curves for the individual classes are depicted in Figure 6. For most of the classes, the likelihood attains a maximum around σ ≈ 1300. The classes 1 and 7 require a slightly more complex model with σ ≈ 350. The maximum likelihood models correspond to
0
5
10
15
20
30 20 10 0 0
5
10
15
20
chisq quantiles
10
20
30
chisq quantiles
mahalanobis distances
0
10 20 30 40
V. Roth
0
mahalanobis distances
mahalanobis distances
956
0
5
10
15
20
chisq quantiles
Figure 7: Outlier detection for the digit 9. The iteration terminates after excluding two images (top left → bottom left → top right panel).
approximately 9 to 11 effective degrees of freedom in the kernel space, which is not too far from the input space dimensionality. Once the optimal kernel parameters are selected, the outliers are detected by iteratively excluding the object with the highest deviation from the upper envelope around the linear quantile fit (cf. algorithm 1). For the digit 9, the individual steps of this iteration are depicted in Figure 7. The iteration terminates after excluding two outlier images. All remaining typical images with high Mahalanobis distances fall within the envelope (top right panel). For the combined set of outliers for all digits, Figure 8 shows the first 18 outlier images, ordered according to their deviations from the quantile fits. Many European-style 1s and “crossed” 7s are successfully identified as atypical objects in this collection of U.S. postal codes. Moreover, some almost unreadable digits are detected, like the “0,” which has the form of a horizontal bar (middle row), or the “9” in the bottom row. In order to compare the results with a standard technique, a one-class νSVM was also trained on the data. As in the previous experiment, the width of the RBF kernel was set to the maximum-likelihood value identified by the above model selection procedure. For the digit 9, the dependency of the number of identified outliers on the ν parameter is depicted in Figure 9. The almost linear dependency again emphasizes the problem that the SVM approach does not provide us with a meaningful characterization of outliers. Rather, one “sees” (roughly) as many outliers as specified in advance by
Kernel Fisher Discriminants for Outlier Detection
957
1
1
1
1
7
1
0
1
1
1
0
0
1
7
7
1
9
1
Figure 8: The first 18 detected outliers in the U.S. Postal Service test data set, ordered according to decreasing deviation from the quantile fits. The caption below each image shows the label provided by the database.
number of detected outliers
45 40 35 30 25 20 15 10 5 0
0
0.05
0.1
0.15
0.2
0.25
ν− parameter Figure 9: One-class SVM results for the digit 9. (Left) Number of detected outliers as a function of the ν-parameter. The solid line defines the theoretical upper bound ν · 177. (Right) The two outliers identified with ν = 0.03.
choosing ν. When setting ν = 0.03, the SVM identifies two outliers, which equals the number of outliers identified in the OC-KFD experiment. The two outlier images are depicted in the right panel. Comparing the results with that of the OC-KFD approach, we see that both methods identified the almost unreadable 9 as the dominating outlier.
958
V. Roth
5.3 Some Implementation Details. Presumably the easiest way of implementing the model is to carry out an eigenvalue decomposition of K . Both the the effective degrees of freedom df = i λi /(λi + γ ) and the Mahalanobis distances in equation 2.6 can then be derived easily from this decomposition. For practical use, consider the pseudo-code presented in algorithm 2. A complete R-script for computing the OC-KFD model can be downloaded from www.inf.ethz.ch/personal/vroth/OCKFD/index.html. Algorithm 2: OC-KFD lmax ← −∞ for θ on a specified grid do Split data into two subsets Xtrain and Xtest of size n/2. Compute kernel matrix K (Xtrain , σ ), its eigenvectors V, and eigenvalues {λi }. Compute αˆ = Vdiag{1/(λi + γ )}V y. Compute normalization integral Z by Monte Carlo sampling (see equation 4.1). Compute Mahalanobis distances by equations 2.6 and 3.2, and evaluate log likelihood on test set: l(Xtest |θ ) = i −(1/2)D(xi ∈ Xtest |Xtrain , θ ) − (n/2) ln(Z). if l(Xtest |θ ) > lmax then lmax = l(Xtest |θ ), θ opt = θ . end if end for Given θ opt , compute K (X, σopt ), V, {λi }. Compute Mahalanobis distances and df (see equations 2.6 and 3.2). Detect outliers using algorithm 1. 6 Conclusion Detecting outliers by way of one-class classifiers aims at finding a boundary that separates typical objects in a data sample from the atypical ones. Standard approaches of this kind suffer from the problem that they require prior knowledge about the expected fraction of outliers. For the purpose of outlier detection, however, the availability of such prior information seems to be an unrealistic (or even contradictory) assumption. The method proposed in this article overcomes this shortcoming by using a one-class KFD
Kernel Fisher Discriminants for Outlier Detection
959
classifier directly related to gaussian density estimation in the induced feature space. The model benefits from both the built-in classification method and the explicit parametric density model in the feature space: from the former, it inherits the simple complexity regulation mechanism based on only two free parameters. Moreover, within the classification framework, it is possible to quantify the model complexity in terms of the effective degrees of freedom df. The gaussian density model, on the other hand, makes it possible to derive a formal description of atypical objects by way of hypothesis testing: Mahalanobis distances are expected to follow a χ 2 distribution in df dimensions, and deviations from this distribution can be quantified by confidence intervals around a fitted line in a quantile-quantile plot. Since the density model is parameterized by both the kernel function and the regularization constant, it is necessary to select these free parameters before the outlier detection phase. This parameter selection is achieved by observing the cross-validated likelihood for different parameter values and choosing the parameters that maximize this quantity. The theoretical motivation for this selection process follows from van der Laan et. al. (2004), where it has been shown that the cross-validation selector asymptotically performs as well as the so-called benchmark selector, which selects the best model contained in the model family. The experiments on detecting outliers in image databases effectively demonstrate that the proposed method is able to detect atypical objects without problem-specific prior assumptions on the expected fraction of outliers. This property constitutes a significant practical advantage over the traditional ν-SVM approach. The latter does not provide a plausible characterization of outliers. One “detects” (roughly) as many outliers as one has specified in advance by choosing ν. Prior knowledge about ν, on the other hand, will be hardly available in general outlier-detection scenarios. In particular, the presented experiments demonstrate that the whole processing pipeline, consisting of model selection by cross-validated likelihood, fitting linear quantile-quantile models, and detecting outliers by considering confidence intervals around the fit, works very well in practical applications with reasonably small input dimensions. For input dimensions 15, the numerical solution of the normalization integral becomes rather time-consuming when using the VEGAS algorithm. Evaluating the usefulness of more sophisticated sampling models like Markov chain Monte Carlo methods for this particular task will be the subject of future work. Acknowledgments I thank the referees who helped to improve this letter. Special thanks go to Tilman Lange, Mikio Braun, and Joachim M. Buhmann for helpful discussions and suggestions.
960
V. Roth
References Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. Hoboken, NJ: Wiley. Fox, J. (1997). Applied regression, linear models, and related methods. Thousand Oaks, CA: Sage. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. Annals of Statistics, 23, 73–102. Huber, P. (1981). Robust statistics. Hoboken, NJ: Wiley. Kendall, M., & Stuart, A. (1977). The advanced theory of statistics (Vol. 1). New York: Macmillan. Lepage, G. (1980). Vegas: An adaptive multidimensional integration program (Tech. Rep. CLNS-80/447). Ithaca, NY: Cornell University. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). Piscataway, NJ: IEEE. Moody, J. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 847– 854). Cambridge, MA: MIT Press. Roth, V., & Steinhage, V. (2000). Nonlinear discriminant analysis using kernel func¨ tions. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 568–574). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., R¨atsch, G., & Smola, A. (1999). Input space vs. feature space in kernel-based methods. IEEE Trans. Neural Networks, 10(5), 1000–1017. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. ¨ Scholkopf, B., Williamson, R., Smola, A., & Shawe-Taylor, J. (2000). SV estimation ¨ of a distribution’s support. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 582–588). Cambridge, MA: MIT Press. Tax, D., & Duin, R. (1999). Support vector data description. Pattern Recognition Letters, 20(11–13), 1191–1199. van der Laan, M., Dudoit, S., & Keles, S. (2004). Asymptotic optimality of likelihoodbased cross-validation. Statistical Applications in Genetics and Molecular Biology, 3(1), art. 4.
Received January 27, 2005; accepted August 25, 2005.
LETTER
Communicated by Mario Figueiredo
Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation Liefeng Bo [email protected]
Ling Wang [email protected]
Licheng Jiao [email protected] Institute of Intelligent Information Processing, Xidian University, Xi’an 710071, China
Kernel fisher discriminant analysis (KFD) is a successful approach to classification. It is well known that the key challenge in KFD lies in the selection of free parameters such as kernel parameters and regularization parameters. Here we focus on the feature-scaling kernel where each feature individually associates with a scaling factor. A novel algorithm, named FS-KFD, is developed to tune the scaling factors and regularization parameters for the feature-scaling kernel. The proposed algorithm is based on optimizing the smooth leave-one-out error via a gradient-descent method and has been demonstrated to be computationally feasible. FS-KFD is motivated by the following two fundamental facts: the leave-one-out error of KFD can be expressed in closed form and the step function can be approximated by a sigmoid function. Empirical comparisons on artificial and benchmark data sets suggest that FS-KFD improves KFD in terms of classification accuracy.
1 Introduction Fisher linear discriminant analysis (Fisher, 1936; Fukunaga,1990) is a classical classifier whose fundamental idea is to maximize the between-class scatter while minimizing the within-class scatter simultaneously. In many applications, Fisher linear discriminant analysis has proved to be very powerful. However, for real-world problems, only linear discriminant analysis is not good enough. Mika, Ratsch, and Weston (1999) and Mika (2002) introduced a class of nonlinear Fisher discriminant analysis using kernel tricks, named KFD. Extensive empirical comparisons have shown that KFD is comparable to other kernel-based classifiers, such as support vector machines (SVMs) (Vapnik, 1995, 1998) and least-squares support vector machines (LS-SVMs) (Gestel et al., 2002; Suykens & Vandewalle, 1999). Neural Computation 18, 961–978 (2006)
C 2006 Massachusetts Institute of Technology
962
L. Bo, L. Wang, and L. Jiao
For kernel-based learning algorithms, the key challenge lies in the selection of kernel parameters and regularization parameters. Many researchers have identified this problem and tried to solve it. Weston et al. (2001) performed feature selection for SVMs by combining the feature scaling technique with the leave-one-out error bound. Chapelle, Vapnik, Bousquet, and Mukherjee (2002) tuned multiple parameters for two-norm SVMs by minimizing the radius margin bound or the span bound. Ong and Smola (2003) applied semidefinite programming to learn kernel function by hyperkernel. Lanckriet, Cristianni, Bartlett, Ghaoui, and Jordan (2004) designed kernel matrix directly by semidefinite programming. All of these algorithms have proved to be effective and boosted the development of this field. We focus here on tuning the scaling factors of the feature scaling kernel (Williams & Barber, 1998; Krishnapuram, Hartemink, Carin, & Figueiredo, 2004). Two of the most popular feature-scaling kernels are polynomial kernel and gaussian kernel, as given below: Kθ (xi , x j ) = 1 +
d
r (k) (k) θk xi x j
,
(1.1)
2 (k) (k) θk xi − x j .
(1.2)
k=1
Kθ (xi , x j ) = exp −
d k=1
In a feature-scaling kernel, each feature has its own scaling factor. If some feature is insignificant or irrelevant for classification, the associated scaling factor will be set smaller; otherwise, it will be set larger. Cawley and Talbot (2003) gave a closed form of the leave-one-out error of KFD and demonstrated that it was superior to n-fold cross-validation error in terms of computational complexity. Motivated by this fact, we develop a novel algorithm, named FS-KFD, to tune multiple parameters for the feature-scaling kernel. FS-KFD is constructed in two steps: replacing the step function in the leave-one-out error with a sigmoid function and then optimizing the resulting smooth leave-one-out error via a gradient-descent algorithm. In FS-KFD, all the free parameters are analytically chosen, so the learning process is fully automatic. Extensive experimental comparisons show that FS-KFD improves the performance of KFD in the presence of many irrelevant features and obtains good classification accuracy. The remainder of the letter is organized as follows. In section 2, kernel Fisher discriminant analysis is briefly reviewed. The expressions for the smooth leave-one-out error and for its derivative are given in section 3. FS-KFD is extended to multiclass classification in section 4. In section 5, the experimental results are reported. The direction of future research is indicated in section 6.
Feature Scaling for Kernel Fisher Discriminant Analysis
963
2 Kernel Fisher Discriminant Analysis For real-world problems, linear discriminant analysis is not enough. Mika et al. constructed the linear discriminant analysis in the feature space induced by a Mercer kernel, thus implicitly yielding a nonlinear discriminant analysis in the input space. The leading model is named KFD, in which two scatter matrices—between-class scatter matrix and withinclass scatter matrix—are defined by SbF = (m1F − m2F )(m1F − m2F )T and SwF = 2 li i i F F T F i=1 j=1 ((x j ) − mi )((x j ) − mi ) , where the mean mi of the ith class i is miF = l1i lj=1 (xij ). An optimal transformation w is given by maximizing the between-class scatter while simultaneously minimizing the within-class scatter: max w
wT SbF w . wT SwF w
(2.1)
In terms of reproducing kernel theory (Aronszajn, 1950), w can be formu lated as w = lj=1 α j (x j ). With equation 2.1, we can calculate α by max α
where
α T S¯ bF α α T S¯ wF α
,
(2.2)
¯ 1F − m ¯ 2F )(m ¯ 1F − m ¯ 2F )T S¯ bF = (m
and
S¯ wF =
l j i ¯ iF ) i=1 j=1 (β j − m i ¯ iF = l1i [ lj=1 m K(x1 ,
2
¯ iF )T with β ij = [K(x1 , xij ), . . . , K(xl , xij )]T and (β ij − m i i x j ), . . . , lj=1 K(xl , xij )]T . It can be seen that KFD is equivalent to finding the leading eigenvector of ( S¯ wF )−1 S¯ bF . To improve numerical stability and generalization ability, we replace S¯ wF with S¯ wF + λI, where λ is a regularization constant and I is an identity matrix. For a new sample x, we can predict its label by l g(x) = sgn((w·(x)) + b) = sgn α j K(x j , x) + b ,
(2.3)
j=1
where b = −α T
¯ 1F +l2 m ¯ 2F l1 m l
.
3 Optimization of the Smooth Leave-One-Out Cross-Validation Error Let us denote the leave-one-out error by (x1 , y1 , . . . , xl , yl ). It is well known that the leave-one-out error is almost an unbiased estimate of the expected generalization error.
964
L. Bo, L. Wang, and L. Jiao
¨ Lemma 1 (Luntz & Brailovsky, 1969; Scholkopf & Smola, 2002).
l−1
E perror = E
1 (x1 , y1 , . . . , xl , yl ) , l
l−1 is the probability of test error for the model trained on samples of size where perror l − 1 and the expectations are taken over the random choice of the samples.
This lemma suggests that the leave-one-out error is a good estimate for the generalization error. However, the leave-one-out cross validation is rarely adopted in a medium- or large-scale application due to its high computational cost; it requires running the training algorithm l times. The training algorithms for kernel machines, such as KFD, typically require a computational cost of O(l 3 ). In this case, the computational cost of the leave-one-out cross-validation procedure is O(l 4 ), which quickly becomes intractable as the number of training samples increases. Fortunately, there exists a computationally efficient implementation for the leave-one-out cross-validation procedure in KFD, which only a computational cost of incurs O(l 3 ). Xu, Zhang, and Li (2001) showed that KFD is equivalent to minimizing the following loss function, f (α) ¯ = α¯ T (CT C + λU)α¯ − 2α¯ T CT y + yT y,
(3.1)
where α¯ = [ αb ], C = [K 1], U = [ 0IT 00 ], and I denotes the unit matrix. Let gi (x) be the ith kernel Fisher classifier constructed from the data set excluding the ith training sample. Defining the residual error by ri = yi − gi (xi ) for the ith training sample, Cawley and Talbot (2003) demonstrated the following: Lemma 2. r = (I − H) y (1 − D(H)), where H = C(C T C + λU)-1 C T , D(H) denotes the diagonal elements of H, and denotes element-wise division. A straightforward corollary of lemma 2 is that the leave-one-out error of KFD can be computed at a cost of O(l 3 ). This indicates that it is feasible to apply leave-one-out model selection to a medium-size problem. In the following, we will discuss the smooth leave-one-out error derived by replacing the step function with a sigmoid function. According to lemma 2, the leave-one-out error of KFD is given by l 1 1 − yi sign (yi − ri ) loo(θ , λ) = , l 2 i=1
(3.2)
Feature Scaling for Kernel Fisher Discriminant Analysis
965
where sign (a ) is 1 if a ≥ 0; otherwise, sign(a ) is −1. From equation 3.2, we observe that there exists a step function sign (·) in loo(θ , λ), implying that it is not differentiable. In order to use a gradient-descent method to minimize this estimate, we approximate the step function by a sigmoid function, tanh(γ t) =
exp (γ t) − exp (−γ t) , exp (γ t) + exp (−γ t)
(3.3)
where we set γ to be 10. Then the smooth leave-one-out error can be expressed as
loo(θ , λ) =
l 1 1 − yi tanh (γ (yi − ri )) . l 2
(3.4)
i=1
Figure 1 shows the leave-one-out error and the smooth leave-one-out error on the Breast Cancer data set. It can be seen from Figure 1 that the smooth leave-one-out error successfully follows the trend of the leave-oneout error. Thus, we can expect that a small, smooth leave-one-out error guarantees good generalization ability. According to the chain rule, the derivative of loo(θ , λ) is formulated as ∂(loo(θ , λ)) ∂(loo(θ , λ)) ∂r = . ∂θk ∂rT ∂θk It follows that we need only to calculate With
∂(tanh(t)) ∂t
(3.5) ∂(loo(θ ,λ)) ∂rT
and
∂r , ∂θk
respectively.
= sec h2 (t), we have
∂ (loo (θ , λ)) = ∂rT
γ y ⊗ sec h2 (γ (y − r)) 2l
T ,
(3.6)
where ⊗ denotes an element-wise proxduct. The derivative of r with respect to θk is given by ∂H ∂r =− y (1 − D (H)) ∂θk ∂θk
∂H . + ((I − H) y) (1 − D (H)) (1 − D (H)) ⊗ D ∂θk
(3.7)
966
L. Bo, L. Wang, and L. Jiao
A
Error
0.036 0.034 0.032
0.03 -4
-2 0 log2(lambda)
2
-2 0 log2(lambda)
2
B
Error
0.0345 0.034 0.0335 0.033 -4
Figure 1: (A) Variation of the leave-one-out error with log 2(λ) on the Breast data set. (B) Variation of the smooth leave-one-out error with log 2(λ) on the Breast data set.
The derivative of H with respect to θk is given by −1 −1 T ∂ CT C + λU ∂H ∂C T = C +C CT C C + λU ∂θk ∂θk ∂θk −1 ∂CT + C CT C + λU . ∂θk
(3.8) −1
Now let us focus on computing ∂(C C+λU) . A good solution is based on the ∂θk equality: T−1 T = I (Bengio, 2000). Differentiating both sides of the equation −1 , we have with respect to θk and then isolating ∂T ∂θk T
∂T −1 ∂T−1 = −T−1 T . ∂θk ∂θk
(3.9)
Feature Scaling for Kernel Fisher Discriminant Analysis
967
Substituting CT C + λU for T, we have −1 −1 ∂ CT C + λU T −1 ∂ CT C + λU C C + λU = − CT C + λU ∂θk ∂θk T T T −1 ∂C −1 T ∂C C C + λU = − C C + λU C+C ∂θk ∂θk (3.10) Combining equations 3.5, 3.6, 3.7, 3.8, and 3.10, we can compute the derivative of the smooth leave-one-out error with respect to θk . The derivative of H with respect to λ is given by −1 T −1 ∂H C C + λU . = − CT C + λU ∂λ
(3.11)
So we can compute the derivative of loo(θ , λ) with respect to λ in a similar manner. From the derivation, it can be easily verified that the computational complexity of FS-KFD is # (Iteration) × # (free parameters) × l 3 .
(3.12)
4 Extension to Multiclass Classification In this section, we attempt to extend FS-KFD to multiclass classification using the one-against-all scheme that has been independently devised by several researchers. Rifkin and Klautau (2004) carefully compared the oneagainst-all scheme with some other popular multiclass schemes and concluded that it is as accurate as any other scheme if the underlying binary classifiers are well-tuned, regularized classifiers. One-against-all reduces a c-class problem to c binary problems. For the sth binary problem, all samples labeled yi = s are considered positive samples and the others negative samples. For a new sample prediction, c classifiers are run, and the classifier that outputs the largest value is chosen. Let g (s) (xi ) denote the output of the sth binary classifier on a sample xi . According to the one-against-all scheme, the predicted label for xi is yˆ i = arg max
s∈{1,...,c}
g (s) (xi ) .
(4.1)
Thus, the leave-one-out error of multiclass classification can be written as mloo(θ , λ) =
l 1 1 − equal yi , arg max g (s) (xi ) , s l i=1
(4.2)
968
L. Bo, L. Wang, and L. Jiao
where equal (a , b) = 1 if a = b; otherwise equal (a , b) = 0, and yi ∈ {1, 2, . . . , c}. It becomes intractable to approximate equation 4.2 by a sigmoid function due to the discontinuity of the inner function arg maxs (g (s) (xi )). In the following, we consider an alternative strategy where the upper bound of the leave-one-out error of multiclass classification is optimized. Theorem 1. Let loo(s) denote the leave-one-out error of the sth binary classifier. If the one-against-all scheme is used, the following inequality holds: mloo ≤
c
loo(s) .
(4.3)
s=1
Proof. If all c binary classifiers classify the sample xi correctly, we have (s)
yi g (s) (xi ) > 0,
s = 1, . . . , c,
(s)
(4.4) (s)
where yi = 1, if yi = s; otherwise yi = −1. Inequality 4.4 can be further simplified to
g (yi ) (xi ) > 0 g (s) (xi ) < 0,
s = yi
.
(4.5)
Since only the output of the yith classifier is greater than zero, we have
arg min g (s) (xi ) = yi . s
(4.6)
This means that if all c binary classifiers classify the sample xi correctly, the final multiclass classifier also classifies xi correctly. The equivalent proposition is that if the multiclass classifier classifies xi incorrectly, there exists at least one binary classifier misclassifying xi . This completes the proof of theorem 1. This theorem allows us to control the leave-one-out error of multiclass classification by controlling the sum of the leave-one-out error of all the binary classifiers. Three multiclass schemes can be derived by considering whether the kernel parameters and the regularization parameters are shared by all the binary classifiers. In the first scheme, all the binary classifiers share the kernel parameters and the regularization parameters (Hsu & Lin, 2002; Rifkin & Klautau,
Feature Scaling for Kernel Fisher Discriminant Analysis
969
2004). The sum of the smooth leave-one-out errors of c binary classifiers can be formulated as sloo (θ , λ) =
c
loo(s) (θ , λ) .
(4.7)
s=1
loo(s) (θ , λ) can be expanded into
loo(s) (θ, λ) =
l 1
l
i=1
(s) (s) (s) 1 − yi tanh γ yi − ri , 2
(4.8)
(s)
where ri is the residual error on the ith sample for the sth binary problem. The derivative of sloo(θ , λ) with respect to θk is given by ∂(sloo(θ , λ)) ∂(loo(s) (θ , λ)) ∂r(s) = , ∂θk ∂θk ∂(r(s) )T c
(4.9)
s=1
where ∂(loo(s) (θ , λ)) = T ∂ r(s)
γ y(s) ⊗ sec h2 (γ (y(s) − r(s) )) 2l
T ,
∂H (s) ∂r(s) =− y (1 − D(H)) ∂θk ∂θk + ((I − H)y(s) ) (1 − D(H)) (1 − D (H)) ⊗ D
(4.10)
∂H ∂θk
(4.11)
Thus, we can compute the derivative of sloo (θ , λ) with respect to θk by combining equations 4.9, 4.10, and 4.11. The derivative of sloo (θ , λ) with respect to λ can be computed in a similar manner. It is easily checked that the computational complexity of this multiclass scheme is the same as that of FS-KFD for binary classification since all the binary classifiers share H. In the second scheme, only the kernel parameters are shared. As a result, the binary classifiers no longer share H due to the difference among the regularization parameters. The computational complexity of this scheme becomes c × # (Iteration) × # (free parameters) × l 3 .
(4.12)
970
L. Bo, L. Wang, and L. Jiao
In the third scheme, the kernel parameters and the regularization parameters are not shared. Therefore, we independently optimize the free parameters of each binary classifier. The computational complexity of this scheme is the same as that of the second one.
5 Performance Comparison In order to demonstrate the effectiveness of FS-KFD, we compare its performance with those of KFD, SVMs, and k-nearest neighbors (KNN) (Lowe, 1995) on an artificial XOR problem, benchmark data sets from UCI Machine Learning Repository (Blake & Merz, 1998), and the radar target recognition problem. All the algorithms were implemented in MATLAB 7.0. And all the experiments were run on a personal computer with 2.4 GHz P4 processors, 2 GB memory, and Windows XP operation system. Unless otherwise specified, the FS-KFD mentioned in the following uses the gaussian kernel. For FS-KFD, a gradient-descent method is used to search for the optimal values for free parameters, and thus one needs to choose good optimization software. We recommend using an available optimization package to avoid the numerical problems. Here we use the function fminunc in the optimization toolbox of MATLAB that implements BFGS quasi-Newton algorithm to solve medium-scale problems. The maximum number of iterations allowed is set to be 50, the termination tolerance on the function value and variable value is set to be 0.0001, and the cubic polynomial line search procedure is used to find the optimal step size. To avoid adding positive constraints in the optimization problem, we use parameterizations β = (log(θ ), log(λ)). The initial values of the scaling factors and regularization parameters are log(θ ) = log( d1 ) × 1 and log(λ) = 0, respectively, where d is the feature dimensionality. In general, choosing the optimal value for γ is difficult. Throughout the article, γ is set to be 10. We have found that using the same setting for various data sets works well. We can also try several different values for γ and choose the one leading to the smallest leave-one-out error.
5.1 Artificial XOR Problem. This experiment aims at validating the robustness of FS-KFD against the inclusion of the irrelevant features. To this end, a variant of XOR is constructed, with each feature drawn from a uniform distribution on the interval [−1, 1]. Regardless of the feature dimensionality d, the output label for a given data point is related to only the first two features of the data and is defined as y=
+1
if x1 x2 ≥ 0
−1
otherwise
x1 , x2 ∈ U (−1, +1) .
(5.1)
Feature Scaling for Kernel Fisher Discriminant Analysis
971
A 0.4
Error
0.3
FS-KFD KFD
0.2 0.1 0 0
5
10 15 Dimensionality
20
5
10 15 Scaling factor
20
B Magnitude
60 40 20 0 0
Figure 2: (A) Variation of the errors of FS-KFD and KFD with the dimensionality. (B) Scaling factors with the dimensionality d = 20.
This suggests that there exist d − 2 irrelevant features for the data with d features. The optimal decision function of this problem is nonlinear, and the highest recognition rate of linear classifiers is only 66.67%. FS-KFD and KFD are constructed on the training set with 200 samples and tested on the independent test set with 5000 samples. The results are averaged over 10 random realizations. To study the scaling property of the errors of FSKFD and KFD as the feature dimensionality, we sequentially increase the feature dimensionality from 2 to 20 at an interval of 2. The plots of the errors of the two algorithms as the function of the feature dimensionality are shown in Figure 2A. The scaling factors with the dimensionality d = 20 are shown in Figure 2B. From Figure 2A, we observe that FS-KFD is much more robust to the increase of the irrelevant features compared with KFD. Furthermore, the
972
L. Bo, L. Wang, and L. Jiao
Table 1: Information on Benchmark Data Sets. Problem
Training/Test
Class
Attribute
Breast German Liver Diabetes Vote Glass Yeast Splice Segment Vehicle
400/299 600/400 200/145 400/368 250/185 150/64 100/108 500/1675 500/1810 500/346
2 2 2 2 2 6 5 3 7 4
9 20 6 8 16 9 79 240 18 18
feature selection ability of FS-KFD is clearly exhibited in Figure 2B. The scaling factors corresponding to the relevant features are significantly larger than those corresponding to the irrelevant features. The rapid performance degradation of KFD suggests that the feature-scaling technique is indeed necessary in the presence of many irrelevant features. 5.2 Benchmark Comparison. The purpose of this experiment is to compare FS-KFD with KFD, SVM, and KNN on a collection of benchmark data sets from the UCI Machine Learning Repository. These data sets have been extensively used in testing the performance of diversified kinds of learning algorithms. Information on these benchmark data sets is summarized in Table 1. The sizes of training set and test set are shown in the second column of Table 1. For each training-test pair, the training samples are scaled into zero mean and unit variance, and the test samples are adjusted using the same linear transformation. The final errors are averaged over 10 random splits of the full data sets, which are reported in Tables 2 and 3. Note that all model selection procedures are independently performed for each training-test pair so that the standard error of the mean includes the variability due to the sensitivity of the model selection criterion to the partitioning of the data. The detailed experimental setups are summarized as follows: 1. For KFD, the leave-one-out error is used for model selection. We perform a grid search on intervals log 2(θ ) = [−12, −10, . . . , 4] and log 2(λ) = [−10, −9, . . . , 1]. Three possible multiclass schemes are considered: KFD with shared kernel parameters and regularization parameters, KFD with only shared kernel parameters, and KFD without shared free parameters.
Feature Scaling for Kernel Fisher Discriminant Analysis
973
Table 2: Mean and Variance of Test errors Obtained by FS-KFD, KFD, Span Bound–Based SVM, and KNN. Problem
FS-KFD(1)
KFD(1)
SVM(Span)
KNN
Breast German Diabetes Liver Vote Glass Splice Yeast
4.05 ± 0.71 24.75 ± 1.88 24.67 ± 1.75 30.14 ± 5.34 5.14 ± 1.40 32.81 ± 9.63 6.33 ± 1.27 5.83 ± 1.80
4.11 ± 0.77 23.35 ± 2.74 23.45 ± 2.05 29.72 ± 5.16 5.62 ± 1.98 33.28 ± 7.51 6.90 ± 1.09 5.85 ± 2.02
4.45 ± 0.76 24.22 ± 2.19 24.86 ± 1.59 31.72 ± 5.26 5.08 ± 1.86 32.97 ± 6.93 6.91 ± 0.61 6.67 ± 1.79
3.85 ± 1.01 27.35 ± 2.10 26.68 ± 2.06 39.66 ± 4.09 7.08 ± 1.83 31.87 ± 5.95 10.32 ± 1.06 8.89 ± 2.23
Segment Vehicle
4.59 ± 0.67 17.72 ± 2.21
7.87 ± 0.80 20.17 ± 2.00
6.57 ± 1.25 17.05 ± 2.38
8.25 ± 1.03 31.56 ± 1.83
Notes: FS-KFD(1) denotes FS-KFD with shared kernel parameters and regularization parameters. KFD(1) denotes KFD with shared kernel parameters and regularization parameters.
2. For SVM, the span bound (Vapnik & Chapelle, 2000) is used to optimize the kernel parameters and the regularization parameters. Initial setups are the same as in FS-KFD. 3. For KNN, the leave-one-out error is used to find the best number of neighbors k. We consider 50 different values from the interval [1, . . . , l − 1] (uniformly in logarithm) (R¨atsch, 2001), where l is the size of the training set. Two-tailed t-tests with the significant level 0.05 are performed to determine whether there is a significant difference between FS-KFD and other algorithms. The conclusions are summarized as follows. FS-KFD is significantly better than KFD on the Segment and Vehicle data sets. As for the remaining data sets, FS-KFD and KFD achieve similar performance. Table 3: Mean and Variance of Test errors Obtained by FS-KFD and KFD. Problem
FS-KFD(2)
FS-KFD(3)
KFD(2)
KFD(3)
Glass Splice Yeast
34.53 ± 11.13 6.16 ± 1.19 6.29 ± 2.61
31.87 ± 8.31 5.87 ± 1.00 6.67 ± 2.38
33.44 ± 9.44 6.95 ± 0.93 6.48 ± 2.58
31.71 ± 9.58 6.71 ± 0.93 7.59 ± 3.02
Segment Vehicle
4.36 ± 0.66 17.89 ± 2.07
4.61 ± 0.75 18.58 ± 2.09
8.04 ± 0.98 20.64 ± 2.13
7.62 ± 1.00 20.40 ± 1.93
Notes: FS-KFD(2) and FS-KFD(3) denote FS-KFD with only shared kernel parameters and without shared free parameters, respectively. KFD(2) and KFD(3) denote KFD with only shared kernel parameters and without shared free parameters, respectively.
974
L. Bo, L. Wang, and L. Jiao
FS-KFD and span bound–based SVM obtain similar performance on all data sets except Segment. FS-KFD is much better than KNN on all data sets except Breast and Glass. Pairwise two-tailed t-tests with a significance level of 0.05 are performed to determine whether there is a significant difference among the three multiclass schemes of FS-KFD and KFD. The resulting p-values indicate that there is no significant difference among the three multiclass schemes. In general, the feature-scaling technique improves the generalization performance of KFD and leads to a natural feature selection when irrelevant features occur. For example, on the Segment data set, the four largest scaling factors are 13.98, 3.50, 2.72, and 2.26, and yet other scaling factors are smaller than 0.5. 5.3 Radar Target Recognition. Radar target recognition refers to the detection and recognition of target signatures using high-resolution range profiles—in our case, in inverse synthetic aperture radar. A radar image represents a spatial distribution of microwave reflectivity that is sufficient to characterize the illuminated target. Range resolution allows the sorting of reflected signals on the basis of range. When range-gating or time-delay sorting is used to interrogate the entire range extent of the target space, a one-dimensional image, called a range profile, will be generated. Figure 3 is an example of such signature of three different planes: J-6, J-7, and B-52. Our task is to recognize the range profile of the three different plane models—J-6, J-7, and B-52—based on experimental data acquired in a microwave anechoic chamber. The dimensionality of the range profiles is 64. The full data set is split into 359 training samples and 719 test samples. The training samples consist of 103 one-dimensional images of J-6, 149 onedimensional images of J-7, and 107 one-dimensional images of B-52. The test samples consist of 206 one-dimensional images of J-6, 299 one-dimensional images of J-7, and 214 one-dimensional images of B-52. Experimental results for several classifiers are summarized in Table 4. It can be observed that FS-KFD is superior to other classifiers in terms of classification accuracy on this data set. 6 Discussion Our algorithm is not yet applicable to the problems where the number of feature dimensionality is on the order of several hundred and that of training samples on the order of several thousand due to the high computational cost. This limitation can be overcome by integrating a feature preselection step into FS-KFD. An alternative way to break this limitation is to allow some associated features to share the same scaling factors. For example, in image recognition problems, it is reasonable that the neighboring features share the same scaling factors. Exploiting effective feature preselection and reasonable feature-sharing schemes is an interesting research direction.
Feature Scaling for Kernel Fisher Discriminant Analysis
975
Magnitude
A 0.1 0.05 0 0
20 40 Dimensionality
60
20 40 Dimensionality
60
20 40 Dimensionality
60
Magnitude
B 0.1 0.05 0 0 C Magnitude
0.1
0.05
0 0
Figure 3: (A) One-dimensional image of J-6. (B) One-dimensional image of J-7. (C) One-dimensional image of B-52.
It is well known that the kernel function plays an important role in KFD. Choosing different kernel functions may result in different performance. The determination of an appropriate kernel for a specific application is far from fully understood. Consequently, combining FS-KFD and kernel
976
L. Bo, L. Wang, and L. Jiao
Table 4: Number of Misclassifications of Several Classifiers on the Radar Target Recognition Problem. Classifier SVM (gaussian kernel) LS-SVM (gaussian kernel) RVM (gaussian kernel) (Tipping, 2001) SPR (gaussian kernel) (Figueiredo, 2003) KFD ( gaussian kernel) FS-KFD (feature-scaling gaussian kernel)
J-6/J-7/B-52 11 11 12 12 13 7
construction trick to improve the performance of KFD in a specific application is of potential importance. One phenomenon worth mentioning is that the leave-one-out error resulting from the gradient-descent algorithm is smaller than the test error. The reason is that the leave-one-out error suffers from a large variance in small sample cases. If some countermeasure, such as regularization on the leave-one-out error is taken, this problem can be overcome. This is a topic we will pursue in the future research. Acknowledgments We thank two reviewers for their helpful comments that greatly improved the article and Lin Shi for her help in proofreading the manuscript. This work was supported by the National Natural Science Foundation of China under grants 60372050 and 60133010 and National 863 Project grant 2002AA135080.
References Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Bengio, Y. (2000). Gradient-based optimization of hyper-parameters. Neural Computation, 12, 1889–1900. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Cawley, G. C., & Talbot, N. L. C. (2003). Efficient leave-one-out cross validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36, 2585–2592. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Figueiredo, M. A. T. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1150–1159.
Feature Scaling for Kernel Fisher Discriminant Analysis
977
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual of Eugenics, 7, 179–188. Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). Orlando, FL. Academic Press. Gestel, T. V., Suykens, J., Lanckriet, G., Lambrechts, A., Moor, B. D., & Vandewalle, J. (2002). Bayesian framework for least squares support vector machine classifiers, gaussian processes and kernel Fisher discriminant analysis. Neural Computation, 15, 1115–1148. Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13, 415–425. Krishnapuram, B., Hartemink, A., Carin, L., & Figueiredo, M. (2004). A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1105–1111. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier, Neural Computation, 7, 72–85. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. (In Russian). Techicheskaya Kibernetica, 3. Mika, S. (2002). Kernel fisher discriminants. Unpublished doctoral dissertation, University of Technology, Berlin. Mika, S., Ratsch, G., & Weston, J. (1999). Fisher discriminant analysis with kernels. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing (pp. 41– 48). Piscataway, NJ: IEEE Press. Ong, C. S., & Smola, A. J. (2003). Machine learning with hyperkernels. In Proceedings of the Twentieth International Conference on Machine Learning (pp. 568–575). Menlo Park, CA: AAAI Press. R¨atsch, G. (2001). Robust boosting via convex optimization. Unpublished doctoral dissertation, University of Potsdam, Potsdam, Germany. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers, Neural Processing Letters, 9, 293–300. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal Machine Learning Research, 1, 211–244. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Vapnik, V. (1998). Statistical learning theory, New York: Wiley. Vapnik, V., & O. Chapelle. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12, 2013–2036. Weston, J., Mukherjee, M., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 668–674), Cambridge, MA: MIT Press. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1342–1351.
978
L. Bo, L. Wang, and L. Jiao
Xu, J., Zhang, X., & Li, Y. (2001). Kernel MSE algorithm: A unified framework for KFD, LS-SVM and KRR. In Proceedings of the International Joint Conference on Neural Networks (pp. 1486–1491). Piscataway, NJ: IEEE Press.
Received October 26, 2004; accepted August 4, 2005.
LETTER
Communicated by Todd Leen
Class-Incremental Generalized Discriminant Analysis Wenming Zheng wenming [email protected] Research Center for Learning Science, Southeast University, Nanjing, Jiangsu 210096, China
Generalized discriminant analysis (GDA) is the nonlinear extension of the classical linear discriminant analysis (LDA) via the kernel trick. Mathematically, GDA aims to solve a generalized eigenequation problem, which is always implemented by the use of singular value decomposition (SVD) in the previously proposed GDA algorithms. A major drawback of SVD, however, is the difficulty of designing an incremental solution for the eigenvalue problem. Moreover, there are still numerical problems of computing the eigenvalue problem of large matrices. In this article, we propose another algorithm for solving GDA as for the case of small sample size problem, which applies QR decomposition rather than SVD. A major contribution of the proposed algorithm is that it can incrementally update the discriminant vectors when new classes are inserted into the training set. The other major contribution of this article is the presentation of the modified kernel Gram-Schmidt (MKGS) orthogonalization algorithm for implementing the QR decomposition in the feature space, which is more numerically stable than the kernel Gram-Schmidt (KGS) algorithm. We conduct experiments on both simulated and real data to demonstrate the better performance of the proposed methods. 1 Introduction Generalized discriminant analysis (GDA) was proposed by Baudat and Anour (2000) as the nonlinear extension of the classical linear discriminant analysis (LDA) (Duda & Hart, 1973) from input space to a high-dimensional ¨ ¨ feature space via the kernel trick (Vapnik, 1995; Scholkopf, Smola, & Muller, 1998). However, the GDA method often suffers from the so-called small sample size (SSS) problem (Chen, Liao, Ko, Lin, & Yu, 2000; Zheng, Zhao, & Zou, 2004a; Cevikalp & Wilkes, 2004; Cevikalp, Neamtu, Wilkes, & Barkana, 2005), where the dimensionality of the feature space nonlinearly mapped from the input space is generally much larger than the number of the training samples, such that the optimal discriminant vectors of GDA lie in the null space of the within-class scatter matrix (Zheng, Zhao, & Zou, 2004b). Mathematically, the standard approach of solving GDA is to solve the eigenvalues λ and eigenvectors ω, solutions of the following generalized Neural Computation 18, 979–1006 (2006)
C 2006 Massachusetts Institute of Technology
980
W. Zheng
eigenequation (Baudat & Anour, 2000; Zheng et al., 2004b): SB ω = λST ω, where SB and ST represent the between-class scatter matrix and the total scatter matrix in the feature space, respectively. However, as for the case of the small sample size problem, the optimal discriminant vectors of GDA can be found from the null space of SW (Zheng et al., 2004b), where SW denotes the within-class scatter matrix. A simple and efficient way of solving GDA in this case is to find an orthonormal basis of the subspace ST (0) ∩ SW (0) (Zheng, Zhao, & Zou, 2005), where ST (0) and SW (0) represent the null space of ST and SW , respectively, and ST (0) represents the complement of ST (0). Although it is tractable to solve GDA by utilizing the Mercer kernels, the common aspect of the previously proposed algorithms is the use of the singular value decomposition (SVD) (Baudat & Anour, 2000; Zheng et al., 2005; Yang, Frangi, Jin, & Yang, 2004; Liu, Wang, Li, & Tan, 2004). A major common drawback of these algorithms is the difficulty of designing an incremental solution for the eigenvalue problem. The other major drawback of these SVD-based algorithms is the numerical instability problem. This is because the eigenvalues determined by the eigenvalue decomposition approach may be very close to each other, which will result in instability of the eigenvector according to the perturbation theory (Fukunaga, 1990). Recently, Xiong, Ye, Li, Cherkassky, and Janardan (2005) proposed a kernel discriminant analysis algorithm via QR decomposition (KDA/QR) to reduce the computational complexity of kernel discriminant analysis (KDA). However, similar to the kernel direct discriminant analysis (KDDA) approach (Lu, Plataniotis, & Venetsanopoulos, 2003), this method finds the discriminant vectors of KDA by limiting attention to the range space of SB , which may not obtain the optimal discriminant vectors in terms of the Fisher discriminant criterion—in particular, the case of the small sample size problem (Zheng et al., 2005). The other drawback of KDA/QR is that it is only a batch method, which requires that all the training data be available before computing the discriminant vectors (Ye et al., 2004). Thus, it is still time-consuming to update the discriminant vectors when new data items are inserted into the training set. In this article, we propose a computationally efficient and numerically stable algorithm for GDA as for the case of the small sample size problem. The proposed method can directly solve the optimal discriminant vectors of GDA by applying only QR decomposition. More important, the proposed method introduces an incremental technique to update the discriminant vectors when new data items are inserted into the training set, which is very desirable for designing a dynamic recognition system. Moreover, this article also proposes a modified kernel Gram-Schmidt (MKGS) orthogonalization algorithm for implementing the QR decomposition in the feature space, which is much more numerically stable in contrast to the kernel GramSchmidt (KGS) orthogonalization algorithm proposed by Wolf and Shashua (2003).
Class-Incremental Generalized Discriminant Analysis
981
In the next section, we review the KGS algorithm and then propose the MKGS algorithm in section 3. In section 4, we propose the batch GDA algorithm and the class-incremental GDA algorithm, respectively, using the MKGS algorithm. In section 5, we present the feature extraction method for classification based on the proposed GDA algorithm. Section 6 is devoted to the experiments on both simulated and real data. The conclusion is given in the last section. 2 KGS Algorithm Let A be a matrix with k columns α1 , . . . , αk , where αi ∈ Rn (i = 1, . . . , k). Let (·) be a mapping that maps the elements of Rn into a high-dimensional Hilbert space F , that is, : Rn → F . Let A = [(α1 ), . . . , (αk )].
(2.1)
Suppose that β1 , . . . , βk are the equivalent orthonormal vectors corresponding to the columns of A . Then βi (i = 1, . . . , k) can be computed by using the following classical Gram-Schmidt (CGS) orthonormal procedure ¨ (Bjorck, 1994): 1. β 1 = (α1 ). 2. Repeat for j = 2, . . . , k. j−1 β T (α ) β j = (α k ) − i=1 i β T β j βi . i
i
3. Repeat for j = 1, . . . , k, β j = β j /β j , where · stands for the Euclidean distance norm. However, directly computing the orthonormal vectors βi (i = 1, . . . , k) is an intractable task because the mapping function is hard to explicitly evaluate. Wolf and Shashua (2003) proposed an indirect approach to implement the above orthonormal procedure via the kernel trick (hereafter the KGS algorithm). More specifically, assume that k(x, y) is the reproducing kernel defined on the feature space F such that k(x, y) = (x), (y) = ((x))T (y),
(2.2)
where (x), (y) stands for the inner product of (x) and (y). Then according to Wolf and Shashua (2003), the KGS algorithm can be summarized as follows:
982
W. Zheng
KGS algorithm (Wolf & Shashua, 2003). Let A be a matrix with columns (α1 ), . . . , (α k ), where (α1 ), . . . , (α k ) are k linearly independent vectors. Then the corresponding orthonormal vectors of the columns of A can be obtained using the following steps, where s j and t j ( j = 1, . . . , k) are k-dimensional vectors, D is a k by k diagonal matrix, and e j = (0, . . . , 1, . . . , 0)T is a k-dimensional vector where the jth item is 1. 1. Let s1 = t1 = e 1 , D11 = k(α1 , α1 ), where e 1 = (1, 0, . . . , 0)T ; 2. Repeat for j = 2, . . . , k t k(α ,a )
j−1
tq ( j−1) k(αq ,αj )
, 1, 0, . . . , 0)T ; a. Compute s j = ( 11 D111 j , . . . , q =1D( j−1)( j−1) b. Compute t j = (−t1 , . . . , −t j−1 , e j , 0, . . . , 0)s j ; j c. Compute D j j = p,q =1 tpj tq j k(α p , αq ); 3. R = D1/2 [s1 , . . . , sk ]; 4. R−1 = [t1 , . . . , tk ]D−1/2 ; The columns of the matrix [β1 , . . . , βk ] = [(α1 ), . . . , (αk )][t1 , . . . , tk ]D−1/2 are the corresponding orthonormal vectors of the columns of A . 3 MKGS Algorithm The KGS algorithm proposed by Wolf and Shashua (2003) is essentially the kernelized version of the CGS algorithm. Thus, many numerical properties of CGS will be delivered to KGS. However, the experimental results by ¨ Rice (1966) and the theoretical analysis by Bjorck (1967) indicated that the CGS procedure is very sensitive to round-off errors. In other words, if the matrix A is ill conditioned, the computed vectors β1 , . . . , βk will soon lose their orthogonality, and reorthogonalization will be needed. Thus, it is very desirable to modify the KGS algorithm to obtain a numerically superior algorithm for orthogonalizing the columns of A . It is notable that the modified Gram-Schmidt (MGS) orthogonalization procedure is numerically superior to CGS. More details can be found in ¨ Rice (1966) and Bjorck (1967). Thus, we will adopt the MGS procedure to modify the KGS algorithm. In general, the MGS procedure can be divided into two versions: the row-oriented procedure and the column-oriented ¨ ¨ procedure (Bjorck, 1994). According to Bjorck (1994), the two procedures are numerically equivalent, the operations and rounding errors are the same, and both produce the same numerical results. The main difference is that the column-oriented procedure is more appropriate to use when the orthogonalized vectors are sequentially obtained. Based on the column-oriented procedure, the orthonormal vectors β1 , . . . , βk can be computed as follows (for simplicity, we use the notations: (αi )(0) = (αi ), i = 1, . . . , k):
Class-Incremental Generalized Discriminant Analysis
983
1. β1 = (α1 )(0) ; 2. Repeat for j = 2, . . . , k a. Repeat for m = 1, . . . , j − 1 (α j )(m) = (α j )(m−1) − b. β j = (α j )( j−1) ;
βmT (α j )(m−1) βm ; βmT βm
3. Repeat for j = 1, . . . , k β j = β j /β j ; Similar to the KGS algorithm, we implement the above procedure via the kernel trick, and hereafter the MKGS algorithm. MKGS Algorithm Let A be a matrix with columns (α1 ), . . . , (αk ), where (α1 ), . . . , (αk ) are k linearly independent vectors. Then the corresponding orthonormal vectors of the columns of A can be obtained using the following steps, where s j and t j ( j = 1, . . . , k) are k-dimensional vectors, D is a k by k diagonal matrix, is a k by k matrix, and e j = (0, . . . , 1, . . . , 0)T is a k-dimensional vector where the jth item is 1. 1. Let s1 = t1 = e 1 , D11 = k(α1 , α1 ), i1 = k(αi , α1 )(i = 1, . . . , k); 2. Repeat for j = 2, . . . , k (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 s ji = (i+1)
tj
j
(i)
pi tpj ; Dii (i) t j − s ji ti ;
p=1
= ( j)
c. t j = t j ; d. Repeat for p = 1, . . . , k j pj = q =1 tq j k(αq , α p ); j e. Compute D j j = p=1 pj tpj ; 3. R = D1/2 [s1 , . . . , sk ], where si = [si1 , si2 , . . . , si(i−1) , 1, 0, . . . , 0]T ; 4. R−1 = [t1 , . . . , tk ]D−1/2 ; The columns of the matrix [β1 , . . . , βk ] = [(α1 ), . . . , (αk )] [t1 , . . . , tk ]D−1/2 are the corresponding orthonormal vectors of the columns of A . By calculating the computational cost of each line of the above algorithm, we can easily obtain that the complexity of the MKGS algorithm is O(k 3 ).
984
W. Zheng
4 GDA/MKGS: GDA via MKGS j
Let X = {xi }i=1,...,c; j=1,...,Ni be an n-dimensional training sample set with N elements, where c is the number of the classes and Ni is the number of the samples in ith class. The between-class scatter matrix SB , the within-class scatter matrix SW , and the total scatter matrix ST are respectively defined as: SB =
c
T Ni ui − u ui − u
(4.1)
i=1
SW =
Ni c j j T xi − ui xi − ui
(4.2)
i=1 j=1
ST =
Ni c j j T xi − u xi − u ,
(4.3)
i=1 j=1 j
where x T denotes the transpose of x, (xi ) is the jth sample in the ith class, ui is the mean of ith class samples, and u is the mean of all samples in F : ui =
Ni j 1 xi , Ni
u =
j=1
i j 1 xi . N
c
N
(4.4)
i=1 j=1
4.1 Batch GDA/MKGS Algorithm. Let SB (0) denote the null space of (0) denote the complement of SB (0) and SW (0), respectively. SB , SB (0) and SW Then, from the expressions of SB , SW , and ST in equations 4.1, 4.2, and 4.3, we obtain SB (0) = span{ui − u |i = 1, . . . , c}
(4.5)
SW (0) = span{(xi ) − ui |i = 1, . . . , c; j = 1, . . . , Ni }
(4.6)
ST (0) = span{(xi ) − u |i = 1, . . . , c; j = 1, . . . , Ni }.
(4.7)
j
j
Note that (xi ) − u = ((xi ) − ui ) + (ui − u ). Thus, from equations 4.5, 4.6, and 4.7, we have j
j
ST (0) ⊆ span{(xi ) − ui , ui − u |i = 1, . . . , c; j = 1, . . . , Ni }. j
(4.8)
Moreover, we have the following two important theorems about SB (0) and SW (0):
Class-Incremental Generalized Discriminant Analysis
985
Theorem 1. SB (0) can be spanned by ui − u 1 (i = 2, . . . , c), that is, SB (0) = span{ui − u |i = 2, . . . , c}. 1 (0) can be spanned by (xi ) − (xi1 ) (i = 1, . . . , c; j = Theorem 2. SW j 2, . . . , Ni ), that is, SW (0) = span{(xi ) − (xi1 )|i = 1, . . . , c; j = 2, . . . , Ni }. j
The proofs of theorems 1 and 2 are given in appendixes A and B, respectively. Theorem 1 will be useful to design an incremental algorithm for updating the basis of SB (0) when new classes are inserted into the training set since it can be expressed as the span of the vectors ui − u 1 (i = 2, . . . , c), which does not depend on the ensemble mean u of the training set. Similarly, theorem 2 will be useful to design an incremental algorithm for updating the basis of SW (0) when new instances are inserted into the existing classes of the j training set since it can be expressed as the span of the vectors (xi ) − (xi1 ) (i = 1, . . . , c; j = 2, . . . , Ni ), which is not dependent on the class mean ui (i = 1, . . . , c). In the next section, we will show that theorems 1 and 2 are crucial for designing the class-incremental GDA/MKGS algorithm. Now let A = A 1 , . . . , Ac , Ac+1
(4.9)
where the matrices Ai are, respectively, defined by Ai = xi2 − xi1 , xi3 − xi1 , . . . , xiNi − xi1 (i = 1, . . . , c) (4.10) A c+1 = u2 − u1 , u3 − u1 , . . . , uc − u1 .
(4.11)
From theorems 1 and 2 and equations 4.9, 4.10, and 4.11, we obtain that SW (0) and SB (0) can be respectively spanned by the first N − c columns and the last c − 1 columns of the matrix A . Moreover, from equation 4.8, we obtain that ST (0) lies in the span of the N − 1 columns of matrix A . j Without loss of generality, we assume that (xi ) (i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent. Then we have the following theorem regarding the rank of ST (0): j
Theorem 3. Suppose that (xi )(i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent. Then the rank of ST (0) is N − 1, that is, rank(ST (0)) = N − 1. The proof of theorem 3 is given in appendix C. Consider that ST (0) lies in the span of the N − 1 columns of matrix A . Thus, theorem 3 indicates that the N − 1 columns of the matrix A form a basis of ST (0), where the first
986
W. Zheng
N − c columns form a basis of SW (0). Note that the optimal discriminant vectors of GDA lie in the subspace ST (0) ∩ SW (0) as for the case of the small sample size problem (Zheng et al., 2004b). Thus, our goal is to get the basis of ST (0) ∩ SW (0). This can be implemented by using the MKGS orthogonalization procedure. Let Aip denote the pth column of matrix Ai . Then for i, j, m, n = 1, . . . , c and p = 1, . . . , Ni − 1 and q = 1, . . . , Nj − 1, we have
Aip
T
T q +1 p+1 − xi1 xj − x 1j Ajq = xi p+1 T q +1 p+1 T 1 − xi xj xj = xi T q +1 1 T 1 xj + xi − xi1 x j p+1 q +1 p+1 q +1 = k xi , x j − k xi , x 1j − k xi1 , x j + k xi1 , x 1j
Aip
T
u m=
(4.12)
Nm p+1 T 1 xi − xi1 xmt Nm t=1
Nm p+1 t 1 k xi , xm − k xi1 , xmt (4.13) Nm t=1 T Nn Nm Nn Nm p q T 1 1 1 xm xn = k xmp , xnq . um un = Nm Nn Nm Nn
=
p=1
q =1
p=1 q =1
(4.14) Let K be an N − 1 by N − 1 matrix defined by K = (K ij )i=1,...,c+1; j=1,...,c+1 ,
(4.15)
where K ij = (Ai )T Aj .
(4.16)
Let (K ij ) pq denote the element in the pth row and qth column of the matrix K ij . Then for i, j = 1, . . . , c, m, n = 1, . . . , c − 1, and p = 1, . . . , Nj − 1 and q = 1, . . . , Nj − 1, we have T (K ij ) pq = Aip Ajq
(4.17)
T T T (K (c+1) j )mq = (K j(c+1) )q m = Ajq um+1 − Ajq u um+1 − u 1 = A jq 1 (4.18)
Class-Incremental Generalized Discriminant Analysis
T T un+1 − u (K (c+1)(c+1) )mn = A A(c+1)n = u m+1 − u1 1 (c+1)m T T = u m+1 un+1 − um+1 u1 T T un+1 + u1 u1 . − u 1
987
(4.19)
According to equations 4.12 to 4.19, we can easily calculate the matrix K . Let K (i, j) denote the element in the ith row and jth column of K . According to the MKGS algorithm, we obtain the following batch GDA/MKGS algorithm, where s j and t j ( j = 1, . . . , N − 1) are N − 1 dimensional vectors, D is an N − 1 by N − 1 diagonal matrix, and is an N − 1 by N − 1 matrix: Batch GDA/MKGS Algorithm 1. Let s1 = t1 = e 1 , D11 = K (1, 1), i1 = K (i, 1)(i = 1, . . . , N − 1); 2. Repeat for j = 2, . . . , N − 1 (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 s=
j
(i+1)
tj
(i)
pi tpj ; Dii (i) t j − sti ;
p=1
= ( j)
c. t j = t j ; d. Repeat for p = 1, . . . , N − 1 j pj = q =1 tq j K (q , p); j e. Compute D j j = p=1 pj tpj ; 3. [β N−c+1 , . . . , β N−1 ] = A [tN−c+1 , . . . , tN−1 ](Dij )i,−1/2 j=N−c+1,...,N−1 ; The c − 1 vectors β N−c+1 , . . . , β N−1 form an orthonormal basis of ST (0) ∩ SW (0), which are referred to as the discriminant vectors of GDA as for the case of the small sample size problem. By calculating the computational cost of the above algorithm, we obtain that the complexity of the batch GDA/MKGS algorithm is O(N3 ). 4.2 Class-Incremental GDA/MKGS Algorithm. This section aims to design an incremental algorithm for updating the discriminant vectors of GDA/MKGS when new data items are inserted into the training set. We consider two distinct cases of the inserted instances: (1) the instances belong to a new class, and (2) the instances belong to an existing class. 4.2.1 Insertion of a New Class. Recalling that we have c classes, let j {xc+1 | j = 1, . . . , Nc+1 } be the (c + 1)th class being inserted, where Nc+1 is
988
W. Zheng
the number of the new training samples. In this case, the expression in equation 4.9 can be rewritten as ˜ ˜ = A ˜ A 1 , . . . , Ac , Ac+1 , Ac+2 ,
(4.20)
˜ and A ˜ are where Ai (i = 1, . . . , c) are defined in equation 4.10, A c+1 c+2 respectively defined as follows: 1 3 1 Nc+1 1 2 ˜ A c+1 = xc+1 − xc+1 , xc+1 − xc+1 , . . . , xc+1 − xc+1 (4.21) ˜ A c+2
= u 2 − u1 , u3 − u1 , . . . , uc+1 − u1 .
(4.22)
The kernel matrix K in equation 4.15 is replaced by ˜ )T A ˜ . K˜ = ( A
(4.23)
The elements of K˜ can be calculated by utilizing the kernel function. According to the batch GDA/MKGS algorithm, we have the following algorithm of updating the new discriminant vectors when the (c + 1)th class is inserted into the training set, where K˜ (i, j) denotes the element in the ith row and jth column of K˜ , s˜ j and ˜t j ( j = 1, . . . , N + Nc+1 − 1) are ˜ is an N + Nc+1 − 1 by N + Nc+1 − 1 N + Nc+1 − 1-dimensional vectors, D ˜ is an N + Nc+1 − 1 by N + Nc+1 − 1 matrix. diagonal matrix, and Class-Incremental GDA/MKGS Algorithm 1: Updating Discriminant Vectors with the Insertion of the (c + 1)th Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D c. Repeat for i = 1, . . . , N − c ˜ ij = ij ; d. Repeat for i = N − c + 1, . . . , N + Nc+1 − 1 ˜ ˜ ˜ ij = qN−c =1 t q j K (q , i); 2. Repeat for j = N − c + 1, . . . , N + Nc+1 − 1 (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1
c.
j
(i)
˜ pi tpj ; ˜ ii D (i+1) (i) = t j − s ˜t i ; tj ˜t j = t (j j) ;
s=
p=1
Class-Incremental Generalized Discriminant Analysis
989
d. Repeat for p = 1, . . . , N + Nc+1 − 1 ˜ pj = qj =1 ˜t q j K˜ (q , p); ˜ jj = j ˜ pj ˜t pj ; e. Compute D p=1
3. [β N+Nc+1 −c , . . . , β N+Nc+1 −1 ] ˜ [ ˜t N+Nc+1 −c , . . . , ˜t N+Nc+1 −1 ]( D ˜ ij )−1/2 =A i, j=N+Nc+1 −c,...,N+Nc+1 −1 ; The c vectors β N+Nc+1 −c , . . . , β N+Nc+1 −1 are the new discriminant vectors of GDA after the (c + 1)th class is inserted into the training set. According to calculating the computational cost of the above algorithm, we obtain that the complexity of the class-incremental GDA/MKGS algorithm for updating discriminant vectors with the insertion of the (c + 1)th class is O{(Nc+1 + c)(N + Nc+1 )2 }. 4.2.2 Insertion of a New Instance from an Existing Class. Suppose that x is an instance being inserted into the ith (1 ≤ i ≤ c) class. For simplicity of notation, we denote x by xiNi +1 since there are Ni samples in the ith class. Then the mean of the ith class, denoted by u˜ i , is expressed as u˜ i =
N i +1 j 1 xi . Ni + 1
(4.24)
j=1
Without loss of the generality, we assume that i > 1. Then the expression in equation 4.9 can be rewritten as Ni +1 A˜ = A − xi1 , A˜ 1 , . . . , Ac , xi c+1 ,
(4.25)
where Ai (i = 1, . . . , c) are defined in equation 4.10, and A˜ c+1 is defined as ˜ i − u A˜ c+1 = u 2 − u1 , . . . , ui−1 − u1 , u 1 , ui+1 − u1 , . . . , uc − u1 .
(4.26)
The new kernel matrix, denoted by K˜ , is expressed as K˜ = ( A˜ )T A˜ .
(4.27)
According to the batch GDA/MKGS algorithm, we have the following incremental algorithm of updating the new discriminant vectors with the insertion of an instance in ith class, where K˜ (i, j) denotes the element in the ith row and jth column of K˜ , s˜ j and ˜t j ( j = 1, . . . , N) are N-dimensional ˜ is an N by N diagonal matrix, and ˜ is an N by N matrix. vectors, D
990
W. Zheng
Class-Incremental GDA/MKGS Algorithm 2: Updating Discriminant Vectors with the Insertion of a New Instance in ith (i ≤ c) Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D c. Repeat for i = 1, . . . , N − c ˜ ij = ij ; d. Repeat for i = N − c + 1, . . . , N ˜ ˜ ˜ ij = qN−c =1 t q j K (q , i) 2. Repeat for j = N − c + 1, . . . , N (1) a. t j = e j ; b. Repeat for i = 1, . . . , j − 1 j
(i)
˜ pi tpj ; ˜ ii D (i+1) (i) = t j − s ˜t i ; tj ˜t j = t (j j) ;
s=
p=1
c. d. Repeat for p = 1, . . . , N ˜ pj = qj =1 ˜t q j K˜ (q , p); ˜ jj = j ˜ pj ˜t pj ; e. Compute D p=1
˜ [ ˜t N+2−c , . . . , ˜t N ]( D ˜ ij )−1/2 3. [β N+2−c , . . . , β N ] = A i, j=N+2−c,...,N ; The c − 1 vectors β N+2−c , . . . , β N are the new discriminant vectors of GDA after the new instance of ith class is inserted into the training set. By calculating the computational cost of the above algorithm, we obtain that the complexity of the class-incremental GDA/MKGS algorithm for updating discriminant vectors with the insertion of a new instance is O(c N2 ). 5 Feature Extraction for Classification Let (X) = [ (x11 ) · · · (x1N1 ) · · · (xc1 ) · · · (xcNc ) ]. Then we have j xi = (X)e k+ j where k =
i−1 t=1
(5.1)
Nt . From equations 4.4, 4.10, and 5.1, we have
Ni Ni 1 (X) (X)e k+t = e k+t = (X)L i Ni Ni
(5.2)
Ai = (X)[e k+2 − e k+1 , e k+3 − e k+1 , . . . , e k+Ni − e k+1 ](i = 1, . . . , c),
(5.3)
ui =
t=1
t=1
Class-Incremental Generalized Discriminant Analysis
where L i =
1 Ni
Ni
t=1 e k+t .
991
From equations 4.11 and 5.2, we have
A c+1 = (X)[L 2 − L 1 , L 3 − L 1 , . . . , L c − L 1 ].
(5.4)
Combining equations 4.9, 5.3, and 5.4, we obtain A = A 1 , . . . , Ac , Ac+1 = (X)P,
(5.5)
where P = [e 2 − e 1 , . . . , e N1 − e 1 , e N1 +2 − e N1 +1 . . . , e N − e N−Nc +1 , L 2 − L 1 , . . . , L c − L 1 ].
(5.6)
Thus, according to the batch GDA/MKGS algorithm, the projection matrix of GDA can be expressed as WGDA/MKGS = [β N−c+1 , . . . , β N−1 ] = (X)P T,
(5.7)
where T=
1 D(N−c+1)(N−c+1)
tN−c+1 , . . . ,
1 D(N−1)(N−1)
tN−1 .
(5.8)
The projection of a test point (xtest ) onto WG DA/K MG S can be calculated by T T T ytest = WGDA /MKGS (xtest ) = T P K test ,
(5.9)
where T K test = k x11 , xtest , k x12 , xtest , . . . , k(xcNc , xtest .
(5.10)
Note that the discriminant vectors of GDA/MKGS lie in the subspace (0). Thus, we have SW T WGDA /MKGS SW = 0.
(5.11)
From equations 4.2 and 5.11, we have 1 N1 Nc T − u − u WGDA 1 , . . . , xc c = 0. /MKGS x1 − u1 , . . . , x1
(5.12)
From equation 5.12, we obtain that T T WGDA /MKGS (xi ) = WGDA/MKGS ui ( j = 1, . . . , Ni ; i = 1, . . . , c). j
(5.13)
992
W. Zheng
Equation 5.13 means that the training data of each class are projected onto the same point in the projection space. Now let Xi denote the ith class data j set in the feature space, and let yi and y¯ i represent the respective projection j of (xi ) and ui onto the projection matrix WGDA/MKGS , that is, j j T yi = WGDA /MKGS xi ,
T y¯ i = WGDA /MKGS ui .
(5.14)
From equations 5.13 and 5.14, we have j
yi1 = yi2 = · · · = yi = · · · = yiNi = y¯ i .
(5.15)
Based on the nearest-neighbor rule, we define the distance between the projections of the test point (xtest ) and the data set Xi as follows:
j d p (xtest ), Xi = min{ ytest − yi },
j = 1, . . . , Ni .
(5.16)
Combining equations 5.16 with 5.15, we obtain that
d p (xtest ), Xi = ytest − yi1 .
(5.17)
Therefore, based on the nearest-neighbor rule, the associated class index of the test point can be obtained as follows:
c ∗ = arg min d p (xtest ), Xi = arg min ytest − yi1 . i
i
(5.18)
6 Experiments We will use simulated and real data, respectively, to demonstrate the efficiency of the proposed method in this section. In the first example, we use toy data to show that MKGS is numerically superior to KGS for implementing the QR decomposition. The second example aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm when used for a dynamic face recognition system. In the third example, we use a handwritten digital character recognition experiment to demonstrate the better performance of the proposed algorithm. All of these experiments are run on the platform of IBM personal computer with MATLAB software. The monomial kernel and the gaussian kernel are used in the experiments, which are respectively defined as follows:
r r
Monomial kernel: k(x, y) = (x T y)d , where d is the monomial degree Gaussian kernel: k(x, y) = exp(−x − y2 σ ), where σ is the parameter of the gaussian kernel
Class-Incremental Generalized Discriminant Analysis
993
6.1 Toy Example. In this example, we use toy data to test the performance of MKGS and KGS for implementing the QR decomposition. The GDA algorithm via KGS (GDA/KGS) include the batch GDA/KGS algorithm and the class-incremental GDA/KGS algorithm for updating discriminant vectors with the insertion of the (c + 1) class. These two algorithms are given in appendixes D and E, respectively. Four clusters of artificial two-dimensional data sets are generated by the function y = x 2 + 0.5 + ε, where the x values have a uniform distribution in [−1, 1], and ε is a uniformly distributed random number on the interval [0, 0.5]. For each cluster, we generate 100 samples. The gaussian kernel with parameter σ = 0.1 is used over the experiment to calculate the kernel matrix. The experiment has three steps: 1. Choose the samples in the first two clusters as training data to compute discriminant vector using the batch GDA/MKGS algorithm and the batch GDA/KGS algorithm, respectively. Then compute and display the projections of the test data onto the computed discriminant vector. Figures 1a and 2a, respectively, display the experimental results, where the figure shows the feature values (indicated by shade of gray) and contour lines of identical feature values. 2. Insert the third cluster into the training set, and then compute two discriminant vectors for discriminating the three clusters using the class-incremental GDA/MKGS algorithm and the class-incremental GDA/KGS algorithm, respectively. Figures 1b and 1c display the projections of the test data onto the two discriminant vectors of GDA/MKGS, while Figures 2b and 2c display the projections of the test data onto the two discriminant vectors of GDA/KGS. 3. Insert the fourth cluster into the training set, and then compute three discriminant vectors for discriminanting the four clusters by using the class-incremental GDA/MKGS algorithm and the class-incremental GDA/KGS algorithm, respectively. Figures 1d through 1f display the projections of the test data onto the three discriminant vectors of GDA/MKGS, while Figures 2d through 2f display the projections of the test data onto the three discriminant vectors of GDA/KGS. By contrast with the experimental results between Figure 1 and Figure 2, we can see that the GDA/MKGS algorithm achieves much better performance than the GDA/KGS algorithm. The projections of the toy data can be nicely separated in Figure 1, whereas they could not be well separated in Figure 2. 6.2 Face Recognition. This example aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm in terms of recognition accuracy and training time for dynamic face recognition task. We use the AR face database (Martinez & Benavente, 1998) to perform
994
W. Zheng
1.5 1
1.5 1
1.5 1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (a)
1
-1.5
-1
0 (b)
1
-1.5
1.5
1.5
1.5
1 0.5
1 0.5
1 0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (d)
1
-1.5
-1
0 (e)
1
-1.5
-1
0 (c)
1
-1
0 (f)
1
Figure 1: Features extraction by class-incremental GDA/MKGS. (a) Projections onto the discriminant vector computed from the first two clusters using the batch GDA/MKGS algorithm. (b–c) Projections onto the two discriminant vectors computed from the first three clusters using the class-incremental GDA/MKGS algorithm; (d–f) Projections onto the three discriminant vectors computed from the four clusters using the class-incremental GDA/MKGS algorithm.
this experiment. The AR face database consists of over 3000 facial images of 126 subjects. Each subject contains 26 facial images recorded in two different sessions separated by two weeks, and each session consists of 13 images. The original image size is 768 × 576 pixels, and each pixel is represented by 24 bits of RGB color values. Figure 3 shows the 26 images for one subject; images 1 to 13 were taken in the first session and images 14 to 26 in the second session. Among the 126 subjects, we randomly select 70 subjects (50 males and 20 females) for this experiment. Similar to Cevikalp et al. (2005), we use only the nonoccluded images (those numbered 1 to 7 and 14 to 20) for the experiment. Before the experiment, all images are centered and cropped to the size of 468 × 476 pixels, and then are down-sampled into the size of 100 × 100 pixels.
Class-Incremental Generalized Discriminant Analysis
995
1.5 1
1.5 1
1.5 1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (a)
1
-1.5
-1
0 (b)
1
-1.5
1.5
1.5
1.5
1 0.5
1 0.5
1 0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1
0 (d)
1
-1.5
-1
0 (e)
1
-1.5
-1
0 (c)
1
-1
0 (f)
1
Figure 2: Features extraction by class-incremental GDA/KGS. (a) Projections onto the discriminant vector computed from the first two clusters using the batch GDA/KGS algorithm. (b–c) Projections onto the two discriminant vectors computed from the first three clusters using the class-incremental GDA/KGS algorithm. (d–f) Projections onto the three discriminant vectors computed from the four clusters using the class-incremental GDA/KGS algorithm.
We use the twofold cross-validation method (Fukunaga, 1990) to perform the experiment: divide all the images into two subsets, and use one subset as the training set and the other as the test set. After performing the experiment, we swap the training set and the test set and repeat the experiment. Considering that our experiment aims to demonstrate the better performance of the class-incremental GDA/MKGS algorithm for dynamic face recognition task in terms of recognition accuracy and training time, our experiment will focus more on demonstrating the dynamic recognition procedures when new subjects are inserted into the training set. More specifically, we will design the two trials of the twofold cross-validation as follows. In the first trial of the twofold cross validation, we choose 61 subjects among the 70 subjects for the experiment; we use the seven images
996
W. Zheng
(1)
(2)
(8)
(14)
(3)
(9)
(15)
(21)
(4)
(10)
(16)
(22)
(5)
(11)
(17)
(23)
(6)
(12)
(18)
(24)
(7)
(13)
(19)
(25)
(20)
(26)
Figure 3: Images of one subject in the AR face database.
numbered 1, 2, 3, 4, 14, 15, and 16 in each subject as training images and the other seven images numbered 5, 6, 7, 17, 18, 19, and 20 in each subject as test images. The discriminant vectors are computed using the batch GDA/MKGS algorithm. The test recognition rate of the test images is then calculated based on the nearest-neighbor classifier. After finishing the recognition, we choose one subject from the remainding images and insert the seven images numbered 1, 2, 3, 4, 14, 15, and 16 into the training set and the other seven images into the test set. Then we update the discriminant vectors using the class-incremental GDA/MKGS algorithm and recalculate the test recognition rate. This procedure is repeated until all 70 subjects are included in the training set and the test set. In the second trial, we swap the training images and the test images for each subject; that is, for each subject, we use the seven images numbered 1, 2, 3, 4, 14, 15, and 16 as test images and the other seven images numbered 5, 6, 7, 17, 18, 19, and 20 as training images, and then perform the same recognition procedure as in the first trial. For comparison, we conduct the same experiment using other commonly used face recognition methods, including the Eigenfaces method (Turk & Pentland, 1991), the Fisherfaces method (Belhumeur, Hespanha, & Kriegman, 1997), the LDA method via the MGS algorithm (LDA/MGS; ¨ Zheng, Zou, & Zhao, 2004), the KPCA method (Scholkopf et al., 1998), and the standard GDA method (Baudat, 2000), respectively. The monomial kernel with degree d = 2 and gaussian kernel with σ = 1e8 are used in the
Class-Incremental Generalized Discriminant Analysis
997
97
Average Recognition Rates (%)
96 95 94 93 92 91 Eigenfaces Fisherfaces LDA/MGS KPCA GDA GDA/MKGS
90 89 88 87 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 4: Average recognition rates of various systems with respect to the number of the training classes in the AR face recognition experiment, where the monomial kernel with degree d = 2 is used.
experiments. Figure 4 shows the average recognition rates of the two trials with respect to the number of the training classes when the monomial kernel with degree d = 2 is used, and Figure 5 shows the average recognition rates when the gaussian kernel with σ = 1e8 is used. As we can see from Figures 4 and 5, the GDA/MKGS method achieves the best recognition rate among the commonly used face recognition methods. Moreover, in order to demonstrate the incremental technique of the proposed algorithm, we compare the average training time between the batch GDA/MKGS algorithm and the class-incremental GDA/MKGS algorithm of updating the discriminant vectors when a new class is inserted into the training set. Figure 6 shows the experimental results. From Figure 6 we can clearly see that the class-incremental GDA/MKGS approach saves much more training time than the batch GDA/MKGS approach. 6.3 Handwritten Digital Character Recognition. In this example, we conduct handwritten digital character recognition experiment to further demonstrate the better recognition performance of the proposed algorithm.
998
W. Zheng
97
Average Recognition Rates (%)
96 95 94 93 92 Eigenfaces Fisherfaces LDA/MGS KPCA GDA GDA/MKGS
91 90 89 88 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 5: Average recognition rates of various systems with respect to the number of the training classes in the AR face recognition experiment, where the gaussian kernel with parameter σ = 1e8 is used.
We use the handwritten digits database of the U.S. Postal Service (USPS) collected from mail envelopes in Buffalo as the experimental data. The USPS database contains 7291 training points and 2007 test points of dimension¨ ality 256 (Scholkopf et al., 1998). We choose the first 1500 training points as training data and all 2007 test points as test data for experiment. For the comparison, this experiment is also conducted using the PCA method, the LDA method, the KPCA method, and the traditional GDA method, respectively, where we use the monomial kernel with a different degree in each trial to calculate the kernel matrix for the kernel-based algorithms. The nearest-neighbor classifier is used over the experiments for classification task. Table 1 shows the best average recognition rates of various systems. From Table 1, we can see that the GDA/MKGS method achieves the best recognition rate among the systems. Moreover, we also compare the average recognition rates among the kernel-based algorithms with respect to the different choice of the monomial degree. The experimental results are shown in Figure 7. It can be clearly seen from Figure 7 that the GDA/MKGS
Class-Incremental Generalized Discriminant Analysis
999
Average Training Time (Second)
12
10
8
6
4
2 Batch GDA/MKGS Class Incremental GDA/MKGS 0 61
62
63
64 65 66 67 Number of Classes
68
69
70
Figure 6: Average training time of computing the discriminant vectors with respect to the number of the training classes. Table 1: Average Recognition Rates of Various Systems. Method
PCA
LDA
KPCA
GDA
GDA/MKGS
Average recognition rate (%)
92.28
86.10
92.63
93.17
93.82
algorithm can significantly improve the performance of the traditional GDA algorithm. 7 Conclusion In this work, an efficient algorithm has been presented to solve the discriminant vectors of GDA in the case of the small sample size problem. By applying QR decomposition rather than SVD, the proposed algorithm is computationally fast compared to the traditional GDA algorithms. More important, the proposed algorithm introduces an effective technique to update the discriminant vectors of GDA when new classes are inserted into
1000
W. Zheng
Average Recognition Rates (%)
95 94 93 92
KPCA GDA
91
GDA/MKGS 90 89 88 87 2
3
4
5
6
Degree of Monomial Kernel Figure 7: Average recognition rates of three kernel-based algorithms with respect to the different choice of monomial degree.
the training set, which is very desirable for designing dynamic recognition systems. Moreover, we have proposed a modified KGS algorithm in this work to replace the KGS algorithm proposed by Wolf and Shashua (2003) for implementing the QR decomposition in the feature space. This algorithm has turned out to be much more numerically stable than the KGS algorithm. Experiments on both simulated and real data sets demonstrated the better performance of the class-incremental GDA/MKGS algorithm. On the simulated toy data, our experiment demonstrated the superiority of MKGS to KGS for implementing the QR decomposition. And the face recognition experiments on the AR face database and handwritten digital character recognition on the USPS database demonstrated the high recognition rates of the proposed GDA/MKGS algorithm in contrast to other commonly used face recognition methods. Moreover, our face experiments demonstrated the computational advantage of the class-incremental GDA/MKGS algorithm when new classes are dynamically inserted into the training set.
Class-Incremental Generalized Discriminant Analysis
1001
Appendix A: Proof of Theorem 1 Proof. Because ui − u 1 = (ui − u ) − (u1 − u ), we get
span ui − u 1 i = 2, . . . , c ⊆ span ui − u i = 1, . . . , c . Note that u − u 1 = get
1 N
c i=1
Ni (ui − u 1)=
1 N
c i=2
(A.1)
Ni (ui − u 1 ). Thus, we
span ui − u 1 i = 2, . . . , c = span u − u1 , ui − u1 i = 2, . . . , c .
(A.2)
Because ui − u = (ui − u 1 ) − (u − u1 ), we get
span ui − u i = 1, . . . , c ⊆ span u − u 1 , ui − u1 i = 2, . . . , c .
(A.3)
From equations A.2 and A.3, we obtain span ui − u i = 1, . . . , c ⊆ span ui − u 1 i = 2, . . . , c .
(A.4)
From equations A.1 and A.4, we obtain span ui − u i = 1, . . . , c = span ui − u 1 i = 2, . . . , c .
(A.5)
Combining equations 4.1 and A.5, we obtain SB (0) = span ui − u 1 i = 2, . . . , c . Appendix B: Proof of Theorem 2 Proof. Note that j j xi − xi1 = xi − ui − xi1 − ui .
(B.1)
Thus, we obtain j j span xi − xi1 j = 2, . . . , Ni ⊆ span xi − ui j = 1, . . . , Ni . (B.2) Note that Ni Ni j j xi − xi1 = xi − xi1 = Ni ui − xi1 . j=2
j=1
(B.3)
1002
W. Zheng
Thus, we get j j span xi − xi1 j = 2, . . . , Ni = span ui − xi1 , xi − xi1 j = 2, . . . , Ni .
(B.4)
Because (xi ) − ui = ((xi ) − (xi1 )) − (ui − (xi1 )), we get j
j
j span xi − ui j = 1, . . . , Ni ⊆ span ui − xi1 , j xi − xi1 j = 2, . . . , Ni .
(B.5)
From equations B.4 and B.5, we get j j span xi − ui j = 1, . . . , Ni ⊆ span xi − xi1 j = 2, . . . , Ni . (B.6) From equations B.2 and B.6, we obtain j j span xi − ui j = 1, . . . , Ni = span xi − xi1 j = 2, . . . , Ni . (B.7) From equation B.7, we obtain j span xi − ui i = 1, . . . , c; j = 1, . . . , Ni j = span xi − xi1 i = 1, . . . , c; j = 2, . . . , Ni .
(B.8)
From equations 4.2 and B.8, we obtain j SW (0) = span{ xi − xi1 |i = 1, . . . , c; j = 2, . . . , Ni }.
Appendix C: Proof of Theorem 3 j
Proof. Because (xi )(i = 1, . . . , c; j = 1, . . . , Ni ) are linearly independent, we have rank x11 , x12 , . . . , xcNc = Ni = N. i
(C.1)
Class-Incremental Generalized Discriminant Analysis
1003
Moreover, we have rank x11 , x12 , . . . , xcNc = rank x11 , x12 − x11 , . . . , xcNc − x11 .
(C.2)
From equations C.1 and C.2, we obtain that rank x11 , x12 − x11 , . . . , xcNc − x11 = N.
(C.3)
Equation C.3 means that the N vectors (x11 ), (x12 ) − (x11 ), . . . , (xcNc ) − (x11 ) are linearly independent. Thus, we obtain that rank x12 − x11 , . . . , xcNc − x11 = N − 1. (C.4) Moreover, we have rank x11 − u , x12 − u , . . . , xcNc − u = rank x11 − u , x12 − x11 , . . . , xcNc − x11 ≥ rank x12 − x11 , . . . , xcNc − x11 .
(C.5)
From equations C.4 and C.5, we obtain that rank x11 − u , x12 − u , . . . , xcNc − u ≥ N − 1. Note that
c i=1
Ni j=1
(C.6)
((xi ) − u ) = 0. Thus, we have j
rank x11 − u , x12 − u , . . . , xcNc − u = rank x12 − u , . . . , xcNc − u ≤ N − 1.
(C.7)
Combining equations C.6 and C.7, we obtain that rank x11 − u , x12 − u , . . . , xcNc − u = N − 1.
(C.8)
Note that ST (0) = span{(xi ) − u |i = 1, . . . , c; j = 1, . . . , Ni }. Thus, from equation C.8, we obtain that rank(ST (0)) = N − 1. j
Appendix D: Batch GDA/KGS Algorithm 1. Let s1 = t1 = e 1 , D11 = K (1, 1); 2. Repeat for j = 2, . . . , N − 1
1004
W. Zheng
t11 K (1, j) ,..., D11
a. Compute s j =
j−1
q =1 tq ( j−1) K (q , j)
D( j−1)( j−1)
T , 1, 0, . . . , 0
;
b. Compute t j = (−t1 , . . . , −t j−1 , 1, 0, . . . , 0)s j ; j c. Compute D j j = p,q =1 tpj tq j K ( p, q ); 3. [β N−c+1 , . . . , β N−1 ] = A [tN−c+1 , . . . , tN−1 ](Dij )i,−1/2 j=N−c+1,...,N−1 ; The c − 1 vectors β N−c+1 , . . . , β N−1 are the discriminant vectors of GDA as for the case of small sample size problem. Appendix E: Class-Incremental GDA/KGS Algorithm: Updating Discriminant vectors with the Insertion of the (c + 1)th Class 1. Repeat for j = 1, . . . , N − c a. Compute ˜t j = (t Tj , 0, . . . , 0)T ; ˜ j j = Dj j ; b. Compute D 2. Repeat for j = N − c + 1, . . . , N + Nc+1 − 1 ˜
j−1
˜
t˜q ( j−1) K˜ (q , j)
, 1, 0, . . . , 0)T ; a. Compute s˜ j = ( t11 KD˜ (1, j) , . . . , q =1D˜ 11 ( j−1)( j−1) b. Compute ˜t j = (− ˜t 1 , . . . , − ˜t j−1 , 1, 0, . . . , 0)˜s j ; ˜t pj ˜t q j K˜ ( p, q ); ˜ jj = j c. Compute D p,q =1
3. [β N+Nc+1 −c , . . . , β N+Nc+1 −1 ] ˜ ij )−1/2 ˜ [ ˜t N+Nc+1 −c , . . . , ˜t N+Nc+1 −1 ]( D =A i, j=N+Nc+1 −c,...,N+Nc+1 −1 ; The c vectors β N+Nc+1 −c , . . . , β N+Nc+1 −1 are the new discriminant vectors of GDA after the (c + 1)th class is inserted into the training set. Acknowledgments I thank the anonymous reviewers for their valuable comments and suggestions. I also thank Songcan Chen from the Department of Computer Science and Engineering, University of Aeronautics and Astronautics, China, for his kind discussions and valuable advice. This work was supported in part by the National Natural Science Foundations of China under grants 60503023 and 60473035, and in part by the Jiangsu Natural Science Foundations under grants BK2005407 and BK2005122. References Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Belhumeur P. N., Hespanha J. P., & Kriegman D. J. (1997). Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 711–720.
Class-Incremental Generalized Discriminant Analysis
1005
˚ (1967). Solving linear least squares problems using Gram-Schmidt orthog¨ Bjorck, A. onalization. BIT, 7, 1–21. ˚ (1994). Numerics of Gram-Schmidt orthogonalization. Linear Algebra and ¨ Bjorck, A. Its Applications, 197–198, 297–316. Cevikalp, H., Neamtu, M., Wilkes, M., & Barkana, A. (2005). Discriminiative common vectors for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 4–13. Cevikalp, H., & Wilkes, M. (2004). Face recognition by using discriminant common vectors. In Proceedings of the 17th International Conference on Pattern Recognition (pp. 326–329). Piscataway, NJ: IEEE Computer Society. Chen, L. F., Liao, H. Y. M., Ko, M. T., Lin, J. C., & Yu, G. J. (2000). A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognition, 33, 1713–1726. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Orlando, FL: Academic Press. Liu W., Wang Y., Li S. Z., & Tan T. (2004). Null space-based kernel Fisher discriminant analysis for face recognition. In Proceedings of the Sixth International conference on Automatic Face and Gesture Recognition (pp. 369–374). Piscataway, NJ: IEEE Computer Society. Lu L., Plataniotis K., & Venetsanopoulos A. (2003). Face recognition using kernel direct discriminant analysis algorithms. IEEE Transactions on Neural Networks, 14(1), 117–126. Martinez A. M., & Benavente, R. (1998). The AR face database (CVC Tech. Rep. No. 24). Barcelona, Spain: Computer Vision Center. Rice, J. R. (1966). Experiments on Gram-Schmidt orthogonalization. Mathematics Computation, 20, 325–328. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Wolf, L., & Shashua, A. (2003). Learning over sets using kernel principal angles. Journal of Machine Learning Research, 4, 913–931. Xiong, T., Ye, J., Li, Q., Cherkassky, V., & Janardan, R. (2005). Efficient kernel discriminant analysis via QR decomposition. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1529–1536). Cambridge, MA: MIT Press. Yang, J., Frangi, A. F., Jin, Z., & Yang, J-Y (2004). Essence of kernel Fisher discriminant: KPCA plus LDA. Pattern Recognition, 37, 2097–2100. Ye, J., Li, Q., Xiong, H., Park, H., Janardan, R., & Kumar, V. (2004). IDR/QR: An incremental dimension reduction algorithm via QR decomposition. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 364–373). New York: ACM Press. Zheng, W., Zhao, L., & Zou, C. (2004a). An efficient algorithm to solve the small sample size problem for LDA. Pattern Recognition, 37, 1077–1079.
1006
W. Zheng
Zheng, W., Zhao, L., & Zou, C. (2004b). A modified algorithm for generalized discriminant analysis. Neural Computation, 16, 1283–1297. Zheng, W., Zhao, L., & Zou, C. (2005). Foley-Sammon optimal discriminant vectors using kernel approach. IEEE Transactions on Neural Networks, 16(1), 1–9. Zheng, W., Zou C., & Zhao L. (2004). Real-time face recognition using Gram-Schmidt orthogonalization algorithm for LDA. In Proceedings of the 17th International Conference on Pattern Recognition (pp. 403–406). Piscataway; NJ: IEEE Computer Society.
Received January 10, 2005; accepted August 19, 2005.
ARTICLE
Communicated by Noboru Murata
Singularities Affect Dynamics of Learning in Neuromanifolds Shun-ichi Amari [email protected] RIKEN Brain Science Institute, Saitama, 351-0198, Japan
Hyeyoung Park [email protected] Kyungpook National University, Korea
Tomoko Ozeki [email protected] RIKEN Brain Science Institute, Saitama, 351-0198, Japan
The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA timeseries model, and many other cases. The standard statistical paradigm of the Cram´er-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow Neural Computation 18, 1007–1065 (2006)
C 2006 Massachusetts Institute of Technology
1008
S.-I. Amari, H. Park, and T. Ozeki
manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples. 1 Introduction The multilayer perceptron is an adaptive nonlinear system that receives input signals and transforms them into adequate output signals. Learning takes place by modifying the connection weights and the thresholds of neurons. Since perceptrons are specified by a set of these parameters, we may regard the whole set of perceptrons as a high-dimensional space or manifold whose coordinate system is given by these modifiable parameters. We call this a neuromanifold. Let us assume that the behavior of a perceptron is disturbed by noise, so that by receiving input signal x, a perceptron emits output y stochastically. This stochastic behavior is determined by the parameters. Let us also assume that a pair (x, y), of input signal x and corresponding answer y is given from the outside by a teacher. A number of examples are generated by an unknown probability distribution, p0 (x, y) or the conditional probability p0 (y|x), of the teacher. Let us denote the set of examples as (x 1 , y1 ), (x 2 , y2 ), · · · , (x n , yn ). A perceptron learns from these examples to imitate the stochastic behavior of the teacher. The behavior of a perceptron under noise is described by a conditional probability distribution p(y|x), which is the probability of output y given input x, so it can be regarded as a statistical model that includes a number of unknown parameters. From the statistical point of view, estimation of the parameters is carried out from examples generated by an unknown probability of the teacher network. Learning, especially online learning, is a type of estimation where the parameters are modified sequentially, using examples one by one. The parameters change by learning, forming a trajectory in the neuromanifold. Therefore, we need to study the geometrical features of the neuromanifold to elucidate the behavior of learning. The neuromanifold of multilayer perceptrons is a special statistical model because it includes singular points, where the Fisher information matrix degenerates. This is due to the symmetry of hidden units, and the number of hidden units substantially decreases when two hidden neurons are identical. The identifiability of parameters is lost at such singular positions. Such a structure was described in the pioneering work of Brockett (1976) in the case of linear systems, and in the case of multilayer perceptrons by ˚ Chen, Lu, and Hecht-Nielsen (1993), Sussmann (1992), Kurkov´ a & Kainen ¨ (1994), and Ruger & Ossen (1997). This type of structure is ubiquitous in many hierarchical models such as the model of probability densities given by gaussian mixtures, the ARMA time-series model, and the model of linear systems whose transfer functions are given by rational functions. The
Singularities Affect Dynamics of Learning in Neuromanifolds
1009
Riemannian metric degenerates at such singular points, which are not isolated but form a continuum. In all of these models, when we summarize the parameters that give the same behavior (probability distribution), the set of behaviors is known to have a generic cone-type singularity embedded in a finite-dimensional, sometimes infinite-dimensional, regular manifold (Dacunha-Castelle & Gassiat, 1997). Many interesting problems arise in such singular models. Since the Fisher information matrix degenerates at singular points, its inverse does not exist. Therefore, the Cram´er-Rao paradigm of classic statistical theory cannot be applied. The maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, although the consistency of behavior holds. The criteria used for model selection (such as AIC, BIC, and MDL) are derived from the gaussianity of the maximum likelihood estimator, with the covariance given by the inverse of the Fisher information, so that their validity is lost in such hierarchical models. The generalization error has so far been evaluated based on the Cram´er-Rao paradigm, so we need a new theoretical method to attack the problem. This is related to the strange behavior of the log-likelihood ratio statistic in a singular model. These problems have not been fully explored by conventional statistics. When the true distribution lies at a regular point, the classical Cram´erRao paradigm is still valid, provided it is sufficiently separated from a singular point. However, the dynamics of learning is a global problem that takes place throughout the entire neuromanifold. It has been shown that once parameters are attracted to singular points, at the parameters are very slow to move away from them. This is the plateau phenomenon ubiquitously observed in backpropagation learning. Therefore, even when the true point lies at a regular point, the singular structure strongly affects the dynamics of learning. Although there is no unified theory, problems caused by the singular structure in hierarchical models have been remarked by many researchers, and various new approaches have been proposed. This is a new area of research that is attracting much attention. Hagiwara, Toda, and Usui (1993) noticed this problem first. They used AIC (the Akaike information criterion) to determine the size of perceptrons to be used for learning and found that AIC did not work well. AIC is a criterion to determine the model that minimizes the generalization error. However, it has been reported for a long time that AIC does not give good model selection performance in the case of multilayer perceptrons. Hagiwara et al. found that this is because of the singular structure of such a hierarchical model and investigated ways to overcome this difficulty (Hagiwara, 2002a, 2002b; Hagiwara, Hayasaka, Toda, Usui, and Kuno, 2001; Hagiwara et al., 1993; Kitahara, Hayasaka, Toda, & Usui, 2000). To accelerate the dynamics of learning, Amari (1998) proposed the natural or Riemannian gradient method of learning, which takes into account the geometrical structure of the neuromanifold. Through this method, one can avoid the plateau phenomenon in learning (Amari, 1998; Amari &
1010
S.-I. Amari, H. Park, and T. Ozeki
Ozeki, 2001; Amari, Park, & Fukumizu, 2000; Park, Amari, & Fukumizu, 2000). However, the Fisher information matrix, or the Riemannian metric, degenerates at some places because of singularity, so we need to develop a new theory of dynamics of learning (see Amari, 1967, for dynamics of learning for regular multilayer perceptrons) to understand its behavior, in particular the effects of singularity in the ordinary backpropagation method and the natural gradient method. Work done regarding this aspect includes that of Fukumizu and Amari (2000) as well as the statistical-mechanical approaches taken by Saad and Solla (1995), Rattray, Saad, and Amari (1998), and Rattray and Saad (1999). Fukumizu used a simple linear model and indicated that the generalization error of multilayer perceptrons with singularities is different from that of the regular statistical model (Fukumizu, 1999). This problem is related to the analysis of the log-likelihood-ratio statistic of the perceptron model at a singularity (Fukumizu, 2003). The strange behavior of the loglikelihood statistic in the gaussian mixture has been remarked since the time of Hotelling (1939) and Weyl (1939), but only recently has it become possible to derive its asymptotic behavior (Hartigan, 1985; Liu & Shao, 2003). Fukumizu (2003) extended the idea of the gaussian random field (Hartigan, 1985; Dacunha-Castelle & Gassiat, 1997) to make it applicable to multilayer perceptrons and formulated the asymptotics of the generalization error of multilayer perceptrons. Watanabe (2001a, 2001b, 2001c) was the first who studied the effect of singularity in Bayesian inference. He and his colleagues introduced algebraic geometry and algebraic analysis by using Hironaka’s theorem of singularity resolution and Sato’s formula in algebraic analysis to evaluate the asymptotic performance of the Bayesian predictive distribution in various hierarchical singular models; remarkable results have been derived (Watanabe, 2001a, 2001b, 2001c; Yamazaki & Watanabe, 2002, 2003; Watanabe & Amari, 2003). In this article, we give an overview concerning the strange behaviors of singular models so far studied. They include estimation, testing, Bayesian inference, model selection, generalization, and training errors. Special attention is paid to the dynamics of learning from the information-geometric point of view by summarizing our previous results (Amari & Ozeki, 2001; Amari, Park, & Ozeki, 2001, 2002; Amari, Ozeki, & Park, 2003). In particular, we show new results concerning the fast and slow submanifolds in learning of gaussian mixtures. The article is organized as follows. We give various examples of singular models in section 2. They include models of simple multilayer perceptrons with one hidden unit and two hidden units, the gaussian mixture model of probability distributions, and a toy model of the cone that is used to give an exact analysis. The analysis of the gaussian mixture is newly presented here. In section 3, we explain the theory developed by Dacunha-Castelle and Gassiat (1997), which shows the generic cone structure of a singular model. This elucidates why strange behaviors emerge in a singular model.
Singularities Affect Dynamics of Learning in Neuromanifolds
1011
We then show that such models with singularity have strange behaviors, differing from those of ordinary regular statistical models, in parameter estimation, Bayesian predictive distribution, the dynamics of learning, and model selection in section 4. Section 5 is devoted to a detailed analysis of the dynamics of learning. We use simple models to show that there appear slow and fast manifolds in the neighborhood of a singularity, which explains the plateau phenomenon in the ordinary gradient learning method. The natural gradient method is shown to resolve this difficulty. Section 6 deals with the generalization error and training error in simple singular models, where the gaussian random field (Hartigan, 1985; Dacunha-Castelle & Gassiat, 1997; Fukumizu, 2003) is used and the special potential functions are introduced (Amari & Ozeki, 2001). Explicit formulas will be given in the cases of the maximum likelihood estimator and the Bayesian estimator. 2 Singular Statistical Models and Their Geometrical Structure The manifolds, or the parameter spaces, of many hierarchical models such as multilayer perceptrons inevitably include singularities. Typical examples are presented in this section. 2.1 Single Hidden Unit Perceptron. We begin with a trivial model of a single hidden unit perceptron, which receives input vector x = (x1 , · · · , xm ) and emits scalar output y. The hidden unit calculates the weighted sum w · x = wi xi of the input, where w = (w1 , · · · , wm ) is the weight vector, and emits its nonlinear function ϕ(w · x) as the output, where ϕ(u) is the activation function ϕ(u) = tanh(u). The output unit is linear, so its output is vϕ(w · x), which is eventually disturbed by gaussian noise . Hence, the final output is y = vϕ(w · x) + .
(2.1)
The parameters to specify a perceptron are summarized into a single vector θ = (w, v). The average of y, given x, is E[y] = f (x, θ ) = vϕ(w · x),
(2.2)
where E denotes expectation, given x. Any point θ in the (m + 1)-dimensional parameter space M = {θ } specifies a perceptron and its average output function f (x, θ ). However, the parameters are redundant or unidentifiable in some cases. Since ϕ is an odd function, (w, v) and (−w, −v) give the same function, f (x, θ ) = f (x, −θ ).
(2.3)
1012
S.-I. Amari, H. Park, and T. Ozeki
Figure 1: (Left) Parameter space M of a single hidden unit perceptron. (Right) ˜ of the average output functions of perceptrons with a singular strucSpace M ture.
Moreover, when v = 0 (or w = 0), f (x, θ ) = 0, whatever value w takes (v takes). Hence, the function f (x, θ ) is the same in the set C ⊂ M, C = {(v, w)| vw = 0}
(2.4)
(see Figure 1, left), which we call the critical set. Therefore, if we summarize the points that have the same average output function f (x, θ ) into one, the parameter space shrinks such that the critical set C is reduced to a single ˜ which consists of all the differpoint. The parameter space reduces to M, ent average output functions of perceptrons. It consists of two components ˜ has connected by a single point (see Figure 1, right). In other words, M singularity. Note that M is the parameter space, each point of which corresponds to a perceptron. However, the behaviors of some perceptrons are ˜ is the set the same even if their parameters are different. The reduced M of perceptron behaviors that correspond to the average output functions or the probability distributions specified thereby. The probability density function of the input-output pair (x, y) is given by 1 1 p(y, x, θ ) = √ q (x) exp − (y − f (x, θ ))2 , 2 2π
(2.5)
where in equation 2.1 is subject to the standard gaussian distribution N(0, 1) and q (x) is the probability density function of input x. The Fisher information matrix is defined by G(θ ) = E
∂l(y, x, θ ) ∂l(y, x, θ )T ∂θ ∂θ
,
(2.6)
Singularities Affect Dynamics of Learning in Neuromanifolds
1013
where E denotes the expectation with respect to p(y, x, θ ) and l(y, x, θ ) = log p(y, x, θ ) = − 12 (y − f (x, θ ))2 + log q (x), is a fundamental quantity in statistics. It is positive definite in a regular statistical model and plays the role of the Riemannian metric of the parameter space, as is shown by the information geometry (Amari & Nagaoka, 2000). The Fisher information gives the average amount of information included in one pair (y, x) of data, which is used to estimate the parameter θ . Cram´er-Rao Theorem. Let θˆ be an unbiased estimator from n examples in a regular statistical model. Then the error covariance of θˆ satisfies 1 E[(θˆ − θ )(θˆ − θ )T ] ≥ G −1 (θ ). n
(2.7)
Moreover, the equality holds asymptotically (i.e., for large n) for the MLE (maximum likelihood estimator) θˆ , and it is asymptotically subject to the gaussian distribution with mean θ and covariance matrix (1/n)G −1 (θ )}. However, this does not hold when v = 0 or w = 0, because ∂l(y, x, θ ) =0 ∂v
(2.8)
∂l(y, x, θ ) = 0. ∂w
(2.9)
or
When v = 0, p(y, x, θ ) is kept constant even when w changes, and the same situation holds when w = 0. Hence, the Fisher information matrix, equation 2.6, degenerates, and its inverse G −1 (θ ) does not exist in the set C : vw = 0. The Cram´er-Rao theorem is no longer valid at the critical set C. The model is singular on C. This makes it difficult to analyze the performance of estimation and learning when the true distribution is in C or in the neighborhood of C. 2.2 Gaussian Mixture. The Gaussian mixture is a statistical model of probability distributions that has long been known to include singularities (Hotelling, 1939; Weyl, 1939). Let us assume that the probability density of the real variable x is given by a mixture of k gaussian distributions as
p(x, θ ) =
k i=1
vi ψ(x − µi ),
(2.10)
1014
S.-I. Amari, H. Park, and T. Ozeki
√ 2 where ψ(x) = (1/ 2π) exp{−x /2}. The parameters are θ = (v1 , · · · , vk ; µ1 , · · · , µk ) and 0 ≤ vi ≤ 1, vi = 1. When we know the number k of components, all we have to do is to estimate the unknown parameters θ from observed data x1 , x2 , · · · , xn . However, in the usual situation, k is unknown. To make the story simple, let us consider the case of k = 2. The model is given by p(x, θ ) = vψ (x − µ1 ) + (1 − v)ψ (x − µ2 ) .
(2.11)
The parameter space M is three-dimensional, θ = (v, µ1 , µ2 ). If µ1 = µ2 holds, p(x, θ ) actually consists of one component, as is the case with k = 1. In this case, the distribution is the same whatever value v takes, so we cannot identify v. Moreover if v = 0 or v = 1, the distribution is the same whatever value µ1 or µ2 takes, so the parameters are unidentifiable. Hence, in the parameter space M = {θ }, some parameters are unidentifiable in the region C: C : v(1 − v) (µ1 − µ2 ) = 0.
(2.12)
This is depicted in the shaded area in Figure 2. We call this the critical set. In the critical set, the determinant of the Fisher information matrix becomes 0, and there is no inverse matrix. Let us look at the critical set C ⊂ M carefully. In C, any point on the three lines—µ1 = µ2 = µ0 , v = 0, µ2 = µ0 , and v = 1, µ1 = µ0 —(see Figure 2 left), represents the same gaussian distribution, ψ(x − µ0 ). If we regard the parameter points representing the same distribution as one and the same, these three lines shrink to one point.
˜ Figure 2: Parameter space of gaussian mixture M and singular structure M.
Singularities Affect Dynamics of Learning in Neuromanifolds
1015
˜ = M/ ≈ through Mathematically speaking, we obtain the residue class M ˜ depicted in Figure the equivalence relation ≈. Then we get the space M 2 (right), which is the set of probability distributions (not the set M of parameters). This is the space where singularities are accumulated on a line C˜ and the dimensionality is reduced on the line. The line corresponds to the critical set C. To analyze the nature of singularity, let us introduce new variables w and u : w is the center of gravity of the two peaks, or the mean value of the distribution, and u is the difference between the locations of the two peaks, w = vµ1 + (1 − v)µ2
(2.13)
u = µ2 − µ1 .
(2.14)
Estimation of the mean parameter w of the distribution is easy, because the Fisher information matrix is nondegenerate in this direction. The problem is estimation of u and v when uv(1 − v) is small, because the critical region C is given by uv(1 − v). To make discussions simpler, consider the case where w = 0 is known, and only u and v are unknown. The distribution is then written as p(x, u, v) = vψ x − (1 − v)u + (1 − v)ψ(x + vu).
(2.15)
Let us consider only the region where u ≈ 0 and v(1 − v) > c holds for some constant c. By Taylor expansion of the above equation around u = 0, we get 1 1 p(x, u, v) ≈ ψ(x) 1 + c 2 (v)H2 (x)u2 + c 3 (v)H3 (x)u3 2 6
1 1 4 5 + c 4 (v)H4 (x)u + c 5 (v)H5 (x)u + · · · , 24 120
where c i (v) = v(1 − v)i + (1 − v)(−v)i , meaning that c 2 (v) = v(1 − v), c 3 (v) = v(1 − v)(1 − 2v), c 4 (v) = v(1 − v)(1 − 3v + 3v 2 ), and
(2.16)
1016
S.-I. Amari, H. Park, and T. Ozeki
H2 (x) = x 2 − 1, H3 (x) = x 3 − 3x, H4 (x) = x 4 − 6x 2 + 3 are the Hermite polynomials. We can embed this singular model locally in a regular model S as its singular subset. This is our new strategy of studying the singular structure by locally embedding it in an exponential family of distributions whose structure is well known and regular statistical analysis is possible. Let us consider a regular exponential family S specified by the regular parameters θ = (θ1 , θ2 , θ3 ), p(x, θ ) = ψ(x) exp θ1 H2 (x) + θ2 H3 (x) + θ3 H4 (x) ,
(2.17)
where ψ(x) is a dominating measure. Let us denote by S the parameter space of θ . Now we calculate l(x, u, v) = log p(x, u, v), by taking the logarithm of equation 2.16 and performing Taylor expansion again,
l(x, u, v) = log ψ + +
u2 u3 c 2 (v)H2 (x) + c 3 (v)H3 (x) 2 6
u4 {c 4 (v)H4 (x) − 3c 2 (v)2 H2 (x)2 }. 24
(2.18)
Hence, the parameter space M of gaussian mixtures is approximately embedded in S by 1 θ1 = c 2 (v)u2 , 2 1 θ2 = c 3 (v)u3 , 6 1 θ3 = {c 4 (v) − 3c 2 (v)2 }u4 . 24
(2.19) (2.20) (2.21)
This embedding from (u, v) to θ is singular. We first consider how M is embedded in the two-dimensional space (θ1 , θ2 ) by equations 2.19 and 2.20 in Figure 3a, where θ3 is ignored. The shape near the singularity u ≈ 0 becomes clear by using (u, v)-coordinates of M and their map in S. Let us consider a line in M where v is constant. This line v = const is
Singularities Affect Dynamics of Learning in Neuromanifolds
1017
(a)
(b) Figure 3: Gaussian mixture distribution M embedded in S. (a) M embedded in the parameter space S, (θ1 , θ2 ). (b) The picture where M cannot be embedded in S and sticks out into the higher dimension.
embedded in S as θ1 = a 1 u2 θ2 = a 2 u3 , where a 1 and a 2 are constants depending on v. This curve is not smooth but
1018
S.-I. Amari, H. Park, and T. Ozeki
is a cusp in S. The transformation from (u, v) to (θ1 , θ2 ) is singular at u = 0, and the Jacobian determinant, ∂θ1 ∂u |J | = det ∂θ 1 ∂v
, ∂θ2 ∂v
∂θ2 ∂u
(2.22)
vanishes at u = 0. To elucidate the dynamics of learning near the singularity u = 0, θ3 is necessary, as we will show later. Note that the above approximation, which uses Taylor expansion of u, is not applicable in the case where u is not small, but v is close to 0 or 1. Equation 2.16 is valid only in the case of small u. Taylor expansion is not suitable when v(1 − v) becomes small and u is large. Another consideration in terms of a random gaussian field is necessary, because an infinite-dimensional regular space is required to include the singular M. We have drawn the picture of M embedded in the three dimensions of S = {θ = (θ1 , θ2 , θ3 )} by calculating the higher-order term of u4 in Figure 3b. If u becomes large or v approaches 0 or 1, the surface of the embedded M includes higher terms ˜ sticks out and that cannot be represented in equation 2.17 and embedded M is wrapped in higher dimensions. As v(1 − v) approaches 0, the points in ˜ shrink toward the origin as u → 0 but expand into infinite dimensions. M This situation arises in the non-Donsker case (Dacunha-Castelle & Gassiat, 1997). Our primary interest is to know the influence of the singularity on the dynamics of learning. Because the natural gradient learning method takes the geometrical structure into account, it will not be greatly affected by the singularity. However, the effect of the singularity is not negligible when the true distribution is at, or near, the singularity C. In gradient descent learning algorithms, the critical set C works as a pseudo-attractor (the Milnor attractor) even when the true distribution is at a regular point far from it. The singular point resembles a black hole in the dynamics of learning, and it is necessary to investigate its influence on the dynamics, as we discuss in a later section. The Fisher information matrix degenerates in the critical set. We calculate it explicitly in a later section. 2.3 Multilayer Perceptron. We consider here the multilayer perceptron with hidden units (Rosenblatt, 1961; Amari, 1967; Minsky & Papert, 1969; Rumelhart, Hinton, & Williams, 1986). It is a singular model where the output y is written as
y=
k i=1
vi ϕ (wi · x) + ε.
(2.23)
Singularities Affect Dynamics of Learning in Neuromanifolds
1019
Here, x is an input vector, ϕ(wi · x) is the output of the ith neuron of the hidden layer, and wi is its weight vector. The neuron of the output layer summarizes all the outputs of the hidden layer by taking the weighted sum with weights vi . Gaussian noise ε ∼ N(0, 1) is added at the end, so the output y is a random variable. Let us summarize all the modifiable parameters in one vector, θ = (w1 , · · · , w k ; v1 , · · · , vk ), and then the average output is given by
E[y] = f (x, θ ) =
k
vi ϕ (wi · x) .
(2.24)
i=1
This model has a structure similar to the gaussian mixture distribution. When the parameter wi of the ith neuron is 0, this neuron is useless because ϕ(0) = 0 and vi may take any value. Moreover, when vi = 0, whatever value wi takes, this term is 0. In the meantime, if wi = w j (or wi = −w j ), these two neurons emit the same output (or the negative output). Then vi ϕ(wi · x) + v j ϕ(w j · x) is the same, not depending on particular values of vi and v j , provided vi + v j (or vi − v j ) takes a fixed value. Therefore, we can identify their sum (or difference), but each of vi and v j remains unidentifiable. The neuromanifold M of perceptrons is a space whose admissible coordinate system is given by θ . The critical set C is defined by the join of the two subsets, vi wi = 0,
wi = ±w j ,
(2.25)
in which the parameters are unidentifiable. In other words, there is a direction such that the behavior (the input-output relation) is the same even when the parameters change in this direction. Since the Fisher information is 0 along this direction, the Fisher information matrix degenerates, and its inverse diverges to infinity. In statistical study, the Cram´er-Rao theorem guarantees that the error of estimation from a large number of data is given by the inverse of the Fisher information matrix in the regular case. However, because the inverse of the Fisher information matrix diverges, the classical theory is not applicable here. In geometrical terms, the Riemannian metric, which is determined by the Fisher information matrix, degenerates. Therefore, the distance becomes 0 along a certain direction. This is what the singular structure gives rise to. Remark.
We have absorbed the bias term in the weighted sum as
wi · x =
wi xi + wi0 x0 = w ˜ i · x˜ + wi0 ,
(2.26)
1020
S.-I. Amari, H. Park, and T. Ozeki
where x0 = 1. Because x0 is constant, this causes another nonidentifiability. When w˜ i = w˜ j = 0,
(2.27)
if the other parameters satisfy vi ϕ(wi0 ) + v j ϕ(v j0 ) = const.,
(2.28)
they give the same behavior. In this article, we ignore such a case for simplicity. There are no other types of nonidentifiability (see Sussmann, 1992; ˚ Kurkov´ a & Kainen, 1994). We regard two perceptrons as being equivalent when their behaviors are the same even if their parameters are different. Let us shrink the manifold M by reducing the equivalent points of M to one point, giving the reduced ˜ In the mathematical sense, the reduced M ˜ corresponds to the residue M. class of M because of the equivalence. In this case, degeneration of the dimensionality occurs in the critical set of the neuromanifold. We showed this in the trivial case of one hidden neuron. We now show this through another simple example. Let us consider a perceptron with two hidden units with w = 1, w = (cos θ, sin θ ), which is represented by f (x, v, θ ) = vϕ(x1 cos θ + x2 sin θ + b),
(2.29)
where b is a fixed bias term. In this case, the parameter space is twodimensional with coordinates θ = (v, θ ). Consider the space S consisting of functions of the form f (x, θ ) = θ3 ϕ(θ1 x1 + θ2 x2 + b). Then equation 2.29 is ˜ in S, embedded in S by θ1 = cos θ, θ2 = sin θ , and θ3 = v. The embedded M ˜ = {θ| θ1 = cos θ, θ2 = sin θ, θ3 = v} , M is a cone depicted in Figure 4, and the apex is the singular point. 2.4 Cone Model. Amari and Ozeki (2001) analyzed a toy model, the cone model, to examine the exact behavior of the estimation error and the dynamics of learning. We introduce this model here for later use. Let us consider the statistical model described by a random variable x ∈ Rd+2 , which is subject to a gaussian distribution with mean µ and covariance matrix I , where I is the identity matrix, 1 1 p(x; µ) = √ exp − ||x − µ||2 . 2 ( 2π)d+2
(2.30)
Singularities Affect Dynamics of Learning in Neuromanifolds
1021
Figure 4: Singular structure of the neuromanifold.
The parameter space S = {µ} is a (d + 2)-dimensional Euclidean space with a coordinate system µ. When the mean parameter µ is restricted on the surface of a cone in S, the family of the gaussian distributions is called the cone model M. We first consider the case with d = 1, that is, d + 2 = 3, in which the cone is given by µ1 = ξ, µ2 = ξ cos θ, µ3 = ξ sin θ
(2.31)
(see Figure 5), where (ξ, θ ) are the parameters used to specify M. In the general case, the cone is parameterized by (ξ, ω), ξ
M:µ= √ 1 + c2
1 cω
= ξ a(ω),
(2.32)
where c is a constant that specifies the shape of the cone, ω is a vector on the d-dimensional unit sphere that specifies the directions of the cone, and ξ is the distance from the origin. In the case of d = 1, the sphere is a circle (see Figure 5), and the parameter ω is replaced by θ . The model M, which
1022
S.-I. Amari, H. Park, and T. Ozeki
Figure 5: Cone Model.
is given by the parameters (ξ, ω), is embedded in the d + 2–dimensional space S, and consists of two cones, one for ξ ≥ 0 and the other for ξ ≤ 0, of d + 1 dimensions, R × Sd , which are connected at the apex ξ = 0. The apex ξ = 0 is the singularity. Amari et al. analyzed the behavior of the maximum likelihood estimator in the case where the true parameters are on the singularity (Amari et al., 2001, 2002, 2003). In this case, the true distribution p0 (x) is given by x ∼ N(0, Id+2 ). The simple multilayer perceptron with one hidden unit (Amari et al., 2001, 2002, 2003) has a similar cone structure. Through a transformation of parameters—β = w, ξ = vβ, and ω = wβ ∈ Sd−1 —we get y = ξ ϕβ (ω · x) + ε,
(2.33)
where we put ϕβ (u) = β1 ϕ(βu). This is a cone in the space of (ξ, ω). In previous articles, we have analyzed the behavior of learning when the true parameter is on the singularity (ξ = 0). In this article, we analyze the behavior of learning when the true parameter is not necessarily at the singularity and show that the singularity strongly affects the dynamics. 2.5 Other Models. There are many other statistical models with singular structures. Hierarchical models include such singular structures in many cases. The space of the ARMA time-series model and that of linear rational systems are good examples (Amari, 1987; Brockett, 1976), but little is known about the effects of singularity. The estimation of points of change,
Singularities Affect Dynamics of Learning in Neuromanifolds
1023
which is called the Nile River problem, is also a well-known example of singularity. Let us consider another model, the model of population coding with multiple stimuli in a neural field (Amari, 1977), on which neurons are lined up continuously along a one-dimensional positional axis z. When a stimulus from the outside is applied at a specific place corresponding to z = µ, the neurons located around z = µ are excited. Excitation of the neural field—that is, the firing rate of a neuron at z—is written in this case as r (z) = vψ(z − µ) + ε(z),
(2.34)
where v is the strength of the stimulus, µ is its center, and ε(z) is a noise term dependent on z. Function ψ is unimodal and is called the tuning function. This model has been applied in population coding where the problem is to estimate µ from the neural response r (z). Its statistical analysis has been given in terms of the Fisher information in many reports (for example, Wu, Nakahara, & Amari, 2001; Wu, Amari, & Nakahara, 2002). When two stimuli are simultaneously given at locations µ1 and µ2 with intensities v and 1 − v, the response of the field is written as r (z; v, µ1 , µ2 ) = vψ (z − µ1 ) + (1 − v)ψ (z − µ2 ) + ε(z).
(2.35)
In this case, the same singular structure as that of the gaussian mixture appears in the parameter space θ = (v, µ1 , µ2 ). The strange behavior of the maximum likelihood estimator is analyzed in Amari and Burnashev (2003) and Amari and Nakahara (2005). 3 Locally Conic Structure and Gaussian Random Field The local structure of singular statistical model is studied by DacunhaCastelle and Gassiat (1997) in a unified manner. The local structure of a regular statistical model is represented by the tangent space of the manifold of the statistical model, where the first-order asymptotic theory is well formulated. The concepts of affine connections and related e- and m-curvature are necessary for the higher-order asymptotic theory, as is shown by information geometry (Amari & Nagaoka, 2000). A singular statistical model does not have the tangent space at singularity, and instead the tangent cone is useful for analyzing its local structure. We summarize the results of Dacunha-Castelle and Gassiat (1997) in this section without mathematical rigor but intuitively. The locally conic structure and the related random gaussian field play a fundamental role in analyzing the behaviors of the likelihood ratio statistics (Hartigan, 1985; Fukumizu, 2003) and also of the MLE and its generalization ability (Amari
1024
S.-I. Amari, H. Park, and T. Ozeki
& Ozeki, 2001). Another general framework for singular models is given from algebraic geometry, which we do not summarize here. (See Watanabe, 2000a, 2000b, and 2000c, and related papers.) 3.1 Locally Conic Structure. All the examples in section 2 have a locally conic structure. For a singular statistical model M = { p(x, θ ), θ ∈ Rk } in which the critical set C exists, Dacunha-Castelle and Gassiat (1997) introduced the following parameters in the neighborhood of C. Let us denote by p0 (x) a probability density in C where the identifiability is lost. Given p(x, θ ), let ξ be the Hellinger distance between p(x, θ ) and C, ξ = inf
p0 (x)∈C
( p0 (x) − p(x, θ ))2 d x.
(3.1)
It is further assumed that p(x, θ ) can be parameterized in a neighborhood of p0 (x) ∈ C by (ξ, ω), where ω ∈ is (k − 1)-dimensional. We thus have the new parameterization p(x, ξ, ω), where lim p(x, ξ, ω) = p(x, 0, ω) = p0 (x).
ξ →0
(3.2)
The critical set C is given by ξ = 0, where p(x, 0, ω) represents p0 (x) ∈ C so that ω is not identifiable. The score function with respect to ξ is the directional derivative of log likelihood and is denoted by v(x, ω) =
d log p(x, 0, ω) dξ
(3.3)
at ξ = 0. It depends on ω ∈ . Since ξ is the Hellinger distance, we have, E[{v(x, ω)}2 ] = 1,
(3.4)
that is, the Fisher information in the ξ -direction is normalized to 1 at any ω. Now consider l(x, ξ, ω) = log p(x, ξ, ω)
(3.5)
in the function space of random variable x. The model M is embedded in it, where the points in C are reduced to equivalent points, so that the ˜ Let us fix dimension reduction takes place. Its image is the reduced set M. a point p0 (x) in C. In its neighborhood, the Taylor expansion gives l(x, ξ, ω) = log p0 (x) + ξ v(x, ω),
(3.6)
Singularities Affect Dynamics of Learning in Neuromanifolds
1025
so that when ξ is very small, the image of M forms a cone in the space of functions of x, whose apex is log p0 (x) and whose edges are spanned by v(x, ω), ω ∈ . This is called the tangent cone and is different from the tangent space of a regular statistical model. 3.2 MLE and Gaussian Random Field. Given n examples D = {x1 , · · · , xn } from a singular model M, the log likelihood is written as L(D, ξ, ω) =
n
log p(xi , ξ, ω).
(3.7)
i=1
The MLE is the maximizer of L, but it is difficult to calculate because the derivatives of L with respect to ω are 0 at ξ = 0 in some directions, ∂ L(D, 0, ω) ∂ 2 L(D, 0, ω) = = · · · = 0. ∂ω ∂ω∂ω
(3.8)
We now fix ω and calculate the maximizer ξˆ (D, ω) of L. By expansion with respect to ξ , we have L(D, ξ, ω) = L(D, 0, ω) +
∂L 1 ∂2 L 2 ξ+ ξ + ···. ∂ξ 2 ∂ξ 2
(3.9)
Since ξˆ is small when the true distribution is p0 (x), that is, ξ = 0, the maximizer is given from −
∂2 L ∂L ξˆ = . ∂ξ 2 ∂ξ
(3.10)
The term of the second derivative, 1 ∂ 2 log p(xi , ξ, ω) 1 ∂2 L =− , 2 n ∂ξ n ∂ξ 2 n
−
(3.11)
i=1
converges to the Fisher information in the direction ω, because of the law of large numbers. The second term of the first derivative is 1 ∂l(xi , 0, ω) 1 1 ∂L = √ = √ Yn (ω) = √ v(xi , ω), ∂ξ n ∂ξ n i=1 n i=1 n
n
(3.12)
which converges to the gaussian random variable Y(ω) in law because of the central limit theorem. For ω = ω , Y(ω) and Y(ω ) are correlated in general.
1026
S.-I. Amari, H. Park, and T. Ozeki
A family of gaussian distributions {Y(ω), ω ∈ } forms a random gaussian field over . The maximizer ξˆ is given by 1 ξˆ (D, ω) = √ Yn (ω). n
(3.13)
3.3 MLE. Let us substitute ξˆ in equation 3.7, obtaining the partially maximized likelihood, ˆ L(D, ω) = L(D, ξˆ (D, ω), ω) 1 = log p0 (xi ) + Yn (ω)2 . 2 Hence, the MLE is given by the maximizer of the random field, ωˆ = argmax Yn (ω)2 .
(3.14)
It is difficult to calculate this and study the properties of the MLE in general. 3.4 Likelihood Ratio Statistics. The log likelihood ratio statistics, λ=2
log
p(xi , θˆ ) , p(xi , θ 0 )
(3.15)
is used for testing the null hypothesis H0 : θ = θ 0 against the alternative H1 : θ = θ 0 , where θˆ is the mle. The statistic λ is asymptotically subject to the χ 2 distribution with k degrees of freedom in a regular model, and hence E[λ] = k
(3.16)
asymptotically. However, this does not hold in a singular model. The log likelihood ratio statistics λ in a singular model is λ = 2 sup ξ,ω
log
p(xi , ξ, ω) p0 (xi )
(3.17)
= 2 sup{L(D, ξˆ , ω) − L(D, 0, ω)}
(3.18)
= 2 sup Yn (ω)2 .
(3.19)
ω ω
Hence, it is given by the supremum of the gaussian random field. Hartigan (1985) suggested that λ ∼ log log n in the gaussian mixture model by extracting m = log n almost independent Y(ω1 ), · · · , Y(ωm ).
Singularities Affect Dynamics of Learning in Neuromanifolds
1027
Fukumizu (2003) followed the idea to evaluate λ in the case of multilayer perceptrons and derived λ ∼ log n. 4 Singularity Causes Strange Behaviors in Estimation and Learning In the framework of singular statistical models, we give a glimpse of strange behaviors of estimation, testing, model selection, and online learning. We study three cases. In the first case, the true distribution, or the distribution that best approximates the true distribution, is exactly at the singularity. In this case, the parameters are not identifiable and the model is redundant (a smaller model suffices), but we can estimate its behavior (or the equivalent class) consistently. The gaussian random field plays a key role. In the second case, the true distribution is near the singularity. In the last case, the true distribution is at a regular point. In the last case, the classical theory can be applied locally. However, when studying the dynamics of learning, we need to take the influence of the singularity into account. The trajectories of learning cover the entire space, so it is a global problem in the entire space. The log likelihood ratio test and MLE are known to be asymptotically optimal in the regular model. The likelihood principle is the belief that statistical inference should be based on likelihood. In singular models, however, this is not always true, and their optimality is not guaranteed (Amari & Burnashev, 2003). The behavior of Bayesian estimation and estimation with a regularization term also shows a different aspect from the regular case. There are many interesting problems to be studied, such as learning and its dynamics. 4.1 Statistical Testing in the Neighborhood of Critical Set. Statistical testing is a general method to judge from data whether the true distribution lies at the singularity. In the case of gaussian mixtures, we judge whether k = 1 or k = 2 through a statistical test. We take equation 2.12 as the null hypothesis and perform testing against the alternative that this equation is not true. In a general regular case, the log likelihood ratio statistic λ obeys the χ 2 distribution with the degrees of freedom equal to the number of parameters when the number of data is large enough. However, when the model is singular, the log likelihood ratio statistic may not be subject to the χ 2 distribution and may diverge to infinity in proportion to the number n of observations. This was shown in the classical works of Weyl (1939) and Hotelling (1939). Only recently was a precise asymptotic evaluation of the log likelihood ratio statistic given (Fukumizu, 2003; Hartigan, 1985; Liu & Shao, 2003) for some singular models. It is unfortunate that such tangled problems have usually been excluded as pathological cases and have not been well studied. Such cases are not pathological; they are ubiquitous in hierarchical engineering models. Let us consider the statistical test H0 : θ = θ 0 against H1 : θ = θ 0 . When the true point θ 0 is a regular point—that is, it is not in the critical set—the
1028
S.-I. Amari, H. Park, and T. Ozeki
MLE is asymptotically subject to a gaussian distribution with mean θ 0 and a variance-covariance matrix G −1 (θ 0 ) /n, where G(θ ) is the Fisher information matrix. In such a case, the log likelihood ratio statistic is expanded in the Taylor series, giving T λ = n θˆ − θ 0 G −1 (θ 0 ) θˆ − θ 0 .
(4.1)
Hence, this is subject to the χ 2 distribution of the degrees of freedom equal to the number k of parameters. Its expectation is E[λ] = k.
(4.2)
However, when the true distribution θ 0 lies on the critical set, the situation changes. The Fisher information matrix degenerates, and G −1 diverges, so the expansion is no longer valid. The expectation of the log likelihood estimator is asymptotically written as E[λ] = c(n)k,
(4.3)
where the term c(n) takes various forms depending on the nature of singularities. By evaluating the property of the gaussian random field Y(ω), Fukumizu (2003) showed that c(n) = log n
(4.4)
in the case of multilayer perceptrons under a certain condition. In the case of the gaussian mixture, c(n) =
log log n
(4.5)
holds (Hartigan, 1985; Liu & Shao, 2003). 4.2 Estimation, Training Error, and Generalization Error. When the true parameter is at the singularity (or close to it), the MLE is no longer subject to the gaussian distribution, even asymptotically. This causes strange behaviors of training and generalization errors. The standard theory (Amari & Murata, 1993; Murata, Yoshizawa, & Amari, 1994) does not hold. This will be discussed in more detail in a later section. 4.3 Bayesian Estimator. The Bayesian estimator is used in many cases where an adequate prior distribution π(θ ) is assumed. When the prior distribution penalizes complex models, it plays a role equivalent to the regularization term. When a set of independently and identically (i.i.d) data
Singularities Affect Dynamics of Learning in Neuromanifolds
1029
D = {x1 , · · · , xn } generated by p(x; θ 0 ) is given, the posterior distribution of the parameters is written as p(θ |D) =
π(θ ) p (D|θ ) , p(D)
(4.6)
p(D|θ )π(θ )dθ
(4.7)
where p(D) =
is the distribution of data D. The maximum a posterior (MAP) estimator is given by the parameter θˆ that maximizes the posterior distribution. Also, the Bayesian predictive distribution is used as the distribution of a new data x based on D. It is given by averaging the distribution p(x|θ ) over the posterior distribution p(θ |D), p(x|D) =
p(x, θ ) p(θ |D)dθ .
(4.8)
It is empirically known that the Bayesian predictive distribution or MAP behaves well in the case of large-scale neural networks. In such a case, one uses a smooth prior π(θ ) > 0 on the neuromanifold. Obviously, if π(θ 0 ) = ∞ at a specific point, the MAP estimator is attracted to that specific point. This is not fair. When π(θ ) > 0 is smooth, its influence decreases as n approaches ∞, and it approaches the MLE, which is regarded as the MAP under the uniform prior. ˜ of However, a smooth prior on M is singular in the equivalence class M the neuromanifold, because a singular point in this class includes infinitely many equivalent parameters of M. Hence, the prior density is infinitely large on the singular points compared with that at regular points. This implies that the Bayesian smooth prior is in favor of singular points (perceptrons with a smaller number of hidden units) with an infinitely large factor. Hence, the Bayesian method works well in such a case to avoid overfitting. One may use a very large perceptron with a smooth Bayesian prior, and an adequate smaller model will be selected, although no theory exists that explains how to choose the prior. The Bayesian estimator of singular models was studied by Watanabe (2001a, 2001b) and Yamazaki and Watanabe (2002, 2003) by using algebraic geometry, in particular, Hironaka’s theory of singularity resolution and Sato’s formula in the theory of algebraic analysis. 4.4 Model Selection. To obtain an adequate model, one should select a model from many alternatives based on the data. In the case of hierarchical models, one should determine the preferred model size, that is, the number
1030
S.-I. Amari, H. Park, and T. Ozeki
of hidden units. This is the problem of model selection. AIC, BIC, and MDL have been widely used as model selection criteria. NIC is a version of AIC applicable to the general cost function of a neural network (Murata et al., 1994). AIC (Akaike, 1974) is a criterion to minimize the generalization error. The model that minimizes AIC = 2 × training error +
2k n
(4.9)
is selected according to this criterion. This is derived from asymptotic statistical analysis, where the MLE estimator θˆ is assumed to be asymptotically subject to the gaussian distribution with the covariance matrix equal to the inverse of the Fisher information matrix divided by n. MDL (Rissanen, 1986) is a criterion to minimize the length of encoding for the observed data by using a family of parametric models. It is given asymptotically by the minimizer of MDL = training error +
log n k. 2n
(4.10)
The Bayesian BIC (Schwarz, 1978) gives the same criterion as MDL. These criteria are derived also through the same assumption regarding the Gaussianity of the MLE. In the case of multilayer perceptrons, the neuromanifold with a smaller number of hidden units is included in that with a larger number, but the smaller one forms a critical set within the larger neuromanifold. Therefore, the MLE (or any other efficient estimator) is no longer subject to the gaussian distribution, even asymptotically, provided the true distribution belongs to a smaller model. Model selection is required when the estimator is close to the critical set, but the validity of AIC and MDL fails to hold. Akaho and Kappen (2000) noted this in the gaussian mixture model. One should evaluate the log likelihood ratio statistic more carefully in such a case (Amari, 2003). The situation is the same in other hierarchical models with singularity. There have been reported many comparisons of AIC and MDL by computer simulations. Sometimes AIC works better, while MDL does better in other cases. Such confusing reports seem to be the result of the difference between regular and singular models and also the difference in the nature of singularities. 4.5 Dynamics of Learning. Let us consider online learning of multilayer perceptrons through the gradient descent method. Let us define the error by the square of the difference between the network output and the teacher’s signal. When the noise term is gaussian, the square of the error is
Singularities Affect Dynamics of Learning in Neuromanifolds
1031
equal to the negative of the log likelihood. The minimization of the error is then equivalent to the maximization of the likelihood, and the result of learning locally converges to the maximum likelihood estimator. The stochastic gradient descent method was proposed by Amari (1967) and was named the backpropagation method (Rumelhart et al., 1986), while the natural gradient method (Amari, 1998; Amari et al., 2000; Park et al., 2000) takes the Riemannian structure of the space into account. For an input-output example (x, y), the loss function or the negative log likelihood is given by l(x, y; θ ) =
2 1 y − f (x, θ ) . 2
(4.11)
Its expectation is given by averaging it with respect to the true distribution p0 (x, y), L(θ) = E p0 [l(x, y; θ )] .
(4.12)
The backpropagation and natural gradient learning algorithm (Amari, 1998) are written, respectively, as θ t+1 = θ t − η∇l (x t , yt ; θ t )
(4.13)
θ t+1 = θ t − ηG −1 (θ t )∇l(x t , yt ; θ t ),
(4.14)
when example (x t , yt ) is given at time t = 1, 2, · · ·. Here, η is a learning constant, ∇ is the gradient ∂/∂θ, and G is the Fisher information matrix. It is generally difficult to calculate G(θ ), because the distribution q (x) of inputs is unknown. Moreover, its inversion is costly. The adaptive natural gradient method estimates G −1 (θ t ) adaptively from data (Amari et al., 2000; Park et al., 2000). It has been shown that the natural gradient method is locally equivalent to the Newton method, giving a Fisher efficient estimator, while the backpropagation is not Fisher efficient. The natural gradient method is capable of near-optimal performances (Rattray et al., 1998; Rattray & Saad, 1999). In the present formulation, the natural gradient is equivalent to the adaptive version of the Gauss-Newton method, but it is different and more powerful in other cost functions (Park et al., 2000). −1 The adaptive update of G −1 t = G (θ t ) is calculated online by −1 −1 −1 ˆ ˆ T G −1 t+1 = (1 + τ )G t − τ G t ∇l(x t , yt , θ t )[G t ∇l(x t , yt , θ t )] .
(4.15)
The learning constant τ should not be large in order to guarantee the stability of the estimation of G −1 , but should not be too small to guarantee that the estimator G t = G(θˆ t ) traces the change of θˆ t well. Inoue, Park and
1032
S.-I. Amari, H. Park, and T. Ozeki
Okada (2003) show that the ratio η/τ should be kept within an adequate range. Since examples are generated stochastically, the dynamics of learning, equations 4.13 and 4.14, are represented by the stochastic difference equations. However, when η is small, stochastic fluctuation is averaged out. Hence, we investigate the behavior of the averaged learning equation where ∇l is replaced by its expectation, ∇ L(θ ) = E[∇l(x, y; θ )].
(4.16)
If continuous time is used, these become differential equations: dθ = −η∇ L(θ ), dt dθ = −ηG −1 (θ )∇ L(θ ). dt
(4.17)
The solution draws a trajectory in the neuromanifold. The problem is how the trajectory is influenced by the singular structure. Kang et al. used a three-layer perceptron with binary weights and found that the parameters are attracted to the critical set that forms a singularity and are very slow to move away from it (Kang, Oh, Kwon, & Park, 1993). Saad and Solla (1995) and Riegler and Biehl (1995) analyzed the dynamics in a more general case and showed that such a phenomenon is universal. They argued that the slowness in backpropagation learning, or the plateau phenomenon, is caused by this singularity. The natural gradient learning method takes the geometrical structure into account. It enables the influence of the singular structure to be reduced, and the trajectory is not trapped in the plateaus. Rattray et al. analyzed the dynamics of natural gradient learning by means of statistical physics and showed that it is almost ideal (Rattray & Saad, 1999; Rattray et al., 1998). To examine the dynamics of learning in more detail, let us consider perceptrons consisting of two hidden units. The parameter space is M = {θ }, θ = (w 1 , w 2 , v1 , v2 ), and let us consider the subset Q(w, v) specified by w and v, Q(w, v) = {w 1 = w2 = w, v1 + v2 = v} ,
(4.18)
which is included in the critical set C. The behavior of each perceptron in Q is the same and corresponds to that of a perceptron having only one hidden unit, where the weight vector is w and the output weight is v, and the behavior is y = vϕ(w · x) + . Let the true parameters be θ 0 = {w 1 , w 2 , v1 , v2 }, where w1 = w2 , so two different hidden units are used.
Singularities Affect Dynamics of Learning in Neuromanifolds
1033
Let θ¯ = (w, ¯ v¯ ) be the perceptron with only one hidden unit that best approximates the input-output function f (x, θ 0 ) of the true perceptron. Then all the perceptrons of two hidden units on the line Q(w, ¯ v¯ ), w1 = w2 = w, ¯
v1 + v2 = v¯
(4.19)
correspond to the best approximation. Let us transform the two weights as w=
1 (w 1 + w2 ) , 2
u=
1 (w1 − w2 ) . 2
(4.20)
The derivative of L(θ ) along the line Q is then 0 because all the perceptrons are equivalent along the line. The derivatives in the direction of changing w ¯ and v¯ are also zero, because they are the best approximators. The derivative in the direction of u is again 0, because the perceptron having u is equivalent to that having −u, which is derived by interchanging the two hidden units. Hence, the line Q forms critical points of the cost function. This implies that it is very difficult to get rid of it once the parameters are attracted to Q (w, ¯ v¯ ). Fukumizu and Amari (2000) calculated the Hessian of L on the line. When it is positive definite, the line is attractive. When it includes negative eigenvalues, the state eventually escapes in these directions. They showed that in some cases, part of the line can be truly attractive, although it is not a usual asymptotically stable equilibrium but has directions of escape (even though the derivative is 0) in other parts. This is not a usual saddle point and belongs to the special type called the Milner attractor. In such a case, the perceptron is truly attracted to the line and stays inside the line Q(w, ¯ v¯ ), fluctuating around it because of random noise, until it finds a place from which it can escape. This explains the plateau phenomenon. The problem of plateau cannot be resolved by simply increasing η, because even when the state goes outside Q because of a large η, it may again return to it. To show why the natural gradient method works well, we need to evaluate its behavior in the neighborhood of the critical points. We can then prove that the natural gradient has the effect of strong repulsion to escape from the neighborhood of the critical set, and the plateau phenomenon will disappear. Computer simulations confirm this observation. Inoue et al. (2003) investigated the trajectory of learning by using the committee machine perceptron with two hidden units. They observed the following behavior of the trajectory. It approaches the singularity and stays near the singularity for a while before escaping from it. More precisely, they observed that w1 and w2 first came close to each other, both approaching w, ¯ which is the optimal in C, and then they moved away in different directions. Inoue et al. (2003) also studied the effectiveness of the adaptive natural gradient method, and showed the importance of controlling the two learning constants η and τ .
1034
S.-I. Amari, H. Park, and T. Ozeki
Figure 6: Learning trajectory near the singularities.
What is the trajectory of learning when the true parameters are on the singularity? Park, Inoue, and Okada (2003) investigated this problem by using three-layer perceptrons with two hidden units. Once the trajectory reaches C in Figure 6, all points are equivalent and suboptimal. Where does the trajectory enter C? To answer this question, we need to examine the dynamics near C. Because the component of the flow entering C is extremely slow near C, the flow component parallel to C is relatively strong. Analysis of the dynamics makes it clear that the trajectory does not stop at any point in the line where w 1 = w2 = w and v1 + v2 = v is constant, v1 = 0, v2 = 0, and that it approaches the point where either v1 or v2 is zero. This is an interesting observation. What is the trajectory of learning in the natural gradient method? Since the metric degenerates in C, its inverse diverges to infinity. However, ∇ L = 0 even when the true distribution is outside C. If we consider G −1 ∇ L, it becomes a multiplication of 0 and ∞. However, if we evaluate G −1 ∇ L near C, we can see that the infinitely strong repulsive force works in the direction of escape from C (Fukumizu & Amari, 2000). That is, the force going out from the plateau is strong, and the trajectory moves away without being attracted in the natural gradient. 5 Dynamics of Learning: Slow and Fast Manifolds In this section, we show in detail the effect of singularities in the dynamics of learning for three simple models: the one-dimensional cone, the simple
Singularities Affect Dynamics of Learning in Neuromanifolds
1035
MLP, and the gaussian mixture. Note that the structure of the gaussian mixture is very similar to that of the multilayer perceptron. We calculate the average trajectories of the standard and natural gradient learning methods to show how the trajectories approach the optimal point. We show that a slow manifold emerges around the critical set to which the state is quickly attracted by fast dynamics, and then the state escapes toward the optimal point slowly in the slow manifold. This is a universal feature of the plateau phenomenon. 5.1 Cone Model. Here, we investigate the dynamics of learning in the cone model introduced in section 2. The parameter space M is twodimensional with coordinates (ξ, θ ). The cost function is defined as the negative log likelihood l(x; ξ, θ ) =
1 (x1 − ξ )2 + (x2 − cξ cos θ )2 + (x3 − cξ sin θ )2 . 2
(5.1)
For the average learning dynamics under the standard gradient learning method, we can easily obtain
ξ˙ (t) ˙ θ(t)
= −ηt
x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ −cξ (x¯ 2 sin θ − x¯ 3 cos θ )
SGD
,
(5.2)
where x¯ i = E[xi ]. Similarly, the average dynamics of natural gradient learning can also be obtained as
ξ˙ (t) ˙ θ(t)
= −ηt
1 1+c 2
NGD
x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ − cξ1 (x¯ 2 sin θ − x¯ 3 cos θ )
, (5.3)
where we calculate the Fisher information matrix as G(ξ, θ ) =
1 + c2 0 . 0 c2ξ 2
(5.4)
To consider the effect of singularity (i.e., ξ = 0) on the dynamics of learning, we define two submanifolds satisfying ξ˙ = 0 and θ˙ = 0, respectively. These are Mξ = {(θ, ξ ) : x¯ 1 + c(x¯ 2 cos θ + x¯ 3 sin θ ) − (1 + c 2 )ξ = 0}
(5.5)
1036
S.-I. Amari, H. Park, and T. Ozeki
Figure 7: Learning trajectories in the cone model. (Left) Standard gradient (the dotted line is the slow manifold). (Right) Natural gradient.
and Mθ = {(θ, ξ ) : x¯ 2 sin θ − x¯ 3 cos θ = 0}.
(5.6)
The intersection of Mξ and Mθ is the equilibrium of the dynamics. From the standard gradient learning equation, we see that ξ˙ is of order O(1), whereas θ˙ is of order O(ξ ). Therefore, in the neighborhood of the singularity where ξ is small, the speed of change in ξ is much faster than that of θ . Therefore, the state is attracted toward Mξ (the dashed line in Figure 7 (left)) by the fast dynamics. Then the state moves along the line ∂l/∂ξ = 0 or Mξ , which is the slow manifold. The dynamics becomes especially slow when ξ is small (slow dynamics). This explains the plateau phenomenon in leaning curves. On the other hand, with the natural gradient learning equation 5.3, one can see that ξ˙ is of order 1 and θ˙ is of order ξ −1 , so no slow manifolds appear. Moreover, the update term around the singularity is large, so that a strong repulsive force acts from the singularity. This explains why the plateau disappears in the natural gradient. In computer simulations, we set c = 1. For the true parameters, we took ξ ∗ = 1 and θ ∗ = 0, so we had (x¯ 1 , x¯ 2 , x¯ 3 ) = (1, 1, 0). For the standard gradient and the natural gradient, we traced a number of trajectories with different initial values of ξ and θ . The trajectories in the parameter space are shown in Figure 7 using polar coordinates. Note that the center of the polar coordinates, (0, 0), is the singular point corresponding to the apex of the cone. In Figure 7 (left) for the standard gradient, we can see that the trajectories were attracted to the singular point or the slow manifold and then finally
Singularities Affect Dynamics of Learning in Neuromanifolds
1037
Figure 8: Time evolution of expected loss in the cone model. (Solid line: standard gradient, dashed line: natural gradient).
were attracted to the optimal point. In contrast, for the natural gradient, Figure 7 (right), such attraction and retardation did not appear. Figure 8 shows the time evolution of the expected loss for the initial condition (ξ, θ ) = (1.5, 5π/6). We can see a clear plateau in the standard gradient learning curve, whereas there was no plateau in the natural gradient learning curve. 5.2 Simple MLP. We also investigated the dynamics of learning of the simple MLP defined by equation 2.29, which is also discussed in section 2. The loss function, which is the squared error or the negative log likelihood, is given by l(y, x; ξ, θ ) =
2 1 y − ξ ϕ(cos θ x1 + sin θ x2 + b) . 2
(5.7)
The mathematical analysis is similar, and the fast and slow manifolds are obtained from θ˙ = 0 and ξ˙ = 0, respectively. For our computer simulations, we set b = 0.5. For the true parameters, we took ξ ∗ = 1 and θ ∗ = 0. As for the cone model, we traced a number of
1038
S.-I. Amari, H. Park, and T. Ozeki
Figure 9: Learning trajectories in the simple MLP model. (Left) Standard gradient. (Right) Natural gradient.
trajectories with different initial values of ξ and θ . The trajectories in the parameter space are shown in Figure 9 using polar coordinates. Note that for this MLP model, the center of the polar coordinates is the singular point, which corresponds to the shrink point of Figure 4. In Figure 9 (left) for the standard gradient, we can see phenomena similar to those for the cone model. That is, the trajectory was attracted to the plateau near the singular point and then slowly reached the optimal point. For the natural gradient, Figure 7 (right), we did not see that kind of attraction and retardation. Figure 10 shows the time evolutions of the expected loss for the initial condition (ξ, θ ) = (1.5, 5π/6). We can see a clear plateau in the curve of standard gradient learning, but none in that of natural gradient learning. 5.3 Gaussian Mixture. Next, we consider a more realistic model, the gaussian mixture. This is an original study in this article. To investigate the dynamics of learning of the gaussian mixture model, we begin with the Taylor expansion, equation 2.16, where v(1 − v) > c should be kept in mind. In this model, the singularity exists at u = 0, so the Taylor expansion for small values of u is useful. The cost function, the negative of the log likelihood, is further expanded as 1 1 l(x; u, v) = − log ψ(x) + c 2 (v)H2 (x)u2 + c 3 (v)H3 (x)u3 2 6 1 1 2 4 2 4 5 + c 4 (v)H4 (x)u − c 2 (x)H2 (x)u + O(u ) . 24 8
(5.8)
Singularities Affect Dynamics of Learning in Neuromanifolds
1039
Figure 10: Time evolutions of expected loss in simple MLP. (Solid line: standard gradient, dashed line: natural gradient).
Let (u∗ , v ∗ ) be the true parameter from which the learning data are generated. We can calculate the average learning dynamics around small u. Lemma. When u is small, the gradient of l evaluated at the true parameter (u∗ , v ∗ ), is given by E ∗ [∂ul] = −c 2 (v){u∗ 2 c 2 (v ∗ ) − u2 c 2 (v)}u 1 − c 3 (v){u∗ 3 c 3 (v ∗ ) − u3 c 3 (v))}u2 + O(u3 ) 2 1 E ∗ [∂v l] = − c 2 (v){u∗ 2 c 2 (v ∗ ) − u2 c 2 (v)}u2 2 1 − c 3 (v){u∗ 3 c 3 (v ∗ ) − u3 c 3 (v))}u3 + O(u4 ), 6
(5.9)
(5.10)
where E ∗ denotes the expectation with respect to p(x, u∗ , v ∗ ) and c i (v) = dc i (v)/dv. The proof is given in the appendix.
1040
S.-I. Amari, H. Park, and T. Ozeki
The averaged equation for the standard gradient is given by
u(t) ˙
= −η
v(t) ˙
E ∗ [∂ul] E ∗ [∂v l]
SGD
.
(5.11)
We put f 1 (u, v) = u∗ 2 c 2 (v ∗ ) − u2 c 2 (v), f 2 (u, v) = u∗ 3 c 3 (v ∗ ) − u3 c 3 (v). The averaged learning equations are then given by
u(t) ˙
v(t) ˙
= −η SGD
+
c 2 (v) f 1 (u, v)u + 12 c 3 (v) f 2 (u, v)u2 1 c (v) f 1 (u, v)u2 2 2
O(u3 ) O(u4 )
+ 16 c 3 (v) f 2 (u, v)u3
.
(5.12)
We now consider the trajectories of learning in two cases: u∗ = 0 (singularity) and u∗ = 0 (regular). Case I: u∗ = 0. By putting u∗ = 0 in equation 5.12 and ignoring higher-order terms, we have u˙ = −ηc 2 (v)2 u3 , v˙ = −η
c 2 (v)c 2 (v) 4 u. 2
(5.13)
From this, the trajectory of dynamics is given by dv v˙ 1 − 2v = = u. du u˙ 2v(1 − v)
(5.14)
The equation can be integrated to give u2 =
1 1 (1 − 2v)2 − log(1 − 2v) + c, 4 2
(5.15)
where c is constant. When u is small, v˙ is of O(u4 ) and is much smaller than u, ˙ which is of order u3 . Hence, dv/du ≈ 0, and the trajectories are almost parallel to the u-axis, as shown in Figure 11. In other words, u converges to 0 without significantly changing v.
Singularities Affect Dynamics of Learning in Neuromanifolds
1041
Figure 11: Trajectories of learning in the gaussian mixture model.
Case II: u∗ = 0. We also evaluate the dynamics when u is small. The equation is u˙ = ηc 2 (v)c 2 (v ∗ )u∗ 2 u η v˙ = c 2 (v)c 2 (v ∗ )u∗ 2 u2 2
(5.16) (5.17)
and u c 2 (v) dv = du 2 c 2 (v)
(5.18)
irrespective of (u∗ , v ∗ ). Incidentally, the equation is the same as that of equation 5.14; hence, the trajectories are the same (see Figure 11), but the directions are opposite, and the state is escaping from u = 0 toward u∗ in this case; that is, the directions in Figure 11 are reversed. The equilibrium is given by the intersection of the two manifolds, MF : f 2 (u, v) = 0, MS : f 1 (u, v) = 0.
1042
S.-I. Amari, H. Park, and T. Ozeki
0.85
0.85 v
v
OPTIMUM
0.75
0.75 OPTIMUM
0.65
0.65 0.5
0.7 u
0.9
0.5
0.7 u
0.9
Figure 12: Learning trajectories in the simple gaussian mixture model. (Left) Standard gradient. (Right) Natural gradient.
When u is small, the first term of f 1 (u, v) dominates, and the state is quickly attracted to MS . Then it moves in MS slowly to the intersection of MS and MF . Computer simulations confirm this observation (see Figure 12 left). We next studied the natural gradient method. Using an approximation, equation 5.8, we can also obtain an explicit form of the Fisher information matrix. This is given by
3 2 2 2 4 1 3 5 c (v)u + (v)u c (v)u + c (v)(2c (v) + 1)u 2c 3 2 2 2 3 2 2 G(u, v) = . 1 3 5 1 1 2 4 6 c 3 (v)u + c 3 (v)(2c 2 (v) + 1)u (c (v)) u + ( 6 − c 2 (v))u 2 2 2 (5.19) For the natural gradient method, the dynamics of learning is
u(t) ˙ v(t) ˙
NGD
η = 3 u
c˜1 f 2 (u, v)u + c˜2 f 1 (u, v)u2 c˜3 f 2 (u, v) + c˜4 f 1 (u, v)u
,
(5.20)
where c˜i are functions of v. The equilibrium is again the intersection of MS and MF , but the roles of MS and MF are reversed. Repulsion is strong when u is small, and no plateau appears. For our computer simulations, we set the true parameters as u∗ = 0.75 and v ∗ = 0.7. Since the analytic expression of the average dynamics given in equations 5.12 and 5.20 are an approximation around a small u, we cannot apply this for the whole trajectory. Therefore, we use the Monte Carlo
Singularities Affect Dynamics of Learning in Neuromanifolds
1043
Figure 13: Time evolutions of the expected loss in the simple gaussian mixture model. Solid line: standard gradient, dashed line: natural gradient.
method to get the expectation of ∂l/∂u and ∂l/∂v. At each learning step, we generate 106 samples according to the true input distribution and take the sample means. We traced a number of trajectories with different initial values for u and v. The trajectories in the parameter space are shown in Figure 12. In Figure 12 (left), we can see the line (slow manifold) on which the parameters first converge. Therefore, the state proceeds by learning toward the line satisfying u∗ 2 c 2 (v ∗ ) = u2 c 2 (v), which is shown as a dashed line in Figure 12 (left). However, for the natural gradient, Figure 12 (right), the update terms of u˙ and v˙ have terms of order O(u−2 ) and O(u−3 ), respectively, which lead to much faster dynamics around the singularity. Figure 13 shows the time evolutions of the expected loss for the initial condition (u, v) = (0.9, 0.6). We can see the slow convergence in the standard gradient learning curve, whereas this sort of retardation is not apparent in natural gradient learning.
1044
S.-I. Amari, H. Park, and T. Ozeki
6 Generalization and Training Errors When the True Distribution Is at a Singular Point It is important for model selection to evaluate the generalization error relative to the training error. Since AIC and MDL have been derived from the asymptotic gaussianity of the estimator, they cannot be applied to the singular case. In particular, for hierarchical models such as multilayer perceptrons and gaussian mixtures, smaller models are embedded in the larger models as critical sets. Therefore, we need to find new model selection criteria for the singular case. In the regular case, the MLE and Bayes predictive estimators give asymptotically the same estimation performance. However, these are not guaranteed in the singular case. As a preliminary study, we analyzed the asymptotic behavior of the MLE and Bayes predictive distribution by using simple toy models when the true distribution lay at a singular point, that is, in a smaller model. The behavior of an estimator is evaluated by the relation between the expected generalization error and the expected training error. Let D = {x 1 , . . . , x n } be observed data, or D = {(x 1 , y1 ), . . . , (x n , yn )} in the case of perceptrons. When we have an estimated probability density function pˆ (x; D), the generalization error can be defined by the Kullback-Leibler divergence from the true probability density po to the estimated density function, po (x) K L[ po (x) : pˆ (x; D)] = E po log . pˆ (x; D)
(6.1)
In the case of perceptrons, the estimated density function is the conditional probability density pˆ (y|x; D), and a similar formulation follows. For the evaluation, we take an expectation of the generalization error with respect to the data, and we call it the expected generalization error, po (x) , E gen = E D E po log pˆ (x; D)
(6.2)
where E D denotes expectation with respect to the observed data D. Similarly, the training error of the estimated density function pˆ (x; D) is defined by the sample average, 1 po (x i ) log , n pˆ (x i ; D) n
i=1
(6.3)
Singularities Affect Dynamics of Learning in Neuromanifolds
1045
which is the expectation of log( p0 / pˆ ) with respect to the empirical distribution of data D, pemp (x) =
1 δ(x − x i ). n
(6.4)
The expected training error is defined as E train = E D
n 1 po (x i ) log . n pˆ (x i ; D)
(6.5)
i=1
For the MLE, the estimated density function is given by p(x, θˆ ) where ˆθ is the MLE. One can also see the relation between the expected training error and the likelihood ratio statistics λ in equation 3.17, E train = −
1 E D [λ]. 2n
For the Bayes estimation, the estimation density function is given by the Bayes predictive distribution of the form pˆ Bayes (x|D) =
p(x, θ ) p(θ |D)dθ .
(6.6)
6.1 Maximum Likelihood Estimator 6.1.1 Cone Model. For the cone model defined in equation 2.30, the log likelihood of data D = {x i }i=1,...,n is written as 1 ||x i − ξ a(ω)||2 . 2 n
L(D, ξ, ω) = −
(6.7)
i=1
The MLE is the one that maximizes L(D, ξ, ω). However, ∂ k L/∂ωk = 0 at ξ = 0 for any k, so we cannot analyze the behaviors of the MLE by Taylor expansion at ξ = 0. Therefore, we first fix ω and search for the ξ that maximizes L. The maximum ξˆ is given by 1 ξˆ (ω) = argmaxξ L(D, ξ, ω) = √ Yn (ω), n
(6.8)
where 1 x˜ = √ xi . n i=1 n
Yn (ω) = a(ω) · x˜ ,
(6.9)
1046
S.-I. Amari, H. Park, and T. Ozeki
Its limit Y(ω) is a zero-mean gaussian random field with covariance a(ω) · a(ω ), because x˜ ∼ N(0, I ) when the true distribution is at ξ = 0. The MLE ωˆ is given by the maximizer of ωˆ = argmaxω Yn2 (ω).
(6.10)
Using the MLE, we obtain the expected generalization and training errors in the following theorem. Theorem 1. MLE satisfies
For the cone model, when the true distribution is at ξ = 0, the
1 po (x) = E D maxω Y2 (ω) , ˆ 2n p(x|ξ , ω) ˆ n 1 po (x i ) 1 E train = E D log = − E D maxω Y2 (ω) . ˆ n 2n p(x i |ξ , ω) ˆ E gen = E D E po log
(6.11) (6.12)
i=1
A more detailed derivation is appendix. In addition, we can given in the obtain the explicit value of E D maxω Y2 (ω) within the limit of large d. Corollary 1.
E gen ≈
When d is large, the MLE satisfies
1 + 2c
E train ≈ −
√ 2 π
d + c2 (d + 1)
≈
2n(1 + c 2 ) √ 1 + 2c π2 d + c 2 (d + 1) 2n(1 + c 2 )
c2d , 2n(1 + c 2 ) ≈−
c2d . 2n(1 + c 2 )
(6.13)
(6.14)
The proof is also given in the appendix. Among the results is an interesting one concerning the antisymmetry between E gen and E train ; that is, E gen = −E train , which is proved in the regular case (Amari & Murata, 1993). Note also that the generalization and training errors depend on the shape parameter c as well as the dimension number d. In the regular case, they depend only on d. As one can easily see, when c is small, the cone looks like a needle, and its behavior resembles a one-dimensional model. When c is large, the cone resembles two (d + 1)-dimensional hypersurfaces, so its behavior is like a (d + 1)–dimensional regular model. Such observations are confirmed by equations 6.13 and 6.14.
Singularities Affect Dynamics of Learning in Neuromanifolds
1047
6.1.2 Simple MLP with One Hidden Unit. For the simple MLP defined in equation 2.33, we can also apply the same approach. The log likelihood of data set D = {(x i , yi )}i=1,...,n is written as
L(D; ξ, ω) = −
n 2 1 yi − ξ ϕβ (ω · x i ) . 2
(6.15)
i=1
Let us define two random variables depending on D and ω, 1 yi ϕβ (ω · x i ) , Yn (ω) = √ n i=1 n
1 2 ϕβ (ω · x i ). n
(6.16)
n
An (ω) =
(6.17)
i=1
Note that An (ω) converges to A(ω) = E x [ϕβ2 (ω · x)] as n goes to infinity, but is not normalized to 1 in the present case. Y(ω) defines a gaussian random field on , with mean 0 and covariance A(ω, ω ) = E x [ϕβ (ω · x)ϕβ (ω · x)]. One should be careful that An (ω, β) of equation 6.17 approaches 0 as β → 0. This belongs to the non-Donsker class, and our theory does not hold in such a case. Using the MLE, we get the following theorem. Theorem 2. For the simple MLP model, when the teacher perceptron is ξ = 0, the MLE satisfies Y2 (ω) 1 po (y|x) = E D supω , 2n An (ω) p(y|x, ξˆ , ω) ˆ n 1 Y2 (ω) po (y|x i ) 1 log . = − E D supω E train = E D n 2n An (ω) p(y|x i , ξˆ , ω) ˆ i=1 E gen = E D E po ,q log
(6.18) (6.19)
The details of this derivation are given in the appendix. From the results, we can see a nice correspondence between the cone model and MLP. However, note that there is no sufficient statistic in the MLP case, while all the data are summarized in the sufficient statistic x˜ in the cone model. In addition, due to the nonlinearity of the hidden unit, we cannot easily determine the explicit relation through which the training and generalization errors depend on the dimension number of the parameters, which we have for the cone model in corollary 1.
1048
S.-I. Amari, H. Park, and T. Ozeki
6.2 Bayes Predictive Distribution 6.2.1 Cone Model. Different from the regular case, the asymptotic behavior of the Bayesian predictive distribution depends on the prior. Let us define the prior as π(ξ, ω). The probability density of the observed sample is then given by Zn = p(D) =
π(ξ, ω)
n
p(x n |ξ, ω)dξ dω.
(6.20)
i=1
When new data x n+1 are given, we can similarly obtain the joint probability density p(x n+1 , D) as Zn+1 = p(x n+1 , D) =
π(ξ, ω)
n+1
p(x i |ξ, ω)dξ dω.
(6.21)
i=1
From the Bayes theorem, we can easily see that the Bayes predictive distribution is given by pˆ Bayes (x|D) =
Zn+1 , Zn
(6.22)
where x = x n+1 . When we assume a specific prior for the parameter ξ and ω, we can calculate Zn explicitly. When π(ξ ) = 1 and ω is uniform on , we can obtain the Bayes predictive distribution and the generalization error explicitly, as in the following theorems. Theorem 3. Under the uniform prior on ξ , the Bayes predictive distribution of the cone model is given by 1 1 pˆ BAYES (x|D) = √ e xp − x2 2 ( 2π)d+2
∇∇ SdU 1 1 tr H (x) , × 1 + √ ∇logSdU ( x˜ ) · x + 2 2n n SdU
(6.23)
where H2 (x) = x x T − I and
SdU ( x˜ ) =
e xp
1 Yn (ω)2 dω, 2
1 a (ω) · x i = a(ω) · x˜ . Yn = √ n i=1 n
Singularities Affect Dynamics of Learning in Neuromanifolds
1049
Theorem 4. Under the uniform prior on ξ , the generalization and training errors of the Bayes predictive distribution of the cone model are given by 2 1 po (x) 1 E gen = E D E po log = E D ∇logSdU ( x˜ ) = , (6.24) p(x|D) 2n 2n n 1 1 po (x i ) E D log = E gen − E D ∇logSdU ( x˜ ) · x˜ . (6.25) E train = n i=1 p(x i |D) n The details of this derivation are given in the appendix. Remark. For any prior π(ξ, ω) that is positive and smooth, theorems 3 and 4 also hold asymptotically without any change. The Jeffreys’ prior is given by the square root of the determinant of the Fisher information matrix, π(ξ, ω) ∝
|G(ξ, ω)|.
(6.26)
In this case, π(ξ ) ∝ |ξ |d and ω are uniformly distributed on Sd . The Jeffreys prior is not smooth and is 0 at ξ = 0. This is completely different from the regular case. We can conduct a similar analysis and obtain the following theorems. Theorem 5. Under the Jeffreys’ prior, the Bayes predictive distribution of the cone model is given asymptotically by 1 1 pˆ BAYES (x|D) = √ e xp − x2 2 ( 2π)d+2
∇∇ SdJ 1 1 × 1 + √ ∇logSdJ ( x˜ ) · x + H (x) , tr 2 2n n SdJ
(6.27)
where # " Id (Yn (ω)) e xp 1 Yn (ω)2 dω, 2 " # ! |z + u|d e xp − 1 z2 dz. Id (u) = √1 2 2π
SdJ ( x˜ ) =
!
(6.28) (6.29)
Theorem 6. Under the Jeffreys’ prior, the generalization and training errors of the Bayes predictive distribution of the cone model are given asymptotically by E gen =
2 d +1 1 E D ∇logSdJ ( x˜ ) = 2n 2n
(6.30)
1050
S.-I. Amari, H. Park, and T. Ozeki
E train = E gen −
1 E D ∇logSdJ ( x˜ ) · x˜ . n
(6.31)
For the proof, the same derivation process as that of the uniform case can be applied, although the process is fairly complicated. These results are rather surprising. Under the uniform prior, the generalization error is constant and does not depend on d, which is the complexity of the model. Hence, no overfitting occurs whatever complex models we use. This is completely different from the regular case. However, this striking result arises from the uniform prior on ξ . The uniform prior puts a strong emphasis on the singularity because there are infinitely many equivalent points at ξ = 0, so the prior density is infinitely large if we consider the ˜ of the probability distributions or behaviors. Hence, one should be space M very careful in choosing a prior when the model includes singularities. In the case of Jeffreys’ prior, the generalization error increases in proportion to d, which is similar to the regular case. In addition, the antisymmetric duality between E gen and E train does not hold for both the uniform prior and Jeffreys’ prior. 6.2.2 Simple MLP with One Hidden Unit. For the simple MLP model, we conducted a similar analysis for the uniform prior and Jeffreys’ prior, and obtained the following theorems. Theorem 7. Under the uniform prior on ξ , the Bayes predictive distribution of the simple MLP model is given by ! 12 1 y ∇ QU d (Yn , ω)ϕβ (ω · x)dω pˆ BAYES (y|x, D) = √ e xp − 2 1+ √ y n PdU (Yn ) 2π !
∇∇ QU 1 1 d (Yn , ω)An (ω)dω + + O H2 (y) , (6.32) U 2n n2 Pd (Yn ) where 1 Yn (ω)2
, e xp 2 A(ω) A(ω) PdU (Yn ) = QU d (Yn , ω)dω,
QU d (Yn , ω) =
1
(6.33) (6.34)
1 1 yi ϕβ (ω · x i ) = √ yi ϕi , Yn (ω) = √ n i=1 n i=1 A(ω) = E x ϕβ2 (ω · x) . n
n
(6.35) (6.36)
Singularities Affect Dynamics of Learning in Neuromanifolds
1051
Theorem 8. Under the uniform prior on ξ , the generalization and training errors of the Bayes predictive distribution of the simple MLP model are given by po (y|x) E gen = E D E po q log p(y|x, D) ! U ∇ QU 1 d (Yn , ω)∇ Qd (Yn , ω )A(ωω )dωω = ED U 2 2n Pd (Yn ) =
1 2n n 1
(6.37)
po (yi |x i ) E D log E train = n i=1 p(yi |x i , D) ! ∇ QU 1 d (Yn , ω)Yn (ω)dω . = E gen − E D n PdU (Yn )
(6.38)
Theorem 9. Under Jeffreys’ prior on ξ , the Bayes predictive distribution of the simple MLP model is given by ! 1 1 y ∇ QdJ (Yn , ω)ϕβ (ω · x)dω pˆ BAYES (y|x, D) = √ e xp − y2 1 + √ 2 n PdJ (Yn ) 2π !
∇∇ QdJ (Yn , ω)An (ω)dω 1 1 + + O H2 (y) , (6.39) 2n n2 PdJ (Yn ) where QdJ (Yn , ω) = PdJ
(D) =
1
I d+1 d
A(ω)
1 Yn (ω)2 Yn (ω)
e xp 2 A(ω) A(ω)
(6.40)
QdJ (Yn , ω)dω.
1 Id (u) = √ 2π
(6.41)
1 |z + u|d e xp − z2 dz. 2
(6.42)
Theorem 10. Under Jeffreys’ prior on ξ , the generalization and training errors of the Bayes predictive distribution of the simple MLP model are given by 1 E gen = ED 2n =
d +1 2n
!
∇ QdJ (Yn , ω)∇ QdJ (Yn , ω )A(ωω )dωω 2 J Pd (D)
(6.43)
1052
S.-I. Amari, H. Park, and T. Ozeki
1 E train = E gen − E D n
!
∇ QdJ (Yn , ω)Yn (ω)dω . PdJ (D)
(6.44)
All of these results for the simple MLP correspond well with those for the cone model. For both the cone and MLP models, we can see that the generalization error is strongly dependent on the prior distribution of the parameters. This differs from the classic theory for the regular models. 7 Conclusion It has long been known that some kinds of statistical model violate ordinary conditions such as the existence of a regular Fisher information matrix. Unfortunately, in classical statistical theories, singular models of this type have been regarded as pathological and have received little attention. However, to understand the behavior of hierarchical models such as multilayer perceptrons, singularity problems cannot be ignored. The singularity is closely related to basic problems regarding these models—such as the slow dynamics of learning in plateaus and a strange relation between generalization and training errors—and also the criteria of model selection. It is premature to give a general theory of estimation, testing, Bayesian inference, and learning dynamics for singular models. In this article, we have summarized our recent results regarding these problems using simple toy models. Although the results are preliminary, we believe that we have succeeded in elucidating various aspects of the singularity and have found some interesting paths to follow in future studies. Appendix: Proofs of Theorems Proof of Theorem 1.
The log likelihood of D is given by 1 x i − ξ a(ω)2 . 2 n
L(D, ξ, ω) = −
(A.1)
i=1
From the definition of the generalization and training errors, we obtain 1 2 ˆ ˆ E gen = E D E po −ξ (ω)a( ˆ (a(ω) ˆ · a(ω)) ˆ ˆ ω) ˆ · x + ξ (ω) 2 1 2 ξˆ (ω) ˆ = ED 2 =
1 E D maxω Yn2 (ω) . 2n
(A.2)
Singularities Affect Dynamics of Learning in Neuromanifolds
1053
n 1 1 2 E train = E D ˆ (a(ω) ˆ · a(ω)) ˆ −ξˆ (ω)a( ˆ ω) ˆ · x i + ξˆ (ω) n 2 i=1
1 1 ˆ √ a(ω) ˆ ˆ · x˜ + ξˆ 2 (ω) = E D −ξˆ (ω) 2 n 1 ˆ = E D − ξˆ 2 (ω) 2 =−
1 E D maxω Yn2 (ω) . 2n
(A.3)
In order to get the final results, we need to calculate
Proof of Corollary 7. maxω Yn2 (ω). Let
1 (x˜ 1 + cω · x˜ ) a(ω) · x˜ = √ 1 + c2 where x˜ = (x˜ 2 , . . . , x˜ d+2 )T . Then 1 2 x˜ 1 + 2c x˜ 1 ω · x˜ + c 2 (ω · x˜ )2 2 1+c 1 2 x˜ 1 + 2c x˜ 1 Ae · ω + c 2 A2 (e · ω)2 , = 2 1+c
Yn2 (ω) = a(ω) · x˜ 2 =
where x˜ = x˜ e = Ae, e = 1. Then we obtain ωˆ = argmaxω Yn2 (ω) = sgn(x˜ 1 )e, and maxω Yn2 (ω) =
1 2 x˜ 1 + 2c|x˜ 1 |A + c 2 A2 . 2 1+c
From the fact that E D [A2 ] = d + 1, E D [A] = E
x˜ 22
+ ··· +
2 x˜ d+2
d!! = (d − 1)!!
where d!! = d(d − 2)(d − 4) · · ·, we finally get
$
2 π
(−1)d
≈
√ d,
(A.4)
1054
S.-I. Amari, H. Park, and T. Ozeki
E D maxω Yn2 (ω) =
1 1 + 2c E D [|x˜ 1 |]E D [A] + c 2 (d + 1) 2 1+c √ 1 + 2c π2 d + c 2 (d + 1) c2d ≈ . ≈ 2 (1 + c ) 1 + c2
(A.5)
The log likelihood of D is given by
Proof of Theorem 2.
1 {yi − ξ ϕβ (ω · x i )}2 2 n
L(D, ξ, ω) = −
i=1
=−
1 2 1 2 yi + ξ yi ϕβ (ω · x i ) + − ξ 2 ϕβ (ω · x i ). 2 2 n
n
n
i=1
i=1
i=1
(A.6)
By using 1 2 1 Yn2 (ω) yi + , 2 2 An (ω) n
L(ξˆ (ω), ω) = −
(A.7)
i=1
ωˆ = argmaxω
Yn2 (ω) , An (ω)
(A.8)
1 2 2 ˆ ˆ ξ ( ω ˆ · x) + ( ω)ϕ ˆ ( ω ˆ · x) E gen = E D E y,x −ξˆ (ω)yϕ β β 2 1 2 Y2 (ω) 1 ˆ ξ (ω)A ˆ n (ω) ˆ = = ED Ex E D supω , 2 2n An (ω)
(A.9)
n 1 1 2 2 ˆ ˆ −ξ (ω)y ˆ i ϕβ (ω · x i ) + ξ (ω)ϕ ˆ β (ωˆ · x i ) E train = E D n 2 i=1
1 1 ˆ n (ω) ˆ ˆ + ξˆ 2 (ω)A = E D −ξˆ (ω) ˆ √ Y(ω) 2 n Y2 (ω) 1 1 2 ˆ = E D − ξ (ω)A ˆ n (ω) ˆ = − E D supω . 2 2n An (ω) Let us define
Proof of Theorem 3. Zn = p(D) =
(A.10)
π(ξ, ω)
n i=1
p(x i |ξ, ω)dξ dω,
(A.11)
Singularities Affect Dynamics of Learning in Neuromanifolds
Zn+1 = p(x, D) =
π(ξ, ω)
n+1
p(x i |ξ, ω)dξ dω.
1055
(A.12)
i=1
Then the Bayesian predictive distribution can be written by
pˆ BAYES (x|D) =
Zn+1 . Zn
(A.13)
Under the uniform prior, we can easily get 1 n 1 x i − ξ 2 dξ dω exp − x i 2 + ξ a(ω) · Zn = √ 2 2 ( 2π)n(d+2) √ 1 2π 1 x i 2 Yn (ω)2 dω. exp (A.14) = √ √ exp − n(d+2) 2 2 ( 2π) n
From equations A.14 and A.13, we obtain
1 pˆ BAYES (x|D) = √ ( 2π)d+2
$
n x2 SdU ( x˜ n+1 ) , exp − n+1 2 SdU ( x˜ )
where 1
%
x+ x˜ n+1 = √ n+1
&
xi ,
SdU ( x˜ )
=
1 2 exp Yn (ω) dω. 2
Using the approximation of the form,
x˜ n+1 ≈ x˜ +
x 1 x˜ √ − n 2n
and from equation A.15, we obtain
= x˜ + δx,
(A.15)
1056
S.-I. Amari, H. Park, and T. Ozeki
1 1 1 pˆ BAYES (x|D) = √ exp − x2 1 − 2 2n ( 2π)d+2
∇ SU ( x˜ ) ∇∇ SdU ( x˜ ) 1 T × 1 + Ud · δx + tr δxδx 2 Sd ( x˜ ) SdU ( x˜ ) 1 1 1 ∇ SdU ( x˜ ) = √ exp − x2 1 + √ ·x 2 n SdU ( x˜ ) ( 2π)d+2
∇∇ SdU T ∇ SdU 1 tr − x x · x ˜ − 1 . + 2n SdU SdU
(A.16)
Using the fact that ∇ SdU ( x˜ ) · x˜ + SdU ( x˜ ) = tr{∇∇ SdU ( x˜ )}, we can obtain the final result. From equation A.16,
Proof of Theorem 4.
∇∇ SdU 1 1 U tr H2 (x) E gen = E D E po − log 1 + √ ∇ log Sd ( x˜ ) · x + 2n n SdU
∇∇ SdU 1 1 H (x) = E D E po − √ ∇ log SdU ( x˜ ) · x − tr 2 2n n SdU 2 1 ∇ log SdU ( x˜ ) · x + 2n 2 1 = E D ∇ log SdU ( x˜ ) 2n (A.17)
Similarly, for the training error, we get
n 1 1 ∇∇ SdU 1 U E D − log 1 + √ ∇ log Sd ( x˜ ) · x i + H2 (x i ) E train = n 2n n SdU i=1
n ∇∇ SdU 1 1 1 U = E D − √ ∇ log Sd ( x˜ ) · x i − H2 (x i ) tr n 2n n SdU i=1 2 1 + ∇ log SdU ( x˜ ) · x i 2n 1 = − E D ∇ log SdU ( x˜ ) · x˜ n
n 2 ∇∇ SdU 1 1 1 U . E D − tr H2 (x i ) + + ∇ log Sd ( x˜ ) · x i n 2n 2n SdU i=1
Singularities Affect Dynamics of Learning in Neuromanifolds
1057
Here, we use the expansion 1 x˜ ≈ x˜ i + √ x i , n
(A.18)
1 f ( x˜ ) ≈ f ( x˜ i ) − √ ∇ f ( x˜ i ) · x i , n where x˜ i =
√1 n
(A.19)
x j , and we finally get
j=i
n 1 2 1 1 U U ∇ log Sd ( x˜i ) · x ED E train = − E D ∇ log Sd ( x˜ ) · x˜ + n n 2n i=1
1 E D ∇ log SdU ( x˜ ) · x˜ . n
= E gen −
(A.20)
On the other hand, since Yn (ω) and Yn+1 (ω) have the same distributions, we easily get E gen = H0 +
1 , n
where H0 is the entropy of the distribution p0 (x). Let us define
Proof of Theorem 5. !
'n |ξ |d i=1 p(x i |ξ, ω)dξ dω, ! ' n+1 J = p(x, D) = |ξ |d i=1 p(x i |ξ, ω)dξ dω. Zn+1 ZnJ = p(D) =
(A.21) (A.22)
Then the Bayesian predictive distribution can be written as pˆ BAYES (x|D) =
J Zn+1 . ZnJ
(A.23)
Under Jeffreys’ prior, we get ZnJ
1 n 2 2 |ξ | exp − x i + ξ a(ω) · = √ x i − ξ dξ dω 2 2 ( 2π)n(d+2) 1 1 x i 2 exp − = √ 2 ( 2π)n(d+2) # " n √ × |ξ |d exp − ξ 2 + n(a(ω) · x˜ )ξ dξ dω 2 1
d
1058
S.-I. Amari, H. Park, and T. Ozeki
1 1 = √ exp − x i 2 2 ( 2π)n(d+2) ( ( )
) a(ω) · x˜ 2 (a(ω) · x˜ ) 2 n d ξ− √ × |ξ | exp − exp dξ dω 2 2 n 1 1 2 exp − = √ x i √ d+1 2 ( 2π)n(d+2) n ( ) 2 d (a(ω) · x˜ ) 2 z z + a(ω) · x˜ exp − × exp dzdω 2 2 √ 1 2π 2 x exp − = √ i √ d+1 2 ( 2π)n(d+2) n ( ) (a(ω) · x˜ ) 2 × Id (a(ω) · x˜ ) exp dω. 2 Therefore, the predictive distribution is given from equation A.14 as 1 pˆ BAYES (x|D) = √ ( 2π)d+2
$
n n+1
d+1
x2 SdJ ( x˜ n+1 ) . exp − 2 SdJ ( x˜ )
(A.24)
By again using the same expansion as for the uniform case, we obtain
d 1+ pˆ BAYES (x|D) = √ 2n ( 2π)d+2
∇ SdJ ( x˜ ) ∇∇ SdJ ( x˜ ) 1 T × 1+ J · δx + tr δxδx 2 Sd ( x˜ ) SdJ ( x˜ ) 1 1 exp − x2 = √ 2 ( 2π)d+2
∇∇ SdJ 1 1 × 1 + √ ∇ log SdJ ( x˜ ) · x + H (x) . tr 2 2n n SdJ 1
Proof of Theorem 7. Zn =
n
1 exp − x2 2
(A.25)
Let us define
p(yi |x i , ξ, ω)dξ dω,
(A.26)
p(yi |x i , ξ, ω)dξ dω.
(A.27)
i=1
Zn+1 =
n+1 i=1
Singularities Affect Dynamics of Learning in Neuromanifolds
1059
Then the Bayesian predictive distribution can be written as pˆ BAYES (y|x, D) =
Zn+1 . Zn
(A.28)
Under the uniform prior, we can easily get 1 1 2 1 2 yi ϕi − Zn = √ n exp − yi + ξ ϕi dξ dω 2 2 2π √ 1 Yn (ω)2 1 2π 1 2 exp yi dω. = √ n √ exp − 2 An 2 An (ω) 2π n
(A.29)
Therefore, the predictive distribution can be written as 1 pˆ BAYES (y|x, D) = √ 2π
$
U Pd (Dn+1 ) n 1 , exp − y2 n+1 2 PdU (Dn )
(A.30)
where 1 1 Yn (ω)2 exp dω An 2 An (ω)
PdU (Dn ) =
1 2 ϕβ (ω · x i ). n n
An (ω) =
i=1
Noting that An (ω) converges to A(ω) within the limit of large n, we substitute PdU (Dn ) = PdU (Yn ). Using the approximation of the form,
Yn+1 ≈ Yn +
yϕ 1 Yn , √ − n 2n
(A.31)
1 Q(Yn+1 , ω) ≈ Q(Yn , ω) + √ ∇ Q(Yn , ω)yϕ n 1 (A.32) ∇∇ Q(Yn , ω)y2 ϕ 2 − ∇ Q(Yn , ω)Yn , 2n y ∇ Q(Yn , ω)ϕdω P(Yn+1 ) ≈ P(Yn ) + √ n
1 2 2 y ∇∇ Q(Yn , ω)ϕ dω − ∇ Q(Yn , ω)Yn dω , (A.33) + 2n +
1060
S.-I. Amari, H. Park, and T. Ozeki
we obtain
1 2 1 1 PdU (Yn+1 ) pˆ BAYES (y|x, D) = √ exp − y 1− 2 2n PdU (Yn ) 2π ! 1 1 y ∇ Q(Yn , ω)ϕdω = √ exp − y2 1 + √ 2 P(Yn ) n 2π
! ∇∇ Q(Yn , ω)ϕ 2 dω 1 + y2 2n P(Yn ) ! ∇ Q(Yn , ω)Yn dω P(Yn ) − − . (A.34) P(Yn ) P(Yn ) By using the fact that ∂2 ∂ Q(Y)A(ω) = Q(Y)A(ω) + Q(Y), ∂Y2 ∂Y we can finally obtain the result. Proof of Theorem 8. From equation 6.39 and the definition of the generalization error, we get ! y ∇ QU d (Yn , ω)ϕβ (ω · x)dω E gen = −E D E y,x log 1 + √ n PdU (Yn ) ! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (y) 2n PdU (Yn ) ! y ∇ QU d (Yn , ω)ϕβ (ω · x)dω ≈ −E D E y,x √ n PdU (Yn ) ! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (y) 2n PdU (Yn ) 2
! ∇ QU y2 d (Yn , ω)ϕβ (ω · x)dω − 2n PdU (Yn ) ! U ∇ QU 1 d (Yn , ω)∇ Qd (Yn , ω )A(ωω )dωω = . ED 2 U 2n P (Yn ) d
Similarly, for the training error, we get ! n yi ∇ QU 1 d (Yn , ω)ϕβ (ω · x i )dω E train = − E D log 1 + √ n n PdU (Yn ) i=1
(A.35)
Singularities Affect Dynamics of Learning in Neuromanifolds
! ∇∇ QU 1 d (Yn , ω)An (ω)dω + H2 (yi ) 2n PdU (Yn ) ! n yi ∇ QU 1 d (Yn , ω)ϕβ (ω · x i )dω ED √ ≈− n n PdU (Yn ) i=1 ! ∇∇ QU 1 d (Yn , ω)An (ω)dω H2 (yi ) + 2n PdU (Yn ) ! 2
∇ QU yi2 d (Yn , ω)ϕβ (ω · x i )dω − 2n PdU (Yn ) ! ∇ QU 1 d (Yn , ω)Yn (ω)dω . = E gen − E D n PdU (Yn )
1061
(A.36)
On the other hand, from equation A.30, the generalization error is written as E gen =
n+1 1 log + E D E po q log PdU (Dn ) − E D E po q log PdU (Dn+1 ) . 2 n
From the fact that lim E D E po q log PdU (Dn ) < ∞,
n→∞
we finally get E gen = Let us define
Proof of Theorem 9. Zn = p(D) =
1 . 2n
|ξ |d
n
p(yi |x i , ξ, ω)dξ dω,
(A.37)
i=1
Zn+1 = p(y, x, D) =
|ξ |d
n+1
p(yi |x i , ξ, ω)dξ dω.
(A.38)
i=1
The Bayesian predictive distribution can then be written as pˆ BAYES (x|D) =
Zn+1 . Zn
(A.39)
1062
S.-I. Amari, H. Park, and T. Ozeki
Similar to the cone model, we get 1 2 1 2 d |ξ | y ϕ exp − + ξ y ϕ − i i n i i dξ dω 2 2 2π √ 1 2 2π yi = √ n √ d+1 exp − 2 2π n Yn (ω) 1 1 Yn (ω)2
× dω. (A.40) I exp √ d+1 d 2 An (ω) An (ω) An
Zn = √
1
Therefore, the predictive distribution can be written as 1 pˆ BAYES (y|x, D) = √ 2π
$
n n+1
d+1
1 2 PdJ (Dn+1 ) . exp − y 2 PdJ (Dn )
(A.41)
By using the same approaches as for the uniform prior, we can easily obtain the final results. Proof of Theorem 10. written as E gen =
From equation A.14, the generalization error is
n+1 d +1 log + E D E po q log PdJ (Dn ) 2 n − E D E po q log PdJ (Dn+1 ) .
(A.42)
From the fact that lim E D E po q log PdJ (Dn ) < ∞,
n→∞
(A.43)
we finally get E gen =
d +1 2n
(A.44)
For the proof, the same derivation process as for the uniform case can be applied. References Akaho, S., & Kappen, H. J. (2000). Nonmonotonic generalization bias of gaussian mixture models. Neural Computation, 12, 6, 1411–1428.
Singularities Affect Dynamics of Learning in Neuromanifolds
1063
Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Appl. Comp., AC-19, 716–723. Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Trans., EC-16(3), 299– 307. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics. 27, 77–87. Amari, S. (1987). Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections and divergence. Mathematical Systems Theory, 20, 53–82. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S. (2003). New consideration on criteria of model selection. In L. Rutkowski & J. Kacprzyk (Eds.), Neural networks and soft computing (Proceedings of the Sixth International Conference on Neural Networks and Soft Computing) (pp. 25–30). Heidelberg: Physica Verlag. Amari, S., & Burnashev, M. V. (2003). On some singularities in parameter estimation problems. Problems of Information Transmission. 39, 352–372. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–154. Amari, S., & Nagaoka, H. (2000). Information geometry. New York: AMS and Oxford University Press. Amari, S., & Nakahara, H. (2005). Difficulty of singularity in population coding. Neural Computation, 17, 839–858. Amari, S., & Ozeki, T. (2001). Differential and algebraic geometry of multilayer perceptrons. IEICE Trans., E84-A, 31–38. Amari, S., Ozeki, T., & Park, H. (2003). Learning and inference in hierarchical models with singularities. Systems and Computers in Japan, 34, 34–42. Amari, S., Park, H., & Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12, 1399–1409. Amari, S., Park, H., & Ozeki, T. (2001). Statistical inference in nonidentifiable and singular statistical models. J. of the Korean Statistical Society, 30(2), 179–192. Amari, S., Park, H., & Ozeki, T. (2002). Geometrical singularities in the neuromanifold of multilayer perceptrons. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 343–350). Cambridge, MA: MIT Press. Brockett, R. W. (1976). Some geometric questions in the theory of linear systems. IEEE Trans. on Automatic Control, 21, 449–455. Chen, A. M., Lu, H., & Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. Neural Computation, 5, 910–927. ´ (1997). Testing in locally conic models, and Dacunha-Castelle, D., & Gassiat, E. application to mixture models. Probability and Statistics, 1, 285–317. Fukumizu, K. (1999). Generalization error of linear neural networks in unidentifiable cases. In O. Watanabe & T. Yokomori (Eds.), Algorithmic learning theory: Proceedings of the 10th International Conference on Algorithmic Learning Theory (ALT’99) (pp. 51– 62). Berlin: Springer-Verlag. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3), 833–851.
1064
S.-I. Amari, H. Park, and T. Ozeki
Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13, 317–327. Hagiwara, K. (2002a). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Hagiwara, K. (2002b). Regularization learning, early stopping and biased estimator. Neurocomputing, 48, 937–955. Hagiwara, K., Hayasaka, T., Toda, N., Usui, S., & Kuno, K. (2001). Upper bound of the expected training error of neural network regression for a gaussian noise sequence. Neural Networks, 14, 1419–1429. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. Proceedings of IJCNN (Vol. 3, pp. 2263–2266). Nagoya, Japan. Hartigan, J. A. (1985). A failure of likelihood asymptotics for normal mixtures. Proc. Barkeley Conf. in Honor of J. Neyman and J. Kiefer, 2, 807–810. Hotelling, H. (1939). Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. Math., 61, 440–460. Inoue, M., Park, H., & Okada, M. (2003). On-line learning theory of soft committee machines with correlated hidden units—Steepest gradient descent and natural gradient descent. J. Phys. Soc. Jpn., 72(4), 805–810. Kang, K., Oh, J.-H., Kwon, S., & Park, Y. (1993). Generalization in a two-layer neural networks. Phys. Rev. E, 48(6), 4805–4809. Kitahara, M., Hayasaka, T., Toda, N., & Usui, S. (2000). On the statistical properties of least squares estimators of layered neural networks (in Japanese). IEICE Transactions, J86-D-II, 563–570. ˚ Kurkov´ a, V., & Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6, 543–558. Liu, X., & Shao, Y. (2003). Asymptotics for likelihood ratio tests under loss of identifiability. Annals of Statistics, 31(3), 807–832. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. Park, H., Amari, S., & Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Neural Networks, 13, 755–764. Park, H., Inoue, M., & Okada, M. (2003). On-line learning dynamics of multilayer perceptrons with unidentifiable parameters. Submitted to J. Phys. A: Math. Gen., 36, 11753–11764. Rattray, M., & Saad, D. (1999). Analysis of natural gradient descent for multilayer neural networks. Physical Review E, 59, 4523–4532. Rattray, M., Saad, D., & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461–5464. Riegler, P., & Biehl, M. (1995). On-line backpropagation in two-layered neural networks. J. Phys. A; Math. Gen., 28, L507–L513. Risssanen, J. (1986). Stochastic complexity and modeling. Ann. Statist. 14, 1080–1100. Rosenblatt, F. (1961). Principles of neurodynamics. New York: Spartan. ¨ Ruger, S. M., & Ossen, A. (1997). The metric of weight space. Neural Processing Letters, 5, 63–72.
Singularities Affect Dynamics of Learning in Neuromanifolds
1065
Rumelhart, D., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error backpropagation. In D. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press. Saad, D., & Solla, A. (1995). On-line learning in soft committee machines. Phys. Rev. E, 52, 4225–4243. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks. 5, 589–593. Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, 13, 899–933. Watanabe, S. (2001b). Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 1409–1060. Watanabe, S. (2001c). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 329–336). Cambridge, MA: MIT Press. Watanabe, S., & Amari, S. (2003). Learning coefficients of layered models when the true distribution mismatches the singularities. Neural Computation, 15(5), 1013– 1033. Weyl, H. (1939). On the volume of tubes. Amer. J. Math., 61, 461–472. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Computation, 14, 999–1026. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Yamazaki, K., & Watanabe, S. (2002). A probabilistic algorithm to calculate the learning curves of hierarchical learning machines with singularities. Trans. on IEICE, J85-D-II(3), 363–372. Yamazaki, K., & Watanabe, S. (2003). Singularities in mixture models and upper bounds of stochastic complexity. International Journal of Neural Networks, 16(7), 1029–1038.
Received June 28, 2004; accepted May 26, 2005.
LETTER
Communicated by Paul Tiesinga
How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory Neurons Nicolas Brunel [email protected]
David Hansel [email protected] Laboratory of Neurophysics and Physiology, CNRS UMR 8119, Universit´e Paris Ren´e Descartes, 75270 Paris Cedex 05, France
GABAergic interneurons play a major role in the emergence of various types of synchronous oscillatory patterns of activity in the central nervous system. Motivated by these experimental facts, modeling studies have investigated mechanisms for the emergence of coherent activity in networks of inhibitory neurons. However, most of these studies have focused either when the noise in the network is absent or weak or in the opposite situation when it is strong. Hence, a full picture of how noise affects the dynamics of such systems is still lacking. The aim of this letter is to provide a more comprehensive understanding of the mechanisms by which the asynchronous states in large, fully connected networks of inhibitory neurons are destabilized as a function of the noise level. Three types of single neuron models are considered: the leaky integrateand-fire (LIF) model, the exponential integrate-and-fire (EIF), model and conductance-based models involving sodium and potassium HodgkinHuxley (HH) currents. We show that in all models, the instabilities of the asynchronous state can be classified in two classes. The first one consists of clustering instabilities, which exist in a restricted range of noise. These instabilities lead to synchronous patterns in which the population of neurons is broken into clusters of synchronously firing neurons. The irregularity of the firing patterns of the neurons is weak. The second class of instabilities, termed oscillatory firing rate instabilities, exists at any value of noise. They lead to cluster state at low noise. As the noise is increased, the instability occurs at larger coupling, and the pattern of firing that emerges becomes more irregular. In the regime of high noise and strong coupling, these instabilities lead to stochastic oscillations in which neurons fire in an approximately Poisson way with a common instantaneous probability of firing that oscillates in time.
Neural Computation 18, 1066–1110 (2006)
C 2006 Massachusetts Institute of Technology
Synchronization Properties of Inhibitory Networks
1067
1 Introduction Of the various patterns of synchronous oscillations that occur in the brain, episodes of activity during which local field potentials display fast oscillations with frequencies in the range 40 to 200 Hz have elicited particular interest. Such episodes have been recorded in vivo in several brain areas, in particular in the rat hippocampus (Buzsaki, Urioste, Hetke, & Wise, 1992; Bragin et al., 1995; Csicsvari, Hirase, Czurko, Mamiya, & Buzs´aki, 1999a, 1999b; Siapas & Wilson, 1998; Hormuzdi et al., 2001). During these episodes, single-neuron firing rates are typically much lower than local field potential (LFP) frequencies (Csicsvari et al., 1999a), and single-neuron discharges appear very irregular. Although a detailed analysis of the single-cell firing statistics during these episodes is lacking, this irregularity is consistent with findings of high variability in interspike intervals of cortical neurons in various contexts (see, e.g., Softky & Koch, 1993; Compte et al., 2003), and the observation of large fluctuations in membrane potentials in intracellular recordings in cortex in vivo (see, e.g., Destexhe & Par´e 1999; Anderson, Lampl, Gillespie, & Ferster, 2000). Recent theoretical studies have shown that fast synchronous population oscillations in which single-cell firing is highly irregular emerge in networks of strongly interacting inhibitory neurons in the presence of high noise. Brunel and Hakim (1999) investigated analytically the emergence of such oscillations in networks of sparsely connected leaky integrate-and-fire neurons activated by an external noisy input. They showed that the frequency of the population oscillations increases rapidly when the synaptic delay decreases and that it can be much larger than the firing frequency of the neurons. For instance, a network of neurons firing with an average rate of 10 Hz can oscillate at frequencies that can be on the order of 200 Hz when synaptic delays are on the order of 1 to 2 msec (Brunel & Hakim, 1999; Brunel & Wang, 2003). Tiesinga and Jose (2000) found similar collective states in numerical simulations of a fully connected network of inhibitory conductance-based neurons activated with a noisy external input. However, the frequency of the population oscillations in their model was smaller (in the range 20–80 Hz), and the variability of the spike trains was weaker than in the leaky integrate-and-fire (LIF) network for similar synaptic time constants and average firing rates of the neurons. Following the terminology of Tiesinga and Jose (2000), we will call this type of state a stochastic synchronous state. Stochastic synchrony occurs in the presence of strong noise, in contrast to so-called cluster states, which are found when noise and heterogeneities are weak. In the simplest cluster state, all neurons tend to spike together in a narrow window of time; they form one cluster. In such a state, the population oscillation frequency is close to the average frequency of the neurons (Abbott & van Vreeswijk, 1993; Tsodyks, Mit’kov, & Sompolinsky, ˜ & Kopell, 1993; Hansel, Mato, & Meunier, 1995; White, Chow, Soto-Trevino,
1068
N. Brunel and D. Hansel
1998; Golomb & Hansel, 2000; Hansel & Mato, 2003). Clustering in which neurons are divided into two or more groups can also occur (Golomb, Hansel, Shraiman, & Sompolinsky, 1992; Golomb & Rinzel, 1994; Hansel et al., 1995; van Vreeswijk, 1996). Within each of these groups, neurons fire at a similar phase of the population oscillation, but different groups fire at different phases. Thus, the frequency of the population oscillations can be very different from the neuronal firing rate, as in stochastic synchrony. However, in contrast to stochastic synchrony, in cluster states the population frequency is always close to a multiple of the neuronal firing rate. Moreover, single-neuron activity in cluster states is much more regular than in stochastic synchronous states. In this letter, we examine the dynamics of networks of inhibitory neurons in the presence of noisy input. For simplicity, we consider a fully connected network; similarities and differences with more realistic randomly connected networks will be mentioned in the discussion. We consider three classes of models: the LIF model (Lapicque, 1907; Tuckwell, 1988), the exponential integrate-and-fire model (EIF) (Fourcaud-Trocm´e, Hansel, van Vreeswijk, & Brunel, 2003), and simple conductance-based (CB) models (Hodgkin & Huxley, 1952) with two active currents, a sodium and a potassium current. In these models, we study the instabilities of the asynchronous state. In this state, the population-averaged firing rate becomes constant in the large N limit, and correlations between neurons vanish in this limit. We investigate in particular how the instability responsible for stochastic synchrony relates to other types of instabilities of the asynchronous state. Section 2 is devoted to the LIF model. We fully characterize the spectrum of the instabilities of the asynchronous state and explore in what ways these instabilities depend on the noise amplitude, the average firing rate of the neurons, and the synaptic time constants (latency, rise, and decay time). This can be performed analytically, due to the simplicity of the LIF model and the simplified all-to-all architecture. The LIF neuron has the great advantage of analytical tractability, but it often exhibits nongeneric properties. For instance, the frequency-current relationship that characterizes the response of the neuron to a steady external current exhibits a logarithmic behavior near current threshold, while generic type I neurons have a square root behavior. LIF and standard Hodgkin-Huxley (HH) models may also display substantially different synchronization behaviors in the low-noise regime (Pfeuty, Golomb, Mato, & Hansel, 2003; Pfeuty, Mato, Golomb, & Hansel, 2005). Crucially, LIF neurons respond in a nongeneric way to fast oscillatory external inputs (Fourcaud-Trocm´e et al., 2003). This motivates an investigation of synchronization properties in models with more realistic dynamics. In section 3, we combine analytical calculations with numerical simulations to study a network of EIF neurons. In this model, single-neuron dynamics depend on a voltage-activated current that triggers action potentials. This framework allows us to make predictions regarding the way sodium currents
Synchronization Properties of Inhibitory Networks
1069
affect the emergence of fast oscillations. In section 4 we simulate several conductance-based network models to compare their behaviors with those of the LIF and the EIF. We conclude that although quantitative aspects of the phenomenology of stochastic synchrony depend on the details of the neuronal dynamics, the occurrence of this type of collective state is a generic feature of large neuronal networks of strongly coupled inhibitory neurons. 2 Networks of Leaky Integrate-and-Fire Neurons 2.1 The Model. We consider a fully connected network of N inhibitory LIF neurons. The membrane potential, Vi , of neuron i (i = 1, . . . , N) evolves in the subthreshold range Vi (t) ≤ Vt where Vt is the firing threshold, according to τm V˙ i (t) = −Vi (t) + Irec (t) + Ii,ext (t),
(2.1)
where τm is the membrane time constant, Ii,ext (t) is an external input, and Irec (t) is the recurrent input due to the interactions between the neurons. When the voltage reaches threshold Vt , a spike is emitted and the voltage is reset to Vr . The external input is modeled as a current, √ Ii,ext (t) = I0 + σ τm ηi (t),
(2.2)
where I0 is a constant DC input, σ measures the fluctuations around I0 , and ηi (t) is a white noise that is uncorrelated from neuron to neuron, ηi (t) = 0, ηi (t)η j (t ) = δi j δ(t − t ). Since the network is fully connected, all neurons receive the same recurrent synaptic input. It is modeled as J s t − t kj , N k N
Irec (t) = −
(2.3)
j=1
where the postsynaptic current elicited by apresynaptic spike in neuron j occurring at time t kj , is denoted by s t − t kj , and J, which measures the strength of the synaptic interactions, has the dimension of a voltage. The first sum on the right-hand side in this equation represents a sum over synapses; the second sum is over spikes. For the function s(t), we take s(t) =
0 A(exp[−(t − τl )/τd ] − exp[−(t − τl )/τr ])
t < τl t ≥ τl ,
(2.4)
1070
N. Brunel and D. Hansel
where τl is the latency, τr is the rise time, τd is the decay time of the postsynaptic current, and A = τm /(τd − τr ) is a normalization factor that ensures that the integral of a unitary postsynaptic current remains independent of the synaptic time constants. With this normalization, s(t)dt = τm . Such kinetics are obtained from the set of differential equations, τd s˙ (t) = −s(t) + x(t)
(2.5)
τr x(t) ˙ = −x(t) + τm δ(t − τl ),
(2.6)
where x is an auxiliary variable. In the following we assume that all the neurons are identical, and we take τm = 10 ms (McCormick, Connors, Lighthall, & Prince, 1985), Vt = 20 mV, Vr = 14 mV (Troyer & Miller 1997). 2.2 The Asynchronous State. The dynamics of the LIF network can be studied using a mean-field approach (Treves, 1993; Abbott & van Vreeswijk, 1993); Brunel & Hakim, 1999; Brunel, 2000). In the thermodynamic limit N → ∞, the √recurrent input to the neurons can be written up to corrections of order 1/ N: Irec (t) = −J s(t)
(2.7)
τd s˙ = −s + x
(2.8)
τr x˙ = −x + τm ν(t − τl ),
(2.9)
where ν(t) is the instantaneous firing rate of the neurons at time t averaged over the network. The dynamical state of the network can then be described by a probability density function (PDF) for the voltage, P(V, t), which satisfies the Fokker-Planck equation, τm
∂P ∂ σ 2 ∂2 P + = [(V − Irec (t) − I0 )P], 2 ∂t 2 ∂V ∂V
(2.10)
with the boundary conditions P(Vt , t) = 0 ∂P 2ν(t)τm (Vt , t) = − ∂V σ2 lim(P(Vr + , t) − P(Vr − , t)) = 0
→0
∂P ∂P 2ν(t)τm lim (Vr + , t) − (Vr − , t) = − . →0 ∂ V ∂V σ2
(2.11) (2.12) (2.13) (2.14)
Synchronization Properties of Inhibitory Networks
1071
Finally, P(V, t) must obey P(V, t)d V = 1 at all times t. In the stationary state of the network, the PDF of the membrane potential as well as the population average of the firing rate, ν(t) = ν0 , are constant in time, and the neurons fire asynchronously. Integrating equation 2.10 together with the condition ∂ P/∂t = 0 and the boundary conditions, equation 2.14, one can show that the stationary firing rate ν0 is given by Ricciardi (1977), Amit and Tsodyks (1991), and Amit and Brunel (1997), √ 1 = π ν0 τm
yt
2
e x [1 + erf(x)] d x
(2.15)
yr
where erf is the error function (Abramowitz & Stegun, 1970), Vt + Jν0 τm − I0 σ Vr + Jν0 τm − I0 . yr = σ yt =
(2.16) (2.17)
The coefficient of variation of the interspike interval distribution, which measures the irregularity of single-neuron firing, can also be computed in the asynchronous state. This yields (Brunel, 2000; Tuckwell, 1988) CV2 = 2π(ν0 τm )2
yt yr
2
ex dx
x
−∞
2
e y [1 + erf(y)]2 dy.
Figure 1 shows how the coefficient of variation (CV) increases with noise level σ , for several values of the average firing rate. 2.3 Stability Analysis of the Asynchronous State. The asynchronous state is stable if any small perturbation from it decays back to zero. To study the instabilities of the asynchronous state, one approach is to diagonalize the linear operator, which describes the dynamics of small perturbations from the asynchronous state (see appendix A). A specific eigenmode, with eigenvalue λ, is stable (resp. unstable) if Re(λ) < 0 (resp. Re(λ) > 0). The frequency at which the mode oscillates is ω/(2π) where ω = Im(λ). The asynchronous state is stable if the real part of the eigenvalue is negative for all the eigenmodes. Here we present an alternative approach that directly provides the equation that determines the critical manifolds in parameter space on which eigenmodes change stability (see also Brunel & Wang, 2003). The derivation proceeds in two steps. The first step is to compute the recurrent current, assuming that the population firing rate of the network is weakly modulated and oscillatory with a frequency
1072
N. Brunel and D. Hansel
1 10 Hz
0.1 CV
30 Hz 50 Hz
0.01
0.001 0.001
0.01
1 0.1 SD of noise σ (mV)
10
Figure 1: Coefficient of variation as a function of the noise level, σ , for Vt = 20 mV, Vr = 14 mV, τm = 10 ms, and several values of the average firing rate ν0 , indicated on the graph.
ω : ν(t) = ν0 + ν1 exp(iωt) with ν1 1 (for notational simplicity, we use complex numbers in the following linear analysis). At leading order, the recurrent current is also oscillatory at the same frequency, and its modulation can be written as I1 =
−Jν1 τm exp(−iωτl ) ≡ Jν1 τm AS (ω) exp[iπ + i S (ω)], (1 + iωτd )(1 + iωτr )
(2.18)
where AS (ω) =
1 (1 +
ω2 τd2 )(1
+ ω2 τr2 )
,
S (ω) = ωτl + atan(ωτr ) + atan(ωτd ).
(2.19) (2.20)
The phase shift of the synaptic current with respect to the oscillatory presynaptic firing rate modulation is the sum of four contributions. Three of them, on the right-hand side of equation 2.20, depend on the latency, the rise time, and the decay time of the synapses and vary with ω. The fourth contribution, which does not depend on ω, is due to the factor exp(iπ) in equation 2.18. It results from the inhibitory nature of the synaptic interactions.
Synchronization Properties of Inhibitory Networks
1073
The second step is to compute the firing rate in response to an oscillatory input, equation A.3. It is given by (Brunel & Hakim, 1999; Brunel, Chance, Fourcaud, & Abbott, 2001), ν1 =
∂U (yt , iω) − ∂U (y , iω) I1 ν0 ∂y ∂y r , σ (1 + iωτm ) U(yt , iω) − U(yr , iω)
(2.21)
where yt , yr are given by equations 2.16 and 2.17 and the function U is defined in appendix A, equation A.5. The modulation ν1 can also be written as ν1 =
I1 ν0 AN (ω) exp[i N (ω)], σ
(2.22)
where (I1 ν0 /σ )AN (ω) is the amplitude of the firing rate modulation and N (ω) is the phase shift of the firing rate with respect to the oscillatory input current. A negative (resp. positive) phase shift means that the modulation of the neuronal response is in advance (resp. delayed) compared to the modulation of the input. Since in the network, the modulation of the firing rate and the modulation of the recurrent current must be consistent, combining equations 2.18 and 2.21, the following self-consistent condition must be satisfied: 1=
Jν0 τm AN (ω)AS (ω) exp[iπ + i S (ω) + i N (ω)]. σ
(2.23)
This equation is a necessary and sufficient condition for the existence of self-sustained oscillations of the population firing rate with vanishingly small amplitude and frequency ω. Therefore, it is identical to the condition that an eigenmode of the linearized dynamics, oscillating at frequency ω, has marginal stability (i.e., the real part of its eigenvalue vanishes). Note that AN depends on ν0 , σ , Vt , and Vr , whereas AS depends on synaptic time constants τl , τr , and τd . The complex equation 2.23 is equivalent to two real equations. The first of these equations is S (ω) + N (ω) = (2k + 1)π,
k = 0, 1, 2, . . .
(2.24)
Equation 2.24 does not depend on the coupling strength J > 0. It determines the frequency spectrum of the eigenmodes with marginal stability. The second equation is 1=
Jν0 τm AN (ω)AS (ω). σ
(2.25)
1074
N. Brunel and D. Hansel
It determines the coupling J c (ω), J c (ω) =
σ , ν0 τm AN (ω)AS (ω)
(2.26)
at which a mode with frequency ω has marginal stability. In the following, we study the instability of the asynchronous state in the J-σ plane, at a fixed firing rate ν0 . When J and σ are varied, ν0 is kept fixed by a suitable adjustment of I0 . The effect of the firing rate ν0 on instabilities is then studied separately in section 2.4.4. For given ν0 and σ , one expects the asynchronous state to be stable when J is sufficiently small. The first eigenmode to lose stability when J increases determines the stability boundary of the asynchronous state. Therefore, this boundary is given by
J˜c =
min
{ω| S (ω)+ N (ω)=(2k+1)π }
σ , ν0 τm AN (ω)AS (ω)
(2.27)
where the minimum is computed over all the solutions to equation 2.24. 2.4 The Spectrum of Instabilities. Inspection of the qualitative properties of the synaptic and neuronal phase lag helps us to understand how the instabilities of the asynchronous state occur and how they depend on the noise level (see also Fuhrmann, Markram, & Tsodyks, 2002, Brunel & Wang, 2003, for similar considerations). 2.4.1 The Solutions to Equation 2.24. The synaptic phase lag S (ω) is an increasing function of frequency. For low and high frequencies, it behaves like S (ω) ∼ ω(τl + τr + τd ) and S (ω) ∼ π + ωτl , respectively. The function S (ω), which does not depend on the noise level, is plotted in Figure 2A. The neuronal phase lag, N (ω), depends markedly on the noise level. For low noise levels, N (ω) has a sawtooth profile with peaks and very sharp variations at frequencies ω = 2π f n , where f n are in the limit of zero noise integer multiples of the firing rate of the neurons, f n = nν0 , n = 1, 2 . . . (see Figure 2B). Provided the latency is strictly positive, τl > 0, the function N + S goes to infinity with a sawtooth profile in the large ω limit (see Figure 2C). Since the frequencies of the eigenmodes with marginal stability are solutions to equation 2.24, they can be determined graphically by looking at the intersections of the graph of the function N + S with the horizontal lines at ordinates (2k + 1)π. For τl > 0, one can show that an odd number, 2 p + 1, of such intersections exists for any k. For instance, for the parameters of Figure 2C, there are for k = 0, 13 intersections for σ = 0.01 mV, 3 intersections for σ = 0.05 mV, and only 1 intersection for σ ≥ 0.1 mV. Out of these 2 p + 1 intersections, p + 1 are for frequencies close to f n for values of n in a range that depends on the noise level and also on the synaptic
Synchronization Properties of Inhibitory Networks
1075
Synaptic phase lag
A 270 180 90 0
Neuronal phase lag
B
Total phase lag
C
0
50
100
150
200
0
50
100
150
200
0
50
100 150 Frequency (Hz)
200
90 0 -90 -180 270 180 90 0 -90
Figure 2: Interpretation of equation 2.24 in terms of synaptic and neuronal phase lags. (A) Synaptic phase lag for τl = 1 ms, τr = 1 ms, τd = 6 ms. (B) Neuronal phase lag for Vt = 20 mV, Vr = 14 mV, ν0 = 30 Hz, τm = 10 ms, and five noise levels: 0.01 mV (dot-dashed), 0.05 mV (dashed), 0.1 mV (thin solid), 1 mV (medium solid), and 10 mV (thick solid). Note the sharp variations at integer multiples of the firing rate ν0 (30, 60, 90, . . . Hz) for low noise levels, which disappear as noise becomes stronger. (C) Total phase lag (sum of synaptic and neuronal phase lags, for the same noise levels as in B). Solutions to equation. 2.24 for a given noise level are at the intersection of the curve representing the total phase lag and the horizontal dotted line at 180 degrees. Note the large number of intersections for low noise levels that disappear as noise increases until a single intersection is left.
kinetics, as will be shown later. For instance, n = 3, . . . , 9 for σ = 0.01 mV, n = 4, 5 for σ = 0.05 mV, as shown in Figure 2C. The remaining p intersections are at intermediate frequencies nν0 < ω/(2π) < (n + 1)ν0 .
1076
N. Brunel and D. Hansel
Figure 2B shows that the amplitude of the peaks in N (ω) decreases, and the peaks themselves become less sharp and broader when the noise increases. As a result, pairs of intersections with the 180 horizontal line coalesce and disappear. For example, in Figure 2C, a pair of intersections—one near 60 Hz and the other at a frequency between 60 and 90 Hz—disappears when the noise is increased from σ = 0.01 (dot-dashed curve) to 0.05 mV (dashed curve). Eventually, for sufficiently large noise, all the intersections except one have disappeared, and the neuronal phase N becomes a monotonously increasing function of ω (see Figure 2C, for σ ≥ 0.1 mV). For the parameters of Figure 2C, this single intersection is between 90 Hz (for σ = 10 mV) and 120 Hz (for σ = 0.1 mV), three to four times larger than the average firing rate of the neurons, ν0 = 30 Hz. In general, the value of ω at this intersection depends not only on ν0 but also on the synaptic time constants, τl , τr , and τd as will be discussed below. A similar picture holds for the intersections with horizontal lines at (2k + 1)π with k ≥ 1, although they occur at much larger frequencies. For example, intersections with the line at 540 degrees (k = 1) occur around 1000 Hz for the parameters of Figure 2. Once the frequencies of the marginal modes have been obtained from equation 2.24, the corresponding critical couplings are determined by solving equation 2.26. The results can be represented in the plane J-σ . 2.4.2 The Instability Spectrum in the J − σ Plane. We start by describing the structure of the instability spectrum in the case where the synaptic time constants are τr = 1 ms, τd = 6 ms, and τl = 1 ms. What happens when these parameters change is briefly discussed in section 2.5. We first consider the case ν0 = 30 Hz. The lines on which eigenmodes are marginal are plotted in the σ − J plane in Figure 3A. The frequencies of the marginal modes are plotted as a function of noise in Figure 3B. Figure 3A shows that there are several families of lines. Each family corresponds to the set of solutions to the phase equation, 2.24, for a given k = 0, 1, 2, 3, . . . (from low J to high J ; for clarity, only lines belonging to k = 0, 1 families are shown) as σ varies. As discussed in the previous section, an odd number 2 p + 1 of solutions to the phase condition exist for any k. These 2 p + 1 solutions can be divided into p pairs of solutions that coalesce at some level of noise and one solution that exists at any noise level. We first discuss the p lines corresponding to the 2 p solutions that disappear as the noise level increases. Each line is composed of an upper and a lower branch that approach each other as the noise level increases and subsequently meet at a critical value of the noise. The area enclosed by the line is the region in which the corresponding eigenmode is unstable (i.e., the region to the left of the curve). The frequency of the mode on the lower part of this line is very close to an integer multiple of the average neuronal firing rate nν0 , while on the upper part of the line, the frequency is between
Synchronization Properties of Inhibitory Networks
1077
Total synaptic inhibitory coupling (mV)
A k=1
100 k=0 n=8 10
n=7 n=6 n=5 Asynchronous state stable
n=4 1 0.001
n=3 0.01
1
0.1
10
B 1200 1050 k=1
Frequency (Hz)
900 150
750
60
k=0
n=3 n=2
30
450
150
n=4
90
600
300
n=5
120
n=11 0 n=10n=9 n=8 0.001 n=7 n=6 n=5 n=4 n=3 n=2
0 0.001
0.01
1 0.01 0.1 Noise amplitude (mV)
0.1
1
10
k=0
10
Figure 3: (A) Lines on which eigenmodes become unstable, obtained from equations 2.24 and 2.25, in the plane J − σ , for the parameters of Figure 2. The asynchronous state is stable below the lowest line (region marked “asynchronous state stable”). Only lines corresponding to families of solutions at k = 0 and k = 1 (marked in the figure) are indicated. Each family is composed of individual branches labeled by integer values of n (indicated only for k = 0). (B) Frequency of marginal modes. The thick curve in A is the stability boundary of the asynchronous state. The thick curve in B is the frequency of the unstable mode on this boundary plotted against the noise.
1078
N. Brunel and D. Hansel
nν0 and (n + 1)ν0 or (n − 1)ν0 (see Figure 3B). These two parts of the line meet each other for the noise level at which the eigenmode becomes stable for any value of the coupling. Hence, we can index all eigenmodes and the lines on which they have marginal stability in the σ − J plane by the integer n. This index is the number of clusters that emerge via the instability on the lower part of the line. If n = 1, one cluster emerges, and all the neurons in the network tend to fire simultaneously. For n ≥ 2, they tend to split into groups of neurons that fire in synchrony one after the other (see Figure 4B). Thus, the frequency of the population oscillations can be significantly larger than the average firing rate of the neurons, ν0 , if n is large. All these instabilities exist only in a limited range of noise amplitude. For example, in Figures 3A and 3B, the n = 2 curve exists for σ < 0.005 mV, n = 3 exists for σ < 0.04 mV, and so forth. This reflects the sensitivity of clustering to noise. As noise increases, neurons are less able to maintain their phase locking to the population oscillation. Clustering cannot emerge since neurons would skip more and more between clusters and spend less time bound to a specific cluster. The instabilities corresponding to these p eigenmodes are called clustering instabilities. An additional solution to the phase condition differs from the other solutions by the fact that it survives even for large noise. As noise increases, it generates a single-valued curve in the J − σ plane. On this line, the desynchronizing effect of the noise that would suppress the instability can be compensated for by increasing the coupling strength (see Figure 3A). The mode that has marginal stability on this line can also be indexed by an integer n since in the limit of weak noise, its frequency goes to one of the integer multiples of the firing rate f n (n = 4 in Figure 3). Hence, at low noise levels, this instability, like the clustering instabilities described above, leads to a cluster state in which the neurons fire in a regular manner. When the noise becomes strong, it leads to a state in which individual neurons fire in a highly irregular way while the population activity oscillates—a stochastic synchronous state. In this state, the neurons increase their firing probability together with the oscillatory population activity, that is, they display “rate oscillations” (see Figure 4D). To distinguish this instability from those described above, we term it an oscillatory rate instability. 2.4.3 Stability Boundary of the Asynchronous State. For each value of σ , the asynchronous state is stable for J < J˜c where J˜c is given by equation 2.27. Typically, J˜c = J (ω1 ) where ω1 is the smallest solution of equation 2.24. This is because if ω1 < ω2 < · · · are solutions to equation 2.24, we have AS (ω1 ) > AS (ω2 ) > · · · , and likewise AN (ω1 ) > AN (ω2 ) > · · · . The bold lines in Figure 3A indicate the boundary of the region in which the asynchronous state is stable. When the noise is weak, the stability boundary coincides with one of the lines where a clustering instability occurs (n = 2 for σ < 0.005 mV, n = 3 for
Synchronization Properties of Inhibitory Networks
B
A |
|
| |
| |
| |
|
|
|
|
|
Firing rate (Hz)
150 100 50 0 2000
2050
2100 Time (ms)
2150
|
| |
|
|
| |
|
150 100 50 0 2000
2050
2100 Time (ms)
2150
2200
D | | | |
|| | | |
|
|
||| |
||
| ||
||
| | |
| |
|| | || |
|
| |
|
|
| | |
| | | | | || |
| |
| | | ||
| |
| ||
||
| |
| ||
||
| | |
|| |
| | ||||
| |
||
|
||
|| | |
|
| |
|| || |
| | |
|
|
Firing rate (Hz)
150 100 50 2050
2100 Time (ms)
2150
|
|
| | | |
|
|
| |
| | |
| | | | | |
2200
| | |
|
| | | | | | |
|
| |
|
| | | | |
|
| | | |
| |
|
| ||
||
200
0 2000
| |
||| | |
|
| |
|
| | | |
|| |
|
|
|
|
| |
||
|| | | ||
| | |
| |
| | |
||
|
| |
|
|
|
|
|
| | | ||
|
|
| | |
|| |
| |
|
| ||
| |
| |
| |
|
|
|
| | ||
||
|
| |
| |
| |
|
|
| | | | ||| | |
| |
Firing rate (Hz)
|
200
2200
C
| |
|
| |
|
|
|
| |
|
| |
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
|
|
|
200
| |
| |
| |
| |
| |
| | | |
| |
| |
|
|
|
|
| |
|
|
|
|
| |
| |
| |
|
| | | | |
| | | |
|
|
|
|
| |
| | |
|
|
|
| | |
| |
| |
|
|
|
| |
|
| |
|
|
| |
|
| |
| |
| | | |
| | | |
|
|
| |
| | | |
| |
| |
| |
|
|
| |
| |
| |
|
| | |
|
|
|
|
|
|
| |
| |
|
|
|
|
| | | |
| | | |
| | | |
| |
|
|
|
| |
|
| | |
|
|
| |
| | | |
|
|
| |
| | |
| |
| |
| |
| |
| | |
| |
| |
|
| |
|
|
| | |
|
| |
|
|
| |
| | |
|
| | |
|
|
|
| | |
|
| |
|
| | | |
|
|
|
| | |
|
|
|
|
| |
|
|
|
|
|
| |
|
|
|
| |
| |
|
|
|
|
| |
| |
| | |
|
|
|
|
Firing rate (Hz)
1079
| | | |
| |
| | | | | |
| | | | |
| |
|
| |
| | |
|
| | |
|| | | | | | |
| | |
| | |
| | |
| | | |
| |
|
|
| ||
200 150 100 50 0 2000
2050
2100 Time (ms)
2150
2200
Figure 4: Simulations of a network of 1000 LIF neurons. All four panels show spike trains of 20 selected neurons (rasters) and instantaneous population firing rate, computed in 1 ms bins. (A, B) Low coupling–low noise region. In A, the asynchronous state is stable (J = 1 mV, σ = 0.04 mV). In B, the noise is decreased (σ = 0.02 mV). The network now settles in a three-cluster state, as predicted by the analytical results. (C, D) Strong coupling–strong noise region. C : J = 100 mV, σ = 10 mV; the asynchronous state is stable. Decreasing σ to σ = 4 mV leads to a stochastic oscillatory state, as predicted by the analytical results. See appendix C for more details on numerical simulations.
1080
N. Brunel and D. Hansel
0.005 < σ < 0.04 mV) and the CV of the firing is smaller than 0.04. When the noise is sufficiently large (σ > 0.04 mV), the stability boundary coincides with the oscillatory rate instability. On this part of the boundary, the CV increases from 0.04 to about 1 for σ ∼ 5 mV (see Figure 1). The frequency of the marginal mode on the stability boundary of the asynchronous state is shown in Figure 3B. At low noise levels, the index of the marginal mode is n = 2, and the frequency is about 2ν0 = 60 Hz. It increases discontinuously when σ ≈ 0.005 mV to about 3ν0 = 90 Hz, as the index changes from n = 2 to n = 3. A second discontinuity occurs for σ ≈ 0.04 mV, and the marginal mode on the boundary becomes the oscillatory rate mode. The index n changes from 3 to 4, and the frequency jumps to 120 Hz. For further increases of σ , the frequency changes smoothly with the noise and remains significantly larger than ν0 . The two regions of the stability boundary (clustering and rate oscillation) are characterized by different relationships between J˜c and σ . At low noise levels, in the clustering region, J˜c ∼ σ 2 (Abbott & van Vreeswijk √ 1993), while at high noise levels, in the rate oscillation region, J˜c ∼ σ/ ln σ (see appendix B for details of the derivation). 2.4.4 How the Stability Boundary Depends on the Firing Rate ν0 . The instabilities that occur on the asynchronous state stability boundary depend on ν0 , as shown in Figure 5. For ν0 = 10 Hz, clustering instabilities with n = 7, 8, 9, 10, 11 occur at very low noise levels. For σ ≈ 10−4 , the instability becomes an oscillatory rate instability with n = 12. As ν0 increases, the asynchronous state loses stability at smaller J and σ . Intuitively, this is because the inhibitory feedback responsible for the destabilization of the asynchronous state increases with ν0 . Moreover, clustering instabilities occur in a larger range of noise, the number of emerging clusters becomes smaller, and the transition to the oscillatory rate instability moves toward larger noise level. The last effect is a consequence of the fact that as the firing rate increases, the spike trains become more regular. For ν0 = 30 Hz, this transition occurs around σ = 0.04 mV, whereas for ν0 = 50 Hz, it is at about σ = 0.3 mV. For all these values of ν0 , the CV is in the range 0.01 to 0.1 at this transition. 2.5 The Effect of Synaptic Kinetics. The synaptic time constants and delay have a strong effect on the asynchronous state instability spectrum and asynchronous state stability boundary. Three qualitatively different situations can occur. First, when the latency and the rise time are both zero, the synaptic phase, S (ω) is bounded by π/2. Since the neuronal phase is smaller than π/2 (see Figure 2), equation 2.24 has no solutions. Hence, the asynchronous state is stable for any J , σ . Second, when there is no latency (τl = 0) but the rise time is finite, the synaptic phase is bounded by π. Hence, solutions to equation 2.24 exist
Synchronization Properties of Inhibitory Networks
1081
A Total synaptic inhibitory coupling (mV)
1000 10 Hz 30 Hz 50 Hz
100
10
1
0.1
0.01
0.001 1e-05 0.0001 0.001 0.01 0.1 Noise (mV)
1
10
B 160 n=3
Oscillation frequency (Hz)
140 n=4
n=12
120
n=11 100
n=10 n=9
80
50 Hz
n=2 n=3
n=8
30 Hz 10 Hz
n=7 60
n=2
40 1e-05 0.0001 0.001 0.01 0.1 Noise (mV)
1
10
Figure 5: Instabilities of the asynchronous state versus firing rate. (A) Stability boundary of the asynchronous state in the σ − J plane. (B) Frequency of the marginal mode on this boundary. In both panels, curves for three firing rates are shown: ν0 = 10 Hz (thin curves), 30 Hz (intermediate curves), 50 Hz (thick curves). Dashed lines indicate cluster state instabilities; the solid line indicates the oscillatory rate instability.
1082
N. Brunel and D. Hansel
B 100
k=1
k=0 10
n=9 1 n=8
Asynchronous state stable
n=7
0.1 0.001
0.01
0.1 Noise (mV)
1
Total synaptic inhibitory coupling (mV)
Total synaptic inhibitory coupling (mV)
A
k=0
100
n=7 n=6 n=5
10
Asynchronous state stable
n=4 1 0.001
10
n=3 0.01
0.1 Noise (mV)
1
10
1200
900
1050
750
k=1
600 450 n=12 300 150 0 0.001
n=11 n=10 n=9
n=5
n=6
0.01
n=8
k=0
n=7
Oscillation frequency (Hz)
Oscillation frequency (Hz)
1000
900 750 600 450 300 150
0.1 Noise (mV)
1
10
0 0.001
n=3 0.01
k=0
n=5 n=4 0.1 Noise (mV)
1
10
Figure 6: Influence of the shape of the inhibitory synaptic currents on the location of instabilities in the σ − J plane and on the frequencies of the eigenmodes with marginal stability. (A) Instantaneous synaptic currents (τr = τd = 0) and latency τl = 2 ms. The rate oscillation mode has an index n = 8 (frequency about 180 Hz at σ = 10 mV). When noise decreases, there is a succession of transitions to cluster modes with lower n. (B) Synaptic currents with τr = 2 ms, τd = 6 ms, and no latency (τl = 0). The oscillatory rate instability has a large index n (n ∼ 25). The rate oscillations have a frequency of about 120 Hz for σ = 10 mV. The critical coupling J˜c varies in a nonmonotonic way when σ increases. Note the different scales in ordinate of the top panels of A and B.
only for k = 0. An example of such a case is shown in Figure 6B. Note that in this case, the oscillatory rate instability has a very large index n, and there is a range of J in which the stability of the asynchronous state varies nonmonotonically as the noise level increases. The asynchronous state is
Synchronization Properties of Inhibitory Networks
1083
first unstable due to clustering instabilities, then becomes stable, becomes unstable again due to the oscillatory rate instability, and finally becomes stable as the noise increases. Third, with a finite latency, the synaptic phase is unbounded as ω increases. Hence, solutions to equation 2.24 exist for any k, leading to families of eigenmodes associated with each integer k as in Figure 3, for any value (zero or nonzero) of rise and decay times. For example, the families of lines on which eigenmodes have marginal stability in the σ − J plane and the frequencies of the marginal modes are shown in Figure 6A for k = 0 and k = 1. However, the region of stability of the asynchronous state is larger for nonzero decay time and/or rise time, as seen by comparing Figure 6A and Figure 3A (note that the scale of the ordinates is 10 times larger in Figure 3A than in Figure 6A). The number of clusters emerging on the asynchronous state stability boundary (i.e., the index n of the corresponding instability mode) in the weak noise region depends on the amplitude of noise, but also on all synaptic time constants (latency, rise time, decay time): the shorter the synaptic time constants, the larger the number of clusters. This is shown in Figure 7, where the number of clusters is plotted as a function of σ and a scaling factor, α, for all three synaptic time constants (τl = α × 1 ms, τr = α × 1 ms, τd = α × 6 ms). The solid lines in this figure show the boundary between regions in which the instability mode has frequency ∼ nν0 , for example, the corresponding instability leads to n-cluster states (number of clusters n marked in each region) in the σ − α plane. The number of clusters can vary from 1 for slow synaptic currents (for example with τl = 3 ms, τr = 3 ms, τd = 18 ms) to infinity as synaptic currents become infinitely fast (van Vreeswijk, 1996). These lines delineate open stripes in the σ − J plane; upon crossing such lines, a pair of solutions to equation 2.24 either appears or vanishes, one of which corresponds to the stability boundary of the asynchronous state. For example, for α = 1, the instability line on the stability boundary of the asynchronous state is n = 3 for σ = 0.01; increasing σ , the pair of solutions corresponding to n = 3 disappears, and the instability line becomes the n = 4 line. The dotted line separates the region where the number of solutions to equation 2.24 is strictly larger than one (on the left of the line) from a region where a single solution exists (on the right of the line)—hence, a pair of solutions vanishes when crossing the line from left to right. The difference between the dotted line and the solid lines is that on a dotted line, none of the solutions corresponds to the stability boundary of the asynchronous state. Taking again the case where α = 1 as an example, crossing the dotted line marks the point where the n = 5 pair coalesces (see Figure 3). The set of lines that marks the boundary of the open, large noise region (composed of a set of alternating solid and dotted lines) can be defined as the boundary of the stochastic synchrony region. For the LIF neuron, this set of lines remains
1084
N. Brunel and D. Hansel 3 1
Synaptic scale factor α
2.5
2
2
1.5
3 1 4 5
6 7
8
0.5 0.01
0.1 Noise (mV)
1
Figure 7: The nature of the unstable modes on the stability boundary of the asynchronous state as a function of noise and a global scaling factor of the synaptic kinetics α. The synaptic time constants are τr = α ms, τd = 6α ms, and τl = α ms. The solid lines are the boundaries between regions in which the instability mode has a frequency ∼ nν0 , for example, the corresponding instability leads to n-cluster states, where n is indicated in each region. The dotted line separates a weak noise regime where the number of solutions to equation 2.24 is strictly larger than 1, from a strong noise regime where there is only one solution left. In the weak noise regime, the instabilities are cluster-type instabilities, while in the strong noise regime, the instability is the oscillatory rate instability.
confined to a range of values of σ , which is between 0.05 and 0.2 mV, for the range of α shown in the figure. 3 Networks of EIF Neurons 3.1 The Model. In the following we consider a network of N inhibitory fully connected exponential integrate-and-fire neurons (EIF, FourcaudTrocm´e et al., 2003) receiving noisy external inputs. In the EIF, the fixed threshold condition of the LIF neuron is replaced by a spike-generating
Synchronization Properties of Inhibitory Networks
1085
current that depends exponentially on voltage. In this model, the voltage of neuron i is described by V − VT τm V˙ i (t) = −Vi (t) + T exp + Irec (t) + Ii,ext (t),
T
(3.1)
where the external and the recurrent currents are modeled as in the LIF network. The parameter VT is the highest voltage at which the membrane potential can be maintained injecting a steady external input, and the slope factor T measures the sharpness of the action potential generation. When the external current is large enough, the voltage diverges to infinity in finite time, due to the exponential term in the right-hand side of equation 3.1. This divergence defines the time of the spike. At that time, the voltage is reset to a fixed voltage Vr , where it remains during an absolute refractory period τ ARP . Unless specified otherwise, the results presented below were obtained for the following set parameters: τm = 10 ms, VT = 5.1 mV, Vr = −3 mV, τ ARP = 1.7 ms, and T = 3.5 mV (Fourcaud-Trocm´e et al., 2003). The EIF model is a good compromise between the simplified LIF neuron, which has an unrealistic spike generation mechanism, and more realistic HH-type models. It is simple enough that analytical techniques can be applied to study its dynamics. Moreover, simple HH models, can be mapped onto EIF models, as shown in Fourcaud-Trocm´e et al. (2003). In the following, we study the stability properties of the asynchronous state in this model and compare them to those of the LIF model, which can be thought of as an EIF neuron with infinitely sharp spike initiation ( T = 0). More generally, we investigate how the synchronization properties of the EIF network depend on the sharpness of the spike initiation. 3.2 Stability of the Asynchronous State. The approach described in section 2 can be applied to study the stability of the asynchronous state of the EIF network. Marginal modes are still determined by equations 2.24 to 2.27, but AN (ω) and N (ω) now represent the amplitude and the phase shift of the instantaneous firing rate modulation of a single EIF neuron in response to an oscillatory input at frequency ω. Fourcaud-Trocm´e et al. (2003) computed these functions in the low- and the high-frequency limits. Obtaining these functions analytically at any ω is a difficult task. Hence, we compute them for various noise levels using numerical simulations. Examples are shown in Figure 8. Once N (ω) and AN (ω) are known, we solve equations 2.24 to 2.26 to find the frequency of the unstable modes at the onset of instabilities, together with the critical coupling strength where these instabilities appear. The instability spectrum derived using this approach is plotted in the σ − J plane in Figure 9A for ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, τd = 6 ms. In this figure, only the lines corresponding to k = 0 and n = 1, 2, 3 are shown. This instability spectrum bears some resemblance to the one obtained in the LIF
1086
N. Brunel and D. Hansel A
Neuronal phase shift (degrees)
180
90
0
B
0
20
40 60 Frequency (Hz)
80
100
0
20
40 60 Frequency (Hz)
80
100
Total phase shift (degrees)
360
270
180
90
0
Figure 8: (A) Single-neuron phase shift for low noise (small circles, σ = 0.5 mV) and high noise (large circles, σ = 5 mV). Note that for low noise, the phase shift increases sharply when the frequency is close to integer multiples of the stationary frequency (here, 30 Hz), and decreases in between two successive integer values, while for high noise, the phase increases monotonically with frequency. (B) Total phase shifts (neuronal + synaptic, left-hand side of equation 2.24) for the same two noise levels as in A. The intersection of curves with the horizontal dotted line at 180 degrees gives the frequencies of oscillatory instabilities. Note that for low noise, these intersections are close to integer multiples of the firing rate (∼30, 60, 90 Hz) and intermediate frequencies (∼45, 75 Hz), while for high noise, there is a single intersection around 50 Hz that is unrelated to the single-cell firing frequency.
Synchronization Properties of Inhibitory Networks A
100 90
Network frequency (Hz)
1087
n=3
80 fast oscillation
70 60
n=2
50 40 30
n=1
20 1 Noise (mV)
0.1
Total synaptic strength (mV)
B
10
fast oscillation 100
10
n=3 Asynchronous state stable n=2 n=1
1 0.1
1 Noise (mV)
10
Figure 9: Oscillatory instabilities in the EIF model (circles and solid lines) and in the simulations of the Wang-Buzs´aki model (stars and dashed lines; see the details in section 4.) (A) Frequency of oscillatory instabilities versus noise. (B) Critical lines on which eigenmodes become unstable in the σ − J plane. In both panels, cluster instabilities with n = 1, 2, 3, corresponding to the resonances at 30, 60, 90 Hz (see Figure 8), are shown. Note that the frequency and the critical coupling strength on the stability boundary of the asynchronous state are very close in the two models.
1088
N. Brunel and D. Hansel
model for the same parameters (see Figure 3A). However, there are several significant differences between the two figures. One is that in the LIF model, the marginal modes for k = 0 have indices n ≥ 3, whereas the first mode to be unstable is for n = 1 in the EIF model. Moreover, in the EIF network, the oscillatory rate instability for k = 0 has an index n = 1, whereas it is n = 4 in the LIF network for the same parameters. As a consequence, the oscillations that emerge from the rate instability are slower in the EIF model than in the LIF model (40–70 Hz versus 90–120 Hz; compare Figure 9B with Figure 3B). Another difference is that the noise level at which all instabilities but the oscillatory rate instability have disappeared is to times greater in the EIF than in the LIF. These differences can be understood by comparing the functions N (ω), and N (ω) + S (ω) in the two models. These functions are plotted in Figure 8 for the EIF neuron (same parameters are in Figure 9A) for weak and strong noise. For low noise levels, the neuronal phase shift, N (ω), displays sharp variations close to integer multiples of the average firing frequency ν0 . This is similar to what happens in the LIF model. However, in contrast to the LIF model, where at the peaks N (ω) is close to 90 degrees, here it is close to 180 degrees. Hence, for fixed synaptic time constants, solutions to equation 2.24 exist for lower values of n in the EIF model than in the LIF model. In particular, in the EIF model, solutions to equation 2.24 exist for k = 0, n = 1, and n = 2, while this is not the case in the LIF model. Similarly, one can understand why for high noise levels, the frequency of the oscillatory rate mode is typically smaller in the EIF than in the LIF. This is because the function N (ω) monotonously increases from 0 to 90 degrees in the EIF, whereas in the LIF, it is smaller than 45 degrees (Fourcaud-Trocm´e et al., 2003; Geisler, Brunel, & Wang, 2005). Finally, the sharp variations of N (ω) at integer multiples of the frequency ν0 are more resistant to noise in the EIF neuron—they disappear at larger values of the noise—than in the LIF neuron. As a matter of fact, the clustering instabilities persist until the noise level is in the range 1 to 2 mV, which is much larger than in the LIF model. The effects of changing the synaptic time constants or the spike sharpness parameter T on the stability of the asynchronous state are depicted in Figure 10. This figure shows how the nature of the instability on the asynchronous state stability boundary depends on the noise level and a global scaling factor of the synaptic time constants, α. Three values of T are considered: T = 3.5 mV, T = 1 mV, and T = 0 (the LIF model, already shown in Figure 7, is shown for comparison). At low noise levels for fixed α, clustering instabilities occur with an index n, which increases when T decreases. This reflects the fact that the amplitude of the peaks of N (ω) decreases with T , and hence solutions with small n of equation 2.24 disappear. This effect can be compensated for by increasing α, that is, making the synapses slower. The n = 1 cluster instability for T = 0 requires
Synchronization Properties of Inhibitory Networks
1089
3 1
LIF (∆T=0)
Synaptic scale factor α
2.5
2
2 EIF (∆T=1mV)
1.5
1 1
3 4
1
56
EIF (∆T=3.5mV)
2
7 8
0.5 0.01
1
0.1
10
Noise (mV)
Figure 10: The nature of the unstable modes on the stability boundary of the asynchronous state as a function of noise and a global scaling factor of the synaptic kinetics α. The results are displayed for three values of T : T = 0 (LIF model, thin lines and labels, see Figure 7); T = 1 mV (intermediate lines and labels); T = 3.5 mV (thick lines and labels). For clarity, the dotted line for
T = 3.5 mV is truncated for σ < 1 mV.
synapses three times slower than for T = 3.5 mV. The cluster instabilities are more resistant to noise when T increases (for the n = 1 instability, up to about 0.2 mV for T = 0, 1 mV for T = 1 mV, 2 mV for T = 3.5 mV), reflecting the fact that peaks in N (ω) are more resistant to noise for larger
T . Finally, the frequency of the firing rate oscillations that appear in the high noise region also depends on T : as T increases, the frequency of these oscillations decreases. Frequencies in such a regime range from 40 to 70 Hz for T = 3.5 mV, from 60 to 90 Hz for T = 1 mV, and from 90 to 120 Hz for T = 0 (LIF model). 4 Network of Conductance-Based Neurons In this section, we study networks of conductance-based neurons in which action potential dynamics involve sodium and potassium currents. Using numerical simulations, we characterize the instabilities by which
1090
N. Brunel and D. Hansel
synchronous oscillations emerge in these models and compare the results with those we have presented above for the EIF and the LIF. In particular, we examine whether the fact that in the EIF, the frequency of rate oscillations increases with the sharpness of the spike initiation is also valid in simple conductance-based models. 4.1 The Models. We consider single-compartment conductance-based models in which the membrane potential, V, of a neuron obeys the equation C
dV I ion + Irec (t) + Iext (t), = IL − dt ion
(4.1)
where C is the capacitance of the neuron, I L = −g L (V − VL ) is the leak current, and ion I ion is the sum over all voltage-dependent ionic currents, Ie xt is the external current, and Ir ec denotes the recurrent current received by the neuron. The voltage-dependent currents are an inactivating sodium current, I Na = g Na m3∞ h (V − VNa ), and a delayed rectifier potassium current, I K = g K n4 (V − VK ). As in Wang and Buzs´aki (1996) the activation of the sodium current is assumed instantaneous, m∞ (V) =
αm (V) , αm (V) + βm (V)
while the kinetics of the gating variable h and n are given by (see, e.g., Hodgkin & Huxley, 1952) dx = αx (V)(1 − x) − βx (V)x, dt
(4.2)
with x = h, n. The functions αx and βx for the three models we consider are given in appendix D. For simplicity, we neglect the driving force in the synaptic interactions. Hence, the recurrent current has the form G s(t − t kj ), N k N
I rec (t) = −
j=1
(4.3)
Synchronization Properties of Inhibitory Networks
1091
where the function s(t) is given by equation 2.4 (in which τm = C/g L ), and G, which has the dimension of a current density (mA/cm2 ), measures the strength of the synapses. The external input is also modeled as a noisy current,
Iext (t) = I0 + σ Cg L η(t),
(4.4)
where I0 is a constant DC input, η(t) is a white noise with zero mean and unitary variance, and σ , which has the dimension of a voltage, measures the amplitude of the temporal fluctuations of the external input. 4.2 Characterization of the Instability of the Asynchronous State. To characterize the degree of synchrony in the network, we define the population-averaged membrane potential, 1 Vi (t), N N
¯ V(t) =
(4.5)
i=1
and the variance 2 2 ¯ ¯ σV2 = [V(t)] t − [V(t) t]
(4.6)
of its temporal fluctuations, where · · ·t denotes time averaging. After normalization of σV to the average over the population of the variance of single-cell membrane potentials, σV2i = [Vi (t)]2 t − [Vi (t)t ]2 , one defines χ(N) (Hansel & Sompolinsky, 1992, 1996; Golomb & Rinzel 1993, 1994; Ginzburg & Sompolinsky, 1994), χ(N) =
1 N
σV2 N i=1
σV2i
,
(4.7)
which varies between 0 and 1. The central limit theorem implies that in the limit N → ∞, χ(N) behaves as δχ χ(N) = χ∞ + √ + O N
1 , N
(4.8)
where χ∞ is the large N limit of χ(N) and δχ measures the finite size correction to χ at the leading order. For a given strength of the coupling,
1092
N. Brunel and D. Hansel
the network is in the asynchronous state for sufficiently large noise. This means that χ∞ = 0. When the noise level decreases, the asynchronous state loses stability via a Hopf bifurcation, and synchrony appears: χ∞ > 0. Near the bifurcation, χ∞ behaves at the leading order as χ∞ = A(σc − σ )γ =0
for
for
σ > σc ,
σ < σc
(4.9) (4.10)
with an exponent γ = 1/2 (Kuramoto, 1984; van Vreeswijk, 1996). To locate the noise level where the instability of the asynchronous state occurs for a given value of the synaptic coupling, we simulate networks of various sizes N. Comparing the values of χ(N) for the different N, we estimate δχ, and χ∞ as a function of σ . The obtained values of χ∞ are subsequently fitted according to equation 4.10. In a network of LIF neurons, in which the stability boundary of the asynchronous state can be computed analytically, we show in appendix C that this method gives an accurate prediction of such a boundary for network sizes on the order of 1000. The results of this analysis for the Wang-Buzs´aki network and two values of the synaptic coupling, G = 2 mA/cm2 and G = 12 mA/cm2 , are plotted in Figure 11. For G = 2 mA/cm2 , δχ(σ ) can be reliably estimated from simulations with N = 800 and N = 3200. For G = 12 mA/cm2 , the finite size effects are larger because the coupling is stronger, and a substantially better account of these effects requires simulations of larger networks (N = 1600 and N = 3200). This is shown in Figures 11B and 11C. Still, the estimates of σc obtained in the fits in these two figures are very close. Once the instability has been located, we compute the frequency of the population oscillations at the onset of the instability. To this end, we simulate the network for a noise level σ ≈ σc . A good estimate of the oscillation frequency is provided by looking at the autocorrelation of the population average of the membrane potentials of the neurons. If the bifurcation at σc is supercritical, the frequency estimates on the two sides of the transition are similar. If the bifurcation is subcritical, they may differ significantly. In that case, the frequency of the unstable mode has to be determined in the vicinity of the transition, but on the side where the asynchronous state is still stable. In fact, in the simulation reported below, the bifurcations were found to be supercritical. We checked using several examples that the frequency estimates thus obtained were only weakly sensitive to the size of the network for N > 800. Hence, in the results displayed below, the population frequency was estimated from simulations of networks with N = 800. The traces of two neurons, the membrane potential of the neurons averaged over all the network and its autocorrelation in the vicinity of the asynchronous state instability onset, are shown in Figure 12 for G = 12 mA/cm2 . The oscillations present in the population activity are not clearly reflected in the spiking activity at the single-cell level, which is highly
Synchronization Properties of Inhibitory Networks A
1093
0.6
χ
N=800-3200
0.3
0
1
1.5
χ
B 0.6
2 σ(mV)
2.5
3
N=800-3200
0.3
0
6.5
7
7.5
8 8.5 σ(mV)
9
9.5
C 0.6
χ
N=1600-6400 0.3
0
6.5
7
7.5
8 8.5 σ(mV)
9
9.5
Figure 11: The measure of synchrony χ as a function of the noise in the WangBuzs´aki model for two values of the coupling. In each panel, the finite size correction δχ is estimated by comparing the simulation results (circles and crosses) for two sizes of the network. Subtracting the finite size correction leads to estimates for χ∞ (squares) as a function of σ , which are fitted using equation 4.10 (solid lines). (A) G = 2 mA/cm2 , crosses: N = 800; circles: N = 3200. Fit: σc = 1.61 mV; A = 0.84 mV−1/2 . (B) G = 12 mA/cm2 , crosses: N = 800; circles: N = 3200, Fit: σc = 7.68 mV; A = 0.48 mV−1/2 . (C) G = 12 mA/cm2 , crosses: N = 1600; circles: N = 6400. Fit: σc = 7.62 mV; A = 0.5 mV−1/2 . In all these simulations, τl = 1 ms, τr = 1 ms, τd = 6 ms. The DC part of the external input, I0 , was tuned to get an average firing rate of the neuron of 30 Hz ±0.5 Hz near the onset of synchrony.
irregular. In contrast, the population average of the membrane potentials (or on the population activity, not shown) and its autocorrelation reveals the existence of population oscillations at a frequency that is about 70 Hz, compared to an average firing rate of the neurons of 29.5 Hz.
1094
N. Brunel and D. Hansel
A
25 mV
B
500 ms
5 mV
AC (mV^2)
C 3860 3858 3856 3854 -100
-50
0 t (ms)
50
100
Figure 12: The Wang-Buzs´aki network near the onset of instability of the asynchronous state. N = 800, G = 12 mA/cm2 , σ = 7.75 mV. Synaptic time constants and delay as in Figure 11. The average firing rate of the neurons is 29.5 Hz. The coefficient of variation of the interspike interval distribution is 0.72. (A) Membrane potential of two neurons in the network. The noise masks the fast oscillations of the subthreshold membrane potential. (B) The fast oscillations of the membrane potential are revealed on averaging over many neurons (here over all the neurons in the network.) (C) Autocorrelation of the population average membrane potential (averaging over 2 s). The frequency of the oscillations is 66 Hz.
4.3 The Stability Boundary of the Asynchronous State. We performed simulations for the three models described in appendix D. These models differ in the sharpness of the activation function of their sodium current. In each model, we varied the synaptic coupling strength and looked for the critical noise level, σc , at which the asynchronous state becomes unstable. The external input was also changed with G so that for σ near σc , the average firing rate of the neurons was 30 ± 0.5 Hz. Once the transition was located, the frequency of the population oscillations emerging at this transition was estimated, as explained above. The results obtained for the Wang-Buzs´aki model are summarized in Figure 9. The agreement with the predictions from the EIF model (see
Synchronization Properties of Inhibitory Networks
B 1
250
0.8
200
0.6
150
f (Hz)
m_inf(V)
A
0.4 0.2 0 -100
1095
C
100
20 mV
50 0 -50 V (mV)
50
0 -2
1 ms 0 2 I (mA/cm^2)
D -60 mV 20 mV 100 ms
Figure 13: Sodium activation curves and firing properties of the conductancebased model. In the three top panels: solid line: Wang-Buzs´aki model; dashed line: model 1; dotted line: model 2. (A) Activation functions m∞ (V). (B) Currentfrequency (I-f) curves of the three models. (C) Action potential of the three models. (D) Voltage traces in response to a constant step of currents. From left to right: model 1, Wang-Buzs´aki model, and model 2.
section 3) is remarkably good, with more discrepancy at small coupling and therefore small σ . This suggests that for high noise and coupling strength, the instability is mainly determined by the synaptic time constants and delays and by the properties of the sodium current responsible for the spike initiation. In the EIF model, the frequency of the unstable mode at instability onset depends on the parameter T , which characterizes the sharpness of the spike initiation driven by the sodium current. In fact, we found in section 3 that the frequency increases when T decreases. We have also shown that the index n of the rate oscillatory instability increases when T decreases. To test whether the spike initiation sharpness had similar effects in the conductance-based models, we simulated networks of neurons that differ from the WB model only for the function m∞ . The activation functions of these models are given in appendix D and are plotted in Figure 13. The slopes at half-height are 2.01 10−2 mV−1 (dashed line, model 1) and 3.4410−2 mV−1 (dotted line, model 2), compared to the 2.64 10−2 mV−1 in the WB model (solid line; see Figure 14A). The change in the activation curve has a substantial effect on the threshold of the f-I curve (see Figure 13B) and on the shape of the spikes (see Figure 13C).
1096
N. Brunel and D. Hansel
15
G (mA/cm^2)
A
10
5
0
B
Unstable
Stable
0
2
4 σ (mV)
6
8
2
4 σ (mV)
6
8
Osc. frequency (Hz)
80 70 60 50 40 30 0
Figure 14: (A) Stability boundary of the asynchronous state in the σ − G plane. (B) Frequency of the population oscillations near synchrony onset as a function of the noise. In both panels, solid line: Wang-Buzs´aki model; dashed line: model 1; dotted line: model 2.
The stability boundary of the asynchronous state and the frequency of the population oscillations on this boundary are plotted for the three models in the σ − J plane in Figure 14. In all the models, for sufficiently weak noise or sufficiently weak coupling, the frequency is close to the average firing rate of the neurons, ν0 = 30 Hz. This indicates that in this limit, the index of the instability mode is n = 1 (with k = 0) for the three models. The frequency
Synchronization Properties of Inhibitory Networks * * *
1097 * * *
*
20 mV
10 mV 250 ms
Figure 15: Clustering in model 2 for G = 3 mA/cm2 and σ = 1.2 mV. Synaptic time constants: τr = 1 ms, τd = 6 ms, τl = 1 ms. Size of the network: N = 1600. The pattern of synchrony corresponds to a smeared two-cluster state. The two shown neurons fire (two upper traces) in alternate periods in which they are nearly in-phase and nearly in antiphase. Stars indicate spikes in antiphase. The maxima of the voltage population average coincide with the action potential of at least one of the two neurons (lower traces).
increases with the noise. In the Wang-Buzs´aki model and in model 1, the increase is smooth. Hence the index of instability remains n = 1 at large coupling, and the destabilization of the asynchronous state always occurs via the n = 1 rate oscillatory instability. At σ ≈ 3 mV (G ≈ 5 mA/cm2 ), the frequencies of the oscillations in the Wang-Buzs´aki model and in model 1 begin to differ. The frequency of the oscillation is smaller in model 1 than in the Wang-Buzs´aki model. This is consistent with our analysis of the EIF model which predicts that in the high-noise regime, the frequency of the rate oscillatory mode should decrease with the sharpness of the spike initiation ( T larger). In model 2, the population frequency jumps to a value close to 60 Hz for σ ≈ 1 mV (G ≈ 2.5 mV/cm2 ). Beyond this value, it increases smoothly with σ . This indicates that the index of the instability changes from n = 1 to n = 2 as the coupling (or equivalently the noise) increases and that the index of the rate oscillatory instability is n = 2. Just after the change in n, the instability leads to a two-cluster state. This is confirmed in Figure 15 where the membrane potentials of two neurons are plotted together with the population average of the membrane potential for G = 3 mV/cm2 . However, because of the local noise, neurons do not belong to the same cluster all the time, but rather switch between the two clusters. Still, on average, at any time, each cluster comprises half of the neurons in the network (not shown). This behavior and the fact that in the high-noise
1098
N. Brunel and D. Hansel X=1
X=0.5 * * * *
* * * *
20 mV
10 mV 250 ms
Figure 16: The effect of synaptic kinetics on the pattern of synchrony near the onset of instability of the asynchronous state in model 2. The coupling strength is G = 1 mA/cm2 . The size of the network is N = 1600. The average frequency of the neurons is about 30 Hz. (Left) The control case: The synaptic rise time and decay time are τ1 = 1 ms and τ2 = 6 ms, respectively. The synaptic delay is δ = 1 ms. The spikes of the two neurons are well synchronized, and both fire at almost every cycle of the oscillations of the population averaged voltage. Noise: σ = 0.66 mV. (Right) Fast synapses: τ1 = 0.5 ms, τ2 = 3 ms, τl = 0.5 ms. The pattern of synchrony corresponds to a smeared two-cluster state. The two neurons fire in alternate periods in which they are nearly in-phase and nearly in antiphase (spikes indicated by a star). The maxima of the voltage population average coincide with the action potential of at least one of the two neurons. Noise: σ = 0.57 mV.
regime the population oscillations are faster in model 2 than in the WangBuzs´aki model are in line with the conclusions of our analysis of the LIF and EIF network. Finally we found that for the conductance-based models fast synapses and fast delays favor clustering as predicted by Figure 10. An example is shown in Figure 16 where the voltage traces of two neurons are plotted together with the average membrane potential for two sets of synaptic time constants and delays. In the control condition (α = 1), the two neurons always tend to spike in synchrony, and their spikes coincide most of the time with the maximum of the population oscillation. For synapses and delays two times faster (α = 0.5), the two neurons alternate between periods of nearly in-phase and nearly antiphase spiking, while the spikes of the two neurons coincide in general with the maxima of the oscillations of the population average voltage. 5 Discussion Our study sheds new light on the instabilities of the asynchronous state in networks of inhibitory neurons in presence of noisy external input. Previous
Synchronization Properties of Inhibitory Networks
1099
studies investigated synchronization properties of networks of inhibitory neurons in fully connected networks at zero noise or in the weak noise regime (Abbott & van Vreeswijk 1993; Treves, 1993; van Vreeswijk, 1996; Gerstner, van Hemmen, & Cowan, 1996; Wang & Buzs´aki, 1996; White et al., 1998; Neltner, Hansel, Mato, & Meunier, 2000), in sparsely connected networks in absence of noise (Wang & Buzs´aki 1996; Golomb & Hansel, 2000) or in sparsely connected networks in the strong noise–strong coupling region (Brunel & Hakim 1999; Tiesinga & Jose, 2000; Brunel & Wang, 2003). These studies had found qualitatively distinct types of instabilities of the asynchronous state in weak and strong noise regimes. This article shows how the two types of instabilities are related when the noise level is varied and investigates for the first time stochastic synchrony in the simpler framework of fully connected networks. 5.1 The Two Types of Eigenmodes in Large Neuronal Networks of Inhibitory Neurons. Our main result is the existence of two types of eigenmodes of the linearized dynamics around the asynchronous state that differ in terms of the effect of noise on their stability. One type of mode can be unstable only if the noise is sufficiently small. Such an instability occurs when the neurons resonantly lock with the oscillatory synaptic input. When the noise is too high, resonant locking is destroyed and the modes are stable. We termed these eigenmodes clustering modes because when destabilized, they lead to clustering. Eigenmodes of the second type can be destabilized at weak noise and weak coupling, leading to clustering, but also at strong noise provided the coupling is sufficiently strong, leading to coherent modulation of the firing probability of the neurons. In this regime, the network displays stochastic synchrony. We termed the eigenmodes undergoing this instability oscillatory rate modes. Which of the clustering eigenmodes or oscillatory rate modes becomes unstable first, when the coupling strength increases, depends on the noise level, the synaptic kinetics, and the intrinsic properties of the neurons. However, as a general rule, clustering eigenmodes are the first to be unstable, for low noise levels and fast synapses. At sufficiently large noise, the oscillatory rate mode is the only remaining unstable mode. The transition between these two regimes of synchrony and the frequency of the stochastic oscillations depends on the synaptic and single cell properties, as briefly discussed below. In particular, if the synapses are sufficiently fast, the frequency of the emerging oscillations can be significantly faster than the firing rate of the individual neurons (Brunel & Hakim, 1999; Brunel & Wang, 2003). Previous studies have shown that in networks of integrate-and-fire inhibitory neurons, a discrete spectrum of eigenmodes exists at zero noise (Abbott & van Vreeswijk, 1993; Treves, 1993; van Vreeswijk, 1996; Hansel & Mato, 2003). Abbott and van Vreeswijk (1993) showed that such eigenmodes become stable at very low values of noise, in a model with
1100
N. Brunel and D. Hansel
phase noise and no latency. However, the existence of specific eigenmodes that display instabilities robust to noise has not been reported in those studies. 5.2 Beyond the Instability Line of the Asynchronous State. We also used numerical simulations to explore the dynamics of both LIF and WangBuzs´aki networks in the synchronous regime beyond the stability boundary of the asynchronous state. A detailed description of the dynamics of the various synchronized states of the LIF networks is presented in appendix C. Using numerical simulations, we showed that the bifurcation leading to the stochastic synchronous state is supercritical, consistent with the results of Brunel and Hakim (1999). We also showed multistability between different types of cluster states in the low-noise, low-coupling region, with cluster states disappearing one after the other as the noise level increases. 5.3 On the Role of Intrinsic Properties in Stochastic Synchrony. Theoretical studies have shown that intrinsic properties of neurons have a strong influence on the stability of the asynchronous state in large neuronal networks at weak noise (Hansel et al., 1995; van Vreeswijik & Hansel, 2001; Ermentrout, Pascal, & Gutkin, 2001; Pfeuty et al., 2003, 2005). A key concept in grasping this influence is the phase response function, which characterizes the way a tonically firing neuron responds to small perturbations (Kuramoto, 1984; Ermentrout & Kopell, 1986; Hansel, Mato, & Meunier, 1993; Rinzel & Ermentrout, 1988). The shape of the phase response function depends on the intrinsic properties of the neurons. In conductance-based models, it is determined by the sodium, calcium, and potassium currents involved in the neuronal dynamics (Hansel et al., 1993, 1995; van Vreeswijik, Abbott, & Ermentrout, 1994; Ermentrout et al., 2001). Hence, the singleneuron dynamics can be related to network dynamics. This idea has been applied in the framework of simple integrate-and-fire models as well as in conductance-based models (for reviews, see Golomb, Hansel, & Mato, 2001; Mato, 2005). In the strong noise–strong coupling region, the phase response function is no longer relevant, and other approaches, such as the one used in this article, are required. 5.3.1 The Effect of Spike Initiation and Repolarization. Besides the synaptic time constants, we showed that the sharpness of the spike initiation is an important parameter that affects the quantitative features of stochastic synchronous oscillations at their onset (see also Geisler et al., 2005). In the EIF model, the frequency of the stochastic oscillations and the transition between the clustering and rate oscillation regions are strongly affected by the parameter T . The sharpness of spike initiation greatly influences the amplitude of the noise where the transition between the two modes of synchrony occurs. For LIF neurons ( T = 0), this transition occurs at very
Synchronization Properties of Inhibitory Networks
1101
low noise levels (below 0.1 mV). When the parameter T increases, the transition moves to higher noise levels. For instance, it is larger than 1 mV when T = 3.5 mV. Similarly, in the conductance-based model, we found a significant influence of the shape of the function m∞ on the frequency of the stochastic synchronous oscillations and the noise level required for their appearance. In contrast to the role of the spike initiation dynamics, our work suggests that the membrane potential repolarization dynamics following an action potential is much less critical for stochastic synchronous oscillations. In fact, for the quantitative features of the stochastic synchronous oscillation instability in the Wang-Buzs´aki and the EIF models to be similar, it is sufficient to choose the parameter T to match the spike initiation dynamics of the EIF to those of the Wang-Buzs´aki model. 5.3.2 The Effect of Subthreshold Active Currents on Stochastic Oscillations. We have focused on inhibitory networks of integrate-and-fire neurons and of specific HH-type neurons. In particular, the panoply of ionic currents in the conductance-based model we studied is limited to the standard fast sodium current and the delayed-rectifier potassium current. Synchronization properties of neurons with additional types of ionic currents, including those significantly activated in the sub-threshold range, remain to be explored in the high-noise regime. However, we believe that the stochastic synchronous state should be robust to the addition of such additional currents because such currents do not substantially affect the discharge modulation of neurons in response to oscillatory inputs at high frequency (Fourcaud-Trocem´e et al., 2003). 5.4 The Effect of Temporal Correlations in the External Noisy Input. In this article, we have considered white noise. Colored noise modifies the phase of LIF neurons at high frequencies. In this limit, the phase is 0 degrees for colored noise compared to 45 degrees in case of white noise (Brunel et al., 2001). One consequence is a larger population frequency in the stochastic synchronous state (Brunel & Wang, 2003). Note that in the particular case of synaptic currents with no latency, such a change in the properties of noise can lead to a drastic change in the synchronization properties of the network. Indeed, with white noise, the network displays the stochastic synchronous instability in the high-noise regime, whereas for sufficiently colored noise, no such instability can be found, because in this case, the sum of synaptic and neuronal phase lag is bounded by 180 degrees. In the case of EIF and CB neurons, differences of the neuronal firing rate response between white and colored noise are much less important (Fourcaud-Trocm´e et al., 2003), and therefore the synchronization properties of networks of such neurons should only weakly depend on the nature of the noise.
1102
N. Brunel and D. Hansel
For simplicity, we have also considered current-based inputs rather than conductance-based inputs. Conductance-based inputs are expected to yield qualitatively similar results, though the frequency of the stochastic oscillation is known to increase as the input conductance of neurons increases (Geisler et al., 2005). 5.5 Perspectives. In the models studied here, all the neurons have the same intrinsic properties and the connectivity is all-to-all. The addition of heterogeneities in cellular properties and in the external inputs would contribute to stabilize the modes leading to clustering instabilities (Neltner et al., 2000; Hansel & Mato, 2001, 2003). If the heterogeneities are too strong, these modes are stable for any value of the coupling. This fragility of cluster states to heterogeneity is due to the fact that the appearance of such states is dependent on neuronal resonances at integer multiples of the firing rate, and therefore any heterogeneity leading to pronounced cell-to-cell variability of firing rates will destroy such states. In contrast, we conjecture that the instability of the rate oscillatory mode is very robust to heterogeneities, although the coupling strength at which it occurs is likely to depend on the level of heterogeneities. This robustness would be due to the fact that the synchrony in this regime is no longer dependent on resonant peaks at integer multiples of the single-cell firing rate. More work is required to confirm this conjecture. Brunel and Hakim (1999) found stochastic synchrony in a network of N LIF neurons, connected at random with an average of C 1 synapses in the limit N → ∞ but C/N 1. Their analytical approach and the one used in this article are similar in spirit. However, an important difference is that when the connectivity is random, an additional noise term contributes to the synaptic inputs. The study of the stability of the asynchronous state then requires knowing how oscillations in the variance of the inputs to a neuron affect the single-cell firing rate. This analysis can be performed when synaptic interactions are modeled as a delta function, and presynaptic neurons fire approximately as Poisson processes, as in Brunel and Hakim (1999). Unfortunately, this analysis does not generalize easily to situations in which a finite rise and decay time are present (Fourcaud & Brunel, 2002) and/or neurons fire in a significantly non-Poissonian fashion. Still, we expect that if the connectivity is very large, these fluctuations will not destroy the oscillatory rate instability. In contrast, if the connectivity is too sparse, the spatial fluctuations in the synaptic inputs that increase with the synaptic strength can prevent the oscillatory rate instability from occurring (see, e.g., Golomb & Hansel, 2000). This will happen if the connectivity C is smaller than some critical number that depends on the synaptic kinetics, the average firing rate of the neurons, and their intrinsic properties. The exact conditions for existence of the oscillatory rate instability in such sparse networks should also be clarified in future work.
Synchronization Properties of Inhibitory Networks
1103
Appendix A: Linear Stability Analysis of the Asynchronous State in LIF Networks The asynchronous state described by equations 2.15 to 2.17 is stable if any small perturbation from it decays back to zero. In the mean field framework, a perturbation of the asynchronous state can be described by its effect on the average firing rate, on the PDF, and on the recurrent current, as follows: ν(t) = ν0 + ν1 (t)
(A.1)
P(V, t) = P0 (V) + P1 (V, t)
(A.2)
Ir ec (t) = J τm ν0 + I1 (t)
(A.3)
where 1, and ν1 (t), P1 (V, t) are finite. Inserting these expression in equations 2.9, 2.10, and 2.14 and keeping only the leading order in ε and looking for solutions P1 , ν1 , I1 ∝ exp λt with λ a complex number yields −Jν0 τm exp(−λτl ) 1= σ (1 + λτd )(1 + λτr )(1 + λτm )
∂U ∂y
(yt , λ) −
∂U (y , λ) ∂y r
U(yt , λ) − U(yr , λ)
,
(A.4)
where yt and yr are given in equations 2.16 and 2.17 and the function U(y, λ) is given in terms of combinations of hypergeometric functions (Abramowitz & Stegun, 1970):
2
U(y, λ) =
+
ey
1+λτm 2
M
1 − λτm 1 , , −y2 2 2
2 2ye y λτm 3 λτm M 1 − , , −y2 . 2 2 2
(A.5)
The asynchronous state is stable if for all the solutions to this equation, Re(λ) < 0. Conversely, the existence of at least one solution with Re(λ) > 0 signals that the asynchronous state is unstable. Thus, an onset of instability of the asynchronous state in parameter space is determined by setting λ = iω in equation A.4. At this onset, a Hopf bifurcation occurs, and ω is the frequency of the oscillation of the network activity in the corresponding unstable mode. In cases where the Hopf bifurcation is supercritical, ω is also the frequency of the synchronous oscillations that emerge at the instability onset.
1104
N. Brunel and D. Hansel
Appendix B: Scaling of the Critical Coupling Strength with Noise for Large Noise When σ goes to infinity, an expansion of equation 2.15 yields Vt − Vr (Vt − Vr )2 1 (yr ) + = (yr ) + · · · ν0 τm σ 2σ 2
(B.1)
√ where (x) = π exp(x 2 )(1 + erf(x)). To keep a finite rate ν0 as σ goes to ∞, yr must go to −∞ as
yr ∼ − ln(σ ). In addition, ∂U ∂y
(yt , λ) −
∂U (y , λ) ∂y r
U(yt , λ) − U(yr , λ)
∼ |yr |
in the limit yr → −∞. Equation 2.23 then implies that the critical coupling strength goes as Jc ∼
σ ln(σ )
for large σ . Appendix C: Numerical Simulations of LIF Networks Simulations of LIF networks were performed at various levels of J and σ close to the predicted stability boundary of the asynchronous state for different network sizes (N = 1000, 2000, 4000). The methods are as described in section 4.2. We show in Figure 17 the resulting phase diagram for ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, and τd = 6 ms. At low-coupling levels (J < 2 mV), the asynchronous state destabilizes through a subcritical bifurcation. There is a small parameter range where the asynchronous state coexists with the n = 3 cluster state. At low-noise levels, at least three types of cluster states coexist (with 1, 2, 3 clusters). The state that is selected by the network depends on initial conditions. These states destabilize successively through discontinuous transitions (for J = 1 mV, n = 1 destabilizes at σ = 0.16 mV, n = 2 destabilizes at 0.25 mV, n = 3 at 0.32 mV). Above a coupling level of about 2 mV, the asynchronous state destabilizes through a supercritical Hopf bifurcation to the stochastic synchronous state. The stochastic synchronous state coexists with the other cluster states in some range of noise amplitudes. As the coupling strength increases, the stochastic
Synchronization Properties of Inhibitory Networks
1105
100 1 CV
χ
1 0.5 0
0 2
6
4
8
2
8
CV
χ
6
0.5 0
0 0.5
1
0.5
1
0.2 χ
10
0.5 0
CV
1
0 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
0.1 0.5 0
0 0.1
0.1 0.05 CV
1 χ
CV
1 χ
Total synaptic strength (mV)
4 0.5
1
0.5 0
0 0.02
0.04
0.02
0.04
1 0.01
0.1 1 Noise amplitude (mV)
10
Figure 17: Beyond the linear stability analysis: phase diagram of the network of LIF neurons (ν0 = 30 Hz, τl = 1 ms, τr = 1 ms, τd = 6 ms) obtained with numerical simulations. Solid lines: instability lines obtained by analytical calculations (same as Figure 3, here for the sake of clarity only, the lines corresponding to n = 3 and n = 4 are shown). Filled triangles indicate the limit of stability of the asynchronous state, as obtained by numerical simulations (see the text for details). Other filled symbols: limits of stability of n = 1 (circles), n = 2 (squares), and n = 3 (diamonds) cluster states. Open symbols: limits of stability of firing rate oscillation (diamonds), n = 3 (squares), and n = 2 (circles) states. The insets show how synchrony level χ and CV (in both types of panels, triangles: asynchronous state and firing rate oscillation; circles: 1 cluster; squares, 2 clusters; diamonds, 3 clusters) varies in these various states as a function of noise at various coupling strengths (range indicated by dotted lines).
synchronous state merges successively with the n = 3 cluster state (around 8 mV), the n = 2 cluster state (above 12 mV), and finally the n = 1 cluster state (above 30 mV). Thus, at high coupling strengths (see top panel for a coupling strength of 100 mV), there is a single synchronous state that varies continuously from the fully synchronized state at σ = 0 mV (population frequency equal to firing rate) to the stochastic synchronous state close to the bifurcation at σ = 5 mV (population frequency about 90 Hz, much larger than single-cell firing rate, CV close to 1). The phase diagram
1106
N. Brunel and D. Hansel
shown in Figure 17 is representative of networks with short synaptic time constants, when the population firing rate in the high-noise region is larger than the single-cell firing rate, though the specifics of which cluster states are obtained depend on parameters. On the other hand, for large synaptic time constants, when the population frequency at high noise is on the same order or smaller than the single-cell firing rate, and the destabilization of the asynchronous state occurs exclusively on the n = 1 instability line, the phase diagram simplifies drastically: the asynchronous state destabilizes at any J through a supercritical bifurcation, a scenario similar to high-coupling strengths in Figure 17. Appendix D: The Conductance-Based Models In all the three conductance-based models dealt with in this work, the inactivation functions of the potassium and the sodium currents are as in the model of Wang and Buzs´aki (1996): αm (V) =
0.1(V + 35) 1 + e −(V+35)/10
βm (V) = 4e −(V+60)/18 αn (V) = 0.03
(V + 34) 1 − e −(V+34)/10
βn (V) = 0.375e −(V+44)/80 αh (V) = 0.21 e βh (V) =
−(V+58)/20
3 . 1 + e −(V+28)/10
(D.1) (D.2) (D.3) (D.4) (D.5) (D.6)
In order to study how the activation of the sodium affects the instability of the asynchronous state and the frequency of the population oscillations at synchrony onset we also considered two other models that differ from the Wang-Buzs´aki model in the sharpness of the sigmoidal function m∞ (V) (see also Figure 13). In model 1: αm (V) =
0.1(V + 30) 1 + e −(V+30)/10
βm (V) = 4e −(V+55)/12 .
(D.7) (D.8)
Hence, the slope of the activation curve at the inflexion point of the sigmoid is smaller in model 1 than in the Wang-Buzs´aki model (see also Figure 13A, dashed line).
Synchronization Properties of Inhibitory Networks
1107
In model 2: αm (V) =
0.1(V + 35) 1 + e −(V+35)/20
βm (V) = 4e −(V+47.4)/18 .
(D.9) (D.10)
In this model the activation curve is sharper than in the Wang-Buzs´aki model (see also Figure 13A, dotted line). In all three models g Na = 35 mS/cm2 , VNa = 55 mV, VK = −90 mV, g L = 0.1 mS/cm2 , VL = −65 mV and C = 1 µF/cm2 . In particular, the passive membrane time constant of the neuron is τm = C/g L = 10 msec as in Wang and Buzs´aki (1996). Acknowledgments We thank Alex Roxin and Carl van Vreeswijk for careful reading of the manuscript. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in a network of pulsecoupled oscillators. Phys. Rev. E, 48, 1483–1490. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates I: Substrate–spikes, rates and neuronal gain. Network, 2, 259–274. Anderson, J. S., Lampl, I., Gillespie, D. C., & Ferster, D. (2000). The contribution of noise to contrast invariance of orientation tuning in cat visual cortex. Science, 290, 1968–1972. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzs´aki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N., Chance, F., Fourcaud, N., & Abbott, L. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comp., 11, 1621–1671. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? J. Neurophysiol., 90, 415–430.
1108
N. Brunel and D. Hansel
Buzsaki, G., Urioste, R., Hetke, J., & Wise, K. (1992). High frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Csicsvari, J., Hirase, H., Czurko, A., Mamiya, A., & Buzs´aki, G. (1999a). Fast network oscillations in the hippocampal CA1 region of the behaving rat. J. Neurosci., 19, RC20. Csicsvari, J., Hirase, H., Czurko, A., Mamiya, A., & Buzs´aki, G. (1999b). Oscillatory coupling of hippocampal pyramidal cells and interneurons in the behaving rat. J. Neurosci., 19, 274–287. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Fourcaud, N., & Brunel, N. (2002). Dynamics of firing probability of noisy integrateand-fire neurons. Neural Computation, 14, 2057–2110. Fourcaud-Trocm´e, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci., 23, 11628–11640. Fuhrmann, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophysiol., 88, 761–770. Geisler, C., Brunel, N., & Wang, X.-J. (2005). Contributions of intrinsic membrane dynamics to fast network oscillations with irregular neuronal discharges. J. Neurophysiol., 94, 4344–4361. Gerstner, W., van Hemmen, L., & Cowan, J. (1996). What matters in neuronal locking? Neural Computation, 8, 1653–1676. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation, 12, 1095–1139. Golomb, D., Hansel, D., & Mato, G. (2001). Theory of synchrony of neuronal activity. In S. Gielen & F. Moss (Eds.), Handbook of biological physics. Dordrecht: Elsevier. Golomb, D., Hansel, D., Shraiman, D., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Phys. Rev. A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178.
Synchronization Properties of Inhibitory Networks
1109
Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comp., 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–370. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computational Neurosci., 3, 7–34. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conductance and excitation in nerve. J. Physiol., 117, 500–544. Hormuzdi, S. G., Pais, I., LeBeau, F. E., Towers, S. K., Rozov, A., Buhl, E. H., Whittington, M. A., & Monyer, H. (2001). Impaired electrical signaling disrupts gamma frequency oscillations in connexin 36-deficient mice. Neuron, 31, 487–495. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: SpringerVerlag. Lapicque, L. (1907). Recherches quantitatives sur l’excitabilit´e e´ lectrique des nerfs trait´ee comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Mato, G. (2005). Theory of neural synchrony. In C. Chow, B. Gutkin, D. Hansel, C. Meunier, & J. Dalibard (Eds.), Methods and models in neurophysics. Dordrecht: Elsevier. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons in the neocortex. J. Neurophysiol., 54, 782–806. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous neural networks. Neural Comp., 12, 1607–1641. Pfeuty, B., Golomb, D., Mato, G., & Hansel, D. (2003). Electrical synapses and synchrony: The role of intrinsic currents. J. Neurosci., 23, 6280–6294. Pfeuty, B., Mato, G., Golomb, D., & Hansel, D. (2005). The combined effects of inhibitory and electrical synapses in synchrony. Neural Comp., 17, 633–670. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling, pp. 251–291. Cambridge, MA: MIT Press. Siapas, A. G., & Wilson, M. A. (1998). Coordinated interactions between hippocampal ripples and cortical spindles during slow-wave sleep. Neuron, 21, 1123–1128. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334– 350. Tiesinga, P. H., & Jose, J. V. (2000). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network, 11, 1–23. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259– 284.
1110
N. Brunel and D. Hansel
Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Tsodyks, M. V., Mit’kov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Computation, 13, 959–992. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. ˜ C., & Kopell, N. (1998). Synchronization and White, J. A., Chow, C. C., Soto-Trevino, oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16.
Received May 19, 2005; accepted August 23, 2005.
LETTER
Communicated by Bard Ermentrout
Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory Connections Takashi Kanamaru [email protected] Department of Basic Engineering in Global Environment, Faculty of Engineering, Kogakuin University, 2665-1 Nakano, Hachioji, Tokyo 192-0015, Japan
To study the synchronized oscillations among distant neurons in the visual cortex, we analyzed the synchronization between two modules of pulse neural networks using the phase response function. It was found that the intermodule connections from excitatory to excitatory ensembles tend to stabilize the antiphase synchronization and that the intermodule connections from excitatory to inhibitory ensembles tend to stabilize the in-phase synchronization. It was also found that the intermodule synchronization was more noticeable when the inner-module synchronization was weak. 1 Introduction The average behavior of neurons often shows synchronized oscillations in many areas of the brain—for example, in the visual cortex (Gray & Singer, 1989), the hippocampus (Buzs´aki, Horv´ath, Urioste, Hetke, & Wise, 1992; Bragin et al., 1995), the auditory neocortex (Traub, Bibbig, LeBeau, Cunningham, & Whittington, 2005), and the entorhinal cortex (Cunningham, Davies, Buhl, Kopell, & Whittington, 2003), and they have attracted considerable attention in the past 20 years. Synchronized oscillations with gamma frequency (20–70 Hz) among nearby neurons with overlapping receptive fields have been observed in the visual cortex. Moreover, when correlated visual stimulations were presented, synchronized oscillations were observed even among distant neurons that had nonoverlapping receptive fields and were separated by 7 mm. Based on such observations, it was proposed that the correlations among neuronal activities might be related to the binding of visual information (for reviews, see Gray 1994). Several mechanisms likely contribute to the generation of such synchronized oscillations in the visual cortex. First, the lateral geniculate nucleus (LGN) often provides oscillating inputs to the visual cortex. However, the range of projections from the LGN cannot explain the cortical synchronization among distant neurons. Therefore, the synchronized oscillations in the visual cortex are thought to be generated by an intracortical mechanism, not by oscillating inputs from the LGN (Gray & Singer, 1989). However, it Neural Computation 18, 1111–1131 (2006)
C 2006 Massachusetts Institute of Technology
1112
T. Kanamaru
is unknown whether the oscillations are caused by the properties of single neurons or by intracortical network interactions. As for the theory that the oscillations are caused by the properties of single neurons, it was reported that chattering cells in the visual cortex show periodic bursts of gamma frequency, which might be related to the generation of oscillatory responses (Gray & McCormick, 1996). On the other hand, physiological evidence supports the theory that the oscillations are generated by intracortical network interactions (Jagadeesh, Gray, & Ferster, 1992; Gray, 1994). In the hippocampus, it was reported that the network that contains inhibitory neurons contributes to the generation of oscillations (Buzs´aki et al., 1992; Whittington, Traub, & Jefferys, 1995; Fisahn, Pike, Buhl, & Paulsen, 1998). Concerning the generation of synchronized oscillations in the neuronal network, we have been studying pulse neural networks that are composed of excitatory neurons and inhibitory neurons. In previous studies, the dynamics of a single module consisting of a network were analyzed using the Fokker-Planck equation, and various synchronized firings were found depending on the values of the parameters (Kanamaru & Sekine, 2004, 2005). Such synchronized firings might be related to the synchronized oscillations among nearby neurons. In the study reported here, in order to elucidate the mechanism of synchronized oscillations among distant neurons, we analyzed the synchronization between two modules of networks, in which each module was composed of excitatory neurons and inhibitory neurons. Ermentrout and Kopell (1998) analyzed a similar network of two modules, each of which contained an excitatory cell (E-cell) and an inhibitory cell (I-cell). The E-cell and I-cell each represented populations of neurons, and their dynamics obeyed the equations for the spiking neuron model. Therefore, the neurons in each population were assumed to be perfectly synchronized. However, when the neurons in each module are not perfectly synchronized but are partially synchronized (van Vreeswijk, 1996), their analysis cannot hold because each neuron in a module receives many pulses from other neurons in that module and from neurons in the other module. In our model, perfect synchronization is not realized because of noise; therefore, probabilistic representations are required to describe the dynamics of each module. In section 2, the definition of a module of pulse neural network is given, and its dynamics are analyzed using the Fokker-Planck equation. Some examples of synchronized firings are presented. In section 3, a system with two modules of networks is introduced, and the intermodule synchronization is analyzed using the phase response function (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout, Pascal, & Gutkin, 2001; Nomura, Fukai, & Aoyagi, 2003). As a result, it was found that the intermodule connections from excitatory to excitatory ensembles tend to stabilize the antiphase synchronization, and the intermodule connections from excitatory to inhibitory ensembles tend to stabilize the in-phase synchronization. Moreover, it was found that the intermodule synchronization
Synchronization Between Two Modules of Pulse Neural Networks
1113
is more noticeable when the inner-module synchronization is weak. The final section provides a discussion and conclusions. 2 One-Module System In this section, we consider a module of a pulse neural network composed of (i) excitatory neurons with internal states θ E (i = 1, 2, . . . , NE ) and inhibitory (i) neurons with internal states θ I (i = 1, 2, . . . , NI ) that are written as ˙(i) (i) (i) θ E = 1 − cos θ E + 1 + cos θ E (i) × r E + ξ E (t) + gEE I E (t) − g E I I I (t) ,
(2.1)
˙(i) (i) (i) θ I = 1 − cos θ I + 1 + cos θ I (i) × r I + ξ I (t) + gIE I E (t) − g I I I I (t) ,
(2.2)
I X (t) =
NX (i) 1 δ θX − π , NX
(2.3)
i=1
(i) ( j) ξ X (t)ξY (t ) = Dδ XY δi j δ(t − t ),
(2.4)
where X, Y = E or I , g XY is the connection strength from ensemble Y to ensemble X, r E and r I are system parameters, and δ XY and δi j are Kro(i) necker’s deltas. I X (t) is the synaptic inputs from ensemble X, and ξ X (t) is noise in the ith neuron in ensemble X. In the following, we call this network with excitatory and inhibitory ensembles the one-module system. The dynamics of this one-module system are nearly identical to the dynamics of the pulse-coupled active rotators analyzed by Kanamaru and Sekine (2005). Therefore, we will briefly describe it here. Note that the model of neurons with θ˙ = (1 − cos θ ) + (1 + cos θ )r is the canonical model of class 1 neurons (Ermentrout & Kopell, 1986; Ermentrout, 1996), and arbitrary class 1 neurons near their bifurcation points can be transformed into the canonical model. The canonical model was previously extended to the network of weakly connected class 1 neurons (Hoppensteadt & Izhikevich, 1997; Izhikevich, 1999), and the system governed by equations 2.1, 2.2, and 2.3 has the form of this canonical model of weakly connected class 1 neurons. Thus, the networks of the weakly connected arbitrary class 1 neurons with global connections can be transformed into the above form with the appropriate change of variables. Here we restrict the parameters so that the system parameters r E and r I and the noise intensity D are uniform in the network. Moreover, the restrictions gEE = g I I ≡ gint = 4 and g E I = gIE ≡ ge xt are placed on the connection strengths for simplicity.
1114
T. Kanamaru (i)
In the absence of noise ξ X (t) and synaptic input I X (t), a single neuron shows self-oscillation for r X > 0. For r X < 0, this neuron becomes an excitable system with a stable equilibrium written by θ0 = −arccos
1 + rX , 1 − rX
(2.5)
in which θ0 is close to zero for r X ∼ 0. We define the firing time of the neuron (i) as the time at which θ X exceeds π because π is away from θ0 (∼ 0). Note that ˙(i) ˙(i) the relation θ E = θ I = 2 > 0 holds at θ = π independent of the synaptic (i) input I X (t) and the noise ξ X (t); therefore, the firing of the neuron can be defined naturally. In the following, we use values of the parameter where r X < 0, and we consider the dynamics of the networks of excitable neurons. Note that the synaptic input I X (t) from ensemble X can be rewritten as I X (t) =
NX 1 (i) δ t − tj , 2NX j
(2.6)
i=1
(i)
where t j is the jth firing time of the ith neuron in ensemble X. In the limit of NE , NI → ∞, the average behavior of neurons in the system can be analyzed with the Fokker-Planck equations, which describe the development of the probability density of the system over time, as shown in appendix A. It is notable that asynchronous firings and synchronized firings of neurons in the network correspond to a stationary solution and a time-varying solution of the Fokker-Planck equations, respectively. When the Fokker-Planck equations are used, a bifurcation set is obtained numerically by the method shown in appendix B, and the bifurcation set for the parameters r E = −0.025 and r I = −0.05 is shown in Figure 1. Generally, synchronized firings of neurons are observed when the chosen values of noise intensity D and connection strength ge xt are in the area enclosed by the SNL (saddle-node-on-limit-cycle) and Hopf bifurcation lines (see Figure 1). For more detailed information about the bifurcation, see Kanamaru and Sekine (2005). Typical synchronized firings of neurons in a one-module system are shown in Figure 2. The change in the probability flux J E , which is defined in appendix A, at θ E = π over time for various values of D and ge xt /gint , is shown in Figures 2A, 2C, and 2E. Note that the probability flux J E can be interpreted as the instantaneous firing rate of the excitatory ensemble. The raster plots of the firing times of the excitatory neurons in the system with NE = NI = 1000 are shown in Figures 2B, 2D, and 2F. As shown in Figures 2A and 2B, in cases where D and ge xt /gint are near the saddle-nodeon limit-cycle bifurcation, the synchronized firings of neurons have strong correlations and long periods because the system stays a long time in the
Synchronization Between Two Modules of Pulse Neural Networks
1115
2.5
Hopf 2 1.5
SN
g ext /g int
Hopf
1
SNL
0.5 0 0.001
SN
DLC 0.01
0.1
D Figure 1: Bifurcation set in the (D, ge xt ) plane for r E = −0.025 and r I = −0.05. SN, saddle node; SNL, saddle-node-on-limit cycle; DLC, double limit cycle.
area where the original saddle and node existed. As shown in Figures 2C and 2D, in cases where D and ge xt /gint are near the Hopf bifurcation, the synchronized firings of neurons have weak correlations and high frequencies because a limit cycle that corresponds to these synchronized firings is created around the stable equilibrium with large probability fluxes. The synchronized firings of neurons shown in Figures 2E and 2F are weakly synchronized periodic firings (Kanamaru & Sekine, 2004) where only a small percentage of the neurons fire in each period. Such firings are realized when the peak value of the probability flux J E is very small, as shown in Figure 2E, and when each neuron receives subthreshold periodic inputs. We assume that these weakly synchronized periodic firings may be related to the physiologically observed synchronized firings because their degree of synchronization is also weak (Gray & Singer, 1989; Buzs´aki et al., 1992; Fisahn et al., 1998). However, in the physiological environment, the properties of single neurons are not uniform and the structures of the networks are more complex. Therefore, more detailed theoretical analyses are required to validate the presence of neurons with weakly synchronized periodic firings in physiological environments. 3 Two-Module System In this section, to study the mechanism of the synchronized oscillations among distant neurons, we consider the two-module system in which the
1116
T. Kanamaru
A
D=0.005, gext / gint =0.5
0.5 0.4 JE 0.3 0.2 0.1 0 0
50
100
150
B
200
250
C
50
100
D
t
150
200
250
300
200
250
300
t
1000
index
1000
index
D=0.02, gext / gint =0.5
0.5 0.4 JE 0.3 0.2 0.1 0 300 0
500
500
0
0 0
50
100
150
200
250
300
0
50
100
t
150
t
E
D=0.005, gext / gint =1.5
0.2 0.15 JE 0.1 0.05 0 0
50
100
F
150
200
250
300
200
250
300
t
index
1000 500 0 0
50
100
150
t
Figure 2: Synchronized firings of neurons in the one-module system in the case where r E = −0.025 and r I = −0.05. (A, C, E) Change in the probability flux J E over time at θ E = π . (B, D, F) Raster plots of the firing times of the excitatory neurons in the system with NE = NI = 1000. (A, B) Synchronized firings of neurons where D and ge xt /gint are near the saddle-node-on-limit-cycle bifurcation. The results in the case of D = 0.005, ge xt /gint = 0.5, and gint = 4 are shown. (C, D) Synchronized firings of neurons where D and ge xt /gint are near the Hopf bifurcation. The results in the case of D = 0.02, ge xt /gint = 0.5, and gint = 4 are shown. (E, F) Weakly synchronized periodic firings of neurons where D = 0.005, ge xt /gint = 1.5, and gint = 4.
internal states of the neurons are defined as: ˙(i) (i) (i) θ Ek = 1 − cos θ Ek + 1 + cos θ Ek (i) × r Ek + ξ Ek (t) + g Ek Ek I Ek (t) − g Ek Ik I Ik (t) + Ek El I El (t) − Ek Il I Il (t) ,
(3.1)
Synchronization Between Two Modules of Pulse Neural Networks
˙(i) (i) (i) θ Ik = 1 − cos θ Ik + 1 + cos θ Ik (i) × r Ik + ξ Ik (t) + g Ik Ek I Ek (t) − g Ik Ik I Ik (t), + Ik El I El (t) − Ik Il I Il (t) , l ≡ 3 − k,
1117
(3.2) (3.3)
where k = 1, 2 and represents the first and second modules, respectively. For simplicity, we set the inner-module connection strengths as g Xk Yk = g XY and the intermodule connection strengths as Xk Yl ≡ XY (k = l). Moreover, we assume that the intermodule connection strengths are very weak, namely, XY 1, and that the intermodule connections originate only from the excitatory ensembles, namely, E I = I I = 0, because the intercolumnar longrange connections in the cortex are excitatory (Gilbert & Wiesel, 1983; Ts’o, Gilbert, & Wiesel, 1986). A similar network of two modules, each of which contains an excitatory cell (E-cell) and an inhibitory cell (I-cell), was previously analyzed by Ermentrout and Kopell (1998). The E-cell and I-cell each represented populations of neurons, and their dynamics obeyed the equations for the spiking neuron model. The neurons in each population were assumed to have perfectly synchronized firings. However, as shown in Figure 2, our innermodule neurons do not show perfectly synchronized firings; therefore, the average behavior of the neurons in each module cannot be represented by that of a single neuron. Instead, we use the probabilistic representation of the Fokker-Planck equation to describe the dynamics of each module. In the limit of NEk , NIk → ∞, the dynamics of the probability density of each module are governed by the Fokker-Planck equation shown in appendix A, and the Fourier coefficients of the probability densities follow the ordinary differential equation x˙ = f (x), which is defined in appendix B. When each module shows inner-module synchronized firings, the vector x moves on a limit cycle x = x 0 (t). In a system of two modules that have weak intermodule connections XY , the two limit cycles are connected weakly, and such a system can be analyzed using the phase response function (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout et al., Gutkin, 2001; Nomura et al., 2003), as summarized in appendix C. Using this method, we can transform the weakly connected ordinary differential equation x˙ = f (x) of the Fourier coefficients into the averaged phase equations C.5 and C.6, and we can analyze the stationary phase differences using C.10. The dependence of the stationary phase differences on the ratio γ , which is defined as EE : IE = 1 − γ : γ ,
(3.4)
1118
T. Kanamaru
in cases with different values of D and ge xt /gint is shown in Figures 3, 4, and 5. In the following, in-phase and antiphase synchronizations are defined as the stationary solution with phase difference φ = 0 or φ = 0.5, respectively. In all cases, it was found that the connections from excitatory to excitatory ensembles (EE ) tended to stabilize the antiphase synchronization, while the connections from excitatory to inhibitory ensembles (IE ) tended to stabilize the in-phase synchronization. However, when the inner-module synchronizations were strong (see Figure 3), the antiphase synchronization remained stable even when IE was large, and therefore intermodule synchronization was harder to attain than in other cases. On the other hand, when the inner-module synchronizations were weak (see Figure 5), the in-phase synchronization was stable over a wide range of γ , and therefore intermodule synchronization was easily attained. As shown above, EE and IE contribute to the intermodule synchronization in different ways, because their phase responses have different properties. Note that the phase response describes the change in frequency at φ in response to small perturbations, as shown in appendix C. The phase response function Z(t) is a vector function whose components represent the effects of inputs to the Fourier components of the Fokker-Planck equation. To make the phase response easier to understand, the phase responses δ E and δ I , upon injection of the delta function into the excitatory or inhibitory ensemble, are calculated by equation C.7 and the results are shown in Figure 6 for three sets of parameters. Generally, the two-phase responses have opposite signs in three cases; therefore, it can be concluded that the connections to excitatory and inhibitory ensembles have opposite synchronization properties. Moreover, when the system is close to the saddle-node-on-limit-cycle bifurcation point (see Figures 3, 6A, and 6B), the phase response of the inhibitory ensemble is much smaller than that of the excitatory ensemble. Thus, the in-phase synchronization is hard to attain in Figure 3. Although the phase responses in Figures 6D and 6F have similar forms, the amplitude of J I is smaller than that of J E in Figure 6C. Thus, the effect of the inhibitory ensemble is weak when the system is close to the Hopf bifurcation point and its firing rates are high (see Figures 4, 6C, and 6D), and the in-phase synchronization is harder to attain than the weakly synchronized periodic firings in Figure 5. Next, let us consider a system with a transmission delay d between two modules. Such a system can be analyzed with the equation
d (φα − φα ) =
1 T
0
T
Z(t + φα ) · p(t + φα , t + φα − d)dt,
(3.5)
which was obtained by incorporating the delay d to (φα − φα ) in equation C.7 (Hansel, Mato, & Meunier, 1995). The areas where the in-phase or antiphase synchronization is stable are obtained numerically, and their
Synchronization Between Two Modules of Pulse Neural Networks
1119
D=0.005, gext / gint =0.5
A 1 0.8
∆φ
0.6 0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.5 0.4 JE 0.3 0.2 0.1 0 0
50
100
150
200
250
300
200
250
300
t
C module 2
1000 500
module 1
0 1000 500 0 0
50
100
150
t Figure 3: (A) Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where r E = −0.025, r I = −0.05, D = 0.005, ge xt /gint = 0.5, and gint = 4. The phase variable is normalized with the period T. The solid and dotted lines denote the stable and unstable phase differences, respectively. (B) Change in J E over time in the case where γ = 0.5. The solid and dotted lines denote modules 1 and 2, respectively. (C) Raster plot of the firing times of excitatory neurons in a twomodule system where NE = NI = 1000. In B and C, the intermodule connection strengths were set at EE = IE = 0.025.
1120
T. Kanamaru
D=0.02, gext / gint =0.5
A 1 0.8 0.6
∆φ
0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.5 0.4 JE 0.3 0.2 0.1 0 0
20
40
60
80
100
60
80
100
t
C module 2
1000 500
module 1
0 1000 500 0 0
20
40
t Figure 4: Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where D = 0.02, ge xt /gint = 0.5, and gint = 4. The explanations are the same as those in Figure 3 except for the values of the parameters.
Synchronization Between Two Modules of Pulse Neural Networks
1121
D=0.005, gext / gint =1.5
A 1 0.8 0.6
∆φ
0.4 0.2 0 0
0.25
0.5
0.75
γ
only ε EE
1
only ε IE
B 0.2 0.15 JE 0.1 0.05 0 0
50
100
150
200
250
300
t
C module 2
1000 500
module 1
0 1000 500 0 0
50
100
150
200
250
300
t Figure 5: Dependence of the stationary phase differences between the two modules of a two-module system on the connection ratio γ in the case where D = 0.005, ge xt /gint = 1.5, and gint = 4. The explanations are the same as those in Figure 3 except for the values of the parameters.
1122
T. Kanamaru
A
C
D=0.005, gext / gint =0.5
0.5 0.4
J
0.4
0.3
JE
0.2
J 0.3
JI
0.1
0
0 0
0.2
0.4
φ
0.6
JE
0.2
0.1
B
D=0.02, gext / gint =0.5
0.5
0.8
JI
1
0
D
2
0.2
0.4
φ
0.6
0.8
1
0.8
1
1
ΓδE (φ)
ΓδI (φ)
1
ΓδI (φ)
Γ(φ)
Γ(φ) 0
0
ΓδE (φ)
-1
-1 0
0.2
0.4
φ
0.6
0.8
E
1
0
0.2
0.4
φ
0.6
D=0.005, gext / gint =1.5
0.03 0.02
JE
J
JI
0.01 0 0
0.2
0.4
0.6
0.8
1
0.8
1
φ
F 2
ΓδE (φ)
1
Γ(φ)
ΓδI (φ)
0 -1 0
0.2
0.4
0.6
φ
Figure 6: The phase responses δ E and δ I on injection of the delta function to the excitatory or inhibitory ensemble, respectively. (A, C, E) Change in J E and J I over time during a single period. (B, D, F) Phase responses. The parameters are shown in each figure.
dependence on the connection rate γ and the delay d is shown in Figure 7, where the delay is normalized by the period T. It is observed that in cases where d is small, IE stabilizes the in-phase synchronization, and in cases where d is large, EE stabilizes the in-phase synchronization. Let us consider the physiologically valid values of d for a gamma oscillation of 40 Hz
Synchronization Between Two Modules of Pulse Neural Networks
A
B
D=0.005, gext / gint =0.5
D=0.02, gext / gint =0.5
A
1
1
0.8
0.8
1123
A I 0.6
I
0.6
d/T
d/T
0.4
0.4
A
0.2
0.2
A
I 0
I
0 0
only ε EE
0.25
0.5
0.75
γ
1
0
only ε IE
C
0.25
0.5
γ
only ε EE
0.75
1
only ε IE
D=0.005, gext / gint =1.5 1
0.8
I
0.6
d/T 0.4
A
0.2
I 0 0
only ε EE
0.25
0.5
γ
0.75
1
only ε IE
Figure 7: Areas where the in-phase or antiphase synchronization is stable in the (γ , d) plane. The solid and dotted lines are the boundaries for the stable region of the in-phase or antiphase synchronization, respectively. The delay d is normalized with the period T. The in-phase synchronization is stable in the areas labeled I, and the antiphase synchronization is stable in the areas labeled A. The other stable phase differences are omitted for simplicity.
(T = 25 ms). The major components of the delay in signal transmission between two neurons are transmission delay on the axon and the synaptic delay to transmit the signal across the synaptic cleft (Nicholls, Martin, Wallace, & Fuchs, 2001). Because the conduction velocity along a myelinated axon is 1 to 100 m per second, the conduction delay between two neurons separated by 7 mm is estimated to be 0.07 to 7 ms. The synaptic delay is known to be about 1 to 2 ms. Thus, we roughly estimated that d < 10 ms and obtained the relationship d/T < 0.4. Under this condition, IE stabilizes the in-phase synchronization, as shown in Figure 7. Moreover, for d/T < 0.4, the area with stable in-phase synchronization was widest for the weakly synchronized periodic firings (see Figure 7C).
1124
T. Kanamaru
4 Discussion and Conclusions To study the mechanism through which synchronized oscillations occur in the brain, we analyzed the properties of synchronization of class 1 pulse neural networks. In the one-module system, which was composed of excitatory and inhibitory neurons, various synchronized firings were observed depending on the connection strengths and the noise intensity, and they might be related to the synchronized oscillations with gamma frequency among nearby neurons in the visual cortex. Note that such synchronized firings can be observed only when the excitatory neurons and inhibitory neurons are connected with each other (see the area with ge xt = 0 in Figure 1). In other words, the synchronized firings observed in our model were generated by the interactions between the excitatory ensemble and inhibitory ensemble in the network. On the other hand, it is known that self-oscillating neurons that consist of only excitatory (or inhibitory) neurons in a network can synchronize with each other (Mirollo & Strogatz, 1990). This difference might arise because our network is composed of excitable, but not self-oscillating, neurons. To elucidate the mechanism by which synchronized oscillations occur among distant neurons, we analyzed the synchronization between two modules of networks using the phase response function. A similar network of two modules, each of which showed perfect synchronization, was previously analyzed by Ermentrout and Kopell (1998). However, as shown in Figure 2, the neurons in our module do not show perfect synchronization; therefore, a probabilistic representation with the Fokker-Planck equation was required to describe the dynamics of each module. As a result, it was found that the intermodule connections from excitatory to excitatory (E → E) ensembles tended to stabilize the antiphase synchronization, while the intermodule connections from excitatory to inhibitory (E → I ) ensembles tended to stabilize the in-phase synchronization. Moreover, it was found that intermodule synchronization was more easily attained when the inner-module synchronizations were weak. Our finding that the E → E intermodule connections stabilize antiphase synchronization is analogous to the previous results that a pair of excitatory neurons with slow connections has a stable antiphase solution (Hansel et al., 1995; van Vreeswijk, 1996; Sato & Shiino, 2002). Moreover, our finding that the E → I intermodule connections tend to stabilize the in-phase synchronization is similar to the previous result that the E → I and I → E connections stabilize the in-phase synchronization despite the existence of a delay (Ermentrout & Kopell, 1998). However, the mechanism of synchronization in their model differs from that in our model. In the model of Ermentrout and Kopell (1998), the timing of the pulses played important roles in synchronization because their network contained only four neurons: two E-cells and two I-cells. They stated that a pair of pulses (doublet) of
Synchronization Between Two Modules of Pulse Neural Networks
1125
the I-cell was important in the process of synchronization. However, in our network, there are many neurons, and each neuron receives many pulses from other neurons (see Figures 2B, 2D, and 2F). Therefore, the timing of the pulses is less important in our model than in their model. Nevertheless, similar results on the roles of E → I connections were obtained. Moreover, in our model, it was found that the degree of synchronization in one module affects the properties of the intermodule synchronization. In summary, in our model, the oscillations in a neuronal ensemble were generated by a local network composed of excitatory neurons and inhibitory neurons, and their synchronization was realized by the long-range connections from excitatory to inhibitory ensembles. We modeled the average dynamics of the module using probabilistic representation with the Fokker-Planck equation. In physiological environments, the properties of single neurons are not uniform and the networks are more complex; therefore, probabilistic representation may be crucial for understanding their dynamics. However, in our research, we assumed that the properties of the neurons and the structure of the module were uniform. Therefore, more detailed analyses are required. Moreover, we confirmed that the analysis with the phase response function is applicable to the stochastic system whose average dynamics obey the Fokker-Planck equation. It is known that the phase response function can be calculated from physiological data (Reyes & Fetz, 1993a,1993b; Jones, Mulloney, Kaper, & Kopell, 2003); therefore, our method might widen application of the phase response function in theoretical and experimental fields.
Appendix A: The Fokker-Planck Equation for the One-Module System To analyze the dynamics of the one-module system, we use the FokkerPlanck equations (Kuramoto, 1984; Gerstner & Kistler, 2002), which are written as ∂n E ∂ (AE n E ) =− ∂t ∂θ E ∂ D ∂ (B E n E ) , BE + 2 ∂θ E ∂θ E ∂ ∂n I =− (AI n I ) ∂t ∂θ I ∂ D ∂ BI (B I n I ) , + 2 ∂θ I ∂θ I
(A.1)
(A.2)
AE (θ E , t) = (1 − cos θ E ) + (1 + cos θ E ) × (r E + gEE I E (t) − g E I I I (t)),
(A.3)
1126
T. Kanamaru
AI (θ I , t) = (1 − cos θ I ) + (1 + cos θ I ) ×(r I + gEE I E (t) − g I I I I (t)),
(A.4)
B E (θ E , t) = 1 + cos θ E ,
(A.5)
B I (θ I , t) = 1 + cos θ I ,
(A.6)
for the normalized number densities of excitatory and inhibitory neurons, in which
n E (θ E , t) ≡
NE (i) 1 δ θE − θE , NE
(A.7)
NI (i) 1 δ θI − θI , NI
(A.8)
i=1
n I (θ I , t) ≡
i=1
in the limit of NE , NI → ∞. The probability flux for each ensemble is defined as J E (θ E , t) = AE n E − J I (θ I , t) = AI n I −
∂ D (B E n E ), BE 2 ∂θ E ∂ D (B I n I ), BI 2 ∂θ I
(A.9) (A.10)
respectively. In the limit of NX → ∞, I X (t) in equation 2.6 follows an equation that is written as 1 J X (t), 2 = n(π, t),
I X (t) =
(A.11) (A.12)
where J X (t) ≡ J X (π, t) is the probability flux at θ X = π. By integrating the Fokker-Planck equations A.1 and A.2 with equation A.12, the dynamics of the network governed by equations 2.1 and 2.2 can be analyzed. Appendix B: Numerical Integration of the Fokker-Planck Equations In this section, we provide a method of performing numerical integration of the Fokker-Planck equations A.1 and A.2. Because the normalized number densities given by equations A.7 and A.8 are 2π-periodic functions of θ E
Synchronization Between Two Modules of Pulse Neural Networks
1127
and θ I , respectively, they can be expanded as n E (θ E , t) =
∞ E 1 + a k (t) cos(kθ E ) + b kE (t) sin(kθ E ) , 2π
(B.1)
∞ I 1 a k (t) cos(kθ I ) + b kI (t) sin(kθ I ) , + 2π
(B.2)
k=1
n I (θ I , t) =
k=1
and, by substituting them, equations A.1 and A.2 are transformed into an ordinary differential equation x˙ = f (x) where x ≡ (a 1E , b 1E , a 1I , b 1I , a 2E , b 2E , a 2I , b 2I , · · ·)t , (X)
da k (X) = −(r X + K X + 1)kb k dt k (X) (X) − (r X + K X − 1) b k−1 + b k+1 2 Dk (X) g ak , − 8
(B.3)
(X)
db k (X) = (r X + K X + 1)ka k dt k (X) (X) + (r X + K X − 1) a k−1 + a k+1 2 Dk (X) − g bk , 8 g(xk ) = (k − 1)xk−2 + 2(2k − 1)xk−1 + 6kxk + 2(2k + 1)xk+1 + (k + 1)xk+2 , K X ≡ g XE I E − g XI I I , (X)
a0 ≡ (X)
1 , π
b 0 ≡ 0,
(B.4)
(B.5) (B.6) (B.7) (B.8)
and X = E or I . By integrating this ordinary differential equation numerically, the time series of the probability fluxes J E and J I are obtained. For numerical calculations, each Fourier series is truncated at the first 40 or 60 terms. The bifurcation lines of the Hopf bifurcation and the saddle-node bifurcation in Figure 1 were obtained as follows. First, a stationary solution x s was numerically obtained by the Newton method (Press, Flannery, Teukolsky, & Vetterling, 1988), and the eigenvalues of the Jacobian matrix D f (x s ),
1128
T. Kanamaru
which had been numerically obtained by using the QR algorithm (Press et al., 1988), were examined to find the bifurcation lines. The bifurcation lines of the global bifurcations such as the homoclinic bifurcation and the double limit-cycle bifurcation, were obtained by observing the long-time behaviors of the solutions of x˙ = f (x). Appendix C: Analysis with the Phase Response Function In this section, we summarize the method of analyzing the dynamics of two weakly coupled oscillators. Let us consider a dynamical system x˙ = f (x) that has a stable limit cycle with period T as its solution, which was written as x = x 0 (t) (x 0 (t) = x 0 (t + T)), and then introduce a weak perturbation p(x, x ) from the other module x . Then the dynamics of the module are governed by a differential equation written as x˙ = f (x) + p(x, x ),
(C.1)
and it can be reduced to φ˙ = 1 + Z(φ) · p(x 0 , x 0 ),
(C.2)
where φ = t mod T and Z(φ) is the phase response function that describes the change in frequency at φ in response to small perturbations (Kuramoto, 1984; Ermentrout & Kopell, 1991; Ermentrout, 1996; Ermentrout et al., 2001; Nomura et al., 2003). Z(φ) can be numerically obtained using the method shown by, Ermentrout (1996). First, let us consider a linear differential equation, ˙ = −D f (x 0 (t))t · Z(t), Z
(C.3)
and integrate backward in time with random initial conditions. After Z(t) converges to a periodic orbit, normalization 1 T
T
Z(t) · x˙ 0 (t)dt = 1
(C.4)
0
is performed, and Z(t) is obtained. Let us denote the phases of the two modules as φ1 and φ2 , respectively. After averaging, the two phases obey φ˙ 1 = 1 + (φ1 − φ2 ),
(C.5)
φ˙ 2 = 1 + (φ2 − φ1 ),
(C.6)
Synchronization Between Two Modules of Pulse Neural Networks
(φα − φα ) =
1 T
0
T
Z(t + φα ) · p(t + φα , t + φα )dt,
p(t + φα , t + φα ) = p(x 0 (t + φα ), x 0 (t + φα )).
1129
(C.7) (C.8)
Using (φ), the phase difference φ ≡ φ1 − φ2 of the two phases obeys φ˙ = (φ) − (−φ), ≡ odd (φ).
(C.9) (C.10)
We can obtain the stable phase difference φ, which satisfies odd (φ) = 0 and odd (φ) < 0. Acknowledgments This research was partially supported by a Grant-in-Aid for Encouragement of Young Scientists (B) (No. 17700226) from the Ministry of Education, Culture, Sports, Science, and Technology, Japan. References ´ G., N´adasdy, Z., Hetke, J., Wise, K., & Buzs´aki, G. (1995). Gamma Bragin, A., Jando, (40–100Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Buzs´aki, G., Horv´ath, Z., Urioste, R., Hetke, J., & Wise, K. (1992). High-frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Cunningham, M. O., Davies, C. H., Buhl, E. H., Kopell, N., & Whittington, M. A. (2003). Gamma oscillations induced by kainate receptor activation in the entorhinal cortex in vitro. J. Neurosci., 23, 9761–9769. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. of Appl. Math., 46, 233–253. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biol., 29, 195–217. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press.
1130
T. Kanamaru
Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3, 1116–1133. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comput. Neurosci., 1, 11–38. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Trans. Neural Networks, 10, 499–507. Jagadeesh, B., Gray, C. M., & Ferster, D. (1992). Visually evoked oscillations of membrane potential in cells of cat visual cortex. Science, 257, 552–554. Jones, S. R., Mulloney, B., Kaper, T. J., & Kopell, N. (2003). Coordination of cellular pattern-generating circuits that control limb movements: The sources of stable differences in intersegmental phases. J. Neurosci., 23, 3457–3468. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Trans. Neural Networks, 15, 1009– 1017. Kanamaru, T., & Sekine, M. (2005). Synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections and their dependences on the forms of interactions. Neural Comput., 17, 1315–1338. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: Springer. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Nicholls, J. G., Martin, A. R., Wallace, B. G., & Fuchs, P. A. (2001). From neuron to Brain. Sunderland, MA: Sinauer. Nomura, M., Fukai, T., & Aoyagi, T. (2003). Synchrony of fast-spiking interneurons interconnected by GABAergic and electrical synapses. Neural Comput., 15, 2179– 2198. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C. Combridege: Cambridge University Press. Reyes, A. D., & Fetz, E. E. (1993a). Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Reyes, A. D., & Fetz, E. E. (1993b). Effects of transient depolarizing potentials on the firing rate of cat neocortical neurons. J. Neurophysiol., 69, 1673–1683. Sato, Y. D., & Shiino, M. (2002). Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Phys. Rev. E, 66, 041903. Traub, R. D., Bibbig, A., LeBeau, F. E. N., Cunningham, M. O., & Whittington, M. A. (2005). Persistent gamma oscillations in superficial layers of rat auditory neocortex: Experiment and model. J. Physiolo., 562, 3–8.
Synchronization Between Two Modules of Pulse Neural Networks
1131
Ts’o, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by crosscorrelation analysis. J. Neurosci., 6, 1160–1170. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E, 54, 5522–5537. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
Received June 1, 2005; accepted September 14, 2005.
LETTER
Communicated by Gregor Schoener
A Sensorimotor Map: Modulating Lateral Interactions for Anticipation and Planning Marc Toussaint [email protected] School of Informatics, University of Edinburgh, Edinburgh, Scotland, U.K.
Experimental studies of reasoning and planned behavior have provided evidence that nervous systems use internal models to perform predictive motor control, imagery, inference, and planning. Classical (model-free) reinforcement learning approaches omit such a model; standard sensorimotor models account for forward and backward functions of sensorimotor dependencies but do not provide a proper neural representation on which to realize planning. We propose a sensorimotor map to represent such an internal model. The map learns a state representation similar to self-organizing maps but is inherently coupled to sensor and motor signals. Motor activations modulate the lateral connection strengths and thereby induce anticipatory shifts of the activity peak on the sensorimotor map. This mechanism encodes a model of the change of stimuli depending on the current motor activities. The activation dynamics on the map are derived from neural field models. An additional dynamic process on the sensorimotor map (derived from dynamic programming) realizes planning and emits corresponding goal-directed motor sequences, for instance, to navigate through a maze. 1 Introduction ¨ Kohler’s (1917) studies with monkeys were one of the first systematic investigations in the capability of planned behavior in animals. In one of his classic experiments, monkeys had to reach for a banana mounted below the ceiling. After many attempts in vain, one of the monkeys eventually exhib¨ ited the behavior that Kohler found so fascinating: the monkey retreated and sat quietly in a corner for minutes, staring at the banana and at some time also staring at a nearby table. It started to saccade several times between the banana and the table while still sitting quietly. Then it suddenly rushed up, grabbed the table, pulled it below the banana, mounted it, and jumped to grab the banana. Reading these experiment scripts today, one realizes how little we know ¨ about the neural processes in the monkey’s brain when Kohler read in its face the effort to reason about sequential behaviors to reach a goal. Classical (model-free) reinforcement learning approaches explicitly omit Neural Computation 18, 1132–1155 (2006)
C 2006 Massachusetts Institute of Technology
A Sensorimotor Map
1133
internal models (Sutton, & Barto, 1998; see also Majors & Richards, 1997). More recent studies in the cognitive sciences converge to the postulate that nervous systems use internal models to perform predictive motor control, imagery, and planning in a way that involves a simulation of actions and their perceptual implications (Grush, 2004). Based on experiments with humans, who were asked to imagine the way from a starting position in a maze to a goal position, Hesslow (2002) formulates three assumptions that may explain a simulation theory of cognitive functions: (1) behavior can be simulated by activating motor structures, as during an overt action, but suppressing its execution; (2) perception can be simulated by internal activation of sensory cortex, as during normal perception of external stimuli; and (3) both overt (executed) and covert (suppressed) actions can elicit perceptual simulation of their normal consequences. The evidence in favor of internal models and the hypotheseses developed in cognitive science raise the challenge to propose concrete models of how neural systems are capable of these processes. Such systems must be able to anticipate the sensorial implications of motor activities, but they also must account for planned, goal-oriented behavior. The sensorimotor map we propose in this letter provides mechanisms to self-organize a representation of sensorimotor data that encodes the dependencies between motor activity and predictable changes of stimuli (see also Toussaint, 2004). The self-organization process largely adopts the classical approaches to self-organizing neural stimulus representations (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) and their extensions with respect to growing representations (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992; Fritzke, 1995; Bednar, Kelkar, & Miikkulainen, 2002) and temporal dependencies (Bishop, Hinton, & Strachan, 1997; Euliano & Principe, 1999; Somervuo, 1999; Wiemer, 2003; Varsta, 2002; Klemm & Alstrom, 2002). However, unlike previous selforganizing maps, our model couples sensor and motor signals in a joint representational layer. The activation dynamics on the sensorimotor map are adopted from dynamic field models of a homogeneous, laterally connected neural layer (Amari, 1977). In the language of neural fields, the anticipation of a new stimulus corresponds to a shift of the activity peak, which is induced by a modulation of the lateral connection strengths. A key ingredient of our model is that the modulation depends on the current motor activities. A motor representation is coupled to the neural field by modulating the lateral connectivity instead of connecting directly to the neural units. By this mechanism, different motor activities lead to different shifts of the peak. The coupling encodes all the information necessary for anticipating a stimulus change depending on the motor activations and also for planning goal-directed motor sequences. On the sensorimotor map, an additional dynamic process similar to spreading activation dynamics (Bagchi, Biswas, & Kawamura, 2000) accounts for planning. The same coupling to the motor
1134
M. Toussaint
representation allows the system to emit motor excitations that execute the plan. The next section briefly recalls the relevant aspects of standard neural field dynamics. Section 3 gives an overview of the considered architecture. The sensorimotor map and how it couples to sensor and motor representations is introduced in section 4. Section 5 shows how the topology and parameters of the sensorimotor map can be learned online from data gathered during sensorimotor exploration. A demonstration of anticipation with the sensorimotor map is given in section 6, while section 7 introduces and demonstrates planning. In section 8, we briefly address possible extensions of the basic model before discussing related work in more detail in Section 9. A discussion concludes. 2 Neural Field dynamics Amari (1977) investigated a spatially homogeneous neural field as an approximation of a dense layer of interconnected neurons. His main interest was in a theory of the dynamics of activity pattern formation on such substrates. The lateral connectivity is assumed to induce local excitation and widespread inhibition, as typically described with a Mexican hat–type interaction kernel. The most elementary interesting stable solution to such a dynamic system is the single peak solution (also called activity bump or packet), where the activity is localized and stabilized around a center while the widespread inhibition emitted from the peak inhibits any spontaneous activation in the neighborhood. This simple solution has some important functional properties: if the peak is induced by a stimulus, it stabilizes its representation against noise; it may even stabilize the representation when the stimulus vanishes or is temporally occluded; it fuses two nearby stimuli while implementing a competition between distal stimuli; and it exhibits some delay to shift the peak to a new position when the stimulus switches (hysteresis). These properties make the model appealing for sensory processing and decision making, as well as motor control, where the dynamics effectively allow the system to filter noisy signals, decide among conflicting ¨ signals, and stabilize such decisions (Erlhagen & Schoner, 2002). Consequently, neural fields also find application in motor control and robotics ¨ ¨ problems (Schoner & Dose, 1992; Schoner, Dose, & Engels, 1995; Iossifidis & Steinhage, 2001; Dahm, Bruckhoff, & Joublin, 1998; Bergener et al., 1999). We introduce here a discrete implementation of such a neural field, fol¨ lowing Erlhagen and Schoner (2002). In this implementation, the activation mi of a unit i (denoted by m to anticipate the meaning of motor activations) is governed by the dynamics ˙ i = −mi + h m + A i + τm m
j
wi j φ(m j ) + [ξ ∼ N (0, ρm )].
(2.1)
A Sensorimotor Map
1135
Here, τm is the timescale of the dynamics, h m the resting level, Ai some feedforward input to unit i, ξ a gaussian noise term with variance ρm , and φ(m) a sigmoid. We choose 0 φ(m) = m ˆ = m 1
m<0 0≤m≤1 m>1
(2.2)
as a simple parameterless, piecewise-linear sigmoid. The crucial term in these dynamics is the interaction strength wi j between units i and j. In spatially homogeneous neural fields, this strength is usually assumed to depend on only the distance between the locations r i and r j of the two neurons. Namely, for short distances, the interaction is excitatory, while for longer distances, it is inhibitory: wi j = w E exp
−(r i − r j )2 − wI . 2σ E2
(2.3)
The parameters here are the strengths of excitation (w E ) and inhibition (w I ), and the width σ E of the excitatory range. We generally omit indicating the time dependence of dynamic variables except when we need to refer to the time steps of the Euler integration (t) (t−1) (t) mi = mi + m ˙ i that we use to simulate the dynamics. 3 Overview of the Sensorimotor Architecture Figure 1 displays the sensorimotor architecture that we will use in the experiments. The architecture is composed of three layers. The bottom layer is an arbitrary sensor representation. In the experiments, the representation will comprise either 2 units for the x- and y-coordinates of a limb or 40 units encoding range sensor data from a maze. The top layer is the motor representation, which we choose to be a one-dimensional cyclic neural field. Different units in the field will encode different bearing directions of movements. The dynamics of these units are exactly as given in equation 2.1; the “distance” |r i − r j | between two units that determines the excitatory kernel in equation 2.3 is taken as the minimal distance on the circle, measured by how many units are between j and i. The central layer is the sensorimotor map governed by equation 4.1 given below. The key architectural feature is that the motor units project to lateral connections (i j) between two sensorimotor units j and i by multiplicatively modulating the signal transmission of that lateral connection. In contrast, sensor units project directly to sensorimotor units, as it is typical for selforganizing maps.
1136
M. Toussaint
A motor representation 1D cyclic neural field, units encode movement bearings, projects to connections of the SM-map
sensorimotor map self-organized topology
sensor representation projects to units of the SM-map
B motor activations m
closeup of two sensorimotor units i and j
motor weight vector mij multiplicative modulation Mij = mij , m ˆ j
i
⇒ unit intput: x˙ i ∝ Si + η Mij wij φ(xj ) 2] sensor input Si ∝ exp[−|si − s|2 /2σS
sensor weight vector si sensor activations s
Figure 1: Schema of the considered architecture. (A) The bottom layer is a sensor representation, projecting to units of the sensorimotor map via gaussian kernels. The top layer is a motor representation that projects to lateral connections (i j) between sensorimotor units j and i. (B) This coupling induces a multiplicative modulation of the lateral interactions in the sensorimotor map, which depends on the current motor activations.
4 Modulating the Lateral Interactions The core of the architecture is the sensorimotor map. Its activation dynamics is very similar to those of neural fields and reads τx x˙ i = −xi + h x + Si + η
[Mi j wi j − w I ] φ(x j ) + [ξ ∼ N (0, ρx )].
(4.1)
j
As for the neural field, the first term −xi induces an exponential relaxation of the dynamics, the second term h x is the resting level, and the third term Si is a feedforward input from the sensor representation to unit i. We assume that the sensorial input is given as a (unnormalized) gaussian kernel, Si = exp
−(si − s)2 , 2σ S2
(4.2)
A Sensorimotor Map
1137
that compares the input weight vector si (or codebook vector) of the unit i with the current sensor activations s. The fourth term describes the lateral interactions between units in the sensorimotor map. The lateral topology is not necessarily homogeneous but should reflect the topology of the state space and possible state transitions and is given by the lateral weighs wi j . In this article, we assume that wi j = 0 if there exists no connection and wi j = 1 if there exists one (see Toussaint, 2004, and section 8 for a version where wi j is continuous and learned with a temporal Hebb rule). The parameter w I specifies the global inhibition. The crucial difference to a standard neural field is the modulation Mi j of the lateral interactions. This modulation is how motor signals couple into the sensorimotor map. More precisely, we assume that Mi j = mi j , m, ˆ
(4.3)
which is the scalar product of the weight vector mi j and the current motor activations m. ˆ Thus, lateral interactions are modulated multiplicatively depending on the current motor activation. The weight vector mi j , which is associated with every lateral connection (i j), could be thought of the codebook vector of that connection. In a sense, lateral connections “respond” to certain motor activations. Due to the multiplicative coupling, a lateral connection contributes to lateral interaction only when the current motor activity “matches” the weight vector of this connection. Biologically plausible implementations of such modulation are, for example, pre- or postsynaptic inhibition of the signal transmission. In the case of presynaptic inhibition (Rose & Scott, 2003), synapses attach directly to the presynaptic terminal of other synapses, thereby modulating their transmission. In the case of postsynaptic inhibition (shunting inhibition), inhibitory synapses attach to branches of the dendritic tree near the soma, thereby modulating the transmission of the dendritic input accumulated at this dendritic branch (Abbott, 1991). Generally, modulation is a fundamental principle in biological neural systems (Phillips & Singer, 1997). The modulation may also be regarded as a special variant of sigma-pi neural networks (Mel, 1990; Mel & Koch, 1990). 5 Learning the Sensorimotor Map The self-organization and learning of the sensorimotor map combines standard techniques from self-organizing maps (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) and their extensions with respect to growing representations (Carpenter et al., 1992; Fritzke, 1995) and the learning of temporal dependencies in lateral connections (Bishop et al., 1997; Wiemer, 2003). The free variables that need to be adapted
1138
M. Toussaint
are (1) the number of units in the map and their lateral connectivity and (2) the weight vectors si and mi j coupling to the sensor and motor layers, respectively. Except for the adaptation of the motor coupling mi j , all the adaptation mechanisms are standard, and we keep their description brief. The topology. There already exist numerous techniques for the selforganization of representational maps, mostly based on the early work on self-organizing maps (von der Malsburg, 1973; Willshaw & von der Malsburg, 1976; Kohonen, 1995) or vector quantization techniques (Gersho & Gray, 1991). We prefer not to predetermine the state space topology but learn it, and hence adopt the technique of growing neural gas (Fritzke, 1995) to self-organize the lateral connectivity and that of fuzzy ARTMAPs (Carpenter et al., 1992) to account for the insertion of new units when the representation needs to be expanded. We detect novelty when the difference between the current stimulus s and the best matching weight vector si becomes too large. We make this criterion more robust against noise by using a low-pass filter (leaky integrator) of this representation error. More precisely, if i ∗ is the unit with the best match, i ∗ = argmaxi Si , we integrate the error measure e i ∗ via τe e˙ i ∗ = −e i ∗ + (1 − Si ∗ ). Note that Si ∗ = 1 ⇐⇒ si ∗ = s. Whenever this error measure exceeds a threshold ν ∈ [0, 1] termed vigilance, e i ∗ > ν, we generate a new unit j and reset the error measures, e i ∗ ← 0, e j ← 0. Exactly as for growing neural gas, we add new lateral connections between i ∗ and j ∗ = argmaxi=i ∗ Si if they were not already connected. To organize the deletion of lateral connections, we associate an “age” a i j with every connection, which is increased at every time step by an amount of Mi j φ(x j ) and is reset to zero when i and j are the best and second-best matching units. If a connection’s age exceeds a threshold a max , the connection is deleted. The sensor and motor coupling. Standard self-organizing maps adapt the input weight vectors si of a unit i in a Hebbian way such that si converges to the average stimulus for which i is the best matching unit. To avoid introducing additional learning parameters and to make the convergence more robust, we realize this with a weighted averaging, si (T) = T
T
1
(t ) t=1 t =1 αi
(t)
αi s(t) ,
(5.1)
(t)
where αi ∈ {0, 1} determines whether i is the best matching unit at time t. The averaging can efficiently be realized incrementally without additional parameters. We follow the same approach to adapt the motor coupling, (T)
mi j = T
T
1
t =1
(t )
αi j
t=1
(t)
αi j m ˆ (t) .
(5.2)
A Sensorimotor Map
1139 (t)
Here, the averaging weight αi j ∈ {0, 1} is chosen such that mi j learns the average motor signals that lead to an increasing postsynaptic and a decreasing presynaptic activity. In that way, mi j learns which motor signals contribute, on average, to a transition from the stimulus s j to a stimulus si . The simplest realization of this rule is (t) αi j
=
1
if x˙ i > 0 and x˙ j < 0
0
else
.
(5.3)
5.1 Experiments. All experiments will consider the problem of controlling a limb with position y ∈ [−1, 1]2 in a two-dimensional plane or maze. In this experiment, the sensor representation is directly the 2D coordinate of this limb, that is, s = y (see section 8 for an example where the sensor representation is based on range measurements). The motor representation is given by 20 units, m ˆ ∈ [0, 1]20 , which encode 20 different bearing directions ϕi ∈ {0◦ , 18◦ , . . . , 342◦ }. Activations of motor units directly lead to a limb movement with velocity y˙ according to the law
y˙ 1 y˙ 2
=
20 i=1
m ˆi
cos(ϕi ) . sin(ϕi )
(5.4)
At the borders or walls of a maze, this law is violated such that y˙ 1 or y˙ 2 is set to zero when otherwise the border or wall would be crossed. In the first experiment, the limb performs random movements that are induced by explicitly coupling a random signal Ai into the motor layer (see equation 2.1). A random signal Ai is generated by randomly picking a motor unit i ∗ and choosing Ai ∗ = 1 while Ai = 0 for all i = i ∗ . The signal is not randomized at every time step; instead, at each time step with a probability .8, the signal remains unchanged, and with a probability .2, a new i ∗ is chosen. These movements generate the data—the sequences of sensor and motor signals m(t) and s(t) —from which the sensorimotor map learns the dependencies between motor signals and stimulus changes. Our choice of parameters for the dynamics of the sensorimotor map and motor layer is shown in Table 1. Those for adaptation are τe = 10, ν = .2, and a max = 300. During the learning phase, the lateral coupling (which will induce anticipation) is switched off (η = 0). Figure 2A displays the topology of the sensorimotor map that has been learned for the 2D plane after various time steps. In all displays, the units are positioned according to their sensor weight vectors si . Concerning the topology, we basically reproduce the standard behavior of growing neural gas: in the early phase, the map grows as more and more regions are
1140
M. Toussaint
Table 1: Parameters.
Sensorimotor map Motor layer
τ
h
wE
σE
wI
ρ
η
σS
2
0
–
–
.5
.01
0
.05
5
−1
1
2
.6
.01
–
–
explored. In the late phase, unnecessary connections are deleted, leading to a Voronoi-like graph. Figures 2B and 2C are two different illustrations of the learned motor weight vectors mi j . To compute these diagrams, we first associate an angle θi j = (s j − si ) with every connection in the sensorimotor map. These angles θi j correspond to the true geometrical direction of a transition from j to i. Figure 2B displays tuning diagrams for 10 different motor units. For a given motor unit k, we consider all connections (i j) and draw a line with orientation θi j and length (mi j )k . The diagrams exhibit that motor units that represent a certain bearing ϕk have larger weights to connections with similar bearing θi j . The tuning curve 2C displays the same data in another way: for every motor unit k and connection (i j), the weight (mi j )k is plotted against the difference θi j − ϕk . Finally, Figure 2D displays the learning curve with respect to an error measure for the weight vectors mi j : as every motor unit k corresponds to a bearing ϕk , every activation pattern m ˆ over motor units corresponds to an average bearing ϕ(m) ˆ (cf. equation 5.4). The weight vectors mi j are such activation patterns and thus correspond to average bearings ϕ(mi j ). The error measure is the absolute difference between this bearing ϕ(mi j ) and the geometrical direction θi j , averaged over all connections (i j). The graph shows that this error measure does not fully converge to zero. Indeed, most of this error is accumulated at the border of the region for an obvious reason: according to the “physics” we defined, a motor command that would diagonally cross a border leads to a movement parallel to the border instead of a full stop. Thus, at the borders, a whole variety of motor commands exists that all lead to the same movement parallel to the border. Connections between two units parallel to a border thus learn an average of motor commands that also includes diagonal motor commands.
6 Anticipation The sensorimotor map as introduced so far is sufficient for short-term anticipations. When the sensorimotor space is explored as previously with random movements and given the map as learned in the previous example,
A Sensorimotor Map
A
t = 1000
B
1141
t = 2000
t = 10000
t = 100000
C
D
Figure 2: (A) The topology of the sensorimotor representation learned for the 2D region at different times. (B) Tuning diagrams for 10 of the 20 motor units (we display only every second unit to save space): for a motor unit k, lines with length (mi j )k and orientation θi j are drawn. (C) The tuning curve of motor units for all motor units and lateral connections: the weight (mi j )k is plotted against the difference of orientation of the motor unit (ϕk ) and the connection (θi j ). (D) The learning curve of an error measure for the difference in bearing represented by mi j and φi j . Errors occur mostly at the borders. See section 5.1 for more details.
1142
M. Toussaint
we may compare the actual current stimulus s to what the sensorimotor map currently represents, 1 s¯ = i
xi
xi si .
(6.1)
i
We term the quantity s¯ the represented stimulus, which may in general differ from the true stimulus s; we term the difference ¯s = s¯ −s the representational shift. The approximate nature of the representation is one obvious source of representational shift: even when the lateral couplings are turned off (η = 0), there might be small shifts because the map is course-grained. In our case, most of such representation errors stem, again, from the borders. Since there exist no units to represent positions beyond a border and since the activations xi typically have a gaussian-like shape over the units i, the represented stimulus s¯ for a stimulus s at the border will always have a slight inward shift of the order of σ . The results we give will omit this effect by discarding data from the border of the region. We collected data for three different strength η ∈ {0, .2, .5} of lateral interaction. The two measures we discuss are the norm, RSN = |¯s|,
(6.2)
of the representational shift and the directional match RSD of the representational shift with the true change in stimulus s(t) = s(t+1) −s(t) that occurs due to the motor activations, RSD =
¯s, s ∈ [−1, 1]. |¯s| |s|
(6.3)
The results are displayed in Figure 3. All numbers are the averages (and standard deviations) over 2205 data points taken when the limb moves, in random directions as described previously, in the central area of the plane. For η = 0, we find that the norm of the representational shift (RSN = .015 ± .009) is, as expected, very small when compared to the kernel width σ = .05. The shift direction is not correlated to the true stimulus change (RSD = .0054 ± .7). Thus, for η = 0, the internally represented stimulus s¯ is fully dominated by the true stimulus s, and small representational shifts stem from the approximate nature of the representation. For η = .2 we find significantly larger shifts, RSN = .075 ± .042. More important, though, we find a strong correlation in the direction of the representational shift and the true future change of the stimulus, RSD = .89 ± .27. For η = .5, both effects are even stronger: RSN = .20 ± .17 and
A Sensorimotor Map
1143
Figure 3: The norm RSN of the representational shift and the correlation measure RSD between representational shift and the current change of stimulus, for different strengths η of the lateral coupling in the sensorimotor map. With nonzero lateral coupling, the represented stimulus is shifted in the same direction as the current true stimulus change.
RSD = .93 ± .21. For any η, the norm of the true stimulus change is |s| = 0.036 ± 0.017. The results clearly show that the representational shift ¯s encodes an anticipation of the true change of stimulus, that is, the represented stimulus s¯ is an anticipation of a future stimulus that will be perceived depending on the current motor activations. The motor modulation of the lateral interactions is able to direct the representational shift toward the direction that corresponds to the motor signals. This effect can be seen much better visually, watching the recordings1 of the activations on the sensorimotor map and the dynamics of the two positions that correspond to s¯ and s (see also Figure 4). For η = 0, both s¯ and s move very coherently, almost always overlapping; only at the borders there is a systematic inward shift. For η = .2, the activity peak of the field x is always slightly ahead of the true stimulus; the represented position s¯ always runs ahead of the true limb position s. When the motor activations change, s¯ sweeps in front of s toward the new movement bearing. For η = .5 the situation becomes more dramatic. The lateral interaction become dominating such that the field activations x actually run away from the true stimulus, traveling self-sustained in the direction of the current movement. This “wave” breaks down at the border of the sensorimotor map, and the activation peak is recreated at the current stimulus. Thus, the represented position s¯ travels quickly away from the true limb position s in the movement direction until it hits the border and restarts from s.
1 Access
and watch the recordings online at www.marc-toussaint.net/projects.
1144
M. Toussaint
A
B
Figure 4: Anticipation of future stimuli. (A) The forward excitation Si , which encodes the true current stimulus s. The gray shading indicates the value of Si ∈ [0, 1]; for better visibility, edges (i j) are shaded with the average value (Si + S j )/2. The black arrow indicates the direction encoded by the current motor activations. (B) The activation field xi on the sensorimotor map. It exhibits a significant shift in the direction of movement, thus encoding an anticipation of future stimuli depending on the current motor activations. See also note 1.
7 The Dynamics of Planning To organize goal-oriented behavior, we assume that, in parallel to the activation dynamics of x, there exists a second dynamic process that can be motivated from classical approaches to reinforcement learning (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). Recall the Bellman equation,
Vπ∗ (i) =
a
π(a |i)
P( j|i, a )[r ( j) + γ Vπ∗ ( j)],
(7.1)
j
yielded by the expectation V ∗ (i) of the discounted future return R(t) = ∞ τ−1 r (t+τ ) (for which R(t) = r (t+1) + γ R(t+1)). Here, i is a state τ =1 γ index, and γ is the discount factor. We presumed that the received rewards r (t) actually depend on only the state and thus enter equation 7.1 only in terms of the reward function r (i) (we neglect here that rewards may directly depend on the action). Behavior is described by a stochastic policy π(a |i), the probability of executing an action a in state i. Given the property 7.1 of V ∗ , it is straightforward to define a recursion algorithm for an approximation V
A Sensorimotor Map
1145
of V ∗ such that V converges to V ∗ . This recursion algorithm is called value iteration (Sutton & Barto, 1998) and reads τv Vπ (i) = −Vπ (i) +
π(a |i)
a
P( j|i, a ) r ( j) + γ Vπ ( j) ,
(7.2)
j
with a “reciprocal learning rate” or time constant τv . Note that equation 7.1 is the fixed point equation of equation 7.2. Equation 7.2 provides an iterative scheme to compute the state-value function V based on only local information. The practical meaning of the state-value function is that it quantifies how desirable and promising it is to reach a state i, also accounting for future rewards to be expected. If rewards are given only at a single goal state, V has its maximum at this goal and is the higher the easier the goal can be reached from a given state. Thus, if the current state is, i it is a simple and efficient rule of behavior to choose an action a that will lead to the neighbor state j with maximal V( j) (the greedy policy). In that sense, V(i) provides a smooth gradient toward desirable goals. Note, though, that direct value iteration presumes that the state and action spaces are known and finite and that the current state and the world model P( j|i, a ) are known. In transferring these classical ideas to our model, we assume that the system is given a goal stimulus g, that is, it is given the command to reach a state that corresponds to perceiving the stimulus g. Just as ordinary stimuli induce an input Si to the field activations xi , we let the goal stimulus induce a reward excitation, Ri =
1 −(si − g)2 , exp Z 2σ R2
(7.3)
for each unit i, where Z is chosen such that i Ri = 1. Besides the activations xi , we introduce an additional field over the sensorimotor map, the value field vi , which is in analogy to the state-value function V(i). The dynamics are τv v˙ i = −vi + Ri + γ max(w ji v j ), j
(7.4)
and well comparable to equation 7.2. One difference is that vi estimates the “current-plus-future” reward r (t) + γ R(t) rather than the future reward only. In the upper notation, this to the
value iteration corresponds τv Vπ (i) = −Vπ (i) + r (i) + a π(a |i) j P( j|i, a ) γ Vπ ( j) . As it is commonly done for value iteration, we assumed π to be the greedy policy. More precisely, we considered only that action (i.e., that connection ( ji)) that leads to the neighbor state j with maximal value w ji v j . In effect, the summations over a as well as over j can be replaced by a maximization over j.
1146
M. Toussaint
Finally, we replaced the probability factor P( j|i, a ) by w ji . In practice, the value field will relax quickly to its fixed point vi∗ = Ri + γ max j (w ji v ∗j ) and stay there if the goal does not change. The quasi-stationary value field vi together with the current (nonstationary) activations xi allow the system to generate motor excitations that lead toward the goal. More precisely, the gradient v j − vi in the value field indicates how desirable motor activations m ji are when the current “state” is i. Goal-directed motor excitations can thus be generated as a weighted average of the motor activations m ji that have been learned for the connections, A=
1 xi w ji (v j − vi ) m ji , Z i, j
(7.5)
where Z is chosen to normalize |A| = 1. These excitations enter the motor activation dynamics, equation 2.1. Hence, the signals flow between the sensorimotor map, and the motor system is in both directions. In the anticipation process, the signals flow from the motor layer to the sensorimotor map: motor signals activate the corresponding connections and cause lateral, predictive excitations. In the action selection process, the signals are emitted from the sensorimotor map back to the motor layer to induce the motor excitations that should turn predictions into reality. 7.1 Experiments. To demonstrate the planning capabilities of the sensorimotor map, we consider a 2D maze. In the first phase, a sensorimotor map is learned that represents the specific maze environment, using random explorations as described in section 5. Figure 5A displays the topology of the learned sensorimotor map after 100,000 iterations, now with a kernel σ S = .01. In the planning phase, a goal stimulus is applied that corresponds to the position indicated by a triangle in Figure 5B. This goal stimulus induces reward excitations Ri on units that match the goal stimulus closely. The value field dynamics, equation 7.4, quickly relaxes to its fixed point, which is displayed in Figure 5C. The parameters we used are τu = 5, γ = .9, and σ R = σ S /4. As expected, the value field activations are high for units representing the proximity of the goal location and decay smoothly along the connectivity of the sensorimotor map. Note that this value field is not a decaying function of the Euclidean distance to the goal, but approximately a decaying function of the topological distance to the goal, that is, the shortest path length with respect to the learned topology. Figure 5B illustrates a trial where the limb is initially located in the upper-right corner of the maze. The activation field xi represents this current location. Together with the gradient of the value field at the current location (see equation 7.5), motor excitations are induced that let the limb
A Sensorimotor Map A
1147 B
C
Figure 5: Experiments with a maze. (A) The topology of the sensorimotor map learned. (B) The activation field xi on the sensorimotor at the start of the trajectory. (C) The attractor state of the value field on the sensorimotor map, spreading from the goal location in the lower right.
move toward the bottom left. As the limb moves, the sensorimotor activities xi follow its current position, and new motor excitations are induced continuously, which leads to a movement trajectory ascending the gradient of the value field until the goal is reached. In the experiment, once the goal is reached, we switch the goal to a new random location, inducing new reward excitations Ri . The value dynamics, equation 7.4, respond quickly to this change and relax to a new fixed point, providing the gradient to the new goal. A standard quality measure for planning techniques is the required time to goal. Figure 6 displays the time intervals between switching the goals, which are the times needed to reach the new goal position from the previous goal position. First, we see that the goal is always reached in finite time, indicating that planning is always successful. Further, the graph compares the time to reach the goal with the length of the shortest path. This shortest path length was computed from a coarse (block-wise) discretization of the maze with dynamic programming. The clear correlation between the time to reach the goal and the shortest path length shows that the movement of the limb indeed follows a planned shortest path trajectory from the initial position to the goal. Another indicator for successful action selection is whether the current movement is in the direction of the valuegradient. Figure 7A displays the bearing of the local value gradient, ( i xi (v j − vi ) (s j − si )), and the bearing of the current movement, ( y ˙ ), for the first 300 time steps of the experiment. We observe a clear correlation between both bearings, though with a systematic time delay. This time delay is approximately six time steps, as can be read from Figure 7B, and corresponds to the timescale of the motor dynamics, τm = 5. (See note 1 to access more recordings of planning experiments).
1148
M. Toussaint
Figure 6: Times needed to move from a random start position to a random goal position in the maze when the sensorimotor map plans and controls the movement. The time is plotted against the length of the shortest path connection, which was computed from a coarse (block-wise) discretization of the maze with dynamic programming.
8 Extensions We tried to keep the model introduced so far simple and focused on the key mechanisms. This basic model can be extended in many straightforward ways to realize other desired functionalities. For instance, the path generated by the sensorimotor map in the previous example (see Figure 5B) clearly hits the walls very often. This should be no surprise since there is no mechanism of obstacle avoidance implicit in the model so far. But is easy to apply a standard obstacle avoidance technique: given distance signals di ∈ [0, 1] from 20 range sensors (in the 20 different bearings ϕi ) around the limb, an inhibition (e.g., proportional to (1 − di )3 ) can be directly coupled into the motor activations mi . Figure 8A displays a trajectory generated with this obstacle avoidance switched on. Perhaps more important is the question of whether the local lateral weights wi j should be learned instead of fixed to 1 if a connection exists and 0 otherwise. In Toussaint (2004) we presented a learning scheme for these weights based on the temporal Hebb rule. One of the main reasons to consider the continuous plasticity of these lateral weights was that this allows the model to adapt to change. We decided to keep the simpler alternative where wi j ∈ {0, 1}. The adaptability can also be achieved with the basic mechanism of adapting the topology to a change of the world, that is, keep adding or deleting connections.
A Sensorimotor Map
1149
A
B
Figure 7: (A) ˙ and the direction of the value The movement direction ( y) gradient ( i xi (v j − vi ) (s j − si )) for the first 300 time steps. (B) Similar to aconvolution between both curves, we plot the average squared difference 2 t h(t) − f (t + τ ) between both curves when one of them is shifted by a time delay τ . (We chose a norm · 2 that accounts for the cyclic metric in angle space.) The typical time shift is τ ∗ = 6.
A
B
C
G
S
A B
G S
Figure 8: Results of three different extensions of the sensorimotor map. (A) A trajectory with obstacle avoidance. (B) A trajectory from start S to goal G when the paths were blocked at A and B. (C) A sensorimotor map learned from range sensors. See section 8 for details.
Recall our rule to delete connections. As for growing neural gas (Fritzke, 1995), we associate an “age” a i j with every connection and delete the connections when it exceeds a threshold a max . The difference from Fritzke is that we increase all connections’ ages by an amount of Mi j φ(x j ) at every time. As a result, if during execution of a planned trajectory, an anticipated transition to a new stimulus does not occur, then the connections that contribute to this anticipation (for which Mi j φ(x j ) is high) will eventually be deleted. This adaptation of the topology has a crucial influence on the dynamics of the value field. If all connections of a specific pathway are deleted, the attractor of the value field rearranges to guide a way around this blocked pathway.
1150
M. Toussaint
Figure 8B displays such an interesting trajectory, generated for a max = 10: the limb is initially located at S and the goal location is G. The system has learned a representation of the original maze, as given in Figure 5A. But the maze has now been modified by blocking the pathways at A and B. The system first tries to follow a direct path, which is blocked at A. When the limb hits this barrier and continuously activates the connections crossing A (in terms of Mi j φ(x j )), they are eventually deleted, which makes the value field rearrange and guide along another path crossing B. Now the limb hits the barrier at B and connections are deleted, which finally leads to a path that allows the limb to reach the goal. (See note 1 to access recordings of these experiments.) Finally, since the stimulus kernel (see equation 4.2) was chosen as a gaussian, the stimulus can be also given in representations other than directly as the location y of the limb. For instance, Figure 8C displays the sensorimotor map learned for the plane when the input was given as a 40-dimensional range vector (d1 , . . . , d40 ), each di ∈ [0, 1] for 40 different bearings. The only difference with the setup in section 5 was the choice of the kernel width: now we used σ S = 1. The learned topology is slightly more dense close to the borders. This stems from the fact that the range vector changes more dramatically close to a wall since the visual angle under which the wall is seen (and thus the number of entries of the range vector affected by the wall) varies more. Anticipation and planning equally work for this stimulus representation. However, the model is not sufficient to handle ambiguous (partially observable) stimuli. For example, in the maze, there exist many positions with very similar range sensor profile. The sensorimotor map learned of the maze with only range sensor data would lead to an incorrect topology (cf. section 10). 9 Discussion A key mechanism of the sensorimotor map is that motor activations modulate the lateral connection strengths and thereby induce anticipatory shifts of the activity peak on the sensorimotor map. This modulatory sensorimotor coupling encodes a model of the change of stimuli depending on the current motor activities. The mechanism attributes a specific role to the lateral connectivity, namely, motor-modulated anticipatory excitation, which differs significantly from previously proposed roles for lateral connections. However, we believe that the different views on the roles of lateral connections do not compete but complement each other; lateral connections may play different roles depending on the context and function of the respective neural representation. For instance, the role of lateral connections has been extensively discussed in the context of pure sensor representations, in particular, for the visual cortex (Miikkulainen, Bednar, Choe, & Sirosh, 2005). In such sensor representations, the function of lateral connections could be subsumed as either enforcing coherence or competition between laterally
A Sensorimotor Map
1151
connected units. The formation of topographic maps, columnar structure, or patterns of orientation-selective receptive fields can be explained on this basis (e.g., Bednar et al., 2002). Also the stabilization of noisy or occluded stimuli, or the disambiguation between contradicting stimuli can be modeled—for example, with standard neural field dynamics involving local excitatory and global inhibitory lateral connections (Erlhagen & ¨ Schoner, 2002). In the context of temporal signal representations, the function of lateral connections is to induce specific temporal dynamics, learned, for example, with a temporal Hebb rule (spike-time dependent plasticity; Dayan & Abbott, 2001). Self-organized temporal map models (Euliano & Principe, 1999; Somervuo, 1999; Wiemer, 2003; Varsta, 2002; Klemm & Alstrom, 2002) can learn to anticipate stimuli, for example, when a stimulus B always follows a stimulus A. The role we attributed to lateral connections naturally differs from these models since we consider a sensorimotor representation where anticipation needs to depend on the current motor activities and for which we proposed the modulatory sensorimotor coupling. Long-term prediction, for example, path integration (see Etienne & Jeffery, 2004, for a review), is clearly related to the sensorimotor anticipation that we addressed with our model. Some existing models of path integration are based on one- or two-dimensional representational maps of position or head direction, and anticipation is realized by a motor-dependent translation of the activity pattern. For instance, in Hartmann and Wehner (1995) and Song and Wang (2005), the translational shift on a one-dimensional head direction representation is realized with two additional layers of inhibitory neurons—one for right and one for left movements—that are coupled to the motor system. Zhang (1996) achieves an exact translation of the activity pattern by coupling a derivative term in the dynamics, while Stringer, Rolls, Trappenberg, and Araujo (2002) induce translational shifts on a two-dimensional place field representation with a complex coupling of head direction units and forward velocity units into the lateral place field dynamics, in effect similar to our approach. None of these approaches addresses the problem of planning or the emission of motor signals based on the learned forward model. Although our model implements sensorimotor anticipation, it is in its current form limited with regard to the task of exact path integration: only the direction but not the magnitude of anticipatory shifts is guaranteed to be correlated with the true movement, as the experiments in section 6 demonstrate. However, future extensions of the model might solve this problem, for example, by a precise tuning of the parameter η that allows us to calibrate the magnitude of the anticipatory shift (see Figure 3). In the context of machine learning, predictive forward models are typically learned as a function, for example, with a neural network (Jordan & Rumelhart, 1992; Wolpert, Ghahramani, & Jordan, 1995; Ghahramani, Wolpert, & Jordan, 1997; Wolpert, Ghahramani, & Flanagan, 2001). It is assumed that a goal trajectory is readily available such that the learned
1152
M. Toussaint
function allows them to compute the motor signals necessary to follow this trajectory. A representational map of the state space is not formed. In contrast, some model-based reinforcement learning systems have addressed ¨ & Eecen, 1994; the self-organization of state space representations (Krose Zimmer, 1996; Appl, 2000), for example, by using discrete fuzzy representations (e.g., the Fuzzy-ARTMAPs; Carpenter et al., 1992). However, these approaches do not propose a direct coupling of motor activities into a sensorimotor representation to realize anticipation and planning by neural dynamics on this representation; instead, they use the learned representation as an input to separate, standard reinforcement learning architectures. 10 Conclusion The sensorimotor map we describe in this letter proposes a new mechanism of how motor signals can jointly be coupled with sensor signals on a sensorimotor representation. The immediate function of this sensorimotor map and the proposed modulatory sensorimotor coupling is the anticipation of the change of stimuli depending on the current motor activity. Anticipation on its own is a fundamental ingredient of sensorimotor control, for example, to consolidate noisy sensorial information by fusing it with the temporal prediction or to bridge the inherent time lag of sensorial information. However, the ability to anticipate also provides the basic ingredient for planning. The forward model implicitly encoded by the sensorimotor map can be used to realize planning. We considered standard reinforcement learning techniques as a starting point and proposed a value dynamics on the sensorimotor map the performs basically the same computations as value iteration. For this to work in a neural systems framework, it is crucial that there exists a neural representation of the state space. The sensorimotor map provides such a representation. The self-organization and learning processes that develop the sensorimotor map do not set principled constraints on the type of sensor and motor representations coupled to the map. However, a more general problem was not solved and remains a limiting factor. Also in our model, the self-organization of the sensorimotor representation is mainly sensor driven. This leads to problems, as when different states induce the same stimulus (partial observability, stimulus ambiguity), since the current selforganization process will not be able to grow separate units for the same stimulus. The self-sustaining and anticipatory dynamics of the sensorimotor map are able to disambiguate such states depending on the temporal context. But a prerequisite is the existence of multiple units associated with the same stimulus. This leads us back to the challenge of understanding higher-level cognitive processes like internal simulation and planning, as mentioned in ¨ the context of Kohler’s classic monkey experiments. The basic mechanisms of anticipation and planning proposed in this letter, in particular, the
A Sensorimotor Map
1153
action-dependent modulation of lateral interactions, might be transferable to such higher-level representations. An open question is how animals and humans are able to organize such higher-level abstract representations, which clearly are not purely sensor based but state abstractions that capture both the sensor context and its relevance for behavior. Acknowledgments I thank the German Research Foundation (DFG) for funding the Emmy Noether fellowship TO 409/1-1, allowing me to pursue this research. References Abbott, L. (1991). Realistic synaptic inputs for network models. Network: Computation in Neural Systems, 2, 245–258. Amari, S. (1977). Dynamics of patterns formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87. Appl, M. (2000). Model-based reinforcement learning in continuous environments. Un¨ Informatik, Technische Universit¨at published doctoral dissertation, Institut fur ¨ Munchen. Bagchi, S., Biswas, G., & Kawamura, K. (2000). Task planning under uncertainty using a spreading activation network. IEEE Transactions on Systems, Man and Cybernetics A, 30, 639–650. Bednar, J. A., Kelkar, A., & Miikkulainen, R. (2002). Modeling large cortical networks with growing self-organizing maps. Neurocomputing, 44–46, 315–321. Bergener, T., Bruckhoff, C., Dahm, P., Janßen, H., Joublin, F., Menzner, R., Steinhage, A., & von Seelen, W. (1999). Complex behavior by means of dynamical systems for an anthropomorphic robot. Neural Networks, 12, 1087–1099. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Nashua, NH: Athena Scientific. Bishop, C., Hinton, G., & Strachan, I. (1997). GTM through time. In Proc. of IEE Fifth Int. Conf. on Artificial Neural Networks (pp. 111–116). London: IEE. Carpenter, G., Grossberg, S., Markuzon, N., Reynolds, J., & Rosen, D. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Networks, 5, 698–713. Dahm, P., Bruckhoff, C., & Joublin, F. (1998). A neural field approach to robot motion control. In Proc. of 1998 IEEE Int. Conf. on Systems, Man, and Cybernetics (SMC 1998) (pp. 3460–3465). New York: IEEE. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. ¨ Erlhagen, W., & Schoner, G. (2002). Dynamic field theory of movement preparation. Psychological Review, 109, 545–572. Etienne, A. S., & Jeffery, K. J. (2004). Path integration in mammals. Hippocampus, 14, 180–192. Euliano, N., & Principe, J. (1999). A spatio-temporal memory based on SOMs with activity diffusion. In Workshop on Self-Organizing Maps (pp. 253–266). Helsinki.
1154
M. Toussaint
Fritzke, B. (1995). A growing neural gas network learns topologies. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 625–632). Cambridge, MA: MIT Press. Gersho, A., & Gray, R. (1991). Vector quantization and signal compression. Boston: Kluwer. Ghahramani, Z., Wolpert, D. M., & Jordan, M. I. (1997). Computational models of sensorimotor integration. In P. Morasso & V. Sanguineti (Eds.), Selforganization, computational maps and motor control (pp. 117–147). Dordrecht: Elsevier. Grush, R. (2004). The emulation theory of representation: Motor control, imagery, and perception. Behavioral and Brain Sciences, 27, 377–396. Hartmann, G., & Wehner, R. (1995). The ant’s path integration system: A neural architecture. Biological Cybernetics, 73, 483–497. Hesslow, G. (2002). Conscious thought as simulation of behaviour and perception. Trends in Cognitive Sciences, 6, 242–247. Iossifidis, I. & Steinhage, A. (2001). Controlling an 8 DOF manipulator by means of neural fields. In A. Halme, R. Chatila, & E. Prassler (Eds.), Int. Conf. on Field and Service Robotics (FSR 2001). Helsinki. Jordan, M., Rumelhart, D. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Klemm, K., & Alstrom, P. (2002). Emergence of memory. Europhysics Letters, 59, 662. ¨ ¨ Kohler, W. (1917). Intelligenzprufungen am menschenaffen. (3rd ed.). Berlin: Springer. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer. ¨ Krose, B., & Eecen, M. (1994). A self-organizing representation of sensor space for mobile robot navigation. In Proc. of Int. Conf. on Intelligent Robots and Systems (IROS 1994) (pp. 9–14). New York: IEEE. Majors, M., & Richards, R. (1997). Comparing model-free and model-based reinforcement learning. (Tech. Rep. No. CUED/F-INENG/TR.286). Cambridge: Cambridge: University Engineering Department. Mel, B. W., (1990). The sigma-pi column: A model of associative learning in cerebral cortex. (Tech. Rep. CNS Memo 6). Pasadena: Computation and Neural Systems Program, California Institute of Technology. Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 474–481). San Mateo, CA: Morgan Kaufmann. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (2005). Computational maps in the visual cortex. Berlin: Springer. Phillips, W., & Singer, W. (1997). In the search of common foundations for cortical computation. Behavioral and Brain Sciences, 20, 657–722. Rose, P., & Scott, S. (2003). Sensory-motor control: A long-awaited behavioral correlate of presynaptic inhibition. Nature Neuroscience, 12, 1309–1316. ¨ Schoner, G., & Dose, M. (1992). A dynamical system approach to task level system integration used to plan and control autonomous vehicle motion. Robotics and Autonomous Systems, 10, 253–267. ¨ Schoner, G., Dose, M., & Engels, C. (1995). Dynamics of behavior: Theory and applications for autonomous robot architectures. Robotics and Autonomous Systems, 16, 213–245.
A Sensorimotor Map
1155
Somervuo, P. (1999). Time topology for the self-organizing map. In Proc. of Int. Joint Conf. on Neural Networks (IJCNN 1999) (pp. 1900—1905). New York: IEEE. Song, P., & Wang, X.-J. (2005). Angular path integration by moving “hill of activity”: A spiking neuron model without recurrent excitation of the head-direction system. J. Neuroscience, 25, 1002–1014. Stringer, S. M., Rolls, E. T., Trappenberg, T. P., & Araujo, I. E. T. de. (2002). Selforganizing continuous attractor networks and path integration: Two-dimensional models of place cells. Network, 13, 429–446. Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Toussaint, M. (2004). Learning a world model and planning with a self-organizing, ¨ dynamic neural system. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (NIPS 2003) (pp. 929–936). Cambridge, MA: MIT Press. Varsta, M. (2002). Self-organizing maps in sequence processing. Unpublised doctoral dissertation, Helsinki University of Technology. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. Wiemer, J. (2003). The time-organized map algorithm: Extending the self-organizing map to spatiotemporal signals. Neural Computation, 15, 1143–1171. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by selforganization. Proceedings of the Royal Society of London, B194, 431–445. Wolpert, D., Ghahramani, Z., & Flanagan, J. (2001). Perspectives and problems in motor learning. Trends in Cognitive Science, 5, 487–494. Wolpert, D., Ghahramani, Z., & Jordan, M. (1995). An internal model for sensorimotor integration. Science, 269, 1880–1882. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neuroscience, 16, 2112–2126. Zimmer, U. (1996). Robust world-modelling and navigation in a real world. NeuroComputing, 13, 247–260.
Received November 9, 2004; accepted August 18, 2005.
LETTER
Communicated by Randall Beer
A Reflexive Neural Network for Dynamic Biped Walking Control Tao Geng [email protected] Department of Psychology, University of Stirling, Stirling, U.K.
Bernd Porr [email protected] Department of Electronics and Electrical Engineering, University of Glasgow, Glasgow, U.K.
¨ otter ¨ Florentin Worg [email protected] Department of Psychology, University of Stirling, Stirling, U.K., and Bernstein Center for Computational Neuroscience, University of G¨ottingen, G¨ottingen, Germany
Biped walking remains a difficult problem, and robot models can greatly facilitate our understanding of the underlying biomechanical principles as well as their neuronal control. The goal of this study is to specifically demonstrate that stable biped walking can be achieved by combining the physical properties of the walking robot with a small, reflex-based neuronal network governed mainly by local sensor signals. Building on earlier work (Taga, 1995; Cruse, Kindermann, Schumm, Dean, & Schmitz, 1998), this study shows that human-like gaits emerge without specific position or trajectory control and that the walker is able to compensate small disturbances through its own dynamical properties. The reflexive controller used here has the following characteristics, which are different from earlier approaches: (1) Control is mainly local. Hence, it uses only two signals (anterior extreme angle and ground contact), which operate at the interjoint level. All other signals operate only at single joints. (2) Neither position control nor trajectory tracking control is used. Instead, the approximate nature of the local reflexes on each joint allows the robot mechanics itself (e.g., its passive dynamics) to contribute substantially to the overall gait trajectory computation. (3) The motor control scheme used in the local reflexes of our robot is more straightforward and has more biological plausibility than that of other robots, because the outputs of the motor neurons in our reflexive controller are directly driving the motors of the joints rather than working as references for position or velocity control. As a consequence, the neural controller and the robot Neural Computation 18, 1156–1196 (2006)
C 2006 Massachusetts Institute of Technology
A Reflexive Neural Network for Dynamic Biped Walking Control
1157
mechanics are closely coupled as a neuromechanical system, and this study emphasizes that dynamically stable biped walking gaits emerge from the coupling between neural computation and physical computation. This is demonstrated by different walking experiments using a real robot as well as by a Poincar´e map analysis applied on a model of the robot in order to assess its stability. 1 Introduction There are two distinct schemes for leg coordination discussed in the literature on animal locomotion and on biologically inspired robotics: CPGs (central pattern generators) and reflexive controllers. It was found that motor neurons (and hence rhythmical movements) in many animals are driven by central networks of interneurons that generate the essential features of the motor pattern. However, sensory feedback signals also play a crucial role in such control systems by turning a stereotyped unstable pattern into the co-coordinated rhythm of natural movement (Reeve, 1999). These networks were referred to as CPGs. On the other hand, Cruse developed a reflexive controller model to understand the locomotion control of a slowly walking stick insect (Carausius morosus). In his model, reflexive mechanisms in each leg generate the step cycle of each individual leg. For interleg coordination, in accordance with observations in insects, he presented six mechanisms that can reestablish coordination in the case of minor disturbances (Cruse, Kindermann, Schumm, Dean, & Schmitz, 1998; Cruse & Warnecke, 1992). While neural systems modeled as CPGs or reflexive controllers explicitly or implicitly compute walking gaits, the mechanics also “compute” a large part of the walking movements (Lewis, 2001). This is called physical computation, namely, exploiting the system’s physics, rather than explicit models, for global trajectory generation and control. One distinct example of physical computation in animal locomotion is the “preflex”—the nonlinear, passive visco-elastic properties of the musculoskeletal system itself (Brown & Loeb, 1999). Due to the physical nature of the preflex, the system can respond rapidly to disturbances (Cham, Bailey, & Cutkosky, 2000). Thus, in all animals, locomotion control is shared between neural computation and physical computation. In this letter, we present our design of a novel reflexive neural controller that has been implemented on a planar biped robot. We will show how a dynamically stable biped walking gait emerges on our robot as a result of a combination of neural and physical computation. Several issues are addressed in this article that we believe are relevant for understanding biologically motivated walking control. Specifically we will show that it is possible to design a walking robot with a very sparse set of input signals and with a controller that operates in an approximate and self-regulating way. Both aspects may be of importance in biological systems too, because they
1158
¨ otter ¨ T. Geng, B. Porr, and F. Worg
allow for a much more limited structure of the neural network and reduce the complexity of the required information processing. Furthermore, in our robot, the controller is directly linked to the robot’s motors (its “muscles”), leading to a more realistic, reflexive sensor-motor coupling than implemented in related approaches. These mechanisms allowed us for the first time to arrive at a dynamically stable artificial biped combining physical computation with a pure reflexive controller. The experimental part of this study is complemented by a dynamical model and the assessment of its stability using a Poincar´e map approach. Robot simulations have been recently criticized, raising the issue that complex systems like a walking robot cannot be fully simulated because of uncontrollable contingencies in the design and in the world in which it is embedded. This notion, known as the embodiment problem, has been dis¨ otter, ¨ cussed to a large extent in the robotics literature (Porr & Worg 2005; Ziemke, 2001). This issue reappears also in our case where we find that the simulations and their analysis will indeed match the experiments and raise confidence in the design, while stopping short of the rich detail of the real system. This letter is organized as follows. First, we describe the mechanical design of our biped robot. Next, we present our neural model of a reflexive network for walking control. Then we demonstrate the result of several biped walking experiments and apply Poincar´e map analysis on the robot model. Finally, we compare our reflexive controller with other walking control mechanisms. 2 The Robot Reflexive controllers such as Cruse’s model involve no central processing unit that demands information on the real-time state of every limb and computes the global trajectory explicitly. Instead, local reflexes of every limb require very little information concerning the state of the other limbs. Coordinated locomotion emerges from the interaction between local reflexes and the ground. Thus, such a distributed structure can immensely decrease the computational burden of the locomotion controller. With these eminent advantages, Cruse’s reflexive controller and its variants were implemented on some multilegged robots (Ferrell, 1995). Whereas in the case of biped robots, though some of them also exploit some form of reflexive mechanisms, their reflexes usually work as an auxiliary function or as infrastructural units for other nonreflexive high-level or parallel controllers. For example, on a simulated 3D biped robot (Boone & Hodgins, 1997), specifically designed reflexive mechanisms were used to respond to two types of ground surface contact errors of the robot, slipping and tripping, while the robot’s hopping height, forward velocity, and body attitude are separately controlled by three decoupled conventional controllers. On a real biped robot (Funabashi, Takeda, Itoh, & Higuchi, 2001), two prewired reflexes are
A Reflexive Neural Network for Dynamic Biped Walking Control
1159
implemented to compensate for two distinct types of disturbances representing an impulsive force and a continuous force, respectively. To date, no real biped robot has existed that depends exclusively on reflexive controllers for walking control. This may be because of the intrinsic instability specific to biped walking, which makes the dynamic stability of biped robots much more difficult to control than that of multilegged robots. After all, a pure local reflexive controller itself involves no mechanisms to ensure the global stability of the biped. While the controllers of biped walking robots generally require some kind of continuous position feedback for trajectory computation and stability control, some animals’ fast locomotion is largely self-stabilized due to the passive, visco-elastic properties of their musculoskeletal system (Full & Tu, 1990). Not surprisingly, some robots can display a similar self-stabilization property (Iida & Pfeifer, 2004). Passive biped robots can walk down a shallow slope with no sensing, control, or actuation. However, compared with a powered biped, passive biped robots have obvious drawbacks, for example, needing to walk down a slope and their inability to control speed (Pratt, 2000). Some researchers have proposed equipping a passive biped with actuators to improve its performance. Van der Linde (1998) made a biped robot walk on level ground by pumping energy into a passive machine at each step. Nevertheless, no one has yet built a passive biped robot that has the capabilities of powered robots, such as walking at various speeds on various terrains (Pratt, 2000). Passive biped robots are usually equipped with circular feet, which can increase the basin of attraction of stable walking gaits and can make the motion of the stance leg look smoother. Instead, powered biped robots typically use flat feet so that their ankles can effectively apply torque to propel the robot to move forward in the stance phase and to facilitate its stability control. Although our robot is a powered biped, it has no actuated ankle joints, rendering its stability control even more difficult than that of other powered bipeds. Since we intended to exploit our robot’s passive dynamics during some stages of its gait cycle, similar to the passive biped, its foot bottom also follows a curved form with a radius equal to the leg length. As for the mechanical design of our robot, it is 23 cm high, foot to hip. It has four joints: left hip, right hip, left knee, and right knee. Each joint is driven by an RC servo motor. A hard mechanical stop is installed on the knee joints, thus preventing the knee joint from going into hyperextension, similar to the function of knee caps on animals’ legs. The built-in PWM (pulse width modulation) control circuits of the RC motors are disconnected while its built-in potentiometer is used to measure joint angles. Its output voltage is sent to a PC through a DA/AD board (USB DUX, www.linuxusb-daq.co.uk). Each foot is equipped with a modified Piezo transducer (DN 0714071 from Farnell) to sense ground contact events. We constrain the robot only in the sagittal plane by a boom. All three axes (pitch, roll, and
1160
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 1: (A) The robot and (B) a schematic of the joint angles of one leg. (C) The structure of the boom. All three orthogonal axes (pitch, roll, and yaw) rotate freely, thus having no influence on the robot dynamics in its sagittal plane.
yaw) of the boom can rotate freely (see Figure 1C), thus having no influence on the dynamics of the robot in the sagittal plane. Note that the robot is not supported by the boom in the sagittal plane. In fact, it is always prone to trip and fall. The most important consideration in the mechanical design of our robot is the location of its center of mass. Its links are made of aluminium alloy, which is light and strong enough. The motor of each hip joint is an
A Reflexive Neural Network for Dynamic Biped Walking Control
1161
Figure 2: Illustration of a walking step of the robot.
HS-475HB from Hitec. It weighs 40 g and can output a torque up to 5.5 kg/cm. Due to the effect of the mechanical stop, the motor of the knee joint bears a smaller torque than the hip joint in stance phases, but must rotate quickly during swing phases for foot clearance. We use a PARK HPXF from Supertec on the knee joint, which is light (19 g) but fast with 21 rad/s. Thus, about 70% of the robot’s weight is concentrated on its trunk. The parts of the trunk are assembled in such a way that its center of mass is located as far forward as possible (see Figure 2). The effect of this design is illustrated in Figure 2. As shown, one walking step has two stages: from A to B and from B to C. During the first stage, the robot has to use its own momentum to rise up on the stance leg. When walking at a low speed, the robot may have not enough momentum to do this, so the distance the center of mass has to cover in this stage should be as short as possible, which can be fulfilled by locating the center of mass of the trunk far forward. In the second stage, the robot falls forward naturally and catches itself on the next stance leg (see Figure 2). Then the walking cycle is repeated. The figure also shows clearly the movement of the curved foot of the stance leg. A stance phase begins with the heel touching ground and terminates with the toe leaving ground. 3 The Neural Structure of Our Reflexive Controller The reflexive controller model of Cruse et al. (1998) and Cruse and Saavedra (1996) that has been applied on many robots can be roughly divided into two levels: the single leg level and the interleg level. Figure 3 illustrates how Cruse’s model creates a single leg movement pattern. A protracting leg switches to retraction as soon as it attains the AEP (anterior extreme
1162
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 3: Single leg movement pattern of Cruse’s reflexive controller model (Cruse et al., 1998).
position). A retracting leg switches to protraction when it attains the PEP (posterior extreme position). On the interleg level, six different mechanisms have been described (Cruse et al., 1998), which coordinate leg movements by modifying the AEP and PEP of a receiving leg according to the state of a sending leg. Although Cruse’s model, as a reflexive controller, is for hexapod locomotion, where the problem of interleg coordination is much more complex than in biped walking, we can still compare its mechanism for the generation of single leg movement patterns with that of our reflexive controller. Cruse’s model depends on PEP, AEP, and GC (ground contact) signals to generate the movement pattern of the individual legs, whereas our reflexive controller presented here uses only GC and AEA (anterior extreme angle of hip joints) to trigger switching between stance and swing phases of each leg. Creation of the single leg movement pattern for our model is illustrated in Figure 4. Figures 4A to 4E represent a series of snapshots of the robot configuration while it is walking. At the time of Figure 4B, the left foot (black) has just touched the ground. This event triggers four local joint reflexes at the same time: flexor of left hip, extensor of left knee, extensor of right hip, and flexor of right knee. At the time of Figure 4E, the right hip joint attains its AEA, which triggers only the extensor reflex of the right knee. When the right foot (gray) contacts the ground, a new walking cycle will begin. Note that
Figure 4: Illustration of single leg movement pattern generation.
A Reflexive Neural Network for Dynamic Biped Walking Control
1163
Figure 5: The neuron model of reflexive controller on our robot. Gray circles = sensor neurons or receptors; vertical ovals = interneurons; horizontal ovals = motorneurons. Synapses: black circle = excitatory, black triangle = inhibitory. See Table 1 for abbreviations.
on the hip joints and knee joints, extensor means forward movement, while flexor means backward movement. The reflexive walking controller of our robot follows a hierarchical structure (see Figure 5). The bottom level is the reflex circuit local to the joints,
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1164 Table 1: Meaning of Some Abbreviations. AL, (AR) GL, (GR) EI, (FI) EM, (FM) ES, (FS)
Stretch receptor for anterior angle of left (right) hip Sensor neuron for ground contact of left (right) foot Extensor (flexor) reflex interneuron Extensor (flexor) reflex motorneuron Extensor (flexor) reflex sensor neuron
including motorneurons and angle sensor neurons involved in joint reflexes. The top level is a distributed neural network consisting of hip stretch receptors, ground contact sensor neurons, and interneurons for reflexes. Neurons are modeled as nonspiking neurons simulated on a Linux PC and communicated to the robot via the DA/AD board. Though somewhat simplified, they still retain some of the prominent neuronal characteristics. 3.1 Model Neuron Circuit of the Top Level. The joint coordination mechanism in the top level is implemented with the neuron circuit illustrated in Figure 5. Each of the ground contact sensor neurons has excitatory connections to the interneurons of the ipsilateral hip flexor and knee extensor as well as to the contralateral hip extensor and knee flexor. The stretch receptor of each hip has excitatory connections to its ipsilateral interneuron of the knee extensor and inhibitory connection to its ipsilateral interneuron of the knee flexor. Detailed models of the inter-neuron, stretch receptor, and ground contact sensor neuron are described in following subsections. 3.1.1 Interneuron Model. The interneuron model is adapted from one used in the neural controller of a hexapod simulating insect locomotion (Beer & Chiel, 1992). The state of each model neuron is governed by equations 3.1 and 3.2 (Gallagher, Beer, Espenschied, & Quinn, 1996): τi
dyi = −yi + ωi, j u j dt
(3.1)
u j = (1 + e j −y j )−1 ,
(3.2)
where yi represents the mean membrane potential of the neuron. Equation 3.2 is a sigmoidal function that can be interpreted as the neuron’s short-term average firing frequency, j is a bias constant that controls the firing threshold. τi is a time constant associated with the passive properties of the cell membrane (Gallagher et al., 1996), and ωi, j represents the connection strength from the jth neuron to the ith neuron. 3.1.2 Stretch Receptors. Stretch receptors play a crucial role in animal locomotion control. When the limb of an animal reaches an extreme position, its stretch receptor sends a signal to the controller, resetting the phase of the
A Reflexive Neural Network for Dynamic Biped Walking Control
1165
limbs. There is also evidence that phasic feedback from stretch receptors is essential for maintaining the frequency and duration of normal locomotive movements in some insects (Chiel &Beer, 1997). While other biologically inspired locomotive models and robots use two stretch receptors on each leg to signal the attaining of the leg’s AEP and PEP, respectively, our robot has only one stretch receptor on each leg to signal the AEA of its hip joint. Furthermore, the function of the stretch receptor on our robot is only to trigger the extensor reflex on the knee joint of the same leg rather than to explicitly (in the case of CPG models) or implicitly (in the case of reflexive controllers) reset the phase relations between different legs. As a hip joint approaches the AEA, the outputs of the stretch receptors for the left (AL) and the right hip (AR) are increased as ρ AL = (1 + e α AL ( AL −φ) )−1 ρ AL = (1 + e
α AR ( AR −φ) −1
)
(3.3) ,
(3.4)
where φ is the real-time angular position of the hip joint, AL and AR are the hip anterior extreme angles whose value are tuned by hand in an experiment, and α AL and α AR are positive constants. This model is inspired by a sensor neuron model presented in Wadden & Ekeberg (1998) that is thought capable of emulating the response characteristics of populations of sensor neurons in animals. 3.1.3 Ground Contact Sensor Neurons. Another kind of sensor neuron incorporated in the top level is the ground contact sensor neuron, which is active when the foot is in contact with the ground. Its output, similar to that of the stretch receptors, changes according to ρG L = (1 + e αG L (G L −VL +VR ) )−1
(3.5)
αG R (G R −VR +VL ) −1
(3.6)
ρG R = (1 + e
) ,
where VL and VR are the output voltage signals from piezo sensors of the left foot and right foot, respectively, G L and G R work as thresholds, and αG L and αG R are positive constants. While AEP and PEP signals account for switching between stance phase and swing phase in other walking control structures, ground contact signals play a crucial role in phase transition control of our reflexive controller. This emphasized role of the ground contact signal also has some biological plausibility. When held in a standing position on a firm flat surface, a newborn baby will make stepping movements, alternating flexion and extension of each leg, which looks like walking. This is called “stepping reflex,” elicited by the foot’s touching of a flat surface. There is considerable evidence that
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1166
the stepping reflex, though different from actual walking, eventually develops into independent walking (Yang, Stephens, & Vishram, 1998). Concerning its nonlinear dynamics, the biped model is hybrid in nature, containing continuous (in swing phase and stance phase) and discrete (at the ground contact event) elements. Hurmuzlu (1993) applied discrete mapping techniques to study the stability of bipedal locomotion. It was found that the timing of ground contact events has a crucial effect on the stability of biped walking. Our preference for using a ground contact signal instead of PEP or AEP signals also has other reasons. In PEP/AEP models, the movement pattern of a leg will break down as soon as the AEP or PEP cannot be reached, which may happen as a consequence of an unexpected disturbance from the environment or due to intrinsic failure. This can be catastrophic for a biped, though tolerable for a hexapod due to its high degree of redundancy. 3.2 Neural Circuit of the Bottom Level. In animals, a reflex is a local motor response to a local sensation. It is triggered in response to a suprathreshold stimulus. The quickest reflex in animals is the “monosynaptic reflex,” in which the sensor neuron directly contacts the motor neuron. The bottom-level reflex system of our robot consists of reflexes local to each joint (see Figure 5). The neuron module for one reflex is composed of one angle sensor neuron and the motor neuron it contacts (see Figure 5). Each joint is equipped with two reflexes, extensor reflex and flexor reflex, both modeled as a monosynaptic reflex, that is, whenever its threshold is exceeded, the angle sensor neuron directly excites the corresponding motor neuron. This direct connection between angle sensor neuron and motor neuron is inspired by a reflex described in cockroach locomotion (Beer, Quinn, Chiel, & Ritzmann, 1997). In addition, the motor neurons of the local reflexes receive an excitatory synapse and an inhibitory synapse from the interneurons of the top level, by which the top level can modulate the bottom-level reflexes. Each joint has two angle sensor neurons: one for the extensor reflex and the other for the flexor reflex (see Figure 5). Their models are similar to that of the stretch receptors described above. The extensor angle sensor neuron changes its output according to ρ E S = (1 + e α E S (φ− E S ) )−1 ,
(3.7)
where φ is the real-time angular position obtained from the potentiometer of the joint (see Figure 1B). E S is the threshold of the extensor reflex (see Figure 1B) and α E S a positive constant. Likewise, the output of the flexor sensor neuron is modeled as ρ F S = (1 + e α F S ( F S −φ) )−1 ,
(3.8)
A Reflexive Neural Network for Dynamic Biped Walking Control
1167
where F S and α F S are similar, as above. It should be particularly noted that the thresholds of the sensor neurons in the reflex modules do not work as desired positions for joint control, because our reflexive controller does not involve any exact position control algorithms that would ensure that the joint positions converge to a desired value. In fact, as will be shown in the next section, the joints often pass these thresholds in swing and stance phase and begin their passive movement thereafter. The sensor neurons involved in the local reflex module of each joint can affect the movements of only the joint they belong to, having not direct or indirect connection to other joints. This is different for the phasic feedback signal, AEA, which works in the top level (i.e., the interjoint level), sensing the position of the hip joints and contacting the motor neurons of the knee joints. The model of the motor neuron is the same as that of the interneurons presented in section 3.1.1. Note that on this robot, the output value of the motor neurons, after multiplication by a gain coefficient, is sent to the servo amplifier to directly drive the joint motor.1 In this way, the neural dynamics are directly coupled with the motor dynamics, and furthermore, with the biped walking dynamics. Thus, the robot and its neural controller constitute a closely coupled neuromechanical system. The voltage of joint motor is determined by Motor voltage = MAMP (g E M u E M + g F M u F M ),
(3.9)
where MAMP represents the magnitude of the servo amplifier, g E M and g F M are output gains of the motor neurons of the extensor and flexor reflex, respectively, and u E M and u F M are the outputs of the motor neurons. 4 Robot Walking Experiments The model neuron parameters chosen jointly for all experiments are listed in Tables 2 and 3. Only the thresholds of the sensor neurons and the output gain of the motor neurons are changed in different experiments. The time constants τi of all neurons take the same value of 5 ms. The weights of all the inhibitory connections are set to −10. The weights of all excitatory
1 While we use motors to drive the robot, animals use muscles for walking. Muscles have their own special properties that make them particularly suitable for walking behaviors, for example, the preflex, which refers to the nonlinear, passive visco-elastic properties of the musculoskeletal system of animals (Brown & Loeb, 1999). Due to the physical nature of the preflex, the system can respond to disturbances rapidly. In the next stage of our work, we will build a Hill-type muscle model with RC motors. The motor neurons of our reflexive controller at the moment drive the motors directly. In the next stage, they will drive the muscle model directly, just as in animals.
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1168
Table 2: Parameters of Neurons for Hip and Knee Joints.
Hip joints Knee joints
E I
F I
E M
F M
αE S
αF S
5 5
5 5
5 5
5 5
4 4
1 4
Note: For the meaning of the subscripts, see Table 1.
Table 3: Parameters of Stretch Receptors and Ground Contact Sensor Neurons. G L (v)
G R (v)
AL (deg)
AR (deg)
αG L
αG R
α AL
α AR
2
2
= E S
= E S
4
4
4
4
Table 4: Specific Parameters for Walking Experiments.
Hip joints Knee joints
E S (deg)
F S (deg)
115 180
70 100
gE M
gF M
±2 ±1.8
±2 ±1.8
connections are 10, except those between interneurons and motor neurons, which are 0.1. We encourage readers to watch the video clips of the robot walking experiments: Walking fast on a flat floor, http://www.cn.stir.ac.uk/˜tgeng/robot/ walkingfast.mpg Walking with a medium speed, http://www.cn.stir.ac.uk/˜tgeng/robot/ walkingmedium.mpg Walking slowly http://www.cn.stir.ac.uk/˜tgeng/robot/walkingslow.mpg Climbing a shallow slope, http://www.cn.stir.ac.uk/˜tgeng/robot/ climbingslope.mpg These videos can be viewed online with Windows Media Player (www. microsoft.com) or Realplayer (www.real.com). 4.1 Passive Movements of the Robot. In a walking experiment with specific parameters as given in Table 4, the passive part of the movement of the robot is shown most clearly. (The sign of g E M and g F M depends on the hardware configurations of the motors on the left and right leg.)
A Reflexive Neural Network for Dynamic Biped Walking Control
1169
Figure 6: Real-time data of one hip joint. (A) Hip joint angle. (B) Motor voltage measured directly at the motor neurons of the hip joint. During some periods (the gray areas), the motor voltage remains zero, and the hip joint moves passively.
Figure 6 shows the motor voltage and the angular movement of one of its hip joints while the robot is walking. During roughly more than half of every gait cycle, the hip joint is moving passively. As shown in Figure 7, during some period of every gait cycle (the gray area in Figure 7), the motor voltages of the motor neurons on all four joints remain zero, so all joints move passively until the swing leg touches the ground (see Figure 8). During this time, which is roughly one-third of a gait cycle (see Figures 7 and 8), the movement of the whole robot is exclusively under the control of “physical computation” following its passive dynamics; no feedback-based active control acts on it. This demonstrates very clearly how neurons and mechanical properties work together to generate the whole gait trajectory. This is also analogous to what happens in animal locomotion. Muscle control of animals usually exploits the natural dynamics of their limbs. For instance, during the swing phase of the human walking gait, the leg muscles first experience a power spike to begin leg swing and then remain limp throughout the rest of the swing phase, similar to what is shown in Figure 8. Note that in Figure 8 and the corresponding stick diagrams of walking gait, we omitted the detailed movement of the curved foot in order to show clearly the leg movements. The point on which the stance leg stands is the orthographic projection of the midpoint of the foot and not its exact ground contact point.
1170
¨ otter ¨ T. Geng, B. Porr, and F. Worg
Figure 7: Motor voltages of the four joints measured directly at the motor neurons, while the robot is walking. (A) Left hip. (B) Right hip. (C) Left knee. (D) Right knee. Note that during one period of every gait cycle (gray area), all four motor voltages remain zero, and all four joints (i.e., the whole robot) move passively (see Figure 8).
4.2 Walking at Different Speeds and a Perturbed Gait. The walking speed of the robot can be changed easily by adjusting only the thresholds of the reflex sensor neurons and the output gain of the motor neurons (see Table 5). Figures 9A and 9B show two phase plots of the hip and knee joint positions, which were recorded while the robot was walking with different speeds on a flat floor. Figure 9C shows a perturbed walking gait. The bulk of the trajectory represents the normal orbit of the walking gait, while the few outlying
A Reflexive Neural Network for Dynamic Biped Walking Control
1171
Figure 8: (A). A series of sequential frames of a walking gait cycle. The interval between every two adjacent frames is 33 ms. Note that during the time between frame 10 and frame 15, which is nearly one-third of the time length of a gait cycle (corresponding to the gray area in Figure 7), the robot is moving passively. At the time of frame 15, the swing leg touches the floor, and a new gait cycle begins. (B). Stick diagram of the gait drawn from the frames in A. The interval between any two consecutive snapshots is 67 ms.
trajectories are caused by external disturbances induced by small obstacles such as thin books (less than 4% of robot size) obstructing the robot path. After a disturbance, the trajectory returns to its normal orbit soon, demonstrating that the walking gaits are stable and to some degree robust against external disturbances. Here, robustness is defined as rapid convergence to a steady-state behavior despite unexpected perturbations (Lewis, 2001). With neuron parameters changed in the cases of fast walking and slow walking, walking dynamics are implicitly drawn into a different gait cycle (see Figure 9). Figure 9D shows an experiment in which the neuron parameters are changed abruptly online while the robot is walking at a slow speed
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1172
Table 5: The Values of Neuron Parameters Chosen to Generate Different Speeds (see Figure 9).
Low-speed walking (see Figure 9A) High-speed walking (see Figure 9B) Perturbed walking gait (see Figure 9C)
Hip joints Knee joints Hip joints Knee joints Hip joints Knee joints
E S (deg)
F S (deg)
gE M
gF M
120 180 110 180 115 180
70 100 85 100 90 100
±1.4 ±1.5 ±2.5 ±1.8 ±2.5 ±1.5
±1.3 ±1.5 ±2.5 ±1.8 ±2.5 ±1.5
Figure 9: Phase diagrams of hip joint position and knee joint position of one leg. Robot speed: (A) 28 cm/s; (B) 63 cm/s. (C) A perturbed walking gait. For values of the neuron parameters chosen in these experiments, see Table 6. Note that the hip joint angle in these figures is an absolute value, not the angle relative to the robot body as shown in Figure 1B. (D) The walking speed is changed online.
(33 cm/s, the big orbit). After a short transient stage (the outlaying trajectories), the gait cycle of the robot is automatically transferred into another stable, high-speed orbit (the small one, 57 cm/s). In other words, when the neuron parameters are changed, physical computation closely coupled
A Reflexive Neural Network for Dynamic Biped Walking Control
1173
Figure 10: The robot is climbing a shallow slope. The interval between any two consecutive snapshots is 67 ms.
with neural computation can autonomously shift the system into another global trajectory that is also dynamically stable. This experiment shows that our biped robot, as a neuromechanical system, is stable in a relatively large domain of its neuron parameters. With other real-time biped walking controllers based on biologically inspired mechanisms (e.g., CPG) or conventional trajectory preplanning and tracking control, it is still a puzzling problem how to change walking speed on the fly without undermining dynamical stability at the same time. However, this experiment shows that the walking speed of our robot can be drastically changed (nearly doubled) on the fly while the stability is still retained due to physical computation. 4.3 Walking Up a Shallow Slope. Figure 10 is a stick diagram of the robot when it is walking up a shallow slope of about 4 degrees. Steeper slopes could not be mastered. In Figure 10, we can see that when the robot is climbing the slope, its step length is becoming smaller, and the movement of its stance leg is becoming slower (its stick snapshots are becoming denser). Note that these adjustments of its gait take place autonomously due to the robot’s physical properties (physical computation), not relying on any preplanned trajectory or precise control mechanism. This experiment demonstrates that such a closely coupled neuromechanical system can to some degree autonomously adapt to an unstructured terrain. 5 Stability Analysis of the Walking Gaits 5.1 Dynamic Model of the Robot. The dynamics of our robot are modeled as shown in Figure 11. With the Lagrange method, we can get the equations that govern the motion of the robot, which can be written in the form D(q )q¨ + C(q , q˙ ) + G(q ) = τ,
(5.1)
where q = [φ, θ1 , θ2 , ψ]T is a vector describing the configuration of the robot
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1174
Figure 11: Model of the dynamics of our robot. Sizes and masses are the same as those of the real robot.
(for a definition of φ, θ1 , θ2 , ψ, see Figure 11). D(q ) is the 4 × 4 inertia matrix, C(q , q˙ ) is the 4 × 1 vector of centripetal and coriolis forces, G(q ) is the 4 × 1 vector representing gravity forces. τ = [0, τ1 , τ2 , τ3 ]T , τ1 , τ2 , τ3 are the torques applied on the stance hip (the hip joint of the stance leg in Figure 11), the swing hip, and the swing knee joints, respectively. Details of this equation can be found in the appendix. The dynamics of the DC motor (including gears) of each joint can be described with the following equations (here, the hip of the stance leg is taken as an example; the models of other joints are likewise):
La
di a + Ra i a + nk3 θ˙1 = V1 dt τ1 + I1 θ¨1 + k f θ˙1 = nk2 i a ,
(5.2) (5.3)
A Reflexive Neural Network for Dynamic Biped Walking Control
1175
where V1 is the applied armature voltage of the stance hip motor, which is obtained from the output of the motor neurons according to equation 3.9, i a is the armature current, L a the armature inductance, and Ra the armature resistance. k3 is the emf constant. k2 is the motor torque constant. I1 is the combined moment of inertial of the stance-hip motor and gear train referred to the gear output shaft. k f is the vicious-friction coefficient of the combination of the motor and the gear. n is the gear ratio. Considering that the electrical time constant of the motor is much smaller than the mechanical time constant of the robot, we neglect the dynamics of the electrical circuits of the motor, which leads to didta = 0. Thus, equation 5.2 is reduced to ia =
1 (V1 − nk3 θ˙1 ) . Ra
(5.4)
Combining equations 5.1, 5.3, and 5.4, we can get the dynamics model of the robot with the applied motor voltages as its control input. The heel strike at the end of swing phases and the knee strike at the end of knee extensor reflex are assumed to be inelastic impacts, which is in accordance with observations on our real robot and existing passive biped robots. This assumption implies the conservation of angular momentum of the robot just before and after the strikes, with which the value of q˙ just after the strikes can be computed using its value just before the strikes. Because the transient double support phase is very short in our robot walking (usually less than 40 ms), it is neglected in our simulation as often done in the analysis of other passive bipeds (Garcia, 1999). 5.2 Stability Analysis with Poincar´e Maps. The method of Poincar´e maps is usually employed for stability analysis of cyclic movements of nonlinear dynamic systems such as passive bipeds (Garcia, 1999). Because our reflexive controller exploits natural dynamics for the robot’s motion generation, and not trajectory planning or tracking control, the Poincar´e map approach can also be applied to the dynamics model of our robot together with the reflexive network as its controller. We choose the Poincar´e section (Garcia, 1999) to be right after the heel strike of the swing leg. Each cyclic walking gait is a limit cycle in the state space, corresponding to a fixed point on the Poincar´e section. Fixed points can be found by solving the roots of the mapping equation, P(x n ) − x n = 0,
(5.5)
˙ θ˙1 , θ˙2 , ψ] ˙ T is a state vector on the where x n = [q , q˙ ]T = [φ, θ1 , θ2 , ψ, φ, Poincar´e section at the beginning of the nth gait cycle. P(x n ) is a map function mapping x n to x n+1 , which is built by combining the reflexive controller and the robot dynamics model described above.
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1176
Table 6: Fixed Parameters of the Knee Joints. E S,k (deg)
F S,k (deg)
G M,k
180
110
0.9G M,h
Knee joints
Near a fixed point, x ∗ , the map function P(x ∗ ) can be linearized as (Garcia, 1999) P(x ∗ + xˆ ) ≈ P(x ∗ ) + J xˆ ,
(5.6)
where J is the 8 × 8 Jacobian matrix of partial derivatives of P: J =
∂P . ∂x
(5.7)
With any fixed point, J can be obtained by numerically evaluating P eight times in a small neighborhood of the fixed point. According to equation 5.6, small perturbations xˆ i to the limit cycle x ∗ at the start of the ith step will grow or decay from the ith step to the i + 1th step approximately according to xˆ i+1 ≈ J xˆ i . So if all eigenvalues of J lie within the unit cycle, any small perturbation will decay to 0 and the perturbed walking gait will return to its limit cycle, which means the limit cycle is asymptotically stable (Garcia, 1999). The movements of the knee joints are needed mainly for timely ground clearance without much influence on the stability of the walking gait. Therefore, in the simulation analysis and real experiment below, we set the knees’ neuron parameters to fixed values (see Table 6) that can ensure fast movements of the knee joints, preventing any possible scuff of the swing leg. For simplicity, we also fix the threshold of the flexor sensor neurons of the hips ( F S,h ) to 85 degrees in simulation and real experiments below. This will not damage the generality of the results, because similar results can be obtained provided that F S,h is in the interval 70 to 90 degrees. For values outside this range the robot will either fall or produce gaits very unnatural gaits. Thus, now we only need to tune two parameters of the hip joints: the threshold of the extensor sensor neurons ( E S,h ) and the gain of the motor neurons of hip joints (G M,h ), which work together to determine the gait properties. E S,h − F S,h determines roughly the stride length (not exactly, because the hip joint moves passively after passing E S,h ), while G M,h determines the amplitude of the applied voltage of the motors on the hip joint. Since these two parameters have such clear physical interpretations, their tuning is straightforward. With each set of the controller parameter E S,h and G M,h , we use a multidimensional Newton-Raphson method solving equation 5.6 to find
A Reflexive Neural Network for Dynamic Biped Walking Control
1177
Figure 12: Stable domain of the controller parameter, E S,h and G M,h . The big area enclosed by the outer curve represents the range obtained with simulations in which fixed points are stable. The shaded area is the range of the two parameters, in which stable gaits will appear in experiments performed with the real robot. The maximum permitted value of G M,h is 2.95 (higher values will destroy the motor of the hip joint). The two closed curves are a manual, continuous interpolation of the discrete boundaries obtained in simulations and real experiments, respectively.
the fixed point (Garcia, 1999). Then we compute the Jacobian matrix J of the fixed point using the approach described above and evaluate the stability of the fixed point according to its eigenvalues. The result of this Poincar´e map analysis is shown in Figure 12. We have found that asymptotically stable fixed points exist in a considerably large range of the controller parameters E S,h and G M,h (see Figure 12). For comparison, Figure 12 also shows the stable range of these two parameters obtained in real robot experiments. In the real robot, because no definite stability criterion, like using eigenvalues, is applicable, we regard a walking gait as stable if the robot does not fall. The best way to visualize the properties of a limit cycle is using the phase plane, which can be easily obtained in the simulations but is not available in our real robot due to the lack of absolute position and speed sensors. Figure 13 shows two phase plane plots of the absolute angular position of ˙ After being perturbed, one hip joint, φ (see Figure 11), and its derivative, φ. the walking gait returns to its limit cycle quickly in only a few steps, which is in accordance with the experiment results of the real robot presented in the last section.
1178
¨ otter ¨ T. Geng, B. Porr, and F. Worg
˙ (A) Corresponds to Figure 13: Two limit cycles in the phase plan of φ and φ. a fixed point found with this set of controller parameters, E S,h = 125 Deg, G M,h = 2.8. (B) Corresponds to E S,h = 110 Deg, G M,h = 2.5.
Because some details of the robot dynamics such as uncertainties of the ground contact, nonlinear frictions in the joints, and the inevitable noise and lag of the sensors are difficult, if not impossible, to model precisely, the results of simulation and real experiments are not exactly coherent (see Figure 12). However, stability analysis and experiments with our real robot have in general shown that our biped robot under control of the reflexive network will demonstrate stable walking gaits in a wide range of the critical controller parameters and that it will return to its normal orbit quickly after a disturbance. 6 Discussion and Comparison with Other Walking Control Mechanisms 6.1 Minimal Set of Phasic Feedbacks. The aim of locomotion control structures (modeled with either CPG or with reflexive controllers) is to control the phase relations between limbs or joints, attaining a stable phase locking that leads to a stable gait. Therefore, the locomotion controller needs phasic feedback from the legs or joints. In the case of reflexive controllers like Cruse’s model (Cruse et al., 1998), the phasic feedback signals sent to the controller are AEP and PEP signals, which can provide sufficient information on phase relations at least between adjacent legs. It is according to this information that the reflexive controller adjusts the PEP value of the leg, thus effectively changing the period of the leg, synchronizing it in or out of phase with its adjacent legs (Klavins, Komsuoglu, Full, & Koditschek, 2002). On the other hand, in the case of a CPG model, which can generate rhythmic movement patterns even without sensory feedback, it must nonetheless be entrained to phasic feedback from the legs in order to achieve realistic locomotion gaits. In some animals, evidence exists that every limb involved
A Reflexive Neural Network for Dynamic Biped Walking Control
1179
in cyclic locomotion has its own CPG (Delcomyn, 1980), and phasic feedback from muscles is indispensable to keep its CPGs in phase with the real time movement of the limbs. Not surprisingly, CPG mechanisms used on various locomotive robots also require phasic feedback. Lewis, EtienneCummings, Hartmann, Xu, & Cohen (2003) implemented a CPG oscillator circuit to control a simple biped. AEP and PEP signals from its hip joints define the feedback to the CPG, resetting its oscillator circuit. Removal of the AEP or PEP signals caused quick deterioration of this biped’s gait. On another quadruped robot (Fukuoka, Kimura, & Cohen, 2003), instead of discrete AEP and PEP signals, continuous position signals of the hip joints provide feedback to the neural oscillators of the CPG. The neural oscillator parameters were tuned in such a way that the minimum and maximum of the hip positions would reset the flexor and extensor oscillator respectively. Apparently this scheme functions identically with AEP, PEP feedback. In summary, because AEP and PEP provide sufficient information about phase relations between legs, walking control structures usually depend on them (or their equivalents) as phasic feedback from the legs. However, the top level of the reflexive controller on our robot requires only AEA signals as phasic feedback. Furthermore, this AEA signal is only for triggering the flexor reflex on the knee joint of the robot rather than triggering stance phases as in other robots. In this sense, the role (and number) of the phasic feedback signals is much reduced in our reflexive controller. In spite of the fact that the AEA signal is by itself not sufficient to control the phase relations between legs, stable walking gaits have appeared in our robot walking experiments (see section 4). This is because reflexive controller and physical computation cooperate to accomplish the task of phasic walking gait control. This shows that physical computation can help to simplify the controller structure. As described above, CPGs have been successfully applied on some realtime quadruped, hexapod, and other multilegged robots. However, in biped walking control based on CPG models, most of the current studies are performed with computer simulations. To our knowledge, no one has successfully realized real-time dynamic biped walking using a CPG model as a single controller, because the CPG model itself cannot ensure stability of the biped gait. A considerably well-known biped robot controlled by a CPG chip has been developed by Lewis et al. (2003). Its walking and running gaits look very nice, though on a treadmill instead of on a floor. But this biped robot has a fatally weak point in that its hips are fixed on a boom (not rotating freely around the boom axes as in our robot), so it is actually supported by the boom. The boom is greatly facilitating its control, avoiding the most difficult problem of dynamic stability control that is specific to biped robots. Thus, this robot is not a dynamic biped in a real sense. Instead, it is rather more equivalent to one pair of legs of a multilegged robot. Using computer simulations, Taga (1995) found that stable biped gaits can be generated by combining CPGs and human biomechanics. In animals,
1180
¨ otter ¨ T. Geng, B. Porr, and F. Worg
a CPG is a neural structure that is much more complex than the local reflex in anatomy and function. There is evidence that in mammal and human locomotion, CPGs work on top of reflexes and take their effects by modulating them. In evolution, simple monosynaptic reflexes must have appeared much earlier than the much more complex CPG structures. Not only with simulation analysis but also with our real system experiments, the current study has shown that local neuronal reflexes connected by a simple sensor-driven network are sufficient as a controller for dynamic biped walking, the most difficult form of legged locomotion in view of dynamic stability. 6.2 Physical Computation and Approximation. In contrast to exact representations and world models, physical computation often implies approximation. Approximation in control mechanism gives more room and possibility for physical computation. While conventional robots rely on precise trajectory planning and tracking control, biologically inspired robots rarely use preplanned or explicitly computed trajectories. Instead, they compute their movements approximately by exploiting physical properties of their self and the world, thus avoiding the accurate calibration and modeling required by conventional robotics. But in order to achieve real-time walking gait in a real world, even these biologically inspired robots often have to depend on some kind of position or velocity control on their joints. For example, on a hexapod, simulating the distributed locomotion control of insects (Beer et al., 1997), outputs of motor neurons were integrated to produce a trajectory of joint positions that was tracked using proportional feedback position control. On a quadruped, built by Kimura’s group, that implemented CPGs (neural oscillators) and local reflexes, all joints are PD controlled to move to their desired angles (Fukuoka et al., 2003). Even on a half-passive biped controlled by a CPG chip, position control worked on its hip joints, though passive dynamics of its knee joints was exploited for physical computation (Lewis, 2001). The principle of approximation embodied in the reflexive controller of our robot, however, goes even one step further, in the sense that there is no position or velocity control implemented on our robot. The neural structure of our reflexive controller does not depend on, or ensure the tracking of, any desired position. Indeed, it is this approximate nature of our reflexive controller that allows the physical properties of the robot itself, especially the passive dynamics of the robot (see Figure 8), to contribute implicitly to generation of overall gait trajectories, and ensures its stability and robustness to some extent. Just as argued by Raibert and Hodgins (1993, p. 353), “Many researchers in neural motor control think of the nervous system as a source of commands that are issued to the body as direct orders. We believe that the mechanical system has a mind of its own, governed by the physical structure and the laws of physics. Rather than issuing commands,
A Reflexive Neural Network for Dynamic Biped Walking Control
1181
the nervous system can only make suggestions which are reconciled with the physics of the system and the task.” 7 Conclusions In this article, we presented our design and some walking experiments performed by a novel neuromechanical structure for reflexive walking control. We demonstrated with a closely coupled neuromechanical system how physical computation can be exploited to generate a dynamically stable biped walking gait. In the experiments of walking at different speeds and climbing a shallow slope, it was also shown that the coupled dynamics of this neuromechanical system are sufficient to induce an autonomous, albeit limited, adaptation of the gait. While the biologically inspired model neurons used in our reflexive controller retain some properties of real neurons, they do not include one of the most significant features of neurons, synaptic plasticity. As has been observed in human and animal locomotion, while walking gait generation may be reflexive, stability control of walking behavior has to be predictive. Although physical computation can ensure autonomy and stability to some extent, in order to get a stable walking gait in a wide parameter range, we have to rely on the plasticity of the neural structure. In the near future, we will apply proactive learning on this neuromechanical system (Porr & ¨ otter, ¨ Worg 2003). The basic idea is to use the waveform resulting from the ground contact sensors to predict and thus avoid possible instabilities of the next walking step.
Appendix In the following we list the terms of the equation used in the simulation to build the Poincar´e map function. For definitions of l1 , l2 , l3 , l4 , l5 , φ, θ1 , θ2 , ψ, see Figure 11. r is the radius of the curved foot. mt is the mass of the trunk, mh the mass of the thigh, mk the mass of the shank with foot. g is the gravity. D11 = −4mk cos (φ) r 2 − 2mk l4 l2 + 2mk rl4 + 2mk l1 2 + l2 2 +2mt l1 l2 − 2l1 r − 2mt rl2 +4mk r cos (φ) l2 − 2mk r cos (φ) l4 + 2mk l4 2 + 2mt r cos (φ) l2 +2mt l2 l5 cos (θ1 ) − 2mt rl5 cos (θ1 ) +2mt r cos (φ) l1 + 2mt rl5 cos (θ1 − φ) + 2mt l1 l5 cos (θ1 ) −2mt cos (φ) r 2 + mt l5 2 −4mh cos (φ) r 2 − 2mh l1 l3 − 2mh l2 l3
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1182
+2mh rl3 + 2mk l1 l2 − 2mk l1 r − 4mk rl2 + 4mk r 2 +2mk l2 2 + 4mh l1 l2 − 4mh l1 r − 4mh rl2 +2mh l1 2 + 4mh r 2 + 2mh l2 2 −2mh rl3 cos (−θ2 + φ) −2mh l1 l3 cos (−θ1 + θ2 ) − 2mh l2 l3 cos (−θ1 + θ2 ) +2mh rl3 cos (−θ1 + θ2 ) − 2mk rl4 cos (−θ2 + ψ) +2mk rl4 cos (−θ1 + θ2 + ψ + φ) −2mk rl1 cos (−θ1 + θ2 + φ) −2mk l1 l4 cos (ψ) + 2mk l4 l2 cos (−θ1 + θ2 + ψ) +2mk l1 l4 cos (−θ1 + θ2 + ψ) +2mk rl1 cos (−θ1 + θ2 ) + 4mh r cos (φ) l2 −2mh r cos (φ) l3 + 4mh r cos (φ) l1 −2mk l1 2 cos (−θ1 + θ2 ) +2mh l3 2 + mt l1 2 + 2mt r 2 D12 = −mt rl5 cos (θ1 − φ) + mt rl5 cos (θ1 ) −mt (l1 + l2 )l5 cos (θ1 ) −mt mh rl3 cos (−θ1 + θ2 + φ) −mh rl3 cos (−θ1 + θ2 ) + mh l2 l3 cos (−θ1 + θ2 ) +mh l1 l3 cos (−θ1 + θ2 ) − mh l3 2 +mk rl1 cos (−θ1 + θ2 + φ) − mk rl4 cos (−θ1 + θ2 + ψ + φ) +mk l2 l1 cos (−θ1 + θ2 ) − mk rl1 cos (−θ1 + θ2 ) +2mk l1 l4 cos (ψ) +mk l1 2 cos (−θ1 + θ2 ) +mk l1 l4 cos (−θ1 + θ2 + ψ) −mk l4 l2 cos (−θ1 + θ2 + ψ) +mk rl4 cos (−θ1 + θ2 + ψ) −mk l1 2 − mk l4 2 D13 = −mh rl3 cos (−θ1 + θ2 + φ) +mh rl3 cos (−θ1 + θ2 ) − mh l2 l3 cos (−θ1 + θ2 )
A Reflexive Neural Network for Dynamic Biped Walking Control
1183
−mh l1 l3 cos (−θ1 + θ2 ) + mh l3 2 + mk l1 2 − mk rl1 cos (−θ1 + θ2 ) +mk rl4 cos (−θ1 + θ2 ) + mk l2 l1 cos (−θ1 + θ2 ) +mk rl1 cos (−θ1 + θ2 ) −2mk l1 l4 cos (ψ) − mk l1 2 cos (−θ1 + θ2 ) +mk l1 l4 cos (−θ1 + θ2 + ψ) +mk l4 l2 cos (−θ1 + θ2 + ψ) −mk rl4 cos (−θ1 + θ2 + ψ) + mk l1 2 + mk l4 2 D14 = mk rl4 cos (−θ1 + θ2 + ψ + φ) −mk rl4 cos (−θ1 + θ2 + ψ) +mk l4 (l1 + l2 ) cos (−θ1 + θ2 + ψ) −mk l1 l4 cos (ψ) + mk l4 2 D21 = D12 D22 = mt l5 2 + mh l3 2 + mk l1 2 − 2mk l1 l4 cos (ψ) + mk l4 2 D23 = −mh l3 2 − mk l1 2 + 2mk l1 l4 cos (ψ) − mk l4 2 D24 = mk l1 l4 cos (ψ) − mk l4 2 D31 = D13 D32 = D23 D33 = mh l3 2 + mk l1 2 − 2l1 l4 cos (ψ) + mk l4 2 D34 = −mk l1 l4 cos (ψ) + mk l4 2 D41 = D14 D42 = D24 D43 = D34 D44 = mk l4 2 C1 = 2mk sin (φ) φ˙ 2 r 2 − 4mk sin (φ) φ˙ 2 l2 r −2mt sin (φ) φ˙ 2 (l1 + l2 )r
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1184
+2mt l5 sin (θ1 − φ) φ˙ 2 r ˙ 1 −2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl 2
+mt l5 sin (θ1 − φ) θ˙1 r +2mt sin (φ) φ˙ 2 r 2 + 4mh sin (φ) φ˙ 2 r 2 ˙ −3mt l5 sin (θ1 − φ) θ˙1 φr +2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) r +2mh l3 sin (φ) φ˙ 2 r 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 r ˙ +2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l2 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l1 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l2 ˙ +3mh l3 sin (−θ1 + θ2 + φ) θ˙2 φr 2 −mk l1 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) r
+2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) r −2mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) l2 −2mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) l2 −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 −2mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) 2
+mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) l1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) r 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) l1
+2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ sin (φ) l1 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) l2
−mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) 2
−mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) l1 +2l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) r +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) r +2mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) 2
−mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l2 2 −mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) r 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) l2
˙ 2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) l1 −2l1 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) r ˙ 1 sin (−θ1 + θ2 + φ) θ˙2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 1 sin (−θ1 + θ2 + φ) φ˙ −2mk l4 cos (−θ2 + ψ + φ) ψl ˙ −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) r
−2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ sin (φ) l2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 sin (φ) l2 2
−mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) 2
+mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) r 2 −mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) r
−2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r 2
−mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l2 2 +mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) l2 2 +mh l3 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) r
˙ +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φr +2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙2 r ˙ −2mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr
1185
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1186
˙ 2 +2mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +2mk l1 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) l2 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l2 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) l1 2
−mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) r −2mh l3 cos (−θ1 + θ2 + φ) θ˙1 θ˙2 sin (φ) r 2
+mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) r −2mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 2
−mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) l1 +2mh l3 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) r +2mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r −2mh l3 sin (−θ2 + φ) θ˙1 φ˙ cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l2 2
+mh l3 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l1 −2mh l3 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 cos (φ) l2 ˙ 2 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 +2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +2mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l1 ˙ 1 −2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl −4mh sin (φ) φ˙ 2 l2 r −4mh sin (φ) φ˙ 2 l1 r +2mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ 2 −2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl 2
+mt l5 cos (θ1 − φ) θ˙1 sin (φ) l2 2
+mt l5 cos (θ1 − φ) θ˙1 sin (φ) l1 2
−mt l5 cos (θ1 − φ) θ˙1 sin (φ) r 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) r 2 −mt l5 sin (θ1 − φ) θ˙1 cos (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l2 ˙ +2mt l5 cos (θ1 − φ) θ˙1 sin (φ) φr 2
+mt l5 sin (θ1 − φ) θ˙1 cos (φ) l2 2
+mt l5 sin (θ1 − φ) θ˙1 cos (φ) l1 −2mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l1 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 r ˙ −3mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 cos (φ) l2 ˙ +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψr ˙ −3mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ +3mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 r 2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 r
˙ 2 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl −2mk sin (φ) φ˙ 2 l1 r 2
+mk l1 sin (−θ1 + θ2 + φ) θ˙1 r +2mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ +3mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −2mk l1 sin (−θ1 + θ2 + φ) θ˙1 θ˙2 r −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l1 +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) l1 ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) l2 ˙ −3mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 r
1187
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1188
2 −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) l1
−2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) l1 2
+mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψ˙ cos (φ) r 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) l2 +4mk sin (φ) φ˙ 2 r 2 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 2
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 cos (φ) l1 +2mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 +2mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) l2 +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) l2 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) r +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 2
+mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 cos (φ) r 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) l2
+2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 θ˙2 cos (φ) l1 −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψ˙ cos (φ) r 2
+mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 cos (φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 r −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) r +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 sin (φ) l2 +2mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r 2 +mk l1 sin (−θ1 + θ2 + φ) θ˙1 cos (φ) l2
˙ 1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl
A Reflexive Neural Network for Dynamic Biped Walking Control
˙ 1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 −2mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ −2mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin sin (φ) φr ˙ +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ 1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ −2mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr +2mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 cos (φ) r −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ +2mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr −2mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψr ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl C2 = −mt l5 sin (θ1 − φ) φ˙ 2 r ˙ 1 +mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl ˙ +mt l5 sin (θ1 − φ) θ˙1 φr −mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) r +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 +mk l1 2 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) ˙ 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙2 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 1 sin (−θ1 + θ2 + φ) φ˙ +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) r −mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2
1189
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1190
˙ 2 −mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 ˙ 2 −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl −mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l1 −mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ 2 +mt l5 cos (θ1 − φ) θ˙1 sin (φ) φl +mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l2 ˙ −mt l5 cos (θ1 − φ) θ˙1 sin (φ) φr +mt l5 sin (θ1 − φ) θ˙1 φ˙ cos (φ) l1 ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 ˙ 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl −mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ −mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 cos (−θ1 + θ2 + φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl ˙ +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 −mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 −mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) +mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl +mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ cos (φ) r
A Reflexive Neural Network for Dynamic Biped Walking Control
−mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r ˙ 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ 1 +mk l4 cos (−θ2 + ψ + φ) θ˙1 sin (φ) φl ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 −2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φφ) φl ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr −mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ −mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr +mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl ˙ C3 = mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr ˙ +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l1 sin (−θ1 + θ2 + φ) φ˙ cos (φ) r −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ 2 l1 sin (−θ1 + θ2 + φ) ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φr +mh l3 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 ˙ −mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr ˙ 2 +mk l1 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l1 +mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r −mh l3 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) l2 ˙ 2 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl
1191
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1192
˙ 1 +mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φl ˙ 1 −mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl +mh l3 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φr ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φr ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φr −mk l4 sin (−θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l2 ˙ 2 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl +mk l1 sin (−θ1 + θ2 + φ) φ˙ 2 r ˙ +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φr −mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l2 +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) l1 +mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) r +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 φ˙ cos (φ) r ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙2 ψl ˙ −mk l1 sin (−θ1 + θ2 + φ) θ˙1 φr +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l1 +mk l1 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) l2 +mk l1 2 sin (−θ1 + θ2 + φ) θ˙2 φ˙ cos (φ) −mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ 2 r ˙ 1 cos (−θ1 + θ2 + φ) −2mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 ψl −mk l4 sin (−θ1 + θ2 + ψ + φ) φ˙ cos (φ) r +mk l4 sin (−θ1 + θ2 + ψ + φ) θ˙1 φ˙ cos (φ) l2 +mk l1 sin (−θ1 + θ2 + φ) θ˙1 φ˙ cos (φ) r ˙ 2 −mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ 1 −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φl ˙ 1 +mk l4 cos (−θ1 + θ2 + ψ + φ) ψ˙ sin (φ) φl ˙ 1 sin (−θ1 + θ2 + φ) θ˙1 +2mk l4 cos (−θ1 + θ2 + ψ + φ) ψl ˙ 2 −mh l3 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φl ˙ −mh l3 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φr
A Reflexive Neural Network for Dynamic Biped Walking Control
˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙1 sin (φ) φr ˙ 2 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ 1 +mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φl ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) θ˙2 sin (φ) φr +mk l1 2 cos (−θ1 + θ2 + φ) θ˙1 sin (φ) φ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φ˙ cos (φ) l1 ˙ +mk l1 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φr −mk l1 2 cos (−θ1 + θ2 + φ) θ˙2 sin (φ) φ˙ ˙ 1 cos (−θ1 + θ2 + φ) +2mk l4 sin (−θ1 + θ2 + ψ + φ) ψ˙ φl C4 = −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ 2 ˙ 1 ψ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 2 ψ˙ +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ ψ˙ −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr −mk l4 cos (−θ1 + θ2 + ψ + φ) l1 sin (−θ1 + θ2 + φ) φ˙ ψ˙ ˙ θ˙1 +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr ˙ ψ˙ +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr +mk l4 cos (−θ1 + θ2 + ψ + φ) l1 sin (−θ1 + θ2 + φ) θ˙1 ψ˙ ˙ 1 ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl +mk l4 sin (−θ1 + θ2 + ψ + φ) l1 cos (−θ1 + θ2 + φ) φ˙ ψ˙ ˙ 2 θ˙1 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ θ˙2 −mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φr ˙ 1 θ˙2 +mk l4 cos (−θ1 + θ2 + ψ + φ) sin (φ) φl ˙ 2 θ˙2 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 2 θ˙1 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ θ˙2 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr ˙ θ˙1 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φr ˙ 1 θ˙2 −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 2 ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl ˙ 1 θ˙1 +mk l4 sin (−θ1 + θ2 + ψ + φ) cos (φ) φl
1193
¨ otter ¨ T. Geng, B. Porr, and F. Worg
1194
+mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ θ˙1 −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ ψ˙ −mk l4 sin (−θ1 + θ2 + ψ + φ) r φ˙ θ˙2 G 1 = mt g sin (φ) r − mt g sin (φ) l2 −mt g sin (φ) l1 + mt gl5 sin (θ1 − φ) +2mh g sin (φ) r − 2mh g sin (φ) l2 −2mh g sin (φ) l1 +mh gl3 sin (φ) +mh gl3 sin (−θ1 + θ2 + φ) +2mk g sin (φ) r + mk g sin (φ) l4 −2mk g sin (φ) l2 −mk g sin (φ) l1 + mk gl1 sin (−θ1 + θ2 + φ) −mk gl4 sin (−θ1 + θ2 + ψ + φ) G 2 = −mt gl5 sin (θ1 − φ) −mh gl3 sin (−θ1 + θ2 + φ) −mk gl1 sin (−θ1 + θ2 + φ) +mk gl4 sin (−θ1 + θ2 + ψ + φ) G 3 = mh gl3 sin (−θ1 + θ2 + φ) +mk gl1 sin (−θ1 + θ2 + φ) −mk gl4 sin (−θ1 + θ2 + ψ + φ) G 4 = −mk gl4 sin (−θ1 + θ2 + ψ + φ) Acknowledgments This work was supported by SHEFC grant INCITE to F. W. We thank Kevin Swingler for correction of the text. References Beer, R., & Chiel, H. (1992). A distributed neural network for hexapod robot locomotion. Neural Computation, 4, 356–365. Beer, R., Quinn, R., Chiel, H., & Ritzmann, R. (1997). Biologically inspired approaches to robotics. Communications of the ACM, 40(3), 30–38.
A Reflexive Neural Network for Dynamic Biped Walking Control
1195
Boone, G., & Hodgins, J. (1997). Slipping and tripping reflexes for bipedal robots. Autonomous Robots, 4(3), 259–271. Brown, I., & Loeb, G. (1999). Biomechanics and neural control of movement. New York: Springer-Verlag. Cham, J., Bailey, S., & Cutkosky, M. (2000). Robust dynamic locomotion through feedforward-preflex interaction. In ASME IMECE Proceedings, Orlando, FL. Chiel, H., & Beer, R. (1997). The brain has a body: Adaptive behavior emerges from interactions of nervous system, body, and environment. Trends in Neuroscience, 20, 553–557. Cruse, H., Kindermann, T., Schumm, M., Dean, J., & Schmitz, J. (1998). Walknet—a biologically inspired network to control six-legged walking. Neural Networks, 11(7–8), 1435–1447. Cruse, H., & Saavedra, M. (1996). Curve walking in crayfish. Journal of Experimental Biology, 199, 1477–1482. Cruse, H., & Warnecke, H. (1992). Coordination of the legs of a slow-walking cat. Experimental Brain Research, 89, 147–156. Delcomyn, F. (1980). Neural basis of rhythmic behavior in animals. Science, 210, 492–498. Ferrell, C. (1995). A comparison of three insect-inspired locomotion controllers. Robotics and Autonomous Systems, 16, 135–159. Fukuoka, Y., Kimura, H., & Cohen, A. (2003). Adaptive dynamic walking of a quadruped robot on irregular terrain based on biological concepts. Int. J. of Robotics Research, 22, 187–202. Full, R. J., & Tu, M. S. (1990). Mechanics of six-legged runners. Journal of Experimental Biology, 148, 129–146. Funabashi, H., Takeda, Y., Itoh, S., & Higuchi, M. (2001). Disturbance compensating control of a biped walking machine based on reflex motions. JSME International Journal Series C—Mechanical Systems,Machine Elements and Manufacturing, 44, 724– 730. Gallagher, J., Beer, R., Espenschied, K., & Quinn, R. (1996). Application of evolved locomotion controllers to a hexapod robot. Robotics and Autonomous Systems, 19, 95–103. Garcia, M. (1999). Stability, scaling, and chaos in passive-dynamic gait models. Unpublished doctoral dissertation, Cornell University. Hurmuzlu, Y. (1993). Dynamics of bipedal gait; part II: Stability analysis of a planar five-link biped. ASME Journal of Applied Mechanics, 60(2), 337–343. Iida, F., & Pfeifer, R. (2004). Self-stabilization and behavioral diversity of embodied adaptive locomotion. Lecture Notes in Artificial Intelligence, 3139, 119–129. Klavins, E., Komsuoglu, H., Full, R., & Koditschek, D. (2002). Neurotechnology for biomimetic robots. Cambridge, MA: MIT Press. Lewis, M. (2001). Certain principles of biomorphic robots. Autonomous Robots, 11, 221–226. Lewis, M., Etienne-Cummings, R., Hartmann, M., Xu, Z., & Cohen, A. (2003). An in silico central pattern generator: Silicon oscillator, coupling, entrainment, and physical computation. Biological Cybernetics, 88, 137–151. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Comp., 15, 831–864.
1196
¨ otter ¨ T. Geng, B. Porr, and F. Worg
¨ otter, ¨ Porr, B., & Worg F. (2005). Inside embodiment what means embodiment for radical constructivists? Kybernetes, 34, 105–117. Pratt, J. (2000). Exploiting inherent robustness and natural dynamics in the control of bipedal walking robots. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Raibert, H., & Hodgins, J. (1993). Biological neural networks in invertebrate neuroethology and robotics. Orlando, FL: Academic Press. Reeve, R. (1999). Generating walking behaviors in legged robots. Unpublished doctoral dissertation, University of Edinburgh. Taga, G. (1995). A model of the neuro-musculo-skeletal systems for human locomotion. Biological Cybernetics, 73, 97–111. Van der Linde, R. Q. V. (1998). Active leg compliance for passive walking. In Proceedings of IEEE International Conference on Robotics and Automation. Piscataway, NJ: IEEE. Wadden, T., & Ekeberg, O. (1998). A neuro-mechanical model of legged locomotion: Single leg control. Biological Cybernetics, 79, 161–173. Yang, J., Stephens, M., & Vishram, R. (1998). Infant stepping: A method to study the sensory control of human walking. J. Physiol. (London), 507, 927–937. Ziemke, T. (2001). Are robots embodied? In First International Workshop on Epigenetic Robotics Modeling Cognitive Development in Robotic Systems (pp. 75–93).
Received December 20, 2004; accepted September 28, 2005.
LETTER
Communicated by Robert Kass
A Set Probability Technique for Detecting Relative Time Order Across Multiple Neurons Anne C. Smith [email protected] Department of Anesthesiology and Pain Medicine, University of California at Davis, Davis, CA 95616, U.S.A.
Peter Smith [email protected] Department of Mathematics, University of Keele, Keele, Staffordshire, ST5 5BG, U.K.
With the development of multielectrode recording techniques, it is possible to measure the cell firing patterns of multiple neurons simultaneously, generating a large quantity of data. Identification of the firing patterns within these large groups of cells is an important and a challenging problem in data analysis. Here, we consider the problem of measuring the significance of a repeat in the cell firing sequence across arbitrary numbers of cells. In particular, we consider the question, given a ranked order of cells numbered 1 to N, what is the probability that another sequence of length n contains j consecutive increasing elements? Assuming each element of the sequence is drawn with replacement from the numbers 1 through N, we derive a recursive formula for the probability of the sequence of length j or more. For n < 2 j, a closed-form solution is derived. For n ≥ 2 j, we obtain upper and lower bounds for these probabilities for various combinations of parameter values. These can be computed very quickly. For a typical case with small N (<10) and large n (<3000), sequences of 7 and 8 are statistically very unlikely. A potential application of this technique is in the detection of repeats in hippocampal place cell order during sleep. Unlike most previous articles on increasing runs in random lists, we use a probability approach based on sets of overlapping sequences. 1 Introduction Development of analysis techniques to find temporal patterns in large sets of neural spike train data is an important current research problem (Buzs´aki, 2004; Brown, Kass, & Mitra, 2004). This is increasingly critical because of technology advances allowing large numbers of individual neurons (from 10 up to over 100) to be recorded simultaneously. One common target of this multielectrode recording technology is the hippocampal place cell. Place Neural Computation 18, 1197–1214 (2006)
C 2006 Massachusetts Institute of Technology
1198
A. Smith and P. Smith
cells are neurons that fire in a specific temporal order when rats navigate through space (O’Keefe & Dostrovsky, 1971). There is growing experimental evidence from place cells in rats (Pavlides & Winson, 1989; Wilson & ´ McNaughton, 1994; Skaggs & McNaughton, 1996; N´adasdy, Hirase, Czurko, Csicsvari, & Buzs´aki, 1999; Kudrimoti, Barnes, & McNaughton, 1999; Louie & Wilson, 2001; Lee & Wilson, 2002, 2004) and other animals (Dave & Margoliash, 2000; Hoffman & McNaughton, 2002) that temporal ordering of place cell firing from behavior persists during both rapid eye movement (REM) and slow-wave sleep (SWS) stages of subsequent sleep. However, the detection of these patterns is particularly difficult due to shortcomings in statistical analysis techniques for multiple neurons and due to the lack of an observed behavioral correlate often present during awake recording. Techniques used to analyze temporal order among multiple neurons range from analysis of pairwise correlations (Wilson & McNaughton, 1994; Kudrimoti et al., 1999), to the joint peristimulus time histogram (JPSTH) for triplets (Aertsen, Gerstein, Habib, & Palm, 1989), to template matching techniques for multiple neurons (Abeles & Gerstein, 1988; N´adasdy et al., 1999; Louie & Wilson, 2001). Other techniques include the unitary events ¨ Diesmann, & Aertsen, 2001) and gravity methods (Gerstein analysis (Grun, & Aertsen, 1985). Recently, Lee and Wilson (2002, 2004) have employed a combinatorial method. This method is different from previous approaches as it relies on an initial parsing of the data and then focuses on whether longer sequences within the data are likely to have occurred by chance. Here, using a similar overall approach but with different assumptions, we introduce a complementary set probability technique. Our approach can be described succinctly using elementary probability as follows. Assume an urn contains balls numbered 1 to N. Pick a ball at random, record its number, and replace it in the urn. Repeat this procedure n times, resulting in a list with n numbers. This list is called a word. The goal is to compute the probability that the word has a strictly increasing sequence (or run) of numbers of length j or more. For example, with j = 5, n = 9, and N = 8, one possible word is {5, 3, 2, 3, 5, 7, 8, 2, 2} where the run of 5 is shown in bold. In the example of place cell firing, the parameters correspond as follows. The place cells observed during behavior are each assigned a number (1 through N) according to their position in the firing order. After smoothing the sleep data, we next write a list of length n, composed of numbers from 1 to N, corresponding to the observed order of cell firing during a subsequent sleep epoch. Parameter n can be larger or smaller than N. The parameter j corresponds to the length of the longest sequence of strictly increasing numbers observed within the list of n numbers. If the computed probability of j or more occurring by chance is low enough, then we can conclude that the temporal order of cell firing from behavior has been preserved during sleep. This article focuses on the development of an efficient technique to compute recursively upper and lower bounds for this probability of a strictly increasing subsequence of length j or more. Because
Probability Technique for Time Order Across Neurons
1199
the calculations can be carried out very quickly, it is possible to study the properties of the probability function across a wide range of parameter values. Although longest increasing subsequence (LIS) problems have been studied in probability and combinatorics (Wolfowitz, 1944; Schensted, 1960; Knuth, 1970; Chryssaphinou & Vaggelatou, 2001), as well as in computer science (Albert et al., 2004; Bespamyatnikh & Segal, 2000), our approach is different from these methods and from that of Lee and Wilson (2004) in two respects. First, our approach based on set probability yields explicit formulas for the required probabilities. Second, we assume each element in the word is chosen with replacement from the reference sequence of length N. Thus, the resulting hypothesis is not restrictive, and the probability bounds are computationally feasible over a wide range of parameter values. 2 Methods We wish to compute the probability of j or more strictly increasing consecutive elements occurring by chance in a word of length n, where each element is chosen independently with replacement from a reference sequence {1,. . . ,N}. We do this in two steps. First, we compute the probability that j numbers picked independently with replacement from N numbers are strictly increasing. Second, we construct disjoint events based on the starting position of the strictly increasing sequence within the word. Since the events are now disjoint, it is possible to compute the required probability by summing them. 2.1 Step 1. Define the event Fr ( j, n, N) as j consecutive increasing elements starting at element r . It follows that 1 ≤ r ≤ n − j + 1. Other numbers in the word can take any values. We start by asking how many ways one can pick j different numbers from an ordered list {1,. . . ,N}. This is N!/( j!(N − j)!). Since these numbers are distinct, they can always be ordered in such a way as to make them strictly increasing. We then divide this number by the total possible combinations of j numbers from N, making Pr[Fr ( j, n, N)] = (N− j )
N! , j!(N − j)!
(2.1)
which is in fact independent of r and n. We call this probability p j,N . An alternative method to compute this probability is by construction. If we count all the possible ways that we can get j strictly increasing combinations from N numbers, we get
N− j+1 i j−1
Pr[Fr ( j, n, N)] = (N− j )
i j−1 =1 i j−2 =1
...
i2 i 1 =1
i1 .
(2.2)
1200
A. Smith and P. Smith
Using induction, it can be shown that the multiple sum is the binomial coefficient, that is, N− j+1 i j−1 i2 N! = ... i1 . j!(N − j)! i j−1 =1 i j−2 =1
(2.3)
i 1 =1
We calculate the probability p j,N using equation 2.1, as it is computationally less intensive. 2.2 Step 2. To avoid overlapping events, we now define Fr∗ ( j, n, N) to be the event that a strictly increasing word of length j starts at position r with no words of length j occurring before or including part of Fr∗ ( j, n, N). It follows that the number at position r − 1(r ≥ 2) must not be less than the number at position r and that there can be any numbers from positions r + j to n. Since there can be any numbers in positions r + j to n, our calculations yield the probability of a strictly increasing sequence of j or more. The event Fr∗ ( j, n, N) can be expressed in the form r =1 F1 r < j +1 Fr∗ ( j, n, N) = Fr \Fr −1 (Fr \Fr −1 )\(F1 ∪ F2 ∪ · · · ∪ Fr − j ) r ≥ j + 1
(2.4)
where the notation A\B means the event A but not the event B, and the terms in parentheses ( j, n, N) have been dropped, for simplicity, on the right-hand side. Since these events are disjoint, the desired total probability of a strictly increasing run of length j or more in n, denoted by H j , is given by
n− j+1
Hj =
Pr[Fr∗ ( j, n, N)].
(2.5)
r =1
Note that the probability of exactly j strictly increasing numbers can be computed from H j − H j+1 . It remains to compute the terms in equation 2.4. For the top equation where r = 1, F1 is computed directly from equation 2.1 and is the probability of a sequence of j or more strictly increasing elements starting at the first position. For 1 < r < j + 1, we note that Pr(Fr \Fr −1 ) = Pr(Fr ) − Pr(Fr ∩ Fr −1 ),
(2.6)
where Pr(Fr ∩ Fr −1 ) = Pr[Fr −1 ( j + 1, n, N)].
(2.7)
Probability Technique for Time Order Across Neurons
1201
That is, the probability of an increasing sequence of length j or more, starting at position r and with an element greater than or equal to the element at r − 1, is equal to the probability of a sequence length j or more, starting at r , minus the probability of a sequence of length j + 1 or more starting at position r − 1. Note that Pr(Fr \Fr −1 ) does not depend on its starting position in the word, r , or the word length, n. For n < 2 j, the computations are complete, and H j can be computed exactly from H j = p j,N + (n − j)( p j,N − p j+1,N ) =
N![(N + 1) j(n − j) + N( j + 1)] . N j+1 ( j + 1)!(N − j)!
(2.8)
To define disjoint events at larger values of r (r ≥ j + 1) the computation is more complicated. The formula in equation 2.4 can be expanded as follows: Pr(Fr∗ ) = Pr[(Fr \Fr −1 )\(F1 ∪ F2 ∪ · · · ∪ Fr − j )] = Pr[(Fr \Fr −1 )] − Pr(Fr \Fr −1 ) ∪ (F1 ∪ F2 ∪ · · · ∪ Fr − j ) r− j Fs = Pr(Fr ) − Pr(Fr ∩ Fr −1 ) − Pr Fr ∩
r− j
+ Pr Fr ∩ Fr −1 ∩
s=1
Fs .
(2.9)
s=1
In this expression, Pr Fr ∩
r− j
Fs
= Pr(Fr )Pr
s=1
r − j
= Pr(Fr )
Fs
s=1
r− j
Pr(Fs∗ ),
(2.10)
s=1
since Fr does not overlap any Fs, and Pr Fr ∩ Fr −1 ∩
r− j
Fs = Pr Fr ∪ Fr −1 ∪
s=1
− Pr
r− j
r − j
− Pr(Fr ) − Pr(Fr −1 )
+ Pr(Fr ∩ Fr −1 ) + Pr Fr ∩
s=1
+ Pr Fr −1 ∩
Fs
s=1
Fs
r− j
s=1
Fs .
r− j
Fs
s=1
(2.11)
1202
A. Smith and P. Smith
Therefore, Pr(Fr,∗ j,N )
r− j
= Pr Fr ∪ Fr −1 ∪
Fs
s=1
− Pr(Fr −1 ) − Pr
r − j
r− j
= Pr Fr ∪ Fr −1 ∪
Fs
r− j
+ Pr Fr −1 ∪
Fs
s=1
Fs
s=1
r− j
− Pr Fr −1 ∪
s=1
Fs .
(2.12)
s=1
The multiple union terms can be evaluated using the identity (Grimmett & Stirzaker, 1982) Pr
n
Ai
=
i
i=1
Pr(Ai ) −
+
Pr Ai1 ∩ Ai2
i 1
Pr Ai1 ∩ Ai2 ∩ Ai3 − . . .
i 1
+ (−1)n Pr Ai1 ∩ Ai2 ∩ · · · ∩ Ain .
(2.13)
Thus, Pr(Fr,∗ j,N ) = Pr(Fr )
− Pr(Fr ∩ Fr −1 ) + Pr(Fr )
r− j
+ Pr Fr ∩ Fr −1 ∩
r− j
Pr(Fs )
s=1
Fs .
(2.14)
s=1
For relatively small values of n and N (<20), equation 2.11 can be evaluated exactly making use of the construction of disjoint events. For example, for the pairs-intersection terms, Pr(Fr1 ∩ Fr2 ) =
p 2j,N
r1 − r2 ≥ j
pr1 −r2 + j,N
r1 − r2 < j
.
That is, if there is a large gap (greater than or equal to j) between the runs’ starting points, then the probability of a sequence of j or more is just squared. Alternatively, if the gap is small (less than j), then we compute the probability of a longer sequence of increasing elements. For larger values of n and N, the number of loops to be evaluated when expanding the union
Probability Technique for Time Order Across Neurons
1203
terms becomes prohibitive, and it is more efficient to work with upper and lower bounds. 2.3 Upper Bound. An upper bound can be computed using Pr(Fr∗ ) = Pr[(Fr \Fr −1 )\(F1 ∪ F2 ∪ . . . ∪ Fr − j )] ≤ Pr[(Fr \Fr −1 )\(F1 ∪ F2 ∪ . . . ∪ Fr − j−1 )] r − j−1 r − j−1 Fs + Pr Fr ∩ Fr −1 ∩ Fs = Pr(Fr \Fr −1 ) − Pr Fr ∩ ≤ ( p j,N − p j+1,N ) 1 −
s=1 r − j−1
Pr(Fr∗ ) .
s=1
(2.15)
s=1
Intuitively, this upper bound should be close to the true probability because only one event has been neglected. This is the event that an increasing sequence of length j or more starting at r − j is part of a sequence of length j or more that also increases from r onwards. 2.4 Lower Bound. For the lower bound, we use Boole’s inequality (Pr( i Ai ) ≤ i Pr(Ai )) and thus Pr(Fr∗ ) = Pr[(Fr \Fr −1 )\(F1 ∪ F2 ∪ . . . ∪ Fr − j )] ≥ Pr(Fr \Fr −1 ) − Pr(Fr ) =
r− j
Pr(Fr )
(2.16)
s=1
p j,N − p j+1,N − (r − j) p 2j,N
if positive
0
otherwise.
Because of the recursive structure of equations 2.15 and 2.16, the upper and lower bounds can be computed in seconds using Matlab (Mathworks, Natick, MA). (The software to perform these calculations is available online at www.ucdmc.ucdavis.edu/anesthesiology/staff/asmith.html.) In the next section we illustrate typical values for various parameter combinations. 2.5 Combinatorial/Shuffled Data Method. Our technique is most similar to the combinatorial technique developed by Lee and Wilson (2002, 2004) and used in the analysis of hippocampal place cell firing during non-REM sleep. Using the same definitions of reference sequence and word, their technique computes the probability that shuffled versions of the observed words contain a match or better to the reference sequence. As with the above
1204
A. Smith and P. Smith
approach, if this probability is low, the experimenter might conclude that the order of the original reference sequence has been preserved. Their definition of a match or better is more complicated than used in our set probability technique and allows interruptions in the increasing sequence. In particular, they define a word as containing an (x, y) match if there are x + y consecutive letters in the word of which at least x are strictly increasing. This means there are possibly as many as y interruptions in the increasing sequence. In addition, they define the parameter k as the number of distinct letters in the observed word. They suggest computing the required match probability by either a sequence shuffling technique, which can be computationally intensive, or by an approximate technique. To use the approximate technique, it is necessary to rank the matches based on a (subjective) decision about the acceptable balance between the number of interruptions and the length of the increasing sequence. In the approximate match ranking case, it is possible to derive algorithms to compute upper and lower bounds for the probability of the (x, y) matches. Across many trials, the final output of their method is a match-trial ratio with corresponding Z-score relative to the match-trial ratio expected by chance (1/ j! in our notation). The main differences between our technique and the Lee and Wilson technique are that we do not allow sequence interruptions and we assume the observed word is chosen independently and with replacement from the reference sequence rather than basing the probabilities on the shuffled observed sequence. For comparison purposes with our method, we have computed the exact formulas for two of the match probabilities we would get using the Lee and Wilson technique in the special case where each letter is observed only once within the word and the reference sequence equal to the word length (n = N = k). We call this probability Pshuffled (n, j). The probability of exactly shuffling a sequence of length j and getting j strictly increasing is Pshuffled ( j, j) = 1/j!
(2.17)
and for j = n − 1 Pshuffled ( j, j − 1) = (2 j − 1)/j!
(2.18)
We compare our technique with this technique in the next section. 3 Results We illustrate our approach by considering place cells in rat hippocampal area CA1 and how to assess if their temporal order is preserved during sleep. First, we compare our method with some examples from the more elaborate method of Lee and Wilson (2004). Second, we discuss two theoretical
Probability Technique for Time Order Across Neurons
1205
scenarios from non-REM and REM sleep using parameter values consistent with experimental measurements (Lee & Wilson, 2002; N´adasdy et al., 1999; Kudrimoti et al., 1999). By ordering the place cells in behavior from 1 to N, we compute either the exact probability or bounds for the probability of j or more consecutive increasing elements in any word of length n using equations 2.8, 2.15, and 2.16. 3.1 Comparison with Combinatorial/Shuffled Data Approach. At first inspection, one might assume that by making the assumption of replacement in our technique, our probabilities might always be lower than those computed using the shuffle technique of Lee and Wilson (2002, 2004). This is true for some parameter combinations. It is true, for example, for the cases derived in equations 2.17 and 2.18, as can be shown by comparing them with equation 2.8. However, for other situations, this is not the case. Consider the example in Lee and Wilson (2004) where the reference sequence is (1,2,3,4,5,6,7,8,9), and the observed word is (5,1,4,6,9,7,8,4). They report the best match using their (x, y) matching procedure to be (5, 1) (based on the run of numbers 1 4 6 9 7 8) and report the probability of this match or better to be 0.0580 based on exact computations, or bounded by 0.0195 and 0.1038. Using equations 2.15 and 2.16, our technique asks what the likelihood is of finding a run of four strictly increasing elements (here 1 4 6 9) in eight numbers from a reference sequence of length 9. We estimate this event to be more likely, lying between 0.0871 and 0.0875. With such a short sequence (nine observations) and a reasonable spread of values (4 appears twice, and 2 and 3 do not appear), it appears impossible to determine statistically whether these numbers have been chosen with or without replacement from the reference sequence. Given this uncertainty and the simplicity of the current technique’s hypothesis, this larger probability estimated by the set probability technique will be less likely to indicate a significant sequence replay. It is also instructive to consider what happens to the computed match probabilities in cases where the reference sequence is longer. Consider, for example, that the same word (5,1,4,6,9,7,8,4) is observed, but now the reference sequence is composed of numbers 1 through 20. In the Lee and Wilson formulation, the computed match or better probability will be unaltered as the calculations rely on shuffling the observed word. Using our set probability method with N = 20, we would estimate the probability of four or more strictly increasing numbers as even larger and lying in the interval [0.1311, 0.1320]. 3.2 Example: Non-REM (SWS) Sleep. In slow-wave (non-REM) sleep, it is hypothesized that compressed encoding of behavioral sequences may occur during sharp wave/ripple events (Buzs´aki, 1989; Skaggs & McNaughton, 1996; August & Levy, 1999; N´adasdy et al., 1999; Lee & Wilson, 2002). These sharp/wave ripple events occur approximately once
1206
A. Smith and P. Smith
per second and last for approximately 100 msec. During the event, the fir´ Mamiya, & ing rate increases about seven-fold (Csicsvari, Hirase, Czurko, Buzs´aki, 1999). We assume for our theoretical examples that the data have been preprocessed and parsed into words using the techniques outlined by previous authors (N´adasdy et al., 1999; Lee & Wilson, 2002). For example, one could assume that complex spike bursts (i.e., spike bursts with interspike intervals of less than, say, 6 msec) could be represented by a single spike occurring at the time of the first spike of the burst (Ranck, 1973; N´adasdy et al., 1999). As a first application, we consider estimating the probability of sequence repeats within short SWS/ripple events. We call each SWS/ripple event a trial. To be consistent with experiments, we choose parameters as follows: j is 4 or 5, N is 8 or 9, and n ranges from 5 to 10 cell firings. Our computed probabilities, either exact or the upper or lower bounds, are shown in Table 1. Note that in the low-probability regime and with these choices of parameters, the upper and lower bounds computed using equations 2.14 and 2.15 are very close to one another. For fixed N, as the word length increases, we find the probability of finding a j or more length strictly
Table 1: Tabulation of Probability of j or More Consecutive Increasing Numbers Chosen from N in a Word of Length n.
j
N
n
H j = Pr( j or more strictly increasing numbers)
E (number of matches in 300 trials)
4
8
4
9
5
9
5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10
0.0325 0.0479 0.0632 [0.0783 0.0786] [0.0931 0.0937] [0.1076 0.1086] 0.0363 0.0533 0.0704 [0.0871 0.0875] [0.1035 0.1042] [0.1194 0.1207] 0.0021 0.0041 0.0061 0.0081 0.0100 [0.0120 0.0120]
9 14 18 [23 23] [27 28] [32 32] 10 16 21 [26 26] [31 31] [35 36] 0 1 1 2 3 [3 3]
Notes: When a single number is shown, the value is exact (to four decimal places) and is computed using equation 2.8. When two numbers are shown, the values are lower and upper bounds computed using equations 2.16 and 2.15. The last column shows the approximate expected number of times one might find j or more strictly increasing numbers in 300 Bernoulli trials (300 × H j ).
Probability Technique for Time Order Across Neurons
1207
increasing sequence moves from quite unlikely ( p < 0.05) to quite likely. This is particularly noticeable when the j = 4 case is considered. For example, if N = 9 and n > 5, then the chance of observing a run of length j = 4 or more is greater than 0.05. On average approximately one SWS/ripple event occurs every second. We consider the case of 300 such events, corresponding to observations over approximately 300 sec (5 min) of recording and consistent with the number of longer sequences (n ≥ 4) observed by Lee and Wilson (2002). Assuming each event is a Bernoulli trial, we compute in Table 1 the expected number of trials with j or more strictly increasing numbers expected to occur by chance in 300 trials. For j = 4 and as n increases, we find the expected number of trials increases from as low as 9 in 300 to as high as 36 in 300. Note that for fixed n, there is not a big difference between the expected numbers of trials when N = 8 and N = 9 for the two j = 4 cases. However by increasing j from 4 to 5, the expected number of trials drops rapidly, and even in a word of length n = 10, we might expect to observe only 3 words in 300. A second approach would be to concatenate all words together and look again for sequences of length j or more within the (much longer) word. By doing this, we treat the entire sleep episode as a time series. Over 50 minutes of sleep, one may expect to observe as many as 50 × 60 trials, each of which may contain a sequence with significant temporal ordering. In this case, we consider parameter values of N = 10 and N = 20 and vary n from 5 to 300 for various values of j. If the computed probability of j or more strictly increasing is still lower than p-values of 0.05 or 0.01, then this provides evidence that a statistically significant event is occurring. As expected, as sequence length j increases, the probability of observing a sequence of that length decreases (see Figure 1). As the length of the word (n) increases, this probability increases. In the case of N = 10 and n = 3000 (see Figure 1A), an observation of a single sequence of length 7 or more strictly increasing numbers is sufficient to conclude that the temporal order of the place cells has been preserved. Observation of a single sequence of six or more strictly increasing numbers would not be sufficient to indicate replay. For the larger N = 20 case (see Figure 1B), we require eight or more strictly increasing numbers to determine statistical significance. 3.3 Example: REM Sleep. In REM sleep, there is evidence of replay similar to the speed of the place cell firing during behavior (Louie & Wilson, 2002). In this case, for our theoretical study, we take 5 < n < 70. This is consistent with up to seven words with N = 10 cells and is about the maximum number of replays that might fill a 5 minute episode of REM. In this case we show probability bounds for two values of N (10 and 20) for various choices of j (see Figure 2). Interestingly, for both N = 10 (see Figure 2A) and N = 20 (see Figure 2B), an observation of a strictly increasing sequence of six or more is statistically
1208
A. Smith and P. Smith
Figure 1: Probability bounds relevant for the analysis of sequences in long experiments. We show the upper and lower bounds of probability of j or more strictly increasing numbers as n increases for four different values of j and two reference sequence sizes (panel A, N = 10; panel B, N = 20). The region between the upper and lower bounds is shaded gray. As j increases, the difference between the upper and lower bounds becomes indistinguishable. The horizontal dashed lines indicate the standard probability cutoffs of 0.01 and 0.05. For the largest j value of 8, the probability curves lie below 0.05 for both reference sequence lengths and all values of n considered here.
Probability Technique for Time Order Across Neurons
1209
Figure 2: Probability bounds relevant for the analysis of sequences in short experiments. Shown are upper and lower bounds of probability of j or more strictly increasing numbers as n increases for five different values of j and two reference sequence sizes (panel A, N = 10; panel B, N = 20). The region between the upper and lower bounds is shaded gray. As j increases, the difference between the upper and lower bounds becomes indistinguishable. The black horizontal dashed lines indicate the standard probability cutoffs of 0.01 and 0.05. For the largest shown j value of 8, the probability curves lie below 0.01 for both reference sequence lengths and all values of n considered here.
1210
A. Smith and P. Smith
significant enough to indicate replay at the p < 0.05 level. As j increases, this level of significance increases considerably. In contrast, if there are 20 place cells (N = 20), an observation of only 5 strictly increasing numbers in a word of length 20 might easily occur by chance (see Figure 2A). 4 Discussion In this letter, we outline a set probability technique for analyzing spatiotemporal order across neurons. We have derived formulas for the computation of bounds for the probability of a sequence of j strictly increasing elements in a word of length n, chosen from a reference sequence of length N. Our derivation makes use of techniques from elementary probability theory, and calculations can be performed in the order of seconds. While similar problems have been studied in the context of reliability (Wolfowitz, 1944) and in combinatorics (Schensted, 1960), as far as we are aware, the results presented here have not been reported previously. Some of the inferences about repeating spatiotemporal patterns in neural data have been questioned on statistical grounds (see the points made by Baker & Lemon, 2000; Oram, Wiener, Lestienne, & Richmond, 1999; Moore, Rosenberg, Hary, & Breeze, 1996; Vertes, 2004). Our technique has not been applied to experimental data and therefore does not either prove or disprove any existing experimental conclusions. However, it is interesting to note, for example, that previous analyses of sequences of place cells’ order within SWS/ripple events have tended to focus on short words (e.g., j values less than 6). Depending on the parameter space being explored, our analysis indicates that observations of runs of this size are possible by chance alone. In contrast, a single strictly increasing sequence of length 8 is highly unlikely even within a word of length 3000 (see Figure 1) and should be enough for the experimenter to conclude that replay has occurred based on one single observation. 4.1 Comment on Choice of Null Hypothesis. Because we make the assumption that elements from the reference sequence are picked with replacement, our probability estimates will be different from those of a permutation technique. Choosing between with replacement and without replacement techniques is analogous to deciding between the bootstrap technique and a randomization technique, respectively. Our assumption that the word is composed of a number chosen with replacement was made because it is general and allows analytic tractability in solving the problem. Both combinatorial (Lee & Wilson, 2002, 2004) and template-matching techniques (Abeles & Gerstein, 1988) make use of permutations, or shuffling, to compute the probability of a match. Shuffling techniques are useful for surrogate spike data as they allow the experimenter to preserve firing rate or theta phase relations within the data set (N´adasdy et al., 1999; Gerstein, 2004). This is not so critical in this sequence problem as we assume
Probability Technique for Time Order Across Neurons
1211
the spike data have already been parsed; that is, we assume the data have been smoothed in order that the place cells’ firing order during behavior can be represented by numbers 1 though N. A similar smoothing process is carried out on the spikes during sleep, though potentially on a different timescale. 4.2 Advantages of This Approach. An advantage of this approach over shuffling and combinatorial techniques (N´adasdy et al., 1999; Baker & Lemon, 2000; Lee & Wilson, 2004) is that it is straightforward to run through various parameter combinations very quickly. In addition the technique yields analytic forms for upper and lower bounds, and therefore its accuracy does not depend on the number of shuffles or simulations carried out. Our approach is most similar to the recent combinatorial approach of Lee and Wilson (2002, 2004). They take an observed word, identify the best match within that word, and compute how likely permutations of the sequence would contain that match or better within the word. The final output of their method is a match-trial ratio with corresponding Z-score relative to the chance match-trial ratio (1/ j! in our notation). Most importance was assigned in their approach to the low-probability trials involving words of length 4 or more. An advantage of our approach is that both the hypothesis and the calculation of probabilities are considerably simpler. One of the recommendations from our current analysis would be that even if a combinatorial method is employed later, a more practical approach to identification of statistically significant repeats would be to search for sequences of length 5 or more, rather than 4 or more, since sequences of 4 or more appear frequently by chance alone (see Table 1). 4.3 Limitations of This Approach. Currently this technique computes the probability of at least one sequence of length j or more within a longer word. If we were to observe two sequences of length j or more within the word, corresponding in some cases to the gaps that are used in Lee and Wilson (2004), our current technique would yield a larger estimate of the probability of the event’s occurring by chance than necessary. The probability derivation in this case becomes much more difficult (and we leave this for a later publication). 4.4 Future Approaches. As with analysis of behavioral data (Smith et al., 2004), methods from time-series analysis and signal processing may yield more helpful results than techniques based entirely on probability. This is because neurons are inherently noisy and because our current model when applied to rat hippocampal data ignores an important source of data, namely the rat’s position, which is available during the awake portion of rat hippocampal experiments. A model that specifically takes into account the
1212
A. Smith and P. Smith
stochastic nature of spike firing and positional information data should be better able to determine if replay of sequences is significant. For example, an area where our method fails is that it does not take into account the extent of place field overlap between cells. If two cells have a large overlap, then it makes sense that it is harder to say which cell fires before the other. This ambiguity should also be taken into account when looking for replay of sequences. Statistical models have been used to describe rat behavior during awake exploration. While they have been applied very successfully in this area (Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998), they have not yet been applied to the noisier problem of decoding sleep. Acknowledgments This work was supported by the Department of Anesthesiology and Pain Medicine, UC Davis, and NIH grant MH071847. References Abeles, M., & Gerstein G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neurophysiol., 60, 909–924. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61, 900–917. Albert, M. H., Golynski, A., Hamel, A. M., Lopez-Ortiz, A., Rao, S. S., & Safari, M. A. (2004). Longest increasing subsequences in sliding windows. Theoret. Computer Science, 321, 405–414. August, D. A., & Levy, W. B. (1999). Temporal sequence compression by an integrate and fire model of hippocampal area CA3. J. Comput. Neurosci., 6, 71–90. Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. J. Neurophysiol., 84, 1770–1780. Bespamyatnikh, S., & Segal, M. (2000). Enumerating longest increasing subsequences and patience sorting. Information Proc. Letters, 76, 7–11. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. J. Neurosci., 18, 7411– 7425. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: state-of-the-art and future challenges. Nature Neurosci., 7(5), 456–461. Buzs´aki, G. (1989). Two-stage model of memory trace formation: A role for noisy brain states. Neuroscience, 31, 551–570. Buzs´aki, G. (2004). Large-scale recording of neuronal ensembles. Nature Neurosci., 7(5), 446–451. Chryssaphinou, O., & Vaggelatou, E. (2001). Compound Poisson approximation for long increasing sequences. J. Appl. Prob., 38, 449–463.
Probability Technique for Time Order Across Neurons
1213
´ A., Mamiya, A., & Buzs´aki, G. (1999). Oscillatory Csicsvari, J., Hirase, H., Czurko, coupling of hippocampal pyramidal cells and interneurons in the behaving rat. J. Neurosci., 19, 274–287. Dave, A. S., & Margoliash, D. (2000). Song replay during sleep and computational rules for sensorimotor vocal learning. Science, 290, 812–816. Gerstein, G. L. (2004). Searching for significance in spatio-temporal firing patterns. Acta Neurobiol. Exp., 64, 203–207. Gerstein, G. L., & Aersten, A. M. H. J. (1985). Representation of cooperative firing activity among simultaneously recorded neurons. J. Neurophysiol., 54, 1513–1528. Grimmett, G. R., & Stirzaker, D. R. (1982). Probability and random processes. Oxford: Clarendon Press. ¨ S., Diesmann, M., & Aersten, A. (2001). Unitary events in multiple singleGrun, neuron spiking activity. I. Detection and significance. Neural Comp., 14, 43–80. Hoffman, K. L., & McNaughton, B. L. (2002). Coordinated reactivation of distributed memory traces in primate neocortex. Science, 297(5589), 2070–2073. Knuth, D. E. (1970). Permutations, matrices and generalized Young tableaux. Pacific. J. Math., 34(3), 709–727. Kudrimoti, H. S., Barnes, C. A., & McNaughton, B. L. (1999). Reactivation of hippocampal cell assemblies: Effects of behavioral state, experience, and EEG dynamics. J. Neurosci., 19, 4090–4101. Lee, A. K., & Wilson, M. A. (2002). Memory of sequential experience in the hippocampus during slow wave sleep. Neuron, 36, 1183–1194. Lee, A. K., & Wilson, M. A. (2004). A combinatorial method for analyzing sequential firing patterns involving an arbitrary number of neurons based on relative time order. J. Neurophys., 92, 2555–2573. Louie, K., & Wilson, M. A. (2001). Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep. Neuron, 29, 145–156. Moore, G. P., Rosenberg, J. R., Hary, D., & Breeze, P. (1996). “Replay” of hippocampal “memories.” Science, 274(5290), 1216–1216. ´ A., Csicsvari, J., & Buzs´aki, G. (1999). Replay and N´adasdy, Z., Hirase, H., Czurko, time compression of recurring spike sequences in the hippocampus. J. Neurosci., 19, 9497–9507. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Res., 34, 171–175. Oram, M. W., Wiene, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Pavlides, C., & Winson, J. (1989). Influences of hippocampal place cell firing in the awake state on the activity of these cells during subsequent sleep episodes. J. Neurosci., 9, 2907–2918. Ranck, J. B. Jr. (1973). Studies on single neurons in dorsal hippocampal formation and septum in unrestrained rats. I. Behavioral correlates and firing repertories. Exp. Neurol., 41, 461–531. Schensted, C. (1960). Longest increasing and decreasing sequences. Canad. J. Math., 13, 179–191. Skaggs, W. E., & McNaughton, B. L. (1996). Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873.
1214
A. Smith and P. Smith
Smith, A. C., Frank, L. M., Wirth, S., Yanike, M., Hu, D., Kubota, Y., Graybiel, A. M., Suzuki, W., & Brown, E. N. (2004). Dynamic analysis of learning in behavioral experiments. J. Neurosci., 24(2), 447–461. Vertes, R. P. (2004). Memory consolidation in sleep: Dream or reality? Neuron, 44(1), 135–148. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679. Wolfowitz, J. (1944). Asymptotic distribution of runs up and down. Ann. Math. Statist., 15, 163–172. Zhang, K. C., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophys., 79(2), 1017–1044.
Received May 20, 2005; accepted September 9, 2005.
LETTER
Communicated by Steven Zucker
Payoff-Monotonic Game Dynamics and the Maximum Clique Problem Marcello Pelillo [email protected]
Andrea Torsello [email protected] Dipartimento di Informatica, Universit`a Ca’ Foscari di Venezia 30172 Venezia Mestre, Italy
Evolutionary game-theoretic models and, in particular, the so-called replicator equations have recently proven to be remarkably effective at approximately solving the maximum clique and related problems. The approach is centered around a classic result from graph theory that formulates the maximum clique problem as a standard (continuous) quadratic program and exploits the dynamical properties of these models, which, under a certain symmetry assumption, possess a Lyapunov function. In this letter, we generalize previous work along these lines in several respects. We introduce a wide family of game-dynamic equations known as payoffmonotonic dynamics, of which replicator dynamics are a special instance, and show that they enjoy precisely the same dynamical properties as standard replicator equations. These properties make any member of this family a potential heuristic for solving standard quadratic programs and, in particular, the maximum clique problem. Extensive simulations, performed on random as well as DIMACS benchmark graphs, show that this class contains dynamics that are considerably faster than and at least as accurate as replicator equations. One problem associated with these models, however, relates to their inability to escape from poor local solutions. To overcome this drawback, we focus on a particular subclass of payoff-monotonic dynamics used to model the evolution of behavior via imitation processes and study the stability of their equilibria when a regularization parameter is allowed to take on negative values. A detailed analysis of these properties suggests a whole class of annealed imitation heuristics for the maximum clique problem, which are based on the idea of varying the parameter during the imitation optimization process in a principled way, so as to avoid unwanted inefficient solutions. Experiments show that the proposed annealing procedure does help to avoid poor local optima by initially driving the dynamics toward promising regions in state space. Furthermore, the models outperform state-of-theart neural network algorithms for maximum clique, such as mean field annealing, and compare well with powerful continuous-based heuristics. Neural Computation 18, 1215–1258 (2006)
C 2006 Massachusetts Institute of Technology
1216
M. Pelillo and A. Torsello
1 Introduction Research in computational complexity has shown that many problems of practical interest are inherently intractable, in the sense that it is not possible to find fast (i.e., polynomial time) algorithms that solve them exactly, unless the classes P and NP coincide, this being believed to be highly unlikely. In some cases, we can indeed find good approximate solutions in polynomial time (Papadimitriou & Steiglitz, 1982), but unfortunately, it turns out that certain important problems remain intractable even to approximate. This is the case with the maximum clique problem (MCP), a classic problem in combinatorial optimization that asks for the largest complete subgraph of a given graph. Indeed, the best polynomial-time approximation algorithm for the MCP achieves an approximation ratio of n1−o(1) (Boppana & ´ Halldorsson, 1992), where n is the number of vertices in the graph, and Hastad (1996) has shown that this is actually the best we can achieve by proving that unless NP = co R, the MCP cannot be approximated within a factor of n1−ε , for any ε > 0. Although this complexity result characterizes worst-case instances, it nevertheless indicates that the MCP is indeed a very difficult problem to solve.1 Due to this pessimistic state of affairs and because of its important applications in such diverse domains such as computer vision, experimental design, information retrieval, and fault tolerance, much attention has recently gone into developing efficient heuristics for the MCP, for which no formal guarantee of performance may be provided but are nevertheless useful in practical applications. We refer to Bomze, Budinich, Pardalos, & Pelillo (1999) for a survey concerning algorithms, applications, and complexity issues of this important problem. In the neural network community, there has been much interest around the maximum clique problem. Early attempts at encoding this and related problems in terms of a neural network were done in the late 1980s by Ballard, Gardner, and Srinivas (1987), Godbeer, Lipscomb, and Luby (1988), Ramanujam and Sadayappan (1988), Aarts and Korst (1989), and Shrivastava, Dasgupta, and Reddy (1990; see also Shrivastava, Dasgupta, & Reddy, 1992). However, little or no experimental results were presented, thereby making it difficult to evaluate the merits of these algorithms. Lin and Lee (1992) used a quadratic zero-one formulation from Pardalos and Rodgers (1990) as the basis for their neural network heuristic. For an n-vertex graph, they used a network with 2(n + 1) computational nodes, and real-valued connection weights. Grossman (1996) proposed a discrete, deterministic version of the Hopfield model for maximum clique, originally designed for an all-optical implementation. The model has a threshold parameter that determines the
1 See Grotschel, ¨ Lov´asz, & Schrijver (1993) for classes of graphs for which the MCP can be solved in polynomial time.
Game Dynamics and Maximum Clique
1217
character of the stable states of the network. The author suggests an annealing strategy on this parameter and an adaptive procedure to choose the network’s initial state and threshold. On DIMACS graphs, the algorithm performs satisfactorily but does not compare well with more powerful heuristics such as simulated annealing. Jagota (1995) developed several variations of the Hopfield model, both discrete and continuous, to approximate maximum clique. He evaluated the performance of his algorithms over randomly generated graphs as well as on harder graphs obtained by generating cliques of varying size at random and taking their union. Experiments on graphs coming from the Solomonoff-Levin, or “universal” distribution, are also presented in Jagota and Regan (1997). The best results were obtained using a stochastic steepestdescent dynamics and a mean field annealing algorithm, an efficient, deterministic approximation of simulated annealing. These algorithms, however, were also the slowest, and this motivated Jagota, Sanchis, and Ganesan (1996) to improve their running time. The mean field annealing heuristic was implemented on a 32-processor connection machine, and a two-temperature annealing strategy was used. Additionally, a reinforcement learning strategy was developed for the stochastic steepestdescent heuristic, to automatically adjust its internal parameters as the process evolves. On various benchmark graphs, all their algorithms obtained significantly larger cliques than other simpler heuristics but ran slightly slower. Compared to more sophisticated heuristics, they obtained significantly smaller cliques on average but were considerably faster. Other attempts at solving the maximum clique problem using Hopfieldstyle neural networks can be found in Takefuji, Chen, Lee, and Huffman (1990), Funabiki, Takefuji, and Lee (1992), Wu, Harada, and Fukao (1994), ¨ and Guzelis ¨ Bertoni, Campadelli, and Grossi (2002), Pekergin, Morgul, (1999), Jagota, Pelillo, and Rangarajan (2000), and Wang, Tang, and Cao (2003). Almost invariably, all these works formulate the MCP in terms of an integer (usually 0-1) programming problem and use some variant of the Hopfield model to solve it. In a recent series of papers (Pelillo, 1995, 1999, 2002; Bomze, 1997; Bomze, Pelillo, & Giacomini, 1997; Bomze, Pelillo, & Stix, 2000; Jagota et al., 2000; Bomze, Budinich, Pelillo, & Rossi, 2002; Pelillo, Siddiqi, & Zucker, 1999), a completely different framework has been developed. The approach is centered around a classic result from graph theory due to Motzkin and Straus (1965), and variations thereof, which allow us to formulate the MCP as a standard quadratic program—namely, a continuous quadratic optimization problem with simplex (or probability) constraints, to solve which replicator equations have been remarkably effective despite their simplicity. These are well-known continuous- and discrete-time dynamical systems developed and studied in evolutionary game theory, a discipline pioneered by J. Maynard Smith (1982) that aims to model the evolution of animal behavior using the principles and tools of noncooperative game theory
1218
M. Pelillo and A. Torsello
(Hofbauer & Sigmund, 1998). Evolutionary game-theoretic models are also gaining increasing popularity in economics since they elegantly get rid of the much-debated assumptions of traditional game theory concerning the full rationality and complete knowledge of players (Weibull, 1995; Samuelson, 1997; Fudenberg & Levine, 1998). Interestingly, these dynamical equations also turn out to be related to the so-called relaxation labeling processes, a class of parallel, distributed algorithms developed in computer vision to solve (continuous) constraint satisfaction problems (Rosenfeld, Hummel, & Zucker, 1976; Hummel & Zucker, 1983). An independent connection between dynamical systems such as relaxation labeling and Hopfield-style networks, and game theory has been described by Miller and Zucker (1992, 1999). This letter substantially expands on previous work along these lines. We introduce a wide family of evolutionary game dynamics, of which replicator equations represent just a special instance, characterized by having the growth rates of strategies ordered by their expected payoffs, so that strategies associated with higher payoffs grow faster. It is shown that these payoffmonotonic models enjoy precisely the same dynamical properties as standard, first-order replicator equations. In particular, when the payoff matrix is symmetric, they possess a quadratic Lyapunov function, which is strictly increasing along any nonconstant trajectory; furthermore, it is shown that their asymptotically stable stationary points are in one-to-one correspondence with (strict) local solutions of standard quadratic programs. We then specialize our discussion to a parameterized family of such quadratic problems arising from the MCP, which include the Motzkin-Straus formulation as a special case, and show that a one-to-one correspondence exists between its local-global solutions and maximal-maximum cliques of the corresponding graph, provided that its parameter is positive (and less than 1). These properties therefore make any member of the payoff-monotonic family a potential heuristic for the MCP. In particular, we present extensive experimental results obtained with an exponential version of the standard replicator dynamics over hundreds of random as well as DIMACS benchmark graphs, and show that these dynamics are dramatically faster than their first-order counterpart and even more accurate. They also compare favorably with other simple neural network heuristics and obtain only slightly worse results than more sophisticated ones, such as mean field annealing. As with standard replicator equations, however, these models are inherently unable to escape from inefficient local solutions or, in other words, from small maximal cliques. Although this is not necessarily a problem when dealing with graphs arising from graph isomorphism of maximum common subtree problems (Pelillo, 1999, 2002; Pelillo et al., 1999), it makes them unsuited for harder problem instances. In an attempt to overcome this drawback, in the second part of the article, we focus on a well-known subclass of payoff-monotonic dynamics that arises in modeling the evolution of
Game Dynamics and Maximum Clique
1219
behavior by way of imitation processes. Following Bomze et al. (2002), we investigate the properties of our parameterized Motzkin-Straus program when its parameter is allowed to take on negative values. In this case, an interesting picture emerges: as its absolute value grows larger, local maximizers corresponding to maximal cliques disappear, that is, they become unstable under any imitation dynamics. We derive bounds on the parameter that affects the stability of the solutions, and these results, which generalize those presented in Bomze et al. (2002), suggest a whole family of annealed imitation heuristics, which consist of starting from a large negative value of the parameter and then properly reducing it during the optimization process. At each step, imitation dynamics are run in order to obtain a local solution of the corresponding objective function. The rationale behind this idea is that at large absolute values of the annealing parameter, only local solutions corresponding to large maximal cliques will survive, together with various spurious maximizers. As the value of the parameter is reduced, spurious solutions disappear and smaller maximal cliques become stable. A similar idea has been proposed by Gee and Prager (1994) in a different context. Experiments conducted on both random and DIMACS graphs using an exponential imitation dynamics confirm the effectiveness of the proposed approach and the robustness of the annealing strategy. The overall conclusion is that the annealing procedure does help to avoid inefficient local solutions by initially driving the dynamics toward promising regions in state space and then refining the search as the annealing parameter is increased. Moreover, the algorithm outperforms state-of-the-art neural network heuristics for maximum clique, such as mean field annealing, and other heuristics based on Motzkin-Straus and related formulations.
2 Payoff-Monotonic Game Dynamics Evolutionary game theory considers an idealized scenario whereby in a large population, pairs of individuals are repeatedly drawn at random to play a symmetric two-player game. In contrast to traditional gametheoretic models, players are not supposed to behave rationally or have complete knowledge of the details of the game. They act instead according to a preprogrammed behavior pattern, or pure strategy, and it is supposed that some evolutionary selection process operates over time on the distribution of behaviors. (We refer the reader to Hofbauer & Sigmund, 1998, and Weibull, 1995, for excellent introductions to this rapidly expanding field.) Let J = {1, . . . , n} be the set of available pure strategies, and for all i ∈ J , let xi (t) be the proportion of population members playing strategy i, at time t. The state of the population at a given instant is the vector x = (x1 , . . . , xn ) , where a prime denotes transposition. Clearly, population states are
1220
M. Pelillo and A. Torsello
constrained to lie in the standard simplex of the n–dimensional Euclidean space Rn : = {x ∈ Rn : xi ≥ 0 for all i ∈ J , e x = 1}, where e = i ei = (1, . . . , 1) , and ei denotes the ith standard basis vector in Rn . The support of a population state x ∈ , denoted by σ (x), is defined as the set of indices corresponding to its positive components, which correspond to nonextinct strategies: σ (x) = {i ∈ J : xi > 0} . Given a subset of strategies S ⊆ J , the set of states where all strategies outside S are extinct, which corresponds to a face of , is defined as S = {x ∈ : σ (x) ⊆ S}, and its (relative) interior is int( S ) = {x ∈ : σ (x) = S}. Clearly, J = , and, accordingly, we shall write int() instead of int( J ). Let A = (a i j ) be the n × n payoff or utility matrix. Specifically, for each pair of strategies i, j ∈ J , a i j represents the payoff of an individual playing strategy i against an opponent playing strategy j. In biological contexts, payoffs are typically measured in terms of Darwinian fitness or reproductive success (i.e., the player’s expected number of offspring), whereas in economic applications, they usually represent firms’ profits or consumers’ utilities. If the population is in state x, the expected payoff earned by an i-strategist is: πi (x) =
n
a i j x j = (Ax)i ,
(2.1)
j=1
while the mean payoff over the entire population is π(x) =
n
xi πi (x) = x Ax.
(2.2)
i=1
In evolutionary game theory, the assumption is made that the game is played over and over, generation after generation, and that the action of natural selection will result in the evolution of the fittest strategies. If
Game Dynamics and Maximum Clique
1221
successive generations blend into each other, the evolution of behavioral phenotypes can be described by a set of ordinary differential equations. A general class of evolution equations is given by x˙ i = xi gi (x),
(2.3)
where a dot signifies derivative with respect to time, and g = (g1 , . . . , gn ) is a function with open domain containing . Here, the function gi (i ∈ J ) specifies the rate at which pure strategy i replicates when the population is in state x. It is usually required that the growth function g is regular (Weibull, 1995), which means that it is C 1 and that g(x) is always orthogonal to x— g(x) x = 0. The former condition guarantees us that the system of differential equations 2.3 has a unique solution through any initial population state.2 The condition g(x) x = 0 instead ensures that the simplex is invariant under equation 2.3, namely, any trajectory starting in will remain in . A point x is said to be a stationary (or equilibrium) point of our dynamical system if x˙ i = 0, for all i = 1 . . . n. A stationary point x is said to be Lyapunov stable (or, more simply stable) if for any neighborhood U of x there exists a neighborhood W of x such that any trajectory in W remains also in U (formally, x(0) ∈ W implies x(t) ∈ U for all t ≥ 0). It is said to be asymptotically stable if, in addition, such trajectories converge to x. Payoff-monotonic game dynamics represent a wide class of regular selection dynamics for which useful properties hold. Intuitively, for a payoffmonotonic dynamics, the strategies associated with higher payoffs will increase at a higher rate. Formally, a regular selection dynamics 2.3 is said to be payoff-monotonic if gi (x) > g j (x) ⇔ πi (x) > π j (x)
(2.4)
for all x ∈ . Although this class contains many different dynamics, it turns out that they share a lot of common properties. To begin, they all have the same set of stationary points. Proposition 1. A point x ∈ is stationary under any payoff-monotonic dynamics if and only if πi (x) = π(x) for all i ∈ σ (x). Proof. See Weibull (1995).
2 Indeed, to ensure existence and uniqueness of solutions to equation 2.3, it is sufficient that g is (locally) Lipschitz continuous, that is, there exists a constant K such that g(x) − g(y) ≤ K x − y for all x, y in any compact subset of the domain of g (see, e.g., Hirsch & Smale, 1974). It is well known that C 1 functions are also locally Lipschitz continuous.
1222
M. Pelillo and A. Torsello
A well-known subclass of payoff-monotonic game dynamics is given by x˙ i = xi φ(πi (x)) −
n
x j φ(π j (x))
(2.5)
j=1
where φ(u) is an increasing function of u. These models arise in modeling the evolution of behavior by way of imitation processes, where players are occasionally given the opportunity to change their own strategies (Hofbauer, 1995; Weibull, 1995). When φ is the identity function, that is, φ(u) = u, we obtain the standard replicator equations, x˙ i = xi πi (x) −
n
x j π j (x) ,
(2.6)
j=1
whose basic idea is that the average rate of increase x˙ i /xi equals the difference between the average fitness of strategy i and the mean fitness over the entire population. Another popular model arises when φ(u) = e κu , which yields: x˙ i = xi e
κπi (x)
−
n
xj e
κπ j (x)
,
(2.7)
j=1
where κ is a positive constant. As κ tends to 0, the orbits of this dynamics approach those of the standard, first-order replicator model, equation 2.6, slowed down by the factor κ; moreover, for large values of κ, the model approximates the so-called best-reply dynamics (Hofbauer, 1995; Hofbauer & Sigmund, 1998). 3 Payoff-Monotonic Dynamics and Quadratic Programming In this section we explore the connections between payoff-monotonic dynamics and quadratic optimization problems. Consider the following standard quadratic program:3 maximize subject to
3
π(x) = x Ax x ∈ ,
This terminology is due to Bomze (1998).
(3.1)
Game Dynamics and Maximum Clique
1223
where A is an arbitrary n × n symmetric matrix. In evolutionary game theory, a symmetric payoff matrix arises in the context of doubly symmetric (or partnership) games, where the interests of the two players coincide (Hofbauer & Sigmund, 1998; Weibull, 1995). A point x∗ ∈ is said to be a global solution of program 3.1 if π(x∗ ) ≥ π(x), for all x ∈ . It is said to be a local solution if there exists an ε > 0 such that π(x∗ ) ≥ π(x) for all x ∈ whose distance from x∗ is less than ε, and if π(x∗ ) = π(x) implies x∗ = x, then x∗ is said to be a strict local solution. Note that the solutions of equation 3.1 remain the same if matrix A is replaced with A + kee , where k is an arbitrary constant. In addition, observe that maximizing a nonhomogeneous quadratic form x Qx + 2c x over is equivalent to solving equation 3.1 with A = Q + ec + ce (Bomze, 1998). A point x ∈ satisfies the Karush-Kuhn-Tucker (KKT) conditions for problem 3.1, that is, the first-order necessary conditions for local optimality, if there exist n + 1 real constants µ1 , . . . , µn and λ, with µi ≥ 0 for all i = 1 . . . n, such that: (Ax)i − λ + µi = 0 for all i = 1 . . . n, and n
xi µi = 0.
i=1
Note that since both xi and µi are nonnegative for all i = 1, . . . , n, the latter condition is equivalent to saying that i ∈ σ (x) implies µi = 0. Hence, the KKT conditions can be rewritten as (Ax)i
=λ
if i ∈ σ (x)
≤λ
if i ∈ / σ (x)
(3.2)
for some real constant λ. On the other hand, it is clear that λ = x Ax. A point x ∈ satisfying equation 3.2 will be called a KKT point throughout. The following easily proved results establish a first connection between standard quadratic programs and payoff-monotonic dynamics. Proposition 2. If x ∈ is a KKT point for equation 3.1, then it is a stationary point of any payoff-monotonic dynamics. If x ∈ int(), then the converse also holds. Proof. The proof is a straightforward consequence of proposition 1 and the KKT conditions 3.2.
1224
M. Pelillo and A. Torsello
Clearly, not all equilibria of payoff-monotonic dynamics correspond to KKT points of equation 3.1 (e.g., think about the vertices of ), but if they are approached by an interior trajectory, then this comes true. Proposition 3. Let x = limt→∞ x(t) be the limit point to a trajectory under any payoff-monotonic dynamics. If x(0) ∈ int(), then x is a KKT point for program 3.1. Proof. Since x is a limit point of a trajectory, it is a stationary point (see, ¨ 1970), and hence by proposition 1, πi (x) = π(x) for e.g., Bhatia & Szego, all i ∈ σ (x). Suppose now, to the contrary, that πi (x) > π(x) for some j ∈ / σ (x). Because of payoff monotonicity and stationarity of x, we have g j (x) > gi (x) = 0 for all i ∈ σ (x), and by continuity, there exists a neighborhood U of x such that g j (y) > 0 for all y ∈ U. Then, for a sufficiently large T, g j (x(t)) > 0 for all t ≥ T, and since x j (t) > 0 for all t (recall in fact that int() is invariant), we have x˙ j (t) > 0 for t ≥ T. This implies x j = limt→∞ x j (t) > 0, a contradiction. The next proposition, which will be useful later, provides another necessary condition for local solutions of equation 3.1, when the payoff matrix has a particular structure. Proposition 4. Let x ∈ be a stationary point of any payoff-monotonic dynamics, and suppose that the payoff matrix A is symmetric with positive diagonal entries, that is, a ii > 0 for all i = 1, . . . , n. Suppose that there exist i, j ∈ σ (x) such that a i j = 0. For 0 < δ ≤ x j let y(δ) = x + δ(ei − e j ) ∈ . Then y(δ) Ay(δ) > x Ax. Proof. From the symmetry of A, we have: y(δ) Ay(δ) = [x + δ(ei − e j )] A[x + δ(ei − e j )] = x Ax + 2δ(ei − e j ) Ax + δ 2 (ei − e j ) A(ei − e j ) = x Ax + 2δ[(Ax)i − (Ax) j ] + δ 2 (a ii − 2a i j + a j j ). But since x is a stationary point, we have (Ax)i = (Ax) j (recall in fact that i, j ∈ σ (x)), and by the hypothesis a i j = 0 we have y(δ) Ay(δ) = x Ax + δ 2 (a ii + a j j ) > x Ax, which proves the proposition.
Game Dynamics and Maximum Clique
1225
In an unpublished paper, Hofbauer (1995) showed that for symmetric payoff matrices, the population mean payoff x Ax is strictly increasing along the trajectories of any payoff-monotonic dynamics. This result generalizes the celebrated “fundamental theorem of natural selection” (Hofbauer & Sigmund, 1998; Weibull, 1995), whose original form traces back to R. A. Fisher (1930). Here, we provide a different proof, adapting a technique from Fudenberg and Levine (1998). Theorem 1. If the payoff matrix A is symmetric, then π(x) = x Ax is strictly increasing along any nonconstant trajectory of any payoff-monotonic dynamics. In other words, π(x(t)) ˙ ≥ 0 for all t, with equality if and only if x = x(t) is a stationary point. Proof. See Hofbauer (1995), or Pelillo (2002) for a different proof. Apart from the monotonicity result that provides a (strict) Lyapunov function for payoff-monotonic dynamics, the previous theorem also rules out complicated attractors like cycles, invariant tori, or even strange attractors. It also allows us to establish a strong connection between the stability properties of these dynamics and the solutions of equation 3.1. To this end, we need an auxiliary result. Lemma 1. Let x be a strict local solution of equation 3.1, and put S = σ (x). Then x is the only stationary point of any payoff-monotonic dynamics in int( S ). Moreover, x Ax > y Ay for all y ∈ S . Proof. Clearly, since x is a strict local solution of equation 3.1, it is a KKT point and hence is stationary under any payoff-monotonic dynamics by proposition 2. Suppose by contradiction that y ∈ int( S ) \ {x} is stationary too. Then it is easy to see that all points on the segment joining x and y, which is contained in int( S ) because of its convexity, consists entirely of stationary points. Hence, by theorem 1, π˙ = 0 on this segment, which means that π is constant, but this contradicts the hypothesis that x is a strict local solution of equation 3.1. Moreover, for a sufficiently small ε > 0 we have x Ax > [(1 − ε)x + εy] A[(1 − ε)x + εy] = (1 − ε)2 x Ax + 2ε(1 − ε)y Ax + ε 2 y Ay, but since x is stationary and σ (y) ⊆ σ (x), y Ax = x Ax, from which we readily obtain x Ax > y Ay. Theorem 2. A point x ∈ is a strict local solution of program 3.1 if and only if it is asymptotically stable under any payoff-monotonic dynamics.
1226
M. Pelillo and A. Torsello
Proof. If x is asymptotically stable, then there exists a neighborhood U of x in such that every trajectory starting in a point y ∈ U will converge to x. Then, by virtue of Theorem 1, we have y Ay > x Ax for all y ∈ U\{x}, which shows that x is a strict local solution for equation 3.1. On the other hand, suppose that x is a strict local solution of equation 3.1, and let S = σ (x). By lemma 1, the function V : int( S ) → R defined as V(y) = π(x) − π(y) is clearly nonnegative in int( S ), and it vanishes only when y = x. Furthermore, V˙ ≤ 0 by theorem 1 and, again from lemma 1, V˙ < 0 in int( S )\{x}, as x is the only stationary point in int( S ). This means that V is a strict Lyapunov function for any payoff-monotonic dynamics, ¨ 1970; Hirsch and hence x is asymptotically stable (see, e.g., Bhatia & Szego, & Smale, 1974). The results presented in this section show that continuous-time payoffmonotonic dynamics can be usefully employed to find (local) solutions of standard quadratic programs. In the rest of the article, we focus the discussion on a particular class of quadratic optimization problems that arise in conjunction with the maximum clique problem. 4 A Family of Quadratic Programs for Maximum Clique Let G = (V, E) be an undirected graph with no self-loops, where V = {1, . . . , n} is the set of vertices and E ⊆ V × V is the set of edges. The order of G is the number of its vertices, and its size is the number of edges. Two vertices i, j ∈ V are said to be adjacent if (i, j) ∈ E. The adjacency matrix of G is the n × n symmetric matrix AG = (a i j ) defined as follows: ai j =
1,
if (i, j) ∈ E,
0,
otherwise.
The degree of a vertex i ∈ V relative to a subset of vertices C, denoted by degC (i), is the number of vertices in C adjacent to it, that is, degC (i) =
ai j .
j∈C
Clearly, when C = V we obtain the standard degree notion, in which case we shall write deg(i) instead of degV (i). A subset C of vertices in G is called a clique if all its vertices are mutually adjacent; that is, for all i, j ∈ C, with i = j, we have (i, j) ∈ E. A clique is said to be maximal if it is not contained in any larger clique and maximum if it is the largest clique in the graph. The clique number, denoted by ω(G), is defined as the cardinality of the maximum clique. The maximum clique problem is to find a clique whose cardinality equals the clique number.
Game Dynamics and Maximum Clique
1227
In the mid-1960s, Motzkin and Straus (1965) established a remarkable connection between the maximum clique problem and the following standard quadratic program: maximize subject to
f (x) = x AG x x ∈ ⊂ Rn ,
(4.1)
where n is the order of G. Specifically, if x∗ is a global solution of equation 4.1, they proved that the clique number of G is related to f (x∗ ) by the following formula: ω(G) =
1 . 1 − f (x∗ )
(4.2)
Additionally, they showed that a subset of vertices C is a maximum clique of G if and only if its characteristic vector xC , which is the vector of defined as 1/|C|, if i ∈ C C xi = 0, otherwise, is a global maximizer of f on .4 Gibbons, Hearn, Pardalos, and Ramana (1997), and Pelillo and Jagota (1995), extended the Motzkin-Straus theorem by providing a characterization of maximal cliques in terms of local maximizers of f on . One drawback associated with the original Motzkin-Straus formulation, however, relates to the existence of “infeasible” solutions, that is, maximizers of f that are not in the form of characteristic vectors. Pelillo and Jagota (1995) have provided general characterizations of such solutions. To overcome this problem, consider the following family of standard quadratic programs: maximize
f α (x) = x (AG + α I )x
subject to
x ∈ ⊂ Rn ,
(4.3)
where α is an arbitrary real parameter and I is the identity matrix, which includes as special cases the original Motzkin-Straus program (see equation 4.1) and the regularized version proposed by Bomze (1997) (corresponding to the cases α = 0 and α = 12 , respectively).
4
In their original paper, Motzkin and Straus proved just the “only-if” part of this theorem. The converse direction is, however, a straightforward consequence of their result (Pelillo & Jagota, 1995).
1228
M. Pelillo and A. Torsello
Proposition 5. Let x be a KKT point for program 4.3 with α < 1, and let C = σ (x) be the support of x. If C is a clique of G, then it is a maximal clique. Proof. Suppose by contradiction that C is a nonmaximal clique. Hence, there exists j ∈ V\C such that (i, j) ∈ E for all i ∈ C. Since α < 1, we have for all i ∈ C: (AG x + αx)i = (AG x)i + αxi = 1 − (1 − α)xi < 1 = (AG x) j = (AG x + αx) j . But due to equation 3.2, this contradicts the hypothesis that x is a KKT point for equation 4.3. In general, however, the fact that a point x ∈ satisfies the KKT conditions does not imply that σ (x) is a clique of G. For instance, it is easy to show that if for a subset C we have degC (i) = k for all i ∈ C (i.e., C induces a k-regular subgraph), and degC (i) ≤ k for all i ∈ / C, then xC is a KKT point for equation 4.3 provided that α ≥ 0. The following theorem, which generalizes an earlier result by Bomze (1997), establishes a one-to-one correspondence between localglobal solutions of equation 4.3 and maximal-maximum cliques of G. By adapting the proof technique from Bomze (1997), it has also been proved previously in Bomze et al. (2002) using concepts and results from evolutionary game theory. Here we provide a different proof based on standard facts from optimization theory. Theorem 3. Let C be a subset of vertices of a graph G, and let xC be its characteristic vector. Then, for any 0 < α < 1, C is a maximal (maximum) clique of G if and only if xC is a local (global) solution of equation 4.3. Moreover, all solutions of the equation are strict and are characteristic vectors of maximal cliques of G. Proof. See Bomze et al. (2002) for a proof that requires several previous results from evolutionary game theory or appendix B for a self-contained proof which uses only basic concepts from optimization theory. Corollary 1. Let C be a subset of vertices of a graph G with xC as its characteristic vector, and let 0 < α < 1. Then C is a maximal clique of G if and only if xC is an asymptotically stable stationary point under any payoff-monotonic dynamics with payoff matrix A = AG + α I . Proof. The proof is obvious from theorems 2 and 3. These results naturally suggest any dynamics in the payoff-monotonic class as a useful heuristic for the maximum clique problem, and this will be the subject of the next section.
Game Dynamics and Maximum Clique
1229
5 Clique-Finding Payoff-Monotonic Dynamics Let G = (V, E) be a graph, and let AG denote its adjacency matrix. By using A = AG + α I
(0 < α < 1)
(5.1)
as the payoff matrix, any payoff-monotonic dynamics, starting from an arbitrary initial state, will eventually be attracted with probability one by the nearest asymptotically stable point, which, by virtue of corollary 1, will then correspond to a maximal clique of G. Clearly, in theory, there is no guarantee that the converged solution will be a global solution of equation 4.3 and therefore that it will yield a maximum clique in G. In practice, it is not unlikely, however, that the system converges toward a stationary point that is unstable, that is, a saddle of the Lyapunov function x Ax. This can be the case when the dynamics is started from the simplex barycenter and symmetry is not broken. Proposition 3 ensures, however, that the limit point of any interior trajectory will be at least a KKT point of program 4.3. The next proposition translates this fact in a different language. Proposition 6. Let x ∈ be the limit point of a trajectory of any payoff-monotonic dynamics starting in the interior of . Then either σ (x) is a maximal clique or it is not a clique. Proof. The proof is obvious from propositions 3 and 5. The practical significance of the previous result reveals itself in large graphs: even if these are quite dense, cliques are usually much smaller than the graph itself. Now suppose we are returned a KKT point x. Then we put C = σ (x) and check whether C is a clique. This requires O(m2 ) steps if C contains m vertices, while checking whether this clique is maximal would require O(mn) steps and, as stressed above, usually m n. But proposition 6 now guarantees that the obtained clique C (if it is one) must automatically be maximal, and thus we are spared trying to add external vertices. 5.1 Experimental Results. To assess the ability of our payoff-monotonic models to extract large cliques, we performed extensive experimental evaluations on both random and DIMACS benchmark graphs.5 For our simulations we used discretized versions of the continuous-time linear (see equation 2.6) and exponential (see equation 2.7) replicator dynamics (see appendix A for a description of our discretizations). We started the processes from the simplex barycenter and stopped them when a maximal clique (i.e., a strict local maximizer of f ) was found. Occasionally, when
5
Data can be found online at http://dimacs.rutgers.edu.
1230
M. Pelillo and A. Torsello
the system converged to a nonclique KKT point, we randomly perturbed the solution and let the game dynamics start from this new point. Since unstable stationary points have a basin of attraction of measure zero around them, the process is pulled away from it with probability one, to converge, eventually, to another (hopefully asymptotically stable) stationary point. In an attempt to improve the quality of the final results, we used a mixed strategy as far as the regularization parameter α is concerned. Indeed, Bomze et al. (1997) showed that the original Motzkin-Straus formulation (i.e., α = 0), which is plagued with the presence of spurious solutions, usually yields slightly better results than its regularized version. Accordingly, we started the dynamics using α = 0 and, after convergence, we restarted it from the converged point using α = 12 . This way, we are guaranteed to avoid spurious solutions, thereby obtaining a maximal clique. In the first set of experiments we ran the algorithms over random graphs of order 100, 200, 300, 400, 500, and 1000 and with edge densities ranging from 0.25 to 0.95. For each order and density value, 100 different graphs were generated. Table 1 shows the results obtained in terms of clique size. Here, n refers to the graph order, ρ is the edge density, and the labels “RD linear” and “RD exp” indicate the first-order and the exponential replicator dynamics, respectively. The results are compared with the following state-of-the-art neural network heuristics for maximum clique: Jagota’s continuous Hopfield dynamics (CHD) and mean field annealing (MFA) (Jagota, 1995; Jagota et al., 1996), the saturated linear dynamical network (SLDN) by Pekergin et al. (1999), an approximation approach introduced by Funabiki et al. (FTL) (1992), the iterative Hopfield nets (IHN) algorithm by Bertoni et al. (2002), and the Hopfield network learning (HNL) of Wang et al. (2003). Figure 1 plots the corresponding CPU timings obtained with a (nonoptimized) C++ implementation on a machine equipped with a 2.5 GHz Pentium 4 processor. Since random graphs are notoriously easy to deal with, a second set of experiments was also performed on the DIMACS benchmark graphs (see Tables 2 and 3). Here, columns marked with n and ρ contain the number of vertices in the graph and the edge density respectively. Columns “Clique Size” contain the size of the cliques found by the competing algorithms, while the column “Time” reports the CPU timings for the proposed dynamics. The sizes of the cliques obtained are compared against several algorithms present in either the neural network or the continuous optimization literature, and against the best result over all algorithm featured on the DIMACS challenge (Johnson & Trick, 1996) (DIMACS best). The neural-based approaches include mean field annealing (MFA), the inverted neurons network (INN) model by Grossman (1996), and the IHN algorithm, while algorithms from the continuous optimization literature include the continuous-based heuristic (CBH) by Gibbons et al. (1997) and the QSH algorithm by Busygin, Butenko, and Pardalos (2002). The results are taken from the cited papers. No results are presented for CHD, SLDN, FTL, and
Game Dynamics and Maximum Clique
1231
Table 1: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on Random Graphs with Varying Order and Density. n
ρ
100 0.25 0.50 0.75 0.90 0.95 200 0.25 0.50 0.75 0.90 0.95 300 0.25 0.50 0.75 0.90 0.95 400 0.25 0.50 0.75 0.90 0.95 500 0.25 0.50 0.75 0.90 0.95 1000 0.25 0.50 0.75 0.90 0.95
RD Linear 4.90 ± 0.56 8.01 ± 0.66 15.10 ± 1.05 28.50 ± 1.87 41.81 ± 1.80 5.34 ± 0.68 9.04 ± 0.76 17.77 ± 1.35 36.74 ± 1.65 57.33 ± 2.33 5.58 ± 0.62 9.54 ± 0.90 19.05 ± 1.27 40.74 ± 1.97 66.48 ± 2.69 5.73 ± 0.65 9.99 ± 0.80 20.17 ± 1.16 43.63 ± 1.73 73.11 ± 2.37 5.81 ± 0.69 10.14 ± 0.93 20.90 ± 1.32 46.10 ± 2.25 78.73 ± 2.66 6.23 ± 0.62 10.74 ± 0.80 22.88 ± 1.28 52.60 ± 2.02 93.75 ± 3.12
RD Exponential CHD 4.81 ± 0.60 8.07 ± 0.64 15.15 ± 1.11 28.92 ± 1.74 42.04 ± 1.78 5.35 ± 0.64 9.11 ± 0.77 18.05 ± 1.23 37.41 ± 1.55 58.24 ± 2.09 5.61 ± 0.62 9.57 ± 0.88 19.40 ± 1.17 41.48 ± 1.81 67.43 ± 2.47 5.73 ± 0.60 10.07 ± 0.77 20.42 ± 1.10 44.44 ± 1.64 74.25 ± 2.30 5.74 ± 2.30 10.31 ± 0.91 21.31 ± 1.27 46.93 ± 2.29 79.94 ± 2.68 6.17 ± 0.57 10.83 ± 0.82 23.04 ± 1.35 53.15 ± 2.11 94.80 ± 2.96
4.48 7.38 13.87 27.92 — — — — — — — — — — — 5.53 9.24 18.79 43.24 — — — — — — 6.03 10.25 21.26 — —
MFA
SLDN
FTL
IHN
HNL
— 8.50 — 30.02 — — — — — — — — — — — — 10.36 — 49.94 — — — — — — — — — — —
4.83 8.07 15.05 — — — — — — — — — — — — 5.70 9.91 20.44 — — — — — — — 6.17 10.93 23.19 — —
4.2 8.0 14.1 — — 4.9 8.5 — — — 5.1 8.9 — — — 4.9 8.9 17.7 — — 6.2 9.4 — — — 5.8 10.4 21.4 — —
— 9.13 — — — — 10.60 — — — — 11.60 — — — — 12.30 — — — — 12.80 — — — — — — — —
— 9 — 30 — — 11 — 39 — — 11 — 46 — — — — — — — 12 — 56 — — — — — —
HNL since the authors did not provide results on the DIMACS graphs. For the same reason, we did not report results on random graphs for INN, CBH, and QSH. A number of conclusions can be drawn from these results. First, the exponential dynamics provides slightly better results than the linear one, being, however, dramatically faster, especially on dense graphs. These results confirm earlier findings reported in Pelillo (1999, 2002) on graph classes arising from graph and tree matching problems. As for the comparison with the other algorithms, we note that our dynamics substantially outperform CHD and FTL and are, overall, as effective as SLDN (for which no results on DIMACS graphs are available). Observe that these approaches do not incorporate any procedure to escape from poor local solutions and, hence,
1232
M. Pelillo and A. Torsello 0.45 linear 0.4 exponential 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1.8 linear 1.6 exponential 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
100 vertices 4 3.5
1
200 vertices 7
linear exponential
linear exponential
6
3
5
2.5
4
2
3
1.5 1
2
0.5
1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
300 vertices 10 linear 9 exponential 8 7 6 5 4 3 2 1 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
500 vertices
1
400 vertices 40 35
linear exponential
30 25 20 15 10 5 1
0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1000 vertices
Figure 1: CPU time of replicator dynamics on random graphs with varying order and density. The x-axis represents the edge density, while the y-axis denotes time (in seconds).
are close in spirit to ours. Clearly our results are worse than those obtained with algorithms that do use some form of annealing or, in any case, are explicitly designed to avoid local optima, such as IHN, HNL, CBH, and QSH. Interestingly, however, the results are close to (and in some instances
Clique Size Graph brock200 1 brock200 2 brock200 3 brock200 4 brock400 1 brock400 2 brock400 3 brock400 4 brock800 1 brock800 2 brock800 3 brock800 4 c-fat200 1 c-fat200 2 c-fat200 5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 hamming6-2 hamming6-4
n 200 200 200 200 400 400 400 400 800 800 800 800 200 200 200 500 500 500 500 64 64
ρ
DIMACS Best
RD Linear
RD Exponential
0.75 0.50 0.61 0.66 0.75 0.75 0.75 0.75 0.65 0.65 0.65 0.65 0.08 0.16 0.43 0.04 0.07 0.19 0.37 0.90 0.35
21 12 15 17 24 25 25 33 21 21 21 21 12 24 58 14 26 64 126 32 4
17 8 9 13 21 20 19 19 16 16 15 17 12 24 58 14 26 64 126 32 4
17 8 10 13 21 21 18 20 16 16 18 17 12 24 58 14 26 64 126 32 4
Time (secs)
MFA
INN Average
IHN
CBH
QSH
19 9 11 14 24 21 22 23 16 16 16 15 6 24 58 — — — — — —
— 9.1 — 13.4 — 21.3 — 21.3 — 16.9 — 16.5 — — — — — — — — —
— — — — — — — — — — — — 12 24 58 14 26 64 — 32 4
20 12 14 16 23 24 23 24 20 19 20 19 12 24 58 14 26 64 126 32 4
21 12 15 17 27 29 31 33 17 24 25 26 12 24 58 14 26 64 126 32 4
RD Linear 0.33 0.23 0.25 0.28 1.36 1.35 1.38 1.37 4.68 4.61 4.72 4.63 0.22 0.15 0.16 0.97 1 1.01 1.09 0.09 0.03
RD Exponential 0.2 0.17 0.18 0.16 0.69 0.73 0.69 0.74 2.86 2.77 2.92 2.7 0.17 0.16 0.16 1.01 1.02 0.99 1 0.03 0.02
Game Dynamics and Maximum Clique
Table 2: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part I.
1233
1234
Table 2: Continued Clique Size Graph
ρ
256 256 65536 65536 28 70 120 496 171 776 3361
0.97 0.64 0.99 0.83 0.56 0.77 0.76 0.88 0.65 0.75 0.82
128 16 512 36 4 14 8 16 11 27 59
RD Linear
RD Exponential
82 10 312 32 4 14 8 16 7 15 31
81 10 297 32 4 14 8 16 7 15 31
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — —
— 16.0 — 32.0 — — — — 8.4 16.7 33.4
128 16 512 36 4 14 8 16 — — —
128 16 — — 4 14 8 16 10 21 —
QSH
RD Linear
RD Exponential
128 16 — — 4 14 8 16 11 24 —
3.4 0.44 459.04 24.27 0.01 0.04 0.12 5.11 0.21 5.43 174.89
0.51 0.27 34.48 6.05 0.00 0.02 0.07 1.36 0.14 2.64 54.34
M. Pelillo and A. Torsello
hamming8-2 hamming8-4 hamming10-2 hamming10-4 johnson8-2-4 johnson8-4-4 johnson16-2-4 johnson32-2-4 keller4 keller5 keller6
n
DIMACS Best
Clique Size Graph p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 p hat700-3 p hat1000-1 p hat1000-2 p hat1000-3 p hat1500-1 p hat1500-2 p hat1500-3 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3
n
ρ
DIMACS Best
300 300 300 500 500 500 700 700 700 1000 1000 1000 1500 1500 1500 200 200 200 200 200
0.24 0.49 0.74 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.70 0.70 0.90 0.90 0.90
8 25 36 9 36 50 11 44 62 10 46 66 11 65 94 30 18 70 60 44
RD Linear
RD Exponential
6 24 32 8 34 47 9 43 58 8 42 61 9 61 88 15 12 45 36 32
6 24 33 8 34 48 9 43 59 8 44 63 9 61 88 15 12 45 36 32
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — — — — — — — — 42 — 33
7.0 21.9 33.1 — — — 8.4 42.0 58.2 — — — 9.4 60.2 86.2 — — — — —
8 25 36 9 36 49 11 44 61 10 46 68 — — — 30 15 70 41 —
8 25 36 9 35 49 11 44 60 — — — — — — 15 12 46 36 30
QSH
RD Linear
RD Exponential
7 24 33 9 33 46 8 42 59 — — — — — — 30 18 70 60 35
0.41 0.69 1.04 1.18 2.07 3.75 2.39 4.31 7.35 4.89 8.02 14.07 11.03 22.96 36.72 0.68 0.96 1.74 1.11 0.76
0.39 0.43 0.43 1.07 1.14 1.23 2.1 2.2 2.43 4.5 4.36 4.8 9.71 10.37 10.89 0.22 0.23 0.35 0.3 0.26
Game Dynamics and Maximum Clique
Table 3: Results of Replicator Dynamics (RD) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part II.
1235
1236
Table 3: Continued Clique Size Graph
ρ
400 400 400 400 400 1000 200 200 400 400
0.50 0.70 0.70 0.70 0.90 0.50 0.70 0.90 0.50 0.70
13 40 30 22 100 10 18 42 13 21
RD Linear
RD Exponential
7 20 15 12 44 8 14 37 11 18
7 20 15 12 50 8 16 39 11 18
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — 18 41 — —
— — — — — — — — — —
— 40 30 — 100 10 17 41 12 21
8 20 15 14 50 — 18 41 12 20
QSH
RD Linear
RD Exponential
9 40 30 16 100 — 15 37 11 18
2.71 2.41 3.38 4 2.98 29.95 0.29 0.72 0.88 1.24
0.87 0.95 0.97 0.89 0.89 6.91 0.2 0.24 0.61 0.69
M. Pelillo and A. Torsello
san400 0.5 1 san400 0.7 1 san400 0.7 2 san400 0.7 3 san400 0.9 1 san1000 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7
n
DIMACS Best
Game Dynamics and Maximum Clique
1237
even better than) those obtained with Jagota’s MFA and Grossman’s INN, which are also in this family. 6 Annealed Imitation Dynamics: Evolving Toward Larger Cliques In an attempt to avoid inefficient local solutions, we now follow Bomze et al. (2002) and investigate the stability properties of equilibria of payoffmonotonic dynamics when the parameter α is allowed to take on negative values. Indeed, we shall restrict our analysis to imitation dynamics (see equation 2.5), but we first make a few observations pertaining to general selection dynamics. For any regular selection dynamics x˙ i = xi gi (x), we have: ∂gi (x) ∂ x˙ i = δi j gi (x) + xi , ∂xj ∂xj
(6.1)
where δi j is the Kronecker delta, defined as δi j = 1 if i = j and δi j = 0 otherwise. Assuming without loss of generality that σ (x) = {1, . . . , m}, the Jacobian of any regular selection dynamics at a stationary point x has therefore the following block triangular form, J (x) =
M(x) N(x) O
D(x)
,
(6.2)
, O is the (possibly where the entries of M(x) and N(x) are given by xi ∂g∂ ix(x) j empty) matrix containing all zeros, and D(x) = diag{gm+1 (x), . . . , gn (x)}. An immediate consequence of this observation is that we can already say something about the spectrum of J (x), when m < n. In fact, the eigenvalues of J (x) are those of M(x) together with those of D(x), and since D(x) is diagonal, its eigenvalues coincide with its diagonal entries, that is, gm+1 (x), . . . , gn (x). This set of eigenvalues governs the asymptotic behavior of the external flow under the system obtained by linearization around x and is usually called transversal eigenvalues (Hofbauer & Sigmund, 1998). Without knowing the form of the growth functions gi , however, it is difficult to provide further insights into the spectral properties of J (x), and therefore we now specialize our discussion to imitation dynamics (see equation 2.5). In this case, we have gi (x) = φ(πi (x)) −
n k=1
xk φ(πk (x)),
(6.3)
1238
M. Pelillo and A. Torsello
where φ is a strictly increasing function, and hence ∂ x˙ i = δi j φ(πi (x)) − xk φ(πk (x)) ∂xj k
+ xi a i j φ (πi (x)) − φ(π j (x)) −
xk a k j φ (πk (x)) .
(6.4)
k
When x is an equilibrium point, and hence πi (x) = π(x) for all i ∈ σ (x), the previous expression simplifies to ∂ x˙ i = δi j [φ(πi (x)) − φ(π(x))] ∂xj + xi [a i j φ (π(x)) − φ(π j (x)) − φ (π(x))π j (x)].
(6.5)
Before we provide the main result of this section, we prove the following useful proposition, which generalizes an earlier result by Bomze (1986). Proposition 7. Let x be a stationary point of any imitation dynamics, equation 2.5. Then: a. −φ(π(x)) is an eigenvalue of J (x), with x as an associated eigenvector. b. If y is an eigenvector of J (x) associated with an eigenvalue λ = −φ(π(x)), then e y = i yi = 0. Proof. Recall from proposition 1 that x is an equilibrium point for equation 2.5 if and only if πi (x) = π(x) for all i ∈ σ (x). Hence, for i = 1, . . . , n, we have: n j=1
xj
∂ x˙ i = xi [φ(πi (x)) − φ(π(x))] ∂xj + xi
n
x j [a i j φ (π(x)) − φ(π j (x)) − π j (x)φ (π(x))]
j=1
= xi [φ(πi (x)) − φ(π(x)) + φ (π(x))πi (x) − φ(π(x)) − φ (π(x))π(x)] = −xi φ(π(x)). In other words, we have shown that J (x)x = −φ(π(x))x, which proves part a of the proposition.
Game Dynamics and Maximum Clique
1239
To prove part b, first note that the columns of J (x) have a nice property: they all sum up to −φ(π(x)). Indeed, for all j = 1, . . . , n we have: n ∂ x˙ i = φ(π j (x)) − φ(π(x)) ∂xj i=1
+
n
xi a i j φ (π(x)) − φ(π j (x)) − π j (x)φ (π(x))
i=1
= φ(π j (x)) − φ(π(x)) + φ (π(x))π j (x) − φ(π j (x)) − φ (π(x))π j (x) = −φ(π(x)). Now, the hypothesis J (x)y = λy yields λ
yi =
(J (x)y)i = yj J (x)i j = −φ(π(x)) yj ,
i
which implies
i
j
i
yi = 0, since λ = −φ(π(x)).
i
j
Since we analyze the behavior of imitation dynamics restricted to the standard simplex , we are interested only in the eigenvalues of J (x) associated with eigenvectors belonging to the tangent space e⊥ = {y ∈ Rn : e y = 0}. The previous result therefore implies that the eigenvalue −φ(π(x)) can be neglected in our analysis, and that the remaining ones, including the transversal eigenvalues, are indeed all relevant. We now return to the maximum clique problem. Let a graph G = (V, E) be given, and for a subset of vertices C, let γ (C) = max degC (i) − |C| + 1 . i ∈C /
(6.6)
Note that if C is a maximal clique, then γ (C) ≤ 0. The next theorem shows that γ (C) plays a key role in determining the stability of equilibria of imitation dynamics. Theorem 4. Let C be a maximal clique of graph G = (V, E), and let xC be its characteristic vector. If γ (C) < α < 1, then xC is an asymptotically stable stationary point under any imitation dynamics (see equation 2.5) with payoff matrix A = AG + α I , and hence a (strict) local maximizer of f α in . Moreover, assuming C = V, if α < γ (C), then xC becomes unstable. Proof. Assume without loss of generality that C = {1, . . . , m} and suppose that γ (C) < α < 1. To simplify notations, put x = xC . We shall see that the eigenvalues of J (x) are real and negative. This implies that x is a sink and
1240
M. Pelillo and A. Torsello
hence an asymptotically stable point (Hirsch & Smale, 1974). The fact that x is a strict local maximizer of f α in follows directly from theorem 2. As already noticed in the previous discussion, because of its block di
agonal form, the eigenvalues of J (x) are those of M(x) = xi ∂g∂ ix(x) j
i, j=1,...,m
together with the n − m transversal eigenvalues gm+1 (x), . . . , gn (x), where: gi (x) = φ(πi (x)) −
n
xk φ(πk (x)).
k=1
Since C is a (maximal) clique, πk (x) = π(x) for all k ∈ C = σ (x), and therefore gi (x) = φ(πi (x)) − φ(π(x)). But φ is a strictly increasing function, and hence gi (x) < 0 if and only if πi (x) < π(x). Now, since C is a maximal clique, πi (x) = (AG x)i = degC (i)/m for all i > m, and π(x) = (m − 1 + α)/m. But for all i > m, we have degC (i) − m + 1 ≤ γ (C) < α, and this yields πi (x) < π(x). Hence, all transversal eigenvalues are negative. It remains to show that the eigenvalues of M(x) are negative too. When A = AG + α I , we have: M(x)i j = xi
∂ x˙ i 1 = [(a i j + αδi j )φ (π(x)) − φ(π(x)) − φ (π(x))π(x)]. ∂xj m
Hence, in matrix form, we have φ (π(x)) M(x) = m
φ(π(x)) 1− − π(x) ee + (α − 1)I , φ (π(x))
where ee is the m × m matrix containing all ones, and the eigenvalues of M(x) are λ1 =
φ (π(x)) (α − 1) m
with multiplicity m − 1, and φ (π(x)) λ2 = m
φ(π(x)) 1− − π(x) m + α − 1 = −φ(π(x)) φ (π(x))
with multiplicity 1. Since α < 1 and φ is strictly increasing, we have λ1 < 0. Moreover, recall from proposition 7 that eigenvalue λ2 = −φ(π(x)) is not
Game Dynamics and Maximum Clique
1241
relevant to the imitation dynamics on the simplex , since its eigenvector x does not belong to the tangent space e⊥ . Hence, as far as the dynamics in the simplex is concerned, we can ignore it. Finally, to conclude the proof, suppose that α < γ (C) = maxi>m degC (i) − m + 1. Then there exists i > m such that m − 1 + α < degC (i) and hence, dividing by m, we get πi (x) − π(x) > 0 and then gi (x) = φ(πi (x)) − φ(π(x)) > 0, which implies that a transversal eigenvalue of J (x) is positive, that is, x is unstable. Theorem 4 provides us with an immediate strategy to avoid unwanted local solutions: maximal cliques that are not maximum. Suppose that C is a maximal clique in G that we want to avoid. By letting α < γ (C), its characteristic vector xC becomes an unstable stationary point of any imitation dynamics under f α , and thus will not be approached by any interior trajectory. Hence, if there is a clique D such that still γ (D) < α holds, there is a (more or less justified) hope to obtain in the limit x D , which yields automatically a larger maximal clique D. Unfortunately, two other cases could occur: (1) no other clique T satisfies γ (T) < α, that is, α has a too large absolute value, and (2) even if there is such a clique, other attractors could emerge that are not characteristic vectors of a clique (note that this is excluded if α > 0 by theorem 3). The proper choice of the parameter α is therefore a trade-off between the desire to remove unwanted maximal cliques and the emergence of spurious solutions. Instead of keeping the value of α fixed, our approach is to start with a sufficiently large negative α and adaptively increase it during the optimization process, in much the same spirit as simulated or mean field annealing procedures. Of course, in our case, the annealing parameter has no interpretation in terms of a hypothetical temperature. The rationale behind this idea is that for values of α that are sufficiently negative, only the characteristic vectors of large maximal cliques will be stable, attractive points for the imitation dynamics, together with a set of spurious solutions. As the value of α increases, spurious solutions disappear, and at the same time, (characteristic vectors of) smaller maximal cliques become stable. We expect that at the beginning of the annealing process, the dynamics is attracted toward “promising” regions, and the search is further refined as the annealing parameter increases. In summary, a high-level description of the proposed algorithm is shown in Figure 2. Note that the last step in the algorithm is necessary if we also want to extract the vertices comprising the clique found, as shown in theorem 3. It is clear that for the algorithm to work, we need to select an appropriate annealing schedule. To this end, we employ the following heuristic suggested in Bomze et al. (2002). Suppose that the underlying graph is a random one in the sense that edges are generated independent of each other with a certain probability q (in applications, q will be replaced by the actual graph density), and suppose that C is an
1242
M. Pelillo and A. Torsello
Algorithm 1. Start with a sufficiently large negative α. 2. Let b be the barycenter of ∆ and set x = b. 3. Run any imitation dynamics starting from x, under AG + αI until convergence and let x be the converged point. 4. Unless a stopping condition is met, increase α and goto 3. 5. Select α ˆ with 0 < α ˆ < 1 (e.g., α ˆ = 12 ), run any imitation dynamics starting from current x under AG + α ˆI until convergence, and extract a maximal clique from the converged solution.
Figure 2: Annealed Imitation Heuristic.
unwanted clique of size m. Take δ > 0 small, say 0.01, and consider the quantity γ m = 1 − (1 − q )m −
mq (1 − q ) δ ν ,
(6.7)
where ν = 1/2(n − m). Bomze et al. (2002) proved that γ (C) exceeds γ m with probability 1 − δ. Thus, it makes sense to use γ m as a heuristic proxy for the lower bound of γ (C), to avoid being attracted by a clique of size m. Furthermore, note that among all graphs with n vertices and m edges, the maximum possible clique number is the only integer c that satisfies the following relations: c+1 c ≤m< , 2 2
(6.8)
Game Dynamics and Maximum Clique
1243
which, after some algebra, yields
8m + 1 1 −
8m + 1 1 + , 4 2
(6.9)
from which we get c=
8m + 1 1 + 4 2
.
(6.10)
The previous results suggest us a sort of two-level annealing strategy: the level of clique size, which in turn induces that of the “actual” annealing parameter. More precisely, if we do not have any a priori information about the expected size of the maximum clique, we can use equation 6.10 to have an initial overestimation of it. By setting the initial value for α (step 1 of our algorithm) at some intermediate value between γ c and γ c−1 , for example, α = (γ c + γ c−1 )/2, we expect that only the characteristic vectors of maximal cliques having size c will survive in f α , together with many spurious solutions. After the initial cycle, we decrease c, recalculate γ c and γ c−1 , and update α = (γ c + γ c−1 )/2 in step 4 as in the previous step. The whole process is iterated until either c reaches 1 or α becomes greater than zero. 7 Experimental Results In this section we present experiments of applying our annealed imitation heuristics to the same set of random and DIMACS graphs used in the previous experiments. For each graph considered, the algorithms were run by using the two-level annealing schedule described at the end of the previous section. As for the internal cycle (step 3), we used both the (discretized versions of the) linear and exponential replicator dynamics. The processes were iterated until the Euclidean distance between two successive states became smaller than a threshold value. At the final cycle (step 5), the parameter αˆ was set to 1/2, and the dynamics were stopped when either a maximal clique (i.e., a local maximizer of f 1/2 on ) was found or the distance between two successive points was smaller than a fixed threshold. When the process converged to a saddle point, the vector was perturbed, and the algorithm restarted from the new perturbed point. Table 4 and Figure 3 show the results obtained with our annealed imitation heuristics (AIH) on the random graphs, in terms of clique sizes and computation time, respectively, while Tables 5 and 6 show the results obtained on the DIMACS benchmark. Several conclusions can be drawn from these experiments. First, both of our annealing heuristics perform, on average, significantly better than the
1244
M. Pelillo and A. Torsello
Table 4: Results of the Annealed Imitation Heuristics (AIH) and State-of-theArt Neural Network–Based or Optimization-Based Approaches on Random Graphs with Varying Order and Density. n
ρ
AIH Linear
100 0.25 5.11 ± 0.53 0.50 8.43 ± 0.77 0.75 16.10 ± 0.92 0.90 29.53 ± 1.71 0.95 42.82 ± 1.74 200 0.25 5.60 ± 0.60 0.50 9.43 ± 0.89 0.75 19.59 ± 1.14 0.90 38.62 ± 1.80 0.95 60.15 ± 1.79 300 0.25 6.10 ± 0.54 0.50 9.98 ± 0.92 0.75 21.02 ± 1.11 0.90 43.84 ± 2.02 0.95 70.98 ± 2.35 400 0.25 6.65 ± 0.50 0.50 11.15 ± 0.76 0.75 22.82 ± 0.89 0.90 48.68 ± 1.51 0.95 79.19 ± 2.00 500 0.25 6.28 ± 0.68 0.50 10.53 ± 0.84 0.75 22.70 ± 1.24 0.90 49.75 ± 2.10 0.95 84.87 ± 2.16 1000 0.25 6.61 ± 0.73 0.50 11.32 ± 0.85 0.75 24.88 ± 1.40 0.90 56.90 ± 2.40 0.95 101.98 ± 3.10
AIH Exponential CHD MFA SLDN FTL 5.22 ± 0.46 8.84 ± 0.63 16.43 ± 0.83 30.20 ± 1.52 42.94 ± 1.75 5.87 ± 0.51 10.14 ± 0.55 19.96 ± 1.00 39.77 ± 1.50 60.60 ± 1.64 6.30 ± 0.46 10.89 ± 0.60 21.78 ± 0.84 45.33 ± 1.57 72.09 ± 1.75 6.43 ± 0.54 11.24 ± 0.73 22.80 ± 0.90 48.65 ± 1.51 79.00 ± 2.03 6.68 ± 0.55 11.73 ± 0.71 23.91 ± 0.85 52.26 ± 1.46 86.50 ± 2.51 7.17 ± 0.45 12.73 ± 0.78 26.63 ± 1.03 60.46 ± 1.60 104.93 ± 3.47
4.48 — 7.38 8.50 13.87 — 27.92 30.02 — — — — — — — — — — — — — — — — — — — — — — 5.53 — 9.24 10.36 18.79 — 43.24 49.94 — — — — — — — — — — — — 6.03 — 10.25 — 21.26 — — — — —
4.83 8.07 15.05 — — — — — — — — — — — — 5.70 9.91 20.44 — — — — — — — 6.17 10.93 23.19 — —
4.2 8.0 14.1 — — 4.9 8.5 — — — 5.1 8.9 — — — 4.9 8.9 17.7 — — 6.2 9.4 — — — 5.8 10.4 21.4 — —
IHN HNL — 9.13 — — — — 10.60 — — — — 11.60 — — — — 12.30 — — — — 12.80 — — — — — — — —
— 9 — 30 — — 11 — 39 — — 11 — 46 — — — — — — — 12 — 56 — — — — — —
corresponding plain replicator dynamics, where no annealing strategy is used, while paying a time penalty that is in most cases limited. Moreover, similar to what we found in section 5.1 for the plain processes, the exponential version of AIH provides larger cliques than its linear counterpart, thereby improving the results reported in Bomze et al. (2002). As for the comparison with other neural-based clique finding algorithms, we see that AIH performs substantially better than CHD and FTL, which were already outperformed by the plain dynamics, SLDN, mean field annealing (MFA), and the inverted neurons network (INN). Furthermore, it provides results comparable to those obtained with the continuous-based heuristic (CBH) and Hopfield network learning (HNL).
Clique Size Graph brock200 1 brock200 2 brock200 3 brock200 4 brock400 1 brock400 2 brock400 3 brock400 4 brock800 1 brock800 2 brock800 3 brock800 4 c-fat200-1 c-fat200-2 c-fat200-5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 hamming6-2 hamming6-4
n 200 200 200 200 400 400 400 400 800 800 800 800 200 200 200 500 500 500 500 64 64
ρ
DIMACS Best
AIH Linear
AIH Exponential
0.75 0.50 0.61 0.66 0.75 0.75 0.75 0.75 0.65 0.65 0.65 0.65 0.08 0.16 0.43 0.04 0.07 0.19 0.37 0.90 0.35
21 12 15 17 24 25 25 33 21 21 21 21 12 24 58 14 26 64 126 32 4
20 10 12 16 21 22 23 23 18 18 18 18 10 22 58 12 24 62 124 32 4
20 10 13 16 24 24 24 23 20 18 19 18 12 24 58 14 26 64 126 32 4
Time (secs)
MFA
INN Average
IHN
CBH
QSH
19 9 11 14 24 21 22 23 16 16 16 15 6 24 58 — — — — — —
— 9.1 — 13.4 — 21.3 — 21.3 — 16.9 — 16.5 — — — — — — — — —
— — — — — — — — — — — — 12 24 58 14 26 64 — 32 4
20 12 14 16 23 24 23 24 20 19 20 19 12 24 58 14 26 64 126 32 4
21 12 15 17 27 29 31 33 17 24 25 26 12 24 58 14 26 64 126 32 4
AIH 1st 1.39 0.50 1.07 1.77 1.98 2.06 4.05 6.57 10.00 10.06 18.46 18.52 0.17 0.21 0.20 1.03 1.10 1.59 1.27 0.06 0.02
AIH Exponential 0.85 0.27 0.43 0.38 1.73 3.34 2.36 1.59 6.66 7.03 5.85 21.35 0.20 0.18 0.17 1.31 1.23 1.21 1.18 0.03 0.02
Game Dynamics and Maximum Clique
Table 5: Results of the Annealed Imitation Heuristics (AIH) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part I.
1245
1246
Table 5: Continued Clique Size Graph
ρ
256 256 65536 65536 28 70 120 496 171 776 3361
0.97 0.64 0.99 0.83 0.56 0.77 0.76 0.88 0.65 0.75 0.82
128 16 512 36 4 14 8 16 11 27 59
AIH Linear
AIH Exponential
128 16 512 33 4 14 8 16 8 15 31
128 16 512 33 4 14 8 16 9 16 31
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — —
— 16.0 — 32.0 — — — — 8.4 16.7 33.4
128 16 512 36 4 14 8 16 — — —
128 16 — — 4 14 8 16 10 21 —
QSH
AIH 1st
AIH Exponential
128 16 — — 4 14 8 16 11 24 —
2.22 0.39 405.76 155.06 0.01 0.03 0.12 4.54 0.67 31.83 1444.15
0.65 0.36 150.89 16.61 0.01 0.05 0.07 1.69 0.92 679.00 869.99
M. Pelillo and A. Torsello
hamming8-2 hamming8-4 hamming10-2 hamming10-4 johnson8-2-4 johnson8-4-4 johnson16-2-4 johnson32-2-4 keller4 keller5 keller6
n
DIMACS Best
Clique Size Graph p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 p hat700-3 p hat1000-1 p hat1000-2 p hat1000-3 p hat1500-1 p hat1500-2 p hat1500-3 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3 san400 0.5 1
n
ρ
DIMACS Best
300 300 300 500 500 500 700 700 700 1000 1000 1000 1500 1500 1500 200 200 200 200 200 400
0.24 0.49 0.74 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.70 0.70 0.90 0.90 0.90 0.50
8 25 36 9 36 50 11 44 62 10 46 66 11 65 94 30 18 70 60 44 13
AIH Linear
AIH Exponential
6 24 33 8 33 43 8 38 56 9 43 60 9 60 89 15 12 46 37 36 7
7 25 36 8 36 49 9 44 60 10 43 64 10 64 92 15 12 46 38 35 7
Time (secs)
MFA
INN Average
IHN
CBH
— — — — — — — — — — — — — — — — — 42 — 33 —
7.0 21.9 33.1 — — — 8.4 42.0 58.2 — — — 9.4 60.2 86.2 — — — — — —
8 25 36 9 36 49 11 44 61 10 46 68 — — — 30 15 70 41 — —
8 25 36 9 35 49 11 44 60 — — — — — — 15 12 46 36 30 8
QSH
AIH 1st
AIH Exponential
7 24 33 9 33 46 8 42 59 — — — — — — 30 18 70 60 35 9
0.58 0.70 1.00 1.71 2.13 3.20 3.45 4.60 6.25 7.19 9.35 12.43 16.60 23.09 46.56 3.60 3.65 21.84 3.02 6.57 2.22
1.05 2.15 0.98 3.98 6.07 5.68 3.70 11.01 7.94 8.89 10.65 32.35 27.84 31.57 50.67 1.54 1.39 3.10 3.64 5.59 2.80
Game Dynamics and Maximum Clique
Table 6: Results of the Annealed Imitation Heuristics (AIH) and State-of-the-Art Neural Network–Based or Optimization-Based Approaches on DIMACS Benchmark Graphs, Part II.
1247
1248
Table 6: Continued Clique Size Graph
n
ρ
DIMACS Best
AIH Linear
AIH Exponential
MFA
INN Average
IHN
CBH
QSH
AIH 1st
AIH Exponential
400 400 400 400 1000 200 200 400 400
0.70 0.70 0.70 0.90 0.50 0.70 0.90 0.50 0.70
40 30 22 100 10 18 42 13 21
20 15 12 50 8 18 38 12 18
20 15 12 51 8 18 41 12 21
— — — — — 18 41 — —
— — — — — — — — —
40 30 — 100 10 17 41 12 21
20 15 14 50 — 18 41 12 20
40 30 16 100 — 15 37 11 18
21.83 16.14 14.82 90.28 27.46 0.65 4.18 1.26 4.34
9.53 6.18 4.90 14.45 19.37 0.35 1.74 1.63 2.34
M. Pelillo and A. Torsello
san400 0.7 1 san400 0.7 2 san400 0.7 3 san400 0.9 1 san1000 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7
Time (secs)
Game Dynamics and Maximum Clique 7
linear exponential
6 5 4 3 2 1 0
-1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
100 vertices 100 linear 90 exponential 80 70 60 50 40 30 20 10 0 -10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200
1
150 100 50 0
500 vertices
1
180 linear 160 exponential 140 120 100 80 60 40 20 0 -20 -40 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
400 vertices
linear exponential
-50 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
40 linear 35 exponential 30 25 20 15 10 5 0 -5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200 vertices
300 vertices 250
1249
1
1000 linear 900 exponential 800 700 600 500 400 300 200 100 0 -100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1000 vertices
Figure 3: CPU time of the annealed imitation heuristics on random graphs with varying order and density. The x-axis represents the edge density, while the y-axis denotes time (in seconds).
The comparison with the iterative Hopfield nets (IHN) and the QSH algorithm is not as straightforward and depends on the specific graph class. More specifically, on c-fat, hamming, and johnson graphs, the proposed AIH approaches perform as well as IHN and QSH. The annealed dynamics outperform QSH on the sanr graphs and provide slightly better results on the p hat graphs, while QSH performs better on the brock
1250
M. Pelillo and A. Torsello
and the keller graphs. Note, however, that we do not provide results for the largest instances such as hamming10-2, hamming10-4, keller6, san1000, and all the p hat’s with 1000 or more vertices. As for IHN, the approach performs slightly better than AIH on p-hat, although no result for the largest instances is given, and gives comparable results on the sanr graphs. However, we did not provide results for the brock and the keller graphs, which are notoriously hard instances (Brockington & Culberson, 1996). The sanc graphs deserve a separate discussion. Indeed, these graphs were already found to be particularly hard for Motzkin-Straus-based approaches, like the proposed dynamics and CBH (Bomze et al. 2002). As expected, our approaches fail to provide good results and are substantially outperformed by IHN and QSH. 8 Conclusions In this letter, we have introduced a wide family of game dynamic equations known as payoff-monotonic dynamics and have shown how their dynamical properties make any member of this family a potential heuristic for solving standard quadratic programs and, in particular, the maximum clique problem (MCP). Such systems can easily be implemented in a parallel network of locally interacting computational units and can be coded in a few lines of any high-level programming language. We have shown experimentally that an exponential version of the classic (linear) replicator dynamics is particularly effective at finding maximum or near-maximum cliques for several graph classes, being dramatically faster and even more accurate than its linear counterpart. However, these models are inherently unable to avoid poor local solutions in harder graph instances. In an attempt to avoid local optima, we have focused on a particular subclass of these dynamics used to model the evolution of behavior via imitation processes, and have developed the annealed imitation dynamics. This is a class of heuristics for MCP whose basic ingredients are (1) a parameterized continuous formulation of the problem, (2) an instability analysis of equilibria of imitation dynamics, and (3) a principled way of varying a regularization parameter during the evolution process. Experiments on various benchmark graphs have shown that the annealed imitation class contains algorithms that substantially outperform classic neural network algorithms for maximum clique, such as mean field annealing, and compares well with sophisticated MCP heuristics from the continuous optimization literature. Appendix A: Discrete-Time Replicator Dynamics The results presented in section 3, and in particular theorem 1, show that continuous-time payoff-monotonic dynamics can be usefully employed to
Game Dynamics and Maximum Clique
1251
find local solutions of standard quadratic programs. In practical computer implementations, however, we need a way of discretizing the models. A customary way of doing this is given by the following difference equations: xi (t)πi (t) xi (t + 1) = n j=1 x j (t)π j (t)
(A.1)
xi (t)e κπi (t) xi (t + 1) = n , κπ j (t) j=1 x j (t)e
(A.2)
and
which correspond to well-known discretizations of equations 2.6 and 2.7, respectively (Cabrales & Sobel, 1992; Gaunersdorfer & Hofbauer, 1995; Hofbauer & Sigmund, 1998; Weibull, 1995). Note that model A.1 is the standard discrete-time replicator dynamics, which have already proven to be remarkably effective in tackling maximum clique and related problems, and to be competitive to other more elaborated neural network heuristics (Bomze, 1997; Bomze et al., 1997, 2000; Pelillo, 1995, 1999; Pelillo et al., 1999). Equation A.2 has been used in Pelillo (1999, 2002) as a heuristic for graph and tree isomorphism problems. As their continuous counterparts, these dynamics are payoff-monotonic, that is, x j (t + 1) − x j (t) xi (t + 1) − xi (t) > ⇔ πi (t) > π j (t). xi (t) x j (t) It is a well-known result in evolutionary game theory (Weibull, 1995; Hofbauer & Sigmund, 1998) that the fundamental theorem of natural selection (see theorem 1) also holds for the first-order linear dynamics (see equation A.1)—namely, the average consistency x Ax is a (strict) Lyapunov function for equation A.1, provided that A = A . In other words: x(t) Ax(t) < x(t + 1) Ax(t + 1) unless x(t) is a stationary point. Unfortunately, unlike the continuous-time case, there is no such result for the discrete exponential dynamics, equation A.2. That is, there is no guarantee that for any fixed value of the parameter, κ, the dynamics increases the value of x Ax. Indeed, with high values of this parameter, the dynamics can exhibit an oscillatory behavior (Pelillo, 1999). However, a recent result by Bomze (2005) allows us to define an adaptive approach that is guaranteed to find a (local) maximizer for x Ax.
1252
M. Pelillo and A. Torsello
We define the ε-stationary points as Statε = x ∈ : xi ((Ax)i − x Ax)2 < ε . i
Clearly, this set is composed of the union of open neighborhoods around the stationary points of any payoff-monotonic dynamics, and for ε → 0 shrinks toward the stationary points themselves. Let m ¯ A = maxi j |a i j |, span(A) = maxi j a i j − mini j a i j , and, for any given ε, define κ A(ε) as the unique κ > 0, which satisfies κ exp(2m ¯ Aκ) =
2ε . span(A) ε + 2m ¯ 2A
Theorem 5. Suppose A = A . Then for arbitrary ε > 0, for any positive κ ≤ κ A(ε), the objective function x Ax is strictly increasing over time along the parts of trajectories under equation A.2, which are not ε-stationary, that is, x(t) Ax(t) < x(t + 1) Ax(t + 1) if
x(t) ∈ Statε .
Proof. See Bomze (2005). This means that for each point x ∈ we can find a κ for which one iteration of equation A.2 increases x Ax. That is, by setting at each iteration κ = κ A(ε), we are guaranteed to increase x Ax along the trajectories of the system. Note, however, that this estimate of κ A(ε) is not tight. In particular, our experience shows that it severely underestimated the value of κ, slowing the convergence of the dynamics considerably. In order to obtain a better estimate of the parameter κ and improve the performance of the approach, in our experiments we employed the adaptive exponential dynamics described in Figure 4, which, as the next proposition shows, has x Ax as a Lyapunov function. Proposition 8. If the payoff matrix A is symmetric, then the function x Ax is strictly increasing along any nonconstant trajectory of the adaptive exponential dynamics defined above. In other words, x(t) Ax(t) ≤ x(t + 1) Ax(t + 1) for all t, with equality if and only if x = x(t) is a stationary point. Proof. By construction, the function is guaranteed to grow as long a κ that increases x Ax can be found. Theorem 5 guarantees that such a κ can indeed be found.
Game Dynamics and Maximum Clique
1253
Algorithm 1. Start with a sufficiently large κ and from an arbitrary x(0) ∈ ∆. Set t ← 0. 2. While x(t) is not stationary do 3.
Compute x(t + 1) using equation A.2;
4.
While x (t + 1)Ax(t + 1) ≤ x (t)Ax(t) do
5.
Reduce κ;
6.
Recompute x(t + 1) using equation A.2;
7.
Endwhile;
8.
t ← t + 1;
9. Endwhile;
Figure 4: Adaptive exponential (discrete-time) replicator dynamics.
Appendix B: Proof of Theorem 3 Theorem 3. Let C be a subset of vertices of a graph G, and let xC be its characteristic vector. Then, for any 0 < α < 1, C is a maximal (maximum) clique of G if and only if xC is a local (global) solution of equation 4.3. Moreover, all solutions of the equation 4.3 are strict and are characteristic vectors of maximal cliques of G. Proof. Suppose that C is a maximal clique of G, and let |C| = m. We shall prove that xC is a strict local solution of program 4.3. To this end, we use standard second-order sufficiency conditions for constrained optimization
1254
M. Pelillo and A. Torsello
(Luenberger, 1984). Let AG = (a i j ) be the adjacency matrix of G and, for notational simplicity, put A = AG + α I. First, we need to show that xC is a KKT point for equation 4.3. It is easy to see that since C is a maximal clique, we have: C
(AG x )i
= ≤
m−1 m m−1 m
if i ∈ C if i ∈ / C.
Hence, if i ∈ C, then (AxC )i = (AG xC )i + αxiC =
α m−1 + , m m
(B.1)
and if i ∈ / C, (AxC )i = (AG xC )i ≤
m−1 m−1 α < + . m m m
(B.2)
Therefore, conditions 3.2 are satisfied and xC is a KKT point. Note that the Lagrange multipliers µi ’s defined in section 3 are given by µi =
m−1+α − (AxC )i . m
To conclude the first part of the proof, it remains to show that the Hessian of the Lagrangian associated with program 4.3, which in this case is simply AG + α I , is negative definite on the following subspace: = {y ∈ Rn : e y = 0 and yi = 0 for all i ∈ ϒ}, where ϒ = {i ∈ V : xiC = 0 and µi > 0} . But from equation B.2, ϒ = V\C. Hence, for all y ∈ , we have: y Ay =
n
yi
i=1
=
i∈C
n
ai j yj + α
j=1
yi
j∈C
n
yi2
i=1
ai j yj + α
i∈C
yi2
Game Dynamics and Maximum Clique
=
yi
i∈C
= (α − 1)
y j − yi + α
j∈C
1255
yi2
i∈C
yi2
i∈C
= (α − 1)y y ≤0 with equality if and only if y = 0, the null vector. This proves that AG + α I is negative definite on , as required. To prove the inverse direction, suppose that xC ∈ is a local solution to equation 4.3 and hence a KKT point. By proposition 2, xC is also a stationary point for payoff-monotonic dynamics, and since Ahas positive diagonal entries a ii = α > 0, all the hypotheses of proposition 4 are fulfilled. Therefore, it follows that C is a clique (i.e., a i j > 0 for all i, j ∈ C); otherwise xC could not be a local solution of equation 4.3. On the other hand, from proposition 5, C is also a maximal clique. Furthermore, if x is any local solution, and hence a KKT point of equation 4.3, then necessarily x = x S where S = σ (x). Geometrically, this means that x is the barycenter of its own face. In fact, from the previous discussion, S has to be a (maximal) clique. Therefore, for all i ∈ S, (AG x)i + αxi = 1 − (1 − α)xi = λ, for some constant λ. This amounts to saying that xi is constant for all i ∈ σ (x), and i xi = 1 yields xi = 1/|S|. From what we have seen in the first part of the proof, this also shows that all local solutions of equation 4.3 are strict. Finally, as for the “global/maximum” part of the theorem, simply notice that at local solutions x = x S of equation 4.3, S = σ (x) being a maximal clique, the value of the objective function f α is 1 − (1 − α)/|S|.
Acknowledgments We thank Manuel Bomze for many stimulating discussions and Claudio Rossi for his help in early stages of this work. References Aarts. E., & Korst. J. (1989). Simulated Annealing and Boltzmann machines, New York: Wiley. Ballard, D. H., Gardner, P. C., & Srinivas, M. A. (1987). Graph problems and connectionist architectures (Tech. Rep. No. TR 167). Rochester, NY: University of Rochester.
1256
M. Pelillo and A. Torsello
Bertoni, A., Campadelli, P., & Grossi, G. (2002). A neural algorithm for the maximum clique problem: Analysis, experiments and circuit implementation. Algorithmica, 33(1), 71–88. ¨ G. P. (1970). Stability theory of dynamical systems. Berlin: Bhatia, N. P., & Szego, Springer-Verlag. Bomze, I. M. (1986). Non-cooperative two-person games in biology: A classification. Int. J. Game Theory, 15(1), 31–57. Bomze, I. M. (1997). Evolution towards the maximum clique. J. Global Optim., 10, 143–164. Bomze, I. M. (1998). On standard quadratic optimization problems. J. Global Optim., 13, 369–387. Bomze, I. (2005). Portfolio selection via replicator dynamics and projections of indefinite estimated covariances. Dynamics of Continuous, Discrete and Impulsive Systems B, 12, 527–564. Bomze, I. M., Budinich, M., Pardalos, P. M., & Pelillo, M. (1999). The maximum clique problem. In D.-Z. Du & P. M. Pardalos (Eds.), Handbook of combinatorial optimization (Suppl. Vol. A), (pp. 1–74). Boston: Kluwer. Bomze, I. M., Budinich, M., Pelillo, M., & Rossi, C. (2002). Annealed replication: A new heuristic for the maximum clique problem. Discr. Appl. Math., 121(1–3), 27–49. Bomze, I. M., Pelillo, M., & Giacomini, R. (1997). Evolutionary approach to the maximum clique problem: Empirical evidence on a larger scale. In I. M. Bomze, T. Csendes, R. Horst, & P. M. Pardalos (Eds.), Developments in global optimization (pp. 95–108). Dordrecht: Kluwer. Bomze, I. M., Pelillo, M., & Stix, V. (2000). Approximating the maximum weight clique using replicator dynamics. IEEE Trans. Neural Networks, 11(6), 1228–1241. ´ Boppana, R., & Halldorsson, M. M. (1992). Approximating maximum independent sets by excluding subgraphs. BIT, 32, 180–196. Brockington, M., & Culberson, J. C. (1996). Camouflaging independent sets in quasirandom graphs. In D. Johnson & M. Trick (Eds.). Cliques, coloring and satisfiability: Second DIMACS implementation challenge (pp. 75–88). Providence, RI: American Mathematical Society. Busygin, S., Butenko, S., & Pardalos, P. M. (2002). A heuristic for the maximum independent set problem based on optimization of a quadratic over a sphere. J. Comb. Optim., 6, 287–297. Cabrales, A., & Sobel, J. (1992). On the limit points of discrete selection dynamics. J. Econom. Theory, 57, 407–419. Fisher, R. A. (1930). The genetical theory of natural selection. New York: Oxford University Press. Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge, MA: MIT Press. Funabiki, N., Takefuji, Y., & Lee, K.-C. (1992). A neural network model for finding a near-maximum clique. J. Parallel Distrib. Comput., 14, 340–344. Gaunersdorfer, A., & Hofbauer, J. (1995). Fictitious play, Shapley polygons, and the replicator equation. Games Econom. Behav., 11, 279–303. Gee, A. W., & Prager, R. W. (1994). Polyhedral combinatorics and neural networks. Neural Computation, 6, 161–180.
Game Dynamics and Maximum Clique
1257
Gibbons, L. E., Hearn, D. W., Pardalos, P. M., & Ramana, M. V. (1997). Continuous characterizations of the maximum clique problem. Math. Oper. Res., 22, 754– 768. Godbeer, G. H., Lipscomb, J., & Luby, M. (1988). On the computational complexity of finding stable state vectors in connectionist models (Hopfield nets) (Tech. Rep. No. 208/88). Toronto: University of Toronto. Grossman, T. (1996). Applying the INN model the maximum clique problem. In D. Johnson & M. Trick (Eds.), Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge (pp. 122–145). Providence, RI: American Mathematical Society. ¨ Grotschel, M., Lov´asz, L., & Schrijver, A. (1993). Geometric algorithms and combinatorial optimization. Berlin: Springer-Verlag. Hastad, J. (1996). Clique is hard to approximate within n1−ε . In Proc. 37th Ann. Symp. Found. Comput. Sci. (pp. 627–636). Los Alamitos, CA: IEEE Computer Society Press. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. New York: Academic Press. Hofbauer, J. (1995). Imitation dynamics for games. Collegium Budapest. Unpublished manuscript. Hofbauer, J., & Sigmund, K. (1998). Evolutionary games and population dynamics. Cambridge: Cambridge University Press. Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Trans. Pattern Anal. Machine Intell., 5, 267–287. Jagota, A. (1995). Approximating maximum clique with a Hopfield neural network. IEEE Trans. Neural Networks, 6, 724–735. Jagota, A., Pelillo, M., & Rangarajan, A. (2000). A new deterministic annealing algorithm for maximum clique. In Proc. IJCNN’2000: Int. J. Conf. Neural Networks (pp. 505–508). Piscataway, NJ: IEEE Press. Jagota, A., & Regan, K. W. (1997). Performance of neural net heuristics for maximum clique on diverse highly compressible graphs. J. Global Optim., 10, 439–465. Jagota, A., Sanchis, L., & Ganesan, R. (1996). Approximately solving maximum clique using neural networks and related heuristics. In D. Johnson & M. Trick (Eds.), Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge (pp. 169–204). Providence, RI: American Mathematical Society. Johnson, D., & Trick, M. (1996). Cliques, coloring and satisfiability: Second DIMACS Implementation Challenge. Providence, RI: American Mathematical Society. Lin, F., & Lee, K. (1992). A parallel computation network for the maximum clique problem. In Proc. 1st Int. Conf. Fuzzy Theory Tech. Baton Rouge, LA. Luenberger, D. G. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. Maynard Smith, J. (1982). Evolution and the theory of games. Cambridge: Cambridge University Press. Miller, D. A., & Zucker, S. W. (1999). Efficient simplex-like methods for equilibria of nonsymmetric analog networks. Neural Computation, 4, 167–190. Miller, D. A., & Zucker, S. W. (1999). Computing with self-excitatory cliques: A model and an application to hyperacuity-scale computation in visual cortex. Neural Computation, 11, 21–66.
1258
M. Pelillo and A. Torsello
Motzkin, T. S., & Straus, E. G. (1965). Maxima for graphs and a new proof of a theorem of Tur´an. Canad. J. Math., 17, 533–540. Papadimitriou, C. H., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and complexity, Englewood Cliffs, NJ: Prentice Hall. Pardalos, P. M., & Rodgers, G. P. (1990). Computational aspects of a branch and bound algorithm for quadratic zero-one programming. Computing, 45, 131–144. ¨ & Guzelis, ¨ O., ¨ Pekergin, F., Morgul, C. (1999). A saturated linear dynamical network for approximating maximum clique. IEEE Trans. Circuits and Syst.—I, 46(6), 677–685. Pelillo, M. (1995). Relaxation labeling networks for the maximum clique problem. J. Artif. Neural Networks, 2, 313–328. Pelillo, M. (1999). Replicator equations, maximal cliques, and graph isomorphism. Neural Computation, 11(8), 2023–2045. Pelillo, M. (2002). Matching free trees, maximal cliques, and monotone game dynamics. IEEE Trans. Pattern Anal. Machine Intell., 24(11), 1535–1541. Pelillo, M., & Jagota, A. (1995). Feasible and infeasible maxima in a quadratic program for maximum clique. J. Artif. Neural Networks, 2, 411–420. Pelillo, M., Siddiqi, K., & Zucker, S. W. (1999). Matching hierarchical structures using association graphs. IEEE Trans. Pattern Anal. Machine Intell., 21(11), 1105–1120. Ramanujam, J., & Sadayappan, P. (1988). Optimization by neural networks. Proc. IEEE Int. Conf. Neural Networks (pp. 325–332). Piscataway, NJ: IEEE Press. Rosenfeld, A., Hummel, R. A., & Zucker, S. W. (1976). Scene labeling by relaxation operations. IEEE Trans. Syst. Man and Cybern., 6, 420–433. Samuelson, L. (1997). Evolutionary games and equilibrium selection, Cambridge, MA: MIT Press. Shrivastava, Y., Dasgupta, S., & Reddy, S. (1990). Neural network solutions to a graph theoretic problem. In Proc. IEEE Int. Symp. Circuits Syst. (pp. 2528–2531). Piscataway, NJ: IEEE Press. Shrivastava, Y., Dasgupta, S., & Reddy, S. M. (1992). Guaranteed convergence in a class of Hopfield networks. IEEE Trans. Neural Networks, 3, 951–961. Takefuji, Y., Chen, L., Lee, K., & Huffman, J. (1990). Parallel algorithms for finding a near-maximum independent set of a circle graph. IEEE Trans. Neural Networks, 1, 263–267. Wang, R. L., Tang, Z., & Cao, Q. P. (2003). An efficient approximation algorithm for finding a maximum clique using Hopfield network learning. Neural Computation, 15, 1605–1619. Weibull, J. W. (1995). Evolutionary game theory. Cambridge, MA: MIT Press. Wu, J., Harada, T., & Fukao, T. (1994). New method to the solution of maximum clique problem: Mean-field approximation algorithm and its experimentation. Proc. IEEE Int. Conf. Syst., Man, Cybern. (pp. 2248–2253). Piscataway, NJ: IEEE Press.
Received January 27, 2005; accepted September 9, 2005.
NOTE
Communicated by Anthony Bell
Correlation and Independence in the Neural Code Shun-ichi Amari [email protected]
Hiroyuki Nakahara [email protected] RIKEN Brain Science Institute, Wako-shi, Saitama, Japan
The decoding scheme of a stimulus can be different from the stochastic encoding scheme in the neural population coding. The stochastic fluctuations are not independent in general, but an independent version could be used for the ease of decoding. How much information is lost by using this unfaithful model for decoding? There are discussions concerning loss of information (Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003). We elucidate the Nirenberg-Latham loss from the point of view of information geometry.
1 Introduction The brain retains information of a stimulus by an excitation pattern of a population of neurons. The neural excitation is stochastic, so that the information is kept and processed in terms of probability distributions. The stochastic process of generating an excitation pattern from a stimulus is described by an encoding scheme, and in general this encoding scheme produces correlations among neurons. However, it is tractable and plausible for the brain to use an uncorrelated model for further processing and decoding. How much information is lost by using an unfaithful model for decoding? This is a fundamental question discussed in Nirenberg and Latham (2003) and Schneideman, Bialek, and Berry (2003). Wu, Amari, and Nakahara (2001) studied this problem in terms of Fisher information and concluded that the loss of information is small in this specific structure while the decoding process is greatly simplified. Nirenberg and Latham (2003) proposed a measure of loss of information caused by the use of the unfaithful independent model for decoding. Here, the Kullback-Leibler (KL) divergence is used for describing the loss (see also Nirenberg & Latham, 2005). Schneideman et al. (2003) questioned this definition by posing a fundamental problem: how the amount of information should be defined in neural encoding and decoding. They studied various quantities related to this problem in terms of Shannon information. The relations concerning various information-theoretical concepts are Neural Computation 18, 1259–1267 (2006)
C 2006 Massachusetts Institute of Technology
1260
S. Amari and H. Nakahara
elucidated. There have been heated discussions, and the effects caused by using the different models for encoding and decoding schemes remain to be clarified. It is true that there is no unique measure of loss of information in neural coding and decoding. Shannon information is justified when it is used for the purpose of reproducing or transmitting messages correctly, because it gives the minimal length of code to describe a message. However, there is no clear justification of this measure when one applies it for other purposes. Similarly, the Fisher information is justified for the purpose of estimating the parameter given in the form of a stimulus, because it is the only invariant local measure in the manifold of probability distributions (Amari & Nagaoka, 2000). However, neural decoding is not merely for estimation but more naturally for generating an action. In such a case, the Bayes loss plays a role. Therefore, although these measures are useful in various respects and widely used, there are no unique measures. Various measures have their own meanings in their own right, as Schneideman et al. (2003) pointed out. It is not the purpose of this article to define a unique correct definition of loss of information in neural coding and decoding. We are afraid that such arguments might lead us to theological debates. Instead, here we study the mathematical structures of various measures of difference between (conditional) probability distributions. In particular, we focus on the structure of the Nirenberg-Latham loss of information and prove that it is very natural from the information geometry point of view. Information geometry (Amari & Nagaoka, 2000) studies the intrinsic properties in the manifold of probability distributions and hence is useful for studying stochastic neural encoding and decoding. The KL divergence is the canonical invariant divergence between two probability distributions in a dually flat manifold (Amari & Nagaoka, 2000). The Shannon information, Fisher information, Jensen-Shannon divergence, and many other invariant structures are derived from it. We study the properties of the NirenbergLatham loss of information from the point of view of information geometry. We give a necessary and sufficient condition that guarantees no loss of information in the sense of Nirenberg and Latham when using the unfaithful (i.e., uncorrelated) model for decoding. Moreover, the KL divergence of the encoding scheme (which is the conditional mutual information of noises) is decomposed orthogonally into two terms: one is the Nirenberg-Latham loss, and the other is due to an irrelevant normalization term. This elucidates the use of the Nirenberg-Latham loss for analyzing the decoding process. We are interested in extending the theory to a more general process of integrating different evidence in the Bayesian framework, denoted by posterior probabilities of a stimulus. This is a first step toward this interesting problem from the information geometrical viewpoint (Amari, 2005).
Correlation and Independence in the Neural Code
1261
2 Encoding and Decoding Let us consider a population of neurons activated by stimulus s. The firing pattern is represented by a vector r = (r1 , . . . , rn ), where ri is the activity of the ith neuron. Given stimulus s, the neurons fire stochastically, and its (conditional) probability p(r|s) represents the encoding scheme. The expectation of r, r¯ (s) =
r p(r|s) d r,
(2.1)
is called the tuning curve, where the integral should be replaced by summation when ri are discrete. Decoding is a process to infer s from the activity pattern r. An estimator is given by a function sˆ = sˆ (r), and its accuracy is bounded by the inverse of Fisher information. From the Bayesian standpoint, it is natural to consider the posterior distribution of s given r, p(r|s) p(s) , p(r)
p(s|r) =
(2.2)
where p(s) is the prior distribution of stimulus and p(r) =
p(s) p(r|s) ds.
(2.3)
Here, we use the same letter p to represent probabilities, and the meaning is clear from the context. The posterior distribution retains all the information concerning s, and the neural system infers s based on it. The encoding probability p(r|s) is not independent in general, and activities of ri are correlated. However, it would be difficult to take the correlation structure into account, and the brain might use a simplified independent version for decoding. It is given by q (r|s) =
n
pi (ri |s),
(2.4)
p(r1 , . . . , rn |s) dr1 · · · dri−1 dri+1 · · · drn
(2.5)
i=1
where pi (ri |s) =
1262
S. Amari and H. Nakahara
is the marginal distribution of ri . q (r|s) is the independent version of p(r|s), sometimes called the shuffled distribution. Wu, Amari, and Nakahara (2001) used this model (the unfaithful model) for decoding s in a neural field and analyzed the loss of Fisher information. The posterior distribution under the independence assumption is given by q (s|r) =
q (r|s) p(s) , q (r)
(2.6)
q (r|s) p(s) ds.
(2.7)
where q (r) =
3 Nirenberg-Latham Loss of Information Nirenberg and Latham (2003) proposed I = K L[ p(s|r) : q (s|r)] p(s|r) = p(r) p(s|r) log ds d r q (s|r)
(3.1)
as the loss of information. This is the average KL divergence between the two posterior distributions in the decoding scheme. One may consider a similar quantity, ˜ = K L[ p(r|s) : q (r|s)] = I
p(s) p(r|s) log
p(r|s) d r ds, q (r|s)
(3.2)
which is the average KL divergence in the encoding scheme. We show their properties and relation. Theorem 1. ˜ = I + K L[ p(r) : q (r)]. I
(3.3)
˜ = 0, if The proof is immediate from the definition. It is obvious that I and only if p(r|s) = q (r|s), that is, no correlation exists in the true encoding ˜ ≥ I , and I ˜ = 0 implies I = 0, but the converse scheme. Moreover, I does not necessarily hold. This asymmetry arises by the fact that p(s) = q (s) is common but p(r) = q (r). Hence, it is interesting to see the difference ˜ = 0 and I = 0. between I
Correlation and Independence in the Neural Code
1263
˜ : We show an interesting property of I Theorem 2. ˜ = I (R1 , R2 , . . . , Rn |S), I
(3.4)
where the right-hand side is the conditional mutual information of R1 , . . . , Rn . Proof. For fixed s, we have
p(r|s) K L[ p(r|s) : q (r|s)|s] = E p( r |s) log pi (ri |s) = −H(R1 , . . . , Rn |s) +
n
H(Ri |s)
i=1
= I (R1 , . . . , Rn |s),
(3.5)
which is the conditional mutual information. By averaging it over s, we have the theorem. This is called the strength of noise correlations and is studied in Schneidman et al. (2003). ˜ ? 4 How Related Are I and I ˜ = 0: We show when I = 0 occurs in spite of I Theorem 3. I = 0, if and only if p(r|s) = k(r)q (r|s),
(4.1)
for some k(r) not depending on s. Proof. When I = 0, from I = E p( r ) K L[ p(s|r) : q (s|r)|r],
(4.2)
we have p(s|r) = q (s|r)
(4.3)
for almost all r (that is, for all r for which p(r) = 0). By multiplying p(r)q (r)/ p(s) to the both sides and using Bayes’ theorem (see equations 2.2 and 2.6), we have q (r) p(r|s) = p(r)q (r|s). Hence, we have equation 4.1 where k(r) = p(r)/q (r).
(4.4)
1264
S. Amari and H. Nakahara
The theorem shows that even when the encoding scheme is correlated, I = 0 if the correlational part does not depend on s. In other words, I = 0 when the log likelihood is the same for the two cases except for a constant term log k(r) not depending on s. We can restate this by using the score function, which is the derivative of the log likelihood with respect to s. Corollary 1. I = 0, when and only when the score functions are the same for the two encoding schemes. We show an example first. Example 1. For s > 0, the encoding scheme, p(r1 , . . . , rn |s) = √
1 2 ri n (1 + tanh r1 · · · tanh rn ) exp − 2s 2πs 1
(4.5) is not independent and ri ’s are positively correlated. The marginal distributions are pi (ri |s) = √
2
r exp − i , 2s 2πs 1
(4.6)
and
1 1 2 q (r|s) = √ ri n exp − 2s 2πs
(4.7)
˜ = 0. is different from p(r|s). In this case I = 0, but I Theorem 4. When a set of functions f (r ; s) = p(r |s) of r , parameterized by s, spans the L 2 -space of r , where p(r |s) is a marginal distribution of any ri , I = 0 ˜ = 0 are equivalent. and I Proof: We consider the case with n = 2. By integrating equation 4.1 with respect to r2 , we have p1 (r1 |s) =
k(r1 , r2 ) p2 (r2 |s)dr2 p1 (r1 |s).
(4.8)
Correlation and Independence in the Neural Code
1265
This implies {1 − k(r1 , r2 )} p2 (r2 |s) dr2 = 0.
(4.9)
When { p2 (r2 |s)} spans the L 2 -space, that is a complete basis, 1 − k(r1 , r2 ) ≡ 0. ˜ = 0. Hence, under these conditions, when I = 0, p(r|s) = q (r|s) so that I Example 2. For additive encoding, ri = f i (s) + ni ,
(4.10)
where f i (s) denotes a unimodal and continuous tuning function and ni are ˜ = 0 are equivalent. jointly gaussian, I = 0 and I In the above case, the marginal distribution of ri is
2 ri − f i (s) pi (ri |s) = const exp − . 2σ 2
(4.11)
For t = f i (s),
f (ri , t) = pi (ri |t) = exp (t − ri )2 /2σ 2
(4.12)
is a kernel whose eigenfunctions are complete, spanning the L 2 -space. On the other hand, the functions 4.6 span only even functions of ri and are not complete. 5 Geometry of Loss of Information ˜ from the point of view of Finally, we study the relation between I and I information geometry. This may justify the use of I as the loss caused by using the unfaithful independent model. The Bayesian posterior is a probability distribution whose total mass is normalized to 1. However, it is computationally easier to retain a posterior distribution without normalization, without causing any loss of information. Hence, we consider two unnormalized distributions p(s, r) and q (s, r) over s where r is given. In other words, we regard p(s, r) and q (s, r) as unnormalized distributions of s where r is fixed, instead of regarding them as joint distributions of (s, r). Given two such positive unnormalized
1266
S. Amari and H. Nakahara
distributions p˜ (s) and q˜ (s), for which divergence is given by K L[ p˜ (s) : q˜ (s)] =
p˜ (s) log
p˜ (s)ds = 1,
q˜ (s)ds = 1, their KL
p˜ (s) + q˜ (s) − p˜ (s) ds q˜ (s)
(5.1)
(Amari & Nagaoka, 2000). Information geometry shows that one can decompose it into the following orthogonal sum of two nonnegative terms, K L[ p˜ (s) : q˜ (s)] = K L[ p˜ (s) : qˆ (s)] + K L[qˆ (s) : q˜ (s)],
(5.2)
where, by putting cp = qˆ (s) =
p˜ (s)ds,
cq =
q˜ (s)ds,
cp q˜ (s) cq
(5.3)
has the same mass c p as p˜ (s). It is easy to show that equation 5.2 follows from equation 5.1, and both terms on the right-hand side are nonnegative. The term K L [ p˜ : qˆ ] = c p K L[ p : q ],
(5.4)
where p = p˜ /c p and q = q˜ /c q are probability distributions, should be the difference between the normalized distributions, and K L [qˆ : q˜ ] = c p log
cp + c q − c p ≥ 0, cp
(5.5)
which shows the difference in the total masses of qˆ and q˜ . The decomposition is orthogonal, where the Pythagorean relation holds (Amari & Nagaoka, 2000). The orthogonal decomposition, equation 5.2, in the present case is K L[ p(s, r) : q (s, r)|r] = K L[ p(s|r) : q (s|r)|r] p(r)
p(r) + p(r) log + q (r) − p(r) . q (r)
(5.6)
By integrating both sides with respect to r, we have equation 3.3. ˜ = I (R1 , . . . , Rn |S) This shows that the conditional noise information I is decomposed orthogonally into two terms: One is the Nirenberg-Latham loss I , and the other corresponds to the irrelevant normalization constants.
Correlation and Independence in the Neural Code
1267
This decomposition is different from equation 3.4 of Schneidman et al. (2003), which is ˜ = Syn (R1 , . . . , Rn ) + I (R1 , . . . , Rn ) , I
(5.7)
where Syn (R1 , . . . , Rn ) = I (S : R1 , . . . , Rn ) −
I (S : Ri ) ,
(5.8)
and I (R1 , . . . , Rn ) = K L p(r) : pi (ri )
(5.9)
in the mutual information among R1 , . . . , Rn . 6 Conclusions The information geometry framework is applied to elucidate the loss of information by the use of the independent version of encoding scheme for decoding. A necessary and sufficient condition for the loss to vanish is given. Its meaning is newly given, justifying the use of loss in the Bayesian framework. Acknowledgments We thank Peter Dayan for useful discussions. References Amari, S. (2005). Generalization of Bayes predictive distribution and optimality. Manuscript submitted for publication. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: American Mathematical Society, and New York: Oxford University Press. Nirenberg, S., & Latham, P. (2003). Decoding neural spike trains: How important are correlations? Proceedings of the National Academy of Science, USA, 100, 7348–7353. Nirenberg, S., & Latham, P. (2005). Synergy, redundancy and independence in population codes. Journal of Neuroscience, 25, 5195–5206. Schneideman, E., Bialek, W., & Berry, M. J. II. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Wu, S. Amari, S., & Nakahara, H. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797.
Received February 3, 2005; accepted October 3, 2005.
LETTER
Communicated by Haim Sompolinsky
Analysis of Spike Statistics in Neuronal Systems with Continuous Attractors or Multiple, Discrete Attractor States Paul Miller [email protected] Department of Physics and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02454, U.S.A.
Attractor networks are likely to underlie working memory and integrator circuits in the brain. It is unknown whether continuous quantities are stored in an analog manner or discretized and stored in a set of discrete attractors. In order to investigate the important issue of how to differentiate the two systems, here we compare the neuronal spiking activity that arises from a continuous (line) attractor with that from a series of discrete attractors. Stochastic fluctuations cause the position of the system along its continuous attractor to vary as a random walk, whereas in a discrete attractor, noise causes spontaneous transitions to occur between discrete states at random intervals. We calculate the statistics of spike trains of neurons firing as a Poisson process with rates that vary according to the underlying attractor network. Since individual neurons fire spikes probabilistically and since the state of the network as a whole drifts randomly, the spike trains of individual neurons follow a doubly stochastic (Poisson) point process. We compare the series of spike trains from the two systems using the autocorrelation function, Fano factor, and interspike interval (ISI) distribution. Although the variation in rate can be dramatically different, especially for short time intervals, surprisingly both the autocorrelation functions and Fano factors are identical, given appropriate scaling of the noise terms. Since the range of firing rates is limited in neurons, we also investigate systems for which the variation in rate is bounded by either rigid limits or because of leak to a single attractor state, such as the Ornstein-Uhlenbeck process. In these cases, the time dependence of the variance in rate can be different between discrete and continuous systems, so that in principle, these processes can be distinguished using second-order spike statistics. 1 Introduction Many processes inside the brain, particularly in the cerebral cortex, operate in a noisy environment (Shadlen & Newsome, 1994; Buzsaki, 2004). Noise is apparent in the trial-to-trial variation in times and number of Neural Computation 18, 1268–1317 (2006)
C 2006 Massachusetts Institute of Technology
Spike Statistics of Graded Memory Systems
1269
spikes produced by any neuron, under what are intended to be identical external conditions. The effects of noise on the spiking statistics can tell us about the underlying states of a network of neurons (Ginzburg & Sompolinsky, 1994), which can be described by the firing rates of neurons as a function of time. States of the network where the firing rate would remain constant in the absence of noise are stationary attractor states (Hopfield, 1982; Amit, 1989; Zipser, Kehoe, Littlewort, & Fuster, 1993). The concept of attractor states is important in understanding many functions of the brain (Hopfield, 1982; Amit, 1989; Hopfield & Herz, 1995; Goldberg, Rokni, & Sompolinsky, 2004). Hippocampal place fields (O’Keefe & Dostrovsky, 1971; Samsonovich & McNaughton, 1997), working memory (Zipser et al., 1993; Camperi & Wang, 1998; Romo, Brody, Hern´andez, & Lemus, 1999; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Durstewitz, Seamans, & Sejnowski, 2000; Miller, Brody, Romo, & Wang, 2003), and integration—for example, integration of velocity to obtain position (Cannon, Robinson, & Shamma, 1983; Robinson, 1989; Seung, 1996; Aksay, Baker, Seung, & Tank, 2000; Seung, Lee, Reis, & Tank, 2000; Sharp, Blair, & Cho, 2001; Taube & Bassett, 2003)—could all require a quasicontinuum of attractor states. In order to understand how the brain performs these functions, it is important to ask whether the brain digitizes information of a continuous quantity using a set of discrete attractor states as suggested in some models (Koulakov, Raghavachari, Kepecs, & Lisman, 2002; Goldman, Levine, Tank, & Seung, 2003), or if that information is stored as an analog variable in a continuous attractor as suggested by others (Seung, 1996; Seung et al., 2000; Compte et al., 2000; Miller et al., 2003; Loewenstein & Sompolinsky, 2003; Durstewitz, 2003). A single trial is typically insufficient to determine a precise rate, as the variance is on the same order as the firing rate for most cortical activity. For example, a system with stable firing rates of 16 Hz (standard deviation, 4 Hz, assuming a coefficient of variation (CV) of 1) and 25 Hz (standard deviation 5 Hz) would need to spend approximately 1 second at each rate in order for the two rates to be distinguishable with any confidence. Hence, if the system were to drift continuously between the two rates in less than 2 seconds, a single trial would not be enough to distinguish such behavior from a discrete jump. If the levels were any closer together, the time spent near each level of rate would have to be correspondingly longer to separate out the rates. So it is unlikely that a single trial (lasting up to a few seconds for most experimental protocols) would distinguish discrete states from a continuous range. Hence, we consider statistical measures calculated from many trials to address the characteristics of continuous and discrete attractors in a neuronal network. Correlations in the spike times of neurons normally decay on the timescale of synaptic time constants. However, in continuous attractor networks, fluctuations on a short timescale can be temporally integrated
1270
Paul Miller
and accumulate in the manner of a random walk. Hence, the statistics of neuronal spike times can contain slower correlations due to changes in the underlying rate (Cox & Lewis, 1966; Perkel, Gerstein, & Moore, 1967; Saleh, 1978; Ginzburg & Sompolinsky, 1994; Middleton, Chacron, Lindner, & Longtin, 2003). Hence, statistical properties of the spike times can be a useful tool for analyzing the network in terms of its attractors, which determine the firing rates. Important quantities characterizing the statistics of neuronal spike times are the spike time correlation function, Fano factor, and distribution of interspike intervals (ISIs). The correlation function measures the relative likelihood of one spike time compared to others. The Fano factor measures the variance across trials in the total number of spikes, relative to the mean. The ISI distribution is a histogram of times between spikes. We calculate autocorrelation functions (Zohary, Shadlen, & Newsome, 1994; Bair, Zohary, & Newsome, 2001), Fano factors (Dayan & Abbott, 2001) and ISI distributions for spike trains produced according to a Poisson process. Our systems are examples of doubly stochastic Poisson point processes (Saleh, 1978), so called because not only are spikes generated probabilistically (with a probability in a small time window, δt of r (t)δt, where r (t) is the underlying rate), but the underlying rate, r (t), of the Poisson spike train varies randomly. We can think of the two types of noise as operating over different spatial and temporal scales, with variations in rate being relatively slow fluctuations in the entire network that result from the punctate, Poisson noise inherent in the spike train of each individual neuron. We do not consider regular spiking with a varying rate (in which case the following analysis is unnecessary, since the rate could be read out as a function of time for each spike train) since the high CV, near one, of real neurons is better approximated by a Poisson process.
2 Autocorrelation Function for a Poisson Process with Stochastic Rate Variation We analyze the effects of rate variations by solving for a Poisson process, where no correlations in spike times exist apart from those due to any underlying change of rate. To be precise, we assume the state of the network drifts or changes discretely but slowly, with trial-to-trial randomness. We assume then that the firing rate of each neuron is primarily determined by the state of the network, which provides the synaptic input. Hence the slow, large-scale random variations in the state of the network induce similar random variations in the underlying firing rate, r (t), of each neuron. However, neurons do not fire regularly at a slowly varying firing rate, but emit spikes with a high CV. Since a Poisson process has a high CV of one and to make the problem tractable, we assume that each neuron emits spikes as a Poisson process, whose rate varies randomly with time and across trials.
Spike Statistics of Graded Memory Systems
1271
Hence, in any time interval, the probability of a spike is r (t)dt, but r (t) is not a fixed quantity. The statistics of such processes are determined by the probability distribution of the firing rate, P(r, t). For example, in a continuous attractor, the state of the system follows a random walk, which leads to a gaussian distribution of firing rates, whereas for a set of discrete states, only specific firing rates are possible, with probabilities calculated from the Binomial distribution. Such processes are termed doubly stochastic Poisson point processes (Saleh, 1978) and have been studied previously in physics with regard to the emission of photoelectrons from a surface. The probability distribution at one time, P(r, t), can be dependent on a known rate, r0 , at an earlier time, t0 , in which case we write the conditional probability distribution as P(r, t|r0 , t0 ). The underlying firing rate of any neuron fluctuates from trial to trial, about some mean value, r (t), with trial-to-trial variance, σ 2 (t). First, consider a process with constant average rate, r (t) = r0 , such that
∞
−∞
r P(r, t|r0 , t0 )dr = r0
(2.1)
and variance, σ 2 (t ), defined by
∞
−∞
r 2 P(r, t|r0 , t0 )dr = r02 + σ 2 (t ),
(2.2)
where t = t − t0 . The autocorrelation function measures how much more or less likely it is, on any particular trial, to observe a spike at time, t1 + τ , given a spike at t1 , compared to what is predicted by the average rates. For a stationary process, an average can be taken over all values of t1 , so the autocorrelation function, Cxx (τ ), is a function of the time lag, τ , between two spikes. A negative value for Cxx (τ ) indicates it is less likely than average to see two spikes separated by τ on the same trial (for example, due to the refractory period), while a positive value means a spike at one time predicts another spike is more likely than chance to occur after an interval of τ . Cxx (τ ) is zero if spike times are uncorrelated. For a Poisson process, the probability of a spike at time t is proportional to the rate, r (t), so when the rate is known only proba∞ bilistically, the spike probability is proportional to −∞ r P(r, t|r0 , 0)dr . Hence, the autocorrelation function is given, for positive τ , by (Brody, 1999)
1272
Paul Miller
ave Cxx (τ, T) =
=
1 T −τ
T−τ
dt1 Cxx (t1 , t1 + τ )
(2.3)
0
T−τ ∞ 1 dt1 dr1 r1 P (r1 , t1 |r0 , 0) T −τ 0 −∞ ∞ dr2 r2 P (r2 , t1 + τ |r1 , t1 ) × −∞
T−τ
−
dt1
∞
−∞
0
dr1 r1 P (r1 , t1 |r0 , 0)
∞
−∞
dr2 r2 P (r2 , t1 + τ |r0 , 0) ,
where T is the total measurement interval. In all cases described here (apart from the leaky integrators of section 5), the correlation functions are symmetric (Cxx (t1 , t2 ) = Cxx (t2 , t1 )) so we include formulas only for positive τ . If there is no correlation between the rates at one time and another time, or if the probability distribution of rates does not change with time, then the first term in equation 2.3 is equal to the second term (the shuffle correction; Brody, 1998). In such a case, the autocorrelation is zero, as expected. The autocorrelation function depends on the time lag, τ , alone if the underlying process is stationary, so that at least the first two moments (mean and variance) of the rate are constant. Since neural processes are rarely stationary, the subtraction of the shuffle correction in the above equation is designed to remove effects of nonstationarity in the average rate. However, if the variance in firing rate is not constant through the measurement interval, the autocorrelation function may also depend on the total measurement time, T. In general, we can rewrite the terms in the integrand of equation 2.3 as Cxx (t1 , t1 + τ ) =
∞
−∞
dr1 r1 P(r1 , t1 |r0 , 0)r (t1 + τ )|r1 , t1
−r (t1 )|r0 , 0r (t1 + τ )|r0 , 0,
(2.4)
where r (t2 )|r1 , t1 is the mean value of the rate at time, t2 conditional on its earlier value, r1 at time, t1 . If the average rate remains constant, at any time the rate is equally likely to increase or decrease. Hence, if the rate is known at a certain time, that value is the new average rate for later times. In such a case, the second integral of the first term in equation 2.3 becomes equivalent to equation 2.1, resulting in
∞
−∞
dr2 r2 P (r2 , t1 + τ |r1 , t1 ) = r1 .
(2.5)
Spike Statistics of Graded Memory Systems
1273
Hence, the first term of equation 2.3 gives 1 T −τ
T−τ
dt1 0
∞ −∞
dr1 r12 P (r1 , t1 |r0 , 0) =
1 T −τ
T−τ 0
dt1 r02 + σ 2 (t1 )
= r02 + σ 2 (t)(0,T−τ ) ,
(2.6)
which is the square of the mean rate plus the variance in rate averaged over the time interval of measurement. We have used the notation T−τ 2 1 σ 2 (t)(0,T−τ ) = T−τ σ (t)dt. The shuffle correction leads to subtraction 0 2 of a term r0 , to cancel part of the above term. Hence, for a process where the rate varies on a trial-to-trial basis, maintaining a fixed overall average, r0 , the autocorrelation function is given by ave Cxx (τ, T) = σ 2 (t)(0,T−τ ) .
(2.7)
2.1 Correlation Functions for a Continuous Attractor. A continuous attractor, or line attractor, is a range of stable states with no distinct changes between states. A neuronal network with a continuous attractor can be an integrator for any stimulus that causes the network’s state to shift along the attractor. Once the stimulus is removed, the network remains stable, so it does not change systematically in the absence of input. Such a property makes a neuronal integrator equivalent to a continuous memory device. When a neuronal network has a continuous attractor, noise causes the state to change as a random walk (see Figure 1A). A random walk, or more technically a Wiener process, is essentially the temporal integral of uncorrelated gaussian noise. A continuous attractor integrates any noise that causes the state of the network to shift along the attractor (see Figure 2) in the same manner as it can integrate a stimulus. Hence, the property that allows a continuous attractor to store the memory of a stimulus also causes it to have memory of (and, hence, integrate) the noise, leading to a random walk of firing rates. For a random walk process with uniform, uncorrelated noise, the variance is linear in time, σ 2 (t) = αt, which leads to an autocorrelation function proportional to the measurement period, T − τ : ave Cxx (τ, T) =
α (T − τ ) . 2
(2.8)
This is an unusual result, as correlation functions for spike times typically decay with lag time, τ , while an increase of total measurement time, T, improves the sampling. The autocorrelation function increases with measurement time for a random walk process, because the trial-to-trial variability
1274
100
rate (Hz)
A
Paul Miller
50
0 0
20 time (sec)
. B
t2
t3
t4
t5
t6
∆r
rate
r
0
0
time
T
.
Figure 1: (A) Rate as a function of time for different trials of an unbiased random walk process. Dashed line shows initial rate, of 50 Hz, with smooth curves √ indicating the standard deviation across trials, 50 Hz ± 20 Hz t. (B) Time variation of the rate for a process with discrete states and stochastic transitions between states.
in rates increases with time. The longer the time period of measurement is, the more the spike rate on one trial is distinguishable from the spike rate on another trial. This shows up as a positive autocorrelation, since observation of a spike at a certain time is more likely to occur on a trial with high rate,
Spike Statistics of Graded Memory Systems
fA
1275
fB
fC
s
s
s
fC
fB
fA
fA
Figure 2: The line attractor is defined by the tuning curve of each neuron, f A(s), f B (s), f C (s). Fluctuations in s lead to fluctuations in the firing rates of neurons that are either positively correlated ( f A, f B ) or negatively correlated ( f A, f C ).
and an above-average rate at one point in time means that spikes are more likely than on average for any later time. We extend the above result to find the autocorrelation and crosscorrelation functions between neurons whose rates depend on an underlying continuous attractor, parameterized by s, as shown in Figure 2. The firing rates of each neuron are partially determined by the position of the system along the continuous attractor, s, such that changes in s lead to coherent changes in the firing rates of all neurons in the system. We assume that an initial stimulus places the system at an initial point on the attractor with average location s 0 and standard deviation σ0 . The firing rates of neurons (labeled A and B) also fluctuate independently with standard deviations σ A and σ B about the value determined by the position in the attractor. So for two neurons, A and B, the probability distribution of their firing rates is given by
[r − f A(s)]2 PA(r |s) = exp − 2σ A2 2πσ A2 1 [r − f B (s)]2 exp − PB (r |s) = , 2σ B2 2πσ 2 1
(2.9)
(2.10)
B
where f A,B are the tuning curves (average firing rate as a function of stimulus) that describe the attractor (as depicted in Figure 2). In the calculations,
1276
Paul Miller
we expand these firing rate curves to second order in the fluctuations about the initial average stimulus, s 0 : f A(s) ≈ f A0 + f A (s − s 0 ) + f A (s − s 0 )2 /2.
(2.11)
The expansion in s − s 0 is valid if the change in the network due to noise is small compared to the change in the network due to the complete range of stimuli. We assume the diffusion along the continuous attractor is specified by a random walk in s, such that
(s − s0 )2 P(s, t|s0 , t0 ) =
exp − , 2α(t − t0 ) 2πα(t − t0 ) 1
(2.12)
where s0 is the initial position on the attractor for a given trial. Noise, of amplitude σ0 , during the initial stimulus presentation leads to a distribution of s0 , such that
(s0 − s 0 )2 exp − P(s0 , t0 ) = 2σ02 2πσ02 1
.
(2.13)
The cross-correlation function now is solved by integrating over all possibilities. It is expressed in the somewhat cumbersome form (writing t2 = t1 + τ ): C AB (τ, T) = ×
dt1
−∞
0
×
dt1
P(s0 )ds0
PA(r1 |s1 )r1 dr1
T−τ
−
∞
−∞
0
∞
T−τ
ave
−∞
−∞
P(s0 )ds0
P(s0 )ds0 ∞
−∞
−∞
P(s1 , t1 |s0 , t0 )ds1
P(s2 , t2 |s1 , t1 )ds2
∞
∞
−∞
∞
∞
∞ −∞
∞
−∞
PB (r2 |s2 )r2 dr2
P(s1 , t1 |s0 , t0 )ds1
P(s2 , t2 |s1 , t1 )ds2
∞ −∞
∞
−∞
PA(r1 |s1 )r1 dr1
PB (r2 |s2 )r2 dr2
with the complete final result: α(T − τ ) 2 f A f B + σ02 f A f B /2 C ave (τ, T) = σ02 f A f B + σ04 f A f B + AB 4 2 2 α (T − τ ) fA fB − + 6
,
(2.14)
Spike Statistics of Graded Memory Systems
f A0 ατ f B /2 +
1277
σ4 σ02 f A ˜f B + f A0 f B + 0 f A f B 2 4
α 2 f A f B α(T − τ ) ˜ 0 2 2 f A f B + f A f B + σ0 f A f B + (T − τ ) , + 4 12 (2.15) where ˜f B = f B0 + ατ f B /2. For clarity, we consider just the terms including up to the first derivative of f A,B from here on. In this case, we have 2 C ave AB (τ, T) = f A f B σ0 + α(T − τ )/2 ,
(2.16)
which tells us that the cross-correlation is proportional to the product of derivatives of the two tuning curves, f A f B (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Pouget, Zhang, Deneve, & Latham, 1998). Note also the two terms inside the brackets. The first is equal to the variance during the stimulus, which is a fixed quantity. The second term is exactly that derived earlier for a random walk with fixed noise, where the variance increases linearly with time. For the autocorrelation, we can simply set f A = f B = 1, measuring the noise along the line attractor in terms of rate instead of s. Note that given a position in the continuous attractor labeled by s(t), uncorrelated fluctuations in the rates do not affect the correlation function. Specifically, the cross-correlations and autocorrelations are unaffected by the quantities σ A, σ B in equations 2.9 and 2.10. Rather, it is the manner in which the average rate of a neuron depends on the stimulus that determines the correlations. 2.2 Discrete Levels of Rate: Autocorrelation. For a system with discrete states, noise-driven changes in rate occur probabilistically. We assume the system has a set of equally spaced states, each with an identical average lifetime, τx , with stable rates for a particular neuron separated by r (see Figure 1B). The discrete “hopping” between different values of rate is then described by the exponential time distribution between “hops” such that the probability between t and t + δt of remaining in one state is exp[−t/τx ]δt/τx . Such an exponential distribution arises when the probability of hopping per unit time is fixed at 1/τx . The probability distribution of rates after M hops is binomial, with mean r0 and variance, σ 2 (M) = Mr 2 . The probability of M jumps in total time, t, follows the Poisson distribution with mean t/τx . The variance of rate as a function of time can be calculated (see appendix A) to give σ 2 (t) = (r )2 t/τx . So when equation 2.7 is used,
1278
Paul Miller
the autocorrelation function for a process with discrete hopping is identical to that of a continuous random walk,
ave Cxx (τ, T) =
(r )2 (T − τ ) , 2τx
(2.17)
if the noise terms for the two processes are correctly scaled such that α = (r )2 /τx in equation 2.8. 3 Fano Factor: General Results The Fano factor is a measure of the variation in total spike count for a fixed process. Typically it varies between zero for regular spiking and one for Poisson firing. Time variation in the Fano factor indicates temporal correlations in the processes that lead to spikes. The Fano factor is defined by averaging over many trials of fixed time, T,
F (T) =
N2 − N2 , N
(3.1)
where N is the number of spikes in the interval of length T. The expected number of spikes, λ(T), in time, T, is given by the time integral of firing rate: λ(T) =
T
r (t)dt.
(3.2)
0
For a Poisson process, where spikes are uncorrelated, occurring with probability r (t)δt in the time interval t to t + δt, then the probability of N spikes in total time T is given by the Poisson distribution,
P(N|T) =
λN exp(−λ). N!
(3.3)
We are interested in processes where the rate is not fixed but can fluctuate from trial to trial, leading to different values of λ, in which case P(N|T) =
∞
−∞
dλP(λ|T)
λN exp(−λ). N!
(3.4)
Spike Statistics of Graded Memory Systems
1279
Using the above result, we see that the mean number of spikes, N, is the required result: N =
∞
NP(N) =
N=0
∞
−∞
dλλP(λ|T) = λ(T).
(3.5)
The mean square number (or second moment) is calculated similarly (Saleh, 1978): ∞
N2 =
N2 P(N) =
N=0
=
∞
−∞
∞
−∞
dλP(λ|T)
∞
λ N exp(−λ)
N=0
dλP(λ|T) λ2 + λ = λ|T + λ2 |T.
N(N − 1) + N N! (3.6)
Hence, the Fano factor for a Poisson process is given in terms of the moments of time-integrated rate, λ, as F (T) = =
N2 − N2 N λ2 + λ − λ2 λ
=1 +
Var(λ) . λ
(3.7)
Note that if the process is not Poisson but has either more regular or less regular spiking statistics, we can calculate the Fano factor by assuming the number of spikes has a mean given by the time integral of the rate (this must be true by definition of the rate) but with some variation around the mean, such that (N − λ)2 P(N|λ) = √ exp − . 2σ 2 4πσ 2 1
(3.8)
A calculation of F (T) leads to σ2 Var(λ) N2 − N2 = + . N λ λ
(3.9)
Hence in general, the Fano factor splits into two terms—one from irregularities in the spiking statistics and a second arising from irregularities in the underlying rate. The first term is zero for a regular process and one for
1280
Paul Miller
a Poisson process, and typically gives a fixed, constant value. We consider the effects of the second, rate-dependent term, in the following sections. 3.1 Calculation of Fano Factor from the Autocorrelation Function. In this section we use Bayes’ rule to relate the Fano factor to the temporal integral of the autocorrelation function. We evaluate the derivative in time of the second moment of λ as dλ2 (t) = 2λ(t)r (t) dt = 2 dr2 P(r2 , t)λ(t)|r2 , tr2 ,
(3.10)
where λ(t)|r2 (t) is the average over all trajectories that reach a specific rate, r2 at time, t. Since we can write
t
λ(t)|r2 (t) =
dt1 r1 (t1 )|r2 (t),
(3.11)
0
we can substitute into equation 3.10 to give: dλ2 (t) =2 dt
dr2 P(r2 , t)r2
dt1 r1 (t1 )|r2 (t)
0
dr2 P(r2 , t)r2
=2
t
t
dr1 P(r1 , t1 |r2 , t)r1 .
dt1
(3.12)
0
Bayes’ rule allows us to replace P(r2 , t)P(r1 , t1 |r2 , t) with P(r1 , t1 ) P(r2 , t|r1 , t1 ) so the conditional rate is based on a prior rate. Hence equation 3.12 yields dλ2 (t) =2 dt
t
dt1
dr1 P(r1 , t1 )r1
dr2 P(r2 , t|r1 , t1 )r2 ,
(3.13)
0
which is similar to the first term of equation 2.3. Care with the integral limits, such that all spikes and hence firing rates are integrated over a window 0 < t < T, leads to Var(λ(T)) = λ2 (T) − λ(T)2 T ave =2 dτ (T − τ )Cxx (τ, T), 0
(3.14)
Spike Statistics of Graded Memory Systems
1281
ave (τ, T) is given by equation 2.3. Hence, the Fano factor for a doubly where Cxx stochastic Poisson process becomes
2
F (T) = 1 +
T
ave (τ, T) dτ (T − τ )Cxx . T 0 dtr (t)
0
(3.15)
3.2 Fano Factor for Processes with Constant Mean Rate. If the mean rate is a constant, r0 , it is clear that the average value of λ is given by
T
λ(T) =
dt
T
r (t)P(r, t|r0 , t = 0)
−∞
0
=
∞
dtr0 = r0 T.
(3.16)
0
The second moment, λ2 , includes correlations so is not straightforward to calculate. We begin by noting that dλ2 dλ =2 λ dt dt
(3.17)
= 2λ(t)r (t),
(3.18)
where the average is over trials. Similarly, dr d 2 λ2 2 , = 2r (t) + 2 λ dt 2 dt
(3.19)
where the second term yields zero for a process with constant mean rate. So in general for such a process, λ2 (t) =
T
dt
0
t
0
dt 2r02 + 2σ 2 (t ) ,
(3.20)
giving
T
Var[λ(T)] =
dt 0
t
dt 2σ 2 (t ),
0
where σ 2 (t) is the variance in rate at time, t.
(3.21)
1282
Paul Miller
For both the random walk process and discrete hopping process, where the variance in rates increases linearly with t, we have Var[λ(T)] =
αt 3 , 3
(3.22)
where α = (r )2 τx . Hence, the Fano factor in both cases has a quadratic dependence on T and N: F (T) = 1 + =1 +
λ2 − λ2 λ αT 2 α N2 =1+ , 3r0 3r03
(3.23)
where for a discrete network α = (r )2 /τx . It is therefore difficult to distinguish the spike statistics of continuous versus discrete processes. The problem arises because the variances of the two processes increase linearly with time. Hence, common statistical measures, which depend on only the variance of firing rates, contain terms such as α = (r )2 /τx , which is the constant time derivative of variance in firing rates. Statistical measures that depend on r separately from this combination (which can remain constant as r → 0 for a random walk process) are needed to distinguish a continuous from a discrete system. This is true for statistics involving the fourth-order cumulant in spike counts, since the fourth moment of the firing rates differs between the two systems, such that r 4 − 4r 3 r − 3r 2 2 + 12r 2 r 2 − 6r 4 = (r )2 αt.
(3.24)
Setting r = 0 for the continuous system, equation 3.24 is an example that cumulants higher than second order are zero for the gaussian distribution. The details of obtaining a similar result in the moments of the spike count, N, and proof of equation 3.24 are given in appendix A. The above result is an indication that with the discrete hopping process, it is more probable for the rate to change by a large amount (more than a couple of standard deviations) than for the continuous random walk process. Since the probability of large excursions in rate is small, many trials are necessary to make use of this statistic, as we show below. In principle, the following set of moments, N4 − 3N2 2 + 2N4 − 6N3 + 6N2 N + 11N2 − 3N2 − 6N 3N2 − 3N2 − 3N = (r )2 t 2 /5,
(3.25)
Spike Statistics of Graded Memory Systems
1283
can be used to differentiate systems with continuous versus discrete states. We tested the formula by generating sets of spike times stochastically on a computer. Since the process is very noisy, an experimentally unfeasible number of trials (5000) was required to produce each curve. The four curves for each system (see Figure 3B) indicate the variation that remains even after this large number of trials. We used a bounded system (see section below) to test this statistical quantity. The boundaries quickly slow the increase in the fourth moment and cause the combination of moments to reach a maximum and decline. However, when we compare the curves with discrete states versus continuous states, the curve never rises significantly above zero for the continuous system. Hence, if this statistic is measured to be significantly above zero, it is an indication of a discrete set of attractors. If the deviation is not significantly above zero, it is most likely due to lack of statistical power. 4 Boundaries Firing rates of neurons are limited, with an upper bound arising from saturation of synapses or channel kinetics that is rarely reached for energetic reasons, and a lower bound of zero arising from the physical impossibility of the firing rate becoming negative. Hence, it is important to consider the effects of boundaries on the range of firing rate for both continuous and discrete systems. In the calculations, we assume the boundaries are reflecting, which means that whenever the unbounded random process would cause the firing rate to fall beyond the boundary by an amount, r x , we instead assume the rate falls inside the boundary by the same amount, r x . That is, noise still can move the rate from the boundary (the boundary is not an attractor state) but only to within the bounds. We match different discrete and continuous bounded systems by ensuring that the initial linear increase in variance of rate with time is identical (as we did with the unbounded system) and also that the variance in rate as t → ∞ is identical between systems. This means that assuming a discrete system with three states at r0 , r0 ± r , the corresponding √ five-state system has states at r0 , r0 ± r , r0 ± 2r , where r = r/ 3; the
four r = 6/11r0 ; state system has states at r0 ± r /2 and r0 ± 3r /2, where √ the two-state system has states at r0 ± r /2, where r = 2r0 ; and the continuous system has boundaries at r0 ± r B , where r B = 3/2r . Matching the initial linear increase in variance of rate requires us to keep a constant value for α = (r )2 /τx for all systems. 4.1 Continuous Random Walk with Reflecting Boundaries. When a random walk has reflecting boundaries, each reflection can be accounted for by a trajectory of an unbounded random walk from a mirror source. If the initial firing rate is r1 , then mirror sources exist for all n at 2r0 +
1284
Paul Miller
A
Continuous 5 states 3 states
Fano factor
300
0 0
20 Time (sec)
B 4
Continuous 5 states 3 states
4th order measure of N
4×10
0 0
8 Time (sec)
Figure 3: Plots of moments of the spike count from spike trains produced by computer simulations. The systems have an average rate of 100 Hz. The discrete system with three states has r = 80 Hz and τx = 16 s. The five-state and continuous systems are scaled to have the same initial and asymptotic variance in rates. Each dark line corresponds to the average of 5000 trials. Gray lines are analytic results for the corresponding unbounded processes [(r )t 2 /5]. (A) The Fano factors. (B) The fourth-order combination of moments of N (see equation 3.25).
Spike Statistics of Graded Memory Systems
1285
2rb − r1 + 4nrb (odd numbers of reflections) and r1 + 4nrb (even numbers of reflections) where the original source is included. The probability distribution of rate, P[r (t)], is for any rate between the two boundaries, the sum of the unbounded random walk probability of reaching that rate, when starting from any of the mirror sources. That is, for r0 − rb ≤ r ≤ r0 + rb , ∞ −(r − 2r0 + r1 − 2rb + 4nrb )2 1 P [r, t1 + τ |r1 , t1 ] = √ exp 2ατ 2πατ n=−∞ −(r − r1 + 4nrb )2 . (4.1) + exp 2ατ In practice, unless the calculation requires long time intervals, only a few values of n are required, between −nmax and nmax such that 2ατ (4nmax rb )2 . In the computational results, we use −8 ≤ n ≤ 8. This allows us to calculate mean rate conditional on an earlier rate, r (t)|r1 (t1 ) (as needed in equation 2.4) and the variance in rate, Var[r (t)|r1 (0)], by integrating the above probability distribution between boundaries. The full formulas used in the calculation are given in appendix B. 4.2 Small Number of Discrete States. For the system with discrete states, we simply limit the total number of states to bound the firing rate. We solve the system analytically for two, three, four, and five states. In all cases, the probability of a transition between states per time interval, dt, is dt/τx , with τx scaled for each system to maintain (r )2 τx = α. The only difference with the unbounded system is that for an edge state, the direction of the transition is fixed to be back to an intermediate state. So for two states, the system becomes simple to analyze, since the direction of the jump is predetermined for both states; only the time of the jump is stochastic. After an even number of jumps, the system always returns to its starting state, and after an odd number of jumps, the system is in the other state. So, using the Poisson distribution for number of jumps in time t, we find that after time t, the probabilities of remaining in the same state, P(even), or the other state, P(odd), are given by 1 −2t 1 + exp 2 τx −2t 1 1 − exp . P(odd) = 2 τx
P(even) =
(4.2)
Similarly, in the three-state system, if the rate is at the central state at any time, after an even number of jumps (with probability P(even) in equation 4.2) it will return to the same value. After an odd number of jumps,
1286
Paul Miller
the rate is equally likely to be r0 ± r . Hence, the system can be solved. We use similar methods to solve for the four-state and five-state system, as detailed in appendix C. Once the conditional probabilities for the firing rate dependent on its earlier value are known, we can use the standard methods described earlier to calculate the autocorrelation function and Fano factor. 4.3 Results for Bounded Systems. One observation for a bounded process is that if the initial rate is low, it will drift up on average; similarly, an initially high rate will drift down. This is seen in Figure 4A, where as expected, the average rate tends to the midpoint of the two boundaries a long time after the stimulus, as the system loses memory of the initial condition. Figure 4B contains two curves for each system showing the variance in rate as a function of time. For each system, the curve that is higher at early times is for a process where the initial rate is at the midpoint between the two boundaries. The lower solid curve is the result when the initial rate is at a boundary. Observation of such a dependence on the initial stimulus of the variance in firing rate as a function of time, as well as drift of the average firing rate, would be strong evidence for this type of bounded set of attractor states. Figures 4C and 4 D (solid lines) demonstrate that the autocorrelation decays on the timescale of the memory loss of the initial rate (cf. Figure 4A). Figure 4C is calculated as Cxx (t1 , τ ) where t1 = 1 s, whereas Figure 4D has t1 = 15 s. Since the autocorrelation with τ = 0 is equal to the variance at that time, the range of variances at small time in Figure 4B produces the range of magnitudes of autocorrelation in Figure 4C. At large times, the more discrete states in the system, the more time constants there are in the system, so that the slower time constants contribute to a slower decay in the autocorrelation function. The continuous random walk (solid lines) has the slowest and least purely exponential decay of all. The Fano factors (see Figure 4E) vary across systems at small times, reflecting the variance in firing rate (see Figure 4B) and average initial
Figure 4: Statistics for bounded systems. (A) Average rate given different initial rates. (B) Variance in rates. For each system, the upper curve is the variance when the initial rate is at the midpoint, r0 , and the lower curve is when the initial rate is at a boundary. For the three-state system, the variance is independent of the initial rate, but in general, the more states, the greater is the range. (C) Autocorrelation functions, Cxx (t1 , t1 + τ ) with t1 = 1s; same legend as B. (D) Autocorrelation functions, Cxx (t1 , t1 + τ ) with t1 = 15 s. Same legend as B. (E) Fano factors when the initial rate is r0 , the midpoint of the system. Same legend as B. (F) Fano factors when the initial rate is at the lower boundary (upper curve) or upper boundary (lower curve) of the system. Same legend as B.
Spike Statistics of Graded Memory Systems
60
50
2
Continuous 5-states 3-states 2-states
Rate variance (Hz )
B 50
Continuous 5-states 4-states 2,3-states 0 0
40 0
Time (sec)
8
C
D
Time (sec)
8
50
Autocorrelation
Autocorrelation
50
0 0
Time (sec)
0 0
8
E
F Fano factor
5
0 0
Continuous 5-states 3-states 10 Time (sec)
20
Time (sec)
8
5
Fano factor
Mean Rate (Hz)
A
1287
0 0
Continuous 3-states
Time (sec)
20
1288
Paul Miller
firing rates. At large times, as with the Ornstein-Uhlenbeck process (see Figure 5B), the Fano factors for the bounded systems approach a constant value greater than one due to the constant area under the autocorrelation functions and constant average rate. 5 Leaky Integrators A second method for limiting the range of firing rates is to make the integrator imperfect, or leaky. A leaky integrator is defined by a stable rate, r A, which corresponds to an attractor state, and a time constant, τ L , for decay of firing rates to that state. In practice, neuronal systems are likely to have both “hard” boundaries for the firing rate and a slow drift to an attractor state. For the continuous system, reflecting boundaries are analogous to walls of a flat (square) potential well, whereas a leak term makes the effective potential quadratic with a minimum at the attractor state. Adding a leak to the continuous random walk leads to an OrnsteinUhlenbeck (OU) process for the rate dr r − rA + αη(t), =− dt τL
(5.1)
where η(t) is the uncorrelated gaussian noise term. For the discrete integrator, we assume that the discrete levels of rate each decay toward a single value, r A, such that the distance between them decays exponentially: r → r exp(−t/τ L ). For both the continuous random walk and the discrete case, the mean rate follows: r (t) = r A + (r0 − r A) exp(−t/τ L ).
(5.2)
The variance in rate for the set of discrete states is then given by Var[r (t)] = r (t)2 − r (t)2 =
(r )2 t exp(−2t/τ L ) = αt exp(−2t/τ L ). τx
(5.3)
However, for the OU process we have (Gillespie, 1992) Var[r (t)] =
−2t ατ L 1 − exp . 2 τL
(5.4)
These results are compared in Figure 5B, using the specific values, τ L = 2s, r = 10 Hz, τx = 2 s, α = 50 s−3 . The OU process has a finite variance in rate at large times due to stochastic jitter about the attractor state (equivalent to noise around a smooth potential minimum). In contrast, for the discrete
Spike Statistics of Graded Memory Systems
B
50
5 Continuous Discrete
50
8
0
D t1=1s t1=2s t1=4s t1=8s t1=16s
Autocorrelation
50
0 0
Lag (sec)
F
8
Time (sec)
20
t1=1s t1=2s t1=4s t1=8s t1=16s
0 0
8
T=1s T=2s T=4s T=8s T=16s
Lag (sec)
50
50
Autocorrelation
0 0
E
Time (sec)
Continuous Discrete
Autocorrelation
0 0
Autocorrelation
C
Fano Factor
2
Rate variance (Hz )
A
1289
0 0
Lag (sec)
8
T=1s T=2s T=4s T=8s T=16s
Lag (sec)
8
Figure 5: Statistics for the leaky integrators. System with an attractor state at r A = 50 Hz, τ L = 2 s, α = 25 s−3 for the continuous system and r = 10 Hz, τx = 2 s for the discrete system. (A) The variance in rate for both continuous and discrete systems. (B) The Fano factor for both systems. For each system, the initial rate is 10 Hz for the upper curve and 50 Hz for the lower curve. (C) Autocorrelation, Cxx (t1 , t1 + τ ) for the discrete system. (D) Autocorrelation, ave (τ, T) Cxx (t1 , t1 + τ ) for the continuous system. (E) Average autocorrelation, Cxx ave (τ, T) for the continuous for the discrete system. (F) Average autocorrelation, Cxx system.
1290
Paul Miller
system, all allowed states collapse to a single rate at large times, so the variance in rate approaches zero. For both leaky processes, the trial-averaged value of the rate at a later time, t2 , given its value, r1 , at an earlier time, t1 , is no longer constant but given by the decay to the attractor state as −(t2 − t1 ) r (t2 )|r1 , t1 = r A + (r1 − r A) exp . τL
(5.5)
This allows us to find the autocorrelation function for both systems, using equation 2.3: −τ Cxx (t1 , t1 + τ ) = r12 (t1 ) exp τL −τ r1 (t1 ) − r1 (t1 )r2 (t1 + τ ) + r A 1 − exp τL −τ = Var[r (t1 )] exp . (5.6) τL So for the continuous, OU process, we have: −2t ατ L −τ 1 − exp (5.7) exp 2 τL τL −τ −2(T − τ ) τL ατ L ave exp 1 − exp 1− , Cxx (τ, T) = 2 τL 2(T − τ ) τL
Cxx (t1 , t1 + τ ) =
whereas for the discrete leaky system, we have: −2t1 −τ exp (5.8) τL τL −τ −2(T − τ ) τL ατ L ave Cxx (τ, T) = exp 1 − exp 2 τL 2(T − τ ) τL −2(T − τ ) . − exp τL
Cxx (t1 , t1 + τ ) = αt1 exp
The results are plotted in Figures 5C to 5F. Note that the averaged autocorrelation function for the OU process (see Figure 5F) has a linear decay to zero at small times, like the full random walk, but has an exponential decay with a longer measurement period.
Spike Statistics of Graded Memory Systems
1291
We use equation 3.15 to calculate the Fano factors for the two types of leaky integrator. For the discrete system, we have ατ L2 t exp (−2t/τ L ) + τ L 1 − exp (−t/τ L ) 1 − 3 exp (−t/τ L ) , F (t) = 1 + 2 r At + τ L (r0 − r A) 1 − exp (−t/τ L ) (5.9) and for the continuous system ατ L2 2t − 4τ L 1 − exp (−t/τ L ) + τ L 1 − exp (−2t/τ L ) . F (t) = 1 + 2 r At + τ L (r0 − r A) 1 − exp (−t/τ L ) (5.10) Figure 5B demonstrates that while the Fano factors for both processes begin by increasing quadratically together, for the discrete process, a maximum occurs on a timescale of the order of τ L , after which the Fano factor decays as 1/t back toward a value of one. In contrast, the Fano factor for the OU process rises to an asymptotic value of 1 + ατ L2 /r A (a value of 5 in Figure 5B). 6 Distribution of Interspike Intervals We have shown that whereas the second-order statistics of the two types of process can be identical, a difference does occur in the fourth-order statistics. The distribution of ISIs for the two processes will have identical second-order moments but will differ in their higher-order moments and hence have a different overall shape. In this section, we calculate the ISI distribution for a continuous attractor versus a set of discrete attractors and highlight their differences. For a Poisson process, following a spike at time t1 , the interspike interval, τs (t1 ), has a probability distribution, P(τs ), given by P(τs )dτs = exp −
t1 +τs
r (t )dt r (t1 + τs )dτs .
(6.1)
t1
t +τ For a nonstationary process, the quantity in the exponent, t11 s r (t )dt = λ(t1 , t1 + τs ) has a probability distribution that depends not only on the time interval, τs , but also on the initial time, t1 . For a random walk and OU process, the probability distribution of λ is gaussian (see appendix D) and can be calculated if the rate is known at the start of the interval, r1 (t1 ), and at the end of the interval, r2 (t2 ).
1292
Paul Miller
For such continuous random walks, the probability distribution for r2 is known, 1 (r2 − r1 )2 P(r2 , t1 + τs |r1 , t1 ) = √ exp − , 2ατs 2πατs
(6.2)
and the probability distribution for r1 given a spike occurring at t1 is, by Bayes’ rule, P(r1 , t1 |tsp = t1 ) = P(r1 , t1 ) × P(tsp = t1 |r1 , t1 )/P(tsp = t1 ) r1 (r1 − r0 )2 1 exp − . =√ 2αt1 r (t1 ) 2παt1
(6.3) (6.4)
We have used the notation tsp for the time of one spike in the spike train, so, for example, P(tsp = t1 ) is the probability of a spike at time, t1 . We can integrate over all probability distributions to find the distribution of interspike intervals for a random walk with a Poisson spike train: P(τs |tsp = t1 ) =
dr1 P(r1 |tsp = t1 )
dr2 P(r2 , t1 + τs |r1 , t1 )r2
dλ exp[−λ]P(λ|r1 , t1 ; r2 , t2 ) = r0 exp[−r0 τs + ατs2 t1 /2 + ατs3 /6] ατs t1 2ατs2 ατs t1 2 αt1 1− ) + 2 − (1 − . r0 r0 r0 r0
(6.5)
(6.6)
For a random walk of total duration T, the probability distribution for ISIs is given by the integral of the above expression over t1 up to a maximal time of T − τs , normalized by the expected number of ISIs, r0 T − 1, which yields P(τs ) =
r0 exp[−r0 τs + ατs3 /6] r0 T − 1 × exp[ατs2 (T − τs )/2] − 1
2 ατs2 (T − τs ) 2ατs2 6 4 4ατ × 1− + 2 2 + − 2 r0 r0 τs r0 τs r0 4ατs 2α(T − τs ) 4 6 2 + exp[ατs (T − τs )/2] + − − 2 2 . (6.7) r0 τs r02 r02 r0 τs
Spike Statistics of Graded Memory Systems
1293
When the spread in rates is significantly smaller than the average rate, r0 , the probability distribution is dominated by the initial term, r0 exp(−r0 τs ), which is the result for a static process. The result, equation 6.7, is limited to small τs r0 /(αT), because it is for an unbounded random walk, where the rate is unconstrained and unphysical contributions from negative rates can dominate the result for large τs . So we extend the result to the leaky integrator, for which the rates are more constrained, allowing a wider range of validity in τs . For the OU process, the resulting ISI distribution yields (see Figure 6D): P(τs |tsp = t1 ) =
exp (−r Aτs ) exp − (r 1 − r A) f (τs ) 1 + e −τs /τL (6.8) r1 2 2 −σλ,τ f 2 (τs ) −τs /τL −σr,t s 1 e +1 exp exp 2 2
−τs /τL 2 2 2 −τs /τ L e r 1 − f (τs )σr,t + 1 + σr,t e 1 1
2 2 r 1 − f (τs )σr,t f (τs ) , e −τs /τL + 1 r A 1 − e −τs /τL − σr,τ 1 s
where f (τs ) = τ L tanh[−τs /(2τ L )] (see equations D.6 and D.7), r 1 = r (t1 ) 2 (see equation 5.2) and σλ,τ = (ατ L2 /2){2τs − 4τ L [1 − exp(−τs /τ L )] + τ L [1 − s exp(−2τs /τ L )]} (see equation 5.10). For the discrete random walk, we do not have the full distribution of λ but can make progress by assuming that at most, one transition between states occurs in the interspike interval. Hence, our calculation is to first order in τs /τx . To make progress, we use the probability distribution of firing rate, r1 , at time, t, given by ∞ −t t N 1
N+n N−n . exp P (r1 = r0 + nr, t) = N τ τ 2 ! 2 ! x L 2 N=n
(6.9)
Assuming at most one transition, the probability of being at the same rate at the end of the ISI is P0 = 1 − exp(−τs /τx ) and of increasing or decreasing by r is P±1 = exp(−τs /τx )/2. In the case of no transition, exp(−λ) = exp(−r1 τs ). With a single transition, the jump in rates is equally likely to occur at any time in the ISI, so we can evaluate exp(−λ) by calculating exp(−λ) as a function of the time of transition and integrating across transition times. These results can be used in equation 6.5. For a jump up in rates within τs to a higher rate, r1 + r , exp(−λ) = exp(−r1 τs )
[exp(r τs ) − 1] . r τs
(6.10)
1294
Paul Miller
We can then evaluate the ISI distribution using P(τs , t1 ) =
dr1
r1 P(r1 , t1 ) r (t1 )
dr2 r2 P(r2 , t2 |r1 , t1 )exp(−λ)
(6.11)
and allowing only r2 = r1 , r1 ± r . The resulting ISI distribution is given by: t1 exp (−r0 τs ) P(τs |tsp = t1 ) = exp [cosh(r τs ) − 1] r0 τx 2 t1 2 t1 r0 − r sinh(r τs ) + (r ) cosh(r τs ) τx τx τs sinh(r τs ) × 1+ −1 τx r τs τs [1 − cosh(r τs )] t1 + r0 − r sinh(r τs ) . (6.12) τx τx r τs Note the similarity in form to the continuous random walk result (see equation 6.6), which the above formula reproduces in the limit r, τx → 0 with (r )2 /τx = α. A similar calculation can be used to evaluate the ISI distribution for the leaky discrete integrator (see Figure 6C). To first-order in τs /τx , we find
P(τs |tsp = t1 ) =
t1 exp (−r 1 τs ) cosh(r τ ) − 1 exp r1 τx t1 −t1 /τ L r1 − e r sinh(r τ ) × τx t1 × r 2 − e −t1 /τL e −τs /τL r sinh(r τ ) τx t1 + e −2t1 /τL e −τs /τL (r )2 cosh(r τs ) τx τs 1 × 1+ (exp(−dλ) + exp(+dλ)) − 1 τx 2 t1 τs r e −t1 /τL e −τs /τL + r 1 − e −t1 /τL r sinh(r τ ) τx τx 1 × (exp(−dλ) − exp(+dλ)) , 2
(6.13)
Spike Statistics of Graded Memory Systems
B
Ln(ISI probability)
10
Continuous Discrete
5 0 -5 0
C
2 0
-2 0
0.5
Time (sec)
Continuous Discrete
4 ISI deviation
A
1295
Time (sec)
0.2
D 1
t1=2s t1=8s
ISI deviation
ISI deviation
0.5
t1=16s
0
t1=2s t1=8s t1=16s
0.5 0
-0.5 0
Lag, τ (sec)
0.25
0
Lag, τ (sec)
0.25
Figure 6: ISI distributions. (A, B) Data from computer simulations of bounded system. Four sets of data, each the average of 5000 trials. The discrete system has states at 20 Hz, 100 Hz, and 180 Hz, with τx = 16 s. The continuous system is matched in average rate and variance of rate (see text). (A) Log of ISI probability. (B) ISI probability with result for a constant rate subtracted. (C) Analytic results for a leaky discrete integrator, τ L = 10 s, Dr = 10 Hz and τx = 10 s. Black: r (0) = 50 Hz; gray: r0 = 10 Hz. (D) Analytic results for a leaky continuous integrator, τ L = 10 s, α = 10 s−3 . Black: r (0) = 50 Hz; gray: r0 = 10 Hz.
where we have defined:
r τ ≡ r τ L e exp(−dλ) ≡
−t1 /τ L
−τs 1 − exp τL
τL exp(r τ L e −t1 /τL e −τs /τL ) τs
× [E 1 (r τ L e −t1 /τL e −τs /τL ) − E 1 (r τ L e −t1 /τL )] τL exp(−r τ L e −t1 /τL e −τs /τL ) exp(+dλ) ≡ τs ×[E 1 (−r τ L e −t1 /τL e −τs /τL ) − E 1 (−r τ L e −t1 /τL )].
(6.14)
1296
Paul Miller
In ∞ equation 6.14, the E 1 function is defined by the integral E 1 (x) = 1 dt exp(−xt)/t. We compare discrete and continuous processes in Figures 6A and 6B by numerically evaluating the ISI distributions. In all cases, we calculate the mean rate from the total number of spikes and subtract the ISI distribution expected for a Poisson process at this constant mean rate, r . We correct for a finite measurement interval, so subtract Pconst (τs , T) = exp (−r τs )
r (T − τs ) . rT − 1
(6.15)
The term T − τs appears as the integration limit (as no ISIs longer than the measurement time are possible), and the denominator is the total number of ISIs (number of spikes minus one). The main contribution to the ISI distribution for Poisson processes is the sum of exponentials, exp(−r τs ) with a distribution P(r ). When we subtract an exponential of the form exp(−r τ ), we find extra ISIs of low and high τs and a minimum near τs = 1/r , whose depth increases with the variance in rates (see Figures 6B to 6D). On a logarithmic plot of the ISI frequency, we find an initially steep gradient that becomes shallower for larger τs (see Figure 6A), as is expected for the sum of many exponentials. Since the firing rate in the discrete system of attractors is more likely to move far from its initial rate in a short space of time, the occurrence of a low firing rate and corresponding long ISIs is more prevalent (though rare). Figure 6A demonstrates such an excess of long ISIs in the discrete system over the excess seen in the continuous system. A strong difference is seen between the ISI deviation of the two leaky systems (Figure 6D) because the variance in rate remains high for the continuous (OU) system, leading to increasing ISI deviations with time (Figures 6C and 6D). On the other hand, the variance in rates of the discrete system eventually falls to zero, so the ISI deviation diminishes at longer measurement intervals (Figure 6C). 7 Discussion We have compared two types of doubly stochastic Poisson point processes. The processes are doubly stochastic because neuronal spikes are emitted stochastically with a probability proportional to an underlying rate, which also varies stochastically. We find that key statistical features of the spike trains averaged over many trials are identical. While the underlying rate can vary continuously or switch between discrete values, the autocorrelation functions and the Fano factors are identical. This is because both statistical measures depend on the second moment of the underlying rate as a function of time, which in both cases increases linearly with time. Similarly, as a result of the identical time dependence of their Fano factors, the two processes have the same distribution of ISIs (Saleh, 1978). This leads to a difficulty
Spike Statistics of Graded Memory Systems
1297
in distinguishing a continuously varying rate (such as required for analog memory storage, or in steady ramping activity) from a discretely jumping rate, where the jump times are stochastic but give the same behavior for the average rate. Only fourth-order statistics can distinguish the two cases, but in practice, calculations of such high-order statistics contain too much random error to be useful. The results presented here emphasize that single neuronal spike trains do not contain enough information to differentiate continuous attractor from discrete attractor networks, unless the jump in rates between discrete states is very large or many thousands of seconds of data are recorded. Since the system could change over many thousands of seconds, simultaneous recordings of multiple neurons involved in the network are probably needed to distinguish the two types of attractor systems. Working memory systems have been proposed that are based on either a continuous attractor (Seung et al., 2000; Miller et al., 2003) or a set of discrete attractors (Koulakov et al., 2002). The goal of these model systems, like integrator networks, is to produce neurons whose average firing rate is constant during the delay, when no stimulus is present. Both discrete and continuous systems exhibit the unusual property of an autocorrelation function, which depends on only the time interval of comparison. So it depends linearly on the time lag, τ (Ben-Yishai et al., 1995; Miller & Wang, in press) and increases with the measurement interval, T (Lewis, Gebber, Larsen, & Barman, 2001; Constantinidis & Goldman-Rakic, 2002). Such power law behavior relies on noise fluctuations being integrated, so does not occur if the noise in a discrete system is insufficient to cause the network to change states (Miller & Wang, in press). In the analysis in this article, the discrete system does have stochastic transitions between states. We find that just one parameter, the average lifetime between transitions, needs to be adjusted to match the behavior of the discrete with the continuous system. Cross-correlation functions include a term that is proportional to the product of the gradient of the tuning curve of the two neurons, as has been pointed out by Pouget (Pouget et al., 1998) and others (Ben-Yishai et al., 1995; C. D. Brody, personal communication, 2004). In general, two terms occur in all correlation functions. An initial, constant term arises from fluctuations during the stimulus. A second term is linear in the time of measurement, such as the delay time for a working memory task, due to integration of noise during the memory period. Such behavior is unusual as correlation functions typically decay with time, but in memory systems affected by noise, the correlations can last for the same duration as the memory of a stimulus. An unusual result is also found for the Fano factors, which increase quadratically with time for both systems (Gaspard & Wang, 1993). The Fano factor is a measure of the trial-to-trial variability in spike times. We see here that the Fano factor is the sum of two terms. The commonly observed term is a constant number due to the variability in the spike generation process
1298
Paul Miller
when the underlying rates are stable. The second term, which can vary in time, contains the effects of trial-to-trial variations in the underlying rates. Fano factors that increase as a power law have been observed experimentally in neurons responding to vision (Teich, Heneghan, Lowen, Ozaki, & Kaplan, 1997; Baddeley et al., 1997) and in the auditory system (Turcott et al., 1994). Such power-law behavior has been considered in terms of optimal encoding of natural stimuli, but in the working memory systems we consider here, it arises from the internal dynamics of noise-driven fluctuations in an attractor network. If, as in real systems, the firing rates are bounded, the variance in rates cannot increase inexorably with time, but reaches a constant value on a timescale of the order of the leak time or the time for the rate to cross between the boundaries. The correlation functions become exponential on this timescale, and the Fano factors approach a constant value after a quadratic rise at small times. A result for rigidly bounded systems that can be tested in real experimental data is that the trial-to-trial variance, as well as the mean, firing rate behaves differently depending on the starting point. Like a leaky integrator, the mean rate drifts toward one value (the midpoint between the boundaries). However, unlike the leaky integrator, for a system with rigid boundaries for the firing rate, the variance in rate increases more slowly when the initial rate is near a boundary (see Figure 4B). Such behavior of both mean and variance in the firing rate, if seen in real data, would be strong evidence for such a bounded system of attractors. Appendix A: Moments of Rate and Spike Count We find the moments of rate for a system with discrete hopping, by combining the Poisson distribution for the expected number of transitions, M, in time T: P(M|T) =
1 M!
T τx
M
e −T/τx
(A.1)
with the binomial distribution for the distribution of rates after M hops: P (r = r0 − Mr + 2nr | M) =
M! p n (1 − p) M−n . n!(M − n)!
(A.2)
This allows us to calculate the following moments of the rate: r (t) = r0
(A.3)
r (t)2 = r02 + αt
(A.4)
r (t)
(A.5)
3
= r03
+ 3r0 αt
Spike Statistics of Graded Memory Systems
r (t)4 = r04 + 6r02 αt + 3α 2 t 2 + (r )2 αt,
1299
(A.6)
where we have written α = (r )2 /τx . Notably, the variance increases linearly in time, r (t)2 − r (t)2 = αt,
(A.7)
and the fourth-order cumulant is nonzero only when the gaps between states are discrete: r 4 − 4r 3 r − 3r 2 2 + 12r 2 r 2 − 6r 4 = (r )2 αt.
(A.8)
For a Poisson spiking process, the probability of N spikes in time T is given by P(N|T) = where λ = of λ:
T 0
λ N −λ e , N!
(A.9)
r (t)dt. This allows us to evaluate the moments of N in terms
λ N N −λ e N! = λ
N =
N2 =
λ N N2 −λ e N!
= λ2 + λ N3 =
λ N N3 −λ e N!
= λ3 + 3λ2 + λ N4 =
λ N N4 −λ e N!
= λ4 + 6λ3 + 7λ2 + λ.
(A.10)
For a process where the average rate is constant but the variance increases linearly with time as αt, we can evaluate moments of λ in terms of moments of the rate by taking the time derivative, using (Gillespie, 1992): dλm = mλm−1 r (t) dt dλm r k (t) k! = mλm−1 r k+1 (t) + α λm r k−2 (t). dt 2!(k − 2)!
(A.11)
1300
Paul Miller
This leads to: dλ = r (t) dt d 2 λ2 = 2r 2 (t) dt 2 d 3 λ3 = 6r 3 (t) + 6αλ dt 3 d 4 λ4 = 24r 4 (t) + 48αr 2 (t) + 48r0 αr (t). dt 4 (A.12) Using equations A.3 to A.6, we can then evaluate: λ = r0 t λ2 = r02 t 2 + αt 3 /3 λ3 = r03 t 3 + r0 αt 4 λ4 = r04 t 4 + 2r02 αt 5 + α 2 t 6 /3 + (r )2 αt 5 /5. Combining the results for the moments of λ with the results for the moments of N yields N(t) = r0 t N(t)2 = r02 t 2 + r0 t + αt 3 /3 N(t)3 = r03 t 3 + r0 αt 4 + 3r02 t 2 + αt 3 + r0 t N(t)4 = r04 t 4 + 2r02 αt 5 + α 2 t 6 /3 + (r )2 αt 5 /5 + 6r03 t 3 + 6r0 αt 4 + 7r02 t 2 + 7αt 3 /3 + r0 t.
(A.13)
So the Fano factor is given by N(t)2 − N(t)2 = 1 + αt 2 /(3r0 ), N(t)
(A.14)
and we can find a combination of moments up to fourth order in N that depends on only the gap in rates between states: N4 − 3N2 2 + 2N4 − 6N3 + 6N2 N + 11N2 − 3N2 − 6N 3N2 − 3N2 − 3N = (r )2 t 2 /5.
(A.15)
Spike Statistics of Graded Memory Systems
1301
To calculate the full spike count distribution, P(N), we need to know the full distribution, P[λ(t)]. For the continuous random walk and continuous leaky integrator, we have a gaussian distribution for λ with mean λ and variance σλ2 (see appendix D) to give: P(N) =
dλP(N|λ)P(λ) 1
= 2σλ2
−(λ − λ)2 λN exp(−λ) exp dλ N! 2σλ2
= exp(−λ) exp
σλ2 2
N/2 k=0
N/2 2
σλ = exp(−λ) exp 2
k=0
N−2k (2k − 1)!! σλ2k λ − σλ2 (2k)!(N − 2k)! 1 k!(N − 2k)!
σλ2 2
k
N−2k λ − σλ2 .
(A.16)
Appendix B: Small Numbers of Discrete States We summarize here the conditional probabilities of the firing rate for systems with two, three, four, or five discrete states. We assume a Poisson distribution for the number of transitions in time, t (see equation A.1), and that transitions are equally likely to a state of higher rate as to a state of lower rate, unless the system is in a boundary state. For the two-state system, both states are boundary states. For the two-state system, we have −2τ 1 1 + exp 2 τx −2τ 1 1 − exp P [r0 − (r )/2, t + τ |r0 + (r )/2, t] = 2 τx P [r0 + (r )/2, t + τ |r0 + (r )/2, t] =
(B.1)
and by symmetry, −2τ 1 1 + exp 2 τx −2τ 1 1 − exp . P [r0 + (r )/2, t + τ |r0 − (r )/2, t] = 2 τx
P [r0 − (r )/2, t + τ |r0 − (r )/2, t] =
(B.2)
For the other systems, we omit the symmetrically identical results for brevity.
1302
Paul Miller
For the three-state system, we have −2τ 1 1 + exp 2 τx −2τ 1 1 − exp P [r0 ± (r )/2, t + τ |r0 , t] = 4 τx P [r0 , t + τ |r0 , t] =
(B.3)
and −τ −2τ 1 1 + 2 exp exp P [r0 + r, t + τ |r0 + r, t] = 4 τx τx −2τ 1 1 − exp P [r0 , t + τ |r0 + r, t] = 2 τx −τ −2τ 1 1 − 2 exp exp . (B.4) P [r0 − r, t + τ |r0 + r, t] = 4 τx τx For the four-state system, we have P[r0 + 3(r )/2, t + τ |r0 + (r )/2, t] 1 −τ τ τ = exp sinh + sinh 3 τx τx 2τx P [r0 + (r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp 2 cosh + cosh 3 τx τx 2τx P [r0 − (r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp 2 sinh − sinh 3 τx τx 2τx P [r0 − 3(r )/2, t + τ |r0 + (r )/2, t] −τ τ τ 1 = exp cosh − cosh 3 τx τx 2τx and P [r0 + 3(r )/2, t + τ |r0 + 3(r )/2, t] 1 −τ τ τ = exp cosh + 2 cosh 3 τx τx 2τx
(B.5)
Spike Statistics of Graded Memory Systems
1303
P [r0 + (r )/2, t + τ |r0 + 3(r )/2, t] −τ τ τ 1 2 sinh + 2 sinh = exp 3 τx τx 2τx P [r0 − (r )/2, t + τ |r0 + 3(r )/2, t] 1 −τ τ τ = exp 2 cosh − 2 cosh 3 τx τx 2τx P [r0 − 3(r )/2, t + τ |r0 + 3(r )/2, t] −τ τ τ 1 = exp sinh − 2 sinh . 3 τx τx 2τx For the five-state system, we have −τ τ 1 sinh2 P [r0 ± 2r, t + τ |r0 , t] = exp 2 τx 2τx τ 1 −τ sinh P [r0 ± r, t + τ |r0 , t] = exp 2 τx τx τ −τ cosh2 P [r0 , t + τ |r0 , t] = exp τx 2τx
(B.6)
(B.7)
and P [r0 + 2r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp sinh + 2 sinh √ 4 τx τx 2τx P [r0 + r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp cosh + 2 cosh √ 2 τx τx 2τx P [r0 , t + τ |r0 + r, t] τ 1 −τ sinh = exp 2 τx τx P [r0 − r, t + τ |r0 + r, t] √ τ −τ τ 1 = exp cosh − 2 cosh √ 2 τx τx 2τx P [r0 − 2r, t + τ |r0 + r, t] √ 1 τ τ −τ = exp sinh − 2 sinh √ 4 τx τx 2τx
(B.8)
1304
Paul Miller
and P [r0 + 2r, t + τ |r0 + 2r, t] 1 −τ τ τ 2 = exp cosh + cosh √ 2 τx 2τx 2τx P [r0 + r, t + τ |r0 + 2r, t] √ τ −τ τ 1 = exp sinh + 2 sinh √ 2 τx τx 2τx τ −τ 2 sinh P [r0 , t + τ |r0 + 2r, t] = exp τx 2τx P [r0 − r, t + τ |r0 + 2r, t] √ τ −τ τ 1 = exp sinh − 2 sinh √ 2 τx τx 2τx P [r0 − 2r, t + τ |r0 + 2r, t] −τ τ τ 1 = exp cosh2 − cosh √ . 2 τx 2τx 2τx
(B.9)
This leads to the following results for the average rate, r (t)|r1 , 0 given an initial rate, r1 , the variance in rate, Var[r (t)|r1 , 0], autocorrelation function, Cxx [t, τ |r1 , 0] and Fano factor, F [t|r1 , 0], given by 1 + Var[λ(t)|r1 , 0]/λ(t)|r1 , 0 which are plotted in Figure 4. For the two-state system, −2t 1 r (t)|r0 + (r )/2, 0 = r0 + r exp 2 τx 2 −4t (r ) 1 − exp Var [r (t)|r0 + (r )/2, 0] = 4 τx −2τ −4t (r )2 Cxx [t, t + τ |r0 + (r )/2, 0] = exp 1 − exp 4 τx τx −2t r τx 1 − exp λ(t)|r0 + (r )/2, 0 = r0 t + 4 τx Var [λ(t)|r0 + (r )/2, 0] =
(r )2 τx2 4 t 1 3 −2t −4t × − exp . − exp τx 4 τx 4 τx (B.10)
Spike Statistics of Graded Memory Systems
1305
For the three-state system, starting at the middle state, r (t)|r0 , 0 = r0
−2t (r )2 1 − exp 2 τx −τ −2t (r )2 Cxx [t, t + τ |r0 , 0] = exp 1 − exp 2 τx τx Var[r (t)|r0 , 0] =
λ(t)|r0 , 0 = r0 t Var [λ(t)|r0 , 0] =
(r )2 τx2 4 t −t −2t 3 1 × − + 2 exp − exp , τx 2 τx 2 τx
(B.11)
and starting from a boundary state,
−t τx 2 −2t (r ) 1 − exp Var [r (t)|r0 + r, 0] = 2 τx 2 −τ −2t (r ) Cxx [t, t + τ |r0 + r, 0] = exp 1 − exp 2 τx τx −t λ(t)|r0 + r, 0 = r0 t + r τx 1 − exp τx r (t)|r0 + r, 0 = r0 + r exp
Var [λ(t)|r0 + r, 0] =
(r )2 τx2 4 t 1 3 −t −2t × − exp . − + 2 exp τx 2 τx 2 τx (B.12)
For the four-state system, starting at the upper of the two inner states [r (0) = r0 + (r )/2], −t −2t r r (t)|r0 + (r )/2, 0 = r0 + 4 exp − exp 6 2τx τx −t −3t (r )2 Var [r (t)|r0 + (r )/2, 0] = 33 − 16 exp − 24 exp 36 τx 2τx −4t 5t − exp + 8 exp 2τx τx
1306
Paul Miller
−τ −t (r )2 Cxx [t, t + τ |r0 + (r )/2, 0] = exp 32 − 16 exp 36 2τx τx 5t −3t + 4 exp − 20 exp 2τx 2τx −3t −2τ 1 − 4 exp + exp τx 2τx −4t 5t − exp + 4 exp 2τx τx −t r τx 15 − 16 exp λ(t)|r0 + (r )/2, 0 = r0 t + 12 2τx −2t + exp τx 2 2 (r ) τx t Var [λ(t)|r0 + (r )/2, 0] = 516 − 1475 144 τx −t −t − 256 exp + 1824 exp 2τx τx −2t −3t − 60 exp − 64 exp 2τx τx −4t −5t − exp , (B.13) + 32 exp 2τx τx while starting from the upper boundary state, −t −2t r 8 exp + exp r (t)|r0 + 3(r )/2, 0 = r0 + 6 2τx τx 2 −t −3t (r ) 33 − 64 exp + 48 exp Var [r (t)|r0 + 3(r )/2, 0] = 36 τx 2τx −4t 5t − exp −16 exp 2τx τx 2 −τ −t (r ) exp 32 − 64 exp Cxx [t, t + τ |r0 + 3(r )/2, 0] = 36 2τx τx 5t −3t − 8 exp + 40 exp 2τx 2τx −3t −2τ 1 + 8 exp + exp τx 2τx −4t 5t − exp − 8 exp 2τx τx
Spike Statistics of Graded Memory Systems
1307
−t r τx 15 − 16 exp 12 2τx −2t + exp τx t (r )2 τx2 −t 516 − 1667 + 2496 exp Var [λ(t)|r0 + 3(r )/2, 0] = 144 τx 2τx −3t −t + 128 exp −1024 exp τx 2τx −2t −132 exp τx −4t −5t − exp . (B.14) − 64 exp 2τx τx λ(t)|r0 + 3(r )/2, 0 = r0 t +
For the five-state system, starting at the middle state,
r (t)|r0 , 0 = r0
−t −t (r )2 Var[r (t)|r0 , 0] = 1 − exp 3 − exp 2 τx τx τ (r )2 −τ cosh √ Cxx [t, t + τ |r0 , 0] = exp 2 τx 2τx −t −t × 3 − exp 1 − exp τx τx √ τ −t 1 − exp + 2 2 sinh √ τx 2τx λ(t)|r0 , 0 = r0 t Var[λ(t)|r0 , 0] =
t −2t (r )2 τx2 −t 10 − 45 − 4 exp + exp 4 τx τx τx √ t −t t 48 cosh √ + 36 2 sinh √ , + exp τx 2τx 2τx (B.15)
and starting from the upper boundary state,
1308
Paul Miller
r (t)|r0 + 2r, 0 = r0 + r exp
−t τx
t 2 cosh √ 2τx
t 2 sinh √ 2τx −t (r )2 Var [r (t)|r0 + 2r, 0] = 6 + 8 exp 4 τx √ −2t 2t 2 + 3 cosh − exp τx τx √ √ 2t + 2 2 sinh τx +
√
−τ τ (r )2 exp cosh √ 2 τx 2τx −t × 3 + 4 exp τx √ −2t τ + 2 2 sinh √ − exp τx 2τx −t × 1 + exp τx √ τ + 2t τ + 2t − 6 cosh √ + 2 2 sinh √ 2τx 2τx −2t × exp τx −t t 3 cosh √ λ(t)|r0 + 2r, 0 = r0 t + r τx 6 − exp τx 2τx √ t + 2 2 sinh √ 2τx 2 2 −t (r ) τx Var [λ(t)|r0 + 2r, 0] = 3 + 4 exp 2 τx √ 2t −2t − exp 1 + 3 cosh τx τx √ √ 2t + 2 2 sinh . (B.16) τx
Cxx [t, t + τ |r0 + 2r, 0] =
Spike Statistics of Graded Memory Systems
1309
Appendix C: Analysis of Random Walk with Reflecting Boundaries In order to calculate the autocorrelation functions (see Figures 4C and 4D) from equation 2.4 and hence Fano factor (see Figure 4E) from equation 3.14 we need to evaluate the mean rate, r (t2 ), at a later time, t2 , conditional on its value, r1 = r0 + d1 , at an earlier time, t1 . For the random walk with reflecting boundaries this is given, using equation 4.1, by r (t2 )|r0 + d1 , t1 =
r0 +rb
r0 −rb
dr2 r2 √
1 2ατ
∞ n=−∞
−(r2 − r0 + d1 − 2rb + 4nrb )2 2ατ −(r2 − r0−d1 + 4nrb )2 + exp 2ατ ∞ −d1 + (4n + 1)rb erf = r 0 + d1 √ 2αt n=1 −d1 − (4n − 1)rb + erf √ 2αt −d1 − (4n + 1)rb −d1 + (4n − 1)rb + erf + erf √ √ 2αt 2αt −d1 + rb −d1 − rb d1 + erf − erf √ √ 2 2αt 2αt ∞ −d1 + (4n − 1)rb + 2rb 1 + erf √ 2αt n=1 −d1 − (4n + 1)rb + 1 − erf √ 2αt −d1 − rb + 2rb 1 + erf √ 2αt ∞ − [−d1 + (4n + 1)rb ]2 2αt + exp π 2αt n=0 − [−d1 − (4n + 1)rb ]2 − exp 2αt − [−d1 − (4n − 1)rb ]2 − exp 2αt
× exp
1310
Paul Miller
− [−d1 + (4n − 1)rb ]2 − exp 2αt 2αt − (d1 + rb )2 − (d1 − rb )2 + exp − exp . π 2αt 2αt (C.1)
The variance in rate plotted in Figure 4B is given for initial rates of r0 and r0 − rb . In both cases, the mirror sources are evenly spaced, simplifying the calculation a little. For r (0) = r0 , we have σr2 = −r02 +
r0 +rb
dr2 r22 √
∞
1
exp
−(r2 − r0 + 2nrb )2 2ατ
2ατ n=−∞ ∞ √ (2n + 1)rb 2αt π (2n + 1)rb 1 − erf = αt + 4rb √ √ π 2αt 2αt n=0 −(2n + 1)2 rb2 − exp . 2αt r0 −rb
For r (0) = r0 − rb , we have
σr2
= −r (t)|r0 − rb , 0 + √ 2
2 2ατ
r0 +rb
r0 −rb
−(r2 − r0 + rb + 4nrb )2 × exp 2ατ
dr2 r22
∞ n=−∞
= −r (t)|r0 − rb , 02 + (r0 − rb )2 + αt ∞ 2(2n + 1)rb + 16rb2 √ (2n + 1) 1 − erf 2αt n=0 2(2n + 1)rb √ (2n + 1) 1 − erf 2αt n=0 4nrb − 2n 1 − erf √ 2αt √ ∞ −4rb2 (2n)2 π(2n + 1)rb + 2(r0 − rb ) −1+2 exp √ 2αt 2αt n=0 + 8rb (r0 − rb )
∞
(C.2)
Spike Statistics of Graded Memory Systems
1311
−4rb2 (2n + 1)2 − exp 2αt √ ∞ −4rb2 (2n + 1)2 π(2n + 1)rb − 8rb . exp √ 2αt 2αt n=0
(C.3)
Appendix D: Gaussian Distribution of λ for Continuous Random Walks In order to calculate the ISI distribution from equation 6.1, it is necessary t +τ to know the probability distribution of λ(t1 , t1 + τs ) = t11 s r (t )dt . In this appendix, we prove by induction that the distribution is gaussian, that is,
P [λ(t1 , t1 + τ ) = λ] =
1 2 2πσλ(τ )
exp
2 − λ−λ 2 2σλ(τ )
(D.1)
for both the continuous random walk and the continuous leaky integrator (the Ornstein-Uhlenbeck, OU, process). For both processes, the distribution of rates as a function of time is known to be gaussian, with mean, r (t) and standard deviation, σr (t). We will use in the derivation the distribution, P(λ|r ), which denotes the distribution of λ for all paths that end at a certain rate, r , such that P(λ) =
dr P(r )P(λ|r ).
(D.2)
For gaussian P(r ) and gaussian P(λ) we must also have gaussian P(λ|r ). We can calculate the mean of the distribution, λ|r , by integrating over time the mean rates conditional on the final rate, r (t): λ(t)|r (t) =
dt1 r1 (t1 )|r (t).
(D.3)
The conditional mean rate is found using Bayes’ rule: r1 (t1 )|r (t) = =
dr1 P[r1 , t1 |r (t)]
(D.4)
dr1 P[r1 , t1 ]P[r (t)|r1 , t1 ] , P[r (t)]
(D.5)
1312
Paul Miller
which results for the leaky random walk in
λ(t)|r (t) = r At + [r0 − 2r A + r (t)] τ L
1 − exp 1 + exp
−t τL −t τL
,
(D.6)
which setting τ L → ∞ gives the uniform random walk result of λ(t)|r (t) = [r (t) + r0 ]t/2. We separate out the term linear in r (t), writing in general λ(t)|r (t) = f (t)r (t) + g(t),
(D.7)
where f and g are found from equation D.6. Substituting from equation D.7 into equation D.2, using gaussian distri2 butions and writing σλ|r as the variance of P[λ|r ], we find the condition 2 σλ2 = f (t)2 σr2 + σλ|r .
(D.8)
We now check the general condition for the nth moment of λ, dλn = nλn−1 r (t), dt
(D.9)
by evaluating the left- and right-hand sides separately, assuming gaussian probability distributions. To evaluate the left-hand side, we find λ =
1
n
=
2πσλ2
−(λ − λ)2 dλλ exp 2σλ2
n
n/2 n!(2k − 1)!! n−2k 2k λ σλ (2k)!(n − 2k)!
(D.10)
k=0
for even-n with the upper entry of the sum replaced by (n − 1)/2 for odd-n. Taking the derivative with respect to time gives n/2 dλn n!(2k − 1)!! 2k−2 n−2k−1 dσ 2 = σλ λ (n − 2k)r σλ2 + kλ λ . dt (2k)!(n − 2k)! dt k=0
(D.11)
Spike Statistics of Graded Memory Systems
1313
To evaluate the right-hand side, we find
n−1
nλ
1
r (t) =
2πσr2
drr
1 2 2πσλ|r
n−1
dλλ
−(λ − λ|r )2 exp 2 2σλ|r
n/2 n!(2k − 1)!! 2k−2 n−2k−1 σ λ = (2k)!(n − 2k)! λ k=0 2 × (n − 2k) σλ|r r − σr2 f (t)g(t) + nσr2 f (t)λ
=
n/2 n!(2k − 1)!! 2k−2 n−2k−1 σλ λ (n − 2k)σλ2 r + 2kσr2 f (t)λ , (2k)!(n − 2k)! k=0
(D.12) where in the last line we have used equation D.8 and the identity λ = f (t)r + g(t) from equation D.7. Equating equations D.11 and D.12 leads to the requirement dσλ2 = 2 f (t)σr2 . dt
(D.13)
We can evaluate a similar requirement on the conditional probability distribution being gaussian, from the requirement P [λ(t2 )|r2 (t2 )] =
P(r2 , t2 |r1 , t1 ) dr1 P(r1 , t1 ) P(r2 , t2 )
× P [λ − λ1 |r1 (t1 ), r2 (t2 )] ,
dλ1 P [λ1 (t1 )|r1 (t1 )] (D.14)
where P[λ − λ1 |r1 (t1 ), r2 (t2 )] is the probability of the integral of firing rate between r1 at time t1 and r2 at time t2 being equal to λ − λ1 . Integrating from a rate of r0 at time t = 0 is implicitly assumed in the other terms. If the probability distributions are gaussian, they depend on only their variance, which is given by equation D.8 and mean, which we write as λr1 ,r2 ,τ = d(τ )r1 + f (τ )r2 + h(τ ),
(D.15)
where τ = t2 − t1 . All the processes we consider are Markov, so given a definite rate at an earlier time, the distribution can depend on only the time difference at a later time, not the total time elapsed. With these definitions,
1314
Paul Miller
equation D.14 becomes the condition:
1 2 2σλ|r,t 2)
2 2σr,τ
− [λ − d(t2 )r0 − f (t2 )r2 − h(t2 )]2 exp 2 2σλ|r,t 2
2 2σr,t 2 [r2 − r (t2 )]2 exp 2 2σr,t 2 2 2 2 2σr,t 2σλ|r,t 2σλ|r,τ 1 1
dr1 exp
− [r2 − r (t2 )|r1 , t1 ]2 − [r1 − r (t1 )]2 exp 2 2 2σr,τ 2σr,t 1
dλ1 exp
− [λ1 − d(t1 )r0 − f (t1 )r1 − h(t1 )]2 2 2σλ|r,t 1
×
=
×
× exp
− [λ − λ1 − d(τ )r1 − f (τ )r2 − h(τ )]2 . 2 2σλ|r,τ
(D.16)
For both the leaky integrator and the continuous random walk, the mean rate at later times depends only linearly on an earlier known rate (see equation 5.2) so r (t2 )|r1 , t1 = b(τ )r1 + c(τ )
(D.17)
where τ = t2 − t1 , b(τ ) = exp(−τ/τ L ) and c(τ ) = r A[1 − exp(−t/τ L )]. Given such linear dependence, the above integrals can be solved to give the following algebraic requirement: 2 2
2 2 2 2 2 σλ|r,t = f 2 (t1 ) + d 2 (τ ) σr,t σ + σλ|r,t + σλ|r,τ σr,τ + b 2 (τ )σr,t . 2 1 r,τ 1 1 (D.18) Substituting terms, using equations 5.2, D.8, and D.7 allows us to confirm equation D.18 for the leaky (and therefore also nonleaky) continuous random walks. Direct use of equation A.11 confirms the lower moments are consistent with a gaussian probability distribution for λ and satisfaction of equation D.13, along with equation D.14 (and hence equation D.18) proves we can use a gaussian distribution for λ when calculating ISI distributions for these continuous random walk processes. Acknowledgments I am grateful to NIH-NIMH for support with a K25 Career Award. I appreciate helpful discussions with Alfonso Renart, David Luxat, Sridhar
Spike Statistics of Graded Memory Systems
1315
Raghavachari, Caroline Geisler, Carlos Brody, and Xiao-Jing Wang during the preparation of this work. References Aksay, E., Baker, R., Seung, H. S., & Tank, D. W. (2000). Anatomy and discharge properties of pre-motor neurons in the goldfish medulla that have eye-position signals during fixations. J. Neurophysiol., 84, 1035–1049. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A. & Roll’s, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond.: Biol. Sci., 264, 1775–1783. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: time scales and relationship to behavior. J. Neurosci., 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Brody, C. D. (1998). Slow covariations in neuronal resting potentials can lead to artefactually fast cross-correlations in their spike trains. J. Neurophysiol., 80, 3345– 3351. Brody, C. D. (1999). Correlations without synchrony. Neural Comput., 11, 1537– 1551. Buzsaki, G. (2004). Large-scale recording of neuronal ensembles. Nat. Neurosci., 7, 446–451. Camperi, M., & Wang, X.-J. (1998). A model of visuospatial short-term memory in prefrontal cortex: Recurrent network and cellular bistability. J. Comput. Neurosci., 5, 383–405. Cannon, S. C., Robinson, D. A., & Shamma, S. (1983). A proposed neural network for the integrator of the oculomotor system. Biol. Cybern., 49, 127–136. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10, 910–923. Constantinidis, C., & Goldman-Rakic, P. S. (2002). Correlated discharges among putative pyramidal neurons and interneurons in the primate prefrontal cortex. J. Neurophysiol., 88, 3487–3497. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Durstewitz, D. (2003). Self-organizing neural integrator predicts interval times through climbing activity. J. Neurosci., 23, 5342–5353. Durstewitz, D., Seamans, J. K., & Sejnowski, T. J. (2000). Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. J. Neurophysiol., 83, 1733–1750. Gaspard, P., & Wang, X.-J. (1993). Noise, chaos and ( ,τ )-entropy per unit time. Physics Reports, 6, 291–345. Gillespie, D. T. (1992). Markov processes. Orlando, FL: Academic Press.
1316
Paul Miller
Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Goldberg, J. A., Rokni, U., & Sompolinsky, H. (2004). Patterns of ongoing activity and the functional architecture of the primary visual cortex. Neuron, 42, 489– 500. Goldman, M. S., Levine, J. H., Tank, G. M. D. W., & Seung, H. S. (2003). Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cereb. Cortex, 13, 1185–1195. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Hopfield, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-fire neurons. Proc. Natl. Acad. USA, 92, 6655–6662. Koulakov, A. A., Raghavachari, S., Kepecs, A., & Lisman, J. E. (2002). Model for a robust neural integrator. Nat. Neurosci., 5, 775–782. Lewis, C. D., Gebber, G. L., Larsen, P. D., & Barman, S. M. (2001). Long-term correlations in the spike trains of medullary sympathetic neurons. J. Neurophysiol., 85, 1614–1622. Loewenstein, Y., & Sompolinsky, H. (2003). Temporal integration by calcium dynamics in a model neuron. Nat. Neurosci., 6, 961–967. Middleton, J. W., Chacron, M. J., Lindner, B., & Longtin, A. (2003). Firing statistics of a neuron model driven by long-range correlated noise. Phys. Rev. E, 68, 21920– 21927. Miller, P., Brody, C. D., Romo, R., & Wang, X.-J. (2003). A recurrent network model of somatosensory parametric working memory in the prefrontal cortex. Cereb. Cortex, 13, 1208–1218. Miller, P., & Wang, X.-J. (2005). Power-law neuronal fluctuations in a recurrent network model of parametric working memory. J. Neurophysiol. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map. Preliminary evidence from unit activity in the freely moving rat. Experimental Brain Research, 34, 171–175. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. I. The single spike train. Biophys. J., 7, 391–418. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population code. Neural Comput., 10, 373–401. Robinson, D. A. (1989). Integrating with neurons. Annu. Rev. Neurosci., 12, 33–45. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Saleh, B. (1978). Photoelectron statistics. New York: Springer-Verlag. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17, 5900–5920. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93, 13339–13344. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271.
Spike Statistics of Graded Memory Systems
1317
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Sharp, P. E., Blair, H. T., & Cho, J. (2001). The anatomical and computational basis of the rat head-direction cell signal. Trends in Neurosci., 24, 289–294. Taube, J. S., & Bassett, J. P. (2003). Persistent neural activity in head direction cells. Cereb. Cortex, 13(11), 1162–1172. Teich, M. C., Heneghan, C., Lowen, S. B., Ozaki, T., & Kaplan, E. (1997). Fractal nature of the neural spike train in the visual system of the cat. J. Opt. Soc. Am. A, 14, 529–546. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C. & Teich, M. C. (1994). A nonstationary poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biol. Cybern., 70, 209–217. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. J. Neurosci., 13, 3406–3420. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
Received January 26, 2005; accepted September 29, 2005.
LETTER
Communicated by Rajesh Rao
Optimal Spike-Timing-Dependent Plasticity for Precise Action Potential Firing in Supervised Learning Jean-Pascal Pfister jean-pascal.pfister@epfl.ch
Taro Toyoizumi [email protected]
David Barber [email protected]
Wulfram Gerstner wulfram.gerstner@epfl.ch Laboratory of Computational Neuroscience, School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
In timing-based neural codes, neurons have to emit action potentials at precise moments in time. We use a supervised learning paradigm to derive a synaptic update rule that optimizes by gradient ascent the likelihood of postsynaptic firing at one or several desired firing times. We find that the optimal strategy of up- and downregulating synaptic efficacies depends on the relative timing between presynaptic spike arrival and desired postsynaptic firing. If the presynaptic spike arrives before the desired postsynaptic spike timing, our optimal learning rule predicts that the synapse should become potentiated. The dependence of the potentiation on spike timing directly reflects the time course of an excitatory postsynaptic potential. However, our approach gives no unique reason for synaptic depression under reversed spike timing. In fact, the presence and amplitude of depression of synaptic efficacies for reversed spike timing depend on how constraints are implemented in the optimization problem. Two different constraints, control of postsynaptic rates and control of temporal locality, are studied. The relation of our results to spike-timing-dependent plasticity and reinforcement learning is discussed. 1 Introduction Experimental evidence suggests that precise timing of spikes is important in several brain systems. In the barn owl auditory system, for example, coincidence-detecting neurons receive volleys of temporally precise spikes from both ears (Carr & Konishi, 1990). In the electrosensory system of mormyrid electric fish, medium ganglion cells receive input at precisely Neural Computation 18, 1318–1348 (2006)
C 2006 Massachusetts Institute of Technology
Optimal STDP in Supervised Learning
1319
timed delays after electric pulse emission (Bell, Han, Sugawara, & Grant, 1997). Under the influence of a common oscillatory drive as present in the rat hippocampus or olfactory system, the strength of a constant stimulus is coded in the relative timing of neuronal action potentials (Hopfield, 1995; Brody & Hopfield, 2003; Mehta, Lee, & Wilson, 2002). In humans, precise timing of first spikes in tactile afferents encodes touch signals at the fingertips (Johansson & Birznieks, 2004). Similar codes have also been suggested for rapid visual processing (Thorpe, Delorme, & Van Rullen, 2001), and for the rat’s whisker response (Panzeri, Peterson, Schultz, Lebedev, & Diamond, 2001). The precise timing of neuronal action potentials also plays an important role in spike-timing-dependent plasticity (STDP). If a presynaptic spike arrives at the synapse before the postsynaptic action potential, the synapse is potentiated; if the timing is reversed, the synapse is depressed (Markram, ¨ Lubke, Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998, 1999, 2001). This biphasic STDP function is reminiscent of a temporal contrast or temporal derivative filter and suggests that STDP is sensitive to the temporal features of a neural code. Indeed, theoretical studies have shown that given a biphasic STDP function, synaptic plasticity can lead to a stabilization of synaptic weight dynamics (Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001) while the neuron remains sensitive to temporal structure in the input (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Roberts, 1999; Kempter et al., 1999; Kistler & van Hemmen, 2000; Rao & Sejnowski, 2001; Gerstner & Kistler, 2002a). While the relative firing time of pre- and postsynaptic neurons, and hence temporal aspects of a neural code, play a role in STDP, it is less clear whether STDP is useful to learn a temporal code. In order to elucidate the computational function of STDP, we ask in this letter the following question: What is the ideal form of an STDP function in order to generate action potentials of the postsynaptic neuron with high temporal precision? This question naturally leads to a supervised learning paradigm: the task to be learned by the neuron is to fire at a predefined desired firing time t des . Supervised paradigms are common in machine learning in the context of classification and prediction problems (Minsky & Papert, 1969; Haykin, 1994; Bishop, 1995), but have more recently also been studied for spiking neurons in feedforward and recurrent networks (Legenstein, Naeger, & Maass, 2005; Rao & Sejnowski, 2001; Barber, 2003; Gerstner, Ritz, & van Hemmen, 1993; Izhikevich, 2003). Compared to unsupervised or reward-based learning paradigms, supervised paradigms on the level of single spikes are obviously less relevant from a biological point, since it is questionable what type of signal could tell the neuron about the “desired” firing time. Nevertheless, we think it is worth addressing the problem of supervised learning—first, as a problem in its own right, and second, as a starting point of spike-based reinforcement learning (Xie & Seung, 2004;
1320
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Seung, 2003). Reinforcement learning in a temporal coding paradigm implies that certain sequences of firing times are rewarded, whereas others are not. The “desired firing times” are hence defined indirectly via the presence or absence of a reward signal. The exact relation of our supervised paradigm to reward-based reinforcement learning will be presented in section 4. Section 2 introduces the stochastic neuron model and coding paradigm, which are used to derive the results presented in section 3. 2 Model 2.1 Coding Paradigm. In order to explain our computational paradigm, we focus on the example of temporal coding of human touch stimuli (Johansson & Birznieks, 2004), but the same ideas would apply analogously to the other neuronal systems with temporal codes already mentioned (Carr & Konishi, 1990; Bell et al., 1997; Hopfield, 1995; Brody & Hopfield, 2003; Mehta et al., 2002; Panzeri et al., 2001). For a given touch stimulus, spikes in an ensemble of N tactile afferents occur in a precise temporal order. If the same touch stimulus with identical surface properties and force vector is repeated several times, the relative timing of action potentials is reliably reproduced, whereas the spike timing in the same ensemble of afferents is different for other stimuli (Johansson & Birznieks, 2004). In our model, we assume that all input lines, labeled by the index j with 1 ≤ j ≤ N, converge onto one or several postsynaptic neurons. We think of the postsynaptic neuron as a detector for a given spatiotemporal spike pattern in the input. The full spike pattern detection paradigm will be used in section 3.3. As a preparation and first steps toward the full coding paradigm, we also consider the response of a postsynaptic neuron to a single presynaptic spike (section 3.1) or to one given spatiotemporal firing pattern (section 3.2). 2.2 Neuron Model. Let us consider a neuron i that is receiving input from N presynaptic neurons. Let us denote the ensemble of all spikes of N neuron j by x j = {t 1j , . . . , t j j }, where t kj denotes the time when neuron j fired its kth spike. The spatiotemporal spike pattern of all presynaptic neurons 1 ≤ j ≤ N will be denoted by boldface x = {x1 , . . . , xN }. f A presynaptic spike elicited at time t j evokes an excitatory postsynaptic f
potential (EPSP) of amplitude wi j and time course (t − t j ). For simplicity, we approximate the EPSP time course by a double exponential, s s (s) = 0 exp − − exp − (s), τm τs
(2.1)
with a membrane time constant of τm = 10 ms and a synaptic time constant of τs = 0.7 ms, which yields an EPSP rise time of 2 ms. Here (s) denotes the Heaviside step function with (s) = 1 for s > 0 and (s) = 0 otherwise.
Optimal STDP in Supervised Learning
1321
We set 0 = 1.3 mV such that a spike at a synapse with wi j = 1 evokes an EPSP with amplitude of approximately 1 mV. Since the EPSP amplitude is a measure of the strength of a synapse, we refer to wi j also as the efficacy (or “weight”) of the synapse between neuron j and i. Let us further suppose that the postsynaptic neuron i receives an additional input I (t) that could arise from either a second group of neurons or from intracellular current injection. We think of the second input as a teaching input that increases the probability that the neuron fires at or close to the desired firing time t des . For simplicity, we model the teaching input as a square current pulse I (t) = I0 (t − t des + 0.5T)(t des + 0.5T − t) of amplitude I0 and duration T. The effect of the teaching current on the membrane potential is
∞
uteach (t) =
k(s)I (t − s)ds
(2.2)
0
with k(s) = k0 exp(−s/τm ), where k0 is a constant that is inversely proportional to the capacitance of the neuronal membrane. In the context of the human touch paradigm discussed in section 2.1, the teaching input could represent some preprocessed visual information (“object touched by fingers starts to slip now”), feedback from muscle activity (“strong counterforce applied now”), cross-talk from other detector neurons in the same population (“your colleagues are active now”), or unspecific modulatory input due to arousal or reward (“be aware—something interesting happening now”). In the context of training of recurrent networks (e.g., Rao & Sejnowski, 2001), the teaching input consists of a short pulse of an amplitude that guarantees action potential firing. The membrane potential of the postsynaptic neuron i (spike response model; Gerstner & Kistler, 2002b) is influenced by the EPSPs evoked by all afferent spikes of stimulus x, the “teaching” signal, and the refractory f effects generated by spikes ti of the postsynaptic neuron ui (t|x, yti ) = urest +
N
wi j
f t j ∈x j
j=1
f
(t − t j ) +
f
η(t − ti ) + uteach (t),
(2.3)
f
ti ∈yti
where urest = −70 mV is the resting potential, yti = {ti1 , ti2 , . . . , tiF < t} is the set of postsynaptic spikes that occurred before t, and tiF always denotes the last postsynaptic spike before t. On the right-hand side of equation 2.3, η(s) denotes the spike afterpotential generated by an action potential. We take
s η(s) = η0 exp − τm
(s),
(2.4)
1322
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B [Hz]
100
ν
post
0.5
prob. density
200
1
g [kHz]
C
0
-60
-50
u [mV]
0 0
1
I0
2
0.04
I_0 = 1 I_0 = 1.5 I_0 = 2
0.02 0 0
50
100
150
200
ISI [ms]
. (B) Firing rate of the postsynaptic Figure 1: (A) Escape rate g(u) = ρ0 exp u−ϑ u neuron as a function of the amplitude I0 of a constant stimulation current (arbitrary units). (C) Interspike interval (ISI) distribution for different input currents.
where η0 < 0 is a reset parameter that describes how much the voltage is reset after each spike (for the relation to integrate-and-fire neurons, see Gerstner & Kistler, 2002b). The spikes themselves are not modeled explicitly but reduced to formal firing times. Unless specified otherwise, we take η0 = −5 mV. In a deterministic version of the model, output spikes would be generated whenever the membrane potential ui reaches a threshold ϑ. In order to account for intrinsic noise and also for a small amount of synaptic noise generated by stochastic spike arrival from additional excitatory and inhibitory presynaptic neurons that are not modeled explicitly, we replace the strict threshold by a stochastic one. More precisely we adopt the following procedure (Gerstner & Kistler, 2002b). Action potentials of the postsynaptic neuron i are generated by a point process with time-dependent stochastic intensity ρi (t) = g(ui (t)) that depends nonlinearly on the membrane potential ui . Since the membrane potential in turn depends on both the input and the firing history of the postsynaptic neuron, we write: ρi (t|x, yti ) = g(ui (t|x, yti )).
(2.5)
We take an exponential to describe the stochastic escape across threshold: g(u) = ρ0 exp u−ϑ where ϑ = −50 mV is the formal threshold, u = 3 mV u is the width of the threshold region and therefore tunes the stochasticity of the neuron, and ρ0 = 1/ms is the stochastic intensity at threshold (see Figure 1). Other choices of the escape function g are possible with no qualitative change of the results. For u → 0, the model is identical to the deterministic leaky integrate-and-fire model with synaptic current injection (Gerstner & Kistler, 2002b).
Optimal STDP in Supervised Learning
1323
We note that the stochastic process, defined in equation 2.5, is similar to but different from a Poisson process since the stochastic intensity depends on the set yt of the previous spikes of the postsynaptic neuron. Thus, the neuron model has some memory of previous spikes. 2.3 Stochastic Generative Model. The advantage of the probabilistic framework introduced above via the noisy threshold is that it is possible to describe the probability density1 Pi (y|x) of an entire spike train2 Y(t) = f f t ∈y δ(t − ti ) (see appendix A for details): i
Pi (y|x) =
ρi (ti |x, yt f ) exp − f
i
f
ti ∈y
= exp
T
T
ρi (s|x, ys )ds
0
log(ρi (s|x, ys ))Y(s) − ρi (s|x, ys )ds .
(2.6)
0
Thus, we have a generative model that allows us to describe explicitly the likelihood Pi (y|x) of emitting a set of spikes y for a given input x. Moreover, since the likelihood in equation 2.6 is a smooth function of its parameters, it is straightforward to differentiate it with respect to any variable. Let us differentiate Pi (y|x) with respect to the synaptic efficacy wi j , since this is a quantity that we will use later, ∂ log Pi (y|x) = ∂wi j
0
T
ρi (s|x, ys ) f (s − t j )ds, [Y(s) − ρi (s|x, ys )] ρi (s|x, ys ) f
(2.7)
t j ∈x j
dg |u=ui (s|x,ys ) . where ρi (s|x, ys ) = du In this letter, we propose three different optimal models: A, B, and C (see Table 1). The models differ in the stimulation paradigm and the specific task of the neuron. In section 3, the task and hence the optimality criteria are supposed to be given explicitly. However, the task in model C could also be defined indirectly by the presence or absence of a reward signal, as discussed in section 4.1. The common idea behind all three approaches is the notion of optimal performance. Optimality is defined by an objective function L that is directly related to the likelihood formula of equation 2.6 and that can be maximized by changes of the synaptic weights. Throughout
1 For simplicity, we denoted the set of postsynaptic spikes from 0 to T by y instead of yT . 2 Capital Y is the spike train generated by the ensemble (lowercase) y.
1324
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Table 1: Summary of the Optimality Criterion L for the Unconstrained Scenarios (Au , Bu , Cu ) and the Constrained Scenarios (Ac , Bc , Cc ). Unconstrained Scenarios Au —Postsynaptic spike imposed: L Au = log(ρ(t des )) Bu —Postsynaptic spike imposed ¯ des )) + spontaneous activity: L Bu = log(ρ(t imposed: Cu —Postsynaptic spike patterns γ i i k M−1 L Cu = log i Pi (y |x ) k=i Pi (0|x )
Constrained Scenarios Ac —No activity: T L Ac = L Au − 0 ρ(t)dt Bc —Stabilized activity: T L Bc = L Bu − Tσ1 2 0 (ρ(t) ¯ − ν0 )2 dt Cc —Temporal locality constraint: 2 L Cc = L Cu , P = a δ − T˜0
Notes: The constraint for scenario C is not included in the likelihood function L Cc itself, but rather in the deconvolution with a matrix P that penalizes quadratically the terms that are nonlocal in time. See appendix C for more details.
the article, this optimization is done by a standard technique of gradient ascent, wi j = α
∂L , ∂wi j
(2.8)
with a learning rate α. Since the three models correspond to three different tasks, they have a slightly different objective function. Therefore, gradient ascent yields slightly different strategies for synaptic update. In the following, we start with the simplest model with the aim of illustrating the basic principles that generalize to the more complex models. 3 Results In this section, we present synaptic updates rules derived by optimizing the likelihood of postsynaptic spike firing at some desired firing time t des . The essence of the argument is introduced in a particularly simple scenario, where the neuron is stimulated by one presynaptic spike and the neuron is inactive except at the desired firing time t des . This is the raw scenario that is developed in several directions. First, we may ask how the postsynaptic spike at the desired time t des is generated. The spike could simply be given by a supervisor. As always in maximum likelihood approaches, we then optimize the likelihood that this spike could have been generated by the neuron model (i.e., the generative model) given the known input. Or the spike could have been generated by a strong current pulse of short duration applied by the supervisor (teaching input). In this case, the a priori likelihood that the generative model fires at or close to the desired firing time is much higher. The two conceptual paradigms give slightly different results, as discussed in scenario A.
Optimal STDP in Supervised Learning
1325
Second, we may, in addition to the spike at the desired time t des , allow for other postsynaptic spikes generated spontaneously. The consequences of spontaneous activity for the STDP function are discussed in scenario B. Third, instead of imposing a single postsynaptic spike at a desired firing time t des , we can think of a temporal coding scheme where the postsynaptic neuron responds to one (out of M) presynaptic spike pattern with a desired output spike train containing several spikes while staying inactive for the other M − 1 presynaptic spike patterns. This corresponds to a pattern classification task, which is the topic of scenario C. Moreover, optimization can be performed in an unconstrained fashion or under some constraint. As we will see in this section, the specific form of the constraint influences the results on STDP, in particular, the strength of synaptic depression for post-before-pre timing. To emphasize this aspect, we discuss two constraints. The first constraint is motivated by the observation that neurons have a preferred working point defined by a typical mean firing rate that is stabilized by homeostatic synaptic processes (Turrigiano & Nelson, 2004). Penalizing deviations from a target firing rate is the constraint that we will use in scenario B. For a very low target firing rate, the constraint reduces to the condition of “no activity,” which is the constraint implemented in scenario A. The second type of constraint is motivated by the notion of STDP itself: changes of synaptic plasticity should depend on the relative timing of preand postsynaptic spike firing and not on other factors. If STDP is to be implemented by some physical or chemical mechanisms with finite time constants, we must require the STDP function to be local in time, that is, the amplitude of the STDP function approaches zero for large time differences. This is the temporal locality constraint used in scenario C. While the unconstrained optimization problems are labeled with the subscript u (Au , Bu , Cu ), the constrained problems are marked by the subscript c (Ac , Bc , Cc ) (see Table 1).
3.1 Scenario A: One Postsynaptic Spike Imposed. Let us start with a particularly simple model, which consists of one presynaptic neuron and one postsynaptic neuron (see Figure 2A). Let us suppose that the task of the postsynaptic neuron i is to fire a single spike at time t des in response to the input, which consists of a single presynaptic spike at time t pre , that is, the input is x = {t pre } and the desired output of the postsynaptic neuron is y = {t des }. Since there is only a single pre- and a single postsynaptic neuron involved, we drop in this section the indices j and i of the two neurons. 3.1.1 Unconstrained Scenario Au : One Spike at t des . In this section, we assume that the postsynaptic neuron has not been active in the recent past, that is, refractory effects are negligible. In this case, we have ρ(t|x, yt ) = ρ(t|x) because of the absence of previous spikes. Moreover, since there is
1326
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B 0.4
∆w
Au
0.3 0.2 0.1
0
t pre
t des
T
0
-50
t
0
pre
des
-t
50
[ms]
Figure 2: (A) Scenario A: a single presynaptic neuron connected to a postsynaptic neuron with a synapse of weight w. (B) Optimal weight change given by equation 3.2 for scenario Au . This weight change is exactly the mirror image of an EPSP.
only a single presynaptic spike (i.e., x = {t pre }), we write ρ(t|t pre ) instead of ρ(t|x). Since the task of the postsynaptic neuron is to fire at time t des , we can define the optimality criterion L Au as the log likelihood of the firing intensity at time t des , L Au = log ρ(t des |t pre ) .
(3.1)
The gradient ascent on this function leads to the following STDP function, w Au = α
∂ L Au ρ (t des |t pre ) des (t − t pre ), =α ∂w ρ(t des |t pre )
(3.2)
dg |u=u(t|tpre ) . Since this optimal weight change w Au can where ρ (t|t pre ) ≡ du be calculated for any presynaptic firing time t pre , we get an STDP function that depends on the time difference t = t pre − t des (see Figure 2B). As we can see directly from equation 3.2, the shape of the potentiation is exactly a mirror image of an EPSP. This result is independent of the specific choice of the function g(u). The drawback of this simple model becomes apparent if the STDP function given by equation 3.2 is iterated over several repetitions of the experiment. Ideally, it should converge to an optimal solution given by w Au = 0 in equation 3.2. However, the optimal solution given by w Au = 0 is problematic: for t < 0, the optimal weight tends toward ∞, whereas for t ≥ 0,
Optimal STDP in Supervised Learning
1327
there is no unique optimal weight (w Au = 0, ∀w). The reason for this problem is that the model describes only potentiation and includes no mechanisms for depression. 3.1.2 Constrained Scenario Ac : No Other Spikes Than at t des . In order to get some insight into where the depression could come from, let us consider a small modification of the previous model. In addition to the fact that the neuron has to fire at time t des , let us suppose that it should not fire anywhere else. This condition can be implemented by an application of equation 2.6 to the case of a single input spike x = {t pre } and a single output spike y = {t des }. In terms of notation, we set P(y|x) = P(t des |t pre ) and similarly ρ(s|x, y) = ρ(s|t pre , t des ) and use equation 2.6 to find P(t des |t pre ) = ρ(t des |t pre ) exp −
T
ρ(s|t pre , t des )ds .
(3.3)
0
Note that for s ≤ t des , the firing intensity does not depend on t des ; hence, ρ(s|t pre , t des ) = ρ(s|t pre ) for s ≤ t des . We define the objective function L Ac as the log likelihood of generating a single output spike at time t des , given a single input spike at t pre . Hence, with equation 3.3, L Ac = log(P(t des |t pre )) = log(ρ(t
des
|t
pre
T
)) −
ρ(s|t pre , t des )ds,
(3.4)
0
and the gradient ascent w Ac = α∂ L Ac /∂w rule yields w Ac = α
ρ (t des |t pre ) des (t − t pre ) − α ρ(t des |t pre )
T
ρ (s|t pre , t des )(s − t pre )ds.
0
(3.5) Since we have a single postsynaptic spike at t des , equation 3.5 can directly be plotted as a STDP function. In Figure 3 we distinguish two different cases. In Figure 3A we optimize the likelihood L Ac in the absence of any teaching input. To understand this scenario, we may imagine that a postsynaptic spike has occurred spontaneously at the desired firing time t des . Applying the appropriate weight update calculated from equation 3.5 will make such a timing more likely the next time the presynaptic stimulus is repeated. The reset amplitude η0 has only a small influence. In Figure 3B, we consider a case where firing of the postsynaptic spike at the appropriate time was made highly likely by a teaching input of duration T = 1 ms centered around the desired firing t des . The form of the STDP function depends on the amount η0 of the reset. If there is no reset η0 = 0, the
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
without teaching η0 = -10 mV η0 = -5 mV η0 = 0 mV
0.3
with teaching
0.1
0.2 0.1
0
0
-0.1
-0.1
-0.2
-50
t
pre
0
des
-t
50
[ms]
η0 = -10 mV η0 = -5 mV η0 = 0 mV
0.3
∆w
∆w
Ac
0.2
B
Ac
1328
-0.2
-50
t
pre
0
des
-t
50
[ms]
Figure 3: Optimal weight adaptation for scenario Ac given by equation 2.7 in the absence of a teaching signal (A) and in the presence of a teaching signal (B). The weight change in the post-before-pre is governed by the spike afterpotential u AP (t) = η(t) + uteach (t). The duration of the teaching input is T = 1 ms. The amplitude of the current I0 is chosen so that maxt uteach (t) = 5 mV. urest is chosen such that the spontaneous firing rate g(urest ) matches the desired firing rate 1/T: urest = u log( Tρ1 0 ) + θ −60 mV. The weight strength is w = 1.
STDP function shows strong synaptic depression of synapses that become active after the postsynaptic spike. This is due to the fact that the teaching input causes an increase of the membrane potential that decays back to rest with the membrane time constant τm . Hence, the window of synaptic depression is also exponential with the same time constant. Qualitatively the same is true if we include a weak reset. The form of the depression window remains the same, but its amplitude is reduced. The inverse of the effect occurs only for strong reset to or below resting potential. A weak reset is standard in applications of integrate-and-fire models to in vivo data and is one of the possibilities for explaining the high coefficient of variation of neuronal spike trains in vivo (Bugmann, Christodoulou, & Taylor, 1997; Troyer & Miller, 1997). A further property of the STDP functions in Figure 3 is a negative offset for |t pre − t des | → ∞. The amplitude of ∞the offset can be calculated for w 0 and t > 0, that is, w0 −ρ (urest ) 0 (s)ds. This offset is due to the fact that we do not want spikes at other times than t des . As a result, the optimal weight w (the solution of w Au = 0) should be as negative as possible (w → −∞ or w → w min in the presence of a lower bound) for t > 0 or t 0. 3.2 Scenario B: Spontaneous Activity. The constraint in scenario Ac of having strictly no other postsynaptic spikes than the one at time t des may seem artificial. Moreover, it is this constraint that leads to the negative
Optimal STDP in Supervised Learning
A
1329
B 0.4
j
∆w
Bu
0.3 0.2 0.1
i
0
0
pre tj
t
des
T
-50
t
0
pre
des
-t
50
[ms]
Figure 4: Scenario B. (A) N = 200 presynaptic neurons are firing one after the other at time t j = jδt with δt = 1 ms. (B) The optimal STDP function of scenario Bu .
offset of the STDP function discussed at the end of the previous paragraph. In order to relax the constraint of no spiking, we allow in scenario B for a reasonable spontaneous activity. As above, we start with an unconstrained scenario Bu before we turn to the constrained scenario Bc . 3.2.1 Unconstrained Scenario Bu : Maximize the Firing Rate at t des . Let us start with the simplest model, which includes spontaneous activity. Scenario Bu is the analog of the model Au , but with two differences. First, we include spontaneous activity in the model. Since ρ(t|x, yt ) depends on the spiking history for any given trial, we have to define a quantity that is independent of the specific realizations y of the postsynaptic spike train. Second, instead of considering only one presynaptic neuron, we consider N = 200 presynaptic neurons, each emitting a single spike at time t j = jδt, where δt = 1 ms (see Figure 4A). The input pattern will therefore be described by the set of delayed spikes x = {x j = {t j }, j = 1, . . . , N}. As long as we consider only a single spatiotemporal spike pattern in the input, it is always possible to relabel neurons appropriately so that neuron j + 1 fires after neuron j. Let us define the instantaneous firing rate ρ(t) ¯ that can be calculated by averaging ρ(t|yt ) over all realizations of postsynaptic spike trains: ρ(t|x) ¯ = ρ(t|x, yt ) yt |x .
(3.6)
Here the notation · yt |x means taking the average over all possible configuration of postsynaptic spikes up to t for a given input x. In analogy to a Poisson process, a specific spike train with firing times yt = {ti1 , ti2 , . . . , tiF < t} is generated with probability P(yt |x) given by equation 2.6. Hence, the
1330
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
average · yt |x of equation 3.6 can be written as follows (see appendix B for numerical evaluation of ρ(t)): ¯ ρ(t|x) ¯ =
t t ∞ 1 ... ρ(t|x, yt )P(yt |x)dtiF , . . . , dti1 . F! 0 0
(3.7)
F =0
Analogous to model Au , we can define the quality criterion as the log likelihood L Bu of firing at the desired time t des : ¯ des |x)). L Bu = log(ρ(t
(3.8)
Thus, the optimal weight adaptation of synapse j is given by w Bj u = α where
∂ ρ(t|x) ¯ ∂w j
∂ ρ(t ¯ des |x)/∂w j , ρ(t ¯ des |x)
(3.9)
is given by
∂ ρ(t|x) ¯ ∂ = ρ¯ (t|x)(t − t j ) + ρ(t|x, yt ) log P(yt |x) , ∂w j ∂w j yt |x
(3.10)
dg log P(yt |x) is given by equation 2.7 and ρ¯ (t|x) = du . u=u(t|x,yt ) yt |x Figure 4B shows that for our standard set of parameters, the differences to scenario Au are negligible. Figure 5A depicts the STDP function for various values of the parameter u at a higher postsynaptic firing rate. We can see a small undershoot in the pre-before-post region. The presence of this small undershoot can be understood as follows: enhancing a synapse of a presynaptic neuron that fires too early would induce a postsynaptic spike that arrives before the desired firing time and because of refractoriness would therefore prevent the generation of a spike at the desired time. The depth of this undershoot decreases with the stochasticity of the neuron and increases with the amplitude of the refractory period (if η0 = 0, there is no undershoot). In fact, correlations between pre- and postsynaptic firing reflect the shape of an EPSP in the high-noise regime, whereas they show a trough for low noise (Poliakov, Powers, & Binder, 1997; Gerstner, 2001). Our theory shows that the pre-before-post region of the optimal plasticity function is a mirror image of these correlations. ∂ ∂w j
3.2.2 Constrained Scenario Bc : Firing Rate Close to ν0 . In analogy to model Ac we introduce a constraint. Instead of imposing strictly no spikes at times t = t des , we can relax the condition and minimize deviations of the
Optimal STDP in Supervised Learning
A
1331
B ∆u = 0.5 ∆u = 1 ∆u = 3
0.8
∆w
∆w
Bc
Bu
0.6
σ = 4Hz σ = 6Hz σ = 8Hz
0.2
0.4
0.1
0.2 0
0 -0.2
-50
t
pre
0
des
-t
50
[ms]
-50
t
0
pre
des
-t
50
[ms]
Figure 5: (A) The optimal STDP functions of scenario Bu for different levels of stochasticity described by the parameter u. The standard value (u = 3 mV) is given by the solid line; decreased noise (u = 1 mV and u = 0.5 mV) is indicated by dot-dashed and dashed lines, respectively. In the lownoise regime, enhancing a synapse that fires slightly too early can prevent the firing at the desired firing time t des due to refractoriness. To increase the firing rate at t des , it is advantageous to decrease the firing probability time before t des . Methods: For each value of u, the initial weight w0 is set such that the spontaneous firing rate is ρ¯ = 30 Hz. In all three cases, w has been multiplied by u in order to normalize the amplitude of the STDP function. Reset: η0 = −5 mV. (B) Scenario Bc . Optimal STDP function for scenario Bc given by equation 3.13 for a teaching signal of duration T = 1 ms. The maximal increase of the membrane potential after 1 ms of stimulation with the teaching input is maxt uteach (t) = 5 mV. Synaptic efficacies wi j are initialized such that u0 = −60 mV, which gives a spontaneous rate of ρ¯ = ν0 = 5 Hz. Standard noise level: u = 3 mV.
instantaneous firing rate ρ(t|x, ¯ t des ) from a reference firing rate ν0 . This can be done by introducing into equation 3.8 a penalty term PB given by ¯ t des ) − ν0 )2 1 T (ρ(t|x, dt . PB = exp − T 0 2σ 2
(3.11)
For small σ , deviations from the reference rate yield a large penalty. For σ → ∞, the penalty term has no influence. The optimality criterion is a combination of a high firing rate ρ¯ at the desired time under the constraint of small deviations from the reference rate ν0 . If we impose the penalty as a multiplicative factor and take as before the logarithm, we get L Bc = log ρ(t ¯ des |x)PB .
(3.12)
1332
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
x1
y1
Figure 6: Scenario C. N presynaptic neurons are fully connected to M postsynaptic neurons. Each postsynaptic neuron is trained to respond to a specific input pattern and not respond to M − 1 other patterns as described by the objective function of equation 3.14.
Hence the optimal weight adaptation is given by
w Bj c = α
∂ ρ(t ¯ des |x)/∂w j α − Tσ 2 ρ(t ¯ des |x)
T
(ρ(t|x, ¯ t des ) − ν0 )
0
∂ ρ(t|x, ¯ t des )dt. ∂w j (3.13)
Since in scenario B, each presynaptic neuron j fires exactly once at time t j = jδt and the postsynaptic neuron is trained to fire at time t des , we can interpret the weight adaptation w Bj c of equation 3.13 as an STDP function w Bc that depends on the time difference t = t pre − t des . Figure 5 shows this STDP function for different values of the free parameter σ of equation 3.11. The higher the standard deviation σ , the less effective is the penalty term. In the limit of σ → ∞, the penalty term can be ignored, and the situation is identical to that of scenario Bu . 3.3 Scenario C: Pattern Detection 3.3.1 Unconstrained Scenario Cu : Spike Pattern Imposed. This last scenario is a generalization of scenario Ac . Instead of restricting the study to a single pre- and postsynaptic neuron, we consider N presynaptic neurons and M postsynaptic neurons (see Figure 6). The idea is to construct M independent detector neurons. Each detector neuron i = 1, . . . , M, should respond best to a specific prototype stimulus, say xi , by producing a desired spike train yi , but should not respond to other stimuli, yi = 0, ∀xk , k = i (see Figure 7). The aim is to find a set of synaptic weights that maximizes the probability
Optimal STDP in Supervised Learning
A
B k
400
200
x
x
i
400
0
1333
0
100
0
200
k
500
0
100
200
0
100
200
100
200
0.1
ρk
ρi
100
500 0
200
0.4 0.2 0
0
1000
y
y
i
1000
0
200
0
100
200
0.05 0
0
time [ms]
time [ms]
Figure 7: Pattern detection after learning. (Top) The left raster plot represents the input pattern the ith neuron has to be sensitive to. Each line corresponds to one of the N = 400 presynaptic neurons. Each dot represents an action potential. The right figure represents one of the patterns the ith neuron should not respond to. (Middle) The left raster plot corresponds to 1000 repetitions of the output of neuron i when the corresponding pattern xi is presented. The right plot is the response of neuron i to one of the patterns it should not respond to. (Bottom) The left graph represents the probability density of firing when pattern xi is presented. This plot can be seen as the PSTH of the middle graph. Arrows indicate the supervised timing neuron i learned. The right graph describes the probability density of firing when pattern xk is presented. Note the different scales of vertical axis.
that neuron i produces yi when xi is presented and produces no output when xk , k = i is presented. Let the likelihood function L Cu be
L Cu
M M γ = log Pi (yi |xi ) Pi (0|xk ) M−1 i=1
(3.14)
k=1,k=i
where Pi (yi |xi ) (see equation 2.6) is the probability that neuron i produces the spike train yi when the stimulus xi is presented. The parameter γ
1334
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
characterizes the relative importance of the patterns that should not be learned compared to those that should be learned. We get
L Cu =
M
log(Pi (yi |xi )) + γ log(Pi (0|xk )) xk =xi ,
(3.15)
i=1 1 M where the notation ·xk =xi ≡ M−1 k=i means taking the average over all patterns other than xi . The optimal weight adaptation yields
wiCj = α
∂ ∂ log Pi (yi |xi ) + αγ log Pi (0|xk ) . ∂wi j ∂wi j xk =xi
(3.16)
The learning rule of equation 3.16 gives the optimal weight change for each synapse and can be evaluated after presentation of all pre- and postsynaptic spike patterns; it is a “batch” update rule. Since each pre- and postsynaptic neuron emits many spikes in the interval [0, T], we cannot directly interpret the result of equation 3.16 as a function of the time difference t = t pre − t des as we did in scenario A or B. Ideally, we would like to write the total weight change of the optimal rule given by equation 3.16 as a sum of contributions wiCj =
WCu (t pre − t des ),
(3.17)
t pre ∈xij t des ∈yi
where WCu (t pre − t des ) is an STDP function and the summation runs over all pairs of pre- and postsynaptic spikes. The number of pairs of pre- and postsynaptic spikes with a given time shift is given by the correlation function, which is best defined in discrete time. We assume time steps of duration δt = 0.5 ms. Since the correlation will depend on the presynaptic neuron j and the postsynaptic neuron i under consideration, we introduce a new index, k = N(i − 1) + j. We define the correlation in discrete time by its matrix elements Ck that describe the correlation between the presynaptic spike train Xij (t) and the postsynaptic spike train Yi (t − T0 + δt). For example, C3 = 7 implies that seven spike pairs of presynaptic neuron j = 3 with postsynaptic neuron i = 1 have a relative time shift of T0 − δt. With this definition, we can rewrite equation 3.17 in vector notation (see section C.1 for more details) as !
wC = CWCu ,
(3.18)
Optimal STDP in Supervised Learning
A
1335
B 0.4 a = 0.04 a = 0.4
Cc
0.2
∆W
0.2
∆W
Cu
0.3
0.1
0.1
0
0 -50
t
0
pre
des
-t
50
[ms]
-0.1
-50
t
0
pre
des
-t
50
[ms]
Figure 8: (A) Optimal weight change for scenario Cu . In this case, no locality constraint is imposed, and the result is similar to the STDP function of scenario Ac (with η0 = 0 and uteach (t) = 0) represented on Figure 3. (B) Optimal weight change for scenario Cc as a function of the locality constraint characterized by a . The stronger the importance of the locality constraint, the narrower is the spike-spike interaction. For A and B, M = 20, η0 = −5 mV. The initial weights wi j are chosen so that the spontaneous firing rate matches the imposed firing rate. C C C where wC = (w11 , . . . , w1N , w21 , . . . , wCMN )T is the vector containing all the optimal weight change given by equation 3.16 and WCu is the vector containing the discretized STDP function with components WCu = WCu (−T0 + δt) for 1 ≤ ≤ 2T˜0 with T˜0 = T0 /δ. In particular, the center of the STDP function (t pre = t des ) corresponds to the index = T˜0 . !
The symbol = expresses the fact that we want to find WCu such that wC is as close as possible to CWCu . By taking the pseudo-inverse C + = (C T C)−1 C T of C, we can invert equation 3.18 and get WCu = C + wC .
(3.19)
The resulting STDP function is plotted in Figure 8A. As it was the case for the scenario Au , the STDP function exhibits a negative offset. In addition to the fact the postsynaptic neuron i should not fire at other times than the ones given by yi , it should also not fire whenever pattern xk , k = i is presented. The presence of the negative offset is due to those two factors. 3.3.2 Constrained Scenario Cc : Temporal Locality. In the previous paragraph, we obtained a STDP function with a negative offset. This negative offset does not seem realistic because it implies that the STDP function is not localized in time. In order to impose temporal locality (finite memory
1336
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
A
B 0.4 M=20 M=60 M=100
0.2
∆W
Cc
∆W
0.1
w=0.5w0 w=w0 w=1.5w0
0.3
Cc
0.2
0.1 0
-0.1
0
-0.2 -0.1
-50
t
0
pre
des
-t
50
[ms]
-50
t
pre
0
des
-t
50
[ms]
Figure 9: (A) Optimal STDP function as a function of the number of input patterns M. (a = 0.04, N = 400). (B) Optimal weight change as a function of the weight w. If the weights are small (dashed line), potentiation dominates, whereas if they are big (dotted line), depression dominates.
span of the learning rule), we modify equation 3.19 in the following way (see section C.2 for more details): WCc = (C T C + P)−1 C T wC ,
(3.20)
where P is a diagonal matrix that penalizes nonlocal terms. In this article, we take a quadratic suppression of terms that are nonlocal in time. With respect to a postsynaptic spike at t des , the penalty term is proportional to (t − t des )2 . In matrix notation and using our convention that the postsynaptic spike corresponds to = T˜0 , we have: 2 P = a δ − T˜0 .
(3.21)
The resulting STDP functions for different values of a are plotted in Figure 8B. The higher the parameter a , the more nonlocal terms are penalized, the narrower is the STDP function. Figure 9A shows the STDP functions for various number of patterns M. No significant change can be observed for different numbers of input patterns M. This is due to the appropriately chosen normalization factor 1/(M − 1) in the exponent of equation 3.14. The target spike trains yi have a certain number of spikes during the time window T; they set a target value for the mean rate. Let ν post = T1M × M T i i=1 0 y (t)dt be the imposed firing rate. Let w0 denote the amplitude of the synaptic strength such that the firing rate ρ¯ w0 given by those weights is identical to the imposed firing rate: ρ¯ w0 = ν post . If the actual weights are
Optimal STDP in Supervised Learning
a=0
B
C
0
-0.3
0
-0.3 -0.3
0
opt
∆w
0.3
a = 0.4 0.3
rec
rec
0.3
∆w
∆w
rec
0.3
a = 0.04
∆w
A
1337
0
-0.3 -0.3
0
opt
∆w
0.3
-0.3
0
opt
0.3
∆w
Figure 10: Correlation plot between the optimal synaptic weight change wopt = wCu and the reconstructed weight change wrec = CWCc using the temporal locality constraint. (A) No locality constraint; a = 0. Deviations from the diagonal are due to the fact that the optimal weight change given by equation 3.16 cannot be perfectly accounted for the sum of pair effects. The mean deviations are given by equation C.7. (B) A weak locality constraint (a = 0.04) almost does not change the quality of the weight change reconstruction. (C) Strong locality constraint (a = 0.4). The horizontal lines arise since most synapses are subject to a few strong updates induced by pairs of pre- and postsynaptic spike times with small time shifts.
smaller than w0 , almost all the weights should increase, whereas if they are bigger than w0 , depression should dominate (see Figure 9B). Thus, the exact form of the optimal STDP function depends on the initial weight value w0 . Alternatively, homeostatic process could ensure that the mean weight value is always in the appropriate regime. In equations 3.17 and 3.18, we imposed that the total weight change should be generated as a sum over pairs of pre- and postsynaptic spikes. This is an assumption that has been made in order to establish a link to standard STDP theory and experiments where spike pairs have been in the center of interest (Gerstner et al., 1996; Kempter et al., 1999 ; Kistler & van Hemmen, 2000; Markram et al., 1997; Bi & Poo, 1998; Zhang et al., 1998). It is, however, clear by now that the timing of spike pairs is only one of several factors contributing to synaptic plasticity. We therefore asked how much we miss if we attribute the optimal weight changes calculated in equation 3.16 to spike pair effects only. To answer this question, we compared the optimal weight change wiCj from equation 3.16 with that derived from the pair Cc pre − t des ) with or without based STDP rule wirec j = t pre ∈xij t des ∈yi W (t locality constraint, that is, for different values of the locality parameter (a = 0, 0.04, 0.4) (see Figure 10). More precisely, we simulate M = 20 detector neurons, each having N = 400 presynaptic inputs, so each subplot of Figure 10 contains 8000 points. Each point in a graph corresponds to the optimal change of one weight for one detector neuron (x-axis) compared to the weight change of the same weight due to pair-based STDP
1338
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
(y-axis). We found that in the absence of a locality constraint, the pair-wise contributions are well correlated with the optimal weight changes. With strong locality constraints, the quality of the correlation drops significantly. However, for a weak locality constraint that corresponds to an STDP function with reasonable potentiation and depression regimes, the correlation of the pair-based STDP rule with the optimal update is still good. This suggests that synaptic updates with an STDP function based on pairs of pre- and postsynaptic spikes is close to optimal in the pattern detection paradigm.
4 Discussion 4.1 Supervised versus Unsupervised and Reinforcement Learning. Our approach is based on the maximization of the probability of firing at desired times t des with or without constraints. From the point of view of machine learning, this is a supervised learning paradigm implemented as a maximum likelihood approach using the spike response model with escape noise as a generative model. Our work can be seen as a continuoustime extension of the maximum likelihood approach proposed in Barber (2003). The starting point of all supervised paradigms is the comparison of a desired output with the actual output a neuron has, or would have, generated. The difference between the desired and actual output is then used as the driving signal for synaptic updates in typical model approaches (Minsky & Papert, 1969; Haykin, 1994; Bishop, 1995). How does this compare to experimental approaches? Experiments focusing on STDP have been mostly performed in vitro (Markram et al., 1997; Magee & Johnston, 1997; Bi & Poo, 1998). Since in typical experimental paradigms, firing of the postsynaptic neuron is enforced by strong pulses of current injection, the neuron is not in a natural unsupervised setting; but the situation is also not fully supervised, since there is never a conflict between the desired and actual output of a neuron. In one of the rare in vivo experiments to STDP (Fr´egnac, Shulz, Thorpe, & Bienenstock, 1988, 1992), the spikes of the postsynaptic neuron are also imposed by current injection. Thus, a classification of STDP experiments in terms of supervised, unsupervised, or reward based is not as clear-cut as it may seem at a first glance. From the point of view of neuroscience, paradigms of unsupervised or reinforcement are probably much more relevant than the supervised scenario discussed here. However, most of our results from the supervised scenario analyzed in this article can be reinterpreted in the context of reinforcement learning following the approach proposed by Xie and Seung (2004). To illustrate the link between reinforcement learning and supervised learning, we define a global reinforcement signal R(x, y) that depends on the spike timing of the presynaptic neurons x and the postsynaptic neuron y.
Optimal STDP in Supervised Learning
1339
The quantity optimized in reinforcement learning is the expected reward
Rx,y averaged over all pre- and postsynaptic spike trains:
Rx,y =
R(x, y)P(y|x)P(x).
(4.1)
x,y
If the goal of learning is to maximize the expected reward, we can define a learning rule that achieves this goal by changing synaptic efficacies in the direction of the gradient of the expected reward Rx,y : ∂ log P(y|x)
wx,y = α R(x, y) , ∂w x,y
(4.2)
∂ log P(y|x)
is the quantity we discussed where α is a learning parameter and ∂w in this article. Thus, the quantities optimized in our supervised paradigm re-appear naturally in a reinforcement learning paradigm. For an intuitive interpretation of the link between reinforcement learning, and supervised learning, consider a postsynaptic spike that (spontaneously) occurred at time t0 . If no reward is given, no synaptic change takes place. However, if the postsynaptic spike at t0 is linked to a rewarding situation, the synapse will try to recreate in the next trial a spike at the same time, that is, t0 has the role of the desired firing time t des introduced in this article. Thus, the STDP function with respect to a postsynaptic spike at t des derived in this article can be seen as the spike timing dependence that maximizes the expected reward in a spike-based reinforcement learning paradigm. 4.2 Interpretation of STDP Function. Let us now summarize and discuss our results in a broader context. In all three scenarios, we found an STDP function with potentiation for pre-before-post timing. Thus, this result is structurally stable and independent of model details. However, depression for post-before-pre timing does depend on model details. In scenario A, we saw that the behavior of the post-before-pre region is determined by the spike afterpotential (see Table 2 for a result summary of the three models). In the presence of a teaching input and firing rate constraints, a weak reset of the membrane potential after the spike means that the neuron effectively has a depolarizing spike after potential (DAP). In experiments, DAPs have been observed by Feldman (2000), Markram et al. (1997), and Bi and Poo (1998) for strong presynaptic input. Other studies have shown that the level of depression does not depend on the ¨ om, ¨ postsynaptic membrane potential (Sjostr Turrigiano, & Nelson, 2001). In any case, a weak reset (i.e., to a value below threshold rather than to the resting potential) is consistent with the findings of other researchers that used integrate-and-fire models to account for the high coefficient of
1340
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Table 2: Main Results for Each Scenario. Unconstrained Scenarios Au —pre-before-post: LTP ∼ EPSP Bu —pre-before-post: LTP/LTD ∼ reverse correlation Cu —pre-before-post: LTP ∼ EPSP, LTD ∼ background patterns
Constrained Scenarios Ac —post-before-pre: LTD (or LTP) ∼ spike afterpotential Bc —post-before-pre: LTD ∼ increased firing rate Cc —post-before-pre: LTD ∼ background patterns ∼ temporal locality
variation of spike trains in vivo (Bugmann et al., 1997; Troyer & Miller, 1997). In the presence of spontaneous activity (scenario B), a constraint on the spontaneous firing rate causes the optimal weight change to elicit a depression of presynaptic spikes that arrive immediately after the postsynaptic one. In fact, the reason for the presence of the depression in scenario Bc is directly related to the presence of a DAP caused by the strong teaching stimulus. In both scenarios A and B, depression occurs in order to compensate the increased firing probability due to the DAP. In scenario C, it has been shown that the best way to adapt the weights (in a task where the postsynaptic neuron has to detect a specific input pattern among others) can be described as an STDP function. This task is similar to the one in Izhikevich (2003) in the sense that a neuron is designed to be sensitive to a specific input pattern, but different since our work does not assume any axonal delays. The depression part in this scenario arises from a locality constraint. We impose that weight changes are explained by a sum of pair-based STDP functions. There are various ways of defining objective functions, and we have used three different objective functions in this article. The formulation of an objective function gives a mathematical expression of the functional role we assign to a neuron. The functional role depends on the type of coding (temporal coding or rate coding) and hence on the information the postsynaptic neurons will read out. The functional role also depends on the task or context in which a neuron is embedded. It might seem that different tasks and coding schemes could thus give rise to a huge number of objective functions. However, the reinterpretation of our approach in the context of reinforcement learning provides a unifying viewpoint: even if the functional role of some neurons in a specific region of the brain can be different from other neurons of a different region, it is still possible to see the different objective functions as different instantiations of the same underlying concept: the maximization of the reward, where the reward is task specific. More specifically, all objective functions used in this letter maximized the firing probability at a desired firing time t des , reflecting the fact that
Optimal STDP in Supervised Learning
1341
in the framework of timing-based codes, the task of a neuron is to fire at precise moments in time. With a different assumption on the neuron’s role on signal processing, different objective functions need to be used. An extreme case is a situation where the neuron’s task is to avoid firing at time t des . A good illustration is given by the experiments done in the electrosensory lobe (ELL) of the electric fish (Bell et al., 1997). These cells receive two sets of input: the first one contains the pulses coming from the electric organ, and the second input conveys information about the sensory stimulus. Since a large fraction of the sensory stimulus can be predicted by the information coming from the electric organ, it is computationally interesting to subtract the predictable contribution and focus on only the unpredictable part of the sensory stimulus. In this context, a reasonable task would be to ask the neuron not to fire at time t des where t des is the time where the predictable simulation arrives, and this task could be defined indirectly by an appropriate reward signal. An objective function of this type would, in the end, reverse the sign of the weight change of the causal part (LTD for the pre-before-post region), and this is precisely what is seen experimentally (Bell et al., 1997). In our framework, the definition of the objective function is closely related to the neuronal coding. In scenario C, we postulate that neurons emit a precise spike train whenever the “correct” input is presented and are silent otherwise. This coding scheme is clearly not the most efficient one. Another possibility is to impose postsynaptic neurons to produce a specific but different spike train for each input pattern, and not only for the “correct” input. Such a modification of the scenario does not dramatically change the results. The only effect is to reduce the amount of depression and increase the amount of potentiation. 4.3 Optimality Approaches versus Mechanistic Models. Theoretical approaches to neurophysiological phenomena in general, and to synaptic plasticity in particular, can be roughly grouped into three categories: biophysical models that aim at explaining the STDP function from principles of ion channel dynamics and intracelluar processes (Senn, Tsodyks, & Markram, 2001; Shouval, Bear, & Cooper, 2002; Abarbanel, Huerta, & Rabinovich, 2002; Karmarkar & Buonomano, 2002); mathematical models that start from a given STDP function and analyze computational principles such as intrinsic normalization of summed efficacies or sensitivity to correlations in the input (Kempter et al., 1999; Roberts, 1999; Roberts & Bell, 2000; van Rossum et al., 2000; Kistler & van Hemmen, 2000; Song et al., 2000; ¨ Song & Abbott, 2001; Kempter et al., 2001; Gutig, Aharonov, Rotter, & Sompolinsky, 2003); and models that derive “optimal” STDP properties for a given computational task (Chechik, 2003; Dayan & H¨ausser, 2004; Hopfield & Brody, 2004; Bohte & Mozer, 2005; Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005a, 2005b). Optimizing the likelihood of postsynaptic firing in a predefined interval, as we did in this letter, is only one possibility
1342
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
among others of introducing concepts of optimality (Barlow, 1961; Atick & Redlich, 1990; Bell & Sejnowski, 1995) into the field of STDP. Chechik (2003) uses concepts from information theory but restricts his study to the classification of stationary patterns. The paradigm considered in Bohte and Mozer (2005) is similar to our scenario Bc , in that they use a fairly strong teaching input to make the postsynaptic neuron fire. Bell and Parra (2005) and Toyoizumi et al. (2005a) are also using concepts from information theory, but they are applying them to the pre- and postsynaptic spike trains. The work of Toyoizumi et al. (2005a) is a clearcut unsupervised learning paradigm and hence distinct from our approach. Dayan and H¨ausser (2004) use concepts of optimal filter theory but are not interested in precise firing of the postsynaptic neuron. The work of Hopfield and Brody (2004) is similar to our approach in that it focuses on recognition of temporal input patterns, but we are also interested in triggering postsynaptic firing with precise timing. Hopfield and Brody emphasize the repair of disrupted synapses in a network that has previously acquired its function of temporal pattern detector. Optimality approaches such as ours will never be able to make strict predictions about the properties of neurons or synapses. Optimality criteria may, however, help to elucidate computational principles and provide insights into potential tasks of electrophysiological phenomena such as STDP. Appendix A: Probability Density of a Spike Train The probability density of generating a spike train yt = {ti1 , ti2 , . . . , tiF < t} with the stochastic process defined by equation 2.5 can be expressed as follows, P(yt ) = P(ti1 , . . . , tiF )R(t|yt ),
(A.1)
where P(ti1 , . . . , tiF ) is the probability density of having F spikes at times t ti1 , . . . , tiF and R(t|yt ) = exp(− t F ρ(t |yt )dt ) corresponds to the probability i
of having no spikes from tiF to t. Since the joint probability P(ti1 , . . . , tiF ) can be expressed as a product of conditional probabilities,
P(ti1 , . . . , tiF ) = P(ti1 )
F
f f −1 P ti |ti , . . . , ti1 ,
f =2
Equation A.1 becomes P(yt ) = ρ(ti1 |yti1 ) exp
−
ti1 0
ρ(t |yt )dt
(A.2)
Optimal STDP in Supervised Learning
·
F
=
f ρ(ti |yt f ) exp − i
f =2
ti
ti
1343
f
f −1
ρ(t |yt )dt
t ρ(t |yt )dt exp − tiF
t f ρ(ti |yt f ) exp − ρ(t |yt )dt . i
f
ti ∈yt
(A.3)
0
Appendix B: Numerical Evaluation of ρ(t) ¯ Since it is impossible to numerically evaluate the instantaneous firing rate ρ(t) ¯ with the analytical expression given by equation 3.6, we have to do it in a different way. In fact, there are two ways to evaluate ρ(t). ¯ Before going into the details, let us first recall that from the law of large numbers, the instantaneous firing rate is equal to the empirical density of spikes at time t,
ρ(t|yt ) yt = Y(t)Y(t) ,
(B.1)
f where Y(t) = t f ∈yt δ(t − ti ) is one realization of the postsynaptic spike i train. Thus, the first and simpler method based on the right-hand side of equation B.1 is to build a PSTH by counting spikes in small time bins [t, t + δt] over, say, K = 10,000 repetitions of an experiment. The second, and more advanced, method consists in evaluating the left-hand side of equation B.1 by Monte Carlo sampling. Instead of averaging over all possible spike trains yt , we generate K = 10,000 spike trains by repetition of the same stimulus. A specific spike train yt = {ti1 , ti2 , . . . , tiF < t} will automatically appear with appropriate probability given by equation 2.6. The Monte Carlo estimation ρ(t) ˜ of ρ(t) ¯ can be written as
ρ(t) ˜ =
P 1 ρ(t|ytm ), P
(B.2)
m=1
where ytm is the mth spike train generated by the stochastic process given by equation 2.5. Since we use the analytical expression of ρ(t|ytm ), we will call equation B.2 a semianalytical estimation. Let us note that the semianalytical estimation ρ(t) ˜ converges more rapidly to the true value ρ(t) ¯ than the empirical estimation based on the PSTH. In the limit of a Poisson process, η0 = 0, the semianalytical estimation ρ(t) ˜ given by equation B.2 is equal to the analytical expression of equation 3.6, since the instantaneous firing rate ρ of a Poisson process is independent of the firing history yt = {ti1 , ti2 , . . . , tiF < t} of the postsynaptic neuron.
1344
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Appendix C: Deconvolution C.1 Deconvolution for Spike Pairs. With a learning rule such as equation 3.16, we know the optimal weight change wi j for each synapse, but we still do not know the corresponding STDP function. Let us first define the correlation function c k (τ ), k = N(i − 1) + j between the presynaptic spike train Xij (t) = tpre ∈xij δ(t − t pre ) and the postsynaptic spike train Yi (t) = tdes ∈yi δ(t − t des ),
T
c k (τ ) = 0
Xij (s)Yi (s + τ )ds,
k = 1, . . . , NM,
(C.1)
where we allow a range −T0 ≤ τ ≤ T0 , with T0 T. Since the sum of the pair-based weight change W should be equal to the total adaptation of weights wk , we can write
T0
−T0
!
c k (s)W(s)ds = wk
k = 1, . . . , NM.
(C.2)
If we want to express equation C.1 in a matrix form, we need to descretize time in small bins δt and define the matrix element, Ck =
(+1)δt−T0 δt−T0
c k (s)ds.
(C.3)
Now equation C.2 becomes !
w = CW,
(C.4)
where w = (w11 , . . . , w1N , w21 , . . . , w MN )T is the vector containing all the optimal weight change and W is the vector containing the discretized STDP function: W = W(−T0 + δt), for = 1, . . . , 2T˜0 with T˜0 = T0 /δt. In order to solve the last matrix equation, we have to compute the inverse of the nonsquare NM × 2T˜0 matrix C, which is known as the Moore-Penrose inverse (or the pseudo-inverse), C + = (C T C)−1 C T ,
(C.5)
which exists only if (C T C)−1 exists. In fact, the solution given by W = C + w
(C.6)
Optimal STDP in Supervised Learning
1345
minimizes the square distance D=
1 (CW − w)2 . 2
(C.7)
C.2 Temporal Locality Constraint. If we want to impose a constraint of locality, we can add a term in the minimization process of equation C.7 and define the following, 1 E = D + WT PW, 2
(C.8)
where P is a diagonal matrix that penalizes nonlocal terms. In this article, we take a quadratic suppression of terms that are nonlocal in time: 2 P = a δ − T˜0 .
(C.9)
T˜0 corresponds to the index of the vector W in equations C.4 and C.8 for which t pre − t des = 0. Calculating the gradient of E given by equation C.8 with respect to W yields ∇W E = C T (CW − w) + PW.
(C.10)
By looking at the minimal value of E, that is, ∇W E = 0, we have W = (C T C + P)−1 C T w.
(C.11)
By setting a = 0, we recover the previous case. Acknowledgments This work was supported by the Swiss National Science Foundation (200020-103530/1 and 200020-108093/1). T.T was supported by the Research Fellowships of the Japan Society for the Promotion of Science for Young Scientists and a Grant-in-Aid for JSPS Fellows. References Abarbanel, H., Huerta R., & Rabinovich, M. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Academy of Sci. USA, 59, 10137–10143. Atick, J., & Redlich, A. (1990). Towards a theory of early visual processing. Neural Computation, 4, 559–572. Barber, D. (2003). Learning in spiking neural assemblies. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 149–156). Cambridge, MA: MIT Press.
1346
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Parra, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bell, C., Han, V., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellumlike structure depends on temporal order. Nature, 387, 278–281. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bi, G., & Poo, M. (2001). Synaptic modification of correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bohte, S. M., & Mozer, M. C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 201–208). Cambridge, MA: MIT Press. Brody, C., & Hopfield, J. (2003). Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron, 37, 843–852. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and fluctuation detection in the highly irregular firing of leaky integrator neuron model with partial reset. Neural Computation, 9, 985–1000. Carr, C. E., & Konishi, M. (1990). A circuit for detection of interaural time differences in the brain stem of the barn owl. J. Neurosci., 10, 3227–3246. Chechik, G. (2003). Spike-timing-dependent plasticity and relevant mututal information maximization. Neural Computation, 15, 1481–1510. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Feldman, D. (2000). Timing-based LTP and LTD and vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fr´egnac, Y., Shulz, D. E., Thorpe, S., & Bienenstock, E. (1988). A cellular analogue of visual cortical plasticity. Nature, 333(6171), 367–370. Fr´egnac, Y., Shulz, D. E., Thorpe, S., & Bienenstock, E. (1992). Cellular analogs of visual cortical epigenesis. I: Plasticity of orientation selectivity. Journal of Neuroscience, 12(4), 1280–1300. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse- and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. K. (2002a). Mathematical formulations of Hebbian learning. Biological Cybernetics, 87, 404–415.
Optimal STDP in Supervised Learning
1347
Gerstner, W., & Kistler, W. K. (2002b). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time–resolved excitation patterns. Biol. Cybern., 69, 503–515. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetry Hebbian plasticity. J. Neuroscience, 23, 3697–3714. Haykin, S. (1994). Neural networks. Upper Saddle River, NJ: Prentice Hall. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spiketiming-based computation networks. Proc. Natl. Acad. Sci. USA, 101, 337–342. Izhikevich, E. (2003). Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14, 1569–1572. Johansson, R., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature Neuroscience, 7, 170–177. Karmarkar, U., & Buonomano, D. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors. J. Neurophysiology, 88, 507–513. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic potentials. Neural Comput., 12, 385–405. Legenstein, R., Naeger, C., & Maass, W. (2005). What target functions can be learnt with spike-timing-dependent plasticity? Neural Computation, 17, 2337–2382. Magee, J. C., & Johnston, D. (1997). A synaptically controlled associative signal for Hebbian plastiticy in hippocampal neurons. Science, 275, 209–213. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, 275, 213–215. Mehta, M. R., Lee, A. K., & Wilson, M. A. (2002). Role of experience of oscillations in transforming a rate code into a temporal code. Nature, 417, 741–746. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge MA: MIT Press. Panzeri, S., Peterson, R., Schultz, S., Lebedev, & Diamond, M. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Poliakov, A. V., Powers, R. K., & Binder, M. C. (1997). Functional identification of the input-output transforms of motoneurones in the rat and cat. J. Physiology, 504, 401–424. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7, 235–246. Roberts, P., & Bell, C. (2000). Computational consequences of temporally asymmetric learning rules: II. Sensory image cancellation. Computational Neuroscience, 9, 67– 83.
1348
J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner
Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Senn, W., Tsodyks, M., & Markram, H. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Seung, S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., & Abbott, L. (2001). Column and map development and cortical re-mapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-time-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural Networks, 14, 715–725. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005a). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. K. Saul, Y. Weiss, and L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005b). Generalized Bienenstock-Cooper-Munro rule for spiking neurons that maximizes information transmission. Proc. National Academy Sciences (USA), 102, 5239–5244. Troyer, T. W., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neuroscience, 20, 8812–8821. Xie, X., & Seung, S. (2004). Learning in neural networks by reinforcement of irregular spiking. Phys. Rev. E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received August 10, 2004; accepted September 8, 2005.
LETTER
Communicated by Peter Latham
When Response Variability Increases Neural Network Robustness to Synaptic Noise Gleb Basalyga [email protected]
Emilio Salinas [email protected] Department of Neurobiology and Anatomy, Wake Forest University School of Medicine, Winston-Salem, NC 27157-1010, U.S.A
Cortical sensory neurons are known to be highly variable, in the sense that responses evoked by identical stimuli often change dramatically from trial to trial. The origin of this variability is uncertain, but it is usually interpreted as detrimental noise that reduces the computational accuracy of neural circuits. Here we investigate the possibility that such response variability might in fact be beneficial, because it may partially compensate for a decrease in accuracy due to stochastic changes in the synaptic strengths of a network. We study the interplay between two kinds of noise, response (or neuronal) noise and synaptic noise, by analyzing their joint influence on the accuracy of neural networks trained to perform various tasks. We find an interesting, generic interaction: when fluctuations in the synaptic connections are proportional to their strengths (multiplicative noise), a certain amount of response noise in the input neurons can significantly improve network performance, compared to the same network without response noise. Performance is enhanced because response noise and multiplicative synaptic noise are in some ways equivalent. So if the algorithm used to find the optimal synaptic weights can take into account the variability of the model neurons, it can also take into account the variability of the synapses. Thus, the connection patterns generated with response noise are typically more resistant to synaptic degradation than those obtained without response noise. As a consequence of this interplay, if multiplicative synaptic noise is present, it is better to have response noise in the network than not to have it. These results are demonstrated analytically for the most basic network consisting of two input neurons and one output neuron performing a simple classification task, but computer simulations show that the phenomenon persists in a wide range of architectures, including recurrent (attractor) networks and sensorimotor networks that perform coordinate transformations. The results suggest that response variability could play an important dynamic role in networks that continuously learn.
Neural Computation 18, 1349–1379 (2006)
C 2006 Massachusetts Institute of Technology
1350
G. Basalyga and E. Salinas
1 Introduction Neuronal networks face an inescapable trade-off between learning new associations and forgetting previously stored information. In competitive learning models, this is sometimes referred to as the stability-plasticity dilemma (Carpenter & Grossberg, 1987; Hertz, Krogh, & Palmer, 1991): in terms of inputs and outputs, learning to respond to new inputs will interfere with the learned responses to familiar inputs. A particularly severe form of performance degradation is known as catastrophic interference (McCloskey & Cohen, 1989). It refers to situations in which the learning of new information causes the virtually complete loss of previously stored associations. Biological networks must face a similar problem, because once a task has been mastered, plasticity mechanisms will inevitably produce further changes in the internal structural elements, leading to decreased performance. That is, within subnetworks that have already learned to perform a specific function, synaptic plasticity must at least partly appear as a source of noise. In the cortex, this problem must be quite significant, given that even primary sensory areas show a large capacity for reorganization (Wang, Merzenich, Sameshima, & Jenkins, 1995; Kilgard & Merzenich 1998; Crist, Li, & Gilbert, 2001). Some mechanisms, such as homeostatic regulation (Turrigiano & Nelson, 2000) and specific types of synaptic modification rules (Hopfield & Brody, 2004), may help alleviate the problem, but by and large, how nervous systems cope with it remains unknown. Another factor that is typically considered a limitation for neural computation capacity is response variability. The activity of cortical neurons is highly variable, as measured by either the temporal structure of spike trains produced during constant stimulation conditions or spike counts collected in a given time interval and compared across identical behavioral trials (Dean, 1981; Softky & Koch, 1992, 1993; Holt, Softky, Koch, & Douglas, 1996). Some of the biophysical factors that give rise to this variability, such as the balance between excitation and inhibition, have been identified (Softky & Koch, 1993; Shadlen & Newsome, 1994; Stevens & Zador, 1998). But its functional significance, if any, is not understood. Here we consider a possible relationship between the two sources of randomness just discussed, whereby response variability helps counteract the destabilizing effects of synaptic changes. Although noise generally hampers performance, recent studies have shown that, in nonlinear dynamical systems such as neural networks, this is not always the case. The best-known example is stochastic resonance, in which noise enhances the sensitivity of sensory neurons to weak periodic signals (Levin & Miller, 1996; Gammaitoni, H¨anggi, Jung, & Marchesoni, 1998; Nozaki, Mar, Grigg, & Collins, 1999), but noise may play other constructive roles as well. For instance, when a system has an internal source of noise, externally added
Neuronal Variability Counteracts Synaptic Noise
1351
noise can reduce the total noise of the output (Vilar & Rubi, 2000). Also, adding noise to the synaptic connections of a network during learning produces networks that, after training, are more robust to synaptic corruption and have a higher capacity to generalize (Murray & Edwards, 1994). In this letter, we study another beneficial effect of noise on neural network performance. In this case, adding randomness to the neural responses reduces the impact of fluctuations in synaptic strength. That is, here, performance depends on two sources of variability, response noise and synaptic noise, and adding some amount of response noise produces better performance than having synaptic noise alone. The reason for this paradoxical effect is that response noise acts as a regularization factor that favors connectivity matrices with many small synaptic weights over connectivity matrices with few large weights, and this minimizes the impact of a synapse that is lost or has a wrong value. We study this regularization effect in three different cases: (1) a classification task, which in its simplest instantiation can be studied analytically; (2) a sensorimotor transformation; and (3) an attractor network that produces self-sustained activity. For the latter two, the interaction between noise terms is demonstrated by extensive numerical simulations. 2 General Framework First we consider networks with two layers: an input layer that contains N sensory neurons and an output layer with K output neurons. A matrix r is used to denote the firing rates of the input neurons in response to M stimuli, so ri j is the firing rate of input unit i when stimulus j is presented. These rates have a mean component r¯ plus noise, as described in detail below. The output units are driven by the first-layer responses, such that the firing rate of output unit k evoked by stimulus j is
Rk j =
N
wki ri j ,
(2.1)
i=1
or in matrix form, R = wr, where w is the K × N matrix of synaptic connections between input and output neurons. The output neurons also have a set of desired responses F , where Fk j is the firing rate that output unit k should produce when stimulus j is presented. In other words, F contains target values that the outputs are supposed to learn. The error E is the mean squared difference between the actual driven responses Rk j and the desired ones, M K 1 2 E= (Rk j − Fk j ) , (2.2) KM k=1 j=1
1352
G. Basalyga and E. Salinas
or in matrix notation, 1 Tr (wr − F )(wr − F )T . (2.3) KM Here, Tr( A) = i Aii is the trace of a matrix, and the angle brackets indicate an average over multiple trials, which corresponds to multiple samples of the noise in the inputs r. The optimal synaptic connections W are those that make the error as small as possible. These can be found by computing the derivative of equation 2.3 with respect to w (or with respect to wa b , if the summations are written explicitly) and setting the result equal to zero (see, e.g., Golub & van Loan, 1996). These steps give E=
W = F r T C −1 ,
(2.4)
−1 where r = r and CT is the inverse (or the pseudo-inverse) of the correlation matrix C = r r . The general outline of the computer experiments proceeds in five steps as follows. First, the matrix r with the mean input responses is generated together with the desired output responses F . These two quantities define the input-output transformation that the network is supposed to implement. Second, response noise is added to the mean input rates, such that
ri j = r i j (1 + ηi j ).
(2.5)
The random variables ηi j are independently drawn from a distribution with zero mean and variance σr2 , ηi j = 0 2 ηi j = σr2 ,
(2.6)
where the brackets again denote an average over trials. We refer to this as multiplicative noise. Third, the optimal connections are found using equation 2.4. Note that these connections take into account the response noise through its effect on the correlation matrix C. Fourth, the connections 2 are corrupted by multiplicative synaptic noise with variance σW , that is, Wi j = Wi j (1 + i j ),
(2.7)
where i j = 0 2 2 . i j = σW
(2.8)
Neuronal Variability Counteracts Synaptic Noise
1353
Finally, the network’s performance is evaluated. For this, we measure the network error E W , which is the square error obtained with the optimal but corrupted weights W, averaged over both types of noise, EW =
1 Tr (Wr − F )(Wr − F )T . KM
(2.9)
Thus, the brackets in this case indicate an average over multiple trials and multiple networks, that is, multiple corruptions of the optimal weights W. The main result we report here is an interaction between the two types of noise: in all the network architectures that we have explored, for a fixed amount of synaptic noise σW , the best performance is typically found when the response noise has a certain nonzero variance. So, given that there is synaptic noise in the network, it is better to have some response noise rather than to have none. Before addressing the first example, we should highlight some features of the chosen noise models. Regarding response noise, equations 2.5 and 2.6, other models were tested in which the fluctuations were additive rather than multiplicative. Also, gaussian, uniform, and exponential distributions were tested. The results for all combinations were qualitatively the same, so the shape of the response noise distribution does not seem to play an important role; what counts is mainly the variance. On the other hand, the benefit of response noise is observed only when the synaptic noise is multiplicative; it disappears with additive synaptic noise. However, we do test several variants of the multiplicative model, including one in which the random variables i j are drawn from a gaussian distribution and another in which they are binary, 0 or −1. The latter case represents a situation in which connections are eliminated randomly with a fixed probability. 3 Noise Interactions in a Classification Task First we consider a task in which the two-layer, fully connected network is used to approximate a binary function. The task is to classify M stimuli on the basis of the N input firing rates evoked by each stimulus. Only one output neuron is needed, so K = 1. The desired response of this output neuron is the classification function Fj =
1 if j ≤ M/2 0 else,
(3.1)
where j goes from 1 to M. Therefore, the job of the output unit is to produce a 1 for the first M/2 input stimuli and a 0 for the rest. 3.1 A Minimal Network. In order to obtain an analytical description of the noise interactions, we first consider the simplest possible network that
1354
G. Basalyga and E. Salinas
exhibits the effect, which consists of two input neurons and two stimuli. Thus, N = M = 2, and the desired output is F = (1, 0). Note that with a single output neuron, the matrices W and F become row vectors. Now we proceed according to the five steps outlined in the preceding section; the goal is to show analytically that in the presence of synaptic noise, performance is typically better for a nonzero amount of response noise. The matrix of mean input firing rates is set to
r=
1 r0 r0 1
,
(3.2)
where r0 is a parameter that controls the difficulty of the classification. When it is close to 1, the pairs of responses evoked by the two stimuli are very similar, and large errors in the output are expected; when it is close to 0, the input responses are most different, and the classification should be more accurate. After combining the mean responses with multiplicative noise, as prescribed by equation 2.5, the input responses in a given trial become r=
1 + η11
r0 (1 + η12 )
r0 (1 + η21 )
1 + η22
.
(3.3)
Assuming that the fluctuations are independent across neurons, the correlation matrix is therefore
C = rr
T
=
1 + r02 1 + σr2 2r0
2r0 2 1 + r0 1 + σr2
.
(3.4)
Next, after calculating the inverse of C, equation 2.4 is used to find the optimal weights, which are σr2 1 + r02 + 1 − r02 W1 = 2 2 1 + σr2 1 + r02 − 4r02 σ 2 1 + r02 − 1 − r02 W2 = r r0 . 2 2 1 + σr2 1 + r02 − 4r02
(3.5)
Notice that these connections take into account the response variability through their dependence on σr . The next step is to corrupt these synaptic weights as prescribed by equation 2.7 and substitute the resulting
Neuronal Variability Counteracts Synaptic Noise
A
1355
B 3
E W 0.6
2
0.5 0.4
1
0.3
0
E min
0.2
W2
−1
0.1 0
W1
−2 0
0.1
σmin0.2
0.3
0.4
σr
0.5
0
0.2
0.4
0.6
0.8
σr
1
Figure 1: Noise interaction for a simple network of two input neurons and one output neuron (K = 1, N = M = 2). Both input responses and synaptic weights were corrupted by multiplicative gaussian noise. For all curves, solid lines are theoretical results, and symbols are simulation results averaged over 1000 networks and 100 trials per network. In all cases, r0 = 0.8. (A) Average square difference between observed and desired output responses, E W , as a function of the standard deviation (SD) of the response noise, σr . Squares and dashed line correspond to the error without synaptic noise (σW = 0); circles and solid lines correspond to the error with synaptic noise (σW = 0.15, 0.20, 0.25). (B) Dependence of the (uncorrupted) optimal weights W on σr .
expressions into equation 2.9. After making all the substitutions, calculating the averages, and simplifying, we obtain the average error, EW =
1 2 2 2 σ W + W2 1 + σr2 1 + r02 − W1 − r0 W2 + 1 . 2 W 1
(3.6)
This is the average square difference between the desired and actual responses of the output neuron given the two types of noise. It is a function of only three parameters, σr , σW , and r0 , because the optimal weights themselves depend on σr and r0 . The interaction between noise terms for this simple N = M = 2 case is illustrated in Figure 1A, which plots the error as a function of σr with and without synaptic variability. Here, dashed and solid lines represent the theoretical results given by equations 3.5 and 3.6, and symbols correspond to simulation results averaged over 1000 networks and 100 trials per network. Without synaptic noise (dashed line), the error increases monotonically with σr , as one would normally expect when adding response variability. In contrast, when σW = 0.15, 0.2, or 0.25 (solid lines), the error initially decreases and then starts increasing again, slowly approaching the curve obtained with response noise alone.
1356
G. Basalyga and E. Salinas
Figure 1B shows how the optimal weights depend on σr . The solid lines were obtained from equations 3.5. The curves show that the effect of response noise is to decrease the absolute values of the optimal synaptic weights. Intuitively, that is why response variability is advantageous; smaller synaptic weights also mean smaller synaptic fluctuations, because the standard deviations (SD) are proportional to the mean values. So, there is a trade-off: the intrinsic effect of increasing σr is to increase the error, but with synaptic noise present, σr also decreases the magnitude of the weights, which lowers the impact of the synaptic fluctuations. That the impact of synaptic noise grows directly with the magnitude of the weights is also apparent from the first term in equation 3.6. The magnitude of the noise interaction can be quantified by the ratio E min /E 0 , where the numerator is the minimal value of the error curve and the denominator is the error obtained when only synaptic noise is present, that is, when σr = 0. The minimum error E min occurs at the optimal value of σr , denoted as σmin . The ratio E min /E 0 is equal to 1 if response variability provides no advantage and approaches 0 as σmin cancels more of the error due to synaptic noise. For the lowest solid curve in Figure 1A, the ratio is approximately 0.8, so response variability cancels about 20% of the square error generated by synaptic fluctuations. Note, however, that in these examples, the error is below E 0 for a large range of values of σr , not only near σmin , so response noise may be beneficial even if it is not precisely matched to the amount of synaptic noise. Figure 2 further characterizes the strength of the interaction between the two types of noise. Figures 2A and 2B show how the error and the optimal amount of response variability vary as functions of σW . These graphs indicate that the fraction of the error that σr is able to compensate for, as well as the optimal amount of response noise, increases with the SD of the synaptic noise. The minimum error, E min , grows steadily with σW ; clearly, σr cannot completely compensate for synaptic corruption. Also, σW has to be bigger than a critical value for the noise interaction to be observed (σW > 0.1, approximately). However, except when synaptic noise is very small, the optimal strategy is to add some response noise to the network. As in the previous figure, symbols and lines in Figure 2 correspond to simulation and theoretical results, respectively. To obtain the latter, the key is to calculate σmin . This is done by first substituting the optimal synaptic weights of equation 3.5 into the expression for the average error, equation 3.6, and second, calculating the derivative of the error with respect to σr2 2 and equating it to zero. The resulting expression gives σmin as a function of the only two remaining parameters, σW and r0 . The dependence, however, is highly nonlinear, so in general the solution is implicit: 2 2 2 σr8 1 − σW + 2σr6 1 + a 2 1 − 2σW + 6σr4 a 2 1 − σW 2 2 2 2 − 4σW = 0, + a 4 1 + 3σW − 4a 2 σW + 2σr2 a 2 1 + a 2 + 2a 2 σW
(3.7)
Neuronal Variability Counteracts Synaptic Noise
A
1357
B
1
0.8
σmin
E min E0
0.8
0.6
0.6
E min
0.4
0.4
0.2
0.2 0
0
0.2
0.4
0.6
0.8
1
0
σW
C
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
1
σW
D
1
1.5
σmin
E min E0
0.8
1
E min
0.6 0.4
0.5 0.2 0
0
0.2
0.4
0.6
0.8
r0 1
0
0
r0 1
Figure 2: Optimal amount of response noise in the minimal classification network. Same network with two sensory neurons and one output neuron as in Figure 1. Lines and symbols indicate theoretical and simulation results, respectively, averaged over 1000 networks and 100 trials per network. (A) Strength of the noise interaction quantified by E min (dashed line) and E min /E 0 (solid line), as a function of σW , which determines the synaptic variability. Here and in B, r0 = 0.8. (B) Optimal amount of response variability, σmin , as a function of σW , for the same data in A. (C) Strength of the noise interaction as a function of r0 , which parameterizes the discriminability of the mean input responses evoked by the two stimuli. Here and in D, σW = 1. (D) σmin , as a function of r0 for the same data in C.
where
a≡
1 − r02 1 + r02
.
(3.8)
1358
G. Basalyga and E. Salinas
The value of σr that makes equation 3.7 true is σmin . For Figures 2A and 2B, the zero of the polynomial was found numerically for each combination of r0 and σW . Figures 2C and 2D show how E min , E min /E 0 , and σmin depend on the separation between evoked input responses, as parameterized by r0 . For these two plots, we chose a special case in which σmin can be obtained analytically from equation 3.7: σW = 1. In this particular case, the dependence of σmin on r0 has a closed form, 2 σmin
=
1 − r02
2/3 ((1 + r0 )2/3 + (1 − r0 )2/3 ).
1 + r02
(3.9)
This function is shown in Figure 2D. In general, the numerical simulations are in good agreement with the theory, except that the scatter in Figure 2D tends to increase as r0 approaches 0. This is due to a key feature of the noise interaction, which is that it depends on the overlap between input responses across stimuli. This can be seen as follows. First, notice that in Figure 2C, the relative error approaches 1 as r0 gets closer to 0. Thus, the noise interaction becomes weaker when there is less overlap between input responses, which is precisely what r0 represents in equation 3.2. If there is no overlap at all, the benefit of response noise vanishes. This fact explains why more than one neuron is needed to observe the noise interaction in the first place. This observation can be demonstrated analytically by setting r0 = 0 in equations 3.5 and 3.6, in which case the average square error becomes E W (r0 = 0) =
1 2
2 σW −1 + 1 . 1 + σr2
(3.10)
2 This result has interesting implications. If σW = 1, response noise makes 2 no difference, so there is no optimal value. If σW < 1, the error increases 2 monotonically with response noise, so the optimal value is 0. And if σW > 1, the optimal strategy is to add as much noise as possible. In this case, the variance of the output neuron is so high that there is no hope of finding a reasonable solution; the best thing to do is set the mean weights to zero, disconnecting the output unit. Thus, without overlap, either the synaptic noise is so high that the network is effectively useless, or, if σW is tolerable, response noise does not improve performance. At r0 = 0, the numerical solutions oscillate between these two extremes, producing an average error of 0.5 (left-most point in Figure 2C). In general, however, with nonzero overlap, there is a true optimal amount of response noise, and the more overlap there is, the larger its benefits are, as shown in Figure 2C. The simulation data points in Figure 2 were obtained using fluctuations and η in equations 2.7 and 3.3, respectively, sampled from gaussian
Neuronal Variability Counteracts Synaptic Noise
1359
distributions. The results, however, were virtually identical when the distribution functions were either uniform or exponential. Thus, as noted earlier, the exact shapes of the noise distributions do not restrict the observed effect. 3.2 Regularization by Noise. Above, we mentioned that response noise tends to decrease the absolute value of the optimal synaptic weights. Why is this? The reason is that minimization of the mean square error in the presence of response noise is mathematically equivalent to minimization of the same error without response noise but with an imposed constraint forcing the optimal weights to be small. This is as follows. Consider equation 2.4, which specifies the optimal weights in the twolayer network. Response noise enters into the expression through the correlation matrix. By separating the input responses into mean plus noise, we have C = (r + η)(r + η)T = r r T + ηη T = r r T + Dσ ,
(3.11)
where we have assumed that the noise is additive and uncorrelated across neurons (additivity is considered for simplicity but is not necessary). This results in the diagonal matrix Dσ containing the variances of individual units, such that element j along the diagonal is the total variance, summed over all stimuli, of input neuron j. Thus, uncorrelated response noise adds a diagonal matrix to the correlation between average responses. In that case, equation 2.4 can be rewritten as −1 W = F r T r r T + Dσ .
(3.12)
Now consider the mean square error without any noise but with an additional term that penalizes large weights. To restrict, for instance, the total synaptic weight provided by each input neuron, add the penalty term 1 λi wi2j K M i, j
(3.13)
to the original error expression, equation 2.3. Here, λi determines how much input neuron i is taxed for its total synaptic weight. Rewriting this as a trace, the total error to be minimized in this case becomes E=
1 (Tr[(wr − F )(wr − F )T ] + Tr(w T Dλ w)), KM
(3.14)
1360
G. Basalyga and E. Salinas
where Dλ is a diagonal matrix that contains the penalty coefficients λi along the diagonal. The synaptic weights that minimize this error function are given by F r T (r r T + Dλ )−1.
(3.15)
But this solution has exactly the same form as equation 3.12, which minimizes the error in the presence of response noise alone, without any other constraints. Therefore, adding response noise is equivalent to imposing a constraint on the magnitude of the synaptic weights, with more noise corresponding to smaller weights. The penalty term in equation 3.13 can also be interpreted as a regularization term, which refers to a common type of constraint used to force the solution of an optimization problem to vary smoothly (Hinton, 1989; Haykin, 1999). Therefore, as has been pointed out previously (Bishop, 1995), the effect of response fluctuations can be described as regularization by noise. In our model, we assumed that the fluctuations in synaptic connections are proportional to their size. What happens, then, is that response noise forces the optimal weights to be small, and this significantly decreases the part of the error that depends on σW . In this way, smaller synaptic weights— and therefore a nonzero σr —typically lead to smaller output errors. Another way to look at the relationship between the two types of noise is to calculate the optimal mean synaptic weights taking the synaptic variability directly into account. For simplicity, suppose that there is no response noise. Substitute equation 2.7 directly into equation 2.3, and minimize with respect to W, now averaging over the synaptic fluctuations. With multiplicative noise, the result is again an expression similar to equations 3.12 and 3.15, where a correction proportional to the synaptic variance is added to the diagonal of the correlation matrix. In contrast, with additive synaptic noise, the resulting optimal weights are exactly the same as without any variability, because this type of noise cannot be compensated for. Therefore, the recipe for counteracting response noise is equivalent to the recipe for counteracting multiplicative synaptic noise. An argument outlining why this is generally true is presented in section 6.1. 3.3 Classification in Larger Networks. When the simple classification task is extended to larger numbers of first-layer neurons (N > 2) and more input stimuli to classify (M > 2), an important question can be studied: How does the interaction between synaptic and response noise depend on the dimensionality of the problem, that is, on N and M? To address this issue, we did the following. Each entry in the N × M matrix r of mean responses was taken from a uniform distribution between 0 and 1. The desired output still consisted of a single neuron’s response given by equation 3.1, as before. So each one of the M input stimuli evoked a set of N neuronal responses,
Neuronal Variability Counteracts Synaptic Noise
A
B
E min 1 E 0 0.8
σmin
0.5 0.4
0.6
0.3
0.4
0.2
0.2
0.1
0
0
10
20
30
40
N
0
50
0
10
20
30
40
10
20
30
40
N
50
D
C 1
E min E 0 0.8
σmin
0.5 0.4
0.6
0.3
0.4
0.2
0.2
0.1
0
1361
0
10
20
30
40
M
50
0
0
M
50
Figure 3: Interaction between synaptic noise and response noise during the classification of M input stimuli. For each stimulus, the mean responses of N input neurons were randomly selected from a uniform distribution between 0 and 1. The output unit of the network had to classify the M response patterns by producing either a 1 or a 0. The synaptic noise SD was σW = 0.5. Results (circles) are averages over 1000 networks and 100 trials per network. All data are from computer simulations. (A) Relative error, E min /E 0 , as a function of the number of input neurons, N. The number of stimuli was kept constant at M = 10. (B) Optimal value of the response noise SD, σmin , as a function of the number of input neurons, N. Same simulations as in A. (C) Relative error as a function of the number of input stimuli, M. The number of input neurons was kept constant at N = 10. (D) Optimal value of the response noise SD as a function of M for the same simulations as in C.
each set drawn from the same distribution, and the output neuron had to divide the M evoked firing rate patterns into two categories. The optimal amount of response noise was found, and the process was repeated for different combinations of N and M. The results from these simulations are shown in Figure 3. All data points were obtained with the same amount of synaptic variability,
1362
G. Basalyga and E. Salinas
σW = 0.5. Each point represents an average over 1000 networks for which the optimal connections were corrupted. The amount of response noise that minimized the error, averaged over those 1000 corruption patterns, was found numerically by calculating the average error with the same mean responses and corruption patterns but different σr . For each combination of N and M, this resulted in σmin , which is shown in Figure 3B. The actual average error obtained with σr = σmin divided by the error for σr = 0 is shown in Figure 3A, as in Figure 2. Interestingly, the benefit conferred by response noise depends strongly on the difference between N and M. With M = 10 input stimuli, the effect of response noise is maximized when N = 10 neurons are used to encode them (see Figure 3A), and vice versa, when there are N = 10 neurons in the network, the maximum effect is seen when they encode M = 10 stimuli (see Figure 3C). Results with other numbers (5, 20, and 40 stimuli or neurons) were the same: response noise always had a maximum impact when N = M. This is not unreasonable. When there are many more neurons than stimuli, a moderate amount of synaptic corruption causes only a small error, because there is redundancy in the connectivity matrix. On the other hand, when there are many more input stimuli than neurons, the error is large anyway, because the N neurons cannot possibly span all the required dimensions, M. Thus, at both extremes, the impact of synaptic noise is limited. In contrast, when N = M, there is no redundancy, but the output error can potentially be very small, so the network is most sensitive to alterations in synaptic connectivity. Thus, response noise makes a big difference when the number of responses and the number of independent stimuli encoded are equal or nearly so. In Figures 3A and 3C, the relative error is not zero for N = M, but it is quite small (E min = 0.23, E min /E 0 = 0.004). This is primarily because the error without any response noise, E 0 , can be very large. Interestingly, the optimal amount of response noise also seems to be largest when N = M, as suggested by Figures 3B and 3D. In contrast to previous examples, for all data points in Figure 3, the fluctuations in the synapses and in the firing rates, and η, were drawn from uniform rather than gaussian distributions. As mentioned before, the variances of the underlying distributions should matter, but their shapes should not. Indeed, with the same variances, results for Figure 3 were virtually identical with gaussian or exponential distributions. A potential concern in this network is that although the variability of the output neuron depends on the interaction between the two types of noise, perhaps the interaction is of little consequence with respect to actual classification performance. The relevant measure for this is the probability of correct classification, pc . This probability is obtained by comparing the distributions of output responses to stimuli in one category versus the other, which is typically done using standard methods from signal detection theory (Dayan & Abbott, 2001). The algorithm underlying the calculation is quite simple: in each trial, the stimulus is assumed to belong to class 1
Neuronal Variability Counteracts Synaptic Noise
1363
if the output firing rate is below a threshold; otherwise, the stimulus belongs to class 2. To obtain pc , the results should be averaged over trials and stimuli. Finally, note that an optimal threshold should be used to obtain the highest possible pc . We performed this analysis on the data in Figure 3. Indeed, pc also depended nonmonotonically on response variability. For instance, for N = M = 10, the values with and without response noise were pc (σr = σmin ) = 0.83 and pc (σr = 0) = 0.75, where chance performance corresponds to 0.5. Also, the maximum benefit of response noise occurred for N = M and decreased quickly as the difference between N and M grew, as in Figures 3A & 3C. However, the amount of response noise that maximized pc was typically about one-third of the amount that minimized the mean square error. Thus, the best classification probability for N = M = 10 was pc (σr = 0.13) = 0.91. Maximizing pc is not equivalent to minimizing the mean square error; the two quantities weight differently the bias and variance of the output response (see Haykin, 1999). Nevertheless, response noise can also counteract part of the decrease in pc due to synaptic noise, so its beneficial impact on classification performance is real. 4 Noise Interactions in a Sensorimotor Network To illustrate the interactions between synaptic and response noise in a more biologically realistic situation, we apply the general approach outlined in section 2 to a well-known model of sensorimotor integration in the brain. We consider the classic coordinate transformation problem in which the location of an object, originally specified in retinal coordinates, becomes independent of gaze angle. This type of computation has been thoroughly studied both experimentally (Andersen, Essick, & Siegel, 1985; Brotchie, Andersen, Snyder, & Goodman, 1995) and theoretically (Zipser & Andersen, 1988; Salinas & Abbott, 1995; Pouget & Sejnowski, 1997) and is thought to be the basis for generating representations of object location relative to the body or the world. Also, the way in which visual and eye-position signals are integrated here is an example of what seems to be a general principle for combining different information streams in the brain (Salinas & Thier 2000; Salinas & Sejnowski, 2001). Such integration by “gain modulation” may have wide applicability in diverse neural circuits (Salinas, 2004), so it represents a plausible and general situation in which computational accuracy is important. From the point of view of the phenomenon at hand, the constructive effect of response noise, this example addresses an important issue: whether the noise interaction is still observed when network performance depends on a population of output neurons. In the classification task, performance was quantified through a single neuron’s response, but in this case, it depends on a nonlinear combination of multiple firing rates, so maybe the
1364
G. Basalyga and E. Salinas
impact of response noise washes out in the population average. As shown below, this is not the case. The sensorimotor network has, as before, a feedforward architecture with two layers. The first layer contains N gain-modulated sensory units, and the second or output layer contains K motor units. Each sensory neuron is connected to all output neurons through a set of feedforward connections, as illustrated in Figure 4B. The sensory neurons are sensitive to two quantities: the location (or direction) of a target stimulus x, which is in retinal coordinates, and the gaze (or eye-position) angle y. The network is designed so that the motor layer generates or encodes a movement in a direction z, which represents the direction of the target relative to the head. The idea is that the profile of activity of the output neurons should have a single peak centered at direction z. The correct (i.e., desired) relationship between inputs and outputs is z = x− y, which is approximately how the angles x and y should be combined in order to generate a head-centered representation of target direction (Zipser & Andersen, 1988; Salinas & Abbott, 1995; Pouget & Sejnowski, 1997). In other words, z is the quantity encoded by the output neurons, and it should relate to the quantities encoded by the sensory neurons through the function z(x, y) = x− y. Many other functions are possible, but as far as we can tell, the choice has little impact on the qualitative effect of response noise. In this model, the mean firing rate of sensory neuron i is characterized by a product of two tuning functions, f i (x) and gi (y), such that r i (x, y) = rmax f i (x) (1 − D + D gi (y)) + r B ,
(4.1)
where r B = 4 spikes per second is a baseline firing rate, rmax = 35 spikes per second and D is the modulation depth, which is set to 0.9 throughout. The sensory neurons are gain modulated because they combine the information from their two inputs nonlinearly. The amplitude—but not the selectivity —of a visually triggered response, represented by f i (x), depends on the
Figure 4: Network model of a sensorimotor transformation. In this network, N = 400, K = 25, M = 400. Target and movement directions, x and z, respectively, vary between −25 and 25, whereas gaze angle y varies between −15 and 15. The graphs correspond to a single trial in which x = −10, y = 10, and z = x− y = −20. Neither response noise nor synaptic corruption was included in this example. (A) Firing rates of the 400 gain-modulated input neurons arranged according to preferred stimulus location. (B) Network architecture. (C) Firing rates of the 25 output motor neurons arranged according to preferred target location.
Neuronal Variability Counteracts Synaptic Noise
1365
Input Sensory Neurons
A
x = −10, y = 10
ri
50 40 30 20 10 0
−20
−10
0
10
20
Preferred stimulus location
B
ri 1
2
...
i
...
1
2
...
K
Wji Rj
Output Motor Neurons
C
z = x − y = −20 50
Rj
40 30 20 10 0
−20
−10
0
10
20
Preferred target location
N
1366
G. Basalyga and E. Salinas
direction of gaze (Andersen et al., 1985; Brotchie et al., 1995; Salinas & Thier, 2000). Note that in the expression above, the second index of the mean rate r i j has been replaced by parentheses, indicating a dependence on x and y. This is to simplify the notation; the responses can still be arranged in a matrix r if each value of the second index is understood to indicate a particular combination of values of x and y. For example, if the rates were evaluated in a grid with 10 x points and 10 y points, the second index would run from 1 to 100, covering all combinations. Indeed, this is how it is done in the computer. For simplicity, the tuning curves for different neurons in a given layer are assumed to have the same shape but different preferred locations or center points, which are always between −25 and 25. Visual responses are modeled as gaussian tuning functions of stimulus location x,
(x − a i )2 f i (x) = exp − 2σ 2f
,
(4.2)
where a i is the preferred location and σ f = 4 is the tuning curve width. The dependence on eye position is modeled using sigmoidal functions of the gaze angle y, gi (y) =
1 , 1 + exp(−(b i − y)/di )
(4.3)
where b i is the center point of the sigmoid and di is chosen randomly between −7 and +7 to make sure that the curves gi (y) have different slopes for different neurons in the array. In each trial of the task, response variability is included by applying a variant of equation 2.5: ri j = r i j +
r i j ηi j .
(4.4)
This makes the variance of the rates proportional to their means, which in general is in good agreement with experimental data (Dean, 1981; Softky & Koch, 1992,1993; Holt, Softky, Koch, & Douglas, 1996). This choice, however, is not critical (see below). The desired response for each output neuron is also described by a gaussian,
Fk (z) = rmax
(z − c k )2 exp − 2σ F2
+ rB,
(4.5)
where σ F = 4 and c k is the preferred target direction of motor neuron k. This expression gives the intended response of output unit k in terms of the
Neuronal Variability Counteracts Synaptic Noise
1367
encoded quantity z. Keep in mind, however, that the desired dependence on the sensory inputs is obtained by setting z = x− y. When driven by the first-layer neurons, the output rates are still calculated through a weighted sum, Rk (z) = Rk (x, y) =
N
Wki ri (x, y).
(4.6)
i=1
This is equivalent to equation 2.1 but with the second index defined implicitly through x and y, as mentioned above. The optimal synaptic connections Wki are determined exactly as before, using equation 2.4. Typical profiles of activity for input and output neurons are shown in Figures 4A and 4C for a trial with x = −10 and y = 10. The sensory neurons are arranged according to their preferred stimulus location a i , whereas the motor neurons are arranged according to their preferred movement direction c k . For this sample trial, no variability was included; the firing rate values in Figure 4A are scattered under a gaussian envelope (given by equation 4.2) because the gaze-dependent gain factors vary across cells. Also, the output profile of activity is gaussian and has a peak at the point z = −20, which is exactly where it should be given that the correct inputoutput transformation is z = x− y. With noise, the output responses would be scattered around the gaussian profile and the peak would be displaced. The error used to measure network performance is, in this case, E pop = |z − Z|.
(4.7)
This is the absolute difference, averaged over trials and networks, between the desired movement direction z—the actual head-centered target direction—and the direction Z that is encoded by the center of mass of the output activity, (Ri − rB )2 c i Z = i . 2 k (Rk − rB )
(4.8)
Therefore, equation 4.7 gives the accuracy with which the whole motor population represents the head-centered direction of the target, whereas equation 4.8 provides the recipe to read out such output activity. Now the idea is to corrupt the optimal connections and evaluate E pop using various amounts of response noise to determine whether there is an optimum. Relative to the previous examples, the key differences are, first, that the error in equation 4.7 represents a population average, and second, that although the connections are set to minimize the average difference between desired and driven firing rates, the performance criterion is not based directly on it.
1368
G. Basalyga and E. Salinas
A
B
10
0.8
E min E 0 0.6
E pop
8 6
0.4 4 0.2
2
E min
0
0
1 σmin
2
3
σr
4
0
0
500
1000
1500
2000
2500
N D
C σmin1.5
2
E min 1.5
1 1
E min E0
0.5
0
0
0.05
0.1
0.15
0.2
0.25
pW
0.5
0.3
0
0
0.05
0.1
0.15
0.2
0.25
0.3
pW
Figure 5: Noise interaction for the sensorimotor network depicted in Figure 4. Results are averaged over 100 networks and 100 trials per network. All data are from computer simulations. (A) Average absolute deviation between actual and encoded target locations, E pop , as a function of response noise. Continuous lines are for three probabilities of weight elimination, pW = 0.1, 0.3 and 0.5; the dashed line corresponds to pW = 0. (B) Magnitude of the noise interaction, measured by the relative error E min /E 0 , as a function of the number of input neurons, N, for pW = 0.2. (C) E min and E min /E 0 as functions of pW . (D) Optimal response noise SD, σmin , as a function of pW .
Simulation results for this sensorimotor model are presented in Figure 5. A total of 400 sensory and 25 output neurons were used. These units were tested with all combinations of 20 values of x and 20 values of y, uniformly spaced (thus, M = 400). Synaptic noise was generated by random weight elimination. This means that after having set the connections to their optimal values given by equation 2.4, each one was reset to zero with a probability pW . Thus, on average, a fraction pW of the weights in each network was eliminated. As shown in Figure 5A, when pW > 0, the error between the encoded and the true target direction has a minimum with respect to σr . These error curves represent averages over 100 networks. Interestingly, the
Neuronal Variability Counteracts Synaptic Noise
1369
benefit of noise does not decrease when more sensory units are included in the first layer (see Figure 5B). That is, if pW is constant, the proportion of eliminated synapses does not change, so the error caused by synaptic corruption cannot be reduced simply by adding more neurons. Figure 5C shows the minimum and relative errors as functions of pW . This graph highlights the substantial impact that response noise has on this network: the relative error stays below 0.2 even when about a third of the synapses are eliminated. This is not only because the error without response noise is high, but also because the error with an optimal amount of noise stays low. For instance, with pW = 0.3 and σr = σmin , the typical deviation from the correct target direction is about 2 units, whereas with σr = 0, the typical deviation is about 10. Response noise thus cuts the deviation by about a factor of five, and importantly, the resulting error is still small relative to the range of values of z, which spans 50 units. Also, as observed in the classification task, in general it is better to include response noise even if σr is not precisely matched to the amount of synaptic variability (see Figure 5A). Figure 5D plots σmin as a function of the probability of synaptic elimination. The optimal amount of response noise increases with pW and reaches fairly high levels. For instance, at a value of 1, which corresponds to pW near 0.15, the variance of the firing rates is equal to their mean, because of equation 4.4. We wondered whether the scaling law of the response noise would make any difference, so we reran the simulations with either additive noise (SD independent of mean) or noise with an SD proportional to the mean, as in equation 2.5. Results in these two cases were very similar: E min and E min /E 0 varied very much as in Figure 5C, and the optimal amount of noise grew monotonically with pW , as in Figure 5D. 5 Noise Interactions in a Recurrent Network The networks discussed in the previous sections had a feedforward architecture, and in those cases the contribution of response noise to the correlation matrix between neuronal responses could be determined analytically. In contrast, in recurrent networks, the dynamics are more complex and the effects of random fluctuations more difficult to ascertain. To investigate whether response noise can still counteract some of the effects of synaptic variability, we consider a recurrent network with a well-defined function and relatively simple dynamics characterized by attractor states. When the firing rates in this network are initialized at arbitrary values, they eventually stop changing, settling down at certain steady-state points in which some neurons fire intensely and others do not. The optimal weights sought are those that allow the network to settle at predefined sets of steady-state responses, and the error is thus defined in terms of the difference between the desired steady states and the observed ones. As before, response noise is taken into account when the optimal synaptic weights are generated,
1370
G. Basalyga and E. Salinas
although in this case, the correction it introduces (relative to the noiseless case) is an approximation. The attractor network consists of N continuous-valued neurons, each of which is connected to all other units via feedback synaptic connections (Hertz et al., 1991). With the proper connectivity, such a network can generate, without any tuned input, a steady-state profile of activity with a cosine or gaussian shape (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Salinas, 2003). Such stable “bump”-shaped activity is observed in various neural models, including those for cortical hypercolumns (Hansel & Sompolinsky, 1998), headdirection cells (Zhang, 1996; Laing & Chow, 2001), and working memory circuits (Compte, Brunel, Goldman-Rakic, & Wang, 2000). Below, we find the connection matrix that allows the network to exhibit a unimodal activity profile centered at any point within the array. 5.1 Optimal Synaptic Weights in a Recurrent Architecture. The dynamics of the network are determined by the equation dri τ = −ri + h Wi j r j + ηi , dt j
(5.1)
where τ = 10 is the integration time constant, ri is the response of neuron i, and h is the activation function of the cells, which relates total current to firing rate. The sigmoid function h(x) = 1/(1 + exp(−x)) is used, but this choice is not critical. As before, ηi represents the response fluctuations, which are drawn independently for each neuron in everytime step. In this case, they are gaussian, with zero mean and a variance σr2 t. The variance of ηi is divided by the integration time step t to guarantee that the variance of the rate ri remains independent of the time step (van Kampen, 1992). For our purposes, manipulating this type of network is easier if the equations are expressed in terms of the total input currents to the cells (Hertz, Krogh, &Palmer, 1991; Dayan & Abbott, 2001). If the current for neuron i is ui = j Wi j r j , then τ
dui Wi j (h(u j ) + η j ), = −ui + dt j
(5.2)
is equivalent to equation 5.1. A stationary solution of equation 5.2 without input noise is such that all derivatives become zero. This corresponds to an attractor state α for which uiα =
j
Wi j h uαj .
(5.3)
Neuronal Variability Counteracts Synaptic Noise
A ui
1371
B ui
2
2
1
1
0
0
−1
−1
−2
−2
−180
−90
0
90
180
−180
Neuron label (deg)
−90
0
90
180
Neuron label (deg)
Figure 6: Steady-state responses of a recurrent neural network with 20 neurons. Results show the input currents of all units after 1000 ms of simulation time, with responses evolving according to equation 5.2. Each neuron is labeled by an angle between −180 degrees and 180 degrees. (A) Steady-state responses for four sets of initial conditions with peaks near units −90 degrees, 0 degree, 90 degrees, and 180 degrees. The observed activity profiles are indistinguishable from the desired gaussian curves. Neither synaptic nor response noise was included in this example. (B) Steady-state responses with and without noise. The desired activity profile is indicated by the solid line. The dotted line corresponds to the activity observed with noise after 1000 ms of simulation time, having started with an initial condition equal to the desired steady state. Vertical lines indicate the locations of the corresponding centers of mass. The absolute deviation is 34 degrees. Here, σr = 0.3 and pW = 0.02.
The label α is used because the network may have several attractors or sets of fixed points. The desired steady-state currents are denoted as Uiα . These are gaussian profiles of activity such that during steady state α = 1, neuron 1 is the most active (i.e., the gaussian is centered at neuron 1), during steady state α = 2, neuron 2 is the most active, and so on. Figure 6 illustrates the activity of the network at four steady states in the absence of noise (σW = 0 = σr ). To make the network symmetric, the neurons were arranged in a ring, so their activity profiles wrap around. Because of this, each neuron is labeled with an angle. The observed currents ui settle down at values that are almost exactly equal to the desired ones, Uiα . The synaptic connections that achieved this match were found by enforcing the steady-state condition 5.3 for the desired attractors. That is, we minimized 2 NA α 1 α Ui − E= Wi j h U j , NA i j α=1
(5.4)
1372
G. Basalyga and E. Salinas
where Uiα is a (wrap-around) gaussian function of i centered at α and NA is the number of attractors; in the simulations, NA is always equal to the number of neurons, N. This procedure leads to an expression for the optimal weights equivalent to equation 2.4. Thus, without response noise, W = L C −1 ,
(5.5)
where 1 α α U h Uj NA α i 1 h(Uiα ) h U αj . Ci j = NA α Li j =
(5.6)
To include the effects of response noise, we add a correction to the diagonal of the correlation matrix, as in the previous cases (see section 3.2). We thus set Ci j =
1 α α σ2 h Ui h U j + δi j a r , NA α 2τ
(5.7)
where a is a proportionality constant. The rationale for this is as follows. Strictly speaking, equation 5.2 with response noise does not have a steady state. But consider the simpler case of a single variable u with a constant asymptotic value u∞ , such that τ
du = −u + u∞ + η. dt
(5.8)
If the trajectory u(t) from t = 0 to t = T is calculated many times, starting from the same initial condition, the distribution of end points u(T) has a well-defined mean and variance, which vary smoothly as functions of T. The mean is always equal to the end point that would be observed without noise, whereas for T much longer than the integration time constant τ , the variance is equal to the variance of the fluctuations on the right-hand side of equation 5.8 divided by 2τ (van Kampen, 1992). These considerations suggest that we minimize 2 α 1 α E= Ui − Wi j h U j + a η˜ j , NA α,i j
(5.9)
Neuronal Variability Counteracts Synaptic Noise
1373
where the variance of η˜ j is σr2 (2τ ). This leads to equation 5.5 with the corrected correlation matrix given by equation 5.7.
5.2 Performance of the Attractor Network. To evaluate the performance of this network, we compare the center of mass of the desired activity profile to that of the observed profile tracked during a period of time. For a particular attractor α, the network is first initialized very close to that desired steady state; then equation 5.2 is run for 1000 ms (100 time constants τ ), and the absolute difference between the initial and the current centers of mass is recorded during the last 500 ms. The error for the recurrent networks E rec is defined as the absolute difference averaged over this time period and all attractor states, that is, all values of α. Also, when there is synaptic noise, an additional average over networks is performed. This error function is similar to equation 5.1, except that the circular topology is taken into account. Thus, E rec is the mean absolute difference between desired and observed centers of mass. It is expressed in degrees. Before exploring the interaction between synaptic and response noise, we used E rec to test whether the noise-dependent correction to the correlation matrix in equation 5.7 was appropriate. To do this, a recurrent network without synaptic fluctuations was simulated multiple times with different values of the parameter a and various amounts of response noise. The desired attractors were kept constant. The resulting error curves are shown in Figure 7A. Each one gives the average absolute deviation between desired and observed centers of mass as a function of σr for a different value of a . The dependence on a was nonmonotonic. The optimal value we found was 0.5, which corresponds to the lowest curve (dashed) in the figure. This curve was well below the one observed without adjusting the synaptic weights. Therefore, the correction was indeed effective. Figure 7B shows E rec as a function of σr when synaptic noise is also present in the recurrent network. The three solid curves correspond to nets in which synapses were randomly eliminated with probabilities pW = 0.005, 0.015, and 0.025. As with previous network architectures, a nonzero amount of response noise improves performance relative to the case where no response noise is injected. In this case, however, the mean absolute error is already about 25 degrees at the point at which response noise starts making a difference, around pW = 0.005 (see Figure 7C). This is not surprising: these types of networks are highly sensitive to changes in their synapses, so even small mismatches can lead to large errors (Seung, Lee, Reis, & Tank, 2000; Renart, Song, & Wang, 2003). Also, Figure 7C shows that the ratio E min /E 0 does not fall below 0.6, so the benefit of noise is not as large as in previous examples. The effect was somewhat weaker when synaptic variability was simulated using gaussian noise with SD σW instead of random synaptic elimination. Nevertheless, it is interesting that the interaction between synaptic and response noise is observed at all under
1374
G. Basalyga and E. Salinas
A
B
100
100
E rec
E rec
80
80
60
60
40
40
20
20
0
0
0.1
0.2
0.3
0.4
0.5
σr
0
C 1
0.3
0.4
0.01
0.02
0.03
0.04
σr
0.5
60
0.15
40
0.1
20
0.05
0 0.05
0
0.2
0.01
0.02
0.03
0.04
σmin
0.2
E min
0.4
0
0.2
0.25
80
0.6
0
0.1
D 100
E min E0
0.8
0
pW
0
0.05
pW
Figure 7: Interaction between synaptic and response noise in recurrent networks. (A) Average absolute difference between desired and observed centers of mass as a function of σr . Units are degrees. The different curves are for a = 0, 1.5, 1, and 0.5, from left to right. The lowest curve (dashed) was obtained with a = 0.5, confirming that the synaptic weights are optimized when response noise is taken into account. (B) Average error E rec as a function of response noise. Continuous lines are for three probabilities of weight elimination pW = 0.005, 0.015, and 0.025; the dashed line corresponds to pW = 0. Here and in the following panels, a = 0.5. (C) E min /E 0 (left y-axis) and E min (right y-axis) as functions of pW . (D) Optimal response noise SD, σmin , as a function of pW for the same data in C.
these conditions, given that the response dynamics are richer and that the minimization of equation 5.9 may not be the best way to produce the desired steady-state activity. 6 Discussion 6.1 Why are Synaptic and Response Fluctuations Equivalent? We have investigated the simultaneous action of synaptic and response fluctuations on the performance of neural networks and found an interaction or
Neuronal Variability Counteracts Synaptic Noise
1375
equivalence between them: when synaptic noise is multiplicative, its effect is similar to that of response noise. At heart, this is a simple consequence of the product of responses and synaptic weights contained in most neural models, which has the form j Wj r j . With multiplicative noise in one of the variables, this weighted sum turns into j Wj (1 + ξ j )r j , which is the same whether it is the synapse or the response that fluctuates. In either case, the total stochastic component j Wj ξ j r j scales with the synaptic weights. The same result is obtained with additive response noise. Additive synaptic noise behaves differently, however. It instead leads to a total fluctuation j ξ j r j that is independent of the mean weights. Evidently, in this case, the mean values of the weights have no effect on the size of the fluctuations. Thus, the key requirement for some form of equivalence between the two noise sources is that the synaptic fluctuations must depend on the strength of the synapses. This condition was applied to the three sets of simulations presented above, which corresponded to the classification of arbitrary response patterns, a sensorimotor transformation, and the generation of multiple selfsustained activity profiles. This selection of problems was meant to illustrate the generality of the observations outlined in the above paragraph. And indeed, although the three problems differed in many respects, the results were qualitatively the same. We should also point out that in all the simulations, the criterion used to determine the optimality of the synaptic weights was based on a mean square error. But perhaps the noise interaction changes when a different criterion is used. To investigate this, we performed additional simulations of the small 2×1 network in which the optimal synaptic weights were those that minimized a mean absolute deviation; thus, the square in equation 2.2 was substituted with an absolute value. In this case, everything proceeded as before, except that the mean weight values W had to be found numerically. For this, the averages were performed explicitly, and the downhill simplex method was used to search for the best weights (Press, Teukolsky, & Vetterling, 1992). The results, however, were very similar to those in Figure 2A. Although the shapes of the curves were not exactly the same, the relative and minimum errors found with the absolute value varied very much like with the mean-square error criterion as functions of σW . Therefore, our conclusions do not seem to depend strongly on the specific function used to weight the errors and find the best synaptic connection values. 6.2 When Should Response Noise Increase? According to the argument above, the most general way to state our results is this: assuming that neuronal activities are determined by weighted sums, any mechanism that is able to dampen the impact of response noise will automatically reduce the impact of multiplicative synaptic noise as well. Furthermore, we suggest that under some circumstances, it is better to add more response noise
1376
G. Basalyga and E. Salinas
and increase the dampening factor than ignore the synaptic fluctuations altogether. There are two conditions for this scenario to make sense. First, the network must be highly sensitive to changes in connectivity. This can be seen, for instance, in Figure 3A, which shows that the highest benefit of response noise occurs when the number of neurons matches the number of conditions to be satisfied; it is at this point that the connections need to be most accurate. Second, the fluctuations in connectivity cannot be evaluated directly. That is, why not take into account the synaptic noise in exactly the same way as the response noise when the optimal connections are sought? For example, the average in equation 2.3 could also include an average over networks (synaptic fluctuations), in which case the optimal mean weights would depend not only on σr but also on σW . In the simulations, this could certainly be done and would lead to smaller errors. But we explicitly consider the possibility that either σW is unknown a priori, or there is no separate biophysical mechanism for implementing the corresponding corrections to the synaptic connections. Condition 2 is not unreasonable. Realistic networks with high synaptic plasticity must incorporate mechanisms to ensure that ongoing learning does not disrupt their previously acquired functionality. Thus, synaptic modification rules need to achieve two goals: establish new associations that are relevant for the current behavioral task and make adjustments to prevent interference from other, future associations. The latter may be particularly difficult to achieve if learning rates change unpredictably with time. It is not clear whether plausible (e.g., local) synaptic modification mechanisms could solve both problems simultaneously (see Hopfield & Brody, 2004), but the results suggest an alternative: synaptic modification rules could be used exclusively to learn new associations based on current information, whereas response noise could be used to indirectly make the connectivity more robust to synaptic fluctuations. Although this mechanism evidently does not solve the problem of combining multiple learned associations, it might alleviate it. Its advantage is that, assuming that neural circuits have evolved to adaptively optimize their function in the face of true noise, simply increasing their response variability would generate synaptic connectivity patterns that are more resistant to fluctuations. 6.3 When Is Synaptic Noise Multiplicative? The condition that noise should be multiplicative means that changes in synaptic weight should be proportional to the magnitude of the weight. Evidently not all types of synaptic modification processes lead to fluctuations that can be statistically modeled as multiplicative noise; for instance, saturation may prevent positive increases, thus restricting the variability of strong synapses. However, synaptic changes that generally increase with initial strength should be reasonably well approximated by the multiplicative model. Random synapse elimination fits this model because if a weak synapse disappears, the change is small, whereas if a strong synapse disappears, the change is large. Thus,
Neuronal Variability Counteracts Synaptic Noise
1377
the magnitude of the changes correlates with initial strength. Another procedure that corresponds to multiplicative synaptic noise is this. Suppose the size of the synaptic changes is fixed, so that weights can vary only by ±δw, but suppose also that the probability of suffering a change increases with initial synaptic strength. In this case, all changes are equal, but on average, a population of strong synapses would show higher variability than a population of weak ones. In simulations, the disruption caused by this type of synaptic corruption is indeed lessened by response noise (data not shown). 7 Conclusion To summarize, the scenario we envision rests on five critical assumptions: (1) the activity of each neuron depends on synaptically weighted sums of its (noisy) inputs, (2) network performance is highly sensitive to changes in synaptic connectivity, (3) synaptic changes unrelated to a function that has already been learned can be modeled as multiplicative noise, (4) synaptic modification mechanisms are able to take into account response noise, so synaptic strengths are adjusted to minimize its impact, but (5) synaptic modification mechanisms do not directly account for future learning. Under these conditions, our results suggest that increasing the variability of neuronal responses would, on average, result in more accurate performance. Although some of these assumptions may be rather restrictive, the diversity of synaptic plasticity mechanisms together with the high response variability observed in many areas of the brain make this constructive noise effect worth considering. Acknowledgments This research was supported by NIH grant NS044894. References Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 450–458. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Brotchie, P. R., Andersen, R. A., Snyder, L. H., & Goodman, S. J. (1995). Head position signals used by parietal neurons to encode locations of visual stimuli. Nature, 375, 232–235. Carpenter, G. A., & Grossberg, S. (1987). Art2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26, 4919–4930.
1378
G. Basalyga and E. Salinas
Compte, A., Brunel, N., Goldman-Rakic, P., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Crist, R. E., Li, W., & Gilbert, C. (2001). Learning to see: Experience and attention in primary visual cortex. Nature Neuroscience, 4(4), 519–525. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Dean, A. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev. Mod. Phys., 70, 223–287. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: John Hopkins University Press. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch, & L. Segev (Eds.), Methods in neuronal modeling: From synapse to networks (pp. 499–567). Cambridge, MA: MIT Press. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185–234. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. Journal Neurophysiology, 75, 1806–1814. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spike-timing–based computation networks. Proc. Natl. Acad. Sci. USA, 101, 337– 342. Kilgard, M. P., & Merzenich, M. M. (1998). Plasticity of temporal information processing in the primary auditory cortex. Nature Neuroscience, 1, 727–731. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Computation, 13(7), 1473–1494. Levin, J. E., & Miller, J. P. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380, 165–168. McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. Murray, A. F., & Edwards, P. J. (1994). Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792–802. Nozaki, D., Mar, D. J., Grigg, P., & Collins, J. J. (1999). Effects of colored noise on stochastic resonance in sensory neurons. Physical Review Letters, 82, 2402– 2405. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge: Cambridge University Press.
Neuronal Variability Counteracts Synaptic Noise
1379
Renart, A., Song, P., & Wang, X. J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Salinas, E. (2003). Background synaptic activity as a switch between dynamical states in a network. Neural Computation, 15(7), 1439–1475. Salinas, E. (2004). Context-dependent selection of visuomotor maps. BMC Neuroscience, 5(1), 47. Salinas, E., & Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. Journal of Neuroscience, 15, 6461–6474. Salinas, E., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behavior, neurophysiology and computation meet. Neuroscientist, 2, 539– 550. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Softky, W. P., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4(5), 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPS. Journal of Neuroscience, 13, 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1, 210–217. Turrigiano, G. G., & Nelson, S. B. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10, 358–364. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry. Amsterdam: Elsevier. Vilar, J. M. G., & Rubi, J. M. (2000). Scaling of noise and constructive aspects of fluctuations. Berlin: Springer-Verlag. Wang, X., Merzenich, M. M., Sameshima, K., & Jenkins, W. (1995). Remodelling of hand representation in adult cortex determined by timing of tactile stimulation. Nature, 378, 71–75. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16(6), 2112– 2126. Zipser, D., & Andersen, R. A. (1988). A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331, 679–684.
Received July 15, 2005; accepted September 16, 2005.
LETTER
Communicated by Harel Shouval
Strongly Improved Stability and Faster Convergence of Temporal Sequence Learning by Using Input Correlations Only Bernd Porr [email protected] Department of Electronics and Electrical Engineering, University of Glasgow, Glasgow, GT12 8LT, Scotland
¨ otter ¨ Florentin Worg [email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland, and Bernstein Center of Computational Neuroscience, University G¨ottingen, Germany
Currently all important, low-level, unsupervised network learning algorithms follow the paradigm of Hebb, where input and output activity are correlated to change the connection strength of a synapse. However, as a consequence, classical Hebbian learning always carries a potentially destabilizing autocorrelation term, which is due to the fact that every input is in a weighted form reflected in the neuron’s output. This self-correlation can lead to positive feedback, where increasing weights will increase the output, and vice versa, which may result in divergence. This can be avoided by different strategies like weight normalization or weight saturation, which, however, can cause different problems. Consequently, in most cases, high learning rates cannot be used for Hebbian learning, leading to relatively slow convergence. Here we introduce a novel correlation-based learning rule that is related to our isotropic se¨ otter, ¨ quence order (ISO) learning rule (Porr & Worg 2003a), but replaces the derivative of the output in the learning rule with the derivative of the reflex input. Hence, the new rule uses input correlations only, effectively implementing strict heterosynaptic learning. This looks like a minor modification but leads to dramatically improved properties. Elimination of the output from the learning rule removes the unwanted, destabilizing autocorrelation term, allowing us to use high learning rates. As a consequence, we can mathematically show that the theoretical optimum of one-shot learning can be reached under ideal conditions with the new rule. This result is then tested against four different experimental setups, and we will show that in all of them, very few (and sometimes only one) learning experiences are needed to achieve the learning goal. As a consequence, the new learning rule is up to 100 times faster and in general more stable than ISO learning.
Neural Computation 18, 1380–1412 (2006)
C 2006 Massachusetts Institute of Technology
Input Correlation Learning
1381
1 Introduction Probably all existing correlation-based learning algorithms rely currently on Donald Hebb’s (1949) famous paradigm that connections between network units should be strengthened if the two connected units are simultaneously active (Oja, 1982; Kohonen, 1988; Linsker, 1988). The Hebb rule can be formalized as ρ j = µu j f (v),
(1.1)
strength and the output is calculated from the where ρ j is the connection weighted sum v = j ρ j u j . The factor µ is called the learning rate. The linear operator f is just the identity operator f = v for classical Hebbian learning (Hebb, 1949), and it is the derivative f = v for differential Hebbian learning (Kosco, 1986). In spite of their success, Hebbian-type learning algorithms can be unstable because of the existing autocorrelation term in the learning rule. This can be seen if we replace v in equation 1.1 by the weighted sum. Apart from the cross-correlation terms, we get ρ j ∝ µρ j u j f (u j ). Hebbian learning is stable only if this autocorrelation term is zero or can be compensated for by means of additional measures taken (Oja, 1982; Bienenstock, Cooper, & ¨ otter, ¨ Munro, 1982; Miller, 1996b; Porr & Worg 2003a). In the general case, however, this term leads to an exponentially growing instability and network divergence. Hebb rules have been employed in a wide variety of unsupervised learning tasks, and previously we have focused on the specific problem of tem¨ otter, ¨ poral sequence learning (Porr and Worg 2001, 2003a). In this case, two (or more) signals exist that are correlated to each other, but with certain delays between them. In real life, this can happen, for example, when heat radiation precedes a pain signal when touching a hot surface or when the smell of a prey arrives before the predator is close enough to see it hiding in the shrubs. Such situations occur often during the lifetime of a creature, and in these cases it is advantageous to learn reacting to the earlier stimulus, not having to wait for the later signal. Temporal sequence learning enables the animal to react to the earlier stimulus. Thus, the animal learns an anticipatory action to avoid the late, unwanted stimulus. From a more theoretical perspective, such situations are related to classical or instrumental conditioning, and in early studies, correlation-based, stimulus-substitution models were used to address the problem of how to learn such sequences (Sutton & Barto, 1981). Soon, however, these methods were superseded by reinforcement learning algorithms (Sutton, 1988; Watkins, 1989; Watkins & Dayan, 1992), partly because those algorithms had favorable mathematical properties (Dayan & Sejnowski, 1994) and partly because convergent learning could be achieved in behaving systems (Kaelbling, Littman, &
1382
¨ otter ¨ B. Porr and F. Worg
Moore, 1996). Relations to biophysics, however, seem to exist more to the dopaminergic reward-based learning system (Schultz, Dayan, & Montague, 1997) than to (differential) Hebbian learning through long-term potentiation (LTP) at glutamatergic synapses (Malenka & Nicoll, 1999; for a review, ¨ otter ¨ see Worg & Porr, 2005). Therefore, in a series of recent articles, we have tried to show that it is possible to solve reinforcement learning tasks by correlation-based (Hebbian) rules, realizing that such tasks can often be embedded into the framework of sequence learning which allows for ¨ otter, ¨ a Hebbian formalism (Porr & Worg 2003a, 2003b). However, we had to discover that the Hebbian learning rule, which we had designed to address problems of temporal sequence learning, produces exactly the same autocorrelation instability that often prevented convergence. To solve this problem, in this study we present a novel, heterosynaptic learning rule that allows implementation of fast and stable learning. This learning rule has been derived from isotropic sequence order (ISO) learn¨ otter, ¨ ing (Porr & Worg 2003a), which belongs to the class of differential Hebbian learning rules (Kosco, 1986). ISO learning, however, suffers from the problem discussed above. It too contains the destabilizing autocorrelation term; only for the limiting case of µ → 0 have we been able to prove ¨ otter, ¨ that this term vanishes (Porr & Worg 2003a), but only when using a set of orthogonal input filters. However, a very simple alteration of ISO learning eliminates its autocorrelation term completely: if we correlate only inputs with each other, this term no longer exists. More specifically, we define an error signal at one of the inputs and correlate this error signal with the other inputs. Consequently, our rule can be used in applications where such an error signal can be identified, which is the case, in particular, in closed-loop feedback control. In this study, we first derive the convergence properties of input correlation (ICO) learning, showing that one-shot learning is the theoretical limit for the learning rate. As an additional advantage, it will become clear that input filtering does not rely on orthogonal filters at the different inputs. Any input characteristic will suffice as long as the whole system contains an (additional) low-pass filter component. This, however, can also come from the transfer function of the environment in which the learning system is embedded. The advantage of now being able to choose almost arbitrary input filters will for the first time also allow approximating far more complex (e.g., nonlinear) output characteristics than was possible with ISO learning. In the second part of this study, we compare ICO learning with its equivalent differential Hebbian learning rule: the ISO learning rule. This comparison, performed on a simulated and real benchmark test, will demonstrate that input correlation learning is indeed much faster and more stable than the older ISO learning. Finally, we will present a set of experiments from different application domains showing that one-shot learning can be approached when using the ICO rule. These applications have been
Input Correlation Learning
1383
specifically chosen to raise confidence that ICO learning can be applied in a variety of different situations. 2 Input Correlation learning 2.1 The Neural Circuit. Figure 1A shows the basic components of the ¨ otter ¨ neural circuit. In contrast to Porr & Worg (2003a) we will for the mathematical formalism employ here the z-transform instead of the Laplace transform. This is due to the fact that the z-space provides a simple way to express the correlation and thus allows a straightforward proof of convergence and stability (see also the appendix). The learner consists of two inputs, x0 and x1 , which are filtered with functions h, u0 = x0 ∗ h 0 u j = x1 ∗ h j ,
(2.1)
where the signal x1 is filtered by a filter bank of N filters, which are indexed by j. The filter functions h 1 , . . . , h N represent a filter bank with different characteristics so that it is possible to generate complex-shaped responses (Grossberg, 1995). The filtered inputs uk converge onto a single learning unit with weights ρk , and its output is given by v=
N
ρk uk .
(2.2)
k=0
The output will determine the behavior of the system, but not its learning. To make ICO learning comparable with ISO learning, for h we will use mostly resonators, as in our previous work. We will, however, also employ other filter functions if applicable. In discrete time, the resonator responses are given by h(n) =
1 an 1 e sin(bn) ↔ H(z) = , b (z − e p )(z − e p∗ )
(2.3)
where p ∗ is the complex conjugate of p. Note that z-transformed functions are denoted by capital letters or as ρ(z) in the case of Greek letters. The index for the time steps is n. The real andimaginary parts of p are defined as a = Re( p) = −π f /Q and b = Im( p) = (2π f )2 − a 2 , respectively, which is the definition for continuous time. The transformation into discrete time is performed by the exponential e p in equation 2.3, which is called the impulse invariance method. The parameter 0 ≤ f < 0.5 is the frequency of the resonator normalized to a sampling rate of one. The so-called quality
1384
¨ otter ¨ B. Porr and F. Worg
Figure 1: Circuit and weight change. (A) General form of the neural circuit in an open-loop condition. Inputs xk are filtered by resonators with impulse response h k and summed at v with weights ρk . The symbol d/dt denotes the derivative. The amplifier symbol denotes a changeable synaptic weight, ⊗ is a correlator, and is a summation node. The filters h 1 , . . . , h N form a filter bank to cover a wider range of temporal differences between the inputs. (B) Weight change curve. Shown is the weight change for two identical resonators h 0 , h 1 with Q = 0.51, f = 0.01. The two inputs x0 and x1 receive delta pulses x1 (n) = δ(n) and x0 (n) = δ(n − T). The temporal difference between the inputs is T. The resulting weight change after infinite time is ρ. (C) Behavior of the weight ρ1 for ICO learning as compared to ISO learning. Pairs of delta pulses are applied as in B. The time between the delta pulses was set to T = 25. The pulse sequence was repeated every 2000 time steps until step 100,000. After step 100,000, only input x1 receives delta pulses. The learning rate was µ = 0.001.
Input Correlation Learning
1385
Q > 0.5 of the resonator defines the decay rate. We will mostly employ a very low quality (Q = 0.6), which results in a rapid decay. 2.2 The Learning Rule. The learning rule for the weight change ρ j is dρ j du0 = µu j dt dt
j > 0,
(2.4)
where only input signals are correlated with each other. Comparing equation 1.1 with the new learning rule, we see that the output v has been replaced by the input u0 . The derivative indicates that the learning rule implements differential learning (Kosco, 1986). Thus, we have differential heterosynaptic learning. Weight changes can be calculated by correlating the resonator responses of H0 and H1 in the z-domain. In the open-loop case, this is straightforward ¨ otter ¨ and differs only formally from the Laplace domain used in Porr & Worg (2003a) yielding the same weight change curves. Figure 1B shows the weight change curve for N = 1, H0 = H1 (for parameters, see the figure legend). Weights increase for T > 0 and decrease for T < 0, which means that a sequence of events x1 → x0 leads to a weight increase at ρ1 , whereas the reverse sequence x0 → x1 leads to a decrease. Thus, learning is predictive in relation to the input x0 . Weights stabilize if the input x0 is set to a constant value (or if x1 is set to zero). Figure 1C shows the behavior of ICO learning as compared to ISO learning in the open-loop case for the relatively high learning rate of µ = 0.001. Clearly one sees that ISO learning contains an exponential instability, which leads to an upward bend in the straight line and prevents weight stabilization even when setting x0 = 0 at time step 100,000. This is different for ICO learning, which does not contain this instability. 3 ICO Learning Embedded in the Environment 3.1 The Closed-Loop Circuit: General Setup and Learning Goal. ICO learning is designed for a closed-loop system where the output of the learner v feeds back to its inputs x j after being modified by the environment. The resulting structure (see Figure 2), similar to that described in Porr, von ¨ otter ¨ Ferber, and Worg (2003), is that of an subsumption architecture where we start with an inner feedback loop, which is superseded by an outer loop (Brooks, 1991). (For a more detailed discussion of such nested structure, we refer to Porr et al., 2003.) 3.1.1 Feedback Loop. Initially only a stable inner reflex, or feedback loop, exists, which is established by the transfer function of the organism H0 , the transfer function of the environment P0 , the weight ρ0 = 0, and the (here constant) set point SP. Such a reflex could, for example, be the retraction
1386
¨ otter ¨ B. Porr and F. Worg
Figure 2: ICO learning embedded in its environment. is a linear summation unit. Except for the constant set point SP, the inside of the organism resembles ICO learning, as shown in Figure 1A, but here shown without the filter bank and transformed into the z-domain. D is a disturbance, which is delayed by T time steps. The term z − 1 denotes a derivative in the z-domain. Transfer functions P0 and P1 represent the environment and establish the feedback from the motor output v to the sensor inputs X0 and X1 . S0 represents the input before subtracting the set point.
reaction of an animal when touching a hot surface. In such an avoidance scenario, X0 would represent the input to a pain receptor, with a desired state of SP = 0. Hence, a correctly designed reflex will indeed reestablish this desired state, but only in a reactive way, hence, only after the disturbance D has upset the state at X0 for a short while. The delay parameter z−T is here introduced to define the timing relation between inner, late and outer, early (predictive) loop. Thus, the transfer function H0 establishes a fixed reaction of the organism by transferring sensor inputs into motor actions. The transfer function P0 establishes the environmental feedback from the motor output to the sensor input of the organism. The goal of the feedback loop is to keep the set point SP at S0 as precise as possible. In this context, X0 can be understood as an error signal that has to be minimized. Without losing generality, we will set the set point SP for
Input Correlation Learning
1387
all theoretical derivations from now on to zero (SP = 0), which means that S0 = X0 , and we interpret the sensor input as the error signal. 3.1.2 Learning Goal. We are going to explain now how learning is achieved. Initially the outer loop, formed by H1 , P1 , is inactive because ρ1 = 0. It receives the disturbance D at sensor input X1 earlier than the inner loop. In our example, one could think of a heat radiation signal that is felt before touching the hot surface. However, a naive system will not react in the right way, withdrawing the limb before touching, as can be seen in very young children, who will hurt themselves in such a situation. Hence, the learning goal for this system is to increase ρ1 such that an earlier appropriate reaction will be elicited after learning. As a consequence, after learning, X0 will, in an ideal case, never leave the set point again. In a way, one could think of this as the reflex being shifted earlier in time. In the general case, there will be a filter bank where every filter has its own corresponding weight ρ j , j > 0. In the following sections, we will establish the formalism for treating such closed-loop systems and provide a convergence proof. The main result of this section is that we will show that ICO learning approaches one-shot learning in a stable convergence domain provided the inner loop represents a stable feedback controller or, in other words, provided the reflex creates an appropriate and stable reaction. Readers not interested in the mathematical derivations, which rely on the application of some methods from control theory, might consider skipping this section. 3.2 Stability Proof 3.2.1 Responses to a Disturbance. The stability of a feedback system can be evaluated by looking at its impulse response to a disturbance. The actual reaction of the feedback system to a disturbance D can be calculated easily in the z-domain. In the simplest case, the disturbance is a delta pulse, which is just D = 1 in the z-domain. In more complex scenarios (as in the experiments), the disturbance is a random event for which we assume that it is bounded and stable. Thus, we apply a disturbance D and observe the changes, for example, at the sensor input X0 : X0 = Dz−T P0 + X0 H0 ρ0 P0 .
(3.1)
We can now solve for X0 and get
X0 = Dz−T
P0 . 1 − ρ0 P0 H0
(3.2)
¨ otter ¨ B. Porr and F. Worg
1388
This equation provides the response of the feedback loop to a disturbance D. We demand here that the feedback is designed in a way that X0 is stable and always decays to zero after a disturbance has occurred. (For a general stability analysis of feedback loops, see D’Azzo, 1988.) In addition, we introduce F = X0 H0 = z−T
P0 H0 , 1 − ρ0 P0 H0
(3.3)
which is the response of the feedback loop at U0 to a delta pulse (D = 1). We will need this term later for the stability analysis. A pure feedback loop cannot maintain the set point all the time because the reaction to a disturbance D by the feedback loop is always too late. Thus, from the point of view of the feedback, it is desirable to predict the disturbance D to preempt the unwanted triggering of the feedback loop (Palm, 2000). Figure 2 accommodates this in the most general way by a formal delay parameter z−T , which ensures that the input x1 receives the disturbance D earlier than input x0 . This establishes a second predictive pathway, which is inactive at the start of learning (ρ1 = 0). The learning goal is to find a value for ρ1 so that the learner can use the earlier signal at x1 to generate an anticipatory reaction that prevents x0 from deviating from the set point SP. Generally the predictive pathway is set up as a filter bank where the input x1 feeds into different filters that generate the predictive response. The response of the system to a disturbance D with the predictive pathway can be obtained in the same way as demonstrated for the feedback loop: N ρk Hk + Dz−T P0 DP1 k=1 X0 = . 1 − P0 ρ0 H0
(3.4)
The goal now is to find a distribution of weights ρk so that the condition X0 = 0 is satisfied all the time. In other words, find weights that ensure that the input X0 never deviates from the set point. 3.2.2 Analysis of Stability • Learning rule in the z-domain. Stability is achieved if the weights ρ j converge to a finite value. We will prove stability in the z-domain, which has two advantages: the derivative can be expressed in a very simple form, and the closed loop can be eliminated. The result also provides absolute values of the weights after a disturbance has occurred. Equation 2.4 can be rewritten in the z-domain as (z − 1)ρ j (z) = µ[(z − 1)U0 (z)]U j (z−1 ),
(3.5)
Input Correlation Learning
1389
where (z − 1) is the derivative. Since the z-transform is not such a commonly used formalism, we refer the reader to the appendix for a detailed description of some of the methods used to arrive at equation 3.5. Note that the weight ρ j (z) is the z-transformed version of ρ j (t). The change of the weight ρ j (z) on the left side is expressed in the same way as the derivative on the right side. This formulation also takes into account that any change of the weight ρ j (z) might have an immediate impact on the values of U0 and U j . Thus, we do not assume here that learning operates at low learning rates µ. At this point, we allow for any learning rate. • Calculating the weight. To calculate the weight ρ j (z), we need the filtered reflex input U0 = X0 H0 , which can be directly obtained from equation 3.4. The resulting weight ρ j (z) can now be evaluated using ρ j (z) = µF
DP1
N
−T
ρk Hk + Dz
D− P1− H j− ,
(3.6)
k=1
where we will abbreviate from now on the times-reversed functions H(z−1 ) by H − . Solving for ρ j (z) gives
ρ j (z) =
µF DD− P1 P1−
N
k = j,k=1
ρk (z)Hk H j− + z−T µF DD− P1− H j−
1 − µF DD− P1 P1− H j H j−
, (3.7)
which is the value of the weight ρ j (z) after a disturbance D. To get a better understanding of the equation above, we restrict ourselves now to just one filter in the predictive pathway and set N = 1. In that case, the sum in the numerator vanishes to give ρ1 (z) =
z−T µF DD− P1− H1− M . − − := − K 1 − µF DD P1 P1 H1 H1
(3.8)
Thus, we have a result that can be analyzed for the stability of weight ρ1 (z). • Stability criterion. A system is bounded-input bounded-output stable if its impulse response and its corresponding transfer function Y satisfy the following condition,
|Y(e iω )| <
n=+∞ n=−∞
|y(n)| < ∞,
(3.9)
¨ otter ¨ B. Porr and F. Worg
1390
for any ω (Diniz, 2002). In the following discussion, we assume that all functions can be expressed as fractions of polynomials. This is possible as long as the system behaves approximately linearly. Thus, the functions have zeroes and poles in the z-domain. To keep the transfer function |H(e iω )| of equation 3.9 bounded, one has to demand that the unit circle does not contain any poles. Otherwise we would get unlimited exponential growth over time. Hence, stability analysis requires two components: we need to show that the numerator M in equation 3.8 remains bounded and that the denominator K contains no additional poles. • Numerator M is bounded. We discuss first the numerator M of equation 3.8. It can be interpreted as a correlation between two signals. The first signal F D is the response of the feedback loop F to the disturbance D. The second signal DP1 H1 = U1 is the response of the predictive pathway P1 H1 to the disturbance D. Now the question is under which conditions the correlation between these two signals is stable. The one signal is the impulse response of the stable feedback loop F . Stable feedback loops behave like low-pass filters. Thus, they generate a damped exponential that decays to a constant value. The other signal is the response of the predictive pathway. This signal will also be dominated by a low-pass characteristic because the filter H1 is, by definition, a resonator with a strong low-pass characteristics. Furthermore, we note that environmental transfer functions (here P1 ) generically establish a low-pass filter, as discussed in Porr et al. (2003). Both signals, the response of the feedback loop F and the response of the predictive pathway, converge to zero for infinite time. Hence, it can be assumed that with great generality, the correlation of these two low-pass signals also converges. Thus, the numerator poses no threat to stability. • Denominator K has no additional poles. In the next step, the denominator K has to be assessed, as we have to test if the denominator creates additional poles. The denominator consists of amplitude terms DD− P1 P1− H1 H1− because in general, |Y(ω)|2 = Y(z)Y(z−1 )|z=e iω . These terms are real valued, as is the learning rate rendering the denominator K of equation 3.8 as K = 1 − µF |DP1 H1 |2 = 0|z=e iω ,
(3.10)
which, for stability, is supposed to be unequal zero to prevent additional poles. Thus, a simple stability criterion can be stated by max(µ|F ||DP1 H1 |2 )|z=e iω < 1
(3.11)
for all ω. If this criterion is maintained, we do not get additional poles. If F, D, and P1 are known, H1 and µ can be designed in such a way that our stability criterion is met.
Input Correlation Learning
1391
Hence, we need to discuss only the one remaining complex function F , which is the impulse response of the feedback loop at U0 . This loop is by construction stable. Also, we remember that DP1 H1 is the impulse response of the predictive pathway. Weight change results directly from the product of these two functions weighted by the learning rate (see equation 2.4). The question is: Up to which learning rate does this product obey equation 3.11? We note that if the disturbance D, the gain of the feedback loop F , or the gain of the predictive pathway P1 H1 increases in amplitude, learning becomes faster, and this is permitted as long as the effective learning rate, µ|F ||DP1 H1 |2 |z=e iω = µ ˜ < 1,
(3.12)
is below one. In other words, the system must not produce an overshoot during its first learning experience. On the other hand, this also means that one is allowed to increase µ up to that critical value, which leads to the fact that we can reach one-shot learning with ICO learning. This is one of the central results of this study. • Behavior of the final value of ρ1 . Equation 3.8 provides us also with the final value of the weight ρ1 (z). To gain a better understanding of the result, we multiply equation 3.8 by H1 P1 , ρ1 (z)H1 P1 = z−T
G , 1−G
(3.13)
where the constant G = µF DD− P1 P1− H1 H1− is the same for the numerator and the denominator. The expression on the right-hand side of equation 3.13 is a formal description of a feedback-controlled amplifier. This amplifier can be inverting or noninverting depending on the sign of the function G. The sign is determined only by the impulse response of the feedback loop F because all the other terms in G are positive. As a second relevant observation, we note that this means that the term on the left-hand side of equation 3.13 will have the same sign as the feedback reaction F , and because of the delay term z−T , it will act at the moment when the feedback would be triggered. • More than one filter (N > 1). After having understood the case with just one filter N = 1, we can now generalize to the case N > 1. Thus, we are getting back to equation 3.7. Comparing equation 3.7 with equation 3.8 shows that the stability criteria from the special case also apply to the general case. The denominators are the same in both cases so that the criterion equation 3.11 still holds. The only difference is the sum over correlations between different resonators (Hk correlated with H j ). The crucial question here is whether the correlation of these resonator responses is stable. The answer is affirmative because the correlation of one resonator Hk with another
1392
¨ otter ¨ B. Porr and F. Worg
one H j is just the weight change for the case T = 0 of the learning rule (see Figure 1B). This weight change is stable for the same reason as given above: the correlation of two low-pass-filtered delta pulses is bounded. Thus, ICO learning is also stable for a filter bank embedded in a closed loop. The absolute values of the weights in a filter bank are not easy to understand because of the correlations between the filter functions H j and Hk . These correlations do not play a role after successful learning because then x0 is constant, and therefore any weight change is suppressed anyway. 4 Applications This section compares the performance of ICO learning with differential Hebbian (ISO) learning and shows that ICO learning can be applied successfully to different application domains. In sections 4.1 and 4.2, we use a biologically inspired task, which will be performed first as a simulation and then as a real robot experiment where a robot was supposed to retrieve “food disks” (i.e., white paper snipplets on the floor). This task is similar to the one described in Verschure, Voegtlin, and Douglas (2003) and to the ¨ otter ¨ second experiment in Porr & Worg (2003b). In the simulation, we will compare ISO learning with ICO learning and show that the latter is able to perform one-shot learning under ideal noise-free conditions. The actual robot experiment will show that ICO learning also operates successfully in a physically embodied system where ISO learning fails. Other complex control examples will be presented in the last two experiments using different setups. 4.1 The Simulated Robot. This section presents a benchmark application that compares Hebbian (ISO) learning with our new input correlation (ICO) learning. Figure 3A presents the task where a simulated robot has to learn to retrieve food disks in an arena. The disks are also emitting simulated sound signals. Two sets of sensor signals are used. One sensor type (x0 ) reacts to (simulated) touch and the other sensor type (x1 ) to the sound. The actual choice of these modalities is not important for the experiment, but this creates a natural situation where sound precedes touch. Hence, learning must use the sound sensors that feed into x1 to generate an anticipatory reaction toward the food disk (Verschure et al., 2003). The circuit diagram is shown in Figure 3B. The reflex reaction is established by the difference of two touch detectors (LD), which cause a steering reaction toward the white disk. Hence, x0 is a transient signal that occurs only during touching of a disk. As a consequence, x0 is equal to zero if both LDs are not stimulated, which is the trivial case of not touching a disk at all, or when they are stimulated at the same time, which happens during a straight encounter with a disk. The latter situation occurs after successful learning, which leads to the head-on touching of the disks. The reflex has a constant weight ρ0 , which always guarantees a stable reaction. The predictive signal
Input Correlation Learning
1393
Figure 3: The robot simulation. (A) The robot has a reflex mechanism (1), which elicits a sharp turn as soon as it touches the disk laterally and thereby pulls the robot into the center of the disk. The disk also emits “sound.” The robot’s task is to use this sound field to find the disk from a distance (2). (B) The robot has two touch detectors (LD), which establish with the filter H0 and the fixed weight ρ0 the reflex reaction by x0 = L Dl − L Dr . The difference of the signals from two sound detectors (SD) feeds into a filter bank. The weights ρ1 , . . . , ρ N are variable and are changed by either ISO or ICO learning. Apart from the reflex reaction at the disk, the robot has a simple retraction mechanism when it collides with a wall (“retraction,” not used for learning). The output v is the steering angle of the robot. (C) Basic behavior until the first learning experience. The trace at ∗ continues in D, where the robot has learned to target the disks from a distance. The example here uses ICO learning with µ = 5 · 10−5 . Other parameters: filters are set to f 0 = 0.01 for the reflex, f j = 0.1/j, j = 1, . . . , 5 for the filter bank where Q = 0.51 for all filters. Reflex weight was ρ0 = 0.005.
x1 is generated by using two signals coming from the sound detectors (SD). The signal is simply assumed to give the Euclidean distance (rr,l→s ) of the left (l) or right (r ) microphone from a sound source (s). The difference of the signals from the left and the right microphone rr →s − rl→s is a measure of the azimuth of the sound source to the robot. Successful learning leads to a turning reaction, which balances both sound signals and results ideally in a straight trajectory toward the target disk, ending in a head-on contact.
1394
¨ otter ¨ B. Porr and F. Worg
After the robot encounters a disk, the disk is removed and randomly placed somewhere else. An example of successful learning is presented in Figures 3C and 3D. The robot first bumps into walls. Eventually it drives through the disk, which provides the robot with the first learning experience. In this example, just one experience has been sufficient for successful learning. The trace in Figure 3D continues the trace from Figure 3C. Such one-shot learning can be achieved with ICO learning but not with ISO learning. This will be tested now more systematically by comparing the performance for ISO and ICO learning in a few hundred simulations. We quantify successful and unsuccessful learning for increasing learning rates µ. Learning was considered successful when we received a sequence of four contacts with the disk at a subthreshold value of |x0 | < 0.2. We recorded the actual number of contacts until this criterion was reached. Hence, four contacts represent our statistical threshold for deciding between chance and actual successful learning. There were two reasons to choose a threshold of 0.2. First, when x0 is below the threshold, the robot visibly heads for the center of the food disk. Second, the signal x0 has only discrete values because of a discrete arena of 600 × 400 where the robot has a size of 20 × 10. Even if the robot heads perfectly toward the food disk, there will often be a temporal difference between the left and the right sensor because of the discrete representation of both the robot and the round-shaped food disk (diameter 20) leading to a small remaining value of x0 (aliasing effect). The log-log plots of the number of contacts in Figures 4A and 4B show that both rules follow a power law. The similarity of the curves for small learning rates reflects the mathematical equivalence of both rules for µ → 0. The dependence of failures on the learning rate is quite different for ISO learning as compared to ICO learning. For differential Hebbian (ISO) learning (see Figure 4B), errors increase roughly exponentially up to a learning rate of µ = 10−4 . This behavior reflects errors caused by the autocorrelation terms. Above µ = 10−4 , failures reach a plateau with some statistical variability. For ICO learning (see Figure 4A) failures remain essentially zero up to µ = 0.0002; the learned behavior diverges only above that value. In contrast to the ISO rule, this effect is here due to overlearning, where the learning gain of the predictive pathway is higher than the gain of the feedback loop. Thus, the predictive pathway becomes unstable during the first learning experience. This means that the effective learning rate (see equation 3.12) has exceeded one. The actual learning rate µ is lower because it is multiplied with the gains of the feedback reaction F and the predictive pathway DH1 P1 , which depend on the actual experimental setup. For two different learning rates (µ = 5 · 10−6 , 5 · 10−5 ), the weights ρ j , j > 0 and the reflex input x0 are plotted in Figure 5. The data have been taken from four simulations of Figure 4. Thus, success has been measured in the same way as before, requiring |x0 | to be below 0.2 for four consecutive learning experiences. At the low learning rate (see Figures 5A–5D), weights
Input Correlation Learning
1395
Figure 4: Results from the simulated robot experiment. (A) Results from ICO learning. (B) Results from the ISO learning. Log-log plots show how many contacts with the target were required for successful learning at a given learning rate µ. Histograms show how many times learning was not successful. The bin size was set to 10 experiments, which gives an equal spacing on the log x-axis. Failures are shown on a linear axis.
1396
¨ otter ¨ B. Porr and F. Worg
Figure 5: Comparing ICO and ISO learning in individual simulated robot experiments. (A,B,E,F) Plots of the reflex input x0 of the contacts with the food source. (C,D,G,H) Plots of the weights for two learning rates: (A–D) µ = 5 · 10−6 and (E–H) 5 · 10−5 for the two different learning rules ISO and ICO learning. The inset in G shows steps from 6000 to 10,000 plotted with a y-range of −55.72 · 10−3 , . . . , −55.54 · 10−3 . The inset in H shows steps from 0 to 20,000 plotted with a y-range of −0.001, . . . , 0.0025.
Input Correlation Learning
1397
converge to very similar values for ISO as well as ICO learning. This is not surprising, as for low learning rates, the autocorrelation term in ISO learning is small. However, even for such low learning rates, the weights drift for the ISO learning case. This can be seen in particular between steps 3000 and 7000 in Figure 5D. There are no contacts with food disks, and consequently x0 stays zero. However, the weights drift upward because of nonzero inputs to the filter bank through x1 . ICO learning (see Figure 5C) does not show any weight drift for three reasons. First, a constant input at x0 keeps the weights constant. Second, the predictive input x1 is zero at the moment x0 is triggered. This is the case after successful learning, as seen, for example, in Figure 5C between steps 32,000 and 36,000. Third, the derivative (u0 ) of the filtered input x0 is symmetric so that the weight change is effectively zero. All of these factors contribute to stability. Even in the case that x0 always receives small transients, learning is stable. Transients can occur due to aliasing in the simulation or in the real robot due to mechanical imperfections. Such transients trigger unwanted weight change. However, they do not destabilize learning if x0 is understood as an error signal that always counteracts unwanted weight change. For example, a transient at the reflex input x0 causes the robot to learn a steering reaction to the left that is too strong. The next time the robot enters the food disk, the left turn that is too strong causes an error signal at x0 that reduces the steering reaction again. Thus, one finds that in these cases, weights will occasionally grow or shrink due to transients in x0 . However, the weights will be brought back to their optimal values if x0 carries a proper error signal. In the experiments with high learning rates (see Figures 5E to 5H) learning is very fast, resulting in stable weights for ICO after just two learning experiences, which appear in Figure 5E as large peaks. After the second peak, weights undergo only minimal change. In fact, the “almost head-on” contacts (small peaks in x0 ) between steps 6000 and 10,000 of Figure 5G cause the weights to become more positive again. This is demonstrated in the inset of Figure 5G, which indicates that learning has initially caused a slight overshooting of the weights. A different behavior is observed for ISO learning (see Figure 5H). After the second contact with the food disk, the system starts to diverge. The autocorrelation term dominates learning, leading to exponential growth of the weights. After step 22,000, the reflex input x0 is zero, which means that only the autocorrelation terms change the weights. Behaviorally we observe that the robot first learns the right behavior: driving toward the food disk. This behavior corresponds to negative weights, as seen in Figures 5C, 5D, and 5G. After step 10,000, however, the weights drift to positive values, which is behaviorally an avoidance behavior. This behavior becomes stronger and stronger so that the robot will never touch the food disk again. This unwanted ongoing learning is due to the movements of the robot, which cause a continuously changing sound signal x1 resulting in a nonvanishing autocorrelation term. Thus, while ICO learning (see
1398
¨ otter ¨ B. Porr and F. Worg
Figure 6: ICO learning simulation with three simultaneously present food disks. See Figure 3 for parameters. (A) Trace of the robot simulation for the whole simulation. The trace before learning is in gray to differentiate it from the learning behavior. Initially, learning is switched off for the first 1000 steps to demonstrate purely reflexive behavior when encountering the disk at R. The following first three learning experiences are numbered 1 to 3. (B) Weight development during learning. The learning rate was set again to µ = 5 · 10−5 .
equation 2.4) is stable for both low and high learning rates, its differential Hebbian counterpart ISO learning is stable only at low learning rates. The benchmark tests have provided an ideal condition for learning where just one food disk was in the arena. This gave a perfect correlation between proximal and distal sensor. Having three food disks in the arena at the same time renders learning more difficult (see Figure 6). Now we have no longer a simple relationship between the reflex input x0 and the predictor x1 . The sound fields from the different food disks superimpose onto each other so that the distal information is distorted. However, ICO learning also manages this scenario without any problems. Figure 6A depicts the trace of a run starting just before the first learning experience. Figure 6B shows the corresponding weight development, which is stable as well. Again, ISO learning is not able to perform this task at this high learning rate (data not shown). In summary, the simulations demonstrate that ICO learning is much more stable than the Hebbian ISO learning rule. ICO learning is able to operate with high learning rates approaching one-shot learning under ideal noise-free conditions. 4.2 The Real Robot. In this section we show that the same food disk targeting task can also be solved by a real robot. This is not self-evident because of the complications that arise from the embodiment of the robot and its situatedness in a real environment. (See Ziemke, 2001, for a discussion of the embodiment principle.) In addition, we will show that it is possible to use filters other than resonators in the predictive pathway.
Input Correlation Learning
1399
Figure 7: Experiment with a real robot. (A–C) Traces during the run, which lasted 8:46 min. A is taken from the start of the run at 0:06, showing the first reflex reaction; B and C show learned targeting behaviors after 3:45 (14 contacts) and 4:48 (18 contacts), respectively. The development of the weights and the trace x0 is shown in D. The values of the weights for B and C are indicated by arrows. Parameters: Learning rate was set to µ = 0.00002, the reflex weight to ρ0 = 40, and the video input image v(, ϒ) was = [1, . . . , 96] × ϒ[1, 64] pixels. The scan line for the reflex was at ϒ = 50 and for the predictor at ϒ = 2. The reflex x0 and the predictive signal x1 were generated by creating a weighted 2 sum of thresholded gray levels: x0,1 (ϒ) = 96 =1 ( − (96/2)) (v(, ϒ) − 128) where is the Heaviside function. The predictive input is split up into a filter bank of five filters. The predictive filters have 100, 50, 33, 25, 20 taps, where all coefficients are set to one. The reflex pathway is set up with a resonator set to f 0 = 0.01 and Q = 0.51. The camera was a standard pinhole camera with a rather narrow viewing angle of ±35 degrees.
As before, the task of the robot is to target a white disk from a distance. Similar to the simulation, the robot has a reflex reaction that pulls the robot into the white disk just at the moment the robot drives over the disk. This reflex reaction is achieved by analyzing the bottom scan line of a camera mounted on the robot. The predictive pathway is created in a similar way. A scan line from the top of the image, which views the arena at a greater distance from the robot (hence “in its future”), is fed into a filter bank of five filters. In contrast to the simulation, these filters are set up as finite impulse response (FIR) filters with different numbers of taps where all coefficients are set to one. Thus, the only thing such a filter does is to smear the input
1400
¨ otter ¨ B. Porr and F. Worg
signal out over time while the response duration is limited by the number of filter taps. We chose such filters for two reasons. First, in contrast to ISO learning, we do not need orthogonality between the reflex pathway and the predictive pathway. Thus, it is possible to employ different filter functions in the different pathways. This made it possible to solve a problem that exists with this robot setup. Because we used a camera with a rather narrow angle, we had to put the food disk rather centrally in front of the robot. The FIR filters generate step responses that result in a clearly observable behavioral change after learning as soon as the food disk enters the visual field of the robot. Resonator responses are so smooth that the reflex and learned reaction look too similar. The reflex behavior before learning is shown in Figure 7A, where the robot drives exactly straight ahead until it encounters the white disk. Only when it sees the disk directly in front of it is a sharp and abrupt turning reaction generated. The learning rate was set to the highest possible value such that at higher learning rates, the system started to diverge. Learning needs longer than in the simulation: about 10 contacts with the white disk are needed until a learned behavior can be seen. Examples for successful learning are shown in Figures 7B and 7C. Now the robot’s turning reaction sets in from a distance of about 50 cm from the target disk. Thus, the robot has learned anticipatory behavior. The real robot is subject to complications that do not exist in the simulation. The inertia of the robot, imperfections of the motors, and noise from the camera render learning more difficult than in the simulation. As a consequence, we obtain a nonzero reflex input x0 all the time, as shown in the top trace of Figure 7D. This is also reflected in the weight development. The weights change during the whole experiment, although they do not diverge. Rather, they oscillate around their best value. The experiment can be run for a few hours without divergence. Another reason for weight change is the limited space in the arena. This effect can be drastic if the robot is caught in a corner of the arena. Imagine the robot first encounters a food disc and then directly bumps into a wall. The bump then causes a retraction reaction, which changes the input x0 and therefore the reflex reaction. Consequently, learning is affected by such movements. Another aspect is the human operator who throws the food disks in front of the robot. If the food disk is thrown in too late in front of the robot, the timing between x1 and x0 is different, which also leads to wrong correlations. All additional error sources like noisy data impose an upper limit for the learning rate. This limit, however, is not the theoretical one (see equation 3.11) but a practical limit to protect the robot from learning the wrong behavior during its first learning experience (Grossberg, 1987). 4.3 Control Applications. In the next two sections, we will demonstrate that ICO learning can also be used in more conventional control situations.
Input Correlation Learning
1401
Figure 8: (A) Setup of the mechanical system. The position of the main arm is maintained by a PI controller controlling motor force M with ρ0 = 6; its position is measured by a potentiometer P, SP = 100◦ , and effective equilibrium point (EEP) reached 93.2◦ . Note that the effective equilibrium point will be identical to the set point only for an ideal controller at infinite gain. A disturbance D is introduced by an orthogonally mounted smaller arm. System paj rameters: sampling interval 5 ms, µ = 2 × 10−5 , f 0 = 10 Hz, Q0 = 0.6, Q1 = 0.6, j 20 Hz f 1 = j , j = 1, . . . , 10. (B) Signal traces D, M, and P from one experiment. The inset ρ1 shows the development of the connection weights. Disturbances are compensated after about four trials and weights stabilize.
1402
¨ otter ¨ B. Porr and F. Worg
To this end, we note first that a reflex is conceptionally very similar to a conventional closed-loop controller, where a setup is maintained by a feedback reaction from the controller as soon as a disturbance is being measured. In the next section, we will show anticipatory control of a mechanical arm as well as feedforward compensation of a heat pulse in a temperaturecontrolled container, such as those commonly used for chemical reactions. Mainly we will try to demonstrate that in these situations, ICO learning also converges very fast, which may make it applicable in more industrial scenarios too. 4.3.1 The Mechanical Arm. To show that ICO learning is also able to operate with a classical PI controller, we have set up another mechanical system. In addition, we show in this example how weights can be kept stable if the input x0 is too noisy. Recall that weight stabilization occurs as soon as x0 = 0 (see equation 2.4). To ensure this, we employ a threshold around the SP creating an interval within which x0 was set to zero. For our mechanical arm (see Figures 8A and 8B), a conventional PI controller defines the reflexive feedback loop controlling arm position P = x0 . The PI controller replaces the resonator H0 in this case. To stop the weights from fluctuating, we employ a threshold at x0 of = ±1◦ around the setpoint. Disturbances (D = x1 ) arise from the pushing force of a second small arm mounted orthogonally to the main arm. A fast-reacting touch sensor at contact point measures D. Force D is transient (top trace in Figure 8B), and the small arm is pulled back by a spring. A moderately high learning rate was chosen to demonstrate how the system develops in time. The second trace M = v shows the motor signal of the main arm. Close inspection reveals that during learning, this signal first becomes biphasic (see the small inset curve), where the earlier component corresponds to the learned part and the later component to the PI controller’s reaction. At the end of learning, only the first component remains (note the forward shift of M with respect to D). Trace P = x0 shows the position signal of the main arm. In the control situation, learning was off, and a biphasic reaction is visible with about a 10 degree position deviation (peak to peak). During learning, this deviation is almost fully compensated after four trials. Inset curves ρ1 at the bottom show that the connection weights have stabilized after the fourth trial. The fifth trial is shown to demonstrate the remaining variability of the system’s reaction. 4.3.2 Temperature Control. Figure 9 shows anticipatory temperature control against heat spikes, which could be potentially damaging in a real plant. A feedback loop with a resonator H0 guarantees a constant temperature SP in a container. The actual temperature is controlled by an electric heater (κv ) and a cooling system (φv ). The system can be considered nonlinear because cooling and heating are achieved by different techniques. The demanding task of learning here is to predict temperature changes caused by
Input Correlation Learning
1403
Figure 9: Learning to keep the temperature ϑ in a container C constant against external disturbances. Container volume was 500 ml, the main heat source was provided by a 500 W coil heater (κv ), and the main cooling source was by a pulse-width modulated, valve-controlled water flow through a copper coil (φv ) with maximum 750 ml/min at 17◦ C. The disturbance heat source (κ D ) received pulses of 1000 W from D. Data acquisition and control were performed with a USB-DUX board. Sampling rate was 1 Hz. The resonator in the feedback loop was set to f 0 = 0.2 Hz, Q0 = 0.51 and its corresponding weight to ρ0 = 50. H1 is a filter bank of resonators with parameters given in the caption to Figure 10.
another heater κ D , which is switched on occasionally. In a real application, this heater would be a second thermometer or other sensor that is able to predict the deviation from the set point SP. Several temperature experiments have been performed at different setpoints. In Figure 10A, a high gain and small µ was used, and learning compensates over- and undershooting in about 15 trials. Figure 10B shows that with a high gain and a high learning rate, the heat spike is compensated in a single trial, which could represent a vital achievement in a real plant. In this case, compensation of the undershoot takes much longer (not shown). In Figure 10C, a low gain was used, and the system reacts rather slowly. Learning compensates the overshoot after four trials, and the effective equilibrium point is now again reached, which was not the case before learning. In all situations weights essentially stabilize and drift only slightly around their equilibrium, because no threshold was used at x0 . These small oscillations are similar to the behavior of the weights in the real robot experiment, which were also oscillating around their equilibrium. Furthermore, we note that learning already sets in strongly in the first trial immediately influencing the output.
1404
¨ otter ¨ B. Porr and F. Worg
In Figure 10D, we show how the system reacts when using the Hebbian learning rule (ISO learning). We observe poor convergence even for rather small learning rates of µ = 2 × 10−11 , which is more than 1000 times smaller than those used for Figures 10A to 10C. These findings mirror the results of the simulations performed above. Some compensation occurs, but weights drift much more. To achieve this specific result, a higher gain had to be used than in the equivalent experiment shown in Figure 10C. With a lower gain, convergence was never reached, probably due to the noise in the signals. It should also be noted that this experiment was the best out of 20 using ISO learning. 5 Discussion In this letter, we have presented a modification of our old ISO learning rule, which has led to a dramatic improvement of convergence speed and stability. Mathematically we were able to show that under ideal noise-free conditions, ICO learning approaches one-shot learning. We have discussed relations of these types of differential Hebbian learning rules (Kosco, 1986; Klopf, 1986; Sutton & Barto, 1987; Roberts, 1999) to temporal sequence learning and to reward-based learning, most notably TD and Q learning (Sutton & Barto, 2002), and its embedding in the existing literature to a great extent in a set of older articles (see, in particular, ¨ otter ¨ (Worg & Porr, 2005, for a summary). Here we restrict the discussion to the relevant novel features of ICO learning. We have to discuss the different application domains of ICO versus ISO learning. ICO and ISO learning are identical when using an orthogonal filter set for the condition of µ → 0. In this situation, the autocorrelation term of ISO learning vanishes, and convergence is guaranteed for ISO learning as well (Porr et al., 2003). The advantage of ISO learning as compared to ICO learning is its isotropy: all inputs can self-organize into reflex inputs or predictive inputs, depending on their temporal sequence (see Porr & ¨ otter, ¨ Worg 2003a, for a discussion on this property). For ICO learning,
Figure 10: Temperature control experiments. Parameters of the filter bank j j H1 are Q1 = 0.51 f 1 = 0.1jHz , with for A, B: j = 1, . . . , 12 and for C, D : j = 1, . . . , 10. Experiments with different parameters: (A) SP = 60◦ C, EEP = 59.2◦ C, ρ0 = 250, disturbance pulse duration 10 s, µ = 4 × 10−8 . (B) SP = 70◦ C, EEP = 68.4◦ C, ρ0 = 250, disturbance pulse duration 20 s, µ = 4 × 10−7 . (C) SP = 44.0◦ C, EEP = 43.5◦ C, ρ0 = 150, disturbance pulse duration 12 s, µ = 7.5 × 10−7 . (D) Same as in C, except for higher feedback gain of ρ0 = 250 (EEP = 43.9◦ C) and lower learning rate of µ = 2 × 10−11 , but using ISO learning as denoted in the figure. Note that the levels of the input signals at ρ0 and ρ j , j > 0 are different. This leads to different absolute values for ρ0 and ρ j , j > 0.
Input Correlation Learning
1405
1406
¨ otter ¨ B. Porr and F. Worg
one needs to build the predefined subsumption architecture (see Figure 2) into the system from the beginning. This means that we have to set up a feedback system with a desired state and an error signal (x0 → 0) that drives learning. In the context of technical control applications, this is usually given so that ICO learning is the preferred choice against ISO learning (D’Azzo, 1988). In biology, however, self-organization is the key aspect. ISO learning has the ability to self-organize which pathways become reflex pathways and which pathways become predictive pathways. Reflex pathways can be superseded by other pathways, which in turn can become reflex pathways. This also means that ISO learning is able to use any input as an error signal, whereas ICO learning can use only x0 as an error signal. By hierarchically superseding reflex loops, ISO learning is able to self-organize subsumption architectures (Brooks, 1989; Porr et al., 2003), which is not possible with ICO learning. The filter bank here is used to generate an appropriate behavioral response. In contrast to our older ISO learning, it is possible to use other filter functions like step functions for the filter bank, which has been demonstrated in the real robot experiment. The only restriction imposed on the filter bank is that it should establish a low-pass characteristic. This characteristic has to be established by the closed loop, not the open loop. This means that the actual filter in the filter bank need not possess a low-pass characteristic, but the closed loop established by the environment. Filter banks have been employed in other learning algorithms, for example, in TD learning (Sutton & Barto, 2002). In contrast to our learning scheme, the filters there are used only for the critic, not for the actor. In other words, they are used to smear out the conditioned stimulus so that it can be correlated with the unconditioned stimulus. In terms of synaptic plasticity, ISO and ICO learning differ substantially. While ISO learning can be interpreted as a homosynaptic learning ¨ otter, ¨ rule (Porr, Saudargiene, & Worg 2004), ICO learning is strictly heterosynaptic. The neuronal literature on heterosynaptic plasticity normally emphasizes that it is essentially a modulatory process that modifies (con¨ ventional) homosynaptic learning (Bliss & Lomo, 1973; Markram, Lubke, Frotscher, & Sakmann, 1997), but cannot lead to plasticity on its own (Ikeda et al., 2003; Bailey, Giustetto, Huang, Hawkins, & Kandel, 2000; Jay, 2003). However, evidence was also found for a more direct influence of heterosynaptic plasticity in Aplysia siphon sensory cells (Clark & Kandel, 1984), in ¨ the amygdala (Humeau, Shaban, Bissi`ere, & Luthi, 2003) and in the limbic system (Beninger & Gerdjikov, 2004; Kelley, 2004). As a consequence, heterosynaptic learning rules have so far mostly been used to emulate modulatory processes, for example, by the implementation of three-factor learning rules, trying to capture dopaminergic influence in the striatum and the cortex (Schultz & Suri, 2001). To our knowledge, ICO is the only learning rule that operates strictly heterosynaptically, which, for network learning and plasticity, might open new avenues as
Input Correlation Learning
1407
compared to the well-established Hebb rules (Oja, 1982; Kohonen, 1988; Linsker, 1988; MacKay, 1990; Rosenblatt, 1958; von der Malsburg, 1973; Amari, 1977; Miller, 1996a). The tremendous stability of ICO, which is guaranteed for x0 = 0 or can be enforced by using a threshold (x0 < ), will allow designing stable nested or chained architectures of several ICO learning units where the “primary” units in such an architecture are controlled by the feedback neuronal activity of the “secondary” ones. Hence, the secondary neurons in such a setup would provide the x0 signal by way of an internal feedback loop, which takes the role and replaces the behavioral feedback employed here. Not only does this shed interesting light on neuronal feedback loops like the corticothalamic loops (Alexander, DeLong, & Strick, 1986; Morris, Cochran, & Pratt, 2005), but it might also offer interesting possibilities for novel network architectures, where stability can be built into the system by way of such loops. Like ISO learning, ICO learning develops a forward model (Palm, 2000) of the reflex reaction established by H0 , ρ0 , and P0 . The forward model is represented by the resonators and weights H j , ρ j , j > 0 (Porr et al., 2003). The main advantage of ICO learning against ISO learning is that it is not limited to resonators (H j ) as filters. We have shown here that instead of resonators, simple FIR filters can be used for the filter bank. The required low-pass characteristic came from the environment. The FIR filter was, however, just an example. Future research has to systematically explore which linear and nonlinear filters are suitable for ICO learning. Finding a target with a simulated or real-world device has been employed. The oldest model with hand-tuned fixed weights was employed by Walter (1953), where his tortoise had to find its home cage. To find the optimal weights Paine and Tani (2004) have recently employed a genetic algorithm that is able to solve a T-maze task. Their simulated robots need 63 generations. When it comes to learning, basically two paradigms are employed: reinforcement learning or Hebbian learning. In reinforcement learning, Q-learning seems to be the learning rule of choice. Q-learning generates optimal policies to retrieve a reward where a policy associates a sensor input with an action. The Q-value evaluates the policy if it leads to a reward or not. The higher the Q-value is, the more probable is the future or immediate reward. Q-learning has been successfully to applied by Bakker, Lin˚aker, & Schmidhuber (2002) to a T-maze task. The robot has to learn that a road sign at the entrance of the T-maze gives the clue as to whether the reward is in the left or the right arm. To solve this task, the simulated Kephera robot needed 2500 episodes. Thrun (1995) also employs Q-learning to find a target. In contrast to Bakker et al., however, the robot navigates freely in an environment. This task probably comes closest to our task. Successful targeting behavior is established after approximately 20 episodes. Our robot needed approximately 15 contacts with the white disk to find it reliably. However, after 20 episodes, the success rate in Thrun’s experiment
¨ otter ¨ B. Porr and F. Worg
1408
is still poor. A further 80 episodes are needed to bring the success rate up to 90%. Our robot has a comparable success rate of 90% after these 15 contacts, given that the camera can see the disk. The different convergence speeds suggest that Thrun has employed a lower learning rate. The other learning rule that has been employed to solve targeting tasks is Hebbian learning. In Verschure and Voegtlin (1998) and Verschure et al. (2003), the robot has the task of finding targets. Similar to our robot, their robot is equipped with proximal and distal sensors. The proximal sensors trigger reflex reactions. The task is to use the distal sensors to find the targets from the distance. In contrast to our heterosynaptic learning, Verschure and Voegtlin employ Hebbian learning, not heterosynaptic learning. In order to limit unbounded weight growth, they modified the Hebbian learning rule. Verschure et al. did this directly by adding a decay term proportional to the weight. In Verschure and Voegtlin, infinite weight growth is counteracted by inhibiting the signals from the distal sensors or, in other words, the conditioned stimuli. Unfortunately, a direct comparison of the performances with our experiment is not possible because it is not clear from Verschure and Voegtlin and from Verschure et al. how many contacts with the target were needed to learn the behavior. Touzet and Santos (2001) have systematically compared different reinforcement learning algorithms applied to obstacle avoidance. Such systematic approaches are difficult to achieve because of different hardware platforms, different environments, and different ways of documenting the robot runs. Thus, a systematic evaluation of the different learning rules will be the subject of further investigation. Appendix: Using the z-Transform for the Convergence Proof We describe in detail how we transformed the learning rule, equation 2.4, into the z-domain. The z-transform of a sampled (or time-discrete) signal x(n) is defined as X(z) =
∞
x(n)z−n .
(A.1)
n=−∞
The capital letter X(z) denotes the z-transform of the original signal x(n). The z-transform is the discrete version of the Laplace transform, which in turn is a generalized version of the Fourier transform. The original signal and its z-transform are equivalent if convergence can be guaranteed (Proakis & Manolakis, 1996). The z-transform has a couple of useful properties that simplify the convergence proof shown in section 3.2.2.
r
Convolution: The z-transform can be applied not only to signals but also to filters. Filtering in the time domain means convolution of the
Input Correlation Learning
1409
signal x(n) with the impulse response h(n) of the filter. In the z-domain, it is just a multiplication with the transformed impulse response: x(n) ∗ h(n) ⇔ X(z)H(z).
r
r
(A.2)
For example, equation 2.1 turns into U j = X j H j in the z-domain where the capital letters indicate the z-transformed functions. Once transformed into the z-domain, equations can be solved by simple algebraic manipulations. For example, equation 3.1 can be solved for X0 by subtracting X0 H0 ρ0 P0 from both sides and then dividing the equation by 1 − ρ0 P0 H0 . Correlation: The correlation of two signals can be derived from the convolution (see equation A.2) by recalling that a correlation is just a convolution where one signal is reversed in time. Time reversal x(−n) in the z-domain X(z−1 ) leads directly to a formula for correlation: x(n) ∗ h(−n) ⇔ X(z)H(z−1 ).
(A.3)
Derivative: The derivative in the z-space can be expressed as an operator (Bronstein & Semendjajew, 1989):
d ⇔ (z − 1). (A.4) dn With that background it is now possible to z-transform the learning rule, equation 2.4, ρ j = µu j u0
⇔
(z − 1)ρ j = µU j (z−1 )(z − 1)U0 (z),
(A.5)
which is equation 3.5. Note that the derivative on the right-hand side is not time-reversed because it belongs to U0 . Acknowledgments We thank David Murray Smith for fruitful comments on the manuscript. We are also grateful to Thomas Kulvicus and Tao Geng who constructed the mechanical arm. References Alexander, G., DeLong, M., & Strick, P. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9, 357– 381. Amari, S. I. (1977). Neural theory of association and concept-formation. Biol. Cybern., 26(3), 175–185. Bailey, C. H., Giustetto, M., Huang, Y. Y., Hawkins, R. D., & Kandel, E. R. (2000). Is heterosynaptic modulation essential for stabilizing Hebbian plasticity and memory? Nat. Rev. Neurosci., 1(1), 11–20.
1410
¨ otter ¨ B. Porr and F. Worg
Bakker, B., Lin˚aker, F., & Schmidhuber, J. (2002). Reinforcement learning in partially observable mobile robot domains using unsupervised event extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ: IEEE. Beninger, R., & Gerdjikov, T. (2004). The role of signaling molecules in reward-related incentive learning. Neurotoxicity Research, 6(1), 91–104. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity, orientation specifity and binocular interpretation in visual cortex. J. Neurosci., 2, 32–48. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentrate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232(2), 331–356. Bronstein, I., & Semendjajew, K. (1989). Taschenbuch der Mathematik (24th ed.). Thun and Frankfurt. Harri Deutsch. Brooks, R. A. (1989). How to build complete creatures rather than isolated cognitive simulators. In K. VanLehn, (Ed.), Architectures for intelligence (pp. 225–239). Hillsdale, NJ: Erlbaum. Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence, 47, 139–159. Clark, G. A., & Kandel, E. R. (1984). Branch-specific heterosynaptic facilitation in Aplysia siphon sensory cells. PNAS, 81(8), 2577–2581. Dayan, P., & Sejnowski, T. (1994). Td(λ) converges with probability 1. Mach. Learn., 14(3), 295–301. D’Azzo, J. J. (1988). Linear control system analysis and design. New York: McGrawHill. Diniz, P. S. R. (2002). Digital signal processing. Cambridge: Cambridge University Press. Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11, 23–63. Grossberg, S. (1995). A spectral network model of pitch perception. J. Acoust. Soc. Am., 98(2), 862–879. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. ¨ Humeau, Y., Shaban, H., Bissi`ere, S., & Luthi, A. (2003). Presynaptic induction of heterosynaptic associative plasticity in the mammalian brain. Nature, 426(6968), 841–845. Ikeda, H., Akiyama, G., Fujii, Y., Minowa, R., Koshikawa, N., & Cools, A. (2003). Role of AMPA and NMDA receptors in the nucleus accumbens shell in turning behaviour of rats: Interaction with dopamine and receptors. Neuropharmacology, 44, 81–87. Jay, T. (2003). Dopamine: A potential substrate for synaptic plasticity and memory mechanisms. Prog. Neurobiol., 69(6), 375–390. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Kelley, A. E. (2004). Ventral striatal control of appetitive motivation: Role in ingestive behaviour and reward-related learning. Neuroscience and Biobehavioural Reviews, 27, 765–776.
Input Correlation Learning
1411
Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker, (Ed.), Neural networks for computing: AIP Conference Proceedings. New York: American Institute of Physics. Kohonen, T. (1988). Self-organization and associative memory (2nd ed.). Berlin: Springer. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker (Ed.), Neural networks for computing: AIP Conference Proceedings (pp. 277–282). New York: American Institute of Physics. Linsker, R. (1988). Self-organisation in a perceptual network. Computer, 21(3), 105– 117. MacKay, D. J. (1990). Analysis of Linsker’s application of Hebbian rules to linear networks. Network, 1, 257–298. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmanno, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Miller, K. D. (1996a). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Donnay, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks III (pp. 55–78). Berlin: Springer-Verlag. Miller, K. D. (1996b). Synaptic economics: Competition and cooperation in correlation-based synaptic plasticity. Neuron, 17, 371–374. Morris, B., Cochran, S., & Pratt, J. (2005). PCP: From pharmacology to modelling schizophrenia. Curr. Opin. Pharmacol., 5(1), 101–106. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Paine, R. W., & Tani, J. (2004). Motor primitive and sequence self-organisation in a hierarchical recurrent neural network. Neural Networks, 17, 1291–1309. Palm, W. J. (2000). Modeling, analysis and control of dynamic systems. New York: Wiley. ¨ otter, ¨ Porr, B., Saudargiene, A., & Worg F. (2004). Analytical solution of spike-timing dependent plasticity based on synaptic biophysics. In S. Thrun, L. Saul, & B. ¨ Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. ¨ otter, ¨ Porr, B., von Ferber, C., & Worg F. (2003). Iso-learning approximates a solution to the inverse-controller problem in an unsupervised behavioural paradigm. Neural Comp., 15, 865–884. ¨ otter, ¨ Porr, B., & Worg F. (2001). Temporal Hebbian learning in rate-coded neural networks: A theoretical approach towards classical conditioning. In G. Dorffner, H. Bischof, & K. Hornik (Eds.), Artificial neural networks—ICANN 2001 (pp. 1115– 1120), Berlin: Springer. ¨ otter, ¨ Porr, B., & Worg F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. ¨ otter, ¨ Porr, B., & Worg F. (2003b). Isotropic sequence order learning in a closed loop behavioral system. Roy. Soc. Phil. Trans. Math., Phys. and Eng. Sciences, 361(1811), 2225–2244. Proakis, J. G., & Manolakis, D. G. (1996). Digital signal processing. Upper Saddle River, NJ: Prentice Hall. Roberts, P. D. (1999). Temporally asymmetric learning rules: I. Differential Hebbian learning. Journal of Computational Neuroscience, 7(3), 235–246.
1412
¨ otter ¨ B. Porr and F. Worg
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6), 386–408. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Suri, R. E. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862. Sutton, R. (1988). Learning to predict by method of temporal differences. Machine Learning, 3(1), 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Sutton, R. S., & Barto, A. (1987). A temporal-difference model of classical conditioning. In Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 355–378). Mahwah, NJ: Erlbaum. Sutton, R. S., & Barto, A. G. (2002). Reinforcement learning: An introduction (2nd ed.), Cambridge, MA: MIT Press. Thrun, S. (1995). An approach to learning mobile robot navigation. Robotics and Autonomous Systems, 15, 301–319. Touzet, C., & Santos, J. F. (2001). Q-learning and robotics. In IJCNN’99, European Simulation Symposium. Piscataway, NJ: IEEE. Verschure, P., & Voegtlin, T. (1998). A bottom-up approach towards the acquisition, retention, and expression of sequential representations: Distributed adaptive control III. Neural Networks, 11, 1531–1549. Verschure, P. F. M. J., Voegtlin, T., & Douglas, R. J. (2003). Environmentally mediated synergy between perception and behaviour in mobile robots. Nature, 425, 620– 624. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14(2), 85–100. Walter, W. G. (1953). The living brain. London: G. Duckworth. Watkins, C. J. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292. ¨ otter, ¨ Worg F., & Porr, B. (2005). Temporal sequence learning, prediction and control— a review of different models and their relation to biological mechanisms. Neural Comp., 17, 245–319. Ziemke, T. (2001). Are robots embodied? In C. Balkenius, J. Zlatev, H. Kozima, K. Dautenhahn, & C. Breazeal (Eds.), Proceedings of the First International Workshop on Epigenetic Robotics Modeling Cognitive Development in Robotic Systems. Lund: Lund University.
Received April 13, 2005; accepted September 28, 2005.
LETTER
Communicated by Laurent Itti
An Oscillatory Neural Model of Multiple Object Tracking Yakov Kazanovich yakov [email protected] Institute of Mathematical Problems in Biology, Russian Academy of Sciences Pushchino, Moscow Region, 142290, Russia
Roman Borisyuk [email protected] Institute of Mathematical Problems in Biology, Russian Academy of Sciences Pushchino, Moscow Region, 142290, Russia, and Centre for Theoretical and Computational Neuroscience, University of Plymouth, Plymouth PL4 8AA, U.K.
An oscillatory neural network model of multiple object tracking is described. The model works with a set of identical visual objects moving around the screen. At the initial stage, the model selects into the focus of attention a subset of objects initially marked as targets. Other objects are used as distractors. The model aims to preserve the initial separation between targets and distractors while objects are moving. This is achieved by a proper interplay of synchronizing and desynchronizing interactions in a multilayer network, where each layer is responsible for tracking a single target. The results of the model simulation are presented and compared with experimental data. In agreement with experimental evidence, simulations with a larger number of targets have shown higher error rates. Also, the functioning of the model in the case of temporarily overlapping objects is presented. 1 Introduction Selective visual attention is a mechanism that allows a living organism to extract from the incoming visual information the part that is most important at a given moment and that should be processed in more detail. This mechanism is necessary due to the limited processing capability of the visual system, which precludes the rapid analysis of the whole visual scene. Different types of attention are responsible for implementing different strategies of attention focus formation. Traditional theories characterized attention in spatial terms as a spotlight or a “zoom lens” that could move about the visual field focusing on whatever fell within that spatial region (Posner, Snyder, & Davidson, 1980; Eriksen & St. James, 1986). More recent theories of attention state that in some cases, the underlying units of selection are discrete objects whose selection into the focus of attention Neural Computation 18, 1413–1440 (2006)
C 2006 Massachusetts Institute of Technology
1414
Y. Kazanovich and R. Borisyuk
is independent of whatever location these objects occupy. This type of attention is called object-based attention (Egeth & Yantis, 1997, Scholl, 2001). For object-based attention, the limits of information processing concern the number of objects that can be simultaneously attended. An important experimental paradigm in the study of object-based attention is multiple object tracking (MOT). In the canonical MOT experiments (Pylyshyn & Storm, 1988, Pylyshyn, 2001; Scholl, 2001), an observer views a display with m simple identical objects (up to 10 to 12 objects such as points, or plus signs, or circles). A certain subset of the objects (from 1 to m/2, m is supposed to be even) is briefly flashed to mark them as targets. Other objects are considered distractors. Then all objects begin moving independently and unpredictably about the screen without passing too near to each other and without moving off the display. The subjects’ task is to track the subset of targets with their eyes fixed at the center of the screen. At various times during animation, one of the objects is flashed, and the observer should press a key to indicate whether this object is a target or a distractor. In other studies, the subject had to indicate all the targets using a computer mouse. It has been shown that trained subjects are quite efficient in performing MOT. Though the number of errors increases with an increasing number of targets, even for five targets, the performance level was about 85% correct target identifications. It has been argued that the results of the experiments are in agreement with the hypothesis about resourcelimited parallel processing and cannot be explained entirely in terms of a serial attention scanning process, since in the latter case, sequential jumps of a single attentional spotlight from one target to another would have to be done with impossible velocities. A large number of later studies (Yantis, 1992; Blaser, Pylyshyn, & Holcombe, 2000; Scholl & Tremoulet, 2000; Sears & Pylyshyn, 2000; Vis¨ a, 2004; Liu et al., 2005) convanathan & Mingolla, 2002; Oksama & Hyon¨ firmed that the early visual system is able “to individuate and keep track of approximately five visual objects and does so without using an encoding of any of their visual properties” (Pylyshyn, 2001). Visvanathan and Mingolla (2002) investigated MOT in the case when objects are allowed to overlap one another dynamically during a trial. The experiment included those with and without depth cues that signal occlusion. The results show that “although the tracking task does become more difficult . . . it does not become impossible, even in the purely two-dimensional case.” When occlusion cues are added to the display, MOT performance returns to the level observed for nonoverlapping objects. Observers are better in tracking targets than identifying them; the identity tends to be lost when targets come close to each other (Pylyshyn, 2004). In the past few years, attention has become a popular field for neural network modeling. The models of attention can be subdivided into two categories. Connectionist models (Olshausen, Anderson, & Van Essen, 1993; Tsotsos et al., 1995; Moser & Silton, 1998; Grossberg & Raizada, 2000; Itti &
An Oscillatory Neural Model of Multiple Object Tracking
1415
Koch, 2000, 2001) are based on a winner-take-all procedure and are implemented through modification of the weights of connections in a hierarchical neural network. Such models are difficult to use in the case of moving objects since the networks have to work in the space of the visual field; therefore, for any new position of the objects, the weights of the connections should be recomputed. Another type of attention model is represented by oscillatory neural networks (Kazanovich & Borisyuk, 1999; Wang, 1999; Corchs & Deco, 2001; Borisyuk & Kazanovich, 2003, 2004; Katayama, Yano, & Horiguchi, 2004). They are more suitable for object-based attention because in this case, the network operates in phase-frequency space, which makes the attention focus invariant to the locations of objects in physical space. In this letter, we present a neural network model of MOT based on the principles of oscillation synchronization and resonance. As far as we know, this is the first model of MOT and the first example of oscillatory neural network application to processing scenes with moving objects. The MOT model design is based on our earlier published attention model with a central oscillator (AMCO) (Kazanovich & Borisyuk, 1999, 2002, 2003; Borisyuk & Kazanovich, 2003, 2004). Each element of AMCO is an oscillator whose dynamics are described by three variables: the phase of oscillations, the amplitude of oscillations, and the natural frequency of the oscillator. The interaction between oscillators is implemented in terms of phase locking, resonance, and adaptation of the natural frequency. AMCO has a star-like architecture of connections. It contains a one-layer network of locally coupled oscillators, the so-called peripheral oscillators (POs), whose dynamics are controlled by a special central oscillator (CO) through global feedforward and feedback connections (Kryukov, 1991). The POs represent the columns of cortical neurons in the primary visual cortex (areas V1–V2) that respond to specific local features of the image. For simplicity, we use the contrast between the intensities of a pixel and the background as such features. A CO plays the role of the central executive of the attention system (Baddeley, 1996; Cowan, 1988). In AMCO, isolated objects are represented by synchronous assemblies of POs, and the focus of attention is formed by those POs that work synchronously with the CO. In psychological literature, the central executive is considered as a system that is responsible for attentional control of working memory (Baddeley, 1996, 2002, 2003; Shallice, 2002). The question about localization of the central executive in brain structures is still obscure. For a long time, the functions of the central executive have been attributed mostly to a local region in the prefrontal cortex (D’Espasito et al., 1991; Loose, Kaufmann, Auer, & Lange, 2003), but later studies have shown that the central executive may be represented by a distributed network that includes lateral, orbitofrontal, and medial prefrontal cortices linked with motor control structures (Barbas, 2000; Andres, 2003). Recent neuroimaging data show that
1416
Y. Kazanovich and R. Borisyuk
“different executive functions do not only recruit various frontal areas, but also depend upon posterior (mainly parietal) regions” (Collett & Van der Linden, 2002). Daffner et al. (1998) show the relative contribution of the frontal and posterior parietal regions to the differential processing of novel and target stimuli under conditions in which subjects actively directed attention to novel stimuli. The prefrontal cortex may serve as the central node in determining the allocation of attentional resources to novel events, whereas the posterior lobe may provide the neural substrate for the dynamic process of updating one’s internal model of the environment to take a novel event into account. There is some evidence that in addition to neocortical areas, the hippocampus may play an important role in implementing central executive functions: the hippocampus has the final position in the pyramid of cortical convergent zones (Damasio, 1989), participates in controlling the processing of information in most parts of the neocortex (Holscher, 2003), and coordinates the work of the attention system (Vinogradova, 2001; Herrmann & Knight, 2001; Duncan, 2001). An important property of a network with the star-like architecture is a relatively small number of connections (of the order n, where n is the number of elements in the system) in comparison to the systems with all-to-all connections (the number of connections is of the order n2 ). This makes systems with a central element biologically plausible and technically feasible even in the case of large n. Kazanovich and Borisyuk (2002) and Borisyuk and Kazanovich (2004) showed that by combining synchronization and resonance in AMCO, it is possible to select objects in the attention focus one by one in the order determined by the saliency of objects. More salient objects have the advantage in earlier selection. In section 2 we give a short description of AMCO and show how it can be used to track a single target moving among a set of distractors. The main idea of tracking m targets is to use a network that consists of m copies of interactive AMCO with each copy tracking one particular target. When implementing this idea, the following problems have to be solved. First, one should prevent the situation where the same target is simultaneously tracked by several AMCO. Second, the model should be able to operate when objects intersect during their movements. Since objects are identical and assumed to be moving randomly and unpredictably, there is no possibility of errorless identification of a target after it has been occluded by a distractor. In this case, the best strategy for the attention system is to keep in the focus of attention on either of the two objects that have just been separated. Also, there is no possibility to recall the identity of two target objects after their occlusion. But the attention system should be able to track them both after separation. This strategy allows the attention system to permanently hold exactly m isolated objects. This is important in order to prevent the multiplication of attended objects that otherwise would take place.
An Oscillatory Neural Model of Multiple Object Tracking
1417
Central oscillator
Peripheral oscillators
Input image
Figure 1: AMCO architecture. The hollow arrow shows the assignment of natural frequencies of POs. Black arrows show synchronizing connections that are used to (1) bind the POs coding an object into a synchronous assembly and (2) synchronize an assembly of POs with the CO. The gray arrow shows desynchronizing connections that are used to avoid simultaneous synchronization of the CO with several assemblies of POs.
In section 3 we describe how interactive AMCO layers are combined in the MOT model. In section 4 the results of computer simulation of MOT are presented. Section 5 contains discussion. 2 Tracking a Single Target The architecture of AMCO is shown in Figure 1. The input to the network is a grayscale image on the plane grid of the same size as the grid of POs. Each PO receives an external signal from the pixel whose location on the plane is identical to the location of the PO. We use a conventional coding scheme where the external signal determines the natural frequency of oscillations (Niebur, Kammen, & Koch, 1991; Kuramoto, 1991). It is assumed that the external signal is formed in the lateral geniculate nucleus and depends on the contrast between the intensities of the pixel and the background. The value of the natural frequency of a PO is given by the formula ωi = λ(B − Ii ), (0 ≤ Ii ≤ B), where ωi is the natural frequency of the ith PO, Ii is the gray level of the ith pixel, B is the gray level of the background, and λ is a scaling parameter. Thus, the natural frequency is higher if the contrast is higher.
1418
Y. Kazanovich and R. Borisyuk
The POs corresponding to the pixels of objects are called active, and their dynamics are determined by equations A.1 to A.4 in appendix A. The POs corresponding to the pixels of the background are called silent. While an oscillator is silent, it does not participate in the dynamics of the system. The POs have synchronizing local connections with their nearest neighbors. These connections are used to bind the POs representing an isolated object into a coherent assembly. This is done in agreement with the synchronization hypothesis of feature binding (Singer, 1999). The CO has desynchronizing feedforward and synchronizing feedback connections to each PO. The synchronizing connections are used to phase-lock the CO by an assembly of POs. The desynchronizing connections are used to segregate different assemblies of POs in frequency space to prevent simultaneous synchronization of the CO with several assemblies of POs. The interplay between the synchronizing and desynchronizing connections of the CO results in the competition of different assemblies of POs for the synchronization with the CO. Only one assembly of POs can win this competition; therefore, in AMCO at each moment, only one object can be attended (excluding short transitory periods). The POs representing this object work synchronously with the CO, which results in resonance: the amplitudes of these oscillators rapidly increase to a high level. The amplitudes of other POs (which do not work coherently with the CO) are shut down to a low level. A high enough amplitude of a PO is taken as a criterion that this oscillator is included in the focus of attention. The amplitude of the CO in AMCO is a constant. Since the CO can be phase-locked only by those POs whose natural frequencies are in some range around the natural frequency of the CO, the natural frequency of the CO is adapted to its current value. As a result, the natural frequency of the CO becomes equal to the current frequency of an assembly of POs. Such adaptation allows the CO to “search” for an assembly of POs that is an appropriate candidate for synchronization. Our previous publications on AMCO have dealt with stationary objects only. Images with moving objects are more difficult to process since the attention system should be able to properly react to the changes in object locations and also to the intersection and separation of objects during their movements. We illustrate the functioning of AMCO in the case of a visual field of size 25 × 50 pixels that contains nine black squares of size 7 × 7 on a white background. All pixels that belong to the squares receive the same illumination and therefore have the same contrast relative to the background. To comply with the timescales that are typically used in the MOT experiments, we have selected the time unit equal to 100 msec and used gamma oscillations as the range of working frequencies. In computations, the natural frequencies of all active POs were set to ωi = 5 (5 oscillations in 100 msec), which corresponds to the frequency 50 Hz. The amplitudes of active POs have the initial value 2 and vary in the range (0, 11). The threshold for a resonant
An Oscillatory Neural Model of Multiple Object Tracking
1419
Figure 2: Single object tracking. Processing an image with a single target and eight distractors. Attended pixels are black, nonattended pixels of objects are gray, and pixels of the background are white.
amplitude is R = 8.8. If the amplitude of a PO exceeds R, the corresponding pixel is assumed to be included in the focus of attention. Figure 2 displays movie frames of the dynamics of the image at the moments (in seconds) 0, 0.4, 0.8, . . . (time interval between the frames is four time units). The frames are ordered from left to right and from top to bottom. The top-left frame shows the initial position of the squares. Later
1420
Y. Kazanovich and R. Borisyuk
the squares move around in any of the four randomly chosen directions (up, down, left, right) with the probability 0.25. The choice of a new direction is taken at the moments (in seconds) 0.1, 0.2, 0.3, . . . The movement is omitted if it can lead to crossing the borders of the visual field.
Amplitude
10
5
0
50 40 30
10 20
20
10 30
y
x
Amplitude
10
5
0
50 40 30
10 20
20
y
10 30
x
Figure 3: Amplitudes of POs at the moments (top) t = 11.6 (second frame in the last row of Figure 2) and (bottom) t = 12.0 (third frame in the last row of Figure 2).
An Oscillatory Neural Model of Multiple Object Tracking
1421
In Figure 2, attended pixels are shown as black, nonattended pixels are gray, and pixels of the background are white. Initially the squares are regularly distributed in the image. Due to movements, the distribution of the squares around the field becomes random, and complex objects appear that represent different combinations of overlapping squares. At the initial moment, no object is under attention. After a short lag, attention is automatically focused on a randomly chosen square. In the case shown in Figure 3, it is the square in the middle of the image. The focus of attention is firmly attached to an object while this object moves in isolation from other objects. But quite soon, the attended square crosses a complex object formed by several overlapping squares (this situation is shown in the fourth frame of the first row). Since an attended object is included in a cluster of overlapping objects, attention is gradually spread to the pixels surrounding the attended object. If a complex formed by attended objects is later split into two isolated objects, attention is focused on one of these objects. Such a movement of the attention focus along the image is reminiscent of passing a baton in a relay race, with the only difference being that the process of baton passing has a probabilistic nature. One may think that the focus of attention nearly disappears in the second frame of the bottom row of frames and magically reappears again in the next frame. In fact, there is no magic here. The amplitudes of the oscillators in the attended area for a short time become a bit lower than the threshold R (see Figure 3, top). But soon afterward the amplitudes in the attended area again exceed the threshold (see Figure 3, bottom). At this moment, two attended squares become separate, and attention is focused on one of them. Note that in Figure 3, the columns in the diagram with the amplitudes abruptly rising over the surrounding pixels correspond to the pixels where squares have just recently moved. 3 The Model of Multiple Object Tracking The architecture of the MOT network is shown in Figure 4. This architecture corresponds to the case of three targets (in general, the number of layers is the same as the number of targets); therefore, three layers of AMCO are shown as the components of the model. We consider these layers to be attentional subsystems. Each subsystem should track its own target. For convenience of reference, the layers are indexed by different colors: red, green and blue. The equations of dynamics of the network are shown in appendix B for the general case of m targets. The POs that occupy the same location on the plane but belong to different layers form a column. The POs in a column are bound by strong all-to-all synchronizing connections. All POs in a column receive the same external signal from the corresponding pixel of the visual field. As in the case of a single AMCO, the external signal codes the contrast between the intensities of the given pixel and the background. This signal determines the values
1422
Y. Kazanovich and R. Borisyuk
Layer 3 BLUE
Layer 2 GREEN
CO3 BLUE
CO2 GREEN
Layer 1 RED
CO1 RED Synchronizing connections Desynchronizing connections Figure 4: Architecture of the network for MOT. The layers of the network are indexed by different colors. Each layer of the network is responsible for tracking a single target.
of the natural frequencies of POs; therefore, the natural frequencies of the oscillators in a column are identical, which results in rapid synchronization of POs in a column. The local connections between POs in a layer are restricted to the nearest neighbors. These connections are synchronizing. They are strong enough to synchronize the columns of POs that belong to an isolated object. Thus, a coherent assembly of POs is formed in the network in response to stimulation by an individual object in the visual field. The COs belonging to different layers are bound by desynchronizing connections. Such connections are introduced to prevent the synchronization of different COs with the same synchronous assembly of POs. As a result, nonoverlapping targets are coded in the network by noncoherent oscillatory activity of different assemblies of POs. As in the case of a single target, if an assembly of POs in the kth layer works synchronously with the CO in this layer, the amplitudes of these oscillators go up and exceed the threshold for the resonance. If all POs of
An Oscillatory Neural Model of Multiple Object Tracking
1423
the kth layer corresponding to the pixels of an object are in the resonant state, this is interpreted as the fact that this object is included in the focus of attention of the kth attentional subsystem. If objects move slowly enough and do not overlap during their movements, the attention focus (after being formed) is rather stable due to the resonance of POs included in the focus of attention. Resonant oscillators have a much stronger influence on the CO of their layer, which prevents a jump of attention to an assembly of nonresonant oscillators. If the speed of object movements is high relative to the rate of the processes of synchronization and resonance, attention can spontaneously switch from one object to another. This results in errors in distinguishing between targets and distractors. Consider what happens to the attention focus if two objects cross each other. If both objects are unattended, no change of the focus of attention takes place. Temporarily a complex distractor (composed of two objects) is formed, but this has no effect on objects in the focus of attention. In the case when attended and unattended objects overlap, attention is spread to the whole composite object because all POs belonging to this object will be included in a common assembly of synchronized oscillators. This assembly will work synchronously with the same CO that was synchronous with the attended assembly of POs before overlapping. After becoming separate again, the objects renew their competition for being included in the focus of attention. Due to the desynchronizing influence of the CO on POs, only one of the two objects is able to win the competition. Since the whole composite object has been temporarily attended, the system has no information on how to detect which object had been attended before the intersection took place. In this situation, either of the two objects can be newly selected in the focus of attention. The choice is random; therefore, it may lead to an error in target identification with the probability 0.5. If two attended objects overlap, both continue to be in the focus of attention despite the fact that there is a desynchronizing connection between the COs. This is achieved by making this desynchronization weak enough relative to the synchronizing influence that comes to both COs from the common assembly of synchronous POs. When objects move apart, the desynchronizing connection between the COs renews its influence on these oscillators. As a result, each object will again be tracked by its own AMCO layer. Of course, it is possible that objects, say A and B, that before the intersection have been tracked by, for example, “red” and “green” layers, will exchange their tracking systems: after separation, the “red” layer will be used to track the object B, and the object A will be tracked by the “green” layer. Whether this exchange will happen depends on how strong the overlapping has been and how long it has continued. In fact, the interplay between the synchronizing and desynchronizing influences on a CO is even more intricate than the one we have just described. Computation experiments have shown that a constant desynchronizing
1424
Y. Kazanovich and R. Borisyuk
interaction between COs cannot ensure the proper behavior of the COs. One type of error appears if the desynchronizing interaction between COs is too strong. In this case, a CO may lose the synchronization with the assembly of POs at the moment when two attended objects overlap. As a result, only one CO will maintain synchronization with a composite object formed by two previously attended objects. Another type of error may appear if the strength of desynchronization between the COs is too weak. In this case, two COs may maintain the synchronization with the same object after the separation of two simultaneously attended objects. In fact, no constant value of the connection strength between the COs allows the system to avoid one or other of these errors. But the problem can be solved if the interaction between the COs increases after two attended objects overlap. It has been done by using the idea of resonance but applying it now to the amplitudes of COs. Therefore in the MOT model, the amplitudes of COs are not constants anymore. A resonant increase of the amplitude of a CO takes place if two attended objects cross each other (see equation B.5 in appendix B). According to the last term in equation B.1, this results in increasing the strength of desynchronization between the COs that track these objects. Therefore, at the moment of separation of the objects, the strength of desynchronization will be high enough to prevent the situation when both COs track one object leaving the other one outside the focus of attention. To follow the experimental conditions of MOT, the model should not only be able to track moving objects but should also make a proper choice of a set of targets at the initial phase of MOT. In the experiments, targets are indicated to the observer by a brief flash of light. In the model, the notion of saliency is used to formalize the choice of flashed objects in the focus of attention. It is assumed that flashed objects are more salient than other objects and that this leads to automatic attraction of attention to these objects. In terms of the model, the salience of a pixel influences the strength of the influence of POs corresponding to this pixel on the COs. Thus, more salient objects have an advantage in being included in the focus of attention. The idea of the saliency map was introduced by Koch and Ullman (1985) and has been intensively used in computational models of visual search (Itti, Koch, & Niebur, 1998; Itti & Koch, 2000, 2001; Olshausen et al., 1993). The saliency map is a two-dimensional table that encodes the saliency of objects in the visual environment and determines the priority of their choice in the focus of attention. In the MOT model, the saliency map is formed as a set of parameters si that correspond to the pixels of the image and determine the strength of influence of POs on COs as is shown in equation B.1 in appendix B. To reflect the difference in saliency between flashed and nonflashed objects, the saliency si takes one of two positive values: a higher value Sflashed corresponds to the pixels of flashed objects, and a lower value Snonflashed corresponds to the pixels of nonflashed objects. For the pixels of the background, si = 0. The value of Sflashed should be several
An Oscillatory Neural Model of Multiple Object Tracking
1425
times higher than Snonflashed to provide the assemblies of POs that represent flashed targets with a much better chances of winning the competition for synchronization with the COs than the assemblies corresponding to nonflashed distractors. In computer simulations, we put Sflashed = 5 and Snonflashed = 0.2. When flashing is over, all objects become equally salient. This is reflected in the model by making all values of si for the pixels of objects identical. In simulations, si = 1. 4 Model Simulation and Comparison with Experimental Data Two types of MOT model simulations are considered below that correspond to movements with and without overlap, respectively. Computations of the fist type are used to compare the performance of the model with recent ex¨ a (2004). In simulations of the second perimental data of Oksama and Hyon¨ type, overlapping objects are used to demonstrate that the model follows the procedure described in section 3. ¨ a (2004) experimented with a set of 12 objects. To Oksama and Hyon¨ accelerate computation, we reduced the number of objects to 10 as in the experiments of Pylyshyn and Storm (1988). This does not significantly affect the results. In simulations, objects are black squares of size 7 × 7 pixels on a white background in a field of size 30 × 60 pixels. As in section 2, the pixels of the squares are coded in the network by the natural frequencies of POs equal to 50 Hz. Tracking of k targets (2 ≤ k ≤ 5) implies that a network with k layers is used. The timescale for simulations has been chosen as in section 2. A single run of the model takes 7.2 sec and consists of three phases. The first phase takes 0.7 sec and is used for marking the targets. During this period, objects are motionless; the only distinction between targets and distractors is their saliency, as explained in section 3. The desirable result at the end of this phase is the focusing of attention on the targets so that each target is attended by one attentional subsystem. An error may appear if two attentional subsystems are focused on the same target or if a distractor is chosen in the focus of attention of one of the subsystems. Computations have shown that the probability of such errors is less than 0.005; therefore, the initial acquisition of targets in the focus of attention is nearly errorless. The second phase continues 6 sec. This is when objects move in a random manner. The speed of motion is 1 pixel per 50 msec, that is, every 50 msec, all squares make a movement of 1 pixel length in one of the four directions: up, down, left, or right. For each square, the direction of the movement is chosen randomly and independently with the probability 0.25. To prevent collisions, the motion is subject to the restriction that the squares should be always separated by at least one pixel of the background. First, the horizontal or vertical direction of the movement is chosen with the probability 0.5. Then, for horizontal movements, the left or right direction is taken with
1426
Y. Kazanovich and R. Borisyuk
the probability 0.5, and for vertical movement, the up or down direction is taken with the probability 0.5. If there is a danger of collision, the direction of the movement is reversed. If the danger of collision exists for both opposite directions, no movement is undertaken at the current moment. The same rules are applied to prevent the squares from crossing the borders of the visual field. The third phase takes 0.5 sec. During this phase, all movements are stopped. This time is given to the system to resolve an ambiguous situation when several objects are simultaneously attended by an attentional subsystem. Such a situation appears from time to time if objects are moving. When objects are stationary, the time interval of 0.5 sec is long enough for the attentional subsystem to choose which of these objects should be kept in the focus of attention. Other objects are automatically excluded from the focus of attention as a result of desynchronizing influence of the CO on POs in the corresponding layer. At the final moment of the third phase, the number of object identification errors is registered. According to the principles of system design and functioning, at this moment exactly k squares are attended by the network with k layers. Some of the attended squares are targets, but some of them may be distractors due to errors in attention focusing during object movements. Therefore, two types of error may appear: a target is identified as a distractor, or a distractor is identified as a target. According to the strategy implemented in the model, the number of errors is always even: if there is an error in attending a target (a target is missed by all attentional subsystems), this inevitably results in taking a distractor in the focus of attention. To estimate the performance of the model, we executed 50 runs of the system for each target set size k = 2, 3, 4, 5. The results of computations are shown in Table 1. The analysis of variance test (ANOVA) has been used to check whether the means of 50 trials in four groups (corresponding to different number of targets) differ. The null hypothesis is that all the groups are drawn from the same probabilistic distribution (or from different distributions but with the same mean). The result is: F = 43.7, and the pvalue is less than 0.0001; therefore, the results of our simulations do not support the null hypothesis. Further analysis by use of the pair-wise T-test gives the following results: T23 = 4.6, T34 = 3.7, T45 = 3.1. These values of the T-test do not support the null hypothesis that the mean of group k equals the mean of group (k − 1) for k = 3, 4, 5. Therefore, the alternative hypothesis that the mean in group k is larger than the mean in group (k − 1) is supported. ¨ a (2004) estimated human perforIn experiments, Oksama and Hyon¨ mance by using a probe object that should be identified by the subject as a target or a distractor. To exclude a bias in guessing, a probe object was selected with the probability 0.5 from the set of targets and from the ¨ a, we have not made set of distractors. In contrast to Oksama and Hyon¨ direct computational experiments with probe objects but estimated the
An Oscillatory Neural Model of Multiple Object Tracking
1427
Table 1: Results of Identification of Targets and Distractors in the MOT Simulation. Target Set Size 2 Number of errors Mean number of errors per trial Standard deviation Probability of error checked by probe objects
3
2 0.04
38 0.76
0.3 0.006
1.1 0.09
4 86 1.72 1.5 0.179
5 134 2.68 1.6 0.268
probability of error in probe identification based on the number of errors in each run. Let s be the number of objects (s = 10), k the number of targets (k = 2, 3, 4, 5), and e the number of the targets that have been mistakenly identified as distractors in a run of simulations (hence, the same number of distractors has been mistakenly identified as targets). Then the probability of error if checked by a probe object is P = 0.5
e e + k s−k
=
0.5se . k(s − k)
Using this formula, we computed the values of P for each run and averaged these values through all simulation trials. The results are presented in the last row of Table 1 and in Figure 5. For comparison Figure 5 also shows ¨ a, 2004). Both error patthe accuracy of humans in MOT (Oksama & Hyon¨ terns in Figure 5 show similar behavior: the probability of error increases with a larger number of targets. The main difference is that the probability of error for the case of two targets is underestimated in simulations in comparison to the experimental data. To illustrate the functioning of the system in the case when objects are permitted to overlap during their movements, we present a computational experiment where six identical objects (three targets and three distractors) are moving in the visual field. Again objects are black squares of size 7 × 7 pixels on a white background. The size of the field is 19 × 62 pixels. A session of simulation is divided into two phases. The first short phase (1.2 sec) is used to mark the targets and select them into the attention focus. At this phase, all squares are isolated and motionless. The selection is done exactly in the same way as in the case with nonoverlapping objects. During the second phase lasting for 9.6 sec, objects start moving according to the rules that are the same as in the case of a single target (see
1428
Y. Kazanovich and R. Borisyuk
0.35
0.3
Simulation Experimental data
Probability of errors
0.25
0.2
0.15
0.1
0.05
0
2
2.5
3
3.5
4
4.5
5
Number of targets Figure 5: Accuracy of probe objects identification by the MOT model in comparison with humans. Error rates are shown as a function of number of targets ¨ a (2004). tracked. Experimental data are taken from Oksama and Hyon¨
section 2). The movie frames in Figure 6 illustrate the dynamics of the image and the process of attention focus formation and switching. The frames are ordered from left to right and from top to bottom. The time interval between the frames is four time units (0.4 sec). The top-left frame shows the initial position of objects in the image. The next frame shows a moment during exposition time when the attention focus with three targets is formed. Later, objects begin their random movements, which lead to the formation of different combinations of overlapping objects. Colors in Figure 6 do not reflect the colors of the squares in the image (we have already mentioned that all squares in the image are black). Colors are used to show which objects are under the attention of different attentional subsystems (network layers). The color of a pixel depends on the state of the oscillators in the column that corresponds to this pixel. If a PO in the “red” layer is in the resonant state, then the pixel is red. A similar principle of color assignment is applied to green and blue colors. If several POs in a column are in the resonant state, the color of the pixel is a mixture of the basic colors. For example, the pixel is cyan if it is simultaneously attended
An Oscillatory Neural Model of Multiple Object Tracking
1429
Figure 6: Multiple object tracking. Processing an image with three targets and three distractors. Each pixel is painted in red, green, blue, or a mixture of these colors. A pixel is red/green/blue if a “red”/“green”/“blue” oscillator in the column that corresponds to this pixel is in the resonant state. A pixel is cyan if both oscillators in the “blue” and “green” layers are in the resonant state. A pixel is black if no oscillator in the column is in the resonant state. A pixel is white if it belongs to the background.
in the “green” and “blue” layers. Black is used for nonattended pixels of objects, and white is used for the pixels of the background. The intensities of green, red, and blue colors are such that their combination in one pixel results in gray. (In fact, there are no such pixels in Figure 6.) Consider what happens with the attention focus while squares are moving. The pair of squares in the middle of the image is always outside the
1430
Y. Kazanovich and R. Borisyuk
attention focus. These squares are not attended in both cases when they move as isolated objects or temporarily form a composite object. The POs representing the pixels of these objects always work with low amplitude. The pair of squares in the right part of the image shows the process of attention switching in the case when a complex object is formed due to overlapping of attended and nonattended squares. If the time of overlapping is short, after separation of the squares, attention focusing is correct in choosing the square that was under attention before. If the area of overlapping becomes large and the duration of existence of the complex object is not very short, attention is spread over the entire composite object. Therefore, after separation of the objects, either of them may be kept in the focus of attention, while the other is excluded from the focus of attention. In fact, the functioning of the MOT model in this case is not different from how AMCO works in the case of a single target. Computational experiments confirm that the much more intricate architecture of the MOT model does not lead to any additional complications in tracking a single target among distractors by a layer of the network. Finally, let us consider how the pair of squares in the left part of the image is processed by the system. Both squares are marked as targets and selected into the focus of attention. While these squares move separately, the attentional subsystem (network layer) tracking each object does not change; therefore, the colors of the squares (green and blue) in Figure 6 are kept unchanged during this period. The situation changes after a large enough area of overlapping between the squares appears. This area is painted in cyan because it is simultaneously attended by two attentional subsystems, “green” and “blue.” Due to object movements, the size of the cyan area may increase or decrease; it may disappear if the overlapping area is too small and reappear again when the overlapping area becomes large enough. The important fact is that as soon as the squares separate in the image field, each of them is tracked by a single attentional subsystem. The attentional subsystems may exchange targets only if at some moment, the overlapping of the squares does not allow the reliable identification of each square in the composite object. A thorough quantitative investigation of error rates in MOT with overlapping objects is in progress. Preliminary simulations have shown that error rates essentially depend on the speed of object movements. If the speed is low enough, as in the experiment presented above, the probability of error (except those errors that are inevitable after strong overlapping of a target and a distractor) is rather low. 5 Discussion Three main ideas are combined in the presented model of attention. First, oscillations and synchronization are used as a key mechanism in attention focus formation. The evidence that oscillatory activity and long-range
An Oscillatory Neural Model of Multiple Object Tracking
1431
phase synchronization play a major role in the attentional control of visual processing has been provided in studies of EEG (Herrmann & Knight, 2001), MEG (Sokolov et al., 1999; Gross et al., 2004), and local field potential recordings (Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Fries, Schroeder, Roelfsema, & Engel, 2002; Niebur, Hsiao, & Johnson, 2002; Fell, Fernandez, Klaver, Elger, & Fries, 2003; Liang, Bressler, Ding, Desimone, & Fries, 2003). In particular, it has been shown that modulation of neural synchrony determines the outcome of an attention-demanding behavioral task (Gross et al., 2004; Tallon-Baudry, 2004). Second, it is assumed that attention focusing is controlled by a special neural system, the so-called central executive (Baddeley, 1996; Cowan, 1988). In terms of the model, visual attention is normally characterized by synchronous activity of an assembly of neurons that represent the central executive and an assembly of neurons in the primary areas of the visual cortex. MOT represents a special situation when attention is distributed among several isolated objects. It is supposed that in this case, the central executive is split into several subsystems whose activity is desynchronized. Thus, synchronous oscillations are used to label different objects that are simultaneously included in the focus of attention. Third, the resonance is used to formalize the assumption that attention enhances neural activity in attended areas and inhibits responses to nonattended stimuli. The relation between the resonance as it is used in the model and attentional modulation of cortical activity should be explained in more detail. Though electrophysiological (Motter, 1993; Roelfsema, Lamme, & Spekreijse, 1998) and functional imaging studies (Somers, Dale, Seifert, & Tootel, 1999; Kanwisher & Wojciulik, 2000; Seifert, Somers, Dale, & Tootel, 2003) have shown that attentional modulation can be found as early as in the primary visual cortex, it is known that the strength of attentional effects increases as one moves up the cortical processing hierarchy (Treue, 2003). Moreover, Culham et al. (1998) in the fMRI studies of activation produced by attentive tracking of moving objects have found no enhancement in early visual cortex, but it has been shown that the signal more than doubles in parietal and frontal areas. How do these experimental results comply with the model? In the model, direct connections are used from POs to COs. This is a radical simplification of the real situation. In fact, there are many intermediate cortical structures in the pathway from the striate cortex to the higher regions occupied by the central executive. At the moment the flow of information reaches its final station, the difference in the activity of its attended and unattended components becomes quite clear. Still, this difference is not as high as the difference between resonant and nonresonant oscillations in the model. Therefore, one should not literally think that the amplitude of oscillations in the model is a relevant characteristic of the activity in the cortex as it is observed, for example, in fMRI studies. The amplitude of oscillations should be considered as a formal variable that positively correlates with the activity in the cortex and
1432
Y. Kazanovich and R. Borisyuk
determines the strength of interaction between cortical oscillators and the central executive. The theoretical explanation of MOT is based on the idea of preattentive assignment of an index to objects tracked (Pylyshyn & Storm, 1988; Pylyshyn, 2001). It is supposed that this indexing can occur independently and in parallel at several places in the visual field. In contrast to this theory, in our model indexing is implemented in two stages, which include both preattentional and attentional levels. During the first stage, an oscillatory label is assigned to each object. All information related to a given object is coded in the form of synchronous (coherent) oscillations, while oscillations corresponding to different objects are incoherent. Although the oscillatory label is not something constant but varies with time, it provides a reliable mechanism of distinguishing among identical targets. The oscillatory label is used in the second stage when an attentional subsystem is associated with a single target, the one that is tracked by this subsystem. The number of this subsystem can be considered as the index of the target. The difference between attended and unattended objects is realized in the form of synchronization or desynchronization of the activity of assemblies of POs with the COs. The model has a rigid architecture of connections and predetermined interaction functions and parameters. Real biological systems for MOT must be much more flexible. A question that may arise is how the system with a fixed number of layers can adapt to tracking different number of targets. A trivial solution is to assume that the number of active COs is controlled by internal effort and is always set equal to the number of targets tracked. We think that a more plausible solution is a flexible type of interaction among the COs. Suppose that the number k of the COs is restricted (according to the experimental evidence k = 5 is a reasonable upper bound for the number of targets tracked simultaneously), but the type of interaction between the COs can vary depending on the task. If one target should be tracked, all the COs are bound into a synchronous assembly by synchronous interaction among the COs. In the case of two targets, two assemblies of COs are formed with synchronizing interaction among the COs in an assembly and desynchronizing interaction among the assemblies. This situation is equivalent to the case of the model with two COs. Similarly, an arbitrary number of targets below five can be tracked by grouping the COs into a proper number of assemblies. The dependence of number of errors in MOT on the number of targets was the reason for associating MOT with resource-limited parallel processing (Pylyshyn & Storm, 1988). Our model presents an alternative explanation of this phenomenon. Although the processing of information in our model is purely parallel, it has been shown by computer simulations that tracking of the targets will be poorer if the number of targets increases. This is caused by the limited capacity of the phase space where several central oscillators have to operate simultaneously. Increasing the number of
An Oscillatory Neural Model of Multiple Object Tracking
1433
central oscillators makes it more and more difficult for them to avoid temporal synchronization, which may result in unpredictable jumps of attention to nontarget objects. If the number of targets is below five, the probability of such jumps is very low for stationary objects, but it significantly increases when objects start moving with high speed. Due to movements, the processes of synchronization and resonance have not enough time to fully proceed, and these result in the loss of synchronization of a CO with a previously selected object. Although we have chosen the parameters of the model in such a way that it closely follows real-time relations in the experiments with MOT, one should not be too serious about this fact. The model is too simple to be realistic in this respect. A very small number of pixels used for objects representation, restriction of POs interaction to nearest neighbors, and many other features of the model are conditioned by the need to make computations in a reasonable amount of time. In fact, the model is rather flexible in its operation time. Another choice of parameters, or even the duration of the time unit, can lead to other time relations. Therefore, when comparing the performance of the model with human error rates, we used the experimental data averaged through time periods of 5, 9, and 13 sec. The data for 5 sec in the experiments with humans give lower values than those obtained in our simulations, but the pattern of error probabilities is the same. The improvement of the model in timescale representation is planned. In partic¨ a (2004) on nonlinear variation of error ular, the results of Oksama and Hyon¨ rates with the duration of trials are a challenge for the future development. The decision-making strategy used in the model is also oversimplified. The model is forced to always track a fixed number of objects. The experiments show that human strategy is more clever (Pylyshyn & Storm, 1988; ¨ a, 2004). If a subject has a feeling that correct identification Oksama & Hyon¨ of an object during tracking is doubtful, he or she is inclined to stop tracking this object and focus attention on tracking a smaller number of objects. This causes gradual reduction of human performance when the number of targets exceeds five and is probably the source of smaller differences in the probability of error between the cases with four and five targets observed in MOT experiments relative to those found in simulations (see Figure 5). It is known that in MOT experiments, the quality of target tracking can be enhanced by grouping the targets into a virtual polygon and then tracking deformations of the polygon. This grouping can be done spontaneously or according to the instruction given to subjects (Yantis, 1992). These facts can be explained in terms of oscillatory neural networks by assuming that in this case, all targets are combined in a single visual object. The oscillators representing this object are synchronized along virtual borders of the polygon that are formed by some internal effort. As a result, attention is no longer divided, and the central executive operates in a standard manner as a single central oscillator. It is also possible that humans can follow some mixed strategy combining grouping with tracking individual objects. The
1434
Y. Kazanovich and R. Borisyuk
strategy of grouping may be an explanation for the cases when more than five targets are tracked successfully. But according to the data of Oksama ¨ a (2004), if the number of targets exceeds five, the subjects are and Hyon¨ inclined to ignore some targets and focus attention on a smaller target set. Designing a MOT model, we intentionally tried to avoid the use of traditional image processing techniques such as shape analysis, connectivity testing, pattern recognition, and others. Therefore, the model can work equally well when objects in the visual field are not identical or even vary in shape. This is important, for example, if object movements take place in 3D space with the projection of objects on the retina constantly changing. The model design is also caused by the fact that complex procedures of information processing demand more time, and are assumed to be implemented by higher cortical structures. It was interesting to investigate whether MOT can be explained in terms of a simple network architecture where the only function of the top-down flow is to control the focus of attention. Appendix A: Mathematical Description of AMCO The oscillators comprising AMCO are described as generalized phase oscillators. The state of such an oscillator is described by three explicitly given variables: the phase of oscillations, the amplitude of oscillations, and the natural frequency of oscillations. The dynamics of AMCO are described by the following equations. n dθ0 w0 si a i g(θi − θ0 ) = 2πω0 + dt n
(A.1)
i=1
dθi a j p(θ j − θi ) + ρ = 2πωi − a 0 w1 h(θ0 − θi ) + w2 dt j∈N
(A.2)
da i = β(−a i + γ f (θ0 − θi )) dt
(A.3)
dθ0 dω0 = −α 2πω0 − . dt dt
(A.4)
i
In these equations, θ0 is the phase of the CO; θi (i = 1, . . . , n) are the 0 i phases of POs, dθ and dθ are the current frequencies of oscillators; ω0 is dt dt the natural frequency of the CO; ωi are the natural frequencies of POs; a 0 is the amplitude of oscillations of the CO (a constant); a i are the amplitudes
An Oscillatory Neural Model of Multiple Object Tracking
1435
of oscillations of POs; w0 , w1 , w2 are constant positive parameters that control the strength of interaction between oscillators; si is the parameter that distinguishes between active and silent oscillators; si = 1 if the PO is active, otherwise si = 0; Ni is the set of active POs in the nearest neighborhood of the oscillator i; ρ is a gaussian noise term with mean 0 and standard deviation σ ; functions g, h, p control the interaction between oscillators; f is a function that controls the amplitude of oscillations of POs and their transition to the resonant state; and α, β, γ are network parameters (positive constants). The values ωi are determined by the input signal; θ0 , θi , ω0 , a i are internal variables that characterize the state of the network. The functions g, h, p are 2π-periodic, odd, and unimodal in the interval of periodicity. f is 2π-periodic, even, positive, and unimodal in the interval of periodicity. An exact description of these functions and the values of the parameters that are used in computations can be found in Borisyuk and Kazanovich (2004). Equations A.1 and A.2 are traditional equations of phase locking. They correspond to the architecture of connections of Figure 1 and control the processes of synchronization and desynchronization in the network. Equation A.1 describes the dynamics of the CO. Equation A.2 describes the dynamics of POs. The noise ρ in equation A.2 is used as an additional source of desynchronization between assemblies of POs. It helps to randomize the location of different assemblies of POs in phase-frequency space, thus making them distinguishable by the CO. Equation A.3 describes the dynamics of the amplitude of oscillations of POs. This equation provides a mechanism for the resonant increase of the amplitude of oscillations. Let the interval of variation of f be ( f min , f max ). The amplitude of a PO increases to the maximum value a max = γ f max if the PO works synchronously with the CO. The amplitude takes a low value a min = γ f min if the phase of the PO is significantly different from the phase of the CO. We say that a PO is in the resonant state if its amplitude exceeds the threshold R = 0.8 γ f max . The parameter β determines the rate of amplitude increase and decay. Equation A.4 describes the adaptation of the natural frequency of the CO. According to this equation, the value of 2πω0 tends to the current frequency of the CO. Such adaptation allows the CO to “search” for an assembly of POs that is an appropriate candidate for synchronization. Appendix B: Mathematical Description of the MOT Model The equations of the MOT model dynamics represent a modification of equations A.1 to A.4 according to the scheme of Figure 4: m n dθ0k w0 k k si a i g θ j − θ0k − w3 a 0l q θ0l − θ0k = 2πω0k + dt nr es i=1
l=1
(B.1)
1436
Y. Kazanovich and R. Borisyuk
dθik a kj p θ kj − θik = 2πωik − a 0k w1 h θ0k − θik + w2 dt j∈Ni m w4 l l k + a i p θi − θi + ρ m
(B.2)
l=1
da ik = β − a ik + γ f θ0k − θik dt
(B.3)
dω0k dθ0k k = −α 2πω0 − . dt dt
(B.4)
In these equations, upper and lower indices are used to number layers and oscillators in a layer, respectively; k = 1, . . . , m, where m is the number of layers (the same as the number of targets in MOT). The normalizing parameter nr es is equal to the current number of resonant oscillators, but not less than 49 (the number of pixels in an object). The last term in equation B.1 describes the interaction between the central oscillators, and the function q determines the type of interaction (the negative sign before this term makes the interaction desynchronizing). The term before the noise in equation B.2 gives the interaction in a column of POs. Parameters w3 , w4 are positive constants. The parameters si in equation B.1 form a saliency map, si > 0, for the pixels of objects and si = 0 for the pixels of the background. During the stage when some objects are flashed, the values of si are made higher for the pixels of flashed objects than for the pixels of nonflashed objects. When objects are homogeneously illuminated, the values of si are made identical for all pixels of objects. Equations B.3 and B.4 generalize equations A.3 and A.4 for the amplitudes of POs and the natural frequency of COs in the case of a multilayer network. The amplitudes of COs vary according to an equation similar to equation B.3, m l da 0k k k f θ 0 − θ 0 + 1 , = β −a i + γ1 r dt l=1, l=k
where
r (x) =
x,
x ≤ f max ,
f max , x > f max .
(B.5)
An Oscillatory Neural Model of Multiple Object Tracking
1437
The function r is necessary to normalize the variation of the amplitudes of COs. In computations, the values of the resonant amplitude of a CO were about two times higher than the amplitude of a nonresonant CO. Acknowledgments This work was supported by the Russian Foundation of Basic Research (Grants 03-04-48482 and 06-04-48806) and by the UK EPSRC (Grant EP/0036364). References Andres, P. (2003). Frontal cortex as the central executive: time to revise our view. Cortex, 39, 871–895. Baddeley, A. (1996). Exploring the central executive. Quarterly Journal of Experimental Psychology, 49A, 5–28. Baddeley, A. (2002). Fractionating the central executive. In D. Stuss & R. T. Knight (Eds.), Principles of frontal lobe function (pp. 246–260). New York: Oxford University Press. Baddeley, A. (2003). Working memory and language: An overview. Journal of Communication Disorders, 36, 189–208. Barbas, H. (2000). Connections underlying the synthesis of cognition, memory, and emotion in primate prefrontal cortices. Brain Research Bulletin, 52, 319– 330. Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000). Tracking an object through feature space. Nature, 408, 196–199. Borisyuk, R., & Kazanovich, Y. (2003). Oscillatory neural network model of attentionfocus formation and control. BioSystems, 71, 29–36. Borisyuk, R., & Kazanovich, Y. (2004). Oscillatory model of attention-guided object selection and novelty detection. Neural Networks, 17, 899–915. Collett, F., & Van der Linden, M. (2002). Brain imaging of the central executive component of the working memory. Neuroscience and Behavioral Review, 26, 105– 125. Corchs, S., & Deco, G. (2001). A neurodynamical model for selective visual attention using oscillators. Neural Networks, 14, 981–990. Cowan, N. (1988). Evolving conceptions of memory storage, selective attention and their mutual constraints within the human information processing system. Psychological Bulletin, 104, 163–191. Culham, J., Brandt, S. A., Cavanagh, P., Kanwisher, N. G., Dale, A. M., & Tootell, R. (1998). Cortical fMRI activation produced by attentive tracking of moving targets. Journal of Neurophysiology, 80, 2657–2670. Daffner, K. R., Mesulam, M. M., Scinto, L. F. M., Cohen, L. G., Kennedy, B. P., West, W. C., & Holcomb, P. J. (1998). Regulation of attention to novel stimuli by frontal lobes: An event-related potential study. NeuroReport, 9, 787–791. Damasio, A. (1989). The brain binds entities and events by multiregional activation from convergent zones. Neural Computation, 1, 123–132.
1438
Y. Kazanovich and R. Borisyuk
D’Espasito, M., Detre, J. A., Alsop, D. C., Shin, R. R., Atlas, S., & Grossman, M. (1991). The neural basis of central executive system of working memory. Nature, 378, 279–281. Duncan, J. (2001). An adaptive coding model of neural functions in prefrontal cortex. Nature Reviews Neuroscience, 2, 820–829. Egeth, H., & Yantis, S. (1997). Visual attention: Control, representation, and time course. Annual Review of Psychology, 48, 269–297. Eriksen, C. W., & St. James, J. D. (1986). Visual attention within and around the field of focal attention: A zoom lens model. Perception and Psychophysics, 40, 225–240. Fell, J., Fernandez, G., Klaver, P., Elger, C. E., & Fries, P. (2003). Is synchronized neuronal gamma activity relevant for selective attention? Brain Research Reviews, 42, 265–272. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Fries, P., Schroeder, J.-H., Roelfsema, P. R., Singer, W., & Engel, A. K. (2002). Oscillatory neural synchronization in primary visual cortex as a correlate of stimulus selection. Journal of Neuroscience, 22, 3739–3754. Gross, J., Scmitz, F., Schnitzler, I., Kessler, K., Shapiro, K., Hommel, B., & Schnitzler, A. (2004). Modulation of long-range neuronal synchrony reflects temporal limitations of visual attention in humans. Proc. Natl. Acad. Sci. (USA), 101, 13050– 13055. Grossberg, S., & Raizada, R. (2000). Contrast sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Research, 40, 1413–1432. Herrmann, C. S., & Knight, R. T. (2001). Mechanisms of human attention: Event related potentials and oscillations. Neuroscience and Biobehavioral Reviews, 25, 465– 476. Holscher, C. (2003). Time space and hippocampal functions. Review of Neuroscience, 14, 253–284. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2, 194–203. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid screen analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. Kanwisher, N., & Wojciulik, E. (2000). Visual attention: Insights from brain imaging. Nature Reviews Neuroscience, 1, 91–100. Katayama, K., Yano, M., & Horiguchi, T. (2004). Neural network model of selective visual attention using Hodgkin-Huxley equation. Biological Cybernetics, 91, 315– 325. Kazanovich, Y. B., & Borisyuk, R. M. (1999). Dynamics of neural networks with a central element. Neural Networks, 12, 441–454. Kazanovich, Y., & Borisyuk, R. (2002). Object selection by an oscillatory neural network. BioSystems, 67, 103–111. Kazanovich, Y. B., & Borisyuk, R. M. (2003). Synchronization in oscillator systems with phase shifts. Progress in Theoretical Physics, 110, 1047–1058.
An Oscillatory Neural Model of Multiple Object Tracking
1439
Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kryukov, V. I. (1991). An attention model based on the principle of dominanta. In A. V. Holden & V. I. Kryukov (Eds.), Neurocomputers and attention I. Neurobiology, synchronization and chaos (pp. 319–352). Manchester: Manchester University Press. Kuramoto, Y. (1991). Collective synchronization of pulse coupled oscillators and excitable units. Physica D, 50, 15–30. Liang, H., Bressler, S. L., Ding, M., Desimone, R., & Fries, P. (2003). Temporal dynamics of attention-modulated neuronal synchronization in macaque V4. Neurocomputing, 52–54, 481–487. Liu, G., Austen, E. L., Booth, K. S., Fisher, B. D., Argue, R., Rempel, M. I., & Enns, J. T. (2005). Multiple-object tracking is based on scene, not retinal coordinates. Journal of Experimental Psychology, 31, 235–247. Loose, R., Kaufmann, C., Auer, D. P., & Lange, K. W. (2003). Human prefrontal and sensory cortical activity during divided attention tasks. Human Brain Mapping, 18, 249–259. Moser, M. C., & Silton, M. (1998). Computational model of spatial attention. In H. Pashler (Ed.), Attention (pp. 341–393). London: UCL Press. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919. Niebur, E., Hsiao, S. S., & Johnson, K. O. (2002). Synchrony: A neuronal mechanism for attentional selection? Current Opinion in Neurobiology, 12, 190–194. Niebur, E., Kammen, D. E., & Koch, C. (1991). Phase-locking in 1-D and 2-D networks of oscillating neurons. In W. Singer & H. Schuster (Eds.), Nonlinear dynamics and neuronal networks, (pp. 173–204). Berlin: Vieweg Verlag. ¨ a, J. (2004). Is multiple object tracking carried out automatiOksama, L., & Hyon¨ cally by an early vision mechanism independent of higher-order cognition? An individual difference approach. Visual Cognition, 11, 631–671. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Posner, M. I., Snyder, C. R. R., & Davidson, D. J. (1980). Attention and the detection of signals. Journal of Experimental Psychology: General, 109, 160–174. Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision. Cognition, 80, 127–158. Pylyshyn, Z. W. (2004). Some puzzling findings in multiple object tracking (MOT): I. Tracking without keeping track of object identities. Visual Cognition, 1, 301– 322. Pylyshyn, Z. W., & Storm, R. W. (1988). Tracking multiple independent targets: Evidence for a parallel tracking mechanism. Spatial Vision, 3, 179–197. Roelfsema, P. R., Lamme, V., & Spekreijse, H. (1998). Object-based attention in the primary visual cortex of the macaque monkey. Nature, 395, 376–381. Scholl, B. J. (2001). Objects and attention: The state of the art. Cognition, 80, 1–46. Scholl, B. J., & Tremoulet, P. D. (2000). Perception causality and animacy. Trends in Cognitive Science, 4, 299–309.
1440
Y. Kazanovich and R. Borisyuk
Sears, C. R., & Pylyshyn, Z. W. (2000). Multiple object tracking and attentional processing. Canadian Journal of Experimental Psychology, 54, 1–14. Seifert, A. E., Somers, D. C., Dale, A. M., & Tootel, R. (2003). Functional MRI studies of human visual motion perception: Texture, luminance, attention and aftereffects. Cerebral Cortex, 13, 340–349. Shallice, T. (2002). Fractionation of the supervisory system. In D. T. Stuss & R. T. Knight (Eds.), Principles of frontal lobe function (pp. 261–277). New York: Oxford University Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations. Neuron, 24, 49–65. Sokolov, A., Lutzenberger, W., Pavlova, M., Pressl, H., Braun, C., & Birbauner, N. (1999). Gamma-band MEG activity to coherent motion depends on task-driven attention. Neuroreport, 10, 1997–2000. Somers, D. C., Dale, A. M., Seifert, A. E., & Tootel, R. (1999). Functional MRI reveals spatially specific attentional modulation in human primary visual cortex. Proc. Natl. Acad. Sci. (USA), 96, 1663–1668. Steinmetz, P. N., Roy, A., Fitzgerald, P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Tallon-Baudry, C. (2004). Attention and awareness in synchrony. Trends in Cognitive Sciences, 8, 523–525. Treue, S. (2003). Visual attention: The where, what, how and why of salience. Current Opinion in Neurobiology, 13, 428–432. Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nufl, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78, 507–545. Vinogradova, O. S. (2001). Hippocampus as comparator: Role of the two input and two output systems of the hippocampus in selection and registration of information. Hippocampus, 11, 578–598. Visvanathan, L., & Mingolla, E. (2002). Dynamics of attention in depth: Evidence from multi-element tracking. Perception, 31, 1415–1437. Wang, D. L. (1999). Object selection based on oscillatory correlation. Neural Networks, 12, 579–592. Yantis, S. (1992). Multielement visual tracking: Attention and perceptual organization. Cognitive Psychology, 24, 295–340.
Received April 25, 2005; accepted September 15, 2005.
LETTER
Communicated by Gustavo Deco
Analysis of Cluttered Scenes Using an Elastic Matching Approach for Stereo Images Christian Eckes [email protected] Fraunhofer Institute for Media Communications IMK, D-53754 Sankt Augustin, Germany
Jochen Triesch [email protected] Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093-0515, U.S.A.
Christoph von der Malsburg [email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany Institut fur
We present a system for the automatic interpretation of cluttered scenes containing multiple partly occluded objects in front of unknown, complex backgrounds. The system is based on an extended elastic graph matching algorithm that allows the explicit modeling of partial occlusions. Our approach extends an earlier system in two ways. First, we use elastic graph matching in stereo image pairs to increase matching robustness and disambiguate occlusion relations. Second, we use richer feature descriptions in the object models by integrating shape and texture with color features. We demonstrate that the combination of both extensions substantially increases recognition performance. The system learns about new objects in a simple one-shot learning approach. Despite the lack of statistical information in the object models and the lack of an explicit background model, our system performs surprisingly well for this very difficult task. Our results underscore the advantages of view-based feature constellation representations for difficult object recognition problems. 1 Introduction The analysis of complex natural scenes with many partly occluded objects is among the most difficult problems in computer vision (Rosenfeld, 1984). The task is illustrated in Figure 1 and may be defined as follows: Given an input image with multiple, potentially partly occluded objects in front of an unknown complex background, recognize the identities and spatial Neural Computation 18, 1441–1471 (2006)
C 2006 Massachusetts Institute of Technology
1442
C. Eckes, J. Triesch, and C. von der Malsburg
Figure 1: Scene analysis overview. Given a set of object models (examples on the left) and an input image of a complex scene (middle), the task of scene analysis is to compute a high-level interpretation of the scene giving the identities and locations of all known objects, as well as their occlusion relations (right). Our object models take the form of labeled graphs with a gridlike topology.
locations of all known objects present in the scene. Clearly, the ability to master this task is a fundamental prerequisite for countless application domains. Unknown backgrounds and partial occlusions of objects will be the order of the day whenever computer vision systems or visually guided robots are deployed in complex, natural environments. In such situations, it may be that segmentation cannot be decoupled from recognition, but the two problems have to be treated in an integrated manner. In fact, even today’s best “pure” segmentation approaches (e.g., Ren & Malik, 2002) seem to be severely limited in their ability to find object contours when compared to human observers. The system developed and tested here follows the popular approach of elastic graph matching (EGM), a biologically inspired object recognition approach that represents views of objects as 2D constellations of waveletbased features (Lades et al., 1993; Triesch & Eckes, 2004). EGM is an example of a class of architectures that represent particular views of objects as 2D constellations of image features (Fischler & Elschlager, 1973). EGM does not rely on initial segmentation, but directly matches observed image features to stored model features. It has been used successfully for recognizing objects, faces (including identity, pose, gender, and expression; Wiskott, ¨ Fellous, Kruger, & von der Malsburg, 1997), and hand gestures (Triesch & von der Malsburg, 2002), and has already demonstrated its potential for complex scene analysis in an earlier study (Wiskott & von der Malsburg, 1993). The description of object views as constellations of local features is potentially very powerful for dealing with partial occlusions, because the effect of features missing due to occlusion can be modeled explicitly in this approach. For example, missing features of the object “pig” in Figure 1 can be discounted if it is recognized that they are occluded by the object in front. In Bayesian terms, the missing features are explained away by the presence of the occluding object. In this capacity, our approach stands in contrast to
Analysis of Cluttered Scenes
1443
schemes employing more holistic object representations such as eigenspace approaches (Turk & Pentland, 1991; Murase & Nayar, 1995). The philosophy behind our current approach is to integrate multiple sources of information within an EGM framework. The motivation for this philosophy is that when dealing with very difficult (vision) problems, one is often well advised to follow the honorable “use all the help you can get” rationale. Recent research in the computer vision and machine learning communities has repeatedly highlighted the often surprising capabilities of systems that integrate many weak (i.e., individually poor) cues or classifiers. Our current system for scene analysis uses two extensions to the basic graph matching technique, both integrating a new source of information into the system. First, we use an extension of EGM to stereo image pairs, where object graphs are matched simultaneously in the left and right images subject to epipolar constraints (Triesch, 1999; Kefalea, 2001). The disparity information estimated during stereo graph matching is used to disambiguate occlusion relations during scene analysis. Second, we utilize richer features in the object description that fuse information from different visual cues (shape and texture, color) (Triesch & Eckes, 1998; Triesch & von der Malsburg, 2001). The combination of both extensions is shown to dramatically increase recognition performance. The remainder of the letter is organized as follows. In section 2 we describe conventional EGM in the context of scene analysis. Section 3 describes the extension of EGM for stereo image pairs. In section 4 we discuss our extension to richer feature descriptions—so-called compound feature sets. Section 5 presents our method for scene analysis in stereo images. Section 6 covers the results of our experiments. Finally, section 7 discusses the work from a broader perspective and relates it to other approaches in the field. 2 Elastic Graph Matching 2.1 Object Representation. In elastic graph matching (EGM), views of objects are described as 2D constellations of features represented as labeled graphs. In particular, the representation of an object m ∈ {1, . . . , M} is an undirected, labeled graph G m with NV (m) nodes or vertices (set V m ) and NE (m) edges (set E m ): G m = (V m , E m )
with
V m = {nim , i = 1, . . . , NV (m)} E m = {e mj , j = 1, . . . , NE (m)}.
(2.1)
Each node nim is labeled with its corresponding position xim in the training image I m and with a feature set extracted from this position denoted as F(xim , I m ). The feature set provides a local description of the appearance of the object m at location xim . The edges e mj = (i, i ) connect neighboring nodes, nim and nim (i < i ), and are labeled with displacement vectors d m im − xim j = x
1444
C. Eckes, J. Triesch, and C. von der Malsburg
encoding the spatial arrangement of the features. For our experiments, we use graphs with a simple grid-like topology. The distance between nodes in x- and y-direction is 5 pixels, and each node is connected to its four nearest neighbors (see Figure 1). Thus, the number of nodes in the graph varies with object size. Our object database contains graphs of various sizes ranging from 30 to 35 nodes covering the small tea box up to 198 nodes for the large stuffed pig. Note that in many applications of EGM, graph topologies are adapted to the particular problem domain (e.g., for faces or hand gestures). 2.2 Gabor Features. The features labeling the graph nodes are typically vectors of responses of Gabor-based wavelets, so-called Gabor jets (Lades et al., 1993). These features have been shown to be more robust with respect to small translations and rotations than raw pixel values, and they are heavily used in biological vision systems (Jones & Palmer, 1987). These wavelets can be thought of as plane waves restricted by a gaussian envelope function. They come in quadrature pairs (with odd and even symmetry corresponding to sine and cosine functions) and can be compactly written as a single complex wavelet:
ψk (x ) =
k2 − k2 x22 i k· x −σ 2 /2 2σ e . · e − e σ2
(2.2)
For σ , we chose 2π. This choice leads to larger spatial filters than the choice σ = 2, which is sometimes preferred in the literature. Convolving the graylevel component I (x ) of an image at a certain position x0 with a filter kernel ψk yields a complex number, which is denoted by Jk (x0 , I (x )): Jk (x0 , I (x )) =
ψk (x0 − x )I (x )d 2 x.
(2.3)
The family of filters in equation 2.2 is parameterized by the wave vector k. To construct a set of features for a node of a model graph, different filters are obtained by choosing a discrete set of L × D vectors k of L = 3 different frequency levels and D = 8 different directions. In particular, we use: π kl,d = · 2
1 √ 2
l cos
π d D
sin
π d D
l = {0, 1, . . . , L − 1} d = {0, 1, . . . , D − 1}
.
(2.4)
The set of all complex filter responses Jkl,d is aggregated into a complex vector called a Gabor-wavelet jet or just Gabor jet. It is often convenient to
Analysis of Cluttered Scenes
1445
represent the complex filter responses as magnitude and phase values: Jkl,d = |Jkl,d | exp iφkl,d .
(2.5)
Previous studies have suggested that the information contained in the magnitudes of the filter responses is most informative for recognition purposes (Lades et al., 1993), and Shams and von der Malsburg (2002) have argued that their “population responses contain sufficient information to capture the perceptual essence of images.” On this basis, we decided to use only the magnitude information to construct feature sets to be used in the object representations. The feature set F(xim , I m ) for node i of model m is thus a real vector with L × D components: T F(xim , I m ) ≡ Fim = |Jk0,0 (xim , I m )|, . . . , |JkL−1,D−1 (xim , I m )| .
(2.6)
During recognition, we will attempt to establish correspondences between model features and image features in a matching process. To do so, we need to compare Gabor jets attached to the model nodes with Gabor jets extracted from specific points in the input image. We use the similarity function 4
Snode (F, F ) = cos F, F =
F · F F F
4 ,
(2.7)
where ‘·’ denotes the dot or inner product. This similarity function measures the angle between the Gabor jets. It is invariant to changes in the length of the Gabor jets, which provides some robustness to changes in local image contrast. The fourth power helps to better discriminate closely matching jets from just average matching jets by making the distribution of similarity values sparser, leading to fewer occurrences of high similarities, while spreading out the range of interesting similarity values (those close to 1) (Wiskott & von der Malsburg, 1993). Note that the computationally costly division by F and F can be avoided by using normalized Gabor jets with F = F = 1. 2.3 Elastic Matching Process. The goal of elastic matching is to establish correspondences between model features and image features. Matching of individual model features to the image independent of each other suffers from considerable ambiguity. Robust matches can be obtained only by simultaneously matching constellations of many features. EGM performs a topographically constrained feature matching by comparing the entire model graph G m with parameterized image graphs G I (α, m). Such image graphs are generated by parameterized transformations Tα that can account for different transformations of the object in the image with respect to the
1446
C. Eckes, J. Triesch, and C. von der Malsburg
model, such as translation, scaling, rotation in plane, and (limited) rotation in depth. In particular, the transformation Tα maps the position of a model node xim to a new position Tα (xim ). The model feature set attached to a particular node of the object model must be compared to features extracted at the transformed location of that node in the input image. This feature vector is denoted F(Tα (xim ), I ). The similarity of model feature set i with a feature set extracted at a transformed node location in an input image I is written as Si (α, m, I ) = Snode (Fim , F(Tα (xim ), I )) ,
(2.8)
where Snode is defined in equation 2.7. The similarity between the entire image graph corresponding to Tα and the model graph is defined as the mean similarity of the corresponding feature sets attached at each node: Sgraph G m , G I (α, m) =
NV (m) 1 Si (α, m, I ). NV (m)
(2.9)
i=1
EGM is an optimization process that maximizes this similarity function to find the best matching image graph over the predefined space of allowed transformations Tα , α = 1, . . . , A: αˆ = arg max Sgraph G m , G I (α, m) . α
(2.10)
The optimal transformation parameters αˆ depend on the particular object model m and the image I under consideration: αˆ = α(I, ˆ m). Note that it is often useful in EGM to add a term to the graph similarity function Sgraph that punishes geometric distortions of the matched graph compared to the model graph, but this did not improve recognition performance in our particular application. In summary, the matching process yields a set of optimal correspondences {Tαˆ (xim )}, the local similarity at each node of the matched graph Si (α, ˆ m, I ), and the for the complete graph defined as global similarity ¯ α, ˆ m) . These quantities are used in the followS( ˆ m, I ) = Sgraph G m , G I (α, ing scene analysis to recognize the different objects and infer occlusion relations, as will be discussed below (see section 5). 2.4 Remarks on EGM. Before we go into the details, let us review our approach from a broader perspective. Graph matching in computer vision aims at solving the correspondence problem in object recognition and stems from the family of NP-complete problems for which no efficient algorithms exist. Heuristics are therefore unavoidable. We follow a template-based approach to detect correspondences by generating (graph) templates for each
Analysis of Cluttered Scenes
1447
object that contains features at the nodes to describe local surface properties. This is a view-based approach to object recognition that preserves the local feature topology and is well supported by view-based theories of human vision (see, e.g., Tarr & Bulthoff, 1998, or Edelman, 1995, 1998; Ullman, 1998). It is related to approachs based on matching deformable models to image features (see, e.g., Duc, Fischer, & Bigun, 1999) for face authentification and many other registration methods based on articulated template matching (see the pioneering work of Yuille, 1991). EGM aims at establishing a correspondence between image graphs and graphs taken from the model domain. In solving this problem, heuristics must be applied, despite some progress in generating model representation optimized for detecting subgraph isomorphisms (see, e.g., Messmer & Bunke, 1998). But how can we solve this problem more efficiently? One possibility is to perform hierarchical graph matching (see, e.g., Buhmann, Lades, & von der Malsburg, 1990) following a coarse-to-fine strategy (for an application in face detection, see Fleuret & Geman, 2001). However, this work focuses on how stereo and color features can be used to analyze complex scenes, so we had to refrain from implementing many of the ideas mentioned above. We have chosen a rather algorithmic version of the matching process in order to obtain a system with real-time or almost real-time capabilities. Without this, no sound judgment on the usefulness of our approach to cue combination can be made since large-scale cross-runs cannot be avoided. Our system selects graphs of an extended model domain able to sample translation, rotation, and scale and matches these graphs to the image domain. It is evident that the space of possible translations scalings and rotations in plane and depth is too large to be searched exhaustively. Since we consider only moderate changes in scale and aspect here, fortunately, a few simple heuristics can be used to make the matching problem tractable. Typically, a coarse-to-fine strategy is employed, where an initial search coarsely locates the object by evaluating the similarity function on a coarse scan grid and taking the global maximum, while subsequent optimization steps estimate the exact scale, rotation, and any nonrigid deformation parameters. This approach exploits the robustness of the Gabor features to small changes in scale and aspect of the object. We discuss the details of our matching scheme when we introduce the extension to matching in stereo image pairs in section 3. 3 Elastic Graph Matching in Stereo Image Pairs The discussion so far has considered a single model graph being matched to a single input image. In this section, we extend this approach to matching in stereo image pairs. Stereo vision is traditionally considered in terms of attempting to recover the 3D layout of the scene given two images taken from slightly different viewpoints to produce a dense depth map of the
1448
C. Eckes, J. Triesch, and C. von der Malsburg
scene. The aim of scene analysis, however, is just to identify known objects in the scene and establish their locations and depth order. This does not necessarily imply recovering fine-grained three-dimensional shape information. Estimates of the relative distances for the different objects may be sufficient, and the computation of dense disparity maps may not be necessary. For this reason, we attempt to integrate the information from the left and right images at the level of entire object hypotheses rather than at the level of individual image and background features. The stereo problem is essentially a correspondence problem. Elastic graph matching is a successful strategy for solving correspondence problems, and its extension to stereo image pairs is straightforward (Triesch, 1999; Kefalea, 2001). Stereo graph matching searches in the product space of the allowed left and right correspondences for a combined best match; that is, image graphs for the left and right input image are optimized simultaneously. This product space of simultaneous graph transformations in the left and right images, however, is far too large to be searched exhaustively as it scales quadratically O(A2 ) with the number A of transformations used in monocular matching. However, we can again speed up the search by a coarse-to-fine search heuristic. Also, just as in conventional stereo algorithms, it is possible to reduce the search space by exploiting the epipolar geometry, and in addition, we limit the allowed disparity range between the left and right image graphs to further reduce the search space. We extend the notation from above to distinguish left and right images, model graphs extracted from left and right training images, and node positions. Note that these two model graphs created from the left and right training images may have different numbers of nodes and edges. We denote the transformed node positions during matching as p
Tα p (xim )
with p ∈ {R, L},
(3.1)
L
where xim denotes the position of the ith node in the left training image for model m. The set of {Tα L , Tα R } encodes the space of allowed transformations (translation, scaling, and combinations thereof) of model graphs to image graphs. A = {1, . . . , A} is the set of A transformations applied to the model positions. The similarity functions between the left and right image graphs and the left and right model graphs are defined by p p Sgraph G m , G I (α p , m p ) =
NV (m ) p 1 P Snode Fim , F Tα P (xim ), I p . p NV (m ) p
i=1
(3.2) To obtain a combined or fused match similarity, we simply compute the average of the left and right matching similarities, resulting in the following
Analysis of Cluttered Scenes
1449
combined similarity function: L R L R Sstereo G m , G m , G I (α L , m L ), G I (α R , m R ) =
L R 1 1 L R Sgraph G m , G I α L , m L + Sgraph G m , G I α R , m R . 2 2
(3.3)
If the transformation parameters α L and α R were optimized independently, then we would have two independent matching processes, which would double the complexity of the search. Such an increase in complexity, however, can be avoided if we exploit the epipolar constraint, which implies a set of allowed combinations of α L and α R . The epipolar constraint can be formalized as R L x R ∈ E x L =⇒ Tα R x m ∈ E Tα L x m ,
(3.4)
where E x L denotes the epipolar plane defined by the point x L and the cameras’ nodal points. This limits the product space of allowed transformation pairs (α L , α R ) ∈ A × A to a subspace AE ⊂ A × A: R L . AE = (α L , α R ) ∈ A × A | Tα R x m ∈ E Tα L x m
(3.5)
Given the definitions above, stereo graph matching aims at optimizing the combined similarity function in the subset of transformations fulfilling the stereo constraints E : αˆ ≡ (αˆ L , αˆ R ) = arg
L R L R max Sstereo G m , G m , G I (α L , m L ), G I (α R , m R ) .
(α L ,α R )∈AE
(3.6) In analogy to the previous section, the matching process computes a set of p optimal correspondences for the left and right images {Tαˆ p (xim )}, the local p p p p similarity at each node of the matched graphs Si (αˆ , m , I ), and the global similarity for the best stereo match: L R L R ¯ α, S( ˆ m L , m R , I L , I R ) = Sstereo G m , G m , G I (αˆ L , m L ), G I (αˆ R , m R ) . (3.7) Additionally, it will provide a disparity estimate for the optimal match for model m, which we denote by d(m, α), ˆ which we compute from the mean
1450
C. Eckes, J. Triesch, and C. von der Malsburg
position of the matched left and right model graphs in the image pair: R L d(m, α) ˆ = < Tαˆ R (xim ) >i − < Tαˆ L (x mj ) > j
(3.8)
The x-component of this disparity estimate plays an important role in establishing occlusion relations in our scene analysis algorithm described in section 5. 3.1 Matching Schedule. For the experiments described below, we use a coarse-to-fine search heuristic for the stereo matching process. Since we use approximately parallel camera axes, the epipolar geometry is particularly simple. In a first, coarse matching step, the left and right image graphs are scanned across the image on a grid with 4 pixel spacing in x- and ydirection. The disparity is allowed to vary in a range of 3 to 40 pixels in the x-direction and 9 pixels in the y-direction. In a second, fine, matching step, the graphs are allowed to be scaled by a factor ranging from 0.9 to 1.1 in five discrete steps independently from each other, while at the same time their locations are scanned across a fine grid with 2 pixel spacing in the x and y directions. At the same time, disparity is allowed to be corrected by up to 2 pixels relative to the result from the first matching step. An example of stereo graph matching is shown in Figure 2, where we compare stereo graph matching with two independent matching processes in the left and right images. Due to proper handling of the epipolar constraint, stereo graph matching is able to avoid some matching errors that can occur with two independent matching processes. The matching schedule realizes a greedy optimization strategy and converges to a local maximum of the graph matching similarity of equation 3.7 by definition. It should be noted that there is no standard matching schedule for template-based approaches that ensures convergence to a global maximum. Since the search space is huge, prior knowledge, or brute force, must be used. A common approach to face detection by Rowley, Baluja, & Kanade (1998) samples all possible translations, scale, and rotations before comparing the feature data extracted from the image with a statistical face model realized as a neural network. Only a more hierarchical and complex approach, such as by Viola & Jones (2001b), may eventually help to overcome this problem. We refer to Forsyth & Ponce (2002) for a discussion of template-based approaches and their relation to alternative approaches to object recognition. 4 Compound Feature Sets Our second extension to the standard EGM approach outlined so far is the introduction of richer feature combinations at the graph nodes, which we first introduced in Triesch & Eckes (1998). There are several ways of
Analysis of Cluttered Scenes
1451
a
b
c
Figure 2: Stereo matching versus two independent monocular matching steps: (a) Original stereo image pair. (b) Example result of stereo matching of the model “yellow duck” exploiting the epipolar constraint. (c) Result of two independent matching processes in the left and right images. In the right image, a background region obtained a higher similarity than the correct match. This solution was ruled out by the epipolar constraint in b.
introducing richer and more specific features. A standard approach is to assemble information from a bigger support area around the point of interest, that is, to somehow make the features less local. This approach is employed frequently, and the general idea is often referred to as considering a context patch. Example techniques using this approach are set out in Nelson & Selinger (1998) and Belongie, Malik, and Puzicha (2002). While this indeed leads to more specific features, a potential drawback of this idea has received little attention. The robust extraction of these kinds of features requires the entire context patch to be visible. In the presence of partial occlusions, however, features covering large regions of the object are more prone to being compromised by occlusions. Hence, it seems that just using “larger” features may turn out to be problematic in situations where partial occlusions are prevalent. At this point, more work is needed to assess this trade-off. Our approach in this article is to consider richer kinds of features but keep the support area corresponding to a particular node relatively small. In particular, we are interested in investigating the benefits of additional color features, since color has been shown to be a powerful cue in object recognition (Swain & Ballard, 1991). Since we are using a view-based feature constellation technique, we want to extract local color descriptions that we can use to label the graph nodes rather than using
1452
C. Eckes, J. Triesch, and C. von der Malsburg
a holistic description of object color in the form of a histogram, as in Swain & Ballard (1991). We label a node with a set of two feature vectors: a Gabor jet extracted from the gray-level image and a separate feature vector based on local color information. To this end, we use a simple local average of the color around a particular pixel position. We need to choose a suitable color space. We experimented with RGB, HSI, and RY-Y-BY spaces. The last color space is also often called YCbCr since R-Y (Cr) and B-Y (Cb) are croma differences (Cb, Cr) and Y is the luma or brightness information (e.g., Pratt, 2001; Gonzales & Woods, 1992). The transformation from RGB to RY-Y-BY color space and from RGB to HSI color space is given in the appendix. A color feature vector is extracted at a particular image location by simply averaging the colors of the node’s neighborhood covering a region of 5 × 5 pixels. The resulting three component feature vectors are compared with the following similarity function, which is based on a weighted Euclidean distance:
Sr Jc , J c , s1 , s2 , s3 = 1 −
3 c=1
(sc J c − sc J c )2
3 c=1
(255sc )2
.
(4.1)
This similarity function guarantees that the similarity falls in the interval [0, 1]. The scale factors sc are used to enhance certain dimensions of the color space.1 Note that we make no effort to incorporate color constancy mechanisms, so performance must be expected to suffer in the presence of illumination changes. With the graph nodes now being labeled with a set of feature vectors, or a compound feature set, where each feature has an associated similarity function, we need a way of fusing the individual similarities into one node similarity measure. Let a compound feature set be denoted by F C , whereas an individual feature set is denoted Fm : F C := {Fm , m = 1, . . . , M}.
(4.2)
We compare compound feature sets by a weighted average over the output of the individual similarity functions defined for each feature set Fm : M Scompound F C , F C = wm · Sm Fm , F m , m=1
where
wm = 1.
(4.3)
m
1 We used the relative scaling of 1:1:1 in the RY-Y-BY space and 0.75:1:0 in the HSI space.
Analysis of Cluttered Scenes
1453
This similarity function simply replaces Snode in the previous discussion. This strategy avoids early decisions and characterizes this approach as an example of weak fusion. Weights are chosen to optimize performance. Note that overfitting is not a serious problem since the number of objects in all scenes is 858, much higher than the number of weights used in the system. 5 Scene Analysis With our extended matching machinery in place, we are now ready to present our algorithm for scene analysis. It basically operates in two phases. In the first phase, we use the stereo graph matching technique described above to match object models to the input image pairs. In the second phase, we use an iterative method to evaluate matching similarities and disparity information to estimate occlusion relations and to accept or reject (partial) candidate matches. Objects are accepted in a front-to-back order so that possible occlusions can be properly taken into account when an object candidate is considered. The result of this analysis, the scene interpretation, lists all recognized objects and their locations and specifies the occlusion relations between them. One advantage of the graph representation used is that it allows us to explicitly represent partial occlusions of the object. To this end, we introduce R/L a binary label vi ∈ {1, 0} for each node of each matched object model that indicates if this node is visible (vi = 1) or occluded (vi = 0). Given this definition, the average similarity of the visible part of the matched graph, here denoted as S¯ vis m , is defined as S¯ vis m =
1 2
i
L vm,i SiL (α L , m L , I L ) 1 L + 2 i vm,i
i
R vm,i SiR (α R , m R , I R ) R . i vm,i
(5.1)
This quantity is interesting because it explicitly discounts information from object parts that are estimated to be occluded. The (usually) low similarities of such nodes are explained away by the occlusion. The following algorithm for scene analysis simultaneously estimates the presence or absence of objects, their locations, and the visibility of individual nodes, thereby segmenting the scene into known objects and background regions. It iterates through the following steps (summarized in Figure 3): 1. INITIALIZATION: All image regions are marked as free, and all object models are matched to the image pair as described above. The nodes of all objects are marked as visible: voiR = 1 ∀ o, i and voiL = 1 ∀ o, i. 2. FIND CANDIDATES: From all object models that have not already been accepted or rejected (see below), select a set candidate of objects for recognition. An object enters the set of candidates if two conditions hold:
1454
C. Eckes, J. Triesch, and C. von der Malsburg
Algorithm for Scene Analysis: 1. Initialization: mark all image areas as free, then match all object models. 2. Find candidate objects. IF set of candidates is empty THEN END. 3. Determine closest candidate object. 4. Accept or reject closest candidate object. 5. If accepted, mark corresponding image region as occu-
pied and update node visibilities of all remaining object models and re-compute their match similarities. 6. Go to step 2. Figure 3: Overview of scene analysis algorithm.
a. A sufficient portion of it is estimated to be visible. Since we are considering left and right images at the same time and since the left and right model graphs for an object may have different numbers of nodes, and because different numbers of nodes may be visible in left and right image, we define the number of visible nodes to be the maximum number of visible nodes in either the left or right image. This number has to be above a threshold of φ = 20: R L max vo,i , vo,i > φ. i
i
Analysis of Cluttered Scenes
1455
Robust recognition and, in particular, correct rejection on the basis of matching similarities can be made only if a sufficient number of similarity measurements are available. This limits the degree of occlusion the system can cope with. The threshold of 20 nodes corresponds to a minimum required visibility of 60% of the surface area for small objects (e.g., the tea box) and down to 10% for the largest objects (e.g., the toy pig), respectively. This is plausible since very small objects are more likely overlooked in heavily cluttered scenes than larger objects. b. The mean similarity of the visible nodes of left and right graphs, denoted by S¯ vis o , lies above a threshold θ : vis S¯ o > θ. If the candidate set is empty because no more object hypotheses fulfill these requirements, the algorithm terminates. 3. FIND FRONT-MOST CANDIDATE: The identification of the frontmost candidate is based on the fusion of a disparity cue and a node similarity cue. We evaluate all pairwise comparisons between candidates. The candidate judged to be “in front” most often is selected. The disparity cue simply compares the disparities of two candidates to select the closer object. The assumption behind the node similarity cue is that if the matched object graphs for two objects o and o overlap in the image, then the closer object must occlude the other, and the image region in question should look more similar to the model of the occluding object than the model of the occluded object. Consequently, R/L we can define the left and right node similarity occlusion index Coo as follows: R/L L/R R/L R/L R/L Soi − So j ∀ (i, j) with xoi − xo j ≤ d. Coo = (i, j)
The distance threshold d was set to 5 pixels. We also define the average occlusion index C¯ oo as the mean of left and right occlusion indices: C¯ oo =
1 R L (C + Coo ). 2 oo
We fuse the estimates of the disparity and the node similarity cues as follows: if the absolute difference between the disparities is above a threshold of θd = 5 pixels, we use the estimate of the disparity cue to determine which of two objects is in front. Otherwise, we use the estimate of the node similarity cue. This choice reflects our observation that the disparity cue is reliable only for large disparity differences. This is also supported by studies on the use of disparity in biological vision (see, e.g., Harwerth, Moeller, & Wensveen, 1998).
1456
C. Eckes, J. Triesch, and C. von der Malsburg
4. ACCEPT/REJECT CANDIDATE: The algorithm now accepts the front-most candidate. If, however, this object graph shows a higher disparity than the maximum disparity of all previously accepted objects, this candidate is rejected, because in this case, the analysis indicates that the candidate should have already been selected during an earlier iteration but was not. 5. UPDATE NODE VISIBILITIES: After a candidate has been accepted, we mark free image regions that are covered by the object as occupied. The nodes of other objects that have not been accepted or rejected earlier are now labeled as occluded if they fall in the image region covered by the just accepted object. 6. GO TO STEP 2. The proposed algorithm is conceptually simple and effective, but it has some important limitations. Most prominent, the initial matching of all object models does not take occlusions into account. A more elaborate (but slower) scheme would be to rematch all object models after a new candidate has been accepted, properly taking the occlusions due to all previously accepted objects into account. We also do not allow multiple matches of the same object model in different places. Finally, occlusions by objects that are unknown to the system are not handled in this approach. Although local occlusion can also be determined on the basis of local matching similarity (see Wiskott & von der Malsburg, 1993), future research must address this issue more thoroughly. 6 Results 6.1 The Database. We have recorded a database of stereo image pairs of cluttered scenes composed of 18 different objects in front of uniform and complex backgrounds.2 The database is available for download (Eckes & Triesch, 2004). A collage of the 18 objects is shown in Figure 4. Note that there is considerable variation in object shape and size (compare the objects “pig” and “cell-phone” in the upper right corner), but there are also objects with nearly identical shapes, differing only in color and surface texture (compare objects “Oliver Twist” and “Life of Brian” in the upper left corner.) The database has two parts. Part A, the training part, is formed by one training image pair of each object in front of a uniform blue cloth background. Part B, the test part, contains image pairs of scenes with simple and complex backgrounds. There are 80 scenes composed of two to five
2
Images were acquired through a pair of SONY XC-999P cameras with 6 mm lens, mounted on a modified ZEBRA TRC stereo head with a baseline of 30 cm. We used two IC-P-2M frame grabber boards with AM-CLR module from Imaging Technology.
Analysis of Cluttered Scenes
1457
Figure 4: Object database. A collage of the 18 objects used. See the text for a description.
objects in front of the same uniform blue cloth background. The total number of objects in these scenes is 280. The test part also contains 160 scenes composed of two to five objects in front of one of four structured backgrounds. Here, we have a total of 558 objects. Finally, the test part has five scenes without any objects—just the blue cloth and the four complex backgrounds. Thus, the test part has 263 scenes in total. Some example images are shown in Figure 5. 6.2 Evaluation. Every recognition system able to reject unknown objects uses confidence values in order to judge whether a given object has been localized and identified correctly. If the confidence value falls below a certain threshold—the so-called rejection threshold—the corresponding recognition hypothesis is discarded since it is most likely incorrect. In many applications, missing an object is a much less severe mistake than a socalled false-positive error, such as incorrectly recognizing an object that is not present in the scene at all (e.g., in security applications). Hence, the proper value for the rejection threshold depends on the application. Alternatively, the area above the receiver-operator characteristic (ROC) curve (and the upper limit of 100% recognition performance) measures the total error regardless of the application scenario in mind (see e.g., Duda & Hart, 1973).
1458
C. Eckes, J. Triesch, and C. von der Malsburg
a
b
c
d
e
f
Figure 5: Example scenes with simple and complex backgrounds: (a, b) Two scenes with uniform background. Note the variation in size for object “pig” and the different color appearance. Image a stems from the left camera and b from the right. (c–f) Examples of scenes with the four complex backgrounds.
The most important parameter of our system is the similarity threshold θ , which we use as the rejection threshold. θ determines if a model’s matching similarity was high enough to allow it to enter the set of candidates during scene analysis (compare section 5). A low-threshold θ will result in many false-positive recognitions, while a high θ will cause many false rejections. Depending on the kind of application, one type of error may be more harmful than the other, and θ should be chosen to reflect this trade-off for any particular application.3 An example of the effect of varying θ is given in Figure 6. In the following, we report our results in the form of ROC curves obtained by systematically varying θ for different versions of the system. The parameter θ was modified in all experiments within the interval [0.32, 0.84] in steps of 0.01, the corresponding system was tested on the complete database, and the number of correct object recognitions and the number of false-positive recognitions was recorded. Hence, the parameter θ is used as the rejection threshold that determines the working point of the investigated system. All other parameters are left constant at the values given
3 One could also learn a threshold that depends on the specific object model m, that is, θ = θ (m), but this was not attempted in this work.
Analysis of Cluttered Scenes
L
1459
R
a) θ = 0.43
θ = 0.49
θ = 0.50
b) θ = 0.67
θ = 0.68
θ = 0.685
Figure 6: Rejection performance of different systems for an example scene: (L+R) Complex scene with five objects. (a) Results of the stereo recognition using intensity-based Gabor processing (color-blind features) with three different values of θ. A higher rejection threshold leads to fewer false positives but also tends to remove some true positives. (b) Results of the same stereo scene recognition system when raw color Gabor features taken in HSI color space are used instead. Here, a higher rejection threshold is able to suppress false-positive recognitions without influencing already correct recognitions. The additional color information clearly improves the recognition performance. The system still misses the highly occluded gray toy ape in the background though.
above. A recognition result is considered correct if the object position estimated by the matching process is within a radius of 7 pixels distance to the hand-labeled ground truth position. Our main results are shown in Figure 7. It compares the ROC curves for the base system using only Gabor features and monocular analysis with our full system employing stereo graph matching and using compound features incorporating color information. The monocular system analyzed left and right images independently, and the results were simply averaged. Note that for the determination of occlusion relations during scene analysis, the monocular system must rely on the node similarity cue alone since disparity information is not available. The performance of the stereo graph
1460
C. Eckes, J. Triesch, and C. von der Malsburg
0%
2%
4%
6%
false positive rate 8% 10% 12% 14%
16%
18%
20%
100% 90%
0.72
0.66
700
0.68
80% 0.48
600
70%
0.65 60%
500
50%
400 300
0.71
0.71
0.51
200 100 0
0
500 number of false positives
mono-greylevel stereo-greylevel mono-rc-hsi mono-rc-rgb mono-rc-ryyby stereo-rc-hsi stereo-rc-rgb stereo-rc-ryyby
40% 30%
correct object recognitions (%)
number of correct object recognitions
800
20% 10%
0 1000
Figure 7: Comparison by ROC analysis. Systematic variation of the rejection parameter θ from [0.32, 0.84] in steps of 0.01 and recording of correct object recognitions and false positives lead to this ROC plot for all investigated systems. We have marked data values with peak performance with the corresponding rejection value (see Table 1 for a summary). The three stereo systems using compound feature sets incorporating color information in HSI, RY-Y-BY, or RGB color (stereo-rc-lines) clearly outperform the monocular systems. They also outperform the stereo system using only gray-level features (stereo-graylevel line) over a large range of rejection parameters and corresponding false-positive rates.
matching using only intensity-based Gabor wavelets has also been added for comparison. Table 1 gives an overview over the performance of the different systems. All scenes with complex and simple backgrounds with two up to five objects per scene were used in these experiments. We have also investigated which of the two different types of color feature sets, raw color compound or color Gabor compounds, show superior performance. The complete results, presented in Eckes (2006), revealed that raw color compound features support a better recognition performance than color Gabor features. This is understandable since the first type of feature set explictly encodes the local surface color, whereas the latter type of feature set encodes the color texture of the objects. Hence, we refrain from presenting any results based on the inferior Gabor color features in the following in order to keep the discussion simple.
Analysis of Cluttered Scenes
1461
Table 1: Comparison of Peak Recognition Results. Performance
Scene Recognition System Matching Mono Mono Mono Mono Stereo Stereo Stereo Stereo
Features
θ
Error (%)
Correct (Number)
False Positives (Number)
Gabor Compound (RYYBY) Compound (HSI) Compound (RGB) Gabor Compound (RYYBY) Compound (HSI) Compound (RGB)
0.51 0.71
55 47
388 457
500 205
0.65 0.71 0.48 0.68
40 44 20 8
510 478 681 787
386 372 850 544
0.66 0.72
7 6
793 803
293 292
6.3 Discussion. The introduction of the stereo algorithm leads to an increase in peak performance, reducing the error rate from 55% to 20%. This improvement must be paid with an increase of the false-positive rate from around 10% to 18%. This becomes more than compensated, however, when color features are added to the system. For a fixed number of false positives, the extended system typically reaches twice as many correct recognitions in comparison to the original one, whereas the choice of color space, RGB, HSI, or RY-Y-BY, makes only a minor difference in the peak performance (see Figure 7 for the analysis of the ROC). The optimal trade-off behavior shows the system using raw color compounds from the HSI color space since the number of correct object recognitions remains optimal even if the number of false positives is decreased. For very small numbers of false positives (N ≤ 10), however, the original system outperforms our extended version, as the magnification of the ROC in Figure 8 reveals. The ROC curves show an interval of monotonic increase in recognition performance as the number of false positives increases in accordance with expectations. The recognition systems, however, also show a decrease of correct object recognitions as the number of false positives rises above a certain value. This is due to a competition between false-positive and correct object models matched at the same region in the scene. Table 1 summarizes the peak performance of the different systems and makes it possible to study the effect of stereo graph matching and compound feature sets in isolation. Peak performance was defined as the maximum number of correctly recognized objects regardless of the number of false positives. The corresponding θ values are given in the table. Additional color and stereo information has resulted in an enormous increase in performance. The optimal combination of stereo and color cues enables the
1462
C. Eckes, J. Triesch, and C. von der Malsburg false positive rate 1%
0%
2%
100% 90%
0.68
700 600
80% 70%
0.71
0.74
60%
500
50%
400
40% 300
mono-greylevel stereo-greylevel mono-rc-hsi mono-rc-rgb mono-rc-ryyby stereo-rc-hsi stereo-rc-rgb stereo-rc-ryyby
0.69 0.76 0.74
200
0.55 0.57
100 0
0
10
20
30
40 50 60 number of false positives
70
80
90
30%
correct object recognitions (%)
number of correct object recognitions
800
20% 10% 0 100
Figure 8: Comparison by ROC analysis (magnification of Figure 7). For more than eight false positives (0.1% FPR), the stereo-based color compound systems outperforms the original system. Interestingly, the stereo algorithm with intensity-based Gabor features shows worse performance than the original system until more than 54 false positives (1.2% FPR) are allowed.
system to reduce its error rate by almost a factor of 8. Interestingly, the use of the stereo algorithm with color-blind features increases the number of false positives significantly, but the combination of both extensions actually reduces the number of false positives in comparison to the original system dramatically. The best performance is achieved with color features represented in the HSI color space. As Figure 7 has shown, the difference between the different color compound features is small when looking at the peak performance, but the stereo system based on HSI color compound features outperforms all other systems as it keeps the optimal balance between performance and false-positive rate (FPR). 6.4 Efficiency. It takes roughly 30 seconds on a 1 GHz Pentium III with 512 MB RAM to analyze a 2562 stereo image pair if 18 objects have been enrolled. Transformation and feature extraction takes 2 seconds, and stereo matching takes 1 to 2 seconds per object. The time for the final scene analysis is negligible. Hence, most time is spent in the matching routines, which scales linear with the number of objects n, O(n). A recognition system of 21 objects working on image sizes of 1282 performs on the given hardware
Analysis of Cluttered Scenes
1463
in almost real time. Note that the research code is not optimized for speed. A system for face detection, tracking, and recognition based on the same method, Gabor wavelet preprocessing and graph matching, performs in real time on similar image sizes (see Steffens, Elagin, & Neven, 1998). 7 Discussion The system we have proposed is able to recognize and segment partially occluded objects in cluttered scenes with complex backgrounds on the basis of known shape, color, and texture. The specific approach we presented here is based on elastic graph matching, an object recognition architecture that models views of objects as 2D constellations of image features represented as labeled graphs. The graph representation is particularly powerful for object recognition under partial occlusions because it naturally allows us to explicitly model occlusions of object parts (Wiskott & von der Malsburg, 1993). Contributions of this article have been the incorporation of richer object descriptions in the form of compound feature sets (Triesch & Eckes, 1998), the extension to graph matching in stereo image pairs (Triesch, 1999; Kefalea, 2001), and the introduction of a scene analysis algorithm utilizing the disparity information provided in the stereo graph matching to disambiguate occlusion relations. The combination of these innovations leads to substantial performance improvements. At a more abstract level, it appears that the major reason for the significant boost in performance is the combined use of different modalities or cues: stereo, color, shape, texture, and occlusion. We are unaware of any other system that has attempted to integrate such a wide range of cues for the purpose of object recognition and scene analysis. The endeavor to integrate a wide range of cues may be motivated through studies of the development of different modalities in infants, indicating that the human visual system tries to exploit every available cue in order to recognize objects (Wilcox, 1999). Note that stereo information is typically used as a low-level cue; that is, information from left and right images is typically fused at a very early processing stage, while here we fuse the images only at the level of full (or partial) object descriptions. This may not be very plausible from a biological perspective since fusion of information from the left and right eyes in mammalian vision systems seems to occur quite early. Nevertheless, the results we obtained are promising. Another point worth highlighting is our system’s ability to learn about new objects in a one-shot fashion, which is quite valuable. Also, our system’s ability to handle unknown backgrounds is worth mentioning. That is not to say that the system could not be substantially improved by gathering and utilizing statistical information about objects and backgrounds from many training images. On the contrary, we would expect this to substantially improve the system (e.g., Duc et al., 1999). A statistical well-founded system for object detection was presented in Viola and Jones (2001a, 2001b) in which
1464
C. Eckes, J. Triesch, and C. von der Malsburg
the type of features as well as the detection classifiers were learned from many training examples. Its high efficiency and its well-founded learning architecture make it very attractive, although it is unclear how it can be extended to deal with occlusions. However, we view the good performance obtained in our study despite the lack of such statistical descriptions as strong testimony to the power of the chosen object representation. We must admit that our system is shaped by a number of somewhat arbitrary choices concerning, for instance, the sequence of events, relative weights in the creation of compound features, and the rather algorithmic nature of the system. It is our vision that the coordination of processes that together constitute the analysis of a visual scene will eventually be described as a dynamical system shaped by self-organization on both the timescales of learning and brain state organization. The actual sequence of events may then depend on particular features of a given scene. In the following, we dicuss some related work. The number of articles on object recognition is vast, and the goal of this section is to discuss our work in the light of some specific example approaches. We restrict our discussion to appearance-based object recognition approaches. There have been a number of models of object recognition in the primate visual system proposed in the literature. Examples are Fukushima, Miyake, and Ito ¨ (1983), Mel (1997), and Deco and Schurmann (2000) (see the overview in Riesenhuber & Poggio (2000)). However, none of these systems tries to explicitly model partial occlusions. Some more closely related work has been proposed in the computer vision literature. 7.1 Modeling Occlusions in Object Recognition. Of particular interest is the work by Perona’s group (Burl & Perona, 1996; Moreels, Maire, & Perona, 2004), which also describes objects as a 2D constellation of features. Their approach is cast in a probabilistic formulation. However, they restrict analysis to the use of binary features—features that either are or are not present. Thus, their method must make “hard” decisions at the early level of feature extraction, which is often problematic. A similar approach has also been proposed in Pope and Lowe (1996, 2000) and, more recently Lowe (2004). Such object recognition models rely on detecting stable key points on the object surface from which robust features (e.g., extrema in scale-space) are extracted and compared to features stored in memory. The matched features vote for the most likely affine transformation on the basis of a Hough transformation, thereby establishing the correspondences between training image and the current input scene. The matching is fast and able to detect highly occluded objects in some complicated scenes. To this end, either features are made so specific that detection of a small number of them, say three, is sufficient evidence for the presence of the object (Lowe), or missing features are accounted for by simply specifying a prior probability of any feature missing (Perona). Thus, these systems do not explicitly try
Analysis of Cluttered Scenes
1465
to model occlusions due to recognized objects to explain away missing features in partly occluded objects as our approach does. 7.2 Integrating Segmentation and Recognition. The analysis of complex cluttered scenes is one of the most challenging problems in computer vision. Progress in this area has been comparatively slow, and we feel that a reason is that researchers have usually tried to decompose the problem in the wrong way. A specific problem may have been a frequent insistence on treating segmentation and recognition as separate problems, to be solved sequentially. We believe that this approach is inherently problematic, and this work is to be seen as a first step toward finding integrated solutions to both problems. Similarly, other classic computer vision problems such as stereo, shape from shading, and optic flow have been studied mostly in isolation, despite their ill-posed nature. From this perspective, we feel that the greatest progress may result from breaking with this tradition and building vision systems that try to find integrated solutions to several of these problems. This is a challenging endeavor, but we feel that it is time to address it. With respect to segmentation and recognition, there is striking evidence for a very early combination of top-down and bottom-up processing in the human visual system (see, e.g., the work of Peterson and colleagues: Peterson, Harvey, & Weidenbacher, 1991; Peterson & Gibson, 1994; Peterson, 1999 or neurophysical studies such as Mumford (1991) or Lee, Mumford, Romero, and Lamme (1998)). In our current system, segmentation is purely model driven. But earlier work has also demonstrated the benefits of integrating bottom-up and model-driven processing for segmentation and ¨ object learning (Eckes & Vorbruggen, 1996; Eckes, 2006). In these studies, even partial object knowledge was able to disambiguate the otherwise notoriously error-prone data-driven segmentation. Such an approach leads to a more robust object segmentation that adapts and refines the object model in a stabilizing feedback loop. An interesting alternative to that approach is work by Ullman and colleagues (Borenstein, Sharon, & Ullman, 2004; Borenstein & Ullman, 2002, 2004) in which bottom-up segmentation is combined with a learned object representation of a particular object class. The system is able to segment other objects within this category from the background. Images of different horses were used to learn a “patch representation” of the object class “horse,” which is used to recognize subpatterns in horse images afterward. This top-down information becomes fused with a low-level segmentation by minimizing a cost function and improves the segmentation performance. In contrast to our work, the focus here is on improving segmentation based on prior object knowledge, assuming the object has already been recognized. We believe that the close integration of bottom-up segmentation and recognition processes may also benefit recognition performance due to the explaining away of occluded object parts and due to removal of unwanted features likely not belonging to the object, which would otherwise
1466
C. Eckes, J. Triesch, and C. von der Malsburg
introduce noise into higher-level object representations based on large receptive fields. Another interesting segmentation architecture able to integrate low-level and high-level segmentation was presented in Tu, Chen, Yuille, & Zhu (2003). Low-level segmentation based on Markov random field (MRF) models based on intensity values is combined with high-level object modules for faces and text detection. An input image is parsed into low-level regions (based on brightness cues) and the object regions by a Metropolis-Hastings dynamics that minimizes a cost function of a combined MRF model. Tu et al. trained prior models for facial and textural regions in the image with help of ADA boost from training examples. The output of the probabilistic model are combined with an MRF segmentation model. The additive priors in the energy function of the MRF model favors step-wise homogeneous regions in gray-level intensity with smooth borders and not-too-small region sizes. This system has a nice probabilistic formulation, but the data-driven Markov chain Monte Carlo method used for inference tends to be very slow. In addition, the system is unlikely to handle large occlusions because the object recognizers for text and faces are not very robust to partial occlusions. Combining recognition with attention (e.g., Itti & Koch, 2000) is also a mandatory extension to our system. The recognition process might focus on areas with salient image features first, which significantly reduces the rather sequential graph matching effort. These features may also help to preselect or activate feature constellations in the model domain and may bias the recognition system to start with only a promising subset of known objects. In general, combining low-level segmentation and attention approaches such as (Poggio, Gamble, & Little, 1988) with recognition approaches is very promising and biologically highly relevant (see, e.g., Lee & Mumford, 2003). However, we believe that a more dynamical formulation of the matching process (e.g., in the style of the neocognitron; see Fukushima, 2000), in combination with a recurrent and continously refined low-level segmentation, must be established. Such an integrated model may help us understand how the brain is solving the very difficult problem of vision and may also help us develop better machine vision. Despite a number of drawbacks, our method and the methods discussed here are promising examples of what integrated segmentation and object recognition systems might look like. There are still many open questions, and a biologically plausible dynamic formulation is desirable. 8 Conclusion We have presented a system for the automatic analysis of cluttered scenes of partially occluded objects that obtained good results by integrating a number of visual cues that are typically treated separately. The addition of these cues has increased the performance of the scene analysis considerably by reducing the error rate by roughly a factor of 8. Only the combined use of
Analysis of Cluttered Scenes
1467
stereo and color cues was responsible for this unusual large improvement in performance. Our study of the different systems has shown that all proposed systems based on color and stereo cues outperform the monocular and color-blind recognition system over a large range of rejection settings when more than 20 false positives (0.4% false-positive rate) can be tolerated. Future research should try to incorporate bottom-up segmentation mechanisms and the learning of statistical object descriptions—ideally in an only weakly supervised fashion. A direct comparison of different scene analysis approaches is still difficult at present because there is no standard database available. The FERET test in face recognition (Phillips, Moon, Rizvi, & Rauss, 2000) has demonstrated the benefits of independent databases and benchmarks. We hope that the publication of our data set (Eckes & Triesch, 2004) will help fill this gap and facilitate future work in this area. Appendix: Color Space Transformations Let us specify the type of color transformation we have used in this work by following Gonzales & Woods (1992), since there is often confusion about color spaces in both the research literature and in documents on ITU or ISO/IEC standards. The transformation from RGB to RY-Y-BY color space is given by
RY 0.500 −0.419 −0.081 R 127 Y = 0.299 0.587 0.114 G + 0 . BY −0.169 −0.331 0.500 B 127
(A.1)
The transformation from RGB to HSI color space is given by R− B + R−G H = arccos (R − G) + (R − B) (G − B) S=1 − 3
max (R, G, B) R+G + B
1 I = (R + G + B). 3
(A.2)
Acknowledgments This work was in part supported by grant 0208451 from the National Science Foundation. We thank the developers of the FLAVOR software environment, which served as the platform for this research.
1468
C. Eckes, J. Triesch, and C. von der Malsburg
References Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Trans. PAMI, 24(4), 509–522. Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-down and bottom-up segmentations. In IEEE Workshop on Perceptual Organization in Computer Vision. Washington, DC. Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In ECCV2002, 7th European Conference on Computer Vision (pp. 109–122). Berlin: Springer. Borenstein, E., & Ullman, S. (2004). Learning to segment. In The 8th European Conference on Computer Vision—ECCV 2004, Prague. Berlin: Springer-Verlag. Buhmann, J., Lades, M., & von der Malsburg, C. (1990). Size and distortion invariant object recognition by hierarchical graph matching. In IJCNN International Conference on Neural Networks (pp. 411–416). San Diego, CA: IEEE. Burl, M., & Perona, P. (1996). Recognition of planar object classes. In 1996 Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Computer Society. ¨ Deco, G., & Schurmann, B. (2000). A hierarchical neural system with attentional top-down enhancement of the spatial resolution for object recognition. Vision Research, 40(20), 2845–2859. Duc, B., Fischer, S., & Bigun, J. (1999). Face authentication with Gabor information on deformable graphs. IEEE Transactions on Image Processing, 8(4), 504–516. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Eckes, C. (2006). Fusion of visual cues for segmentation and object recognition. In preparation. Eckes, C., & Triesch, J. (2004). Stereo cluttered scenes image database (SCS– IDB). Available online at http://greece.imk.fraunhofer.de/publications/data/ SCS/SCS.zip. ¨ Eckes, C., & Vorbruggen, J. C. (1996). Combining data-driven and model-based cues for segmentation of video sequences. In World Conference on Neural Networks (pp. 868–875). Mahwah, NJ: INNS Press and Erlbaum. Edelman, S. (1995). Representation of similarity in three-dimensional object discrimination. Neural Computation, 7, 408–423. Edelman, S. (1998). Representation is representation of similarities (Tech. Rep. No. CS96-0 8, 1996). Jerusalem: Weizmann Science Press. Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Trans. Computers, 22(1), 67–92. Fleuret, F., & Geman, D. (2001). Coarse-to-fine face detection. International Journal of Computer Vision, 41(1/2), 85–107. Forsyth, D. A., & Ponce, J. (2002). Computer vision: A modern approach. Upper Saddle River, NJ: Prentice Hall. Fukushima, K. (2000). Active and adaptive vision: Neural network models. In ¨ S.-W. Lee, H. H. Bulthoff, & T. Poggio (Eds.), Biologically motivated computer vision. Berlin: Springer. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Systems, Man, and Cybernetics, SMC–13(5), 826–834.
Analysis of Cluttered Scenes
1469
Gonzales, R. C., & Woods, R. E. (Eds.). (1992). Digital image processing. Reading, MA: Addison-Wesley. Harwerth, R., Moeller, M., & Wensveen, J. (1998). Effects of cue context on the perception of depth from combined disparity and perspective cues. Optometry and Vision Science, 75(6), 433–444. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(12), 1489–1506. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 56(6), 1233–1258. Kefalea, E. (2001). Flexible object recognition for a grasping robot. Unpublished doctoral dissertation, Universit¨at Bonn. ¨ ¨ Lades, M., Vorbruggen, J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R. P., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of Optical Society of America A, 20(7), 1434–1448. Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38(15/16), 2429–2454. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture. Neural Computation, 9(4), 777–804. Messmer, B. T., & Bunke, H. (1998). A new algorithm for error-tolerent subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Learning, 20(5), 493–504. Moreels, P., Maire, M., & Perona, P. (2004). Recognition by probabilistic hypothesis construction. In Eighth European Conference on Computer Vision. Berlin: Springer. Mumford, D. (1991). On the computational architecture of the neocortex. I. The role of the thalamo-cortical loop. Biological Cybernetics, 65(2), 135–145. Murase, H., & Nayar, S. K. (1995). Visual learning and recognition of 3-D objects from appearance. Int. J. of Computer Vision, 14(2), 5–24. Nelson, R. C., & Selinger, A. (1998). Large-scale tests of a keyed, appearance-based 3-D object recognition system. Vision Research, 38(15–16), 2469–2488. Peterson, M. (1999). Knowledge and intention can penetrate early vision. Behavioral and Brain Sciences, 22(3), 389. Peterson, M. A., & Gibson, B. S. (1994). Must figure ground organization precede object recognition? An assumption in peril. Psychological Science, 7(5), 253– 259. Peterson, M. A., Harvey, E. M., & Weidenbacher, H. J. (1991). Shape recognition contributes to figure-ground reversal: Which route counts? Journal of Experimental Psychology: Human Perception and Performance, 17(4), 1075–1089. Phillips, P., Moon, H., Rizvi, S., & Rauss, P. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1090–1104. Poggio, T., Gamble, E., & Little, J. (1988). Parallel integration of vision modules. Science, 242, 436–440.
1470
C. Eckes, J. Triesch, and C. von der Malsburg
Pope, A. R., & Lowe, D. G. (1996). Learning appearance models for object recognition. In J. Ponce, A. Zisserman, & Hebert (Eds.), Object representation in computer vision II (pp. 201–219). Berlin: Springer. Pope, A. R., & Lowe, D. G. (2000). Probabilistic models of appearance for 3-D object recognition. International Journal of Computer Vision, 40(2), 149–167. Pratt, W. K. (2001). Digital image processing. New York: Wiley. Ren, X., & Malik, J. (2002). A probabilistic multi-scale model for contour completion based on image statistics. In Proceedings of the Seventh European Conference on Computer Vision (pp. 312–327). Berlin: Springer. Riesenhuber, M., & Poggio, T. (2000). Models of object recognition. Nature of Neuroscience, 3(Supp.), 1199–1204. Rosenfeld, A. (1984). Image analysis: Problems, progress, and prospects. Pattern Recognition, 17, 3–12. Rowley, H., Baluja, S., & Kanade, T. (1998). Neural network–based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 23– 38. Shams, L., & von der Malsburg, C. (2002). The role of complex cells in object recognition. Vision Research, 42, 2547–2554. Steffens, J., Elagin, E., & Neven, H. (1998). Personspotter—fast and robust system for human detection, tracking and recognition. In Proceedings of the Third International Conference on Face and Gesture Recognition (pp. 516–521). Piscataway, NJ: IEEE Computer Society. Swain, M., & Ballard, D. H. (1991). Color indexing. Int. J. of Computer Vision, 7(1), 11–32. Tarr, M., & Bulthoff, H. (1998). Image-based object recognition in man, monkey and machine. Cognition, 67(1–2), 1–20. Triesch, J. (1999). Vision-based robotic gesture Recognition. Aachen, Germany: Shaker Verlag. Triesch, J., & Eckes, C. (1998). Object recognition with multiple feature types. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings ICANN 98 (pp. 233–238). Berlin: Springer. Triesch, J., & Eckes, C. (2004). Object recognition with deformable feature graphs: Faces, hands, and cluttered scenes. In C. H. Chen (Ed.), Handbook of pattern recognition and computer vision. Singapore: World Scientific. Triesch, J., & von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Trans. PAMI, 23(12), 1449–1453. Triesch, J., & von der Malsburg, C. (2002). Classification of hand postures against complex backgrounds using elastic graph matching. Image and Vision Computing, 20(13–14), 937–943. Tu, Z., Chen, X., Yuille, A., & Zhu, S. (2003). Image parsing: Unifying segmentation, detection, and recognition. In Proceedings of the Ninth IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE Computer Society. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Ullman, S. (1998). Three-dimensional object recognition based on the combination of views. Cognition, 67(1–2), 21–44.
Analysis of Cluttered Scenes
1471
Viola, P., & Jones, M. J. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Computer Society. Viola, P., & Jones, M. J. (2001b). Robust real-time object detection (Tech. Rep. CRL 2001/01). Cambridge, MA: Cambridge Research Laboratory. Wilcox, T. (1999). Object individuation: Infants’ use of shape, size, pattern, and color. Infant Behavior and Development, 72(2), 125–166. ¨ Wiskott, L., Fellous, J., Kruger, N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 775–779. Wiskott, L., & von der Malsburg, C. (1993). A neural system for the recognition of partially occluded objects in cluttered scenes: A pilot study. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 935–948. Yuille, A. L. (1991). Deformable templates for face recognition. Journal of Cognitive Neuroscience, 3(1), 59–70.
Received July 20, 2004; accepted August 10, 2005.
LETTER
Communicated by Joachim Buhmann
Support Vector Machines for Dyadic Data Sepp Hochreiter∗ [email protected]
Klaus Obermayer [email protected] Department of Electrical Engineering and Computer Science, Technische Universit¨at Berlin, 10587 Berlin, Germany
We describe a new technique for the analysis of dyadic data, where two sets of objects (row and column objects) are characterized by a matrix of numerical values that describe their mutual relationships. The new technique, called potential support vector machine (P-SVM), is a largemargin method for the construction of classifiers and regression functions for the column objects. Contrary to standard support vector machine approaches, the P-SVM minimizes a scale-invariant capacity measure and requires a new set of constraints. As a result, the P-SVM method leads to a usually sparse expansion of the classification and regression functions in terms of the row rather than the column objects and can handle data and kernel matrices that are neither positive definite nor square. We then describe two complementary regularization schemes. The first scheme improves generalization performance for classification and regression tasks; the second scheme leads to the selection of a small, informative set of row support objects and can be applied to feature selection. Benchmarks for classification, regression, and feature selection tasks are performed with toy data as well as with several real-world data sets. The results show that the new method is at least competitive with but often performs better than the benchmarked standard methods for standard vectorial as well as true dyadic data sets. In addition, a theoretical justification is provided for the new approach. 1 Introduction Learning from examples in order to predict is one of the standard tasks in machine learning. Many techniques have been developed to solve classification and regression problems, but by far, most of them were specifically designed for vectorial data. Vectorial data, where data objects are described by vectors of numbers that are treated as elements of a vector space, are very ∗ New affiliation: Institute for Bioinformatics, Johannes Kepler Universit¨ at Linz, 4040 Linz, Austria.
Neural Computation 18, 1472–1510 (2006)
C 2006 Massachusetts Institute of Technology
Support Vector Machines for Dyadic Data
A)
a b c d α β χ δ ε φ γ
1.3
−2.2 −1.6
7.8
−1.8
−1.1 7.2
2.3
1.2
1.9
−2.9
−2.2
3.7
0.8 −0.6
2.5
9.2
−9.4 −8.3
9.2
−7.7 8.6 −9.7
−7.4
−4.8 0.1 −1.2
0.9
B)
1473
a b c d e f a b c d e f g
0.9 −0.1 −0.8
0.5 0.2
g
−0.5 −0.7
−0.1
0.9
0.6
0.3 −0.7 −0.6
0.3
−0.8
0.6
0.9
0.2 −0.6
0.6
0.5
0.5
0.3
0.2
0.9
0.7
0.1
0.3
0.2 −0.7 −0.6
0.7
0.9 −0.9 −0.5
−0.5 −0.6 0.6
0.1 −0.9
0.9
0.9
−0.7
0.3 −0.5 0.9
0.9
0.3
0.5
Figure 1: (A) Dyadic data. Column objects {a , b, . . .} and row objects {α, β, . . .} are represented by a matrix of numerical values that describe their mutual relationships. (B) Pairwise data. A special case of dyadic data, where row and column objects coincide.
convenient because of the structure imposed by the Euclidean metric. However, there are data sets for which a vector-based description is inconvenient or simply wrong, and representations that consider relationships between objects are more appropriate. In the following, we will study representations of data objects based on matrices. We consider “column” objects, which are the objects to be described, and “row” objects, which are the objects that serve for their description (see Figure 1A). The whole data set can then be represented using a rectangular matrix whose entries denote the relationships between the corresponding row and column objects. We call representations of this form dyadic data (Hofmann & Puzicha, 1998; Li & Loken, 2002; Hoff, 2005). If row and column objects are from the same set (see Figure 1B), the representation is usually called pairwise data, and the entries of the matrix can often be interpreted as the degree of similarity (or dissimilarity) between objects. Dyadic descriptions are more powerful than vector-based descriptions, as vectorial data can always be brought into dyadic form when required. This is often done for kernel-based classifiers or regression functions ¨ (Scholkopf & Smola, 2002; Vapnik, 1998), where a Gram matrix of mutual similarities is calculated before the predictor is learned. A similar procedure can also be used in cases where the row and column objects are from different sets. If both of them are described by feature vectors, a matrix can be calculated by applying a kernel function to pairs of feature vectors—one vector describing a row and the other vector describing a column object. One example for this is the drug-gene matrix of Scherf et al. (2000), which
1474
S. Hochreiter and K. Obermayer
was constructed as the product of a measured drug sample and a measured sample gene matrix and where the kernel function was a scalar product. In many cases, however, dyadic descriptions emerge because the matrix entries are measured directly. Pairwise data representations as a special case of dyadic data can be found for data sets where similarities or distances between objects are measured. Examples include sequential or biophysical similarities between proteins (Lipman & Pearson, 1985; Sigrist et al., 2002; Falquet et al., 2002), chromosome location or co-expression similarities between genes (Cremer et al., 1993; Lu, Wallrath, & Elgin, 1994; Heyer, Kruglyak, & Yooseph, 1999), co-citations of text documents (White & McCain, 1989; Bayer, Smart, & Melaughlin, 1990; Ahlgren, Jarneving, & Rousseau, 2003), or hyperlinks between web pages (Kleinberg, 1999). In general, these measured matrices are symmetric but may not be positive definite, and even if they are for the training set, they may not remain positive definite if new examples are included. Genuine dyadic data occur whenever two sets of objects are related. Examples are DNA microarray data (Southern, 1988; Lysov et al., 1988; Drmanac, Labat, Brukner, & Crkvenjakov, 1989; Bains & Smith, 1988), where the column objects are tissue samples, the row objects are genes, and every sample-gene pair is related by the expression level of this particular gene in this particular sample. Other examples are web documents, where both the column and row objects are web pages and column web pages are described by either hyperlinks to or from the row objects, which give rise to a rectangular matrix.1 Further examples include documents in a database described by word frequencies (Salton, 1968) or molecules described by transferable atom equivalent (TAE) descriptors (Mazza, Sukumar, Breneman, & Cramer, 2001). Traditionally, row objects have often been called “features” and column vectors of the dyadic data matrix have mostly been treated as “feature vectors” that live in a Euclidean vector space, even when the dyadic nature of the data was made explicit (see Graepel, Herbrich, Bollman-Sdorra, & Obermayer, 1999; Mangasarian, ¨ 1998) or the “feature map” method (Scholkopf & Smola, 2002). Difficulties, however, arise when features are heterogeneous, and apples and oranges must be compared. What theoretical arguments would, for example, justify, treating the values of a set of TAE descriptors as coordinates of a vector in an Euclidean space? A nonvectorial approach to pairwise data is to interpret the data matrix as a Gram matrix and to apply support vector machines (SVM) for classification and regression if the data matrix is positive semidefinite (Graepel et al. 1999). For indefinite (but symmetric) matrices, two other nonvectorial approaches have been suggested (Graepel et al., 1999). In the first approach, the data matrix is projected into the subspace spanned by the eigenvectors
1 Note that for pairwise data examples, the linking matrix was symmetric because links were considered bidirectional.
Support Vector Machines for Dyadic Data
1475
with positive eigenvalues. This is an approximation, and one would expect good results only if the absolute values of the negative eigenvalues are small compared to the dominant positive ones. In the second approach, directions of negative eigenvalues are processed by flipping the sign of these eigenvalues. All three approaches lead to positive semidefinite matrices on the training set, but positive semidefiniteness is not ensured if a new test object must be included. An embedding approach was suggested by Herbrich, Graepel, & Obermayer (2000) for antisymmetric matrices, but this was specifically designed for data sets, where the matrix entries denote preference relations between objects. In summary, no general and principled method exists to learn classifiers or regression functions for dyadic data. In order to avoid the shortcomings noted, we suggest considering column and row objects on an equal footing and interpret the matrix entries as the result of a kernel function or measurement kernel, which takes a row object, applies it to a column object, and outputs a number. It will turn out that mild conditions on this kernel suffice to create a vector space endowed with a dot product into which the row and the column objects can be mapped. Using this mathematical argument as a justification, we show how to construct classification and regression functions in this vector space in analogy to the large margin-based methods for learning perceptrons for vectorial data. Using an improved measure for model complexity and a new set of constraints that ensure good performance on the training data, we arrive at a generally applicable method to learn predictors for dyadic data. The new method is called the potential support vector machine (P-SVM). It can handle rectangular matrices as well as pairwise data whose matrices are not necessarily positive semidefinite, but even when the P-SVM is applied to regular Gram matrices, it shows very good results when compared with standard methods. Due to the choice of constraints, the final predictor is expanded into an usually sparse set of descriptive row objects, which is different from the standard SVM expansion in terms of the column objects. This opens up another important application domain: a sparse expansion is equivalent to feature selection (Guyon & Elisseeff, 2003; Hochreiter & Obermayer, 2004b; Kohavi & John, 1997; Blum & Langley, 1997). An efficient implementation of the P-SVM using a modified sequential minimal optimization procedure for learning is described in Hochreiter and Obermayer (2004a).
2 The Potential Support Vector Machine 2.1 A Scale-Invariant Objective Function. Consider a set xi | 1 ≤ i ≤ L ⊂ X of objects that are described by feature vectors x iφ ∈ R N and form a training set Xφ = x 1φ , . . . , x φL . The index φ is introduced because we will later use the kernel trick and assume that the vectors x iφ are images of a map φ that is induced by either a kernel or a measurement
1476
S. Hochreiter and K. Obermayer
function. Assume for the moment a simple binary classification problem, where class membership is indicated by labels yi ∈ {+1, −1}, ·, · denotes the dot product, and a linear classifier parameterized by the weight vector w and the offset b is chosen from the set
sign f (x φ ) | f (x φ ) = w, x φ + b .
(2.1)
Standard SVM techniques select the classifier with the largest margin under the constraint of correct classification on the training set: 1 w2 w,b 2 s.t. yi w, x iφ + b ≥ 1.
min
(2.2)
If the training data are not linearly separable, a large margin is traded against a small training error using a suitable regularization scheme. The maximum margin objective is motivated by bounds on the generalization error using the Vapnik-Chervonenkis (VC) dimension h as a capacity measure (Vapnik, 1998). For the set of all linear defined on Xφ , for classifiers 2 which γ ≥ γmin holds, one obtains h ≤ min R2 /γmin , N + 1 (see Vapnik, ¨ 1998; Scholkopf & Smola, 2002). [·] denotes the integer part, γ is the margin, and R is the radius of the smallest sphere in data space, which contains all the training data. Bounds derived using the fat-shattering dimension ¨ (Shawe-Taylor, Bartlett, Williamson, & Anthony, 1996, 1998; Scholkopf & Smola, 2002), and bounds on the expected generalization error (cf. Vapnik, ¨ 1998; Scholkopf & Smola, 2002) depend on R/γmin in a similar manner. Both the selection of a classifier using the maximum margin principle and the values obtained for the generalization error bounds suffer from the problem that they are not invariant under linear transformations. This problem is illustrated in Figure 2 for a 2D classification problem. In the left figure, both classes are separated by the hyperplane with the largest margin (solid line). In the right figure, the same classification problem is shown, but scaled along the vertical axis by a factor s. Again, the solid line denotes the support vector solution, but when the classifier is scaled back to s = 1 (dashed line in the left figure) it no longer coincides with the original SVM solution. The ratio R2 /γ 2 , which bounds the VC dimension, also depends on the scale factors (see the legend of Figure 2). The situation depicted in Figure 2 occurs whenever the data can be enclosed by a (nonspherical) ¨ ellipsoid (cf. Scholkopf, Shawe-Taylor, Smola, & Williamson, 1999). The considerations of Figure 2 show that the optimal hyperplane is not scale invariant, and predictions of class labels may change if the data are rescaled before learning. Thus, the question arises of which scale factors should be used for classifier selection.
Support Vector Machines for Dyadic Data
1477
Figure 2: (Left) Data points from two classes (triangles and circles) are separated by the hyperplane with the largest margin (solid line). The two support vectors axis (black symbols) are separated by dx along horizontal and by d y along the the vertical axis, 4γ 2 = dx2 + d y2 and R2 / 4γ 2 = R2 / dx2 + d y2 . The dashed line indicates the classification boundary of the classifier shown on the right, scaled along the vertical axis by the factor 1/s. (Right) The same data but scaled along the vertical axis by the factor s. The solid the maximum margin line denotes hyperplane, 4γ 2 = dx2 + s 2 d y2 and R2 / 4γ 2 = R2 / dx2 + s 2 d y2 . For d y = 0 both the margin γ and the term R2 /γ 2 depend on s.
Here we suggest scaling the training data such that the margin γ remains constant while the radius R of the sphere containing all training data ˜ becomes as small as possible. The result is a new sphere with radius R, which leads to a tighter margin-based bound for the generalization error. Optimality is achieved when all directions orthogonal to the normal vector w of the hyperplane γ are scaled to zero and maximal margin with R˜ = mint∈R maxi w, ˆ := w/w. If |t| is ˆ x iφ + t ≤ maxi w, ˆ x iφ , where w small compared to w, ˆ x iφ , for example, if the data are centered at the origin, t can be neglected through above inequality. Unfortunately, this formulation does not lead to an optimization problem that is easy to implement. Therefore, we suggest minimizing the upper bound, ˜2 2 2 2 R , = R˜ 2 w2 ≤ max w, x iφ ≤ w, x iφ = X φw 2 i γ i
(2.3)
where the matrix Xφ := x 1φ , x 2φ , . . . , x φL contains the training vectors x iφ . In the second inequality, the maximum norm is bounded by the Euclidean norm. Its worst-case factor is L, but the bound is tight. The new objective is well defined also for cases where Xφ X φ or/and X φ X φ is singular, and the kernel trick carries over to the new technique.
1478
S. Hochreiter and K. Obermayer
It can be shown that replacing the objective function w2 (see equation 2.2) by the upper bound 2 w Xφ X φ w = Xφ w
(2.4)
˜2
on Rγ 2 , equation 2.3, corresponds to the integration of sphering (whitening) and SVM learning if the data have zero mean. If the data have already been sphered, then the covariance matrix is given by X φ X φ = I, and we recover the classical SVM. If not, minimizing the new objective leads to normal vectors that are rotated toward directions of low variance of the data when compared with the standard maximum margin solution. Note, however, that whitening can easily be performed in input space but becomes nontrivial if the data are mapped to a high-dimensional feature space using a kernel function. In order to test whether the situation depicted in Figure 2 actually occurs in practice, we applied a C-SVM with a radial basis function (RBF) kernel to the UCI data set Breast Cancer from section 3.1.1. The kernel width was chosen small and C large enough so that no error occurred on the training set (to be consistent with Figure 2). Sphering was performed in feature space by replacing the objective function in equation 2.2 with the new objective, equation 2.4. Figure 3 shows the angle
w , w svm sph ψ = arccos wsvm , wsvm wsph , w sph
i j (2.5) svm sph αi α j yi k x , x i, j
= arccos
sph sph svm svm i j i j αi α j yi y j k (x , x ) αi α j k (x , x ) i, j
i, j
between the hyperplane for the sphered (“sph”) and the nonsphered (“svm”) case as a function of the width σ of the RBF kernel. The observed values range between 6 and 50 degrees, providing evidence for the situation shown in Figure 2. The new objective function, equation 2.4, leads to separating hyperplanes that are invariant under linear transformations of the data. As a consequence, neither the bounds nor the performance of the derived classifier depends on how the training data were scaled. But is the new objective function also related to a capacity measure for the model class as the margin is? It is, and Hochreiter, and Obermayer (2004c) have shown that the capacity measure, equation 2.4, emerges when a bound for the generalization error is constructed using the technique of covering numbers. 2.2 Constraints for Complex Features. The next step is to formulate a set of constraints that enforce good performance on the training set and
Support Vector Machines for Dyadic Data
1479
50 45 40
ψ in degree
35 30 25 20 15 10 5 0.5
1
1.5
2
2.5
3
kernel width σ
3.5
4
Figure 3: Application of a C-SVM to the data set Breast Cancer of the UCI Benchmark Repository. The angle ψ (see equation 2.5) between weight vectors wsvm and wsph is plotted as a function of σ (C = 1000). For σ → 0, the data are already approximately sphered in feature space; hence, the objective functions from equations 2.2 and 2.4 lead to similar results.
implement the new idea that row and column objects are both mapped into a Hilbert space within which matrix entries correspond to scalar products and the classification or regression function is constructed. Assume that each matrix entry was measured by a device that determines the projection of an object feature vector x φ onto a direction zω in feature j space. The value K i j of such a complex feature zω for an object x iφ is then given by the dot product K i j = x iφ , zωj .
(2.6)
In analogy to the index φ for x φ , the index ω indicates that we will later assume that the vectors ziω are images in R N of a map ω that is induced by either a kernel or a measurement function. A mathematical foundation of the ansatz equation, 2.6, is given in the Appendix. Assume that we have P devices whose correspondingvectors zω arethe only measurable and accessible directions, and let Zω := z1ω , z2ω , . . . , zωP be the matrix of all complex features. Then we can summarize our (incomplete) knowledge about the objects in Xφ using the data matrix K = X φ Zω .
(2.7)
1480
S. Hochreiter and K. Obermayer
For DNA microarray data, for example, we could identify K with the matrix of expression values obtained by a microarray experiment. For web data, we could identify K with the matrix of ingoing or outgoing hyperlinks. For a document data set, we could identify K with the matrix of word frequencies. Hence we assume that x φ and zω live in a space of hidden causes that are responsible for the different attributes of the objects. The j complex features {zω } span a subspace of the original feature space, but we do not require them to be orthogonal, normalized, or linearly independent. j If we set zω = e j ( jth Cartesian unit vector), that is, Zω = I, K i j = xij and P = N, the “new” description, equation 2.7, is fully equivalent to the “old” description using the original feature vectors x φ . We now define the quality measure for the performance of the classifier or the regression function on the training set. We consider the quadratic loss function, c(yi , f (x iφ )) =
1 2 r , 2 i
ri = f (x iφ ) − yi = w, x iφ + b − yi .
(2.8)
The mean squared error is Remp [ f w, b ] =
L 1 c yi , f (x iφ ) . L
(2.9)
i=1
We now require that the selected classification or regression function minimizes Remp , that is, that ∇w Remp [ f w, b ] =
1 Xφ X φ w + b1 − y = 0 L
(2.10)
and ∂ Remp [ f ] 1
1 = ri = b + w, x iφ − yi = 0, ∂b L i L i
(2.11)
where the labels for all objects in the training set are summarized by a label vector y. Since the quadratic loss function is convex in w and b, only one minimum exists if X φ X φ has full rank. If X φ X φ is singular, then all points of minimal value correspond to a subspace of R N . From equation 2.11, we obtain b=−
L 1 1 w, x iφ − yi = − w X φ − y 1. L L i=1
(2.12)
Support Vector Machines for Dyadic Data
1481
Condition equation 2.10 implies that the directional derivative should be zero along any direction in feature space, including the directions of the complex feature vectors zω . We therefore obtain d Remp f w+tzωj ,b dt
= zωj ∇w Remp [ f w, b ] =
1 j z Xφ X φ w + b1 − y = 0, L ω
(2.13)
and, combining all complex features, 1 1 Zω Xφ X K r = σ = 0. φ w + b1 − y = L L
(2.14) j
Hence, we require that for every complex feature zω , the mixed moments σ j between the residual error ri and the measured values K i j should be zero. 2.3 The Potential Support Vector Machine 2.3.1 The Basic P-SVM. The new objective from equation 2.4 and the new constraints from equation 2.14 constitute a new procedure for selecting a classifier or a regression function. The number of constraints is equal to the number P of complex features, which can be larger or smaller than the number L of data points or the dimension N of the original feature space. Because the mean squared error of a linear function f w,b is a convex function of the parameters w and b, the constraints enforce that Remp is minimal.2 If K has at least rank L (number of training examples), then the minimum even corresponds to r = 0 (cf. equation 2.14). Consequently, overfitting occurs, and a regularization scheme is needed. Before a regularization scheme can be defined, however, the mixed moments σ j must be normalized. The reason is that high values of σ j may be a result of either a high variance of the values of K i j or a high correlation between the residual error ri and the values of K i j . Since we are interested in the latter, the most appropriate measure would be Pearson’s correlation coefficient, L
K i j − K¯ j (ri − r¯ ) σˆ j = , 2 L L 2 ¯ K − K − r ¯ (r ) i j j i i=1 i=1 i=1
2
(2.15)
∗ ∗ Note that w = X is the pseudo( y − b1) fulfills the constraints, where X φ φ
inverse of X φ.
1482
S. Hochreiter and K. Obermayer
1 L 1 L ¯ ri and K j = L i=1 K i j . If every column vector where r¯ = L i=1 K 1 j , K 2 j , . . . , K L j of the data matrix K is normalized to mean zero and variance one, we obtain
σj =
L 1 1
K i j ri = σˆ j √ r − r¯ 12 . L L
(2.16)
i=1
Because r¯ = 0 (see equation 2.11), the mixed moments are now proportional to the correlation coefficient σˆ j with a proportionality constant independent j of the complex feature zω , and σ j can be used instead of σˆ j to formulate the constraints. If the data vectors are normalized, the term K 1 vanishes, and we obtain the basic P-SVM optimization problem:
min w
s.t.
1 2 X w 2 φ K X φ w − y = 0.
(2.17)
The offset b of the classification or regression function is given by equation 2.12, which simplifies after normalization to (see Hochreiter & Obermayer, 2004a)
b=
L 1
yi . L
(2.18)
i=1
We will call this model selection procedure the potential support vector machine (P-SVM), and we will always assume normalized data vectors in the following. 2.3.2 The Kernel Trick. Following the standard support vector way to derive learning rules for perceptrons, we have so far considered linear classifiers only. Most practical classification problems, however, require nonlinear classification boundaries, which makes the construction of a proper feature space necessary. In analogy to the standard SVM, we now invoke the kernel trick. Let x i and z j be feature vectors, which describe the column the and row objects of the data set. We then choose a kernel function k x i , z j and compute the matrix K of relations between column and row objects: K i j = k x i , z j = φ x i , ω z j = x iφ , zωj ,
(2.19)
Support Vector Machines for Dyadic Data
1483
j where x iφ = φ x i and zω = ω z j . In the appendix, it is shown that any L 2 kernel corresponds (for almost all points) to a dot product in a Hilbert space in the sense of equation 2.19 and induces a mapping into a feature space within which a linear classifier can then be constructed. In the following sections, we will therefore distinguish between the actual measurements x i j and z j and the feature vectors x iφ , and zω “induced” by the kernel k. Potential choices for row objects and their vectorial description are z j = x j , P = L (standard construction of a Gram matrix), z j = e j , P = N (elementary features), z j is the jth cluster center of a clustering algorithm applied to the vectors x i (example for a “complex” feature), or z j is the jth vector of a principal component analysis (PCA) or independent component analysis (ICA) preprocessing (another example for a “complex” feature). If the entries K i j of the data matrix are directly measured, the application of the kernel trick needs additional considerations. In the appendix, we show that if the measurement process can be expressed through a kernel k xi , z j , which takes a column object xi and a row object z j and outputs a number, the matrix K of relations between the row and column objects can be interpreted as a dot product, K i j = x iφ , zωj ,
(2.20)
j in some features space, where x iφ = φ xi and zω = ω z j . Note that we distinguish between an object xi and its associated feature vectors x i or x iφ , leading to differences in the definition of k for the cases of vectorial and (measured) dyadic data. Equation 2.20 justifies the P-SVM approach, which was derived for the case of linear predictors and also for measured data. 2.3.3 The P-SVM for Classification. If the P-SVM is used for classification, we suggest a regularization scheme based on slack variables ξ + and ξ − . Slack variables allow small violations of individual constraints if the correct choice of w would lead to a large increase of the objective function otherwise. We obtain min + −
w,ξ ,ξ
s.t.
1 2 Xφ w + C1 ξ + + ξ − 2 + K X φw− y +ξ ≥ 0 − K X φw− y −ξ ≤ 0 0 ≤ ξ +, ξ −
(2.21)
for the primal problem. The above regularization scheme makes the optimization problem robust against outliers. A large value of a slack variable indicates that the
1484
S. Hochreiter and K. Obermayer
particular row object only weakly influences the direction of the classification boundary, because it would otherwise considerably increase the value of the complexity term. This happens in particular for high levels of measurement noise, which leads to large, spurious values of the mixed moments σ j . If the noise is large, the value of C must be small to remove the corresponding constraints via the slack variables ξ . If the strength of the measurement noise is known, the correct value of C can be determined a priori. Otherwise it must be determined using model selection techniques. To derive the dual optimization problem, we evaluate the Lagrangian L, + 1 L = w Xφ X ξ + ξ− φ w + C1 2 K Xφ w − y + ξ + − α+ K Xφ w − y − ξ − + α− − µ+ ξ + − µ− ξ − ,
(2.22)
where the vectors α + ≥ 0, α − ≥ 0, µ+ ≥ 0, and µ− ≥ 0 are the Lagrange multipliers for the constraints in equations 2.21. The optimality conditions require that ∇w L = X φ X φ w − X φ Kα = Xφ X φ w − X φ X φ Zω α = 0,
(2.23)
where α = α + − α − (αi = αi+ − αi− ). Equation 2.23 is fulfilled if w = Zω α.
(2.24)
In contrast to the standard SVM expansion of w into its support vectors x φ , the weight vector w is now expanded into a set of complex features zω , which we will call support features in the following. We then arrive at the dual optimization problem: min α
s.t.
1 α K Kα − y Kα 2 −C1 ≤ α ≤ C1,
(2.25)
which now depends on the data via the kernel or data matrix K only. The dual problem can be solved by a sequential minimal optimization (SMO) technique, which is described in Hochreiter and Obermayer (2004a). The SMO technique is essential if many complex features are used, because the
Support Vector Machines for Dyadic Data
1485
P × P matrix K K enters the dual formulation. Finally, the classification function f has to be constructed using the optimal values of the Lagrange parameters α:
f (x φ ) =
P
α j K (x) j + b,
j=1
where the expansion, equation 2.24 has been used for the weight vector w and b is given by equation 2.18. The classifier based on equation 2.26 depends on the coefficients α j , which were determined during optimization; on b, which can be computed directly; and on the measured values K (x) j for the new object x. The coefficients α j = α +j − α −j can be interpreted as class indicators because they separate the complex features into features that are relevant for class 1 and class −1, according to the sign of α j . Note that if we consider the Lagrange parameters α j as parameters of the classifier, we find that d Remp f w+tzωj ,b dt
= σj =
∂ Remp [ f ] . ∂α j
(2.26)
The directional derivatives of the empirical error Remp along the complex features in the primal formulation correspond to its partial derivatives with respect to the corresponding Lagrange parameter in the dual formulation. One of the most crucial properties of the P-SVM procedure is that the dual optimization problem depends on only K via K K. Therefore, K is not required to be positive semidefinite or square. This allows not only the construction of SVM-based classifiers for matrices K of general shape but also to extend SVM-based approaches to the new class of indefinite kernels operating on the objects’ feature vectors. In the following, we illustrate the application of the P-SVM approach to classification using a toy example. The data set consists of 70 objects x, 28 from class 1 and 42 from class 2, which are described by 2D feature vectors x (open and solid circles in Figure 4). Two pairwise data sets were then generated by applying RBF kernel and the (indefinite) a standard sine-kernel k x i , x j = sin θ x i − x j which leads to an indefinite Gram matrix. Figure 4 shows the classification result. The sine kernel is more appropriate than the RBF kernel for this data set because it is better adjusted to the oscillatory regions of class membership, leading a smaller number of support vectors and a smaller test error. In general, the value of θ has to be selected using standard model selection techniques. A large value of θ leads to a more complex set of classifiers, reduces the classification error on the training set, but increases the error on the test set. Non-Mercer kernels
1486
S. Hochreiter and K. Obermayer
kernel: RBF, SV: 50, σ: 0.1, C: 100 1.2
kernel: sine, SV: 7, θ: 25, C: 100 1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
Figure 4: Application of the P-SVM method to a toy classification problem. Objects are described by 2D feature vectors x. Seventy feature vectors were generated: 28 for class 1 (open circles) and 42 for class 2 (solid circles). A Gram matrix was constructed using the kernels RBF k x i , x j = exp − σ12 x i − x j 2 (left) and sine k x i , x j = sin θ x i − x j (right). White and gray indicate regions of class 1 and class 2 membership, respectively. Circled data indicate support vectors (SV). For the parameters, see the figure.
extend the range of kernels that are used and therefore open up a new direction of research for kernel design. 2.3.4 The P-SVM for Regression. The new objective function of equation 2.4 was motivated for a classification problem, but it can also be used to find an optimal regression function in a regression task. In regression, 2 the term X ˆ 2 w2 , w ˆ := w/w, is the product of a term that φ w = X φ w expresses the deviation of the data from the zero isosurface of the regression function and a term that corresponds to the smoothness of the regressor. The smoothness of the regression function is determined by the norm of the weight vector w. If f (x iφ ) = w, x iφ + b and the length of the vectors x φ is bounded by B, then the deviation of f from offset b is bounded by f (x i ) − b = w, x i ≤ w x i ≤ wB, φ φ φ
(2.27)
where the first ≤ follows from the Cauchy-Schwarz inequality; hence, a smaller value of w leads to a smoother regression function. This trade-off between smoothness and residual error is also reflected by equation A.24 in the appendix, which shows that equation 2.4 is the L 2 -norm of the function f . The constraints of vanishing mixed moments carry over to regression problems, with the only modification that the target values yi in equations
Support Vector Machines for Dyadic Data
1487
10 10
9 9
8 8
7
7
6
6
5
5
4
4
SV: 7 σ: 2, C: 20
3
2
0 −10
2
x = −2
1
−8
−6
−4
−2
0
SV: 5 σ: 4, C: 20
3
x = −2
1
2
4
6
8
0 −10
10
−8
−6
−4
−2
0
2
4
6
8
10
10
9
8
7
6
5
4
SV: 14 σ: 2, C: 0.6
3
2
x = −2
1
0 −10
−8
−6
−4
−2
0
2
4
6
8
10
Figure 5: Application of the P-SVM method to a toy regression problem. Objects (small dots), described by the x-coordinate, were generated by randomly choosing a value from [−10, 10] and calculating the y-value from the true function (dashed line) plus gaussian noise N (0, 0.2). One outlier was added at x = 0 (thin arrows). A Gram matrix was generated using an RBF kernel k x i , x j = exp − σ12 x i − x j 2 . The solid lines show the regression result. Circled dots indicate support vectors (SV). For parameters, see the figure. Bold arrows mark x = −2.
2.21 are real rather than binary numbers. The constraints are even more “natural” for regression because the ri are indeed the residuals that a regression function should minimize. We therefore propose to use the primal optimization problem, eqs. 2.21, and its corresponding dual, equations 2.25, also for the regression setting. Figure 5 shows the application of the P-SVM to a toy regression example (pairwise data). Fifty data points are randomly chosen from the true function (dashed line), and independent and identically distributed gaussian noise with mean 0 and standard deviation 0.2 is added to each y-component. One outlier was added at x = 0. The figure shows the PSVM regression results (solid lines) for an RBF kernel and three different combinations of C and σ . The hyperparameter C controls the residual error. A smaller value of C increases the residual error at x = 0 but also the
1488
S. Hochreiter and K. Obermayer
number of support vectors. The width σ of the RBF kernel controls the overall smoothness of the regressor: A larger value of σ increases the error at x = 0 without increasing the number of support vectors (arrows in bold in Figure 5). 2.3.5 The P-SVM for Feature Selection. In this section we modify the P-SVM method for feature selection such that it can serve as a data preprocessing method in order to improve the generalization performance of subsequent classification or regression tasks (see also Hochreiter & Obermayer, 2004b). Due to the property of the P-SVM method to expand w into a sparse set of support features, it can be modified to optimally extract a small number of informative features from the set of row objects. The set of support features can then be used as input to an arbitrary predictor, for example, another P-SVM, a standard SVM, or a K-nearest-neighbor classifier. Noisy measurements can lead to spurious mixed moments; that is, complex features may contain no information about the objects’ attributes but still exhibit finite values of σ j . In order to prevent those features from affecting the classification boundary or the regression function, we introduce a correlation threshold ε and modify the constraints in equations 2.17 according to K X φw− K X φw−
y − 1 ≤ 0, y + 1 ≥ 0.
(2.28)
¨ This regularization scheme is analogous to the ε-insensitive loss (Scholkopf & Smola, 2002). Absolute values of mixed moments smaller than ε are considered to be spurious, and the corresponding features do not influence the weight vector because the constraints remain fulfilled. The value of ε directly correlates with the strength of the measurement noise and can be determined a priori if it is known. If this is not the case, ε serves as a hyperparameter, and its value can be determined using model selection techniques. Note that data vectors have to be normalized (see section 2.3.1) before applying the P-SVM, because otherwise, a global value of ε would not suffice. Combining equation 2.4 and equations 2.28, we then obtain the primal optimization problem,
min w
s.t.
1 2 X w 2 φ K X φw− K X φw−
y + 1≥0 y − 1≤0
(2.29)
Support Vector Machines for Dyadic Data
1489
for P-SVM feature selection. In order to derive the dual formulation, we have to evaluate the Lagrangian, + 1 L = w Xφ X K Xφ w − y + 1 φw− α 2 K Xφ w − y − 1 , + α−
(2.30)
where we have used the notation from section 2.3.3. The vector w is again expressed through the complex features, w = Zω α,
(2.31)
and we obtain the dual formulation of equation 2.29:
min
α + ,α −
s.t.
1 + α − α− K K α+ − α− 2 − y K α + − α − + 1 α + + α − 0 ≤ α+ ,
0 ≤ α− .
(2.32)
The term 1 (α + + α − ) in this dual objective function enforces a sparse expansion of the weight vector w in terms of the support features. This occurs because for large enough values of ε, this term forces all α j toward zero except for the complex features that are most relevant for classification or regression. If K K is singular and w is not uniquely determined, ε enforces a unique solution, which is characterized by the sparsest representation through complex features. The dual problem is again solved by a fast sequential minimal optimization (SMO) technique (see Hochreiter & Obermayer, 2004a). Finally, let us address the relationship between the value of a Lagrange j multiplier α j and the importance of the corresponding complex feature zω for prediction. The change of the empirical error under a change of the j weight vectors by an amount β along the direction of a complex feature zω is given by Remp f w+β zωj ,b − Remp [ f w,b ] = βσ j +
β2 2 β2 |β| β 2 K i j = βσ j + ≤ + , 2L i 2 L 2
(2.33)
1490
S. Hochreiter and K. Obermayer
because the constraints, equation 2.28, ensure that |σ j |L ≤ . If a complex j feature zω is completely removed, then β = −α j and Remp f w−α j zωj ,b − Remp [ f w,b ] ≤
|α j | L
+
α 2j . 2
(2.34)
The Lagrange parameter α j is directly related to the increase in the empirical error when a feature is removed. Therefore, α serves as importance measures for the complex features and allows ranking features according to the absolute value of its components. In the following, we illustrate the application of the P-SVM approach to feature selection using a classification toy example. The data set consists of 50 column objects x, 25 from each class, which are described by 2D-feature vectors x (the open and solid circles in Figure 6). Fifty row objects z were randomly selected by choosing their 2D feature vectors z according to a uniform distribution on [−1.2, 1.2] × [−1.2, 1.2]. K was generated using an RBF kernel k(x i , z j ) = exp(− 2σ1 2 x i − z j 2 ) with σ = 0.2. Figure 6 shows the result of the P-SVM feature selection method with a correlation threshold = 20. Only six features were selected (the crosses in Figure 6), but every group of data points (column objects) is described (and detected) by one or two feature vectors (row objects). The number of selected features depends on ε and σ , which determines how the strength of a complex feature decreases with the distances x i − z j . Smaller ε or larger σ would result in a larger number of support features. 2.4 The Dot Product Interpretation of Dyadic Data. The derivation of the P-SVM method is based on the assumption that the matrix K is a dot product matrix whose elements denote a scalar product between the feature vectors that describe the row and the column objects. If K, however, is a matrix of measured values, the question arises, under which conditions can such a matrix be interpreted as a dot product matrix? In the appendix, we show that the question reduces to these conditions: 1. Can the set X of column objects x be completed to a measure space? 2. Can the set Z of row objects z be completed to a measure space? 3. Can the measurement process be expressed using the evaluation of a measurable kernel k (x, z), which is from L 2 (X , Z)? Conditions 1 and 2 can be easily fulfilled by defining a suitable σ -algebra on the sets. Condition 3 holds if k is bounded and the sets X and Z are compact; it equates the evaluation of a kernel as known from standard SVMs with physical measurements, and physical characteristics of the measurement device determine the properties of the kernel, such as boundedness and continuity. But can a measurement process indeed be expressed through a kernel? There is no full answer to this question from a theoretical viewpoint;
Support Vector Machines for Dyadic Data
1491
1
0.5
SV: 6, σ: 0.1, : 20
0
−0.5
−1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 6: P-SVM method applied to a toy feature selection problem. Column and row objects are described by 2D feature vectors x and z. Fifty row feature vectors, 25 from each class (open/solid circles), were chosen from N ((±1, ±1), 0.1I), where centers are chosen with equal probability. One hundred column feature vectors were uniformly from [−1.2, 1.2]2 . K is con chosen 1 i j 2 . Black crosses indicate P-SVM x − z structed by an RBF kernel exp − 20.2 2 support features.
practical applications have to confirm (or disprove) the chosen ansatz and data model. 3 Numerical Experiments and Applications In this section, we apply the P-SVM method to various kinds of real-world data sets and provide benchmark results with previously proposed methods when appropriate. This section consists of three parts that cover classification, regression, and feature selection. The P-SVM is first tested as a classifier on data sets from the UCI Benchmark Repository, and its performance is compared with results obtained with C- and the ν-SVMs for different kernels. Then we apply the P-SVM to a measured (rather than constructed)
1492
S. Hochreiter and K. Obermayer
pairwise (“protein”) and a genuine dyadic data set (World Wide Web). Second, we apply the P-SVM to regression problems taken from the UCI Benchmark Repository and compare with results obtained with C-support vector regression and Bayesian SVMs. Finally, we describe results obtained for the P-SVM as a feature selection method for microarray data and for the Protein and World Wide Web data sets. 3.1 Application to Classification Problems 3.1.1 UCI Data Sets. In this section we report benchmark results for the data sets Thyroid (5 features), Heart (13 features), Breast Cancer (9 features), and German (20 features) from the UCI Benchmark Repository and for the data set Banana (2 features) taken from R¨atsch, Onoda, ¨ and Muller (2001). All data sets were preprocessed as described in R¨atsch et al., and divided into 100 training and test set pairs. Data sets were generated through resampling, where data points were randomly selected for the training set and the remaining data were used for the test set. We downloaded the original 100 training and test set pairs from http://ida.first.fraunhofer.de/projects/bench/. For the data sets German and Banana, we restricted the training set to the first 200 examples of the original training set in order to keep hyperparameter selection feasible but used the original test sets. Pairwise data sets were generated by constructing the Gram matrix for radial basis function (RBF), polynomial (POL), and Plummer (PLU; see Hochreiter, Mozer, & Obermayer, 2003) kernels, and the Gram matrices were used as input for kernel Fisher discriminant analysis (KFD, Mika, ¨ ¨ R¨atsch, Weston, Scholkopf, & Muller, 1999), C-, ν-, and P-SVM. Because KFD only selects a direction in input space onto which all data points are projected, we must select a classifier for the resulting one-dimensional ¨ classification task. We follow Scholkopf and Smola (2002) and classify a data point according to its posterior probability under the assumption of a gaussian distribution for each label. Hyperparameters (C, ν, and kernel parameters) were optimized using five–fold cross validation on the corresponding training sets. To ensure a fair comparison, the hyperparameter selection procedure was equal for all methods, except that the ν values of the ν-SVM were selected from {0.1, . . . , 0.9}, in contrast to the selection of C, for which a logarithmic scale was used. To test the significance of the differences in generalization performance (percentage of misclassifications), we calculated for what percentage of the individual runs (for a total of 100; see below) our method was significantly better or worse than others. In order to do so, we first performed a test for the difference of two proportions for each training and test set pair (Dietterich, 1998). The difference of two proportions is the difference of the misclassification rates of two models on the test set, where the models are selected on the training set by the two methods that are to be compared. After this test we adjusted the false
Support Vector Machines for Dyadic Data
1493
discovery rate through the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995), which was recently shown to be correct for dependent outcomes of the tests (Benjamini & Yekutieli, 2001). The fact that the tests can be dependent is important because training and test sets overlap for the different training and test set pairs. The false detection rate was controlled at 5%. We counted for each pair of methods the selected models from the first method that perform significantly (5% level) better than the selected models from the second method. Table 1 summarizes the percentage of misclassification averaged over 100 experiments. Despite the fact that C– and ν–SVMs are equivalent, results differ because of the somewhat different model selection results for the hyperparameters C and ν. Best and second-best results are indicated by bold and italic numbers. The significance tests did not reveal significant differences over the 100 runs in generalization performance for the majority of the cases (for details, see http://ni.cs.tu-berlin.de/publications/psvm sup). The P-SVM with RBF and PLU kernels, however, never performed significantly worse for any of the 100 runs of each data sets than their competitors. Note that the significance tests do not tell whether the averaged misclassification rates of one method are significantly better or worse than the rates of another method. They provide information on how many runs out of 100 one method was significantly better than another method. The UCI Benchmark result shows that the P-SVM is competitive to other state-of-the-art methods for a standard problem setting (measurement matrix equivalent to the Gram matrix). Although the P-SVM method never performed significantly worse, it generally required fewer support vectors than other SVM approaches. This was also true for the cases where the P-SVM’s performance was significantly better. 3.1.2 Protein Data Set. The Protein data set (cf. Hofmann & Buhmann, 1997) was provided by M. Vingron and consists of 226 proteins from the globin families. Pairs of proteins are characterized by their evolutionary distance, which is defined as the probability of transforming one amino acid sequence into the other via point mutations. Class labels are provided that denote membership in one of the four families: hemoglobin-α (H-α), hemoglobin-β (H-β), myoglobin (M), and heterogeneous globins (GH). Table 2 summarizes the classification results, which were obtained with the generalized SVM (G-SVM; Graepel et al., 1999; Mangasarian, 1998) and the P-SVM method. Since the G-SVM interprets the columns of the data matrix as feature vectors for the column objects and applies a standard ν-SVM ¨ to these vectors (this is also called feature map; Scholkopf & Smola, 2002), we call the G-SVM simply ν-SVM in the following. The table shows the percentage of misclassification for the four two-class classification problems, one class against the rest. The P-SVM yields better classification results, although a conservative test for significance was not possible due to the small
1494
S. Hochreiter and K. Obermayer
Table 1: Average Percentage of Misclassification for the UCI and the Banana Data Sets.
Thyroid C-SVM ν-SVM KFD P-SVM Heart C-SVM ν-SVM KFD P-SVM Breast Cancer C-SVM ν-SVM KFD P-SVM Banana C-SVM ν-SVM KFD P-SVM German C-SVM ν-SVM KFD P-SVM
RBF
POL
PLU
5.11 (2.34) 5.15 (2.23) 4.96 (2.24) 4.71 (2.06)
4.51 (2.02) 4.04 (2.12) 6.52 (3.18) 9.44 (3.12)
4.96 (2.35) 4.83 (2.03) 5.00 (2.26) 5.08 (2.18)
16.67 (3.51) 16.87 (3.87) 17.82 (3.45) 16.18 (3.66)
18.26 (3.50) 17.44 (3.90) 22.53 (3.37) 16.67 (3.40)
16.33 (3.47) 18.47 (7.81) 17.80 (3.86) 16.54 (3.64)
26.94 (5.18) 27.69 (5.62) 27.53 (4.19) 26.78 (4.58)
26.87 (4.79) 26.69 (4.73) 31.30 (6.11) 26.40 (4.54)
26.48 (4.87) 30.16 (7.83) 27.19 (4.72) 26.66 (4.59)
11.88 (1.11) 11.67 (0.90) 12.32 (1.12) 11.59 (0.96)
12.09 (0.96) 12.72 (2.16) 14.04 (3.89) 14.93 (2.09)
11.81 (1.07) 11.74 (0.98) 12.14 (0.96) 11.52 (0.93)
26.51 (2.60) 27.14 (2.84) 26.58 (2.95) 26.45 (3.20)
27.27 (3.23) 27.13 (3.06) 30.96 (3.23) 25.87 (2.45)
26.88 (3.12) 28.60 (6.27) 26.90 (3.15) 26.65 (2.95)
Notes: The table compares results obtained with the kernel Fisher discriminant analysis (KFD), C-, ν-, and P-SVM for the radial basis function (RBF), exp(− 2σ1 2 x i − x j 2 ), polynomial (POL), (x i , x j + η)δ , and Plummer (PLU), (x i − x j + ρ)−ζ kernels. Results were averaged over 100 experiments with separate training and test sets. For each data set, numbers in bold and italics highlight, respectively, the best and the second-best results and the numbers in parentheses denote standard deviations of the trials. C, ν, and kernel parameters were determined using five-fold cross validation on the training set and usually differed among individual experiments.
number of data. However, the P-SVM selected 180 proteins as support vectors on average, compared to 203 proteins used by the ν-SVM (note that for 10-fold cross validation, 203 is the average training size). Here a smaller number of support vectors is highly desirable because it reduces the computational costs of sequence alignments, which are necessary for the classification of new examples.
Support Vector Machines for Dyadic Data
1495
Table 2: Percentage of Misclassification for the Protein Data Set for Classifiers Obtained with the P- and ν-SVM Methods.
Size ν-SVM ν-SVM ν-SVM P-SVM P-SVM P-SVM
Reg.
H-α
H-β
M
GH
— 0.05 0.1 0.2 300 400 500
72 1.3 1.8 2.2 0.4 0.4 0.4
72 4.0 4.5 8.9 3.5 3.1 3.5
39 0.5 0.5 0.5 0.0 0.0 0.0
30 0.5 0.9 0.9 0.4 0.9 1.3
Notes: Column Reg. lists the values of the regularization parameter (ν for ν-SVM and C for P-SVM). Columns H-α to GH provide the percentage of misclassification for the four problems “one class against the rest” (“size” denotes the number of proteins per class), computed using 10–fold cross validation. The best results for each problem are shown in bold. The data matrix was not positive semidefinite.
3.1.3 World Wide Web Data Set. The World Wide Web data sets consist of 8282 WWW pages collected during the Web−→Kb Project at Carnegie Mellon University in January 1997 from the web sites of the computer science departments of Cornell and the universities of Texas, Washington, and Wisconsin. The pages were manually classified into the categories student, faculty, staff, department, course, project, and other. Every pair (i, j) of pages is characterized by whether page i contains a hyperlink to page j and vice versa. The data are summarized using two binary matrices and a ternary matrix. The first matrix K (“out”) contains a one for at least one outgoing link (i → j) and a zero if no outgoing link exists; the second matrix K (“in”) contains a one for at least one ingoing link ( j → i) and a zero otherwise; and the third, ternary matrix 12 K + K (“sym”) contains a zero if no link exists, a value of 0.5 if only one unidirectional link exists, and a value of 1 if links exist in both directions. In the following, we restricted the data set to pages from the first six classes, which had more than one in- or outgoing link. The data set thus consists of the four subsets Cornell (350 pages), Texas (286 pages), Wisconsin (300 pages), and Washington (433 pages). Table 3 summarizes the classification results for the C- and P-SVM methods. The parameter C for both SVMs was optimized for each crossvalidation trial using another four–fold cross validation on the training set. Significance tests were performed to evaluate the differences in generalization performance using the 10-fold cross-validated paired t-test (Dietterich, 1998). We considered 48 tasks (4 universities, 4 classes, 3 matrices) and checked for each task whether the C-SVM or the P-SVM performed better using a p-value of 0.05. In 30 tasks, the P-SVM had a significantly better performance than the C-SVM, while the C-SVM was
1496
S. Hochreiter and K. Obermayer
Table 3: Percentage of Misclassification for the World Wide Web Data Sets for Classifiers Obtained with the P- and C-SVM Methods. Course Cornell University Size 57 C-SVM (Sym) 11.1 (6.2) C-SVM (Out) 12.6 (3.1) C-SVM (In) 11.1 (4.9) P-SVM (Sym) 12.3 (3.3) P-SVM (Out) 8.6 (3.8) P-SVM (In) 7.1 (4.1) University of Texas Size 52 C-SVM (Sym) 17.2 (9.0) C-SVM (Out) 9.5 (5.1) C-SVM (In) 12.6 (4.9) P-SVM (Sym) 15.8 (5.8) P-SVM (Out) 8.1 (7.6) P-SVM (In) 12.3 (5.6) University of Wisconsin Size 77 C-SVM (Sym) 27.0 (10.0) C-SVM (Out) 19.3 (7.5) C-SVM (In) 22.0 (8.6) P-SVM (Sym) 18.7 (4.5) P-SVM (Out) 12.0 (5.5) P-SVM (In) 13.3 (4.4) University of Washington Size 169 C-SVM (Sym) 19.6 (6.8) C-SVM (Out) 10.6 (4.6) C-SVM (In) 20.3 (6.4) P-SVM (Sym) 17.1 (4.4) P-SVM (Out) 10.6 (5.2) P-SVM (In) 11.8 (5.6)
Faculty
Project
Student
60 19.7 (5.3) 15.1 (6.0) 21.4 (4.3) 17.1 (6.2) 14.3 (6.3) 13.7 (6.6)
52 13.7 (4.0) 10.6 (4.9) 14.6 (5.5) 15.4 (6.3) 8.3 (4.9) 10.9 (5.5)
143 50.0 (11.5) 22.3 (10.3) 48.9 (15.5) 19.1 (5.7) 16.9 (7.9) 17.1 (5.7)
35 22.0 (9.1) 16.5 (5.8) 20.6 (7.8) 13.6 (7.3) 9.8 (3.6) 10.5 (6.3)
29 19.8 (6.7) 20.2 (8.7) 20.9 (5.1) 12.2 (6.2) 9.8 (3.9) 9.4 (4.6)
129 53.5 (11.8) 28.9 (11.7) 16.4 (7.6) 25.5 (6.9) 20.9 (6.7) 13.0 (5.0)
36 22.0 (5.5) 16.0 (3.8) 16.3 (5.8) 15.0 (9.3) 11.3 (6.5) 8.7 (8.2)
22 14.0 (6.4) 10.3 (4.8) 7.7 (4.5) 10.0 (5.4) 7.7 (4.2) 6.3 (7.1)
117 49.3 (11.1) 34.3 (10.5) 24.3 (9.9) 34.3 (8.6) 23.7 (4.8) 13.3 (5.9)
44 18.7 (6.8) 14.1 (3.0) 20.4 (5.3) 13.4 (6.6) 12.7 (2.9) 9.2 (6.2)
39 10.6 (3.5) 14.3 (4.8) 13.8 (4.7) 8.8 (2.1) 6.7 (3.4) 6.7 (2.0)
151 43.6 (8.3) 28.2 (9.8) 38.3 (11.9) 20.3 (6.8) 17.1 (4.3) 14.3 (6.9)
Notes: The percentage of misclassification was measured using 10–fold cross validation. The best and second-best results for each data set and classification task are indicated, respectively, in bold and italics; numbers in parentheses denote standard deviations of the trials.
never significantly better than the P-SVM (for details, see http://ni.cs.tuberlin.de/publications/psvm sup). Classification results are better for the asymmetric matrices “in” and “out” than for the symmetric matrix “sym,” because there are cases for which highly indicative pages (hubs) exist that are connected to one particular class of pages by either in- or outgoing links. The symmetric case blurs the contribution of the indicative pages because ingoing and outgoing links can no longer be distinguished, which leads
Support Vector Machines for Dyadic Data
1497
Table 4: Regression Results for the UCI Data Sets.
SVR BSVR P-SVM
Robot Arm (10−3 )
Boston Housing
Computer Activity
Abalone
5.84 5.89 5.88
10.27 (7.21) 12.34 (9.20) 9.42 (4.96)
13.80 (0.93) 17.59 (0.98) 10.28 (0.47)
0.441 (0.021) 0.438 (0.023) 0.424 (0.017)
Note: The table shows the mean squared error and its standard deviation in parentheses. Best results for each data set are shown in bold. For the Robot Arm data, only one data set was available, and therefore no standard deviation is given.
to poorer performance. Because the P-SVM yields fewer support vectors, online classification is faster than for the C-SVM, and if web pages cease to exist, the P-SVM is less likely to be affected. 3.2 Application to Regression Problems. In this section we report results for the data sets Robot Arm (2 features), Boston Housing (13 features), Computer Activity (21 features), and Abalone (10 features) from the UCI Benchmark Repository. The data preprocessing is described in Chu, Keerthi, and Ong (2004), and the data sets are available as training and test set pairs at http://www.gatsby.ucl.ac.uk/∼chuwei/data. The sizes of the data sets were (training set/test set): Robot Arm: 200/200, 1 set; Boston Housing: 481/25, 100 sets; Computer Activity: 1000/6192, 10 sets; Abalone: 3000/1177, 10 sets. Pairwise data sets were generated by constructing the Gram matrices for RBF kernels of different widths σ , and the Gram matrices were used as input for the three regression methods: C-support vector regression (SVR; ¨ Scholkopf & Smola, 2002), Bayesian support vector regression (BSVR; Chu et al., 2004), and the P-SVM. Hyperparameters (C and σ ) were optimized using n-fold cross-validation (n = 50 for Robot Arm, n = 20 for Boston Housing, n = 4 for Computer Activity, and n = 4 for Abalone). Parameters were first optimized on a coarse 4 × 4 grid and later refined on a 7 × 7 fine grid around the values for C and σ selected in the first step (65 tests per parameter selection). Table 4 shows the mean squared error and the standard deviation of the results. We also performed a Wilcoxon signed rank test to verify the significance for these results (for details, see http://ni.cs.tu-berlin. de/publications/psvm sup), except for the Robot Arm data set, which has only one training and test set pair, and the Boston Housing data set, which contains too few test examples. On Computer Activity, SVR was significantly better (5% threshold) than BSVR, and on both data sets, Computer Activity and Abalone, SVR and BSVR were significantly outperformed by the P-SVM, which in addition used fewer support vectors than its competitors.
1498
S. Hochreiter and K. Obermayer
Table 5: Percentage of Misclassification and Number of Support Features for the Protein Data Set for the P-SVM Method. Protein Data
H-α
H-β
M
GH
0.2 1 10 20
1.3 (203) 2.6 (41) 3.5 (10) 3.5 (5)
4.9 (203) 5.3 (110) 8.8 (26) 8.4 (12)
0.9 (203) 1.3 (28) 1.8 (5) 4.0 (4)
1.3 (203) 4.4 (41) 13.3 (7) 13.3 (5)
Notes: The total number of features is 226. C was 100. The five columns (left to right) show the values for and the results for the four classification problems “one class against the rest” using 10–fold cross validation. The number of support features are in parentheses.
3.3 Application to Feature Selection Problems. In this section we apply the P-SVM to feature selection problems of various kinds, using the correlation threshold regularization scheme (see section 2.3.5). We first reanalyze the Protein and World Wide Web data sets of sections 3.1.3 and 3.1.2 and then report results on three DNA microarray data sets. Further feature selection results can be found in Hochreiter & Obermayer, 2005, where results for the Neural Information Processing Systems 2003 feature selection challenge are reported and where the P-SVM was the best standalone method for selecting a compact feature set, and in Hochreiter and Obermayer (2004b), where details of the microarray datasets benchmarks are reported. 3.3.1 Protein and World Wide Web Data Sets. In this section we again apply the P-SVM to the Protein and World Wide Web data sets of sections 3.1.2 and 3.1.3. Using both regularization schemes simultaneously leads to a trade-off between a small number of features (a small number of measurements) and better classification result. Reducing the number of features is beneficial if measurements are costly and if a small increase in prediction error can be tolerated. Table 5 shows the results for the Protein data sets for various values of the regularization parameter . C was set to 100 because it gave good results for a wide range of values. We chose a minimal = 0.2 because it resulted in a classifier where all complex features were support vectors. The size of the training set is 203. Note that C was smaller than in the experiments in section 3.1.2 because large values of push the dual variables α toward zero and large values of C have no influence. The table shows that classification performance drops if fewer features are considered, but that 5% of the features suffice to obtain a performance that leads only to about 5% misclassifications compared to about 2% at the optimum. Since every
Support Vector Machines for Dyadic Data
1499
Table 6: P-SVM Feature Selection and Classification Results (10–Fold Cross Validation) for World Wide Web Data Set Cornell and the Classification Problem Student Pages Against the Rest. Cornell Data set, Student Pages
% cl.
% mis.
# (%) SVs
% cl.
% mis.
# (%) SVs
0.1 0.2 0.3 0.4 0.5 0.6 0.7
84 81 79 75 73 71 66
14 12 9.7 6.9 5.5 4.8 3.9
135 (38.6) 115 (32.8) 99 (28.3) 72 (20.6) 58 (16.6) 48 (13.7) 38 (10.9)
0.8 0.9 1.0 1.1 1.4 1.6 2.0
65 64 61 59 56 55 51
3.1 2.7 1.4 1.0 1.0 1.0 0.6
34 (9.7) 32 (9.1) 27 (7.7) 21 (6.0) 12 (3.4) 10 (2.8) 8 (2.3)
Notes: The columns show (left to right): the value , the percentage of classified pages (cl.), the percentage of misclassifications (mis.) and the number (and percentage) of support vectors. C was obtained by a three-fold cross validation on the corresponding training sets.
feature value has to be determined by a sequence alignment, this saving in computation time is essential for large databases like the Swiss-Prot data base (130,000 proteins), where supplying all pairwise relations is currently impossible. Table 6 shows the corresponding results (10-fold cross validation) for the P-SVM applied to the World Wide Web data set Cornell and for the classification problem “student pages vs. the rest.” Only ingoing links (matrix K of section 3.1.3) were used. C was optimized using 3–fold cross validation on the corresponding training sets for each of the 10–fold crossvalidation runs. By increasing the regularization parameter , the number of web pages that have to be considered in order to classify a new page (the number of support vectors) decreases from 135 to 8. At the same time the percentage of pages that can no longer be classified because they receive no ingoing link from one of the support vector page increases. The percentage of misclassification, however, is reduced from 14% for = 0.1 to 0.6% for = 2.0. With only 8 pages providing ingoing links, more than 50% of the pages could be classified with only a 0.6% misclassification rate. 3.3.2 DNA Micorarray Data Sets. In this section we describe the application the P-SVM to DNA microarray data. The data were taken from Pomeroy et al. (2002) (Brain Tumor), Shipp et al. (2002) (Lymphoma Tumor), and van’t Veer et al. (2002) (Breast Cancer). All three data sets consist of tissue samples from different patients that were characterized by the expression values of a large number of genes. All samples were labeled according to the outcome of a particular treatment (positive or negative), and the task was to predict the outcome of the treatment for a new patient.
1500
S. Hochreiter and K. Obermayer
We compared the classification performance of the following combinations of a feature selection and a classification method: (1) (2) (3) (4) (5) (6) (7)
Feature Selection Expression value of the TrkC gene SPLASH (Califano, Stolovitzky, & Tu, 1999) Signal-to-noise-statistics (STN) Signal-to-noise-statistics (STN) Fisher statistics (Fisher) R2W2 P-SVM
Classification One gene classification Likelihood ratio classifier (LRC) K -nearest neighbor (KNN) Weighted voting (voting) Weighted voting (voting) R2W2 ν-SVM
The P-SVM/ν-SVM results are taken from Hochreiter and Obermayer (2004b), where the details concerning the data sets and the gene selection procedure based on the P-SVM can be found; the results for the other combinations were taken from the references cited above. All results are summarized in Table 7. The comparison shows that the P-SVM method outperforms the standard methods. 4 Summary In this contribution we have described the potential support vector machine (P-SVM) as a new method for classification, regression, and feature selection. The P-SVM selects models using the principle of structural risk minimization. In contrast to standard SVM approaches, it is based on a new objective function and a new set of constraints that lead to an expansion of the classification or regression function in terms of support features. The optimization problem is quadratic, always well defined, suited for dyadic data, and requires neither square nor positive-definite Gram matrices. Therefore, the method can also be used without preprocessing with matrices that are measured and matrices that are constructed from a vectorial representation using an indefinite kernel function. In feature selection mode, the P-SVM allows selecting and ranking the features through the support vector weights of its sparse set of support vectors. The sparseness constraint avoids sets for features that are redundant. In a classification or regression setting, this is an advantage over statistical methods where redundant features are often kept as long as they provide information about the objects’ attributes. Because the dual formulation of the optimization problem can be solved by a fast SMO technique, the new P-SVM can be applied to data sets with many features. The P-SVM approach was compared with several state-of-the-art classification, regression, and feature selection methods. Whenever significance tests could be applied, the P-SVM never performed significantly worse than its competitor, and in many cases, it performed significantly better. But even if no significant improvement in prediction error could be found, the P-SVM needed fewer support features, that is, fewer measurements, for evaluating new data objects.
Brain Tumor
Lymphoma
Feature Selection/ Classification
#F
E
TrkC gene SPLASH/LRC R2W2 STN/voting STN/KNN TrkC & SVM & KNN P-SVM/ν-SVM
1 – – – 8 – 45
33 25 25 23 22 20 7
Breast Cancer
Feature Selection/ Classification
#F
E
STN/KNN STN/voting R2W2 P-SVM/ν-SVM
8 13 – 18
28 24 22 21
Feature Selection/ Classification
#F
E
ROC area
Fisher/voting P-SVM/ν-SVM
70 30
26 15
0.88 0.77
Support Vector Machines for Dyadic Data
Table 7: Results for the Three DNA Microarray Data Sets Brain Tumor, Lymphoma, and Breast Cancer.
Notes: The table shows the leave-one-out classification error E (% misclassifications) and the number F of selected features. For Breast Cancer only, the minimal value of E for different thresholds was available; therefore, the area under a receiver operating curve is provided in addition. Best results are shown in bold.
1501
1502
S. Hochreiter and K. Obermayer
Finally, we have suggested a new interpretation of dyadic data, where objects in the real world are not described by vectors. Structures like dot products are induced directly through measurements of object pairs, that is, relations between objects. This opens up a new field of research where relations between real-world objects determine mathematical structures. Appendix: Measurements, Kernels, and Dot Products In this appendix we address the question under what conditions a measurement kernel that gives rise to a measured matrix K can be interpreted as a dot product between feature vectors describing the row and column objects of a dyadic data set. Assume that column objects x (samples) and row objects z (complex features) are from sets X and Z, both of which can be completed by a σ -algebra and a measure µ to measurable spaces. Let (U, B, µ) be a measurable space with σ -algebra B and a σ -additive measure µ on the set U. We consider functions f : U → R on the set U. A function f is called µ-measurable on (U, B) if f −1 ([a , b]) ∈ B for all a , b ∈ R, and µ-integrable if U f dµ < ∞. We define f L 2µ :=
12 f 2 dµ
U
(A.1)
and the set L 2µ (U) := f : U → R; f is µ-measurable and f L 2µ < ∞ .
(A.2)
L 2µ (U) is a Banach space with norm · L 2µ . If we define the dot product f, g L 2µ (U) :=
U
f gdµ,
(A.3)
then the Banach space L 2µ (U) is a Hilbert space with a dot product ·, · L 2µ (U) and scalar body R. For simplicity, we denote this Hilbert by L 2 (U). L 2 (U1 , U2 ) is the Hilbert space of functions k space with U1 U2 k 2 (u1 , u2 ) dµ (u2 ) dµ (u1 ) < ∞ using the product measure of µ (U1 × U2 ) = µ (U1 ) µ (U2 ). With these definitions, we see that H1 := L 2 (Z), H2 := L 2 (X ), and H3 := L 2 (X , Z) are Hilbert spaces of L 2 -functions with domains X , Z and X × Z, respectively. The dot product in Hi is denoted by ·, · Hi . Note that for discrete X or Z, the respective integrals can be replaced by sums (using a measure of Dirac delta functions at the discrete points; see Werner, 2000, p. 464, example c).
Support Vector Machines for Dyadic Data
1503
Let us now assume that k ∈ H3 . k induces a Hilbert-Schmidt operator Tk , f (x) = (Tk α)(x) =
Z
k(x, z) α(z) dµ(z),
(A.4)
which maps α ∈ H1 (a parameterization) to f ∈ H2 (a classifier). If we set µ(z) = Pj=1 δ z j , we recover the P-SVM classification function (without b), equation 2.26,
f (u) =
P
P
α j k u, z j = α j K (u) j ,
j=1
(A.5)
j=1
where α j = α(z j ) and δ z j is the Dirac delta function at location z j . Theorem 1 provides the conditions a kernel must fulfill in order to be interpretable as a dot product between the objects’ feature vectors: Theorem 1 (Singular Value Expansion). Let α be from H1 and let k be a kernel from H3 , which defines a Hilbert-Schmidt operator Tk : H1 → H2 : (Tk α) (x) = f (x) =
Z
k(x, z)α(z)dz.
(A.6)
Then f 2H2 = Tk∗ Tk α, α H1 ,
(A.7)
where Tk∗ is the adjoint operator of Tk and there exists an expansion k(x, z) =
sn e n (z)gn (x)
(A.8)
n
that converges in the L 2 -sense. The sn ≥ 0 are the singular values of Tk , and e n ∈ H1 , gn ∈ H2 are the corresponding orthonormal functions. For X = Z and symmetric, positive-definite kernel k, we obtain the eigenfunctions e n = gn of Tk with corresponding eigenvalues sn . Proof.
From f = Tk α we obtain
f 2H2 = Tk α, Tk α H2 = Tk∗ Tk α, α H1 .
(A.9)
1504
S. Hochreiter and K. Obermayer
The singular value expansion of Tk is Tk α =
sn α, e n H1 gn
(A.10)
n
(see Werner, 2000, theorem VI.3.6). The values sn are the singular values of Tk for the orthonormal systems {e n } on H1 and {gn } on H2 . We define rnm := Tk e n , gm H2 = δnm sm ,
(A.11)
where the last “=” results from equation A.10 for α = e n . The sum
2 rnm =
m
(Tk e n , gm H2 )2 ≤ Tk e n 2H2 < ∞
(A.12)
m
converges because of Bessel’s inequality (the ≤ sign). Next, we complete the orthonormal system (ONS) {e n } to an orthonormal basis (ONB) {˜e l } by adding an ONB of the kernel ker (Tk ) of the operator Tk to the ONS {e n }. The function α ∈ H1 possesses a unique representation through this basis: α = l α, e˜ l H1 e˜ l . We obtain Tk α =
α, e˜ l H1 Tk e˜ l ,
(A.13)
l
where we used that Tk is continuous. Because Tk e˜ l = 0 for all e˜ l ∈ ker (Tk ), the image Tk α can be expressed through the ONS {e n }: Tk α =
α, e n H1 Tk e n
n
= α, e n H1 Tk e n , gm H2 gm n
m
! =
rnm α, e n H1 gm .
(A.14)
n,m
Here we used the fact that {gm } is an ONB of the range of Tk , and therefore Tk e n = m Tk e n , gm H2 gm . {e n gm } is an ONS in H3 (which can be comBecause the set of functions 2 pleted to an ONB) and n,m rnm < ∞ (cf. equation A.12), the kernel ˜ x) := k(z,
n,m
rnm e n (z)gm (x)
(A.15)
Support Vector Machines for Dyadic Data
1505
is from H3 . We observe that the induced Hilbert-Schmidt operator Tk˜ is equal to Tk , (Tk˜ α) (x) =
rnm α, e n H1 gm (x) = (Tk α)(x),
(A.16)
n,m
where the first equals sign follows from equation A.15 and the second equals sign from equation A.14. Hence, the kernel k and kernel k˜ are equal ˜ We obtain from equation A.11 except for a set with zero measure: k =µ k. Tk e l , gt H1 = δlt sl and Tk e l , gt H1 = rlt from equation A.16, and therefore rlt = δlt sl . Inserting rnm = δnm sn into equation A.15 proves equation A.8 in the theorem. The last statement of the theorem follows from the fact that |Tk | = ∗ 1/2 Tk Tk = Tk (Tk is positive and self-adjoint), and therefore e n = gn (Werner, 2000, proof of theorem VI.3.6). As a consequence of this theorem, we can for finite Z define a mapping ω of row objects z and a mapping φ column objects x into a common feature space where k is a dot product: φ(x) := (s1 g1 (x), s2 g2 (x), . . .), ω (z) := (e 1 (z), e 2 (z), . . .),
sn e n (z)gn (x) = k(z, x). φ(x), ω (z) =
(A.17)
n
Note that finite Z ensures that ω (z) , ω (z) converges. From equation A.14, we obtain for the classification or regression function f (x) =
sn α, e n H1 gn (x).
(A.18)
n
It is well defined because sets of zero measure vanish through integration in equation A.4, which is confirmed through expansion equation A.18, where the zero measure is absorbed in the terms α, e n H1 . To obtain absolute and uniform convergence of the sum for f (x), we must enforce k(x, ·)2H1 ≤ K 2 , as can be seen in corollary 1: Let the assumptions of theCorollary 1 (Linear Classification in 2 ). 2 orem 1 hold, and let Z (k(x, z)) dz ≤ K 2 for all x ∈ X . We define w := (α, e 1 H1 , α, e 2 H1 , . . .), and φ(x) := (s1 g1 (x), s2 g2 (x), . . .). Then w, φ(x) ∈ 2 , where w22 ≤ α2H1 and φ(x)22 ≤ K 2 , and the following sum convergences absolutely and uniformly: f (x) = w, φ(x)2 =
n
sn α, e n H1 gn (x).
(A.19)
1506
S. Hochreiter and K. Obermayer
First, we show that φ(x) ∈ 2 :
Proof.
φ(x)l22 =
(sn gn (x))2 =
n
((Tk e n )(x))2
n
= (k(x, .), e n H1 )2 ≤ k(x, ·)2H1 ≤ sup{ (k(x, z))2 dz} ≤ K 2 , x∈X
n
Z
(A.20) where we used Bessel’s inequality for the first ≤,the supremum over x ∈ X for the second ≤ (the supremum exists because { (k(x, z))2 dz} is a bounded subset of R), and the assumption of the corollary for the last ≤. To prove w22 ≤ α2H1 , we again use Bessel’s inequality: w22 =
(α, e n H1 )2 ≤ α2H1 .
(A.21)
n
Finally, we prove that the sum f (x) = w, φ(x)2 =
sn α, e n H1 gn (x)
(A.22)
n
converges absolutely and uniformly. The fact that the sum converges in the L 2 sense follows directly from the singular value expansion of theorem 1. We now choose an m ∈ N with ∞
(α, e n H1 )2 ≤
n=m
2 K
(A.23)
for > 0 (because of equation A.21, such an m exists), and we apply the Cauchy-Schwarz inequality, ∞
sn α, e n H gn (x) ≤ 1
∞
n=m
n=m
≤K
= , K
! 12 (sn gn (x))
2
∞
! 12 (α, e n H1 )
2
n=m
(A.24)
where we used inequalities equations A.20 and A.23. Because m is independent of x, the convergence is absolute and uniform too. Equation A.4 or, equivalently, A.19, is a linear classification or regression function in 2 . We find that the expansion of the classifier f converges absolutely and uniformly and therefore that f is continuous. This can be seen
Support Vector Machines for Dyadic Data
1507
because e n are eigenfunctions of the compact, positive, self-adjoint operator ∗ 12 Tk Tk and gn are isometric images of e n (see Werner, 2000, theorem VI.3.6 and the text before theorem VI.4.2). Hence, the orthonormal functions gn are continuous. We also justified the analysis in equation 2.27 and derivatives of gm with respect to x. Relation of the theoretical considerations to the P-SVM. Using µ(x) = i j j L P i=1 δ x , µ(z) = j=1 δ z , and α j := α z , we obtain
f (x) =
P
α j k x, z
j
" = φ (x) ,
j=1
P
# j αjω z ,
j=1
Xφ = φ x1 , φ x2 , . . . , φ x L , Zω = ω z1 , ω z2 , . . . , ω z P , w=
P
αi ω z j (expansion into support vectors),
j=1
K i j = φ xi , ω z j = sn e n z j gn xi = k xi , z j , n
K=
X φ Zω , and
2 f 2H2 = α K Kα = X φ w2 (the objective function).
(A.25)
Note that in the P-SVM formulation, w is not unique with respect to the zero subspace of the matrix X φ . Here we obtain the related result that w is not unique with respect to the subspace mapped to the zero function by Tk . Interestingly, we recovered the new objective function, equation 2.4, as the L 2 -norm f 2H2 on the classification function. This again motivates the use of the new objective function as a capacity measure. We also find that the primal problem of the P-SVM, equation 2.21, corresponds to the formulation in H2 , while the dual, equation 2.25, corresponds to the formulation in H1 . Primal and dual P-SVM formulations can be transformed into each other by the property Tk α, Tk α H2 = Tk∗ Tk α, α H1 . Acknowledgments ¨ We thank M. Albery-Speyer, C. Buscher, C. Minoux, R. Sanyal, A. Paus, and S. Seo for their help with the numerical simulations. This work was funded by the Anna-Geissler-, the Monika-Kuntzner-Stiftung, and the BMWA under project no. 10024213.
1508
S. Hochreiter and K. Obermayer
References Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure with special reference to Pearson’s correlation coefficient. Journal of the American Society for Information Science and Technology, 54, 550–560. Bains, W., & Smith, G. (1988). A novel method for nucleic acid sequence determination. Journal of Theoretical Biology, 135, 303–307. Bayer, A. E., Smart, J. C., & McLaughlin, G. W. (1990). Mapping intellectual structure of a scientific subfield through author cocitations. Journal of the American Society for Information Science, 41(6), 444–452. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. B, 57(1), 289–300. Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188. Blum, A., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245–271. Califano, A., Stolovitzky, G., & Tu, Y. (1999) Analysis of gene expression microarrays for phenotype classification. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (pp. 75–85). Mento Park, CA: AAAI Press. Chu, W., Keerthi, S. S., & Ong, C. J. (2004) Bayesian support vector regression using a unified loss function. IEEE Transactions on Neural Networks, 15(1), 29–44. ¨ Cremer, T., Kurz, A., Zirbel, R., Dietzel, S., Rinke, B., Schrock, E. Speicher, M. R., Mathieu, U., Jauch, J., Emmerich, P., Scherthan, H., Ried, T., Cremer, C., & Lichter, P. (1993). Role of chromosome territories in the functional compartmentalization of the cell nucleus. Cold Spring Harbor Symp. Quant. Biol., 58, 777–792. Dietterich, T. G. (1998). Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Drmanac, R., Labat, I., Brukner, I., & Crkvenjakov, R. (1989). Sequencing of megabase plus DNA by hybridization: Theory of the method. Genomics, 4, 114–128. Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J., Hofmann, K., & Bairoch, A. (2002). The PROSITE database, its status in 2002. Nucleic Acids Research, 30, 235–238. Graepel, T., Herbrich, R., Bollmann-Sdorra, P., & Obermayer, K. (1999). Classification on pairwise proximity data. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 438–444). Cambridge, MA: MIT Press. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries ¨ for ordinal regression. In A. Smola, P. Bartlett, B. Scholkopf, & D. Schuurmans, Advances in large margin classifiers. Cambridge, MA: MIT Press. Heyer, L. J., Kruglyak, S., & Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 11, 1106–1115. Hochreiter, S., Mozer, M. C., & Obermayer, K. (2003). Coulomb classifiers: Generalizing support vector machines via an analogy to electrostatic systems. In S. Becker,
Support Vector Machines for Dyadic Data
1509
S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 545–552). Cambridge, MA: MIT Press. Hochreiter, S., & Obermayer, K. (2004a). Classification, regression, and feature selection on matrix data (Tech. Rep. No. 2004/2). Berlin: Technische Universit¨at Berlin, ¨ Elektrotechnik und Informatik. Fakult¨at fur Hochreiter, S., & Obermayer, K. (2004b). Gene selection for microarray data. In B. ¨ Scholkopf, K. Tsuda, & J.-P. Vert (Eds.), Kernel methods in computational biology (pp. 319–355). Cambridge, MA: MIT Press. Hochreiter, S., & Obermayer, K. (2004c). Sphered support vector machine (Tech. Rep.). ¨ Elektrotechnik und Informatik. Berlin: Technische Universit¨at Berlin, Fakult¨at fur Hochreiter, S., & Obermayer, K. (2005). Nonlinear feature selection with the potential support vector machine. In I. Guyon, S. Gunn, M. Nikravesh, & L. Zadeh (Eds.), Feature extraction, foundations and applications. Berlin: Springer. Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. Journal of the American Statistical Association, 100(469), 286–295. Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 1–25. Hofmann, T., & Puzicha, J. (1998). Unsupervised learning from dyadic data (Tech. Rep. No. TR-98-042). Berkeley, CA: International Computer Science Insitute. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 46(5), 604–632. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Li, H., & Loken, E. (2002). A unified theory of statistical analysis and inference for variance component models for dyadic data. Statistica Sinica, 12, 519–535. Lipman, D., & Pearson, W. (1985). Rapid and sensitive protein similarity searches. Science, 227, 1435–1441. Lu, Q., Wallrath, L. L., & Elgin, S. C. R. (1994). Nucleosome positioning and gene regulation. Journal of Cellular Biochemistry, 55, 83–92. Lysov, Y., Florent’ev, V., Khorlin, A., Khrapko, K., Shik, V., & Mirzabekov, A. (1988). DNA sequencing by hybridization with oligonucleotides. Doklady Academy Nauk USSR, 303, 1508–1511. Mangasarian, O. L. (1998). Generalized support vector machines (Tech. Rep. No. 98-14). Madison: Computer Sciences Department, University of Wisconsin. Mazza, C. B., Sukumar, N., Breneman, C. M., & Cramer, S. M. (2001). Prediction of protein retention in ion-exchange systems using molecular descriptors obtained from crystal structure. Anal. Chem., 73, 5457–5461. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing, 9 (pp. 41–48). Piscataway, NS: IEEE. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olsen, J. M., Gurran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califeno, A., Stoloviteky, G., Louis, D. N., Mesirov, J. P., Lander, E. S., & Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436–442.
1510
S. Hochreiter and K. Obermayer
¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320. Salton, G. (1968). Automatic information organization and retrieval. New York: McGrawHill. Scherf, U., Ross, D. T., Waltham, M., Smith, L. H., Lee, J. K., Tanabe, L., Kohn, K. W., Reinhold, W. C., Myers, T. G., Andrews, D. T., Scudiero, D. A., Eisen, M. B., Sausville, E. A., Pommier, Y., Botstein, D., Brown, P. O., & Weinstein, J. N. (2000). A gene expression database for the molecular pharmacology of cancer. Nature Genetics, 24(3), 236–244. ¨ Scholkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Generalization bounds via eigenvalues of the gram matrix (Tech. Rep. No. NC2-TR-1999-035). London: Royal Wolloway University of London. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Shawe-Taylor, J., Bartlett, P. L., Williamson, R., & Anthony, M. (1996). A framework for structural risk minimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory (pp. 68–76). New York: Association for Computing Machinery. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940. Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, R. C. T. A. J. L., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., & Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nature Medicine, 8(1), 68– 74. Sigrist, C. J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinformatics, 3, 265–274. Southern, E. (1988). United Kingdom patent application GB8810400. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Pet`erse, W. L., vander kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend, S. W. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Werner, D. (2000). Funktionalanalysis (3rd ed.). Berlin: Springer. White, H. D., & McCain, K. W. (1989). Bibliometrics. Annual Review of Information Science and Technology, 24, 119–186.
Received March 12, 2004; accepted July 15, 2005.
NOTE
Communicated by Alexandre Pouget
Optimal Neuronal Tuning for Finite Stimulus Spaces W. Michael Brown [email protected] Computational Biology, Sandia National Laboratories, Albuquerque, NM, 87123, U.S.A.
Alex B¨acker [email protected] Computational Biology, Sandia National Laboratories, Albuquerque, NM, 87123, and Division of Biology, California Institute of Technology, Pasadena, CA 91125, U.S.A.
The efficiency of neuronal encoding in sensory and motor systems has been proposed as a first principle governing response properties within the central nervous system. We present a continuation of a theoretical study presented by Zhang and Sejnowski, where the influence of neuronal tuning properties on encoding accuracy is analyzed using information theory. When a finite stimulus space is considered, we show that the encoding accuracy improves with narrow tuning for one- and twodimensional stimuli. For three dimensions and higher, there is an optimal tuning width. 1 Introduction The potential impact of coding efficiency on neuronal response properties within the central nervous system was first proposed by Attneave (1954) and has since been studied using both theoretical and experimental approaches. The issue of optimal neuronal tuning widths has received much attention in recent literature. Empirical examples of both finely tuned receptive fields (Kuffler, 1953; Lee, 1999) and broadly tuned neurons (Georgopoulos, Schwartz, & Kettner, 1986; Knudsen & Konishi, 1978) have been found. Theoretical arguments have also been made for both sharp (Barlow, 1972; Lettvin, Maturana, McCulloch, & Pitts, 1959) and broad (Baldi & Heiligenberg, 1988; Eurich & Schwegler, 1997; Georgopoulos et al., 1986; Hinton, McClelland, & Rumelhart, 1986; Salinas & Abbott, 1994; Seung & Sompolinsky, 1993; Snippe, 1996; Snippe & Koenderink, 1992) tuning curves as a means to increase encoding accuracy. Using Fisher information, Zhang and Sejnowski (1999) offered an intriguing solution where the choice of narrow or broad tuning curves depends on the dimensionality of the stimulus space. They found that for one dimension, the encoding accuracy increases with decreasing tuning Neural Computation 18, 1511–1526 (2006)
C 2006 Massachusetts Institute of Technology
1512
W. M. Brown and A. B¨acker
width and that for two dimensions, the encoding accuracy is independent of the tuning width. For three dimensions and higher, the results suggest that encoding accuracy should increase with increasing tuning width. The result, which is widely cited in works on neuronal encoding, offers a universal scaling rule for all radial symmetric tuning functions. However, this scaling rule is highly unintuitive in that for greater than three dimensions, it predicts optimal encoding accuracy for infinite tuning widths, that is, tuning widths for which neurons have no discrimination power and all neurons are indistinguishable from each other. In this note, we analyze this effect and show that when a finite stimulus space is considered, there is an optimal tuning width (in terms of Fisher information) for all stimulus dimensionalities. 2 Fisher Information 2.1 Fisher Information for an Infinite Stimulus Space. The Cram´erRao inequality gives a lower bound for the variance of any unbiased estimator (Cover & Thomas, 1991) and is useful for studying neuronal encoding accuracy in that it represents the minimum mean-squared reconstruction error that can be achieved by any decoding strategy (Seung & Sompolinsky, 1993). Let x = (x1 , x2 , . . . , xD ) be a vector describing a D-dimensional stimulus. The Cram´er-Rao bound is then given by v (x) ≥ J −1 (x) ,
(2.1)
where v is the covariance matrix of a set of unbiased estimators for x, J is the Fisher information matrix, and the matrix inequality is given in the sense that v(x) − J −1 (x) must be a nonnegative definite matrix (Cover & Thomas, 1991). For an encoding variable representing neuronal firing rates, the Fisher information matrix for a neuron is given by J ij (x) = E
∂ ∂ ln P [n | x, τ ] ln P [n | x, τ ] , ∂ xi ∂xj
(2.2)
where E represents the expectation value over the probability distribution P[n | x,τ ] for firing n spikes at stimulus x within a time window τ . For multiple neurons with independent spiking, the total Fisher information for N neurons is given by the sum J (x) =
N
J a (x).
(2.3)
a =1
If the neurons are restricted to radial symmetric tuning functions and distributed identically throughout the stimulus space such that the
Optimal Neuronal Tuning for Finite Stimulus Spaces
1513
distributions of estimation errors in each dimension are identical and uncorrelated, the Fisher information matrix becomes diagonal (Zhang & Sejnowski, 1999), and the total Fisher information reduces to
J (x) =
D N
E
a =1 i=1
2 ∂ ln Pa [n | x, τ ] . ∂ xi
(2.4)
For homogeneous Poisson spike statistics, P[n | x, τ ] =
(τ · f (x))n exp (−τ f (x)) , n!
(2.5)
where f (x) describes the mean firing rate of the neuron with respect to the stimulus, or the neuronal tuning function. Equation 2.4 then becomes
J (x) = τ
2 D N ∂ 1 f a (x) . ∂ xi f a (x)
(2.6)
a =1 i=1
For a gaussian tuning function,
D 1 2 f (x) = F exp − 2 (xi − c i ) , 2σ
(2.7)
i=1
where F represents the mean peak firing rate, c = (c 1 , c 2 , . . . , c D ) represents the preferred stimulus of the neuron, and σ represents the tuning width parameter for the neuron. Substitution into equation 2.6 gives D N D 1 1 2 2 J (x) = τ (xi − c a ,i ) exp − 2 (xi − c a ,i ) . σ4 2σ a =1
i=1
(2.8)
i=1
Assuming that the preferred stimuli for the neurons are uniformly distributed throughout the stimulus space, the average Fisher information per neuron for an infinite stimulus space can be found by replacing the summation with an integral (Zhang & Sejnowski, 1999):
J =
∞
−∞
J 1 (x) d x1 , . . . , d xD ,
(2.9)
1514
W. M. Brown and A. B¨acker
where J 1 (x) represents the Fisher information for a single neuron rather than the total as given in equation 2.4. Under these assumptions, we can make the replacement:
J =
∞
ξ = x − c,
−∞
J 1 (ξ ) dξ1 , . . . , dξ D
.
(2.10)
For Poisson spike statistics and gaussian tuning (see equation 2.8), Fτ J = 4 σ
σ2 − 2
ξ2 2ξ exp − 2 2σ
√ ∞ √ ξ 2 , − σ 2π · erf 2σ
(2.11)
−∞
for D = 1. Here, the gaussian error function is given by 2 erf (b) = √ π
b
exp −t 2 dt.
(2.12)
0
If we assume that the stimulus space (integration interval) is infinite and that σ is finite, equation 2.11 reduces to J 1D
√ F τ 2π = . σ
(2.13)
Due to symmetry with respect to different dimensions, equation 2.13 can be generalized to any dimensionality to give the result reported by Zhang and Sejnowski (1999): J = F dτ (2π)d/2 σ d−2 .
(2.14)
The Fisher information based on equation 2.14 as a function of tuning width is shown in Figure 1. Although we have shown the Fisher information per neuron as an average, it is actually the exact Fisher information for each neuron because of the assumption of homogeneous tuning widths and an infinite stimulus space. Therefore, the total Fisher information can be found by multiplying the Fisher information per neuron by the number of neurons to get Fisher information across the stimulus space that is independent of x. Equation 2.14 does not describe the influence of tuning width on encoding accuracy in the limit as σ approaches infinity, however, and therefore is unusable when the tuning width is large relative to the stimulus space. This is relevant, since for D > 2, equation 2.14 predicts optimal tuning widths to be infinite. Furthermore, when using neuronal firing rate as an encoding variable, this becomes relevant in that for a finite number of neurons with
Optimal Neuronal Tuning for Finite Stimulus Spaces
1515
Figure 1: Scaling rule for the average Fisher information per neuron as a function of tuning width for different stimulus dimensionalities. Dashed lines represent the Fisher information calculated when the stimulus space is infinite (see equation 2.14), and solid lines represent calculations for a finite stimulus space (see equation 2.15). In this plot, the Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
finite firing rates and a finite decoding time, the range for the stimulus must be finite. That is, with discrete spiking events, there is a range of stimulus space beyond which no Fisher information can be conveyed within a reasonable decoding time. 2.2 Fisher Information for a Finite Stimulus Space. In order to study Fisher information within a finite stimulus space, we begin by considering a stimulus range normalized to lie in the inclusive range between 0 and 1. In this case, the tuning width is expressed in terms of a fraction of the finite stimulus space. If we consider an infinite number of neurons with preferred stimuli evenly distributed across the finite stimulus space, the average Fisher information per neuron is given (for radial symmetric tuning functions) by
Jf = 0
1
dc 1 , . . . , dc D 0
1
J 1 (x, c)d x1 , . . . , d xD ,
(2.15)
1516
W. M. Brown and A. B¨acker
where J 1 (x, c) is the Fisher information at x for the neuron with preferred stimulus c. Here, D D 1 1 2 2 J 1 (x, c) = τ 4 (2.16) (xi − c i ) exp − 2 (xi − c i ) . σ 2σ i=1
i=1
For the one-dimensional case with Poisson spiking statistics and gaussian tuning (see equation 2.8), the average Fisher information per neuron is given by √ √ 2π 2 1 J f 1D = F τ erf + 4 exp − 2 − 4 , (2.17) σ 2σ 2σ for the two-dimensional case by √ 2 √ √ −1 2 2 π erf + 3σ 2πerf exp −1 2σ 2σ 2σ 2 , J f 2D = 4F τ 1 −1 2 1 − 2 exp + 4σ + 4σ 2 exp 2 2 σ 2σ (2.18) and for the three-dimensional case by J f 3D =
√ 2 √ √ 3 −1 2 2 2 π − 4σ π erf erf 1 − exp 2 2σ 2σ 2σ 2 √ √ −1 −1 2 12σ F τ − 5σ 2 2πerf . 2 exp − exp −1 2 2 2σ 2σ σ −1 −1 −3 3 + 4σ 3 exp − 3 exp + exp −1 2 2 2 2σ σ 2σ 3/2
(2.19) The influence of tuning width on the average Fisher information per neuron is plotted in Figure 1 for the first D = 1–3 dimensionalities. For one and two dimensions, the average Fisher information, and thus encoding accuracy, increases with decreasing tuning width. For three dimensions, there is an optimal tuning width in terms of Fisher information, given by the σ at which dJ/dσ = 0 (approximately 22.541% of the stimulus space). For higher dimensionalities, optimal tuning widths also exist. This can be seen by the fact that as the tuning width approaches infinity, the derivative of the probability of firing a given number of spikes with respect to a
Optimal Neuronal Tuning for Finite Stimulus Spaces
1517
stimulus goes to zero. For the example presented here, D ∂ F (c k − xk ) 1 2 lim f (x) = lim exp − 2 (xi − c i ) = 0, σ →∞ ∂ xk σ →∞ σ2 2σ i=1
(2.20) for any k from 1 to D. In this limit, the tuning function is independent of x, the probability distribution for spike firing (see equation 2.5) is independent of x, and the resulting derivatives give a Fisher information of zero (see equation 2.2). In the limit as the tuning width goes to zero, equation 2.14 becomes valid (the limits of the error functions and exponentials are equivalent in both the case where the tuning width is infinitesimal and the case when the stimulus space is infinite):
lim J f = lim
σ →0
σ →0 0
= lim
1
dc 1 , . . . , dc D
1
J 1 (x, c)d x1 , . . . , d xD
0 ∞
σ →0 −∞
J 1 (ξ )dξ1 , . . . , dξ D
∞, D = 1 = 2Fd τ π, D = 2 0, D > 2.
(2.21)
Therefore, at least one maximum in the Fisher information must exist for higher dimensionalities. While we expect that the optimal tuning width shifts toward a larger fraction of the stimulus space as the dimensionality is increased, we are unable to find a closed form for equation 2.15 that can prove this. The deviation of the results from equation 2.14 is explained by the fact that an increase in tuning width in equation 2.15 results in an increase in tuning width relative to the stimulus space, an effect that is not easily seen when an infinite stimulus space is considered. Clearly, this deviation should be expected, as the limit given in equation 2.20 contradicts the result in equation 2.14. An infinite tuning width produces a neuronal tuning function that is flat at all but infinite stimuli. A neuron with such a tuning curve cannot discriminate between finite stimuli and therefore contributes zero Fisher information within finite stimulus ranges. This result, which is true for all dimensionalities, tells us that an infinite tuning width can never be optimal, at least under the assumptions presented in this work. 2.3 Determining the Finite Stimulus Range. As stated earlier, when using neuronal firing rate as an encoding variable, there is a physiological limit to the range of stimulus space that can be perceived. This limit is due
1518
W. M. Brown and A. B¨acker
Figure 2: Fisher information as a function of stimulus (J x (x)) calculated using equation 2.24. The average Fisher information per neuron calculated using equation 2.22 is found by normalizing the finite stimulus space to lie between 0 and 1 and distributing the preferred stimuli of the neurons between α and 1 − α such that the drop in Fisher information at the corners of the stimulus space does not fall below J cutoff . The shaded regions are not included in the finite stimulus space and therefore are not included in the average. For this plot, σ = 0.2, J cutoff = 8, and α = −0.23. The Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
to the fact that the number of neurons is finite, the firing rates are finite, and the decoding time is finite. This limit will depend on the number of neurons, the preferred stimuli of the neurons, the mean peak firing rate, and the decoding time. There is no implication, however, that the stimulus space range should equal the range of preferred stimuli for the neurons. It is therefore important to consider the influence of the preferred stimuli range on the optimal tuning width. We can evaluate the effect of the preferred stimuli range on the optimum tuning width by fixing the integration interval for the stimulus space (x) to lie between 0 and 1, and setting the integration interval for the preferred stimuli (c) such that the drop in Fisher information at the corners of the stimulus space does not fall below some threshold value (J cutoff ) (see Figure 2). In this case, the tuning width always represents a fraction of the stimulus space. The range of the preferred stimuli is dependent on both σ and J cutoff ;
Optimal Neuronal Tuning for Finite Stimulus Spaces
1519
however, it will always be centered on the stimulus space and identical in each dimension. The choice of the Fisher information as a threshold to limit the stimulus space is reasonable because there is a finite range of stimulus space within which the mean-squared error will be tolerable based on a finite distribution of preferred stimuli. We can evaluate the effect of J cutoff on the Fisher information per neuron in terms of σ as J α (σ, α) =
1 (1 − 2α)
D
α
1−α
dc 1 , . . . , dc D
1
J 1 (x, c)d x1 , . . . , d xD ,
0
(2.22) such that J x (x1 = 0, . . . , xD = 0) = J x (x1 = 1, . . . , xD = 1) = J cutoff ,
(2.23)
where J x represents the Fisher information as a function of the stimulus
J x (x) =
α
1−α
J 1 (x, c) dc 1 , . . . , dc D ,
(2.24)
and α determines the range of the preferred stimuli. For a given J cutoff and σ , we solve for the range of the preferred stimuli (α – 1-α) using a numerical evaluation of equation 2.23 such that the Fisher information at the corners of the stimulus space will be equal to J cutoff . Of course, for each value of σ , there is a maximum J cutoff for which a solution exists simply because the Fisher information at any point in the stimulus space is finite. Once α has been determined, an analytic solution to equation 2.22 can then be solved to determine the average Fisher information per neuron over the stimulus space range, using a range of preferred stimuli, which may be larger or smaller than the range of the stimulus space, depending on the value of J cutoff . Because equation 2.22 represents an average Fisher information per neuron, it is normalized to account for the range of the neuronal distribution. For a finite neuronal distribution range, there is a finite range of stimulus space within which the mean squared error will be tolerable. By determining the average Fisher information per neuron using equation 2.22, the finite stimulus space is always determined by the points at which the Fisher information begins to fall below this threshold, regardless of the tuning width. The results for this evaluation are shown in Figure 3 for the first three dimensionalities. For all cases, an increase in J cutoff results in an increase in the range of the preferred stimuli (decrease in α); that is, when the cutoff becomes low, the range of the stimulus space must become larger relative to the distribution of the neurons such that the stimulus space includes
1520
W. M. Brown and A. B¨acker
Figure 3: Plot of the average Fisher information per neuron for one to three dimensions calculated using equation 2.22 (left) along with the corresponding ranges for the preferred stimuli (right). The plotted values for J cutoff were chosen so that the range of σ is large enough to illustrate differences (as σ increases, the maximum J cutoff that can be achieved decreases). For all dimensionalities, there is an increase in the range of preferred stimuli and a decrease in the average Fisher information per neuron as J cutoff increases. The Fisher information is divided by the peak firing rate (F ) and the time window (τ ).
Optimal Neuronal Tuning for Finite Stimulus Spaces
1521
regions with higher mean squared errors. This increase in the range of the preferred stimuli results in a decrease in the average Fisher information per neuron. This is due to the fact that the total Fisher information within the finite stimulus space (0–1) for a single neuron decreases as c deviates from 0.5. As the range of the preferred stimuli increases, the average Fisher information for each neuron within the stimulus space will decrease. The results quantitate the idea that the encoding accuracy within a finite stimulus space of interest can be increased by increasing the range of preferred stimuli around the center of the stimulus space. This increase comes at an energetic cost resulting from an increase in the number of neurons, with diminishing returns due to a decrease in the average Fisher information per neuron within the stimulus space. While a change in the curvature of the average Fisher information as a function of σ can result from a change in the Fisher information cutoff, the Fisher information per neuron always improves with narrow tuning for one and two dimensions. For three dimensions, there will always be an optimal tuning width; however, as shown in Figure 3, the optimal tuning width will shift toward a larger fraction of the stimulus space with an increase in the Fisher information cutoff. Numerical evaluation of the optimal tuning width calculated with J cutoff ranging from 10−5 to 10 results in an increase in the optimal tuning width from 0.21 to 0.41, a decrease in α from 0.48 to −0.5, and a decrease in the optimal average Fisher information per neuron from 8.34 to 1.78. 3 Conclusion The result in equation 2.14 gives the Fisher information per neuron when an infinite stimulus space is considered. It is implicit in this model that the Fisher information at any point in the stimulus space is constant (independent of the stimulus) due to an infinite range of the preferred stimuli for the neurons (see the substitution in equation 2.10). We have continued this analysis using a finite stimulus space for two reasons. First, equation 2.14 is not accurate when the tuning width is large relative to the stimulus space due to an assumption in the derivation that σ is finite. Therefore, it is convenient to use a finite stimulus space in order to ascertain accurate results at any σ . Second, physiological limits preclude both the case where the stimulus space is infinite and the case where the range of the preferred stimuli is infinite. Using a finite stimulus space and a finite range of preferred stimuli introduces edge effects that are important to consider in models of encoding accuracy simply because these edge effects must exist in animal physiology. In the case where a finite range of preferred stimuli is considered, the Fisher information cannot be independent of the stimulus, even when an infinite number of neurons are considered. This results in limited regions of the stimulus space where the encoding accuracy is tolerable. When using a finite stimulus space, the Fisher information per neuron is not constant
1522
W. M. Brown and A. B¨acker
and therefore must be reported as an average. The fact that both of these conditions must be true leaves us with the case where we are interested only in limited regions of the stimulus space and for each neuron are concerned only with the contribution to encoding accuracy that lies within this space. In our model, the limits to the encoding accuracy are governed by the limits to the range of the preferred stimuli. The minimum Fisher information across the finite stimulus space of interest can be increased by increasing the range of the preferred stimuli. The cost of this increase is both an increase in the number of neurons and a decrease in the average Fisher information per neuron within the stimulus space. The optimal tuning width for three dimensions is dependent on the distribution of the preferred stimuli within the stimulus space. The average Fisher information per neuron at a given tuning width will change depending on the desired value for J cutoff . However, the following rule is universal under the framework of our model: the encoding accuracy will improve with narrow tuning for one and two dimensions, and for higher dimensions there will be at least one optimal tuning width. In general, when a finite number of neurons are considered to encode a finite stimulus space, there should be an optimal tuning width (in terms of encoding accuracy) for any dimension. Although our results show an infinitesimal tuning width to be optimal for one and two dimensions, for a finite number of neurons this cannot be the case, as the tuning curves will become too narrow to cover the stimulus space without gaps (Eurich & Wilke, 2000; Wilke & Eurich, 2002). Therefore, the variance in the encoding accuracy as a function of the tuning width is also important to consider. We have based our work on the model developed by Zhang and Sejnowski (1999), assuming independent spike firing, constant tuning widths, radial symmetric tuning curves, and neuron distributions such that the estimation errors in different dimensions are always identical and uncorrelated. The model is desirable in that it is mathematically simple and therefore useful for studying the effect of dimensionality on encoding accuracy. However, when applied in a biological setting, many other factors have been shown to influence optimal tuning widths. In addition to Fisher information and variance of encoding accuracy, an objective function for optimal tuning widths should also consider energetic constraints (Bethge, Rotermund, & Pawelzik, 2002), heterogeneity in the tuning widths across distinct stimulus dimensions (Eurich & Wilke, 2000), heterogeneity in the tuning widths within a stimulus dimension (Wilke & Eurich, 2002), noise models (Wilke & Eurich, 2002), covariance of the noise (Karbowski, 2000; Pouget, Deneve, Ducom, & Latham, 1999; Wilke & Eurich, 2002; Wu, Amari, & Nakahara, 2002), nonsymmetric tuning curves (Eurich & Wilke, 2000), decoding time and maximum firing rates (Bethge et al., 2002), hidden dimensions (Eurich & Wilke, 2000), choice of encoding variable(s) (Eckhorn, Grusser, Kroller, Pellnitz, & Popel, 1976), and biased estimators.
Optimal Neuronal Tuning for Finite Stimulus Spaces
1523
Appendix The steps in the integration used to derive the average Fisher information per neuron for one dimension (J f1D ) are given below (The integrations for higher dimensionalities are similar):
0
J f 1D =
1
1
dc
1
J (x, c)d x
(x − c)2 (x − c)2 J (x, c)d x = Fτ exp − dx σ4 2σ 2 0
1 c2 x2 cx = F τ σ −4 exp − 2 (x − c)2 exp − 2 + 2 d x 2σ 0
2σ σ 1 x2 cx 2 d x x exp − + 2σ 2 σ2 0 1 2 c2 x cx −4 = F τ σ exp − 2 − 2cx exp − 2 + 2 d x 2σ 2σ σ 0 2 1 x cx 2 + c exp − 2 + 2 d x 2σ σ 0 c2 = F τ σ −4 exp − 2 2σ 1
1 x2 cx x2 cx 2 2 + cσ x exp − 2 + 2 d x σ x exp − 2σ 2 + σ 2 σ 0 2σ
1 0 2 x cx 2 d x + σ exp − + 2 2 2σ σ 0 1
1 × 2 2 x cx x cx 2 2 − 2cσ exp − + 2 − 2c exp − 2 + 2 d x 2 2σ σ 2σ σ 0 0 √ 1 √ σ c 2 2π c2 2 (x − c) + exp erf 2 2 2σ 2σ 0 c2 −4 = F τ σ exp − 2 2σ 1 1 x2 cx x2 cx 2 2 σ x exp − + −cσ exp − + 2 2σ 2 σ 2 0 σ2 0 2σ 1 2 x cx +c 2 exp − 2 + 2 d x 2σ σ 0 √ 1 √ 3 2 c 2 (x − c) σ 2π exp erf + 2 2σ 2 2σ 0 1 2 × x cx 2 2cσ exp − 2 + 2 2σ σ 0 √ 1 2 √ 2 − c) c (x 2 erf − c σ 2π exp 2σ 2 2σ 0 √ √ 1 2 2 c 2π 2 − c) σ c (x exp erf + 2 2σ 2 2σ
0 1
0
0
1524
W. M. Brown and A. B¨acker
J f 1D
c c2 1 − c = F τ σ −2 exp − 2 − (c − 1) exp 2 σ2 √ 2σ 2σ √ √ c 2 2π 2 (c − 1) erf − erf − Fτ 2σ 2σ 2σ 2 c c 1 1 − c dc = F τ σ −2 0 exp − 2 − (c − 1) exp 2 2 2σ σ √2σ √
1 √ c 2 2π 2 (c − 1) − erf dc − Fτ erf 2σ 0 2σ 2σ 1 1 c2 c = F τ σ −2 exp − 2 exp − 2 + 2 (c − 1) dc 2σ 2σ σ 0
1 c2 − F τ σ −2 c exp − 2 dc √ 2σ √ √ 0 1 √
c 2 2π 2 (c − 1) 2π 1 − Fτ erf erf dc − F τ dc 2σ 0 2σ 2σ 0 2σ 1 c2 c2 c 1 = − F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 0 2σ √ 1
√ √ 1 2π 2 (c − 1) 2 (c − 1) d dc − Fτ + c erf c · gerf 2σ 2σ dc 2σ 0 0 √ √ 1
√ 1 2π c 2 d c 2 + c erf − Fτ c · gerf dc 2σ 2σ dc 2σ 0 0 1 c2 c2 c 1 = − F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 0 2σ √ 1 √ 2π 2 (c − 1) − Fτ c · gerf 2σ 2σ 0
1 √ c 2 c 1 c2 dc + √ exp − 2 + 2 − 2σ σ 2σ 2 0 σ π √ 1
√ √ 1 2π c 2 c2 c 2 − Fτ − c · gerf √ exp − 2 dc 2σ 2σ 2σ 0 σ π 0 1 2 1 c c2 c = −F τ exp − 2 exp − 2 + 2 − exp − 2 2σ σ 2σ 2σ √ 1 0 √ √ 2π 2 (c − 1) 2 (c − 1) − Fτ + erf c · gerf 2σ 2σ 2σ 0 1 √ σ 2 c 1 c2 − √ exp − 2 + 2 − 2σ σ 2σ 2 π 0 √ 1 √ √ 1 2π σ 2 c2 c 2 − Fτ + √ exp − 2 c · gerf 2σ 2σ 2σ π 0 0 √ √ 2π 2 1 erf + 4 exp − 2 − 4 = Fτ σ 2σ 2σ
Optimal Neuronal Tuning for Finite Stimulus Spaces
1525
Acknowledgments We thank Shawn Martin at Sandia National Laboratories and the reviewers for their guidance in presenting this work. Support for this work was provided by Sandia National Laboratories’ LDRD and Mathematics, Information, and Computational Sciences Program of the U.S. Department of Energy, and Caltech’s Beckman Institute. Sandia is a multiprogram laboratory operated by Sandia Corp., a Lockheed Martin Company, for the U.S. Department of Energy’s National Nuclear Security Administration. References Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59(4–5), 313–318. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1(4), 371–394. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Computation, 14(10), 2317–2351. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Eckhorn, R., Grusser, O. J., Kroller, J., Pellnitz, K., & Popel, B. (1976). Efficiency of different neuronal codes: Information transfer calculations for three different neuronal systems. Biological Cybernetics, 22(1), 49–60. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive field neurons. Biological Cybernetics, 76(5), 357–363. Eurich, C. W., & Wilke, S. D. (2000). Multidimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In J. L. McClelland (Ed.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Karbowski, J. (2000). Fisher information and temporal correlations for spiking neurons with stochastic dynamics. Physical Review E, 61(4 Pt. B), 4235–4252. Knudsen, E. I., & Konishi, M. (1978). A neural map of auditory space in the owl. Science, 200(4343), 795–797. Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16(1), 37–68. Lee, B. B. (1999). Single units and sensation: A retrospect. Perception, 28(12), 1493– 1508. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institute of Radio Engineers (New York), 47, 1940–1951.
1526
W. M. Brown and A. B¨acker
Pouget, A., Deneve, S., Ducom, J. C., & Latham, P. E. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11(1), 85–90. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1(1–2), 89–107. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences U S A, 90(22), 10749– 10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–529. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67(2), 183–190. Wilke, S. D., & Eurich, C. W. (2002). Representational accuracy of stochastic neural populations. Neural Computation, 14(1), 155–189. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Computation, 14(5), 999–1026. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11(1), 75–84.
Received September 17, 2004; accepted November 1, 2005.
LETTER
Communicated by Yann Le Cun
A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton [email protected]
Simon Osindero [email protected] Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4
Yee-Whye Teh [email protected] Department of Computer Science, National University of Singapore, Singapore 117543
We show how to use “complementary priors” to eliminate the explainingaway effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind. 1 Introduction Learning is difficult in densely connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer, where the prior assumes independence. Also, variational learning still requires all of the parameters to be learned together and this makes the learning time scale poorly as the number of parameters increases. We describe a model in which the top two hidden layers form an undirected associative memory (see Figure 1) and the remaining hidden layers Neural Computation 18, 1527–1554 (2006)
C 2006 Massachusetts Institute of Technology
1528
G. Hinton, S. Osindero, and Y.-W. Teh
2000 top-level units
10 label units
This could be the top level of another sensory pathway
500 units 500 units 28 x 28 pixel image
Figure 1: The network used to model the joint distribution of digit images and digit labels. In this letter, each training case consists of an image and an explicit class label, but work in progress has shown that the same learning algorithm can be used if the “labels” are replaced by a multilayer pathway whose inputs are spectrograms from multiple different speakers saying isolated digits. The network then learns to generate pairs that consist of an image and a spectrogram of the same digit class.
form a directed acyclic graph that converts the representations in the associative memory into observable variables such as the pixels of an image. This hybrid model has some attractive features:
r r r r
There is a fast, greedy learning algorithm that can find a fairly good set of parameters quickly, even in deep networks with millions of parameters and many hidden layers. The learning algorithm is unsupervised but can be applied to labeled data by learning a model that generates both the label and the data. There is a fine-tuning algorithm that learns an excellent generative model that outperforms discriminative methods on the MNIST database of hand-written digits. The generative model makes it easy to interpret the distributed representations in the deep hidden layers.
A Fast Learning Algorithm for Deep Belief Nets
r r r
1529
The inference required for forming a percept is both fast and accurate. The learning algorithm is local. Adjustments to a synapse strength depend on only the states of the presynaptic and postsynaptic neuron. The communication is simple. Neurons need only to communicate their stochastic binary states.
Section 2 introduces the idea of a “complementary” prior that exactly cancels the “explaining away” phenomenon that makes inference difficult in directed models. An example of a directed belief network with complementary priors is presented. Section 3 shows the equivalence between restricted Boltzmann machines and infinite directed networks with tied weights. Section 4 introduces a fast, greedy learning algorithm for constructing multilayer directed networks one layer at a time. Using a variational bound, it shows that as each new layer is added, the overall generative model improves. The greedy algorithm bears some resemblance to boosting in its repeated use of the same “weak” learner, but instead of reweighting each data vector to ensure that the next step learns something new, it rerepresents it. The “weak” learner that is used to construct deep directed nets is itself an undirected graphical model. Section 5 shows how the weights produced by the fast, greedy algorithm can be fine-tuned using the “up-down” algorithm. This is a contrastive version of the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995) that does not suffer from the “mode-averaging” problems that can cause the wake-sleep algorithm to learn poor recognition weights. Section 6 shows the pattern recognition performance of a network with three hidden layers and about 1.7 million weights on the MNIST set of handwritten digits. When no knowledge of geometry is provided and there is no special preprocessing, the generalization performance of the network is 1.25% errors on the 10,000-digit official test set. This beats the 1.5% achieved by the best backpropagation nets when they are not handcrafted for this particular application. It is also slightly better than the 1.4% errors reported by Decoste and Schoelkopf (2002) for support vector machines on the same task. Finally, section 7 shows what happens in the mind of the network when it is running without being constrained by visual input. The network has a full generative model, so it is easy to look into its mind—we simply generate an image from its high-level representations. Throughout the letter, we consider nets composed of stochastic binary variables, but the ideas can be generalized to other models in which the log probability of a variable is an additive function of the states of its directly connected neighbors (see appendix A for details).
1530
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 2: A simple logistic belief net containing two independent, rare causes that become highly anticorrelated when we observe the house jumping. The bias of −10 on the earthquake node means that in the absence of any observation, this node is e 10 times more likely to be off than on. If the earthquake node is on and the truck node is off, the jump node has a total input of 0, which means that it has an even chance of being on. This is a much better explanation of the observation that the house jumped than the odds of e −20 , which apply if neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because the probability of both happening is e −10 × e −10 = e −20 . When the earthquake node is turned on, it “explains away” the evidence for the truck node.
2 Complementary Priors The phenomenon of explaining away (illustrated in Figure 2) makes inference difficult in directed belief nets. In densely connected networks, the posterior distribution over the hidden variables is intractable except in a few special cases, such as mixture models or linear models with additive gaussian noise. Markov chain Monte Carlo methods (Neal, 1992) can be used to sample from the posterior, but they are typically very time-consuming. Variational methods (Neal & Hinton, 1998) approximate the true posterior with a more tractable distribution, and they can be used to improve a lower bound on the log probability of the training data. It is comforting that learning is guaranteed to improve a variational bound even when the inference of the hidden states is done incorrectly, but it would be much better to find a way of eliminating explaining away altogether, even in models whose hidden variables have highly correlated effects on the visible variables. It is widely assumed that this is impossible. A logistic belief net (Neal, 1992) is composed of stochastic binary units. When the net is used to generate data, the probability of turning on unit i is a logistic function of the states of its immediate ancestors, j, and of the
A Fast Learning Algorithm for Deep Belief Nets
1531
weights, wi j , on the directed connections from the ancestors: p(si = 1) =
1 , 1 + exp −b i − j s j wi j
(2.1)
where b i is the bias of unit i. If a logistic belief net has only one hidden layer, the prior distribution over the hidden variables is factorial because their binary states are chosen independently when the model is used to generate data. The nonindependence in the posterior distribution is created by the likelihood term coming from the data. Perhaps we could eliminate explaining away in the first hidden layer by using extra hidden layers to create a “complementary” prior that has exactly the opposite correlations to those in the likelihood term. Then, when the likelihood term is multiplied by the prior, we will get a posterior that is exactly factorial. It is not at all obvious that complementary priors exist, but Figure 3 shows a simple example of an infinite logistic belief net with tied weights in which the priors are complementary at every hidden layer (see appendix A for a more general treatment of the conditions under which complementary priors exist). The use of tied weights to construct complementary priors may seem like a mere trick for making directed models equivalent to undirected ones. As we shall see, however, it leads to a novel and very efficient learning algorithm that works by progressively untying the weights in each layer from the weights in higher layers. 2.1 An Infinite Directed Model with Tied Weights. We can generate data from the infinite directed net in Figure 3 by starting with a random configuration at an infinitely deep hidden layer1 and then performing a top-down “ancestral” pass in which the binary state of each variable in a layer is chosen from the Bernoulli distribution determined by the top-down input coming from its active parents in the layer above. In this respect, it is just like any other directed acyclic belief net. Unlike other directed nets, however, we can sample from the true posterior distribution over all of the hidden layers by starting with a data vector on the visible units and then using the transposed weight matrices to infer the factorial distributions over each hidden layer in turn. At each hidden layer, we sample from the factorial posterior before computing the factorial posterior for the layer above.2 Appendix A shows that this procedure gives unbiased samples
1 The generation process converges to the stationary distribution of the Markov chain, so we need to start at a layer that is deep compared with the time it takes for the chain to reach equilibrium. 2 This is exactly the same as the inference procedure used in the wake-sleep algorithm (Hinton et al., 1995) but for the models described in this letter no variational approximation is required because the inference procedure gives unbiased samples.
1532
G. Hinton, S. Osindero, and Y.-W. Teh
etc. WT
W 2 V2 vi
WT
W 1 H1 h j
WT
W 1 V1 vi
WT
W 0 H0 h j
WT
W 0 V0 vi
Figure 3: An infinite logistic belief net with tied weights. The downward arrows represent the generative model. The upward arrows are not part of the model. They represent the parameters that are used to infer samples from the posterior distribution at each hidden layer of the net when a data vector is clamped on V0 .
because the complementary prior at each layer ensures that the posterior distribution really is factorial. Since we can sample from the true posterior, we can compute the derivatives of the log probability of the data. Let us start by computing the derivative for a generative weight, wi00j , from a unit j in layer H0 to unit i in layer V0 (see Figure 3). In a logistic belief net, the maximum likelihood learning rule for a single data vector, v0 , is ∂ log p(v0 ) 0 0 = h j vi − vˆ i0 , ∂wi00j
(2.2)
where · denotes an average over the sampled states and vˆ i0 is the probability that unit i would be turned on if the visible vector was stochastically
A Fast Learning Algorithm for Deep Belief Nets
1533
reconstructed from the sampled hidden states. Computing the posterior distribution over the second hidden layer, V1 , from the sampled binary states in the first hidden layer, H0 , is exactly the same process as reconstructing the data, so vi1 is a sample from a Bernoulli random variable with probability vˆ i0 . The learning rule can therefore be written as ∂ log p(v0 ) 0 0 = h j vi − vi1 . 00 ∂wi j
(2.3)
The dependence of vi1 on h 0j is unproblematic in the derivation of equation 2.3 from equation 2.2 because vˆ i0 is an expectation that is conditional on h 0j . Since the weights are replicated, the full derivative for a generative weight is obtained by summing the derivatives of the generative weights between all pairs of layers: ∂ log p(v0 ) 0 0 = h j vi − vi1 + vi1 h 0j − h 1j + h 1j vi1 − vi2 + · · · ∂wi j
(2.4)
All of the pairwise products except the first and last cancel, leaving the Boltzmann machine learning rule of equation 3.1. 3 Restricted Boltzmann Machines and Contrastive Divergence Learning It may not be immediately obvious that the infinite directed net in Figure 3 is equivalent to a restricted Boltzmann machine (RBM). An RBM has a single layer of hidden units that are not connected to each other and have undirected, symmetrical connections to a layer of visible units. To generate data from an RBM, we can start with a random state in one of the layers and then perform alternating Gibbs sampling. All of the units in one layer are updated in parallel given the current states of the units in the other layer, and this is repeated until the system is sampling from its equilibrium distribution. Notice that this is exactly the same process as generating data from the infinite belief net with tied weights. To perform maximum likelihood learning in an RBM, we can use the difference between two correlations. For each weight, wi j , between a visible unit i and a hidden unit, j, we measure the correlation vi0 h 0j when a data vector is clamped on the visible units and the hidden states are sampled from their conditional distribution, which is factorial. Then, using alternating Gibbs sampling, we run the Markov chain shown in Figure it reaches its 4 until stationary distribution and measure the correlation vi∞ h ∞ . The gradient of j the log probability of the training data is then ∂ log p(v0 ) 0 0 ∞ ∞ = vi h j − vi h j . ∂wi j
(3.1)
1534
G. Hinton, S. Osindero, and Y.-W. Teh t=0 j
t=1
t=2
j
j
t = infinity j
< vi0 h0j > i
< vi∞ h ∞j > i
i
i t = infinity
Figure 4: This depicts a Markov chain that uses alternating Gibbs sampling. In one full step of Gibbs sampling, the hidden units in the top layer are all updated in parallel by applying equation 2.1 to the inputs received from the the current states of the visible units in the bottom layer; then the visible units are all updated in parallel given the current hidden states. The chain is initialized by setting the binary states of the visible units to be the same as a data vector. The correlations in the activities of a visible and a hidden unit are measured after the first update of the hidden units and again at the end of the chain. The difference of these two correlations provides the learning signal for updating the weight on the connection.
This learning rule is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights, and each step of Gibbs sampling corresponds to computing the exact posterior distribution in a layer of the infinite logistic belief net. Maximizing the log probability of the data is exactly the same as minimizing the Kullback-Leibler divergence, K L(P 0 ||Pθ∞ ), between the distribution of the data, P 0 , and the equilibrium distribution defined by the model, Pθ∞ . In contrastive divergence learning (Hinton, 2002), we run the Markov chain for only n full steps before measuring the second correlation.3 This is equivalent to ignoring the derivatives that come from the higher layers of the infinite net. The sum of all these ignored derivatives is the derivative of the log probability of the posterior distribution in layer Vn , which is also the derivative of the Kullback-Leibler divergence between the posterior distribution in layer Vn , Pθn , and the equilibrium distribution defined by the model. So contrastive divergence learning minimizes the difference of two Kullback-Leibler divergences: K L P 0 Pθ∞ − K L Pθn Pθ∞ .
(3.2)
Ignoring sampling noise, this difference is never negative because Gibbs sampling is used to produce Pθn from P 0 , and Gibbs sampling always reduces the Kullback-Leibler divergence with the equilibrium distribution. It 3
Each full step consists of updating h given v, then updating v given h.
A Fast Learning Algorithm for Deep Belief Nets
1535
is important to notice that Pθn depends on the current model parameters, and the way in which Pθn changes as the parameters change is being ignored by contrastive divergence learning. This problem does not arise with P 0 because the training data do not depend on the parameters. An empirical investigation of the relationship between the maximum likelihood and the contrastive divergence learning rules can be found in Carreira-Perpinan and Hinton (2005). Contrastive divergence learning in a restricted Boltzmann machine is efficient enough to be practical (Mayraz & Hinton, 2001). Variations that use real-valued units and different sampling schemes are described in Teh, Welling, Osindero, and Hinton (2003) and have been quite successful for modeling the formation of topographic maps (Welling, Hinton, & Osindero, 2003) for denoising natural images (Roth & Black, 2005) or images of biological cells (Ning et al., 2005). Marks and Movellan (2001) describe a way of using contrastive divergence to perform factor analysis and Welling, Rosen-Zvi, and Hinton (2005) show that a network with logistic, binary visible units and linear, gaussian hidden units can be used for rapid document retrieval. However, it appears that the efficiency has been bought at a high price: When applied in the obvious way, contrastive divergence learning fails for deep, multilayer networks with different weights at each layer because these networks take far too long even to reach conditional equilibrium with a clamped data vector. We now show that the equivalence between RBMs and infinite directed nets with tied weights suggests an efficient learning algorithm for multilayer networks in which the weights are not tied. 4 A Greedy Learning Algorithm for Transforming Representations An efficient way to learn a complicated model is to combine a set of simpler models that are learned sequentially. To force each model in the sequence to learn something different from the previous models, the data are modified in some way after each model has been learned. In boosting (Freund, 1995), each model in the sequence is trained on reweighted data that emphasize the cases that the preceding models got wrong. In one version of principal components analysis, the variance in a modeled direction is removed, thus forcing the next modeled direction to lie in the orthogonal subspace (Sanger, 1989). In projection pursuit (Friedman & Stuetzle, 1981), the data are transformed by nonlinearly distorting one direction in the data space to remove all nongaussianity in that direction. The idea behind our greedy algorithm is to allow each model in the sequence to receive a different representation of the data. The model performs a nonlinear transformation on its input vectors and produces as output the vectors that will be used as input for the next model in the sequence. Figure 5 shows a multilayer generative model in which the top two layers interact via undirected connections and all of the other connections
1536
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 5: A hybrid network. The top two layers have undirected connections and form an associative memory. The layers below have directed, top-down generative connections that can be used to map a state of the associative memory to an image. There are also directed, bottom-up recognition connections that are used to infer a factorial representation in one layer from the binary activities in the layer below. In the greedy initial learning, the recognition connections are tied to the generative connections.
are directed. The undirected connections at the top are equivalent to having infinitely many higher layers with tied weights. There are no intralayer connections, and to simplify the analysis, all layers have the same number of units. It is possible to learn sensible (though not optimal) values for the parameters W0 by assuming that the parameters between higher layers will be used to construct a complementary prior for W0 . This is equivalent to assuming that all of the weight matrices are constrained to be equal. The task of learning W0 under this assumption reduces to the task of learning an RBM, and although this is still difficult, good approximate solutions can be found rapidly by minimizing contrastive divergence. Once W0 has been learned, the data can be mapped through W0T to create higher-level “data” at the first hidden layer. If the RBM is a perfect model of the original data, the higher-level “data” will already be modeled perfectly by the higher-level weight matrices. Generally, however, the RBM will not be able to model the original data perfectly, and we can make the generative model better using the following greedy algorithm:
A Fast Learning Algorithm for Deep Belief Nets
1537
1. Learn W0 assuming all the weight matrices are tied. 2. Freeze W0 and commit ourselves to using W0T to infer factorial approximate posterior distributions over the states of the variables in the first hidden layer, even if subsequent changes in higher-level weights mean that this inference method is no longer correct. 3. Keeping all the higher-weight matrices tied to each other, but untied from W0 , learn an RBM model of the higher-level “data” that was produced by using W0T to transform the original data. If this greedy algorithm changes the higher-level weight matrices, it is guaranteed to improve the generative model. As shown in Neal and Hinton (1998), the negative log probability of a single data vector, v0 , under the multilayer generative model is bounded by a variational free energy, which is the expected energy under the approximating distribution, Q(h0 |v0 ), minus the entropy of that distribution. For a directed model, the “energy” of the configuration v0 , h0 is given by E(v0 , h0 ) = −[log p(h0 ) + log p(v0 |h0 )],
(4.1)
so the bound is log p(v0 ) ≥
Q(h0 |v0 )[log p(h0 ) + log p(v0 |h0 )]
all h0
−
Q(h0 |v0 ) log Q(h0 |v0 ),
(4.2)
all h0
where h0 is a binary configuration of the units in the first hidden layer, p(h0 ) is the prior probability of h0 under the current model (which is defined by the weights above H0 ), and Q(·|v0 ) is any probability distribution over the binary configurations in the first hidden layer. The bound becomes an equality if and only if Q(·|v0 ) is the true posterior distribution. When all of the weight matrices are tied together, the factorial distribution over H0 produced by applying W0T to a data vector is the true posterior distribution, so at step 2 of the greedy algorithm, log p(v0 ) is equal to the bound. Step 2 freezes both Q(·|v0 ) and p(v0 |h0 ), and with these terms fixed, the derivative of the bound is the same as the derivative of
Q(h0 |v0 ) log p(h0 ).
(4.3)
all h0
So maximizing the bound with respect to the weights in the higher layers is exactly equivalent to maximizing the log probability of a data set in which h0 occurs with probability Q(h0 |v0 ). If the bound becomes tighter, it
1538
G. Hinton, S. Osindero, and Y.-W. Teh
is possible for log p(v0 ) to fall even though the lower bound on it increases, but log p(v0 ) can never fall below its value at step 2 of the greedy algorithm because the bound is tight at this point and the bound always increases. The greedy algorithm can clearly be applied recursively, so if we use the full maximum likelihood Boltzmann machine learning algorithm to learn each set of tied weights and then we untie the bottom layer of the set from the weights above, we can learn the weights one layer at a time with a guarantee that we will never decrease the bound on the log probability of the data under the model.4 In practice, we replace the maximum likelihood Boltzmann machine learning algorithm by contrastive divergence learning because it works well and is much faster. The use of contrastive divergence voids the guarantee, but it is still reassuring to know that extra layers are guaranteed to improve imperfect models if we learn each layer with sufficient patience. To guarantee that the generative model is improved by greedily learning more layers, it is convenient to consider models in which all layers are the same size so that the higher-level weights can be initialized to the values learned before they are untied from the weights in the layer below. The same greedy algorithm, however, can be applied even when the layers are different sizes. 5 Back-Fitting with the Up-Down Algorithm Learning the weight matrices one layer at a time is efficient but not optimal. Once the weights in higher layers have been learned, neither the weights nor the simple inference procedure are optimal for the lower layers. The suboptimality produced by greedy learning is relatively innocuous for supervised methods like boosting. Labels are often scarce, and each label may provide only a few bits of constraint on the parameters, so overfitting is typically more of a problem than underfitting. Going back and refitting the earlier models may therefore cause more harm than good. Unsupervised methods, however, can use very large unlabeled data sets, and each case may be very high-dimensional, thus providing many bits of constraint on a generative model. Underfitting is then a serious problem, which can be alleviated by a subsequent stage of back-fitting in which the weights that were learned first are revised to fit in better with the weights that were learned later. After greedily learning good initial values for the weights in every layer, we untie the “recognition” weights that are used for inference from the “generative” weights that define the model, but retain the restriction that the posterior in each layer must be approximated by a factorial distribution in which the variables within a layer are conditionally independent given
4
The guarantee is on the expected change in the bound.
A Fast Learning Algorithm for Deep Belief Nets
1539
the values of the variables in the layer below. A variant of the wake-sleep algorithm described in Hinton et al. (1995) can then be used to allow the higher-level weights to influence the lower-level ones. In the “up-pass,” the recognition weights are used in a bottom-up pass that stochastically picks a state for every hidden variable. The generative weights on the directed connections are then adjusted using the maximum likelihood learning rule in equation 2.2.5 The weights on the undirected connections at the top level are learned as before by fitting the top-level RBM to the posterior distribution of the penultimate layer. The “down-pass” starts with a state of the top-level associative memory and uses the top-down generative connections to stochastically activate each lower layer in turn. During the down-pass, the top-level undirected connections and the generative directed connections are not changed. Only the bottom-up recognition weights are modified. This is equivalent to the sleep phase of the wake-sleep algorithm if the associative memory is allowed to settle to its equilibrium distribution before initiating the downpass. But if the associative memory is initialized by an up-pass and then only allowed to run for a few iterations of alternating Gibbs sampling before initiating the down-pass, this is a “contrastive” form of the wake-sleep algorithm that eliminates the need to sample from the equilibrium distribution of the associative memory. The contrastive form also fixes several other problems of the sleep phase. It ensures that the recognition weights are being learned for representations that resemble those used for real data, and it also helps to eliminate the problem of mode averaging. If, given a particular data vector, the current recognition weights always pick a particular mode at the level above and ignore other very different modes that are equally good at generating the data, the learning in the down-pass will not try to alter those recognition weights to recover any of the other modes as it would if the sleep phase used a pure ancestral pass. A pure ancestral pass would have to start by using prolonged Gibbs sampling to get an equilibrium sample from the top-level associative memory. By using a top-level associative memory, we also eliminate a problem in the wake phase: independent top-level units seem to be required to allow an ancestral pass, but they mean that the variational approximation is very poor for the top layer of weights. Appendix B specifies the details of the up-down algorithm using MATLAB-style pseudocode for the network shown in Figure 1. For simplicity, there is no penalty on the weights, no momentum, and the same learning rate for all parameters. Also, the training data are reduced to a single case.
Because weights are no longer tied to the weights above them, vˆ i0 must be computed using the states of the variables in the layer above i and the generative weights from these variables to i. 5
1540
G. Hinton, S. Osindero, and Y.-W. Teh
6 Performance on the MNIST Database 6.1 Training the Network. The MNIST database of handwritten digits contains 60,000 training images and 10,000 test images. Results for many different pattern recognition techniques are already published for this publicly available database, so it is ideal for evaluating new pattern recognition methods. For the basic version of the MNIST learning task, no knowledge of geometry is provided, and there is no special preprocessing or enhancement of the training set, so an unknown but fixed random permutation of the pixels would not affect the learning algorithm. For this “permutationinvariant” version of the task, the generalization performance of our network was 1.25% errors on the official test set. The network shown in Figure 1 was trained on 44,000 of the training images that were divided into 440 balanced mini-batches, each containing 10 examples of each digit class.6 The weights were updated after each mini-batch. In the initial phase of training, the greedy algorithm described in section 4 was used to train each layer of weights separately, starting at the bottom. Each layer was trained for 30 sweeps through the training set (called “epochs”). During training, the units in the “visible” layer of each RBM had real-valued activities between 0 and 1. These were the normalized pixel intensities when learning the bottom layer of weights. For training higher layers of weights, the real-valued activities of the visible units in the RBM were the activation probabilities of the hidden units in the lower-level RBM. The hidden layer of each RBM used stochastic binary values when that RBM was being trained. The greedy training took a few hours per layer in MATLAB on a 3 GHz Xeon processor, and when it was done, the error rate on the test set was 2.49% (see below for details of how the network is tested). When training the top layer of weights (the ones in the associative memory), the labels were provided as part of the input. The labels were represented by turning on one unit in a “softmax” group of 10 units. When the activities in this group were reconstructed from the activities in the layer above, exactly one unit was allowed to be active, and the probability of picking unit i was given by exp(xi ) pi = , j exp(x j )
(6.1)
where xi is the total input received by unit i. Curiously, the learning rules are unaffected by the competition between units in a softmax group, so the 6 Preliminary experiments with 16 × 16 images of handwritten digits from the USPS database showed that a good way to model the joint distribution of digit images and their labels was to use an architecture of this type, but for 16 × 16 images, only three-fifths as many units were used in each hidden layer.
A Fast Learning Algorithm for Deep Belief Nets
1541
synapses do not need to know which unit is competing with which other unit. The competition affects the probability of a unit turning on, but it is only this probability that affects the learning. After the greedy layer-by-layer training, the network was trained, with a different learning rate and weight decay, for 300 epochs using the up-down algorithm described in section 5. The learning rate, momentum, and weight decay7 were chosen by training the network several times and observing its performance on a separate validation set of 10,000 images that were taken from the remainder of the full training set. For the first 100 epochs of the up-down algorithm, the up-pass was followed by three full iterations of alternating Gibbs sampling in the associative memory before performing the down-pass. For the second 100 epochs, 6 iterations were performed, and for the last 100 epochs, 10 iterations were performed. Each time the number of iterations of Gibbs sampling was raised, the error on the validation set decreased noticeably. The network that performed best on the validation set was tested and had an error rate of 1.39%. This network was then trained on all 60,000 training images8 until its error rate on the full training set was as low as its final error rate had been on the initial training set of 44,000 images. This took a further 59 epochs, making the total learning time about a week. The final network had an error rate of 1.25%.9 The errors made by the network are shown in Figure 6. The 49 cases that the network gets correct but for which the second-best probability is within 0.3 of the best probability are shown in Figure 7. The error rate of 1.25% compares very favorably with the error rates achieved by feedforward neural networks that have one or two hidden layers and are trained to optimize discrimination using the backpropagation algorithm (see Table 1). When the detailed connectivity of these networks is not handcrafted for this particular task, the best reported error rate for stochastic online learning with a separate squared error on each of the 10 output units is 2.95%. These error rates can be reduced to 1.53% in a net with one hidden layer of 800 units by using small initial weights, a separate cross-entropy error function on each output unit, and very gentle learning 7 No attempt was made to use different learning rates or weight decays for different layers, and the learning rate and momentum were always set quite conservatively to avoid oscillations. It is highly likely that the learning speed could be considerably improved by a more careful choice of learning parameters, though it is possible that this would lead to worse solutions. 8 The training set has unequal numbers of each class, so images were assigned randomly to each of the 600 mini-batches. 9 To check that further learning would not have significantly improved the error rate, the network was then left running with a very small learning rate and with the test error being displayed after every epoch. After six weeks, the test error was fluctuating between 1.12% and 1.31% and was 1.18% for the epoch on which number of training errors was smallest.
1542
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 6: The 125 test cases that the network got wrong. Each case is labeled by the network’s guess. The true classes are arranged in standard scan order.
(John Platt, personal communication, 2005). An almost identical result of 1.51% was achieved in a net that had 500 units in the first hidden layer and 300 in the second hidden layer by using “softmax” output units and a regularizer that penalizes the squared weights by an amount carefully chosen using a validation set. For comparison, nearest neighbor has a reported error rate (http://oldmill.uchicago.edu/wilder/Mnist/) of 3.1% if all 60,000 training cases are used (which is extremely slow) and 4.4% if 20,000 are used. This can be reduced to 2.8% and 4.0% by using an L3 norm. The only standard machine learning technique that comes close to the 1.25% error rate of our generative model on the basic task is a support vector machine that gives an error rate of 1.4% (Decoste & Schoelkopf, 2002). But it is hard to see how support vector machines can make use of the domainspecific tricks, like weight sharing and subsampling, which LeCun, Bottou, and Haffner (1998) use to improve the performance of discriminative
A Fast Learning Algorithm for Deep Belief Nets
1543
Figure 7: All 49 cases in which the network guessed right but had a second guess whose probability was within 0.3 of the probability of the best guess. The true classes are arranged in standard scan order.
neural networks from 1.5% to 0.95%. There is no obvious reason why weight sharing and subsampling cannot be used to reduce the error rate of the generative model, and we are currently investigating this approach. Further improvements can always be achieved by averaging the opinions of multiple networks, but this technique is available to all methods. Substantial reductions in the error rate can be achieved by supplementing the data set with slightly transformed versions of the training data. Using one- and two-pixel translations, Decoste and Schoelkopf (2002) achieve 0.56%. Using local elastic deformations in a convolutional neural network, Simard, Steinkraus, and Platt (2003) achieve 0.4%, which is slightly better than the 0.63% achieved by the best hand-coded recognition algorithm (Belongie, Malik, & Puzicha, 2002). We have not yet explored the use of distorted data for learning generative models because many types of distortion need to be investigated, and the fine-tuning algorithm is currently too slow. 6.2 Testing the Network. One way to test the network is to use a stochastic up-pass from the image to fix the binary states of the 500 units in the lower layer of the associative memory. With these states fixed, the label units are given initial real-valued activities of 0.1, and a few iterations of alternating Gibbs sampling are then used to activate the correct label unit. This method of testing gives error rates that are almost 1% higher than the rates reported above.
1544
G. Hinton, S. Osindero, and Y.-W. Teh
Table 1: Error rates of Various Learning Algorithms on the MNIST Digit Recognition Task. Version of MNIST Task
Learning Algorithm
Test Error %
Permutation invariant
Our generative model: 784 → 500 → 500 ↔ 2000 ↔ 10 Support vector machine: degree 9 polynomial kernel Backprop: 784 → 500 → 300 → 10 cross-entropy and weight-decay Backprop: 784 → 800 → 10 cross-entropy and early stopping Backprop: 784 → 500 → 150 → 10 squared error and on-line updates Nearest neighbor: all 60,000 examples and L3 norm Nearest neighbor: all 60,000 examples and L2 norm Nearest neighbor: 20,000 examples and L3 norm Nearest neighbor: 20,000 examples and L2 norm Backprop: cross-entropy and early-stopping convolutional neural net
1.25
Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Permutation invariant Unpermuted images; extra data from elastic deformations Unpermuted de-skewed images; extra data from 2 pixel translations Unpermuted images Unpermuted images; extra data from affine transformations Unpermuted images
1.4 1.51 1.53 2.95 2.8 3.1 4.0 4.4 0.4
Virtual SVM: degree 9 polynomial kernel
0.56
Shape-context features: hand-coded matching Backprop in LeNet5: convolutional neural net
0.63
Backprop in LeNet5: convolutional neural net
0.8
0.95
A better method is to first fix the binary states of the 500 units in the lower layer of the associative memory and to then turn on each of the label units in turn and compute the exact free energy of the resulting 510-component binary vector. Almost all the computation required is independent of which label unit is turned on (Teh & Hinton, 2001), and this method computes the exact conditional equilibrium distribution over labels instead of approximating it by Gibbs sampling, which is what the previous method is doing. This method gives error rates that are about 0.5% higher than the ones quoted because of the stochastic decisions made in the up-pass. We can remove this noise in two ways. The simpler is to make the up-pass deterministic by using probabilities of activation in place of
A Fast Learning Algorithm for Deep Belief Nets
1545
Figure 8: Each row shows 10 samples from the generative model with a particular label clamped on. The top-level associative memory is run for 1000 iterations of alternating Gibbs sampling between samples.
stochastic binary states. The second is to repeat the stochastic up-pass 20 times and average either the label probabilities or the label log probabilities over the 20 repetitions before picking the best one. The two types of average give almost identical results, and these results are also very similar to using a single deterministic up-pass, which was the method used for the reported results. 7 Looking into the Mind of a Neural Network To generate samples from the model, we perform alternating Gibbs sampling in the top-level associative memory until the Markov chain converges to the equilibrium distribution. Then we use a sample from this distribution as input to the layers below and generate an image by a single down-pass through the generative connections. If we clamp the label units to a particular class during the Gibbs sampling, we can see images from the model’s class-conditional distributions. Figure 8 shows a sequence of images for each class that were generated by allowing 1000 iterations of Gibbs sampling between samples. We can also initialize the state of the top two layers by providing a random binary image as input. Figure 9 shows how the class-conditional state of the associative memory then evolves when it is allowed to run freely, but with the label clamped. This internal state is “observed” by performing a down-pass every 20 iterations to see what the associative memory has
1546
G. Hinton, S. Osindero, and Y.-W. Teh
Figure 9: Each row shows 10 samples from the generative model with a particular label clamped on. The top-level associative memory is initialized by an up-pass from a random binary image in which each pixel is on with a probability of 0.5. The first column shows the results of a down-pass from this initial highlevel state. Subsequent columns are produced by 20 iterations of alternating Gibbs sampling in the associative memory.
in mind. This use of the word mind is not intended to be metaphorical. We believe that a mental state is the state of a hypothetical, external world in which a high-level internal representation would constitute veridical perception. That hypothetical world is what the figure shows. 8 Conclusion We have shown that it is possible to learn a deep, densely connected belief network one layer at a time. The obvious way to do this is to assume that the higher layers do not exist when learning the lower layers, but this is not compatible with the use of simple factorial approximations to replace the intractable posterior distribution. For these approximations to work well, we need the true posterior to be as close to factorial as possible. So instead of ignoring the higher layers, we assume that they exist but have tied weights that are constrained to implement a complementary prior that makes the true posterior exactly factorial. This is equivalent to having an undirected model that can be learned efficiently using contrastive divergence. It can also be viewed as constrained variational learning because a penalty term—the divergence between the approximate and true
A Fast Learning Algorithm for Deep Belief Nets
1547
posteriors—has been replaced by the constraint that the prior must make the variational approximation exact. After each layer has been learned, its weights are untied from the weights in higher layers. As these higher-level weights change, the priors for lower layers cease to be complementary, so the true posterior distributions in lower layers are no longer factorial, and the use of the transpose of the generative weights for inference is no longer correct. Nevertheless, we can use a variational bound to show that adapting the higher-level weights improves the overall generative model. To demonstrate the power of our fast, greedy learning algorithm, we used it to initialize the weights for a much slower fine-tuning algorithm that learns an excellent generative model of digit images and their labels. It is not clear that this is the best way to use the fast, greedy algorithm. It might be better to omit the fine-tuning and use the speed of the greedy algorithm to learn an ensemble of larger, deeper networks or a much larger training set. The network in Figure 1 has about as many parameters as 0.002 cubic millimeters of mouse cortex (Horace Barlow, personal communication, 1999), and several hundred networks of this complexity could fit within a single voxel of a high-resolution fMRI scan. This suggests that much bigger networks may be required to compete with human shape recognition abilities. Our current generative model is limited in many ways (Lee & Mumford, 2003). It is designed for images in which nonbinary values can be treated as probabilities (which is not the case for natural images); its use of top-down feedback during perception is limited to the associative memory in the top two layers; it does not have a systematic way of dealing with perceptual invariances; it assumes that segmentation has already been performed; and it does not learn to sequentially attend to the most informative parts of objects when discrimination is difficult. It does, however, illustrate some of the major advantages of generative models as compared to discriminative ones:
r
r r
Generative models can learn low-level features without requiring feedback from the label, and they can learn many more parameters than discriminative models without overfitting. In discriminative learning, each training case constrains the parameters only by as many bits of information as are required to specify the label. For a generative model, each training case constrains the parameters by the number of bits required to specify the input. It is easy to see what the network has learned by generating from its model. It is possible to interpret the nonlinear, distributed representations in the deep hidden layers by generating images from them.
1548
r
G. Hinton, S. Osindero, and Y.-W. Teh
The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.
Appendix A: Complementary Priors A.1 General Complementarity. Consider a joint distribution over observables, x, and hidden variables, y. For a given likelihood function, P(x|y), we define the corresponding family of complementary priors to be those distributions, P(y), for which the joint distribution, P(x, y) = P(x|y)P(y), leads to posteriors, P(y|x), that exactly factorize, that is, leads to a posterior that can be expressed as P(y|x) = j P(y j |x). Not all functional forms of likelihood admit a complementary prior. In this appendix, we show that the following family constitutes all likelihood functions admitting a complementary prior, 1 P(x|y) = exp (y) = exp
j (x, y j ) + β(x)
j
j (x, y j ) + β(x) − log (y) ,
(A.1)
j
where is the normalization term. For this assertion to hold, we need to assume positivity of distributions: that both P(y) > 0 and P(x|y) > 0 for every value of y and x. The corresponding family of complementary priors then assumes the form P(y) =
1 exp C
log (y) +
α j (y j ) ,
(A.2)
j
where C is a constant to ensure normalization. This combination of functional forms leads to the following expression for the joint, P(x, y) =
1 exp C
j
j (x, y j ) + β(x) +
α j (y j ) .
(A.3)
j
To prove our assertion, we need to show that every likelihood function of form equation A.1 admits a complementary prior and vice versa. First, it can be directly verified that equation A.2 is a complementary prior for the likelihood functions of equation A.1. To show the converse, let us assume that P(y) is a complementary prior for some likelihood function P(x|y). Notice that the factorial form of the posterior simply means that the
A Fast Learning Algorithm for Deep Belief Nets
1549
joint distribution P(x, y) = P(y)P(x|y) satisfies the following set of conditional independencies: y j ⊥⊥ yk | x for every j = k. This set of conditional independencies corresponds exactly to the relations satisfied by an undirected graphical model having edges between every hidden and observed variable and among all observed variables. By the Hammersley-Clifford theorem and using our positivity assumption, the joint distribution must be of the form of equation A.3, and the forms for the likelihood function equation A.1 and prior equation A.2 follow from this. A.2 Complementarity for Infinite Stacks. We now consider a subset of models of the form in equation A.3 for which the likelihood also factorizes. This means that we now have two sets of conditional independencies: P(x|y) =
P(xi |y)
(A.4)
P(y j |x).
(A.5)
i
P(y|x) =
j
This condition is useful for our construction of the infinite stack of directed graphical models. Identifying the conditional independencies in equations A.4 and A.5 as those satisfied by a complete bipartite undirected graphical model, and again using the Hammersley-Clifford theorem (assuming positivity), we see that the following form fully characterizes all joint distributions of interest, P(x, y) =
1 exp Z
i, j (xi , y j ) +
i, j
γi (xi ) +
i
α j (y j ) ,
(A.6)
j
while the likelihood functions take on the form P(x|y) = exp
i, j
i, j (xi , y j ) +
γi (xi ) − log (y) .
(A.7)
i
Although it is not immediately obvious, the marginal distribution over the observables, x, in equation A.6 can also be expressed as an infinite directed model in which the parameters defining the conditional distributions between layers are tied together. An intuitive way of validating this assertion is as follows. Consider one of the methods by which we might draw samples from the marginal distribution P(x) implied by equation A.6. Starting from an arbitrary configuration of y, we would iteratively perform Gibbs sampling using, in alternation, the distributions given in equations A.4 and A.5. If we run this Markov chain for long enough, then, under the mild assumption that the chain
1550
G. Hinton, S. Osindero, and Y.-W. Teh
mixes properly, we will eventually obtain unbiased samples from the joint distribution given in equation A.6. Now let us imagine that we unroll this sequence of Gibbs updates in space, such that we consider each parallel update of the variables to constitute states of a separate layer in a graph. This unrolled sequence of states has a purely directed structure (with conditional distributions taking the form of equations A.4 and A.5 and in alternation). By equivalence to the Gibbs sampling scheme, after many layers in such an unrolled graph, adjacent pairs of layers will have a joint distribution as given in equation A.6. We can formalize the above intuition for unrolling the graph as follows. The basic idea is to unroll the graph “upwards” (i.e., moving away from the data), so that we can put a well-defined distribution over the infinite stack of variables. Then we verify some simple marginal and conditional properties of the joint distribution and thus demonstrate the required properties of the graph in the “downwards” direction. Let x = x(0) , y = y(0) , x(1) , y(1) , x(2) , y(2) , . . . be a sequence (stack) of variables, the first two of which are identified as our original observed and hidden variable. Define the functions f (x , y ) = f x (x ) =
1 exp Z
i, j (xi , yi ) +
i, j
γi (xi ) +
i
α j (y j )
(A.8)
j
f (x , y )
(A.9)
f (x , y )
(A.10)
y
f y (y ) =
x
gx (x |y ) = f (x , y )/ f y (y )
(A.11)
g y (y |x ) = f (x , y )/ f x (x ),
(A.12)
and define a joint distribution over our sequence of variables as follows: P(x(0) , y(0) ) = f x(0) , y(0) P(x(i) |y(i−1) ) = gx x(i) |y(i−1) P(y(i) |x(i) ) = g y y(i) |x(i) .
(A.13) i = 1, 2, . . .
(A.14)
i = 1, 2, . . .
(A.15)
We verify by induction that the distribution has the following marginal distributions: P(x(i) ) = f x x(i) P(y(i) ) = f y y(i)
i = 0, 1, 2, . . .
(A.16)
i = 0, 1, 2, . . .
(A.17)
A Fast Learning Algorithm for Deep Belief Nets
1551
For i = 0 this is given by definition of the distribution in equation A.13. For i > 0, we have: P(x ) = (i)
y(i−1)
(i) (i−1) (i−1) f x(i) , y(i−1) P x |y f y y(i−1) P y = (i−1) fy y y(i−1)
= f x x(i)
(A.18)
and similarly for P(y(i) ). Now we see that the following conditional distributions also hold true: P x(i) |y(i) = P x(i) , y(i) P y(i) = gx x(i) |y(i) P y(i) |x(i+1) = P y(i) , x(i+1) P x(i+1) = g y y(i) |x(i+1) .
(A.19) (A.20)
So our joint distribution over the stack of variables also leads to the appropriate conditional distributions for the unrolled graph in the “downwards” direction. Inference in this infinite graph is equivalent to inference in the joint distribution over the sequence of variables, that is, given x(0) , we can obtain a sample from the posterior simply by sampling y(0) |x(0) , x(1) |y(0) , y(1) |x(1) , . . . . This directly shows that our inference procedure is exact for the unrolled graph. Appendix B: Pseudocode for Up-Down Algorithm We now present MATLAB-style pseudocode for an implementation of the up-down algorithm described in section 5 and used for back-fitting. (This method is a contrastive version of the wake-sleep algorithm; Hinton et al., 1995.) The code outlined below assumes a network of the type shown in Figure 1 with visible inputs, label nodes, and three layers of hidden units. Before applying the up-down algorithm, we would first perform layer-wise greedy training as described in sections 3 and 4. \% \% \% \% \% \% \% \% \%
UP-DOWN ALGORITHM the data and all biases are row vectors. the generative model is: lab <--> top <--> pen --> hid --> vis the number of units in layer foo is numfoo weight matrices have names fromlayer tolayer "rec" is for recognition biases and "gen" is for generative biases. for simplicity, the same learning rate, r, is used everywhere.
1552
G. Hinton, S. Osindero, and Y.-W. Teh
\% PERFORM A BOTTOM-UP PASS TO GET WAKE/POSITIVE PHASE \% PROBABILITIES AND SAMPLE STATES wakehidprobs = logistic(data*vishid + hidrecbiases); wakehidstates = wakehidprobs > rand(1, numhid); wakepenprobs = logistic(wakehidstates*hidpen + penrecbiases); wakepenstates = wakepenprobs > rand(1, numpen); wakeopprobs = logistic(wakepenstates*pentop + targets*labtop + topbiases); wakeopstates = wakeopprobs > rand(1, numtop);
\% POSITIVE PHASE STATISTICS FOR CONTRASTIVE DIVERGENCE poslabtopstatistics = targets’ * waketopstates; pospentopstatistics = wakepenstates’ * waketopstates; \% PERFORM numCDiters GIBBS SAMPLING ITERATIONS USING THE TOP LEVEL \% UNDIRECTED ASSOCIATIVE MEMORY negtopstates = waketopstates; \% to initialize loop for iter=1:numCDiters negpenprobs = logistic(negtopstates*pentop’ + pengenbiases); negpenstates = negpenprobs > rand(1, numpen); neglabprobs = softmax(negtopstates*labtop’ + labgenbiases); negtopprobs = logistic(negpenstates*pentop+neglabprobs*labtop+ topbiases); negtopstates = negtopprobs > rand(1, numtop)); end; \% NEGATIVE PHASE STATISTICS FOR CONTRASTIVE DIVERGENCE negpentopstatistics = negpenstates’*negtopstates; neglabtopstatistics = neglabprobs’*negtopstates;
\% STARTING FROM THE END OF THE GIBBS SAMPLING RUN, PERFORM A \% TOP-DOWN GENERATIVE PASS TO GET SLEEP/NEGATIVE PHASE \% PROBABILITIES AND SAMPLE STATES sleeppenstates = negpenstates; sleephidprobs = logistic(sleeppenstates*penhid + hidgenbiases); sleephidstates = sleephidprobs > rand(1, numhid); sleepvisprobs = logistic(sleephidstates*hidvis + visgenbiases);
\% PREDICTIONS psleeppenstates = logistic(sleephidstates*hidpen + penrecbiases); psleephidstates = logistic(sleepvisprobs*vishid + hidrecbiases); pvisprobs = logistic(wakehidstates*hidvis + visgenbiases); phidprobs = logistic(wakepenstates*penhid + hidgenbiases); \% UPDATES TO GENERATIVE PARAMETERS hidvis = hidvis + r*poshidstates’*(data-pvisprobs);
A Fast Learning Algorithm for Deep Belief Nets
1553
visgenbiases = visgenbiases + r*(data - pvisprobs); penhid = penhid + r*wakepenstates’*(wakehidstates-phidprobs); hidgenbiases = hidgenbiases + r*(wakehidstates - phidprobs);
\% UPDATES TO TOP LEVEL ASSOCIATIVE MEMORY PARAMETERS labtop = labtop + r*(poslabtopstatistics-neglabtopstatistics); labgenbiases = labgenbiases + r*(targets - neglabprobs); pentop = pentop + r*(pospentopstatistics - negpentopstatistics); pengenbiases = pengenbiases + r*(wakepenstates - negpenstates); topbiases = topbiases + r*(waketopstates - negtopstates); \%UPDATES TO RECOGNITION/INFERENCE APPROXIMATION PARAMETERS hidpen = hidpen + r*(sleephidstates’*(sleeppenstatespsleeppenstates)); penrecbiases = penrecbiases + r*(sleeppenstates-psleeppenstates); vishid = vishid + r*(sleepvisprobs’*(sleephidstatespsleephidstates)); hidrecbiases = hidrecbiases + r*(sleephidstates-psleephidstates); Acknowledgments We thank Peter Dayan, Zoubin Ghahramani, Yann Le Cun, Andriy Mnih, Radford Neal, Terry Sejnowski, and Max Welling for helpful discussions and the referees for greatly improving the manuscript. The research was supported by NSERC, the Gatsby Charitable Foundation, CFI, and OIT. G.E.H. is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509–522. Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. In R. G. Cowell & Z. Ghahramani (Eds.), Artificial Intelligence and Statistics, 2005. (pp. 33–41). Fort Lauderdale, FL: Society for Artificial Intelligence and Statistics. Decoste, D., & Schoelkopf, B. (2002). Training invariant support vector machines, Machine Learning, 46, 161–190. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 12(2), 256–285. Friedman, J., & Stuetzle, W. (1981). Projection pursuit regression. Journal of the American Statistical Association, 76, 817–823. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence, Neural Computation, 14(8), 1711–1800.
1554
G. Hinton, S. Osindero, and Y.-W. Teh
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. (1995). The wake-sleep algorithm for self-organizing neural networks. Science, 268, 1158–1161. LeCun, Y., Bottou, L., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America, A, 20, 1434–1448. Marks, T. K., & Movellan, J. R. (2001). Diffusion networks, product of experts, and factor analysis. In T. W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Proc. Int. Conf. on Independent Component Analysis (pp. 481–485). San Diego. Mayraz, G., & Hinton, G. E. (2001). Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 189–197. Neal, R. (1992). Connectionist learning of belief networks, Artificial Intelligence, 56, 71–113. Neal, R. M., & Hinton, G. E. (1998). A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., & Barbano, P. (2005). Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 14(9), 1360–1371. Roth, S., & Black, M. J. (2005). Fields of experts: A framework for learning image priors. In IEEE Conf. on Computer Vision and Pattern Recognition (pp. 860–867). Piscataway, NJ: IEEE. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural networks. Neural Networks, 2(6), 459–473. Simard, P. Y., Steinkraus, D., & Platt, J. (2003). Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR) (pp. 958–962). Los Alamitos, CA: IEEE Computer Society. Teh, Y., & Hinton, G. E. (2001). Rate-coded restricted Boltzmann machines for face recognition. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 908–914). Cambridge, MA: MIT Press. Teh, Y., Welling, M., Osindero, S., & Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research, 4, 1235–1260. Welling, M., Hinton, G., & Osindero, S. (2003). Learning sparse topographic representations with products of Student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1359–1366). Cambridge, MA: MIT Press. Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1481–1488). Cambridge, MA: MIT Press.
Received June 8, 2005; accepted November 8, 2005.
LETTER
Communicated by Kechen Zhang
Optimal Tuning Widths in Population Coding of Periodic Variables Marcelo A. Montemurro [email protected]
Stefano Panzeri [email protected] Faculty of Life Sciences, University of Manchester, Manchester M60 1QD, U.K.
We study the relationship between the accuracy of a large neuronal population in encoding periodic sensory stimuli and the width of the tuning curves of individual neurons in the population. By using general simple models of population activity, we show that when considering one or two periodic stimulus features, a narrow tuning width provides better population encoding accuracy. When encoding more than two periodic stimulus features, the information conveyed by the population is instead maximal for finite values of the tuning width. These optimal values are only weakly dependent on model parameters and are similar to the width of tuning to orientation or motion direction of real visual cortical neurons. A very large tuning width leads to poor encoding accuracy, whatever the number of stimulus features encoded. Thus, optimal coding of periodic stimuli is different from that of nonperiodic stimuli, which, as shown in previous studies, would require infinitely large tuning widths when coding more than two stimulus features. 1 Introduction The width of the tuning curves of individual neurons to sensory stimuli plays an important role in determining the nature of a neuronal population code (Pouget, Deneve, Ducom, & Latham, 1999; Zhang & Sejnowski, 1999). On the one hand, narrow tuning makes single neurons highly informative about a specific range of stimulus values. On the other hand, coarse tuning increases the size of the population activated by the stimulus, but at the price of making individual neurons less precisely tuned. An important question is whether there is an optimal value of tuning width that allows a most effective trade-off between these two partly conflicting requirements of high encoding accuracy by single neurons and engagement of a large population. When considering the encoding of nonperiodic stimulus variables, such as the Cartesian coordinates of position in space, Zhang and Sejnowski (1999) demonstrated that under very general conditions, the information about the stimulus conveyed by the population Neural Computation 18, 1555–1576 (2006)
C 2006 Massachusetts Institute of Technology
1556
M. Montemurro and S. Panzeri
scales as σ D−2 , where σ stands for the width of the tuning curve and D is equal to the number of encoded stimulus features. Thus, extremely narrow tuning curves are better for encoding one nonperiodic stimulus feature, whereas extremely coarse tuning curves are better for encoding more than two nonperiodic features (Zhang & Sejnowski, 1999). However, many important stimulus variables are described by periodic quantities. Example of such variables are the direction of motion and the orientation of a visual stimulus. Thus, it is crucial to investigate how the population accuracy in encoding periodic stimuli depends on the tuning width. Here, we address this problem, and we find that under general conditions and in marked contrast with the case of nonperiodic stimuli, the population accuracy in encoding periodic stimuli decreases quickly for large tuning widths, whatever the stimulus dimensionality. For stimulus dimensions D > 2, there is a finite optimal value of the tuning curve width for which the population conveys maximal information. If D is in the range 2 to 6, the optimal tuning widths predicted by the model are similar in magnitude to those observed in visual cortex. This suggests that the tuning widths of visual cortical neurons are efficient at transmitting information about a small number of periodic stimulus features. 2 Model of Encoding Periodic Variables We consider a population made up of a large number N of neurons. The response of each individual neuron is quantified by the number of spikes fired in a certain poststimulus time window. Thus, the overall neuronal population response is represented as an N-dimensional spike count vector. We assume that the neurons are tuned to a small number D of periodic stimulus variables, such as the orientation or the direction of motion of a visual object. The stimulus variable will be described as an angular vector θ = (θ1 , . . . , θ D ) of dimension D. A real cortical neuron may also encode sensory variables that are not periodic, such as retinal position or speed of motion. However, for simplicity here, we will focus entirely on periodic variables. The neuronal tuning curves, which quantify the mean spike count of each neuron to the presented D-dimensional stimulus, are all taken to be identical in shape but having their maxima at different angles. Thus, each neuron is uniquely identified by its preferred angle φ. For concreteness, we choose the following multidimensional circular normal form for the tuning function, f (θ − φ) ≡ b + f 0 (θ − φ) = b + m
D i=1
exp
cos(ν(θi − φi )) − 1 , (νσ )2 (2.1)
Optimal Periodic Tuning Widths
1557
where b is the baseline (nonstimulus-induced) firing. The stimulusmodulated part of the tuning curve f 0 (θ − φ) depends on σ , the width of tuning, and on m, the response modulation. As in Zhang and Sejnowski (1999), we assume that the preferred angles of each neurons are distributed across the population uniformly over the D-dimensional cuboid region [−π/ν, π/ν] D . The parameter ν sets the period of the tuning function, which is equal to (2π)/ν. For example, ν = 1 corresponds to a period equal to 2π and would describe a motion direction angle, whereas ν = 2 corresponds to a period equal to π and would describe an orientation angle. For simplicity, we assume that different stimulus dimensions are mapped in a separable way. Thus, the stimulus-dependent part of the multidimensional tuning curve in equation 2.1 is written as a product of a onedimensional circular normal function over the different stimulus dimensions: g(θi − φi ) = exp
cos(ν(θi − φi )) − 1 . (νσ )2
(2.2)
This tuning function has been used in neural coding models (see, e.g., Pouget, Zhang, Deneve, & Latham, 1998 and Sompolinsky, Yoon, Kang, & Shamir, 2001). It was chosen here because, unlike the most commonly used gaussian models, it is a genuinely periodic function of the stimulus, and it fits accurately experimental tuning curves for both orientation-sensitive (Swindale, 1998) and direction-sensitive (Kohn & Movshon, 2004) neurons in primary visual cortex. In Figure 1 we plot two examples of the onedimensional circular normal distribution in equation 2.2, compared with their respective gaussian approximations. If σ is smaller than ≈200 , the circular normal function closely resembles a gaussian tuning curve (see equation). For tuning widths much larger than ≈200 , the shape of the circular tuning functions differs substantially from the gaussian. The above assumption that the multidimensional tuning curve is just a product of one-dimensional tuning curves of individual features has been used extensively in population coding models. In addition to being, mathematically convenient for its simplicity, the multiplicative form of the multidimensional tuning function describes well the tuning of neurons in higher visual cortical areas to, for example, complex multidimensional rotations (Logothetis, Pauls, & Poggio, 1995) or to local features describing complex boundary configurations (Pasupathy & Connor, 2001). We assume that the neurons in the population are uncorrelated at fixed stimulus. Thus, the stimulus-conditional probability of population response P(r|θ) is a product of the spike count distribution of each individual neuron. Although it is a simplification, this assumption is useful because it makes our model mathematically tractable, and it is sufficient to account in most cases for the majority of information transmitted by real neuronal populations (Nirenberg, Carcieri, Jacobs, & Latham, 2001; Petersen,
1558
M. Montemurro and S. Panzeri
50
Firing rate [Hz]
40
=400 30
20
=200
10
0
80
60
40
20
0
20
40
60
80
[deg] Figure 1: Comparison of periodic and nonperiodic tuning functions for an orientation-selective neuron coding one stimulus variable. The curves show mean firing rates as a function of the difference between the presented and the preferred stimulus of the neuron. Solid lines: periodic tuning function given by equation 2.2 for ν = 2 (orientation selectivity) and for two values of the tuning width σ . Dashed lines: nonperiodic (gaussian) tuning function for the same tuning widths. The difference between the periodic and nonperiodic tuning functions becomes apparent for large tuning widths and for angles away from the preferred one.
Panzeri, & Diamond, 2001; Oram, Hatsopoulos, Richmond, & Donoghue, 2001). However, in section 8 we introduce a specific model that takes into account cross-neuronal correlations and demonstrates that the conclusions obtained with the uncorrelated assumption are still valid in that correlated case. Following Zhang and Sejnowski (1999), we choose a general model of the activity of the single neuron in the population by requiring that the probability that the neuron with preferred angle φ emits r spikes in response to stimulus θ is an arbitrary function of the mean spike count only: P(r |θ) = S(r, f (θ − φ)).
(2.3)
Optimal Periodic Tuning Widths
1559
In this article, some specific cases of single-neuron models that satisfy this assumption are studied in detail. We shall also derive scaling rules of the encoding efficiency as a function of σ and D that are valid for any model of single-neuron firing that satisfies equation 2.1. 3 Fisher Information The ability of the population to encode accurately a particular stimulus value can be quantified by means of Fisher information (Cover & Thomas, 1991). When the stimulus is a D-dimensional periodic variable, Fisher information is a D × D matrix, J, whose elements i, j are measured in units of deg−2 and are defined as follows:
J i, j (θ ) = −
drP(r|θ)
∂2 log P(r|θ ) . ∂θi ∂θ j
(3.1)
Fisher information provides a good measure of stimulus encoding because it sets a limit on the accuracy with which a particular stimulus value can be reconstructed from a single observation of the neuronal population activity. In fact, it satisfies the following generalized Cram´er-Rao matrix inequality (Cover & Thomas, 1991), ≥ J−1 ,
(3.2)
where is the covariance matrix of the D-dimensional error made by any unbiased estimation method reconstructing the stimulus from the neuronal population activity. Since the neurons fire independently, the population Fisher information is simply given by the sum over all neurons of the single-neuron Fisher (neuron) information (Cover & Thomas, 1991). The latter, denoted as J i, j (θ − φ), has the following expression: (neur on)
J i, j
(θ − φ) = −
dr S(r, f (θ − φ))
∂2 log S(r, f (θ − φ)) . ∂θi ∂θ j (3.3)
By computing explicitly the derivatives in the above expression, one can rewrite the single-neuron Fisher information in equation 3.3 as follows: (neur on)
J i, j
S (r, f (θ − φ))2 (θ − φ) = f 02 (θ − φ) dr S(r, f (θ − φ)) ×
sin((ν(θi − φi )) sin(ν(θ j − φ j )) , ν2σ 4
(3.4)
1560
M. Montemurro and S. Panzeri
∂ where S = ∂z S(r, z), and we require the integral over responses in the above equation to be convergent, so that the single-neuron Fisher information is finite. Since the number of neurons in the population is assumed to be large and since a neuron is uniquely identified by the center φ of its tuning curve, we can compute the population Fisher information by approximating the sum over the neurons by an integral over the preferred angles φ,
J i, j (θ ) =
Nν D (2π)
D
π/ν
−π/ν
(neur on)
d D φ J i, j
(θ − φ),
(3.5)
π/ν where −π/ν d D φ denotes the angular integration over the D-dimensional cuboid region [−π/ν, π/ν] D . Since the term in square brackets in equation 3.4 is an even function of each of the angular variables, the integral in equation 3.5 will be nonzero only when i = j. It is also clear that because of symmetry over index permutations, the population Fisher information J i, j (θ ) is proportional to the identity matrix. Moreover, since the integrand in equation 3.5 is a periodic function integrated over its whole period, the Fisher information in the equation does not depend on the value of angular stimulus variable. Thus, dropping index notation and the θ-dependency in the argument, we will denote by J the diagonal element of the Fisher information matrix, and call it simply Fisher information. 4 Poisson Model We begin our analysis of population coding of periodic stimuli by considering a neuronal firing model satisfying equation 2.3: we assume that single-neuron statistics at fixed stimulus is described by a Poisson process with mean given by the neuronal tuning function, equation 2.1: P(r |θ) =
( f (θ − φ))r exp (− f (θ − φ)) . r!
(4.1)
The Fisher information conveyed by the Poisson neuronal population is as follows, Nν D J = (2π) D
π ν
− πν
d φ D
f 02 (θ − φ) sin2 (ν(θ1 − φ1 )) , f (θ − φ) ν2σ 4
(4.2)
where the term in square brackets is the single-neuron Fisher information, and θ1 and φ1 denote the projections of θ and φ along the first stimulus dimension. In the following, we will study how the Poisson population Fisher information depends on the model parameters.
Optimal Periodic Tuning Widths
Poisson 3.5
3
D=1 D=2
2.5
2.5
−2
J/N [deg−2 ]
3
2 1.5 1 0.5 0 0
D=3
40
60
σ [deg]
80
D=1 D=2
2 1.5 1 D=3
0.5
D=4
20
Gaussian
B
J/N [deg ]
A
1561
0 0
D=4
20
40
60
σ [deg]
80
Figure 2: (A) Fisher information per neuron J /N as a function of the tuning curve width σ for a population of orientation-selective “Poisson” neurons, for stimulus dimensions D = 1, 2, 3, and 4. Solid lines correspond to the periodic stimulus Fisher information J and were calculated with equation 4.3. For plotting, the parameter m (which has only a trivial multiplicative effect) was fixed to 5. Dotted lines correspond to the nonperiodic Fisher information J np and were calculated using gaussian tuning curves of width σ (Zhang & Sejnowski, 1999). (B) Fisher information per neuron J /N for a population of gaussian independent neurons, for different dimensions of the periodic stimulus variable. Parameters were as follows: m = 5; b = 0.5, α = 1, and β = 1.
4.1 Analytical Solution with No Baseline Firing. First, we consider the case in which there is no baseline firing: b = 0. In this case, it is possible to integrate exactly equation 4.2 and obtain the following analytical solution for J, J =
Nm K 1 (ν 2 σ 2 )K 0D−1 (ν 2 σ 2 ) , σ2
(4.3)
where K n (x) = e −1/x In (1/x),
(4.4)
and In (x) stands for the nth order modified Bessel function of the first kind. As in the nonperiodic case (Zhang & Sejnowski, 1999), N and m affect the Fisher information in equation 4.3 only as trivial multiplicative factors. Thus, we can focus on the dependence of Fisher information on σ and D. Figure 2A compares the periodic-stimulus Fisher information J (normalized in the plot as Fisher information per neuron J /N) to that obtained in the nonperiodic case (which we will denote as J np , and has
1562
M. Montemurro and S. Panzeri
a very simple σ -dependence: J np ∝ σ D−2 ; see Zhang & Sejnowski, 1999). It is apparent that the periodic-stimulus Fisher information J depends on σ in a more complex way than the nonperiodic one J np . While for D = 1 there is no qualitative difference between the periodic and the nonperiodic case (with both J and J np being divergent at σ = 0), significant differences appear for D ≥ 2. If D = 2, J is not constant with σ , but it has a maximum at σ = 0 and then decays rapidly. If D > 2, in sharp contrast with J np , J exhibits a maximum at finite σ . The optimal values of σ that maximized J were 26.6, 34.1, 39.9, 44.9 degrees for D = 3, 4, 5, and 6, respectively. The dependence of J on σ and D and its relation with J np can be understood by comparing their respective expressions and taking into account the properties of the functions K n (ν 2 σ 2 ), as follows. For small σ , K n (ν 2 σ 2 ) scales to zero as σ . Thus, the small-σ scaling of the periodic-stimulus model is identical to that of Zhang & Sejnowski (1999). This is because as σ → 0, the periodic tuning function tends to a gaussian and σ can be rescaled away from the angular integrals as in the nonperiodic case. For large σ , K 0 (ν 2 σ 2 ) increases monotonically toward 1, whereas K 1 (ν 2 σ 2 ) (which has a maximum at νσ ≈ 0.8) decreases toward zero as 1/σ 2 . Thus, J decreases rapidly to zero as 1/σ 4 for any D. This is very different from the nonperiodic case, in which J np grows to infinity for large σ if D > 2 (Zhang & Sejnowski, 1999). The occurrence of a finite-σ maximum of J can also be understood in terms of the K n functions. If D is 1 or 2, then the dominant factor is 1/σ 2 , which leads to a maximum of J at σ = 0. For larger (but finite) D, the term K 0D−1 becomes more and more important, and thus the maximum of J is shifted toward larger σ . However, unlike in the nonperiodic case, since K 0 saturates at 1 and K 1 goes to zero for σ → ∞, this maximum must be reached at a finite σ value. An infinite optimal σ value can only be reached in the D → ∞ limit, where K 0D−1 is dominant and the other terms can be neglected. It is interesting to compare the optimal values of tuning width obtained with our model with the widths of orientation tuning curves observed in visual cortex (summarized in Table 1). The width of orientation tuning of V1 neurons is typically in the range of 17 to 21 degrees. In higher cortical visual areas, the tuning curves get progressively broader: σ is approximately 26 degrees in MT and 38 degrees in V4. Thus, tuning widths of cortical neurons are neither too narrow nor too wide and are similar in magnitude to the optimal ones obtained from information-theoretic principles when considering multidimensional periodic stimuli. What is the advantage of using tuning widths in this intermediate range of 15 to 40 degrees observed in cortex? Our model results suggest that unlike very narrow or very wide tuning widths, tuning widths in this intermediate range are rather efficient at conveying information over a range of stimulus dimensions. Intermediate tuning widths
Optimal Periodic Tuning Widths
1563
Table 1: Typical Values of Tuning Widths to Either Orientation or Motion Direction of Neurons in Different Visual Cortex Areas. Species
Area
Stimulus
σ [deg]
Ferret Cat Macaque Macaque Macaque Macaque Macaque
V1 V1 V1 MT V4 V1 MT
O O O O O D D
15–17∗ 14.6† 19–22∗ 24–27∗ 38† 25–29∗ 35–40∗
Notes: These values were taken or derived from published reports. † = the original data were reported as the standard deviation σ of the experimental tuning curve. ∗ = the original published values were given as full widths at half height, and we then converted into σ using equation 2.1 with D = 1. Since the conversion depends on the value of the baseline firing b, we report the converted σ assuming a baseline firing ranging from 10% of the response modulation m (lower σ value) to zero baseline firing (upper σ value). Sources: Usrey, Sceniak, and Chapman (2003); Henry, Dreher, and Bishop (1974); Albright (1984); McAdams and Maunsell (1999).
(e.g., σ = 25 degrees) would not be efficient for D = 1. However, they would be highly efficient at encoding a handful of periodic stimulus features (e.g., D = 2, · · · , 5). On the other hand, using a small width of tuning, say, σ = 5 degrees, would be more efficient for D = 1, only marginally more efficient for D = 2, but very inefficient for D > 3. Using a large σ of 90 degrees would lead to poor information transmission for any value of D below ≈15. Therefore, the advantage of intermediate tuning widths is that they offer a highly efficient information transmission across a range of stimulus dimensions, provided that the population encodes more than one stimulus feature. The results obtained for orientation tuning are easily extended to coding of direction stimulus variables (i.e., ν = 1). It is easy to see that apart from an overall multiplicative factor ν 2 , the Fisher information in equation 4.3 depends on σ only through the product νσ (Notice that this is true not only for the Poisson model solution in equation 4.3, but also for the general Fisher information in equation 3.5). Thus, the dependence of the information on σ for direction-selective populations will be the same as that obtained for orientation-selective neurons, with an overall rescaling of σ by 2. Therefore, for any D, optimal tuning widths for direction-selective populations are exactly twice those corresponding to orientation-selective populations. Table 1 shows that in both V1 and MT, the motion direction tuning widths are larger than the orientation tuning widths, by a factor of ≈1.5. Cortical direction tuning widths are hence also efficient for D > 1.
1564
M. Montemurro and S. Panzeri
4.2 Effect of Baseline Firing. If there is baseline firing, that is, b > 0, it is not possible to express J by a simple analytical formula such as equation 4.3. However, we can gain insight into the effect of baseline firing by obtaining approximate solutions for the limiting cases of very small and very large b, as follows. Here we will focus on how robust the optimal σ values are that were found above in the D > 2 case. When b is very small, we can expand the integrand in equation 4.2 in powers of b/ f 0 (θ − φ).1 Keeping up to first order in b/ f 0 (θ − φ), we obtain: J =
Nm b D−1 2 2 2 2 . K (ν σ )K (ν σ ) − 1 0 σ2 2m
(4.5)
The first term corresponds to the Fisher information, equation 4.3, for the case b = 0, which has a maximum at a certain value of tuning curve width, which we indicate by σ ∗. The second term is a perturbative correction that decreases monotonously when increasing σ . Consequently, for D > 2, the effect of a small baseline firing is to increase slightly the optimal σ with respect to the σ ∗ values obtained for b = 0. How much can the optimal σ values increase when increasing the baseline firing? This can be established by considering the opposite limit, b → ∞. In this case, f 0 (θ − φ)/b 1 for all angles, and we can expand the integrand in equation 4.2 in powers of f 0 (θ − φ)/b. Up to first order in b −1 , we find: J =
2 2 2 2 2 2 2 2 Nm2 m ν σ ν σ D−1 ν σ D−1 ν σ K − K K . K 1 1 0 0 b σ2 2 2 b 3 3 (4.6)
The first contribution √ ∗is the asymptotic behavior for b → ∞ and has a maximum at σ = 2σ . It can be shown that √ the correction term tends to decrease the optimal value of σ from its 2σ ∗ asymptotic large-b value. This suggests that the maximal effect √ of baseline firing on optimal σ values consists in an increase of a factor 2 with respect to the optimal value for b = 0, and this maximal effect is reached when baseline firing dominates. We have verified this prediction by integrating numerically equation 4.2 for different values of b. We found that (data not shown) for D √ = 3 and 4, the optimal σ values were located, for any b, between σ ∗ and 2σ ∗. Thus, optimal tuning width values are robust to the introduction of baseline firing.
1 This expansion is valid if b/ f (θ − φ) 1 for all angles. This condition can be met if 0 σ is nonzero and b is sufficiently small.
Optimal Periodic Tuning Widths
1565
5 Gaussian Model We consider next another model of single-cell firing, the gaussian model, which describes well the statistics of spike counts of visual neurons when spikes are counted over sufficiently long windows (Abbott, Rolls, & Tov´ee, 1996; Gershon, Wiener, Latham, & Richmond, 1998). This model, unlike the Poisson one, permits considering the effect of autocorrelations between the spikes emitted by the same neuron. The statistics of single-neuron spike counts for the gaussian model are defined as follows:
(r − f (θ − φ))2 P(r |θ) = √ exp − 2ψ 2 (ζ ) 2πψ(ζ ) 1
,
(5.1)
where the standard deviation of spike counts, ψ, is an arbitrary function of the mean: ζ = f (θ − φ). Under these assumptions, it is easy to show that the population Fisher information J has the following expression, Nν D J = (2π) D
π ν
− πν
d φ D
f 02 (θ
2ψ 2 (ζ ) 1 + 2 − φ) 2 ψ (ζ ) ψ (ζ )
sin2 (ν(θ1 − φ1 )) ν2σ 4
, (5.2)
where the term in square brackets is the single-neuron Fisher information, and ψ (ζ ) = ∂ζ∂ ψ(ζ ). Since the variance of spike counts of cortical neurons is well described by a power law function of the mean spike count (Tolhurst, Movshon, & Thompson, 1981; Gershon et al., 1998), from now on we will assume that ψ is a power law function of the tuning curve: ψ = α 1/2 f β/2 (θ − φ).
(5.3)
In this case, equation 5.2 becomes J=
π ν Nν D d Dφ D (2π) − πν × f 02 (θ − φ)
β2 1 + β 2 α f (θ − φ) 2 f (θ − φ)
sin2 (ν(θ1 − φ1 )) . (5.4) ν2σ 4
For cortical neurons, the parameters α and β are typically close to 1, both distributed within the range 0.8 to 1.4 (Gershon et al., 1998; Dayan & Abbott, 2001). If all spike times are independent, then the spiking process is Poisson, and α and β would both be 1. Deviations of α and β from 1 require the presence of autocorrelations. Therefore, the study of how J depends on α and
1566
M. Montemurro and S. Panzeri
β allows an understanding of how autocorrelations influence population coding. The dependence of the gaussian model Fisher information on σ and D is plotted in Figure 2B. For this plot, we chose m = 5. The parameters α and β were both set to 1, so that, as for the Poisson process, the variance of spike counts equals the mean. In this case, the scaling of Fisher information is essentially identical to that of the Poisson model in Figure 2A. We investigated numerically the dependence of J on α and β. We found that the shape of Fisher information plotted in Figures 2A and 2B is conserved across the entire range analyzed. In particular, J scaled to zero for large σ for any D and scaled as σ D−2 for small σ . For D = 1, J was always divergent at σ = 0. For D = 2, the maximum was always at σ = 0. For D > 2, J always had a maximum at finite values of σ . The position of the maximum varied slightly as a function of α and β. Results for the position of the maximum for D = 3 and D = 4 are reported in Figure 3. The position of the maxima was almost unchanged when varying α. They varied within less than 3 degrees for D = 3 and 5 degrees for D = 4 when β varied within the typical cortical range 0.8 to 1.4. The similarity between the gaussian model Fisher information, equation 5.4, and the Poisson model Fisher information, equation 4.3, can be explained by noting that if m is large and β < 2, then the second additive term within parentheses in equation 5.4 can be neglected and the gaussian model Fisher information has the following approximated solution:
J ≈
Nm2−β K1 α(2 − β)σ 2
ν2σ 2 2−β
K 0D−1
ν2σ 2 2−β
.
(5.5)
This expression is (apart from an overall multiplicative factor) identical to the population Fisher information for the Poisson model, equation 4.3, with an overall rescaling of the arguments of the K functions by a factor 2 − β. Therefore, if the exponent β of the power law variance-mean relationship, equation 5.3, is approximately 1 (as for real cortical neurons), the optimal values of the gaussian model in the large-m case are almost identical to the ones obtained for the Poisson population. It is worth noting that the α- and β-dependence of J arising from the large-m approximation in equation 5.5 are compatible with the intermediate-m numerical results of Figure 3, which showed that the optimal σ values depend very mildly on α and decrease monotonically with β at fixed D. If m is not very large, then the second additive term within parentheses in equation 5.4 contributes to the gaussian Fisher information. However, we have verified that under a wide range of parameters, this contribution is less dependent on σ than the other one and does not shift the maxima or alter the dependence of J on σ and D in a prominent way.
Optimal Periodic Tuning Widths
A 36
1567
B 36
34 32
σopt [deg]
σopt [deg]
β=0.8 β=1.0 β=1.2
30
D=3
26 0
α
2
3
D
45
β=0.8 β=1.0 β=1.2
40
σ
opt
[deg]
C
1
35 0
D=4 1
α
2
32 30 28
D=3
26 0
σopt [deg]
28
α =0.8 α =1.0 α =1.2
34
3
1
β
2
45
α =0.8 α =1.0 α =1.2
40 35 0
3
D=4 1
β
2
3
Figure 3: Optimal tuning width, σopt (corresponding to a maximum of Fisher information J ), for a population of orientation-selective “gaussian” neurons, as a function of the parameters α and β defining the power law spike count mean variance relationship, equation 5.3. (A) The number of encoded stimulus variables is D = 3. Here β is kept fixed while α is varied. (B) Here, α is kept fixed, β is varied, and D = 3. (C) Now D is 4; β is kept fixed, while α is varied. (D) Again D is 4. Here, α is kept fixed, and β is varied.
Thus, we conclude that the values of optimal tuning widths obtained with the Poisson model are robust to changes in model details such as the introduction of autocorrelations parameterized by α and β, as long as these parameters remain within the realistic cortical range. 6 General Multiplicative Noise Model To further check the robustness of the above conclusions, we introduced a more general model of single-neuron firing: the multiplicative noise model. This model, unlike the Poisson and the gaussian models, has the advantage of not assuming a particular functional form for the variability of neuronal responses at fixed stimulus. It assumes that the variability of spike counts in response to any stimulus is generated by an arbitrary stochastic process
1568
M. Montemurro and S. Panzeri
modulated by an arbitrary function ψ of the mean spike rate. In this case, the spike rate of each neuron is given by the following equation, r = f (θ − φ) + ψ(ζ )z ,
(6.1)
where z is an arbitrary stochastic process of zero mean and unit variance with distribution Q(z), ψ(ζ ) ≡ ψ ( f (θ − φ)), and is a parameter that modulates the overall strength of the response variability. Under these assumptions, the single-neuron’s spike count probability is 1 (r − f (θ − φ)) P(r |θ) = Q , ψ(ζ ) ψ(ζ )
(6.2)
and the single-neuron Fisher information, equation 3.3, has the following form: (neur on)
J i,i
(θ ) =
f 02 (θ − φ) sin2 (ν(θi − φi )) 2 ψ 2 (ζ )ν 2 σ 4
× T0 (Q) + 2 ψ (ζ )T1 (Q) + 2 ψ 2 (ζ )T2 (Q) .
(6.3)
The coefficients Ti (Q) are a function of the noise distribution only and are defined as follows:
Q2 (z) Q2 (z)z dz ; T1 (Q) = dz Q (z)dz + Q(z) Q(z) Q2 (z) 2 z dz + 2 Q (z)zdz. T2 (Q) = 1 + Q(z)
T0 (Q) =
(6.4)
Although equation 6.3 permits the computation of the population Fisher information for any multiplicative noise model, in the following we will concentrate on examining two interesting limiting cases: very low and very high noise strengths. In examining these two cases, we will assume that (as for the gaussian model) the variance of the noise ψ 2 is in a power law relationship with the mean, equation 5.3. 6.1 Low Noise Limit. We first consider the low noise limit 1. In this case, responses are almost deterministic, and single neurons convey information by stimulus-induced changes in the mean spike rate (Brunel & Nadal, 1998). In this limit, the population Fisher information J can be calculated by keeping the leading order in only in the single-neuron
Optimal Periodic Tuning Widths
1569
Fisher information, equation 6.3, and then integrating it over the preferred stimuli, obtaining: Nm2−β T0 (Q) J = K1 α(2 − β)σ 2
ν2σ 2 2−β
K 0D−1
ν2σ 2 2−β
.
(6.5)
The dependence of the low-noise Fisher information on σ and D is thus affected only by β, with all other model parameters entering only as an overall multiplicative factor. Equation 6.5 is identical (apart from an overall multiplicative factor) to the large-m approximation of the gaussian model Fisher information, equation 5.5. It is also identical (apart from a rescaling of the argument of K n ) to the Poisson model exact solution in equation 4.3. 6.2 High Noise Limit. We considered next the case of very noisy neurons: 1. In this limit, information is transmitted entirely by stimulusmodulated changes of the variance ψ 2 . (If the variance was not stimulus dependent, then information would be zero in the high noise limit; Samengo & Treves, 2001). Taking the → ∞ of the single-neuron Fisher information, integrating it over the preferred angles, and taking into account equation 5.3, we obtain the following expression for the population Fisher information: J =
Nν D (2π) D
π ν
− πν
d D φ f 02 (θ − φ)
T2 (Q) α f β (θ − φ)
sin2 (ν(θ1 − φ1 )) . ν2σ 4
(6.6)
In this limit, J is thus independent of the noise strength and depends on the details of the single-neuron model only through a multiplicative factor T2 (Q). It can be seen that its expression is similar to the first (and dominant) additive term of the gaussian solution, equation 5.4. Because of this similarity, when integrating numerically equation 6.6, we found that the dependence of the high-noise population Fisher information on σ and D is remarkably consistent with that obtained for the Poisson and gaussian models (see Figure 2), and that the changes in the position of the maxima when varying α and β were again similar to those reported in Figure 3 for the gaussian model (data not shown). In summary, the dependence on σ and D of the information transmitted by a population of neurons described by the general multiplicative model behaves in a way consistent with the results obtained above in both the high-noise and the low-noise limit. 7 General Scaling Limits for Large and Small σ After having analyzed three different classes of single-neuron firing models, in this section we switch back to the most general firing model in equation
1570
M. Montemurro and S. Panzeri
2.3, in which the single-neuron statistics is an arbitrary function of the mean spike rate; we consider its small- and large-σ scaling. We will derive that for any such firing model, Fisher information scales in an universal way as σ D−2 for small σ and goes to zero as 1/σ 4 for large σ . Thus, for D > 2, Fisher information reaches a maximum for a finite value of the tuning width, whatever the firing model considered. 7.1 Small σ Scaling. When σ 1, the exponential in equation 2.1 gives a nonzero contribution to the tuning function only when θ − φ ∼ 0. In this regime, we can take a Taylor expansion of the cosines in the exponent of equation 2.1, obtaining the following,
− |θ − φ|2 f (θ − φ) b + m exp 2σ 2
≡G
|θ − φ|2 σ2
,
(7.1)
where in the above G(|θ − φ|2 /σ 2 ) is the standard gaussian tuning function. The population Fisher information becomes J =
Nν D
(2π) D
π/ν −π/ν
d φ A˜ D
|θ − φ|2 σ2
(θ1 − φ1 )2 , σ2
(7.2)
where, following Zhang and Sejnowski (1999), the function A˜ is defined as follows:
|θ − φ|2 A˜ σ2
|θ − φ|2 = exp − σ2
2 2 S r, G |θ−φ| σ2 . dr 2 S r, G |θ−φ| σ2
(7.3)
By introducing new integration variables ξi = (θi − φi )/σ , one can rewrite equation 7.2 as follows: J = σ D−2
Nν D (2π)
D
π/(νσ )
−π/(νσ )
d D ξ A˜ |ξ |2 ξ12 .
(7.4)
By taking the small-σ limit of the above expression, one gets J =σ
D−2
Nν D (2π)
D
+∞
−∞
d D ξ A˜ |ξ |2 ξ12 .
(7.5)
If the improper integral above converges, then the periodic Fisher information scales as σ D−2 for small σ , coinciding with the universal scaling rule for nonperiodic stimuli found by Zhang and Sejnowski (1999). It is important
Optimal Periodic Tuning Widths
1571
to note that for a given neuronal model defined by a probability function S, equation 7.5 is the Fisher information obtained if the tuning curve was gaussian with variance σ and the stimulus variable ξ was nonperiodic (Zhang & Sejnowski, 1999). Thus, the small-σ scaling of the periodic-stimulus Fisher information J is ∝ σ D−2 whenever the Fisher information of the corresponding gaussian nonperiodic tuning model is well defined (see Wilke & Eurich, 2001, for cases and parameters in which the gaussian model nonperiodic Fisher information is not well defined). 7.2 Large σ Scaling. When σ 1, the argument of the exponentials in equation 2.1 is very small. Thus, the following expansion will be valid:
f (θ − φ) b + m + m
D cos(ν(θi − φi )) − 1 1 ≈ f (0) + O . 2 (νσ ) σ2 i=1
(7.6) Consequently, the population Fisher information becomes sin2 (θ1 − φ1 ) S (r, m + b)2 1 2 . dφ dr J ≈ 4 m σ S(r, m + b) ν2
(7.7)
Thus, for large σ , Fisher information goes to zero as σ 4 for any stimulus dimensionality. 8 Correlations Between Neurons The analysis above considered a population of independent neurons. In this section, we address the effect of correlated variability among the populations on the position of the optimal values of the tuning curve width. For simplicity, we shall consider that the firing statistics are governed by a multivariate gaussian distribution as follows, P(r|θ) =
1 (2π) N |C|
e − 2 (r−f) 1
T
C−1 (r−f)
(8.1)
where C is the population correlation matrix and f stands for a column vector whose components are the neuron tuning functions, that is, f ≡ [ f (θ − φ (1) ), . . . , f (θ − φ (N) )]T , φ (k) being the preferred stimulus of the kth neuron. The correlation matrix is defined as follows, C (kl) = [δkl + (1 − δkl )q ] ψ(ζ (k) )ψ(ζ (l) ),
(8.2)
1572
M. Montemurro and S. Panzeri
where ζ ( p) ≡ f (θ − φ ( p) ), and (to ensure that the correlation matrix C is positive definite) the cross-correlation strength parameter q is allowed to vary in the range 0 ≤ q < 1. The Fisher information for this gaussian-correlated model is given by the following expression (Abbott & Dayan, 1999): dC −1 dC −1 ∂fT −1 ∂f 1 C + Tr C C . ∂θi ∂θ j 2 dθi dθ j
J i j (θ ) =
(8.3)
By inserting the correlation matrix definition given by equation 8.2, into equation 8.3, and taking the continuous limit for N 1 (Abbott & Dayan, 1999; Wilke & Eurich, 2001), one arrives at the following expression: Nν D−2 J i j (θ ) = (2π) D (1 − q )σ 4
π ν
− πν
d φ D
f 02 (θ
1 (2 − q )ψ 2 (ζ ) − φ) + ψ 2 (ζ ) ψ 2 (ζ )
× sin(ν(θi − φ i )) sin(ν(θ j − φ j )).
(8.4)
Note that as for the uncorrelated model discussed in section 2, the only nonzero elements of the Fisher information matrix are the diagonal ones; these elements are all identical and do not depend on the value of angular stimulus variable. Thus, dropping index and θ-dependency notation, we will again simply denote by J the diagonal element of the Fisher information matrix in equation 8.4. The expression of J for the correlated model in equation 8.4 is almost identical to the one for the uncorrelated gaussian model in equation 5.2, the only difference being a q -dependent relative weight of the two additive terms within parentheses in equation 8.4. Since, as explained in section 5, the first additive term in parentheses is the prominent one in shaping the σ - and D-dependence of J, the correlated model J behaves very much like the uncorrelated-gaussian-model J in equation 5.2. In particular, the cross-correlation parameter q affects the optimal σ values in a very marginal way. Thus, we expect that the only appreciable effect of the cross-correlation strength q is to shift slightly the position of the maximum for D > 2. The variations of the optimal σ values of the correlated model as a function of q (obtained integrating numerically equation 8.4) are reported in Figure 4 for D = 3 and D = 4. ψ(ζ ) was again chosen according to equation 5.3 with α = 1 and β = 1. It is apparent that the values of the optimal tuning widths found for the uncorrelated model are virtually unchanged throughout the entire allowed range of cross-correlation strength q . 9 Discussion Determining how the encoding accuracy of a population depends on the tuning width of individual neurons is crucial for understanding the
Optimal Periodic Tuning Widths
1573
45
40
σopt [deg]
D=4
35 D=3
30
25 0
0.2
0.4
q
0.6
0.8
1
Figure 4: Optimal tuning widths for a population of orientation-selective neurons with gaussian firing statistics in the presence of uniform cross correlations, as a function of the cross-correlation strength parameter q . The two cases D = 3 (lower curve) and D = 4 (upper curve) are separately reported.
transformation of the sensory representation across different levels of the cortical hierarchy (Hinton, McClelland, & Rumelhart, 1986; Zohary, 1992). Generalizing previous results obtained for nonperiodic stimuli (Zhang & Sejnowski, 1999), here we have determined how encoding accuracy of periodic stimuli depends on the tuning width. Although we found no universal scaling rule, the dependence of the encoding accuracy of periodic stimuli on the width of tuning was remarkably consistent across neural models and model parameters. This indicates that the key properties of encoding of periodic variables are general. The encoding accuracy of angular variables differs significantly from that of nonperiodic stimuli. The two major differences are that (1) whatever the number of stimulus features D, very large tuning widths are inefficient for encoding a finite number of periodic variables, and (2) for D > 2, intermediate values of tuning widths (within the range observed in cortex) provide the population with the largest representational capacity. These differences suggest that population coding of periodic stimuli may be influenced by
1574
M. Montemurro and S. Panzeri
computational constraints that differ from those influencing the coding of nonperiodic stimuli. As for the nonperiodic case (Zhang & Sejnowski, 1999), the neural population information about periodic stimuli depends crucially on D, the number of stimulus features being encoded. This number is not known exactly; therefore, it is difficult to derive precisely the optimal tuning widths in each cortical area. However, some evidence indicates that neurons may encode a small number of stimulus features, and in many cases the number of encoded features is more than one. For example, visual neurons extract a few features out of a rich dynamic stimulus (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Touryan, Lau, & Dan, 2002), and a small number of different stimulus maps are often found to coexist over the same area of cortical tissue (Swindale, 2004). Our results show that in this regime in which more than one (but no more than a few) periodic features are encoded, tuning widths within the range observed in cortex are efficient at transmitting information. We showed that the optimal tuning width for population coding increases with the number of periodic features being encoded. Neurophysiological data in Table 1 show a progressive increase of tuning widths across the cortical hierarchy, consistent with the idea that higher visual areas encode complex objects described by a combination of several stimulus parameters (Pasupathy & Connor, 2001), thus requiring larger tuning widths for efficient coding. In summary, our results demonstrate that tuning curves of intermediate widths offer computational advantages when considering the encoding of periodic stimuli. Acknowledgments We thank I. Samengo and R. Petersen for interesting discussions. This research was supported by Human Frontier Science Program, Royal Society and Wellcome Trust 066372/Z/01/Z. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Abbott, L. F., Rolls, E. T., & Tov´ee, M. J. (1996). Representational capacity of face coding in monkeys. Cerebral Cortex, 6, 498–505. Albright, T. D. (1984). Direction and orientation selectivity of neurons in visual area MT of the macaque. J. Neurophysiol., 52, 1106–1130. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brunel, N., & Nadal, J. P. (1998). Mutual information, Fisher information and population coding. Neural Computation, 10, 1731–1757.
Optimal Periodic Tuning Widths
1575
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Gershon, E. D., Wiener, M. C., Latham, P. E., & Richmond, B. J. (1998). Coding strategies in monkey V1 and inferior temporal cortices. J. Neurophysiol., 79, 1135– 1144. Henry, G. H., Dreher, B., & Bishop, P. O. (1974). Orientation specificity of cells in cat striate cortex. Journal of Neurophysiology, 37, 1394–1409. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Kohn, A., & Movshon, A. (2004). Adaptation changes the direction tuning of macaque MT neurons. Nature Neuroscience, 7, 764–772. Logothetis, N., Pauls, J., & Poggio, T. (1995). Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. McAdams, C. J., & Maunsell, J. H. R. (1999). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19(1), 431–441. Nirenberg, S., Carcieri, S. M., Jacobs, A., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Oram, M. W., Hatsopoulos, N., Richmond, B., & Donoghue, J. (2001). Excess synchrony in motor cortical neurons provides redundant direction information with that from coarse temporal measures, J. Neurophysiol., 86, 1700–1716. Pasupathy, A., & Connor, C. E. (2001). Shape representation in area v4: Positionspecific tuning for boundary conformation. J. Neurophysiol., 86, 2505–2519. Petersen, R. S., Panzeri, S., & Diamond, M. (2001). Population coding of stimulus location in rat somatosensory cortex. Neuron, 32, 503–514. Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Comp., 10, 373–401. Samengo, I., & Treves, A. (2001). Representational capacity of a set of independent neurons. Physical Review E 63, 0119101–01191014. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64, 051904. Swindale, N. V. (1998). Orientation tuning curves: Empirical description and estimation parameters. Biological Cybernetics, 78, 45–56. Swindale, N. V. (2004). How different feature spaces may be represented in cortical maps. Network, 15, 217–242. Tolhurst, D. J., Movshon, J. A., & Thompson, I. D. (1981). The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. Experimental Brain Research, 41, 414–419. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Usrey, W. M., Sceniak, M. P., & Chapman, B. (2003). Receptive fields and response properties of neurons in layer 4 of ferret visual cortex. J. Neurophysiol., 89, 1003– 1015.
1576
M. Montemurro and S. Panzeri
Wilke, S. D., & Eurich, C. W. (2001). Representational accuracy of stochastic neural populations. Neural Comp., 14, 155–189. Zhang, K., & Sejnowski, T. (1999). Neuronal tuning: To sharpen or to broaden? Neural Computation, 11, 75–84. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272.
Received February 7, 2005; accepted November 3, 2005.
LETTER
Communicated by Michael Hasselmo
How Inhibitory Oscillations Can Train Neural Networks and Punish Competitors Kenneth A. Norman∗ [email protected]
Ehren Newman∗ [email protected]
Greg Detre [email protected]
Sean Polyn [email protected] Department of Psychology, Princeton University, Princeton, NJ 08544, U.S.A.
We present a new learning algorithm that leverages oscillations in the strength of neural inhibition to train neural networks. Raising inhibition can be used to identify weak parts of target memories, which are then strengthened. Conversely, lowering inhibition can be used to identify competitors, which are then weakened. To update weights, we apply the Contrastive Hebbian Learning equation to successive time steps of the network. The sign of the weight change equation varies as a function of the phase of the inhibitory oscillation. We show that the learning algorithm can memorize large numbers of correlated input patterns without collapsing and that it shows good generalization to test patterns that do not exactly match studied patterns. 1 Introduction The idea that memories compete to be retrieved is one of the most fundamental axioms of neural processing. According to this view, retrieval success is a function of the amount of input that the target memory receives, relative to other, competing memories. If the target memory receives substantially more input than competing memories, it will be retrieved quickly and accurately; if support for the target memory is low relative to competing memories, the target memory will be retrieved slowly or not at all. This view implies that, to maximize subsequent retrieval success, the memory system can enact two distinct kinds of changes. The more obvious way to improve retrieval is to strengthen the target memory. However, it should also be possible to improve retrieval by weakening competing ∗ The
first two authors contributed equally to this research.
Neural Computation 18, 1577–1610 (2006)
C 2006 Massachusetts Institute of Technology
1578
K. Norman, E. Newman, G. Detre, and S. Polyn
memories. Over the past decade, the idea that competitors are punished has received extensive empirical support. Studies by Michael Anderson and others have demonstrated the following regularity: if a memory receives input from the retrieval cue but the memory is not ultimately retrieved, then the memory is weakened. Furthermore, this weakening appears to be proportional to the amount of input that a competitor receives (e.g., Anderson, 2003). Put simply, the more that a memory competes (so long as it is not ultimately retrieved), the more it is punished. We review some illustrative findings in section 1.1. However, despite the obvious functional utility of punishing competitors and the large body of psychological research indicating that competitor punishment occurs, extant computational models of memory have not directly addressed the issue of competitor punishment. In section 2, we present a theory of how the brain can exploit regular oscillations in neural inhibition to punish competing memories and strengthen weak parts of target memories. In section 3, we explore the functional properties of our oscillation-based learning algorithm: How well can it store patterns, relative to other learning algorithms that do not explicitly incorporate competitorpunishment, and how does increasing overlap between patterns affect the learning algorithm’s ability to store item-specific features of individual patterns? The ultimate goal of this work is to show how explicitly incorporating competitor punishment mechanisms into neural learning algorithms can improve the algorithms’ ability to memorize overlapping patterns and improve our understanding of how inhibitory oscillations (e.g., theta oscillations) contribute to learning 1.1 Data Indicating Competitor Punishment. The phenomenon of competitor punishment is nicely illustrated by data from Michael Anderson’s retrieval induced forgetting (RIF) paradigm (for a comprehensive overview of RIF results, see Anderson, 2003). In the RIF paradigm, participants are given a list of category exemplar pairs (e.g. Fruit-Apple, FruitKiwi, and Fruit-Pear) one at a time and are told to memorize the pairs. Immediately after viewing the pairs, participants are given a practice phase where they practice retrieving a subset of the items on the list (e.g., they are given Fruit-Pe and must say “pear”). After a delay (e.g., 20 minutes), participants’ memory for all of the items on the study list is tested. Practicing Fruit-Pe improves recall of “pear” but hurts recall of competing items (e.g., “apple”). Importantly, Anderson and Spellman (1995) found that reduced recall of “apple” is evident even when subjects are given independent cues that were not presented during the practice stage (e.g., Red-A ). This finding indicates that the Apple representation itself, and not just the Fruit-Apple connection, has been weakened. Anderson, Bjork, and Bjork (1994) also found that practicing Fruit-Pe results in more punishment for strong associates of fruit (“apple”) than weak associates of fruit
How Inhibitory Oscillations Can Train
1579
(“kiwi”). Intuitively, strong associates compete more than weak associates, so they suffer more competitor punishment. The fact that impaired recall of the competitor (relative to baseline) lasts on the order of tens of minutes and possibly longer suggests that impaired recall is due to changes in synaptic weights, as opposed to carryover of activation states from the retrieval practice phase. Anderson’s retrieval-induced forgetting experiments are a particularly well-characterized example of competitor punishment. Importantly, though, they are not the only example of this dynamic. To illustrate the generalized nature of this phenomenon, we briefly review findings from other domains that can be understood in terms of competitor punishment:
r
r
r
r
Metaphor comprehension. Glucksberg, Newsome, and Goldvarg (2001) showed that word meanings that are not applicable to the current sentence suffer lasting inhibition. For example, after reading, “My lawyer is a shark,” participants were slower to evaluate sentences that reference the concept “swim” (e.g., “Geese are good swimmers”). Glucksberg et al. argue that when participants read, “My lawyer is a shark,” less appropriate meanings of shark (e.g., “swim”) compete with more appropriate meanings (e.g., “vicious”). “Swim” receives input but loses the competition so it is punished. Task switching. Mayr and Keele (2000) found that after switching from task A to task B, it is more difficult to switch back to task A than to switch to a new task (task C). This can be explained in terms of task A’s competing with task B during the initial switch. Because task A competes but loses (i.e., participants eventually succeed in switching to task B), the neural representation of task A is punished, thereby making it more difficult to reactivate later. Negative priming. In negative priming tasks, participants have to process one object and ignore other objects on each trial (e.g., participants might be instructed to name the green object and ignore objects that are not green). Objects that were ignored sometimes reappear as target objects on subsequent trials. Studies have found that participants are slower to process objects that were ignored on previous trials compared to objects that were not ignored (see Fox, 1995, for a review). This can be explained in terms of the idea that nontarget objects compete with target objects. By attending to the target color, participants bias the competition such that the target object wins the competition. Because the representations of nontarget objects receive strong input from the stimulus array but lose the competition, these representations are punished. Cognitive dissonance reduction. Freedman (1965) showed that if a child is given a mild threat not to play with a toy (and then does not play with the toy), the child ends up liking the toy less. In contrast, if a
1580
K. Norman, E. Newman, G. Detre, and S. Polyn
child is given a strong threat not to play with a toy, there is no attitude change. This paradigm can be understood in terms of a competition between wanting to play with the toy (“play”) and not wanting to play with the toy (“don’t play”). In the mild threat condition, “don’t play” just barely wins over “play”; “play” is a losing competitor, so it is punished, resulting in reduced liking. In the strong threat condition, “don’t play” wins easily over “play”; there is not strong competition, so “play” is not punished. The ubiquitous presence of competitor punishment in psychology spurred us to develop a learning mechanism that can account for this dynamic. Our goal is to come up with a neural network learning algorithm that punishes representations if and only if they compete (and lose) during retrieval. Also, the learning algorithm should be able to efficiently train new patterns into the network in a manner that supports subsequent recall of these patterns. Ultimately we believe that these goals are synergistic: pushing away competitors during training should improve the accuracy of recall at test. 2 The Learning Algorithm In this section, we present a learning algorithm for rate-coded neural networks that can punish competitors as well as train new patterns into a network. The algorithm depends critically on changes in the strength of neural inhibition. By way of background, we first review how inhibition works in our model. Then we provide an overview of how the learning algorithm works. Finally, we provide a more detailed account of how the learning algorithm exploits changes in neural inhibition to punish competitors and strengthen weak memories. 2.1 The Role of Inhibition. Neural systems need some way of controlling excitatory neural activity so this activity does not spread across the entire system, causing a seizure. In keeping with O’Reilly and Munakata (2000), we argue that inhibitory interneurons act to control excitatory activity. Inhibitory interneurons accomplish this goal by sampling the overall amount of excitatory activity within a particular region via diffuse input projections and sending back a commensurate amount of inhibition via diffuse output projections. In this manner, inhibitory interneurons act to enforce a set point on the amount of excitatory activity. Increasing the strength of inhibition leads to a decrease in the overall amount of excitatory activity, and reducing the strength of inhibition leads to an increase in the overall amount of excitatory activity. In terms of this framework, an excitatory neuron will be active if the amount of excitation that it receives is sufficient to counteract the amount of inhibition that it receives. These active units make up the representation of the input pattern. Units that
How Inhibitory Oscillations Can Train
1581
receive a substantial amount of excitatory input, but not enough to counteract the effects of inhibition, can be thought of as competitors. Given processing noise, it is possible that these competing units could be recalled in place of target units. 2.2 Pr´ecis of the Learning Algorithm. The learning algorithm utilizes the Contrastive Hebbian Learning (CHL) weight change equation (Ackley, Hinton, & Sejnowski, 1985; Hinton & Sejnowski, 1986; Hinton, 1989; Movellan, 1990). CHL learning involves contrasting a more desirable state of network activity (sometimes called the plus state) with a less desirable state of network activity (sometimes called the minus state). The CHL equation adjusts network weights to strengthen the more desirable state of network activity (so it is more likely to occur in the future) and weaken the less desirable state of network activity (so it is less likely to occur in the future): d Wij = ((Xi+ Yj+ ) − (Xi− Yj− )).
(2.1)
In the above equation, Xi is the activation of the presynaptic (sending) unit, and Yj is the activation of the postsynaptic (receiving) unit. The plus and minus superscripts refer to plus-state and minus-state activity, respectively. d Wij is the change in weight between the sending and receiving units, and is the learning rate parameter. Our algorithm uses changes in the strength of neural inhibition to generate plus and minus patterns to feed into the CHL equation. To memorize a pattern of activity, we start by soft-clamping the target pattern onto the network. Clamp strength was tuned such that, given a normal level of inhibition, all of the target features (and only those features) are active. This pattern serves as the plus state for learning. We then create two distinct kinds of minus patterns by raising and lowering inhibition, respectively. Raising inhibition distorts the target pattern by making it harder for target units to stay on. Lowering inhibition distorts the target pattern by making it easier for nontarget units to be active. Next, the learning algorithm updates weights by two separate CHLbased comparisons. First, it applies CHL to the difference in network activity given normal versus high inhibition. Second, it applies CHL to the difference in network activity given normal versus low inhibition. 2.2.1 Comparing Normal versus High Inhibition. At a functional level, the normal versus high-inhibition comparison strengthens weak parts of the target pattern by increasing their connectivity with other parts of the target pattern. Raising inhibition acts as a kind of stress test on the target pattern. If a target unit is receiving relatively little collateral support from other target units, such that its net input is just above threshold, raising inhibition will
1582
K. Norman, E. Newman, G. Detre, and S. Polyn
trigger a decrease in the activation of that unit. However, if a target unit is receiving strong collateral support, such that its net input is far above threshold, it will be relatively unaffected by this manipulation. The CHL equation (applied to normal versus high inhibition) strengthens units that turn off when inhibition is raised, by increasing weights from other active units. These weight changes ensure that a target unit that drops out on a given trial will receive more input the next time that cue is presented. If the same pattern is presented repeatedly, eventually the input to that unit will increase to the point where it no longer drops out in the high-inhibition condition. At this point, the unit should be well connected to the rest of the target representation, making it possible for the network to complete that unit, and no further strengthening will occur. 2.2.2 Comparing Normal versus Low Inhibition. The normal-versus lowinhibition comparison punishes competing units by reducing their connectivity with target units. As discussed earlier, competing units can be defined as nontarget units that (given normal levels of inhibition) receive almost enough net input to come on, but not enough input to be active in the final, settled state of the network. If a nontarget unit is located just below threshold, then lowering inhibition will cause that unit to become active. However, if a nontarget unit is far below threshold (i.e., it is not receiving strong input), it will be relatively unaffected by this manipulation. The CHL equation (applied to normal versus low inhibition) weakens units that turn on when inhibition is lowered, by reducing weights from other active units. These weight changes ensure that a unit that competes on one trial will receive less input the next time that cue is presented. If the same cue is presented repeatedly, eventually the input to that unit will diminish to the point where it no longer activates in the low-inhibition condition. At this point, the unit is no longer a competitor, so no further punishment occurs. 2.3 Implementing the Algorithm Using Inhibitory Oscillations. The fact that the learning algorithm involves changes in the strength of inhibition led us to consider how the algorithm relates to neural theta oscillations (rhythmic changes in local field potential at a frequency of approximately 4 to 8 Hz in humans). Theta oscillations depend critically on changes in the firing of inhibitory interneurons (Buzsaki, 2002; Toth, Freund, & Miles, 1997), and there are several data points indicating that theta oscillations might play a role in learning (e.g., Seager, Johnson, Chabot, Asaka, & Berry, 2002; Huerta & Lisman, 1996). In section 4, we assess the correspondence between our algorithm and theta in more detail. At this point, the key insights are that continuous inhibitory oscillations are widespread in the brain, and these oscillations might serve as a neural substrate for our learning algorithm. The version of the learning algorithm described in the previous section (where inhibition is set to normal, higher than normal, or lower than
How Inhibitory Oscillations Can Train
1583
normal) is useful for expository purposes, but the discrete nature of the inhibitory states conflicts with the continuous nature of theta oscillations. To remedy this, we devised an implementation of the learning algorithm that oscillates inhibition in a continuous sinusoidal fashion (from higher than normal to lower than normal). Also, instead of changing weights by comparing normal versus high inhibition and normal versus low inhibition, we change weights by comparing network activity on successive time steps. With regard to the CHL algorithm, the key intuition is that at each point in the inhibitory oscillation, the network is either moving toward the target state (i.e., the pattern of network activity when inhibition is at its normal level) or away from its target state, toward a less desirable state where there is either too little activity (in the case of high inhibition) or too much activity (in the case of low inhibition) . Consider the pattern of activity at time t and the pattern of activity at time t + 1. If inhibition is moving toward its normal level, then the activity pattern at time t + 1 will be closer to the target state than the activity pattern at time t. In this situation, we will use the CHL equation to adapt weights to make the pattern of activity at time t more like the pattern at time t + 1. However, if inhibition is moving away from its normal level, then the activity pattern at time t + 1 will be farther from the target state than the activity pattern at time t. In this situation, we will use the CHL equation to adapt weights to make the pattern of activity at time t + 1 more like the pattern at time t. These rules are formalized in equation 2.2:
d Wij =
((Xi (t + 1)Yj (t + 1)) − (Xi (t)Yj (t))) if inhibition is returning to its normal value
((Xi (t)Yj (t)) − (Xi (t + 1)Yj (t + 1))) if inhibition is moving away from its normal value.
(2.2)
Note that the two equations are identical except for a change in sign. These equations collectively serve the same functions as the normal-versus high-inhibition and normal-versus low-inhibition comparisons described earlier: competitors are punished when the network moves between normal and low inhibition and back again, and weak parts of the target representation are strengthened when the network moves between normal and high inhibition and back again. However, instead of changing weights by comparing snapshots taken at disparate points in time, equation 2.2 achieves the same goal by comparing temporally adjacent network states. Figure 1 summarizes the learning algorithm. 2.4 Network Architecture and Biological Relevance. Although we think the oscillating learning algorithm may be applicable to multiple brain structures, the work described here focuses on applying the algorithm to a neocortical network architecture. McClelland, McNaughton, and O’Reilly
1584
K. Norman, E. Newman, G. Detre, and S. Polyn
Inhibition
High
Competitor Target
Competitor Target
Competitor Target
Competitor Target
Low
Competitor Target
Change in activation: Competitor None Target Negative
Change in activation: Competitor None Target Positive
Change in activation: Competitor Positive Target None
Change in activation: Competitor Negative Target None
Learning rate: Negative
Learning rate: Positive
Learning rate: Negative
Learning rate: Positive
Result:
Result:
Result: Competitor weakened
Result: Competitor weakened
Target strengthened
Target strengthened
Figure 1: Summary of the combined learning algorithm, showing how target and competitor activity change during different phases of the inhibitory oscillation and how these changes in activity affect learning. Moving from normal to high back to normal inhibition serves to identify and strengthen weak parts of the target pattern. Moving from normal to low back to normal inhibition serves to identify and punish competitors.
(1995) and many others (e.g., Hinton & Ghahramani, 1997; Grossberg, 1999) have argued that the goal of neocortical processing is to gradually develop an internal model of the structure of the environment that allows the network to generate predictions about unseen input features. According to the Complementary Learning Systems model developed by McClelland et al. (1995), one of the defining features of cortical learning is that the cortex assigns similar representations to similar inputs. This property allows the network to generalize to new patterns based on their similarity to previously encountered patterns. McClelland et al. contrast this with hippocampal learning, which (according to their model) involves assigning distinct representations to input patterns regardless of their similarity; this property allows the hippocampus to do one-trial memorization but hurts its ability to generalize to new patterns based on similarity (see also Marr, 1971; McNaughton & Morris, 1987; Rolls, 1989; Hasselmo, 1995; Norman & O’Reilly, 2003). To implement a model of cortical learning, our initial simulations used a simple two-layer network, comprising an input-output layer and a hidden layer. The network is shown in Figure 2. The input-output layer was used to present patterns to the network. The hidden layer was allowed to selforganize. Every input-output unit was connected to every input-output unit (including itself) and to every hidden unit. All of the synaptic connections were bidirectional and modifiable according to the dictates of the learning algorithm.
How Inhibitory Oscillations Can Train
1585
Hidden
Input/Output Figure 2: Diagram of the network used in our simulations. Patterns were presented to the lower part of the network (the input-output layer). The upper part of the network (the hidden layer) was allowed to self-organize. Every unit in the input-output layer was connected to every input-output unit (including itself) and to every hidden unit via modifiable, bidirectional weights. All of the simulations described in the article used an 80-unit input-output layer. The hidden layer contained 40 units except when specifically noted otherwise.
This architecture is capable of completing missing pieces of input patterns via input-layer recurrents and backprojections from the hidden layer. More important, the hidden layer gives it the ability to adaptively rerepresent the inputs to facilitate this process of pattern completion. An important goal of the simulations below is to assess whether the oscillating algorithm, applied to this simple cortical architecture, can simultaneously meet the following desiderata for cortical learning (McClelland et al., 1995; O’Reilly & Norman, 2002):
r
The network should assign similar hidden-layer representations to similar inputs.
1586
r
r
K. Norman, E. Newman, G. Detre, and S. Polyn
After repeated exposure to a set of patterns, the network should be able to fill in missing pieces of those patterns, even if the patterns are highly correlated (insofar as real-world input patterns show a high degree of correlation; Simoncelli & Olshausen, 2001). The network should be able to generalize to input patterns that resemble (but do not exactly match) trained patterns.
2.5 General Simulation Methods. The simulation was implemented using a modified version of O’Reilly’s Leabra algorithm (O’Reilly & Munakata, 2000). Apart from a small number of changes listed below (most importantly, relating to the weight update algorithm and how we added an oscillating component to inhibition), all other aspects of the algorithm used here were identical to Leabra. (For a more detailed description of the Leabra algorithm, see O’Reilly & Munakata, 2000.) As per the Leabra algorithm, we explicitly simulated only excitatory units and excitatory connections between these units; we did not explicitly simulate inhibitory interneurons. Excitatory activity was controlled by means of a k-winner-take-all (kWTA) inhibitory mechanism (O’Reilly & Munakata, 2000; Minai & Levy, 1994). The kWTA algorithm sets the amount of inhibition for each layer to a value such that at most k units in that layer show activation values above .25 (fewer than k units will be active if excitatory input does not exceed the leak current, which exerts a constant drag on unit activation). According to this algorithm, all of the units in a layer receive the same amount of inhibitory input on a given time step, but the amount of inhibition can vary across layers. The kWTA algorithm can be viewed as a shortcut that captures the “set-point” role of inhibitory interneurons while reducing computational overhead (relative to explicitly simulating the neurons). The kWTA algorithm also makes it easy to specify the desired amount of activity in a layer by changing the k model parameter. The k parameter was set to k = 8 in both the input-output and hidden layers, except when specified otherwise. To implement the inhibitory oscillation required for the learning algorithm, we used the following procedure. First, at each time step, we used the kWTA algorithm to compute a baseline (normal) level of inhibition. Then we added an oscillating component to the baseline inhibition value. The oscillating component was added only to the input-output layer, not the hidden layer. We limited the oscillation to the input-output layer because we wanted to build the simplest possible architecture that exhibits the desired learning dynamic. We found that adding oscillations to the hidden layer increases the complexity of the model’s behavior, but it does not substantially affect learning performance in either a positive or negative fashion (see appendix C for a concrete demonstration of this point). The magnitude of the oscillating component was varied in a sinusoidal fashion
How Inhibitory Oscillations Can Train
1587
from min to max (where min and max are negative and positive numbers, respectively).1 At the start of each training trial, the target pattern was soft-clamped onto the input-output layer. Over the course of a trial, inhibition was oscillated once from its normal value to the high-inhibition value, then back to normal, then down to the low-inhibition value, then back to normal. The onset of the inhibitory oscillation was delayed 20 time steps from the onset of the stimulus. This delay ensures that activity will reach its equilibrium state (corresponding to the retrieved memory) prior to the start of the oscillation. The period of the inhibitory oscillation was set to 80 time steps. This number was chosen because it gives the network enough time for changes in inhibition to lead to changes in activation, but no more time than was necessary. In principle, we could oscillate inhibition multiple times per stimulus. However, given the way that we calculated weight updates (see below), the effects of multiple inhibitory cycles could be simulated perfectly by staying with one oscillation per stimulus and increasing the learning rate. For a summary of key model parameters relating to the inhibitory oscillation (and other aspects of the model as well), see appendix A. At each time step (starting at the beginning of the inhibitory oscillation), weight updates were calculated using equation 2.2. However, these weight updates were not applied until the end of the trial. This policy makes it easier to analyze network behavior because weight changes cannot feed back and influence patterns of activation within a trial. 3 Simulations In the following simulations, we explore the oscillating algorithm’s ability to meet the desiderata outlined in section 2.4. In particular, we are interested in the algorithm’s ability to support omnidirectional pattern completion, that is, its ability to recall any piece of a pattern when given the rest of the pattern as a cue. The use of the term omnidirectional sets this kind of pattern completion apart from asymmetric forms of pattern completion where, for example, the first half of the pattern can cue recall of the second half, but not vice versa. To illustrate the strengths and weaknesses of the oscillating algorithm, we compare it to O’Reilly’s Leabra algorithm (O’Reilly, 1996; O’Reilly & Munakata, 2000). Leabra consists of two parts. The core of Leabra is a CHLbased error-driven learning rule, which we will refer to as Leabra-Error. In contrast to the oscillating algorithm, which uses changes in the strength of 1 We chose values for min and max according to the following criteria: min has to be low enough to allow competitors to activate during the low-inhibition phase, but not so low that the network becomes epileptic. Max has to be high enough such that poorly supported target units turn off during the high-inhibition phase, but not so high that well-supported target units turn off also.
1588
K. Norman, E. Newman, G. Detre, and S. Polyn
inhibition to generate patterns for CHL, the Leabra-Error algorithm learns by comparing the following two phases:
r r
A minus phase, where some features of the to-be-learned pattern are omitted, and the network has to fill in missing features A plus phase, where the entire to-be-learned pattern is clamped on
The level of inhibition is kept constant across both phases. By comparing minus and plus patterns using CHL, the network learns to minimize the discrepancy between its “guess” about missing features and the actual pattern. The full version of Leabra complements Leabra-Error with a simple Hebbian learning rule that (during the plus phase) strengthens weights between sending and receiving units when they are both active and weakens weights when the receiving unit is active but the sending unit is not. This Hebbian rule was developed by Grossberg (1976), who called it instar learning; O’Reilly and Munakata (2000) describe the same algorithm using the name CPCA Hebbian Learning.2 Simulations conducted by O’Reilly (see, e.g., O’Reilly, 2001; O’Reilly & Munakata, 2000) have demonstrated that adding small amounts of CPCA Hebbian Learning to Leabra-Error boosts the learning performance of Leabra-Error by forcing it to represent meaningful input features in the hidden layer. As recommended by O’Reilly and Munakata (2000), our Leabra comparison simulations used a small proportion of CPCA Hebbian Learning (such that weight changes were 99% driven by Leabra-Error and 1% by CPCA Hebb).3 Finally, to compare the form of CHL inherent in LeabraError to the form of CHL inherent in the oscillating algorithm more directly, we also ran simulations using the Leabra-Error rule on its own (without any CPCA Hebbian Learning). Bias weight learning was turned off in the Leabra and Leabra-Error simulations in order to better match the oscillating-algorithm simulations (which did not include bias weight learning). In graphs of simulation results, error bars indicate the standard error of the mean, computed across simulated participants. When error bars are not visible, this is because they are too small relative to the size of the symbols on the graph (and thus are covered by the symbols).
2 It is important to emphasize that CPCA Hebbian Learning and Contrastive Hebbian Learning are completely different algorithms: the latter algorithm operates based on differences between two activation states, whereas the former algorithm operates based on single activity snapshots. 3 O’Reilly and Munakata (2000) found that higher proportions of Hebbian learning hurt performance by causing the network to overfocus on prototypical features and to underfocus on lower-frequency features. Pilot simulation work, not published here, confirms that this was true in our Leabra simulations as well.
How Inhibitory Oscillations Can Train
1589
3.1 Simulation 1: Omnidirectional Pattern Completion as a Function of Input Pattern Overlap and Test Pattern Noise. In this simulation, we explore the oscillating algorithm’s ability to memorize both correlated and uncorrelated patterns. When given a large number of correlated input patterns, some self-organizing learning algorithms have a tendency to overrepresent shared features and underrepresent itemspecific features, leading to poor recall of item-specific features (e.g., Norman & O’Reilly, 2003, discuss this problem as it applies to CPCA Hebbian Learning). In this section, we show that the oscillating algorithm is not subject to this problem. To the contrary, we show that the oscillating algorithm meets all three of the desiderata outlined in section 2.4:
r
r r
The oscillating algorithm outperforms both Leabra and LeabraError at recalling individuating features of highly correlated input patterns in terms of both asymptotic capacity and training speed. The oscillating algorithm shows good generalization to test cues that do not exactly match stored patterns. The oscillating algorithm learns representations that reflect the similarity structure of the input space.
3.1.1 Methods • Input pattern creation. We gave the network 200 binary input patterns to learn. Each pattern had 8 (out of 80) units active. To generate the patterns, we started with a single prototype pattern and then distorted the prototype by randomly turning off some number of (prototype) units and turning on an equivalent number of (nonprototypical) units. By varying the number of “flipped bits,” we were able to vary the average overlap between input patterns. There were three overlap conditions: 57% average overlap (achieved by flipping two bits), 28% average overlap (achieved by flipping four bits), and 11% average overlap (achieved by flipping all eight bits). We call the last condition the unrelated pattern condition because the patterns do not possess any central tendency. In creating the patterns (for all of the levels of bit flipping noted), we implemented a minimum pairwise distance constraint, such that every input pattern differed from every other input pattern by at least two (out of eight) active bits. • Training and testing. All three algorithms were repeatedly presented with the 200-pattern training set until learning reached asymptote. After each epoch of training, we tested pattern completion by measuring the network’s ability to recall a single nonprototypical feature from each pattern, given all of the other features of that pattern as a retrieval cue. In the simulations reported here, recall was marked as correct if the activation of
1590
K. Norman, E. Newman, G. Detre, and S. Polyn
the correct unit was larger than the activation of all of the other (noncued) input-output units.4 To assess the model’s ability to generalize to test cues that do not exactly match studied patterns, we distorted retrieval cues by adding gaussian noise to the pattern that was clamped onto the network. Specifically, each unit’s external input value was adjusted by a value sampled from a zeromean gaussian distribution. These input values, once adjusted by noise, remained fixed throughout the trial. We manipulated the amount of noise at test by adjusting the variance of the noise distribution. • Applying Leabra to omnidirectional pattern completion. For our Leabra and Leabra-Error simulations, we constructed minus phase patterns by randomly blanking out four of the eight units in the input pattern, thereby forcing the network to guess the correct values of these patterns. In the plus phase, we clamped the full eight-unit pattern onto the input layer. Every time that a pattern was presented, a different (randomly selected) set of four units was blanked. Otherwise, if the same four units were blanked each time, the learning algorithm would learn to recall those four units but not any of the other units.5 • Learning rates. Based on pilot simulations, we selected .0005 as our default learning rate for Leabra and Leabra-Error. Simulations using this learning rate yielded asymptotic capacity that was almost identical to the capacity achieved with lower learning rates, and training time was within acceptable bounds. For the oscillating algorithm, we were able to achieve near-peak performance with much higher learning rates. We found that a learning rate of .05 for the oscillating algorithm yielded the best combination of high final capacity and (relatively) short training time. The number of training epochs for each algorithm/learning-rate combination was adjusted to ensure that training lasted long enough to reach asymptote. Large differences in learning rates were mirrored by commensurately large differences in training duration (e.g., the oscillating algorithm simulations with learning rate .05 took 250 epochs to reach asymptote; in contrast, Leabra simulations with learning rate .0005 took 10,000 epochs to reach asymptote).
4
We have also run pattern completion simulations using a more restrictive recall criterion, whereby recall was marked “correct” if the activation of the correct unit was more than .5 and none of the incorrect units had activation more than .5. All of the advantages of the oscillating algorithm (relative to other algorithms) that are shown in simulation 1 and simulation 2 are also present when we use this more restrictive recall criterion. 5 This is not the only way to train a Leabra network so it supports pattern completion. However, it is the most effective method that we were able to find. Any conclusions that we reach about Leabra and Leabra-Error are restricted to the particular variants that we used in our simulations and may not apply to simulations where other methods are used to generate partial patterns for the minus phase.
How Inhibitory Oscillations Can Train
1591
Patterns Learned as a Function of Test Pattern Noise Number of Patterns Learned
A Unrelated Input Patterns 200 150
Oscillating Algorithm Leabra Leabra-Error
100 50 0 0
2
4
6
8
10
12
14
16
Test Pattern Noise (x 10-2)
C 28% Average Input Overlap
200 150 100 50 0 0
2
4
6
8
10
12
14 -2
Test Pattern Noise (x 10 )
16
Number of Patterns Learned
Number of Patterns Learned
B
57% Average Input Overlap 200 150 100 50 0 0
2
4
6
8
10
12
14
16
Test Pattern Noise (x 10-2)
Figure 3: This figure shows the number of patterns (out of 200) successfully recalled at the end of training by each algorithm as a function of the amount of overlap between input patterns: (A) unrelated patterns; (B) correlated patterns, 28% overlap; (C) correlated patterns, 57% overlap. It also shows the amount of noise applied to retrieval cues at test. Leabra and Leabra-Error outperform the oscillating algorithm given low input pattern overlap and low levels of test pattern noise. However, for higher levels of input pattern overlap and test pattern noise, the oscillating algorithm outperforms Leabra and Leabra-Error.
3.1.2 Results • Capacity. Figure 3 shows the asymptotic number of patterns learned for the oscillating algorithm, Leabra, and Leabra-Error. For unrelated input patterns and low levels of test pattern noise, the oscillating algorithm learns approximately 150 of 200 patterns, but it does less well than both Leabra and Leabra-Error. However, for higher levels of test pattern noise and higher levels of input pattern overlap, the relative position of the algorithms reverses, and the oscillating algorithm performs substantially better than Leabra and Leabra-Error. Appendix C shows that the oscillating algorithm’s advantage for highly overlapping inputs is still obtained when inhibition is oscillated in the hidden layer (in addition to the input-output layer).
1592
K. Norman, E. Newman, G. Detre, and S. Polyn
Hidden Layer Overlap as a Function of Input Overlap
Hidden Layer Overlap
1.0 0.8 0.6 0.4 0.2 0.0 unrelated
28%
57%
Input Pattern Overlap Oscillating Algorithm Leabra Leabra-Error Figure 4: This figure plots, for the oscillating algorithm, Leabra, and LeabraError, the average pairwise overlap between patterns in the hidden layer (at the end of training), as a function of input-pattern overlap. Hidden-layer overlap is lower for the oscillating algorithm than for Leabra and Leabra-Error.
• Hidden representations. The oscillating algorithm’s superior performance for high levels of input pattern overlap and test pattern noise stems from its ability to maintain reasonable levels of pattern separation on the hidden layer, even when inputs are very similar. Figure 4 plots the average pairwise overlap between patterns in the hidden layer (at the end of training) as a function of input overlap.6 The figure shows that all three algorithms maintain good pattern separation in the hidden layer given low input overlap, but as input overlap increases, hidden overlap increases much more sharply in the Leabra and Leabra-Error simulations versus in the simulations using the oscillating algorithm. The high level of hidden layer overlap in the Leabra and Leabra-Error simulations facilitates recall
6 Both input overlap and hidden overlap were operationalized using a cosine distance measure; this measure ranges from zero (no overlap) to one (maximal overlap).
How Inhibitory Oscillations Can Train
1593
of shared features but makes it difficult for the network to recall the unique features of individual patterns. This problem is especially severe given high levels of test pattern noise. When hidden representations are located close together, this increases the odds that, given a noisy input pattern, the network will slip out of the correct attractor into a neighboring attractor. The oscillating algorithm’s good pattern separation in the high-overlap condition is due in large part to its ability to punish competitors. If the representations of two patterns (call them pattern A and pattern B) get too close to each other, then pattern A will start appearing as a competitor (during the low-inhibition phase) during study of pattern B, and vice versa. Assuming that both A and B are presented a large number of times at training, the ensuing competitor punishment will have the effect of pushing apart the hidden layer representations of A and B so they no longer compete with one another. Another factor that contributes to the oscillating algorithm’s good performance is its ability to focus learning on features that are not already well learned. Given a large number of correlated patterns, the oscillating algorithm stops learning about prototypical features relatively early in training (once their representation is strong enough to resist increased inhibition) and focuses instead on learning idiosyncratic features of individual items (which are less able to resist increased inhibition). Reducing learning of prototypical features, relative to item-specific features, improves pattern separation and (through this) pattern completion performance. While the oscillating algorithm shows more pattern separation than Leabra and Leabra-Error, it still possesses the key property that it assigns similar hidden representations to similar stimuli. In this respect, the oscillating algorithm (applied to this two-layer cortical network) differs strongly from hippocampal architectures that automatically assign distinct representations to stimuli (e.g., Norman & O’Reilly, 2003). To quantify the oscillating algorithm’s tendency to use similarity-based representations, we computed the correlation (across all pairs of patterns) between input-layer overlap and hidden-layer overlap. Figure 5 plots this input-hidden similarity score for the oscillating algorithm, Leabra, and Leabra-Error as a function of input pattern overlap. The average similarity score for the oscillating algorithm is approximately .5. For the values of input pattern overlap plotted here, the similarity scores for the oscillating algorithm are higher than the scores for Leabra-Error but lower than the scores for Leabra. The observed difference between Leabra and the oscillating algorithm can be viewed in terms of a simple trade-off: Leabra learns representations that are true to the structure of the input space, but (given similar input patterns) this fidelity leads to high hidden layer overlap that hurts recall. The oscillating algorithm gives up a small amount of this fidelity, but as a result of this sacrifice, it is much better able to recall given high levels of input pattern overlap and noisy test cues. Furthermore, we should emphasize that the trade-off observed here is a direct consequence of the limited
1594
K. Norman, E. Newman, G. Detre, and S. Polyn
Input-Hidden Similarity Score
Input-Hidden Similarity Score as a Function of Input Overlap 1.0 0.8 0.6 0.4 0.2 0.0 unrelated
28%
57%
Input Pattern Overlap Oscillating Algorithm Leabra Leabra-Error Figure 5: This figure plots, for the oscillating algorithm, Leabra, and LeabraError, the models’ tendency to assign similar hidden representations to similar input patterns. See the text for more detail on how this similarity score was computed. Similarity scores for the oscillating algorithm are higher than similarity scores associated with Leabra-Error and lower than similarity scores associated with Leabra.
size of the hidden layer. When hidden layer size is increased, the oscillating algorithm is able to utilize the extra hidden units to simultaneously boost similarity scores and capacity; we demonstrate this point in appendix B. • Training speed. Finally, in addition to measuring capacity, we can also evaluate how quickly the various algorithms reach their asymptotic capacity. Across all of the conditions described above, the oscillating algorithm learns more quickly than Leabra and Leabra-Error. To illustrate this point, we selected two conditions where asymptotic capacity was approximately matched between the oscillating algorithm and either Leabra or Leabra-Error. Specifically, we compared the oscillating algorithm and Leabra-Error for unrelated input patterns with .06 test pattern noise; we also compared the oscillating algorithm and Leabra for input patterns with 57% overlap and zero test pattern noise. For each of these conditions, we plotted the time course of learning across epochs, for a variety of Leabra and Leabra-Error learning rate values, in Figure 6.
How Inhibitory Oscillations Can Train
1595
Number of Patterns Learned
Speed of Training Unrelated Input Patterns Test Pattern Noise .06
57% Average Input Overlap No Test Pattern Noise
160 140 120 100 80 60 40 20 0 1e +0 1e+1 1e+2 1e+3 1e+4 1e+5
1e+0 1e+1 1e+2 1e+3 1e+4 1e+5
Epochs of Training
Epochs of Training
Oscillating Algorithm, lr .05 Leabra-Error, lr .03 Leabra-Error, lr .01 Leabra-Error, lr .0005
Oscillating Algorithm, lr .05 Leabra, lr .03 Leabra, lr .01 Leabra, lr .0005
Figure 6: Training speed for the oscillating algorithm, Leabra, and LeabraError. The left-hand figure plots the time course of training for the oscillating algorithm and Leabra-Error for unrelated input patterns and .06 test pattern noise; the right-hand figure plots the time course of training for the oscillating algorithm and Leabra for 57% input overlap and zero test pattern noise. In both figures, the oscillating algorithm training curves are located to the left of the Leabra and Leabra-Error training curves, across a wide variety of Leabra and Leabra-Error learning rate values. Error bars were omitted from the graph for visual clarity. For all of the points shown here, the standard error was less than 3.5.
In Figure 6, the oscillating algorithm training curves lie to the left of the Leabra and Leabra-Error training curves across the full range of Leabra and Leabra-Error learning rates that we tested (ranging from .0005 to .03). While it is possible to increase the initial speed of learning in Leabra and Leabra-Error by increasing the learning rate parameter, this also has the effect of lowering the asymptotic capacity of the Leabra and Leabra-Error networks to below the asymptotic capacity of the oscillating algorithm. The Leabra and Leabra-Error variants used here learn more slowly than the oscillating algorithm because they learn about only a subset of intraitem associations on each trial. For example, if the top four units are blanked during the minus phase in Leabra-Error, the network will learn how to complete from the bottom four units to the top four units, but it will not learn how to complete from the top four to the bottom four. Thus, it takes multiple passes through the study set (blanking a different set of units each time) for Leabra-Error to strengthen all of the different connections that are
1596
K. Norman, E. Newman, G. Detre, and S. Polyn
required to support omnidirectional pattern completion. In contrast, the oscillating algorithm has the ability to learn about the whole pattern on each trial. 3.2 Simulation 2: Three-Layer Autoencoder. In this simulation, we set out to replicate and extend the results of simulation 1 using a different network architecture: the three-layer autoencoder (Ackley et al., 1985). These networks consist of an input layer that is connected to a hidden layer, which in turn is connected to an output layer. During training, the to-be-learned pattern is presented (in its entirety) to the input layer, and activity is allowed to propagate through the network. The goal of autoencoder learning is to adjust network weights so the network is able to reconstruct a copy of the input pattern on the output layer. The main difference between the autoencoder architecture and the two-layer architecture used in simulation 1 is that, in the autoencoder architecture, there are no direct connections between the input units that receive external input (from the retrieval cue) at test and the to-be-retrieved output unit; everything has to go through the hidden layer. Thus, the autoencoder architecture constitutes a more stringent test of whether a learning algorithm can develop informationpreserving hidden representations that support pattern completion. We compared the oscillating algorithm’s ability to learn patterns in the autoencoder architecture to Leabra and Leabra-Error. Also, the way the network was structured (with distinct input and output layers) made it possible for us to explore two additional comparison algorithms: Almeida-Pineda recurrent backpropagation and standard (nonrecurrent) backpropagation. The results of this simulation replicate the key finding from simulation 1: the oscillating algorithm outperforms the comparison algorithms at omnidirectional pattern completion when both input pattern overlap and test pattern noise are high. However, unlike in simulation 1, the oscillating algorithm also outperforms the comparison algorithms in tests with unrelated input patterns and low levels of test pattern noise. We attribute this latter finding to the fact that the oscillating algorithm automatically strengthens weak connections between target units. In contrast, error-driven algorithms like backpropagation and Leabra-Error strengthen connections between target units only if this is needed to reduce error at training. 3.2.1 Autoencoder Methods • Network architecture. To implement an autoencoder architecture, we added an output layer to the network, so the network was composed of an 80-unit input layer, a 40-unit hidden layer, and an 80-unit output layer. The input layer had a full bidirectional projection to the hidden layer, and the hidden layer had a full bidirectional projection to the output layer. To maximize comparability with our initial simulations, every input unit was directly connected to every other input unit, and every output unit was directly connected to every other output unit (the one crucial difference
How Inhibitory Oscillations Can Train
1597
was that, in the autoencoder simulations, the input units were not directly connected to the output units). We used the same connection parameters as in simulation 1; all connections were modifiable. The one exception to the scheme outlined here was the backpropagation autoencoder, which had only feedforward connections (from the input layer to the hidden layer and from the hidden layer to the output layer). • Training and testing. All of the algorithms were given 150 patterns to memorize. The training set was repeatedly presented (in a different order each time) until learning reached asymptote. The details of training for each algorithm are presented below. After each training epoch, we tested pattern completion. On each test trial, we left out one feature from the input pattern and measured how well the network was able to recall the missing feature on the output layer. As with all of the simulations described earlier, we only tested recall of item-specific features (i.e., features that were not part of the prototype pattern), and we scored recall as being correct based on whether the to-be-recalled output unit was more active than all of the other (nontarget) units in the output layer. Finally, as in simulation 1, we manipulated the level of input pattern overlap and also explored how distorting test cues (by adding gaussian noise to the test patterns) affected pattern completion performance.7 • Methods for oscillating algorithm simulations. We trained the oscillating algorithm version of the autoencoder by simultaneously presenting the same pattern to both the input and output layers. During training, inhibition was oscillated on the input and output layers but not on the hidden layer. The input layer and output layer oscillation parameters were identical to each other and identical to the parameters that we used in simulation 1. • Methods for Leabra and Leabra-Error simulations. The Leabra and Leabra-Error autoencoder simulations used a two-phase design. In the minus phase, the complete target pattern was clamped onto the input layer (but not the output layer) and the network was allowed to settle. In the plus phase, the complete target pattern was clamped onto both the input and output layers and the network was allowed to settle. Otherwise the details of the Leabra and Leabra-Error simulations were the same as in simulation 1. • Methods for recurrent and nonrecurrent backpropagation simulations. For our simulations using the Almeida-Pineda (A-P) recurrent backpropagation algorithm (Almeida, 1989; Pineda, 1987), we used the
7 A given amount of test pattern distortion had less of an effect in simulation 2 than in simulation 1 because in simulation 2 we were only distorting the input layer pattern, whereas in simulation 1 we were applying the distortion to a shared input-output layer (so the distortion had a direct effect on output activity, in addition to an indirect effect via the distorted input pattern). To compensate for this difference, we used a wider range of test pattern noise values in simulation 2 than in simulation 1.
1598
K. Norman, E. Newman, G. Detre, and S. Polyn
rbp++ program contained in the PDP++ neural network software package.8 For our simulations using the (nonrecurrent) backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986), we used the bp++ program contained in the PDP++ software package. For both sets of simulations (rbp++ and bp++), we used the default learning parameters built in to the software package (e.g., momentum = .9), except we changed the learning rate (as described below) and, as with all of the other simulations in this article, turned off bias weight learning. • Learning rates. We used a learning rate of .0005 for all four comparison algorithms. The oscillating algorithm simulations used a learning rate of .05. We allowed each algorithm to run until learning reached asymptote (10,000–20,000 epochs for Leabra, Leabra-Error, and A-P recurrent backpropagation; 100,000 epochs for feedforward backpropagation; 250 epochs for the oscillating algorithm).9 3.2.2 Results of Autoencoder Simulations. The results of our autoencoder capacity simulations are shown in Figure 7. The oscillating algorithm outperforms the comparison algorithms in every condition, with the sole exception of high input overlap, low test pattern noise (where the oscillating algorithm’s performance is comparable to backpropagation). As in simulation 1, the oscillating algorithm’s advantage over other algorithms is larger for high test pattern noise than for low test pattern noise. The most notable difference between simulation 1 and simulation 2 is that, in this simulation, the oscillating algorithm outperforms the comparison algorithms given low input pattern overlap and no test pattern noise (whereas the opposite is true in simulation 1). The fact that the oscillating algorithm shows better pattern completion than the four comparison algorithms can be explained as follows:
r r
Pattern completion performance is a direct function of the strength of links between features within a pattern. Standard autoencoder training (as embodied by the four comparison algorithms) does not force the network to learn strong associations between input features.
In order to minimize reconstruction error, autoencoders need to represent all of the input features somewhere in the hidden layer, but they do not have to link these features. In many situations, the autoencoder can minimize reconstruction error by representing features individually and then
8 This software can be downloaded from Randy O’Reilly’s PDP++ web site at the University of Colorado: http://psych.colorado.edu/∼oreilly/PDP++/PDP++.html. 9 We also ran backpropagation simulations with larger learning rates, and the results were qualitatively similar to the results obtained here.
How Inhibitory Oscillations Can Train
1599
Autoencoder Simulations: Patterns Learned as a Function of Test Pattern Noise Number of Patterns Learned
A Unrelated Input Patterns 140
Oscillating Algorithm A-P Backpropagation Backpropagation Leabra Leabra-Error
120 100 80 60 40 20 0 0
5
10
15
20
25
30
Test Pattern Noise (x 10-2)
C 28% Average Input Overlap
14 0 12 0 10 0 80 60 40 20 0 0
5
10
15
20
25
30 -2
Test Pattern Noise (x 10 )
Number of Patterns Learned
Number of Patterns Learned
B
57% Average Input Overlap 140 120 100 80 60 40 20 0 0
5
10
15
20
25
30
Test Pattern Noise (x 10-2)
Figure 7: Results of three-layer autoencoder simulations where we manipulated input pattern overlap and test pattern noise. The oscillating algorithm performs better than the other algorithms in all conditions except for high input overlap, low test pattern noise (where the oscillating algorithm’s performance is comparable to backpropagation). In general, the oscillating algorithm’s advantage is accentuated for high levels of test pattern noise.
reconstructing the input on a feature-by-feature basis. This strategy leads to poor pattern completion performance. In contrast, the oscillating learning algorithm places strong pressure on the network to form direct associations between the features of to-be-memorized patterns (because units need collateral support from other units in order to withstand the “stress test” of increased inhibition). These strong links result in good pattern completion performance. 4 Discussion The research presented here shows how oscillations in the strength of neural inhibition can facilitate learning. Specifically, lowering inhibition can be used to identify competing memories so they can be punished, and raising
1600
K. Norman, E. Newman, G. Detre, and S. Polyn
inhibition can be used to identify weak parts of memories so they can be strengthened. The specific weight change equation (CHL) that we use in the oscillating algorithm is not novel. Rather, the novel claim is that changes in the strength of inhibition can be used to generate minus (i.e., less desirable) patterns to feed into the CHL equation. In this section, we provide a brief overview of the primary virtues of the oscillating learning algorithm relative to the other algorithms considered here. Then we discuss how the oscillating algorithm relates to neural data on oscillations and learning and how the oscillating algorithm relates to the BCM rule (Bienenstock, Cooper, & Munro, 1982). 4.1 Functional Properties of the Learning Algorithm. In section 3, we showed that the oscillating algorithm (applied to a cortical network architecture) meets the desiderata for cortical learning outlined earlier: good completion of overlapping patterns (after repeated exposure to those patterns), good generalization to retrieval cues that do not exactly match studied patterns, and good correspondence between the structure of the input patterns and the hidden representations (i.e., similar input patterns tend to get assigned similar hidden representations). We attributed the oscillating algorithm’s good performance for overlapping inputs and noisy test cues to its ability to punish competing memories. Whenever memories start to blend together, they start to compete with one another at retrieval, and the competitor punishment mechanism pushes them apart. In this manner, the oscillating algorithm retains good pattern separation in the hidden layer (see Figure 3) even when inputs overlap strongly. As discussed earlier, this extra separation is not without costs (e.g., it incrementally degrades the hidden layer’s ability to represent the structure of the input space, compared to Leabra), but the costs are small relative to the following benefit: maintaining good separation between representations helps to ensure that memories can be accurately stored and accessed even in difficult situations (e.g., when there are many similar memories stored in the system, and the cue only slightly favors one memory over the other). The fact that the oscillating algorithm outperforms all of the comparison algorithms in simulation 2 (pattern completion with an autoencoder architecture), even for unrelated input patterns, points to another key property of the algorithm: it automatically probes for weak parts of the attractor (by raising inhibition) and strengthens these weak parts. This automatic probing and strengthening ensures that the network will be able to patterncomplete from one arbitrary subpart of the pattern to another, regardless of whether that particular partial pattern has been encountered before. In contrast, the other algorithms that we examined (e.g., Leabra-Error) show a large performance hit when the partial patterns used to cue retrieval at test do not exactly match the patterns used to cue retrieval (during the minus phase) at training.
How Inhibitory Oscillations Can Train
1601
4.2 Relating the Oscillating Algorithm to Neural Theta Oscillations. We think that neural theta oscillations (and theta-dependent learning processes) may serve as the neural substrate of the oscillating learning algorithm. Theta oscillations have been observed in humans in both neocortex and the hippocampus. Raghavachari et al. (2001) found that cortical theta was gated by stimulus presentation during a memory experiment, and Rizzuto et al. (2003) found that theta phase is reset by stimulus onset. Both findings indicate that theta oscillations are present at the right time to support stimulus memorization. Other findings point to a more direct link between theta and synaptic plasticity. In a recent study, Seager et al. (2002) found that eyeblink conditioning occurred more quickly when animals were trained during periods of high versus low hippocampal theta power (see Berry & Seager, 2001, for a review of similar studies). Also, Huerta and Lisman (1996) induced theta oscillations in a hippocampal slice preparation and found that the direction of plasticity (long-term potentiation versus long-term depression) depends on the phase of theta (see also Holscher, Anwyl, & Rowan, 1997, for a similar result in anesthetized animals, and Hyman, Wyble, Goyal, Rossi, & Hasselmo, 2003, for a similar result in behaving animals). The finding that LTP is obtained during one phase of theta and LTD is obtained during another phase fits very well with the oscillating algorithm’s postulate that one part of the inhibitory oscillation (going from normal to high to normal inhibition) is primarily concerned with strengthening target memories, and the other part of the oscillation (going from normal to low to normal inhibition) is primarily concerned with weakening competitors. Although this result is very encouraging, more work is needed to explore the mapping between the LTP/LTD findings, and our model. The mapping is not straightforward because the studies noted used local field potential to index theta, and it is unclear how much local field potential is driven by excitatory versus inhibitory neurons. One could reasonably ask why we think the oscillation in our algorithm relates to theta oscillations as opposed, to, say, alpha or gamma oscillations. Functionally, the frequency of the oscillation in our algorithm is bounded by two constraints. First, the oscillation has to be fast enough such that the oscillation completes at least one full cycle (and ideally more) when a stimulus is presented. This rules out slow oscillations (less than 1 Hz). Also, if the oscillation is too fast relative to the speed of spreading activation in the network, competitors will not have a chance to activate during the low-inhibition phase. This constraint rules out very fast oscillations. Thus, although we are not certain that theta is the correct frequency, the functional constraints outlined here and the findings relating theta to learning (outlined above) make this a possibility worth pursuing. 4.3 Applications to Hippocampal Architectures. Although this article has focused on cortical network architectures, we also think that our ideas
1602
K. Norman, E. Newman, G. Detre, and S. Polyn
about theta (that theta can help to selectively strengthen weak target units and punish competitors) may be applicable to hippocampal architectures. At this time, there are several theories (other than ours) regarding how theta oscillations might contribute to hippocampal processing. For example, Hasselmo, Bodelon, and Wyble (2002) argue that theta oscillations help tune hippocampal dynamics for encoding versus retrieval, such that dynamics are optimized for encoding during one phase of theta and dynamics are optimized for retrieval during another phase of theta. Hasselmo et al.’s model varies the relative strengths of different excitatory projections as a function of theta (to foster encoding versus retrieval) but does not vary inhibition. In contrast, our model varies the strength of inhibition but does not vary the strength of excitatory inputs. At this time, it is unclear how our model relates to Hasselmo et al.’s model. We do not see any direct contradictions between our model and Hasselmo et al.’s model (insofar as they manipulate different model parameters as a function of theta), so it seems possible that the two models could be merged, but further simulation work is needed to address this question.
4.4 Relating the Oscillating Algorithm to BCM. In this section, we briefly review another algorithm (the BCM algorithm: Bienenstock et al., 1982) that can be viewed as implementing competitor punishment. Like the CPCA Hebbian Learning rule, the BCM algorithm is set up to learn about clusters of correlated features. The main difference between BCM and CPCA Hebbian Learning relates to the circumstances under which synaptic weakening (LTD) occurs. CPCA Hebbian Learning reduces synaptic weights when the receiving unit is active but the sending unit is not. In contrast, BCM reduces synaptic weights from active sending units when the receiving unit’s activation is above zero but below its average level of activation. Thus, BCM actively pushes away input patterns from weakly activated hidden units. This form of synaptic weakening can be construed as a form of competitor punishment: if a memory receives enough input to activate its hidden representation but not enough to fully activate that representation, that memory is weakened. In contrast, if a memory does not receive enough input to activate its hidden representation, the memory is not affected. The main functional difference between competitor punishment in BCM versus the oscillating algorithm is that BCM can punish competitors only if their representations show above-zero (but below-average) activation. In contrast, the oscillating algorithm actively probes for competitors (by lowering inhibition) and is therefore capable of punishing competitors even if they are not active given normal levels of inhibition. This “active probing” mechanism should result in much more robust competitor punishment. Importantly, BCM’s form of competitor punishment and the oscillating algorithm’s form of competitor punishment are not mutually exclusive. It is possible that combining the algorithms would result in better performance
How Inhibitory Oscillations Can Train
1603
than either algorithm taken in isolation. We will explore ways of integrating BCM with the oscillating learning algorithm in future research.
4.5 Applying the Oscillating Algorithm to Psychological Data. In this article, we have focused on functional properties of the learning algorithm (e.g., its capacity for learning patterns, given different levels of overlap). Another way to constrain the model is to explore its ability to simulate detailed patterns of psychological data. In one line of research, we have used the model to account for several key findings relating to retrieval-induced forgetting (Anderson 2003; see section 1.1 for more discussion of this phenomenon). For example, the model can explain the finding that forgetting of competing items is cue independent (Anderson & Spellman, 1995), the finding that competitor punishment effects are larger when subjects are asked to retrieve the target versus when they are just shown the target (Anderson, Bjork, & Bjork, 2000), and the finding that the amount of competitor punishment is proportional to the strength of the competitor (Anderson et al., 1994). This modeling work is described in detail in Norman, Newman, and Detre (2006).
5 Conclusions The research presented here started with a psychological puzzle: How can we account for data showing that competitors are punished? In the course of addressing this issue, we found that competitor punishment mechanisms can boost networks’ ability to learn highly overlapping patterns (by ensuring that hidden representations do not collapse together). We also observed that the changing inhibition aspect of our algorithm bears a strong resemblance to neural theta oscillations. As such, this research may end up speaking to the longstanding puzzle of how theta oscillations contribute to learning. The challenge now is to follow up the admittedly preliminary results presented here with a more detailed assessment of how the basic principles of the oscillating algorithm (competitor punishment via decreased inhibition and selective strengthening via increased inhibition) can shed light on psychological, neural, and functional issues.
Appendix A: Model Parameters A.1 Basic Network Parameters. At the beginning of each simulation, all of the weights were initialized to random values from the uniform distribution centered on .5 with range = .4. The initial weight values were symmetric, such that the initial weight from unit i to unit j was equivalent to the initial weight from unit j to unit i. This symmetry was maintained
1604
K. Norman, E. Newman, G. Detre, and S. Polyn
through learning because the weight update equations are symmetric. Other model parameters were as follows: Parameter stm gain input/output layer dtvm hidden layer dtvm i kwta pt
Value 0.4 0.2 0.15 0.325
Apart from the parameters mentioned above, all other parameters shared by the oscillating learning algorithm and Leabra were set to their Leabra default values. A.2 Oscillation Parameters. The oscillating component of inhibition was varied from min = −1.21 to max = 1.96. As per equation 2.2, the sign of the learning rate was shifted from positive to negative depending on whether inhibition was moving toward its normal (midpoint) value or away from its normal value. The network was given 20 time steps to settle into a stable state before the onset of the inhibitory oscillation. Figure 8 shows how inhibition was oscillated on each trial and how the sign of the learning rate was changed as a function of the phase of the inhibitory oscillation. Appendix B: Effects of Hidden Layer Size In this simulation, we show that the oscillating algorithm can take advantage of additional hidden layer resources to store more patterns and to more accurately represent the structure of the input space. Specifically, we explored the effect of increased hidden layer size (120 versus 40 units) on the oscillating algorithm, Leabra, and Leabra-Error. The hidden layer k value was adjusted as a function of hidden layer size to ensure that (on average) 20% of the hidden units would be active for both the 120-hidden-unit simulations and the 40-hidden-unit simulations. Input patterns had 57% average overlap, and we used a test pattern noise value of .04. Figure 9 shows the effects of hidden layer size on capacity and on the fidelity of the network’s representations (as indexed by our “similarity score” measure). There are two important results. First, increasing hidden layer size boosts the number of patterns learned by the oscillating algorithm. The effect of increasing hidden layer size is numerically larger for the oscillating algorithm than for Leabra and Leabra-Error, so the capacity advantage for the oscillating algorithm is preserved in the 120-hidden-unit condition. The second important result is that for all three algorithms, our input-hidden similarity metric (as described in section 3.1.2) is substantially larger for the large network simulations: For the oscillating algorithm, the
How Inhibitory Oscillations Can Train
1605
Learning Rate
2 0.03 1 0.00
0
-0.03
-1
-0.06
-2 0
20
40
60
80
Inhibitory Oscillation
3
0.06
100
Time Elapsed (Number of Time Steps) Learning Rate Inhibitory Oscillation Figure 8: Illustration of how inhibition was oscillated on each trial. At each time step, the inhibitory oscillation component depicted on this graph was added to the value of inhibition computed by the kWTA algorithm. The graph also shows how the sign of the learning rate was set to a positive value when the inhibitory oscillation was moving toward its midpoint, and it was set to a negative value when the inhibitory oscillation was moving away from its midpoint.
similarity score is .576 for the 40-unit simulation and .788 for the 120-unit simulation.
Appendix C: Effects of Hidden Layer Oscillations For simplicity, the oscillating algorithm simulations oscillated inhibition in the input-output layer but not the hidden layer. In this simulation, we show that the same qualitative pattern of results is obtained when inhibition is simultaneously oscillated in both the input-output layer and in the hidden layer. We selected hidden layer oscillation parameters such that over the course of training, the effect of the inhibitory oscillation on network activity (operationalized as the difference in average network activation from the peak of the inhibitory oscillation to the trough of the oscillation) was approximately
1606
K. Norman, E. Newman, G. Detre, and S. Polyn
Number of Patterns Learned
A
Capacity as a Function of Hidden-Layer Size 57% Average Input Overlap, Test Noise .04 200 150 100 50 0 40
120
B
Similarity Score as a Function of Hidden-Layer Size 57% Average Input Overlap
Input-Hidden Similarity Score
Number of Hidden Units
1.0 0.8 0.6 0.4 0.2 0.0 40
120
Number of Hidden Units Oscillating Algorithm Leabra Leabra-Error Figure 9: (A) Capacity scores (number of patterns learned out of 200) for the oscillating algorithm, Leabra, and Leabra-Error, given 57% input pattern overlap and .04 test pattern noise. Increasing hidden layer size increases the number of patterns learned by the oscillating algorithm, and the oscillating algorithm continues to perform well relative to Leabra and Leabra-Error. (B) Input hidden similarity scores (see simulation 1 for how these were calculated) given 57% input pattern overlap. All three algorithms show better similarity scores for the larger network.
How Inhibitory Oscillations Can Train
1607
Number of Patterns Learned
Effect of Hidden-Layer Oscillations on Capacity 57% Average Input Overlap 200 150 100 50 0 0
2
4
6
8
10
12
14
16
-2
Test Pattern Noise (x 10 ) Oscillating Algorithm with Hidden Oscillations Oscillating Algorithm without Hidden Oscillations Leabra Leabra-Error Figure 10: Capacity scores (number of patterns learned out of 200) for the oscillating algorithm with and without hidden layer oscillations, given 57% input pattern overlap. Results for Leabra and Leabra-Error are included for comparison purposes. The same qualitative pattern of results is present both with and without hidden layer oscillations.
equated for the input-output layer and the hidden layer.10 The simulation used input patterns with 57% overlap. The results of this simulation are shown in Figure 10. The oscillating algorithm continues to outperform Leabra and Leabra-Error at learning highly overlapping patterns, even with the addition of oscillations in the hidden 10 To equate the average effect of the inhibitory oscillation on activity in the inputoutput layer versus the hidden layer, we ended up using a much smaller-sized oscillation in the hidden layer than in the input-output layer: hidden oscillation min = −0.18 and max = 0.10; for the input oscillation, we used our standard max = 1.96 and a slight smaller than usual min = −1.11; we set the learning rate to .03. The input layer inhibitory oscillation needs to be large in order to offset the strong (excitatory) external input coming into the target units. Hidden units do not receive this strong external input, so less inhibition is required to deactivate these units during the high-inhibition phase.
1608
K. Norman, E. Newman, G. Detre, and S. Polyn
layer. We did not attempt to fine-tune the performance of the model once we added hidden oscillations, so the detailed pattern of results obtained here (e.g., the fact that the network performed slightly better without hidden oscillations) should not be viewed as reflecting parameter-independent properties of the model. Rather, these results constitute an existence proof that the oscillating algorithm advantages that we find in our “default parameter” simulations (without hidden layer oscillations) can also be observed in simulations with comparably sized hidden layer and input-output-layer oscillations. Acknowledgments This research was supported by NIH grant R01MH069456, awarded to K.A.N. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Almeida, L. B. (1989). Backpropagation in nonfeedforward networks. In I. Aleksander, (Ed.), Neural Computing, London: Kogan Page. Anderson, M. C. (2003). Rethinking interference theory: Executive control and the mechanisms of forgetting. Journal of Memory and Language, 49, 415–445. Anderson, M. C., Bjork, E. L., & Bjork, R. A. (2000). Retrieval-induced forgetting: Evidence for a recall-specific mechanism. Memory and Cognition, 28, 522. Anderson, M. C., Bjork, R. A., & Bjork, E. L. (1994). Remembering can cause forgetting: Retrieval dynamics in long-term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 5, 1063–1087. Anderson, M. C., & Spellman, B. A. (1995). On the status of inhibitory mechanisms in cognition: Memory retrieval as a model case. Psychological Review, 102, 68. Berry, S. D. & Seager, M. A. (2001). Hippocampal theta oscillations and classical conditioning. Neurobiology of Learning and Memory, 76, 298–313. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2(2), 32–48. Buzsaki, G. (2002). Theta oscillations in the hippocampus. Neuron, 33, 325–340. Fox, E. (1995). Negative priming from ignored distractors in visual selection. Psychonomic Bulletin and Review, 2, 145–173. Freedman, J. L. (1965). Long-term behavioral effects of cognitive dissonance. Journal of Experimental Social Psychology, 1, 145–155. Glucksberg, S., Newsome, M. R., & Goldvarg, G. (2001). Inhibition of the literal: Filtering metaphor-irrelevant information during metaphor comprehension. Metaphor and Symbolic Activity, 16, 277–293. Grossberg, S. (1976). Adaptive pattern classification and universal recoding I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134.
How Inhibitory Oscillations Can Train
1609
Grossberg, S. (1999). How does the cerebral cortex work? Learning, attention, and grouping by the laminar circuits of visual cortex. Spatial Vision, 12, 163– 186. Hasselmo, M. E. (1995). Neuromodulation and cortical function: Modeling the physiological basis of behavior. Behavioural Brain Research, 67, 1–27. Hasselmo, M. E., Bodelon, C., & Wyble, B. P. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation, 14, 793–818. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society (London) B, 352, 1177–1190. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing, Vol. 1: Foundations (pp. 282–317). Cambridge, MA: MIT Press. Holscher, C., Anwyl, R., & Rowan, M. J. (1997). Stimulation on the positive phase of hippocampal theta rhythm induces long-term potentiation that can be depotentiated by stimulation on the negative phase in area CA1 in vivo. Journal of Neuroscience, 17, 6470. Huerta, P. T., & Lisman, J. E. (1996). Synaptic plasticity during the cholinergic thetafrequency oscillation in vitro. Hippocampus, 49, 58–61. Hyman, J. M., Wyble, B. P., Goyal, V., Rossi, C. A., & Hasselmo, M. E. (2003). Stimulation in hippocampal region CA1 in behaving rats yields long-term potentiation when delivered to the peak of theta and long-term depression when delivered to the trough. Journal of Neuroscience, 23, 11725–11731. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society (London) B, 262, 23–81. Mayr, U., & Keele, S. (2000). Changing internal constraints on action: The role of backward inhibition. Journal of Experimental Psychology: General, 1, 4–26. McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex? Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102, 419–457. McNaughton, B. L., & Morris, R. G. M. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. Trends in Neurosciences, 10(10), 408–415. Minai, A. A., & Levy, W. B. (1994). Setting the activity level in sparse random networks. Neural Computation, 6, 85–99. Movellan, J. R. (1990). Contrastive Hebbian learning in the continuous Hopfield model. In D. S. Tourtezky, J. L. Elman, T. J. Sejnowski, & G. E. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School (pp. 10–17). San Mateo, CA: Morgan Kaufmann. Norman, K. A., Newman, E. L., & Detre, G. J. (2006). A neural network model of retrievalinduced forgetting (Tech. Rep. No. 06-1). Princeton, NJ: Princeton University, Center for the Study of the Brain, Mind, and Behavior.
1610
K. Norman, E. Newman, G. Detre, and S. Polyn
Norman, K. A., & O’Reilly, R. C. (2003). Modeling hippocampal and neocortical contributions to recognition memory: A complementary-learning-systems approach. Psychological Review, 4, 611–646. O’Reilly, R. C. (1996). The Leabra model of neural interactions and learning in the neocortex, Unpublished doctoral dissertation, Carnegie Mellon University. O’Reilly, R. C. (2001). Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199–1242. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. O’Reilly, R. C., & Norman, K. A. (2002). Hippocampal and neocortical contributions to memory: Advances in the complementary learning systems framework. Trends in Cognitive Sciences, 12, 505–510. Pineda, F. J. (1987). Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 18, 2229–2232. Raghavachari, S., Kahana, M. J., Rizzuto, D. S., Caplan, J. B., Kirschen, M. P., Bourgeois, B., Madsen, J. R., & Lisman, J. E. (2001). Gating of human theta oscillations by a working memory task. Journal of Neuroscience, 9, 3175–3183. Rizzuto, D. S., Madsen, J. R., Bromfield, E. B., Schulze-Bonhage, A., Seelig, D., Aschenbrenner-Scheibe, R., & Kahana, M. J. (2003). Reset of human neocortical oscillations during a working memory task. Proceedings of the National Academy of Sciences, 13, 7931–7936. Rolls, E. T. (1989). Functions of neuronal networks in the hippocampus and neocortex in memory. In J. H. Byrne & W. O. Berry (Eds.), Neural models of plasticity: Experimental and theoretical approaches (pp. 240–265). San Diego, CA: Academic Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing, Vol. 1: Foundations (pp. 318– 362). Cambridge, MA: MIT Press. Seager, M. A., Johnson, L. D., Chabot, E. S., Asaka, Y., & Berry, S. D. (2002). Oscillatory brain states and learning: Impact of hippocampal theta-contingent training. Proceedings of the National Academy of Sciences, 99, 1616–1620. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 193–216. Toth, K., Freund, T. F., & Miles, R. (1997). Disinhibition of rat hippocampal pyramidal cells by GABAergic afferents from the septum. Journal of Physiology, 500, 463–474.
Received November 8, 2004; accepted December 13, 2005.
LETTER
Communicated by Christopher Moore
Temporal Decoding by Phase-Locked Loops: Unique Features of Circuit-Level Implementations and Their Significance for Vibrissal Information Processing Miriam Zacksenhouse [email protected] Sensory-Motor Integration Laboratory, Technion Institute of Technology, Haifa, Israel
Ehud Ahissar [email protected] Department of Neurobiology, Weizmann Institute, Rehovot, Israel
Rhythmic active touch, such as whisking, evokes a periodic reference spike train along which the timing of a novel stimulus, induced, for example, when the whiskers hit an external object, can be interpreted. Previous work supports the hypothesis that the whisking-induced spike train entrains a neural implementation of a phase-locked loop (NPLL) in the vibrissal system. Here we extend this work and explore how the entrained NPLL decodes the delay of the novel, contact-induced stimulus and facilitates object localization. We consider two implementations of NPLLs, which are based on a single neuron or a neural circuit, respectively, and evaluate the resulting temporal decoding capabilities. Depending on the structure of the NPLL, it can lock in either a phase- or co-phase-sensitive mode, which is sensitive to the timing of the input with respect to the beginning of either the current or the next cycle, respectively. The cophase-sensitive mode is shown to be unique to circuit-based NPLLs. Concentrating on temporal decoding in the vibrissal system of rats, we conclude that both the nature of the information processing task and the response characteristics suggest that the computation is sensitive to the co-phase. Consequently, we suggest that the underlying thalamocortical loop should implement a circuit-based NPLL. 1 Introduction One of the major computational tasks facing the vibrissal somatosensory system is to determine the angle of the vibrissa on contact with an external obstacle. The vibrissal system receives external sensory input from the trigeminal neurons whose response patterns include both whisking locked spikes and contact-induced spikes (Szwed, Bagdasarian, & Ahissar, 2003). The whisking locked spike train provides a periodic reference input at the whisking frequency. The contact-induced activity represents the Neural Computation 18, 1611–1636 (2006)
C 2006 Massachusetts Institute of Technology
1612
M. Zacksenhouse and E. Ahissar
timing of the novel event of interest. When whisking frequency is consistent across cycles, the resulting computational task is equivalent to decoding the temporal delay or phase shift of a novel input with respect to a reference periodic input (Ahissar & Zacksenhouse, 2001), a basic computational task shared by other active sensory tasks, including vision (Ahissar & Arieli, 2001). At the algorithmic level (Marr, 1982), it was suggested that this computation can be performed by phase-locked loops (PLL) (Ahissar & Vaadia 1990; Ahissar, Haidarliu, & Zacksenhouse, 1997; Ahissar & Zacksenhouse, 2001). PLLs are (electronic) circuits that can lock to the frequency of their external input and perform important processing tasks, including frequency tracking and demodulation (Gardner, 1979). One of the major motivations for this hypothesis is based on the implementation level. Specifically, neuronal implementations of circuit-based PLLs (NPLLs), like the one shown in Figure 1 and detailed in section 2, require neuronal oscillators whose frequencies can be controlled by the input rate (rate-controlled oscillator, RCO) (Ahissar et al., 1997; Ahissar, 1998). Thus, the evidence that over 10% of the individual neurons in the somatosensory cortex can operate as controllable neural oscillators (Ahissar et al., 1997; Ahissar & Vaadia, 1990; Amitai, 1994; Flint, Maisch, & Kriegstein, 1997; Lebedev & Nelson, 1995; Nicolelis, Baccala, Lin, & Chapin, 1995; Silva, Amitai, & Connors, 1991) provided the initial motivation and further support for the hypothesis that these neurons function as RCOs in circuit-based NPLLs. Other requirements for implementing PLLs in the vibrissal system and agreement with the model predictions have also been demonstrated:
r r r r
The frequencies of the local cortical oscillators can be increased by local glutamatergic excitation (Ahissar et al., 1997). These oscillators can track the whisker frequency (Ahissar et al., 1997). The whisker frequency is encoded in the latency of the response of thalamic neurons (Ahissar, Sosnik, & Haidarliu, 2000; Sosnik, Haidarliu, & Ahissar, 2001). Thalamic neurons respond after (and not before, as would be expected from relay neurons) cortical neurons (Nicolelis et. al, 1995) as predicted by a thalamocortical PLL (Ahissar et al., 1997).
While these investigations focused on the response of the vibrissal system to the reference, whisking-induced input, the response to the novel contact-induced input was not investigated in detail. The purpose of this article is to investigate and demonstrate how NPLLs respond to the novel contact-induced input and assess the resulting temporal decoding capabilities. Furthermore, we address the issue of whether circuit-based NPLLs provide any computational advantages over single-neuron implementations (Hoppensteadt, 1986). Specifically, we distinguish between two
Temporal Decoding of Phase-Locked Loops
1613
locking modes, which are sensitive to either the phase or co-phase of the input (the normalized delay of the input with respect to the preceding or succeeding oscillatory event, respectively). It is shown that single neurons can implement only phase-sensitive NPLLs, while circuit-based NPLLs can implement both. In the context of the vibrissal thalamocortical system, both the response characteristics and the nature of the information processing task suggest that the computation should be sensitive to the co-phase and thus should be implemented by circuit-based NPLLs. Section 2 develops a mathematical model of NPLLs and describes four possible variants and their respective characteristics. Section 3 investigates the temporal decoding capabilities provided by the different NPLLs and evaluates them with respect to the temporal decoding task performed by the vibrissal system. Section 4 investigates the response characteristics of cortical oscillators and determines which of the four NPLL variants they implement. The information processing capabilities of NPLLs implemented by single neural oscillators and by neural circuits are discussed in section 5, considering both temporal decoding and temporal pattern generation. 2 Mathematical Modeling of PLLs Different neuronal implementations of the well-known electronic PLLs (Gardner, 1979) are possible (Ahissar, 1998), including for example, the neuronal circuit of Figure 1. The instantaneous frequency of the neural oscillator depends on its intrinsic frequency and the rate of its input (ratecontrolled oscillator, RCO). The input to the RCO is generated by an ensemble of phase-detecting neurons, grouped together as a PD, whose ensemble output rate depends on the delay between the external spike and the RCOevoked spike. When the NPLL locks to the external input, the instantaneous frequency of the internal RCO tracks the instantaneous frequency of the external input, and the deviation from the intrinsic frequency is encoded in the output rate of the PD (Ahissar et al., 1997; Ahissar, 1998). 2.1 Phase Models. The activity of a neural oscillator may involve a single spike or a burst of spikes, which repeat periodically. It is natural to describe the periodic activity as a function of a phase variable (Rand, Cohen, & Holmes, 1986). By normalizing the phase of the oscillator θosc (t) to a unit interval, that is, θosc ∈ [0, 1], it describes the fraction of the elapsed cycle. When the phase reaches the unit level, it resets to zero, and the oscillator generates a single spike or a burst of spikes. The phase of a free oscillator varies at a constant rate, whose inverse determines its intrinsic period τosc . −1 So: θ˙ osc = τosc (Zacksenhouse, 2001). The input to the oscillator affects the rate at which the phase changes and thus the period of the oscillator. In general, the effect may depend on the complete history of the input. However, here we assume that upon completing an oscillatory cycle and generating a spike in the case of a neural
1614
M. Zacksenhouse and E. Ahissar
Output rate
External Input
PD
~ PLL
RCO
(A)
Figure 1: Neuronal Phase-locked loop (NPLL): (a) Schematic Output rate illustration of a NPLL, which includes a Phase Detector (PD) and a Sum Rate-controlled Oscillator (RCO). (b) Schematic illustration of a particular implementation of a PD – the sub-threshold activated θ External Input correlation-PD: inputwwspikes, marked by upward arrows, arrive from Threshold either an externalSub-threshold source or the internal oscillator and produce subactivation threshold activations are summed and evoke a fixed-rate response when the threshold is crossed. Thus, the PD responds when the subw threshold activations from both θthe internal and external sources w overlap. Other implementations are discussed in the text and depicted PD in Figure 2 and 3.
Σ
Input from the RCO
(B) Figure 1: Neuronal phase-locked loop (NPLL). (A) Schematic illustration of a NPLL, which includes a phase detector (PD) and a rate-controlled oscillator (RCO). (B) Schematic illustration of a particular PD, the subthreshold-activated correlation PD: input events, marked by upward arrows, arrive from either an external source or the internal oscillator and produce subthreshold activation of fixed strength and duration. The subthreshold activations are summed and evoke a fixed-rate response when the threshold is crossed. Thus, the PD responds when the subthreshold activations from both the internal and external sources overlap. Other implementations are discussed in the text and depicted in Figures 2 and 3.
Temporal Decoding of Phase-Locked Loops
1615
oscillator, the oscillator is reset independent of its history. Thus, the period is assumed to vary only as a function of the phase of the input during the current cycle. Specifically, the instantaneous frequency during the nth cycle N(t) −1 is θ˙ osc = τosc + h(θosc |{ηk }k=N(tn )+1 ), where ηk is the time of occurrence of the kth input event, N(t) is the number of input events that occurred up to time t (the counting process; Snyders, 1975), and tn is the time of occurrence of the nth oscillatory event (and the start of the nth cycle). The function N(t) h(θosc | {ηk }k=N(tn )+1 ) describes the effect of the input events that occur during the nth cycle and depends in general on their time of occurrence and the phase of the oscillator. The above effect may be simplified in two extreme but very important cases: pulse-coupled oscillators and rate-controlled oscillators. In the first case, the effect of an isolated input event, usually from a single source, is short compared with the inter-event interval, and in the extreme assumed instantaneous. In the second case, the effects from different input events, coming usually from different sources, are highly overlapping, so the rate rather than the timing of the individual events determines the overall effect. 2.1.1 Pulse-Coupled Oscillators. The instantaneous effect of the input −1 is described by θ˙ osc = τosc ± f (θosc )δ(t − η N(t) ), where f (θosc ) is known as the phase-response curve (PRC) (Perkel, Schulman, Bullock, Moore, & Segundo, 1964; Kawato & Suzuki, 1978; Winfree, 1980; Yamanishi, Kawato, & Suzuki, 1980; Zacksenhouse, 2001). Upon integration,
k=N(t)
θosc = t/τosc ±
f (θosc (ηk )),
k=N(tn )+1
and the perturbed period τ p (n) is given by τ p (n) = τosc 1 ∓
k=N(tn +τ p )
f (θosc (ηk )) .
k=N(tn )+1
When only one input event occurs during the oscillatory cycle, the modified period is τ p (n) = τosc 1 ∓ f (ϕ(n)) .
(2.1)
where ϕ(n) = (η N(tn )+1 − tn )/τosc is the phase of the oscillator at the time of occurrence of that input event. As will be further discussed in section 4, a single pulse-coupled oscillator is equivalent to a PLL. However, a PLL may also be implemented by a
1616
M. Zacksenhouse and E. Ahissar
(neural) circuit that includes an RCO, whose characteristics are detailed below. 2.1.2 Rate-Controlled Oscillator (RCO). For simplicity, we assume that the RCO response depends on its input spike rate r (t), independent of its −1 phase, so θ˙ RCO = τRCO ± h RCO (r (t)), where τRCO denotes the intrinsic period of the RCO and h RCO describes the effect of the input rate on the instantaneous frequency. Upon integration, the perturbed period is given by τ p (n) = τRCO 1 ∓
tn +τ p
h RCO (r (t))dt .
tn
This can be expressed in terms of the lumped rate parameter, R(n), which describes the integrated effect of the input to the RCO during its nth cycle on the duration of that cycle, as τ p (n) = τRCO 1 ∓ R(n) ,
R(n) =
tn +τ p
h RCO r (t) dt.
(2.2)
tn
Thus, the lumped rate parameter, R(n), describes the integrated effect of the input during the nth cycle, with the net effect of either shortening or lengthening the period, as denoted by the ∓ sign, respectively. These effects are usually associated with excitatory and inhibitory inputs, respectively, although intrinsic currents may cause the reverse effect (Jones, Pinto, Kaper, & Koppel, 2000; Pinto, Jones, Kaper, & Koppel, 2003). In the linear case, that is, linear h RCO , the lumped parameter R(n) is proportional to the total number of spikes that occur during the nth cycle. The RCO can be implemented by an integrate-and-fire neuron as analyzed and simulated in Zacksenhouse (2001). When the integrate-and-fire RCO is embedded in an inhibitory PLL, the lumped rate parameter is an approximately linear function of the total number of spikes (Zacksenhouse, 2001, equation A8). An RCO that is embedded in a PLL receives its input from a PD, whose response characteristics are analyzed next. 2.2 Phase Detectors. The PD receives input from two sources, the external input and the internal RCO, and converts the interval between them into an output spike rate. For unique decoding, the conversion should be monotonic, with either a decreasing or increasing response (Ahissar, 1998). In particular, the PD may compute the correlation between the two inputs and respond maximally when the interval is zero (correlation-based PD, or Corr-PD). Alternatively, the PD may compute the time difference between its two inputs and respond minimally when the interval is zero (differencebased PD, or Diff-PD) (Kleinfeld, Berg, & O’Connor, 1999).
Temporal Decoding of Phase-Locked Loops
1617
2.2.1 Input Representation. The mathematical formulation of the computation performed by the PD depends on the representation of its input signals, which may be analog, binary, or discrete (Gardner, 1979). Analog signals are described by waveforms (usually sinusoidal) that vary with the phase of the cycle. Binary signals are described by rectangular waveforms whose onset is taken to be the origin. Discrete signals consist of discrete events that occur once per cycle, at a particular phase, which is taken to be the origin. In the context of neural implementations, discrete signals may describe the spike trains from single oscillating neurons, binary signals may describe the spike trains from bursting neurons, and analog signals may describe the average firing rate from a population of neurons and postsynaptic potentials. The main difference between these representations is the information they provide (or do not provide) about the phase variable. Analog signals may provide continuous indication of the phase, while binary and discrete signals provide information only at a specific phase (the origin). (It is noted that binary and discrete signals may be derived from each other and are essentially equivalent. In particular, the zero crossing of the rectangular wave is a discrete signal, and discrete signals may be used to generate rectangular waves using a memory device that is switched on for a fixed duration whenever a discrete event occurs.) The nature of operation of the PD is directly related to the nature of its inputs. The phase between analog inputs is detected using multiplier circuits, or mixers (Gardner, 1979), which operate as correlation-based PDs. The phase between binary or discrete signals is detected using logical devices (Gardner, 1979), which may operate as either correlation-based PDs (e.g., binary AND-gate) or difference-based PDs (e.g., binary Exclusive-OR gate) (Ahissar, 1998). Considering the spiking nature of neuronal signaling, we adopt the discrete, or equivalently the binary, representations in this work. These representations facilitate the unified investigation and comparison of PLLs with correlation-based and difference-based PDs. We use the term event to reflect either an event in a discrete representation or the rising edge of a rectangular signal in a binary representation. In the following mathematical formulation, significant phase variables are defined by normalizing the corresponding time intervals with respect to the intrinsic period of the RCO, τRCO . In particular, each input event is localized with respect to the preceding and the succeeding RCO events, as shown in Figure 2. The normalized intervals between the kth input event and the preceding or succeeding RCO events are referred to as the phase, ϕ(k), and co-phase, ψ(k), respectively. The normalized intervals since the last input or RCO events are denoted by θi and θo , respectively. Equation 2.2 describes how the period of the RCO is perturbed by the input it receives from the PD. The external input is composed of two spike trains: (1) a reference spike train described by a free oscillator with periodτi p and normalized input period ζ = τi p /τRCO , and (2) a novel spike whose timing with respect
1618
M. Zacksenhouse and E. Ahissar
θw
RCO Input
ϕk
ζ=τip/τRCO
θo
ψk
θi
rCorr rDiff (A)
θw
RCO Input
ζ=τ ip/τRCO
ϕk
ψk
θo
θi
rCorr rDiff
(B) Figure 2: Phase relationships between input events, marked by upward arrows ending at the horizontal axis, and RCO-generated events, marked by upward arrows originating at the horizontal axis, and the corresponding response of a subthreshold-activated PD. (A, B) Cases in which the input is lagging or leading the RCO events, respectively. The horizontal axes gauge time normalized by the period of the intrinsic RCO, τRCO , so they correspond directly to the phase. The normalized intervals between the kth input event and the preceding or succeeding RCO events are referred to as the phase, ϕ(k), and co-phase, ψ(k), respectively. Each event causes a subthreshold activity during a window of normalized duration θw , which, for clarity, are marked for only one pair of events in each panel. The resulting responses evoked by that pair of events are depicted on the short axes, for a correlation-based and a difference-based PD, respectively.
Temporal Decoding of Phase-Locked Loops
1619
to the reference spike train carries the information to be detected by the PLL. 2.2.2 Correlation-based PD. Two types of simple correlation-based PDs are analyzed in detail and shown to have similar response characteristics, which are then abstracted to characterize the response of general correlation-based PDs. Simple example of a threshold-activated PD. This is the case depicted in Figure 1B. Each of the inputs to the PD, coming from either the external input or the internal RCO, produces excitatory subthreshold activation for a normalized duration θW . The correlation-based PD responds at a constant rate when these activities overlap, as shown schematically in Figure 2. Consequently, the instantaneous output rate of the PD is given by rCorr (θi , θo ) = r0 U(θW − θi )U(θW − θo ),
(2.3)
where r0 is the constant output rate and U(·) is the unit function (a function that is 1 when its argument is positive and 0 elsewhere). In general, the duration of the subthreshold activation may depend on whether the input is coming from the external input or the internal RCO, but for simplicity of notation, this difference is ignored, and a single θW is used. As derived in equation 2.2, the lumped effect of the PD on the period of the RCO depends on the lumped rate parameter R. Assuming a linear case (and without loss of generality, a unit gain), the lumped rate parameter is given by integrating the instantaneous rate given in equation 2.3. Consequently, an input event that occurs at a phase ϕ(k) and co-phase ψ(k) would result in a lumped rate parameter Rcorr (k) of r0 (θW − ϕ(k)) Rcorr (k) = r0 (θW − ψ(k)) 0
if ϕ(k) < θW if ψ(k) < θW .
(2.4)
otherwise
The duration of the subthreshold activation θW is assumed to be short enough so at the most, one of the first two conditions holds during regular operation (i.e., θW < τRCO /2). In the vibrissal system, the duration of individual reference signals, that is, whisking-locked responses of individual first-order trigeminal neurons, is indeed shorter than half of the whisking cycle, and is usually confined to the protraction (forward movement of the whiskers) period (Szwed et al., 2003). When the RCO’s period is locked to the whisking period, the above relationship would hold. Considering the order of input and RCO events that cause the PD to respond, we refer to the first and second cases as input lagging and leading, respectively.
1620
M. Zacksenhouse and E. Ahissar
RCO Input
Input burst
θ burst θgate
Gate Envelope
PD Output Figure 3: Gated PD. The external input evokes a burst of spikes, which is gated by the PD. The onset of the gating is determined by intrinsic oscillator (RCO).
Simple example of a gated PD. In this case, the external input is composed of a burst of spikes (Szwed et al., 2003), which is gated by the PD, as shown in Figure 3. The onset of the burst is determined by the external input, while the onset of the gating is determined by the RCO. The burst lasts for a normalized duration of θbur st and, for simplicity, is assumed to have a constant rate of ri p . The duration of the gating window is θga te and for simplicity is assumed to equal the duration of the input burst so θga te = θbur st ≡ θW . Thus, the instantaneous output rate of the gated PD is the same as that for the constant rate PD, with r0 = ri p , and equation 2.4 describes the resulting lumped rate parameter due to an isolated external event. General correlation-based PD. According to equation 2.4, the response of a correlation-based PD decreases linearly with the relevant phase variable (the phase or the co-phase when the input is lagging or leading, respectively). In general, the response may be nonlinear, but its derivative should characteristically be negative: g(ϕ(k)) if ϕ(k) < θW (input lagging) RCorr (k) = g(ψ(k)) if ψ(k) < θW (input leading) , 0 otherwise
where
dg(x) ≤ 0. dx
(2.5)
Temporal Decoding of Phase-Locked Loops
1621
Equations 2.4 and 2.5 specify the response of a linear or nonlinear correlation-based PD, respectively, to a pair of input and RCO events. 2.2.3 Difference-based PD. Two types of difference-based PDs, analogous to the ones considered above, are analyzed in detail and shown to have similar response characteristics, which are abstracted to characterize the response of general difference-based PDs. Simple example of a threshold-activated PD. Here the external input events evoke superthreshold activation of fixed strength and duration, while the RCO events evoke inhibitory activation of a similar strength and duration. The difference-based PD responds at a constant rate when the overall activation is superthreshold, that is, when an external event but not an RCO event occurred during the last window θw , as shown schematically in Figure 2. Consequently, the instantaneous output rate is given by rDiff (θi , θo ) = r0 U(θW − θi )U(θo − θW ).
(2.6)
In the linear case considered in the context of the correlation-based PD, the lumped rate parameter is given by: r0 ϕ(k) if ϕ(k) < θW (2.7) RDiff (k) = r0 ψ(k) if ψ(k) < θW . r0 θW otherwise The window is assumed short enough so at the most, one of the first two conditions, corresponding to input lagging or leading, respectively, holds during regular operation. Simple example of a gated PD. In this case, the external input involves a burst of spikes, which is relayed by the PD except for the duration of the gate, which blocks the PD response. Using the parameters defined above, the instantaneous output rate of the gated PD is the same as that for the constant rate PD, and equation 2.7 describes the resulting lumped rate parameter due to an isolated external event. General difference-based PD. According to equation 2.7, the response of a difference-based PD increases linearly with the phase or co-phase in the respective working regions. In general, the response of a difference-based PD may be nonlinear, but its derivative should characteristically be positive: g(ϕ(k)) RDiff (k) = g(ψ(k)) g(θW )
where
dg(x) ≥ 0. dx
if ϕ(k) < θW
(input lagging)
if ψ(k) < θW
(input leading) , otherwise
(2.8)
1622
M. Zacksenhouse and E. Ahissar
Table 1: PLL Characterization. Correlation-Based PD
Input Relevant phase Steady phase Linear case As ζ → 1
Difference-Based PD
ePLL (ζ ≤ 1)
iPLL (ζ ≥ 1)
ePLL (ζ ≤ 1)
iPLL (ζ ≥ 1)
Lagging Phase ϕ ϕ∞ = g −1 [(1 − ζ )] ϕ∞ =
Leading Co-phase ψ ψ∞ = g −1 [(ζ − 1)] ψ∞ =
Leading Co-phase ψ ψ∞ = g −1 [(1 − ζ )] ψ∞ = 1−ζ r
Lagging Phase ϕ ϕ∞ = g −1 [(ζ − 1)] ϕ∞ = −1+ζ r
ϕ∞ ↑
ψ∞ ↑
ψ∞ ↓
ϕ∞ ↓
r0 θw −1+ζ r0
r0 θw +1−ζ r0
0
0
Equations 2.7 and 2.8 specify the response of a linear or nonlinear differencebased PD, respectively, to a pair of input and RCO events.
2.3 PLL Stable Response. During stable 1:1 phase entrainment to a periodic, external input, the response of the PD (and thus of the PLL) is sensitive to either the phase or the co-phase depending on the type of the PD and its connection to the RCO. A PLL in which the PD connection to the RCO is excitatory is referred to as an excitatory PLL (ePLL), while a PLL in which the PD is connected to the RCO via an inhibitory interneuron is referred to as an inhibitory PLL (iPLL) (Ahissar, 1998). The following theorem characterizes the operation of the different PLLs and is summarized in Table 1. Theorem 1: PLL Characterization. During stable 1:1 entrainment of ePLL/ iPLL, the input is lagging or leading, respectively, when the PD is correlation based, and leading or lagging, respectively, when the PD is difference based. Within the working range (i.e., ζ ≤ 1 or ζ ≥ 1), as the input period approaches the intrinsic period of the RCO, the corresponding phase variable, that is, the steady-state phase ϕ∞ for lagging input or the steady-state co-phase ψ∞ for leading input, increases when the PD is correlation based and decreases when the PD is difference based. Proof. Considering an interval of time during which the input is consistently lagging (i.e., ϕ(k) < θW for all the input events in the interval), the phase of the (k + 1)th input is related to the phase of the kth input by ϕ(k + 1) = ϕ(k) + ζ − τ p (k)/τRCO . According to equation 2.2 and either equation 2.5 or 2.8, the perturbed period is given by τ p (k)/τRCO = 1 ∓ g ϕ(k) ,
(2.9)
Temporal Decoding of Phase-Locked Loops
1623
regardless of the type of the PD, so ϕ(k + 1) = ϕ(k) + ζ − 1 ± g ϕ(k) .
(2.10)
When the input is consistently leadingψ(k) < θW , the co-phase of the (k + 1)th input is related to the co-phase of the kth input by ψ(k + 1) = ψ(k) − ζ + τ p /τRCO . According to equation 2.2 and either equation 2.5 or 2.8, the perturbed period is given by τ p (k)/τRCO = 1 ∓ g ψ(k) ,
(2.11)
regardless of the type of the PD, so ψ(k + 1) = ψ(k) − ζ + 1 ∓ g ψ(k) .
(2.12)
Equations 2.10 and 2.12 imply that the equilibrium condition is specified by ϕ∞ = g −1 [±(1 − ζ )] when the input is lagging and by ψ∞ = g −1 [±(1 − ζ )] when the input is leading. Furthermore, the stability condition is given by −2 < ± dg(x) | < 0 when the input is lagging and by −2 < ∓ dg(x) | < d x ϕ∞ d x ψ∞ 0 when the input is leading. For the correlation-based PD, dg/d x ≤ 0 so the ePLL stabilizes with lagging input at ϕ∞ = g −1 [(1 − ζ )] and the iPLL stabilizes with leading input at ψ∞ = g −1 [ζ − 1]. For the difference-based PD, dg /d x ≥ 0, so the opposite holds. The period of the input to an ePLL is shorter than the intrinsic period of the RCO so ζ ≤ 1. As the frequency of the input approaches the intrinsic frequency of the RCO, ζ ↑ 1, so (1 − ζ ) ↓ 0. The period of the input to an iPLL is longer than the intrinsic period of the RCO, so ζ ≥ 1. As the frequency of the input approaches the intrinsic frequency of the RCO, ζ ↓ 1 so (ζ − 1) ↓ 0. For a correlation-based PD, dg /d x ≤ 0, and so both the steady phase ϕ∞ and the steady co-phase ψ∞ increase as the frequency of the input approaches the intrinsic frequency of the RCO from below or above for the ePLL/iPLL, respectively. For the difference-based PD, dg /d x ≥ 0 so both the steady phase ϕ∞ and the steady co-phase ψ∞ decrease as the frequency of the input approaches the intrinsic frequency of the RCO from below or above for the ePLL/iPLL, respectively. The linear operating curves of the different PLLs, given in Table 1, are depicted in Figure 4 in terms of the absolute delay as a function of the input period. The different panels depict the effect of the nominal rate r0 , and the parallel curves within each panel depict the effect of the intrinsic period τRCO . It is noted that the operating range increases as the nominal rate increases. However, according to the proof of the PLL characterization theorem, r0 should be less than 2 to ensure stability, so only the top panels depict stable (top left) and marginally stable (top right) operating curves.
1624
M. Zacksenhouse and E. Ahissar
r0=1
60 40
40
20
20
Absolute delay (msec)
0
60
0
100
200
r0=5
40 20 0
r0=2
60
0
0
iPLL
0 100 200 input period (msec)
200
r0=10
60 40
ePLL
100
Corr-PD
20
Diff-PD 0
0
100
200
Figure 4: Linear steady-state curves describing the absolute delay between the external input and internal oscillatory events as a function of the input period for four types of PLLs. The curves shift in parallel as the intrinsic rate of the RCOs is increased from 80 to 120 msec in steps of 10 msec as indicated by the arrows in the bottom left panel. The nominal rate r0 is 1 (top left panel), 2 (top right panel) 5 (bottom left panel), and 10 (bottom right panel).
Based on the PLL characterization theorem, we can classify the PLLs into two groups according to whether they are sensitive to the phase or co-phase of the input relative to the intrinsic oscillator. The phase-sensitive PLLs include (a1) the ePLL with correlation-based PD and (a2) the iPLL with difference-based PD, while the co-phase-sensitive PLLs include (b1) the iPLL with a correlation-based PD and (b2) the ePLL with a differencebased PD. 3 Temporal Decoding 3.1 Vibrissal Temporal Decoding Task. The entrainment of the PLL by a periodic input prepares the PLL to properly decode a novel input. In order to clarify this subtle issue, we consider in more detail the encoding of object
Temporal Decoding of Phase-Locked Loops
1625
Table 2: Total Response Rt of Phase-Locked NPLLs to a Novel Input. Correlation-Based PD ePLL Novel Input/RCO Event Leading (ψn ) g(ϕ∞ ) + g(ψn ) Lagging (ϕn ) g(ϕ∞ ) + g(ϕn ) Linear PD Leading (ψn ) r0 [2θw − (ϕ∞ + ψn )] r0 [2θw − Lagging (ϕn ) (ϕ∞ + ϕn )]
Difference-Based PD
iPLL
ePLL
iPLL
g(ψ∞ ) + g(ψn ) g(ψ∞ ) + g(ϕn )
g(ψ∞ ) + g(ψn ) g(ψ∞ ) + g(ϕn )
g(ϕ∞ ) + g(ψn ) g(ϕ∞ ) + g(ϕn )
r0 [2θw − (ψ∞ + ψn )] r0 [2θw − (ψ∞ + ϕn )]
r0 (ψ∞ + ψn )
r0 (ϕ∞ + ψn )
r0 (ψ∞ + ϕn )
r0 (ϕ∞ + ϕn )
location in the vibrissal system. The location of the object is encoded in the firing pattern of neurons in the trigeminal ganglion and probably also in the brainstem. In particular, the firing patterns of trigeminal neurons, which provide the external input to the vibrissal system, include two components (Kleinfeld et al., 1999; Szwed et al., 2003): (1) a reference signal composed of spikes at a preferred phase of the whisking cycle and (2) a contact-induced signal composed of spikes that are evoked on contact with an external object. The first component is periodic at the whisking period. The second component is the novel input whose time of occurrence relative to the reference signal (the first component) has to be decoded.
3.2 Effect of Novel Input. The external input to the PD is composed of two components: the reference, periodic input, and the novel input. We make the simple and physiologically appropriate assumption that the PD’s response to each of these components is the same and independent of each other, so the total response of the PD is the sum of the individual responses. For the gated PD described in section 2.2.3, for example, once the gate is opened by the RCO, the PD relays the bursts of activity that it receives from either or both of the external inputs. Using equations 2.4, 2.5, 2.7, and 2.8 for the response to either component of the external input, the total response Rt of the different NPLLs can be derived as summarized in Table 2 and depicted in Figure 5. As evident from Table 2 and Figure 5, the total PD response varies monotonically with the delay between the novel input and the oscillatory event (i.e., ϕn or ψn ) as long as the novel input is confined to either always lead or always lag the RCO event. However, in order to provide sensory decoding, the response should vary monotonically with the delay between the novel input and the reference events. The relevant decoding ranges are specified by the temporal detection theorem stated and proven in the next section.
1626
M. Zacksenhouse and E. Ahissar
Correlation-based
Difference-based
ePLL
Total Response (arbitrary units)
ePLL I
II
IV
ϕ∞
I
III
ψ∞
0
0
iPLL
iPLL I
ψ∞ −θw ψn
IV
IV
III
0
ϕn θ w −θ w ψn
I
II
0
IV
ϕ∞ ϕn θw
Figure 5: Total PD response as a function of the phase ϕn (increasing to the right) and co-phase ψn (increasing to the left) of a novel input to a PLL that is entrained by a periodic reference input with the indicated phase ϕ∞ or cophase ψ∞ relationship. The left/right pair of panels depict the total response of correlation-/difference-based PDs embedded in ePLL (upper panels) and iPLL (bottom panels). The arrows below the axes indicate the reference input, while the arrows above the axes indicate the RCO events. The roman numbers indicate the corresponding zone of the novel input as indicated in Table 2. The solid/dashed lines indicate the response when the novel input lags/lead the reference input.
3.3 PLL Temporal Decoding Capabilities Theorem 2: PLL Temporal Detection. During 1:1 stable phase locking to a periodic external input, a PLL can monotonically decode a novel input when it has a fixed order with respect to both the reference input and the RCO events. The resulting decoding ranges are specified in Table 3. Proof. The output of the PD varies monotonically with the phase of the novel input along the RCO cycle as long as the order between them is fixed (second column of Table 3). The phase difference between the novel input
Temporal Decoding of Phase-Locked Loops
1627
Table 3: Monotonic Decoding Ranges. Novel Input/ Reference Input
Novel Input/ RCO Events
Zone (Figure 5)
ePLL
iPLL
ePLL
iPLL
Leading Leading Lagging Lagging
Leading Lagging Leading Lagging
I II III IV
θW ϕ∞ 0 θ W − ϕ∞
θW − ψ∞ 0 ψ∞ θW
θW − ψ∞ 0 ψ∞ θW
θW ϕ∞ 0 θ W − ϕ∞
Correlation-Based PD
Difference-Based PD
and the reference input varies monotonically with the phase of the novel input along the cycle of the RCO as long as the order between them is fixed (first column of Table 3). Hence, the response of the PD varies monotonically with the phase difference between the novel input and the reference input when the order of the novel input with respect to both the reference input and the RCO events is fixed, as specified by each row. Finally, the relevant ranges with the specified phase relationships between the novel input and both the reference input (first column of Table 3) or the RCO events (second column of Table 3) follow directly given the steady-state phase and co-phase of the reference input with respect to the RCO events. It is apparent that the decoding range depends on whether the PLL is phase or co-phase sensitive. When the order between the novel input and the reference input is determined by the nature of the temporal decoding task, it is possible to distinguish between two decoding modes: (1) narrow but monotonic, and thus unambiguous, decoding range (e.g., correlation-based ePLL decoding a novel input that lags the reference input over the range θw − ϕ∞ ; bottom row of the third column in Table 3; see also the top-left panel in Figure 5), and (2) wide but partially ambiguous detection range (e.g., a correlation-based iPLL decoding a novel input that lags the reference input over the range θw + ϕ∞ ; bottom two rows of the fourth column in Table 3; see also the bottom-left panel in Figure 5). The ambiguity stems from the fact that the order of the novel input with respect to the RCO events is not constrained in this case. Thus, the temporal detection theorem provides a design criterion for selecting the PLL that best matches the requirements of a given temporal decoding task. The sensory information is encoded in the phase difference δ between the novel input and the reference input and can be expressed in terms of the phase of the novel input with respect to the closest RCO event and the phase of the reference input with respect to the same RCO event, as specified in Table 4. The following PLL temporal decoding theorem specifies how this informative phase difference may be determined from the response of the PD.
1628
M. Zacksenhouse and E. Ahissar
Table 4: Phase Difference δ Between the Novel Input and the Reference Input. Novel Input/Reference Input
Novel Input/ RCO Events
Phase-Sensitive PLLs
Co-Phase-Sensitive PLLs
Leading Leading Lagging Lagging
Leading (ψn ) Lagging (ϕn ) Leading (ψn ) Lagging (ϕn )
ϕ∞ + ψn ϕ∞ − ϕn NA ϕn − ϕ∞
ψn − ψ∞ NA ψ∞ − ψn ψ∞ + ϕn
Table 5: Parameters of the Relationship δ = a + b R∞ /r0 + c Rt /r0 Specifying the Phase Difference δ Between the Novel Input and the Reference Input as a Function of the Steady-State PD Response (R∞ ) and Total Response (Rt ) of the PD. Novel Input/Reference Input
Novel Input/RCO Events
Leading
Leading (ψn )
Leading
Lagging (ϕn )
Lagging
Leading (ψn )
Lagging
Lagging (ϕn )
Correlation-Based PD ePLL a b c a b c
= 2; = 0; = −1 = 0; = −2; =1 NA
a = 0; b = 2; c = −1
iPLL
Difference-Based PD ePLL
a = 0; b = 2; c = −1 NA
a = 0; b = −2; c=1 NA
a = 0; b = −2; c=1 a = 2; b = 0; c = −1
a = 0; b = 2; c = −1 a = 0; b = 0; c=1
iPLL a = 0; b = 0; c=1 a = 0; b = 2; c = −1 NA
a = 0; b = −2; c=1
Theorem 3: PLL Temporal Decoding. Consider a PLL that is phase-locked to a periodic reference signal, and denote by R∞ the steady-state response of its PD. A novel input induces an additional response so the total response of the PD is given by Rt . The phase difference δ between the novel input and the reference input may be determined by δ = a + b R∞ /r0 + c Rt /r0 with the parameters given in Table 5 for the specific PLL variant. Proof. The PLL decoding theorem follows directly from Tables 4 and 3 after expressing the steady-state phase or co-phase in terms of the steady response R∞ using equations 2.4 and 2.7. It is noted that in some cases, the computation involves the steady-state PD response R∞ . This may be made available by PLLs that do not receive the novel input and thus continue to respond at R∞ even when the novel input appears. However, when operating in the regime for which the specific
Temporal Decoding of Phase-Locked Loops
1629
PLL variant has the maximum decoding range (as specified in Table 2), the steady-state PD response is not required. Specifically, when the novel input lags both the RCO event and the reference event, the phase difference δ may be directly inferred from the PD response of the co-phase-sensitive PLLs (e.g., an iPLL with a correlation-based PD) after an appropriate offset.
3.4 Significance for Vibrissal Temporal Decoding. 3.4.1 Decoding Range. The whisking-locked reference signal is evoked upon the onset of the protraction phase of whisking, that is, the phase of forward movement, while the contact-induced signal is evoked later during protraction, upon contact with the object (Szwed et al., 2003). Thus, the contact-induced novel input lags the whisking-induced reference input. According to the temporal detection theorem, the decoding ranges that can be achieved in this case by the different PLL variants are specified in the last two rows of Table 2. In particular, the phase-sensitive PLLs (i.e., the ePLL with correlation-based PD and the iPLL with difference-based PD; zone IV, solid curves in the upper-left and lower-right panels of Figure 5) result in an unambiguous but narrow decoding range, while the co-phasesensitive PLLs (i.e., the iPLL with correlation-based PD and the ePLL with difference-based PD; zones III and IV, solid curves in Figure 5 lower-left and upper-right panels) result in a wide but partially ambiguous decoding range. In the latter case, the response is ambiguous when the novel input lags the reference input by less than twice the co-phase ψ∞ , that is, when the contact with the object occurs relatively close to the preferred phase of the whisking cycle. However, the response is still informative since it provides approximate indication of the phase of the novel input, and furthermore, it can be resolved by considering the response from a population of PLLs that receive reference signals produced at different preferred phases (Ahissar, 1998). Hence, it can be concluded that in the case of vibrissal temporal decoding, the widest detection and decoding ranges are obtained with co-phase-sensitive PLLs, for which the input leads the intrinsic oscillator during stable entrainment, in agreement with the observed oscillatory delay (Nicolelis et al., 1995; Ahissar et al., 1997). The two co-phase-sensitive PLLs, that is, the iPLL with a correlationbased PD and the ePLL with a difference-based PD, differ in their input operating ranges, which include input periods that are longer or shorter than the intrinsic period of the oscillator, respectively (see Table 1, first row). Recordings from whisking-range oscillatory neurons in the somatosensory cortex indicate that they track mainly frequencies below their spontaneous frequency (Ahissar et al., 1997). Thus, given the above theorems, the observations suggest that the somatosensory cortex participates in the implementation of iPLLs with correlation-based PDs.
1630
M. Zacksenhouse and E. Ahissar
3.4.2 Frequency Modulation Experiments. Additional support for the above conclusion may be drawn from frequency modulation experiments in which the whiskers are stimulated by air puffs whose frequency is modulated by a slowly varying sinusoidal signal. The responses of neurons along the paralemniscal pathway (one of the two major vibrissal sensory pathways) followed the oscillatory input with varying latencies and spike counts. As the frequency of the input varied sinusoidally between 3 and 7 Hz at a modulating frequency of 0.5 Hz, so did the latency and the spike count of these neurons. However, while the first varied in phase with the frequency of the stimulus, the latter varied in antiphase (Ahissar et al., 2000; Ahissar, Sosnik, Bagdasarian, & Haidarli, 2001). In particular, the latencies and spike counts of cortical neurons in layer 5a, which receive input from thalamic neurons in the medial division of the posterior nucleus (POm) along the paralemniscal pathway, were inversely related, as plotted in Figure 6 (connected stars). Under the hypothesis that the thalamocortical loops in the vibrissal paralemniscal system implement NPLLs, the thalamic neurons should act as PDs (Ahissar et al., 1997). According to section 2.2.2, the observed relationship in neurons of layer 5a indicates that the thalamic neurons that drive these cortical neurons behave as correlation-based PDs. Thus, considering the four possible PLL variants, the observed latencies and inverse relationship are consistent only with the assumption that the paralemniscal pathway implements iPLLs with correlation-based PDs, in agreement with our previous conclusion. Indeed, simulations of an iPLL with a correlation-based PD demonstrate a similar relationship, as shown in Figure 6. The circles in Figure 6 depict the relationship between the spike count and the latency of the response of the simulated PD to a frequency-modulated spike train input. To facilitate comparison, the linear fit to the measured data is marked by a dashed line, demonstrating good agreement with the simulated results.
4 Single Neural Oscillators A single neuron can also be modeled as a PLL (Hoppensteadt, 1986). However, as indicated by equation 2.1, single neural oscillators are sensitive to the phase at which the input events occur, not the co-phase. This is indeed the basis for characterizing neural oscillators using phase-response curves. In particular, a single neural oscillator, described by equation 2.1, is equivalent to a PLL with lagging input as described by equation 2.9, where f (ϕ) = g(ϕ). However, a single neural oscillator cannot operate as a NPLL with leading input, as described by equation 2.11, since its dynamics depends on only the phase, never the co-phase, of the input. As discussed above, the PLL temporal detection theorem suggests that single oscillators, which can be sensitive only to the phase but not to the co-phase, would provide a narrow decoding range when the novel input
Temporal Decoding of Phase-Locked Loops
1631
Response (Spikes per cycle)
7 6 5 4 3 2 1 0
0
20
40
60
80
Latency (msec) Figure 6: Average spike count versus average latency of paralemniscal cortical neurons recorded from layer 5a of the barrel cortex in experiments with frequency modulated (FM) stimulus (Ahissar et al., 2000; Ahissar et al., 2001). Data points are marked with stars, each representing the latency and spike count for one cycle, averaged across 36 repetitions of the same FM sequence (first cycle excluded). Results from consecutive cycles are connected by a solid line. The dashed line is a linear fit to the data with a slope of −0.08, and the circles are generated from a simulated iPLL with correlation-based PD in response to a frequency-modulated input spike train.
lags the reference input. Hence, we conclude that single oscillators are not optimal for temporal decoding of the whisking-induced signals. Since single neurons can be sensitive only to the phase, and not the cophase, of the external input, the PLL characterization theorem implies that they can implement either ePLL with correlation-based PD or iPLL with difference-based PDs. To be able to track frequencies below their spontaneous frequency (Ahissar et al., 1997), the single neurons should operate as iPLL with difference-based PDs. 5 Summary and Discussion 5.1 Temporal Decoding Tasks. In the context of neural information processing, temporal decoding refers to the ability to respond in a way that is sensitive to the temporal pattern of neural activity, not just its average
1632
M. Zacksenhouse and E. Ahissar
spike rate. This article addresses a specific temporal decoding task, which is sensitive to the relative phase of an information-carrying signal (spike train) relative to a periodic reference signal (periodic spike train). Phasedecoding capabilities facilitate the interpretation of neural activity evoked during active touch or active vision (Ahissar & Arieli, 2001). In such active processes, the controlled movements of the sensory organs evoke the reference spike train, while the sensed features of the environment evoke the information-carrying signal. During whisking, for example, the sensory organs are the flexible whiskers, which scan the environment rhythmically, and the relevant feature is the position of an object in that environment. The angle of contact, and thus the relative angular position of the object, can be inferred from the phase along the whisking cycle at which the contact occurred. 5.2 Temporal Decoding Capabilities of PLLs. PLLs are well-developed electronic circuits designed to track periodic signals over a wide frequency range with good noise-rejection performance. The output of the internal oscillator reproduces a cleaned version of the original signal, while the PD followed by a low-pass filter demodulates the input signal. Similarly, neuronal PLLs may be used to track the period of the input spike train and encode its variations in the output—the number of spikes per cycle—of the PD. In this mode, the sensitivity of the PLL may be defined as the change in the output of the PD induced by a small change in the period of the input. Previous work (Zacksenhouse, 2001) indicates that the sensitivity of the iPLL is relatively constant compared with the sensitivity of single-neuron oscillators. By tracking the frequency of the input, PLLs can also be used to detect the relative phase of a novel input, and thus accomplish the phase-decoding task, which is critical for the interpretation of active sensation, as discussed above. Specifically, the PD decodes the phase of the novel input with respect to the periodic activity of the internal oscillator. However, the internal oscillator of an entrained PLL is phase-locked to the reference input, and so the PLL indirectly decodes the phase of the novel input with respect to that reference spike train. The performance of PLLs with respect to this task is the focus of this article. The four PLL variants, involving correlation- and difference-based PDs with either inhibitory (iPLL) or excitatory (ePLL) connections, operate in two locking modes, which are sensitive to either the phase or co-phase of the input. In particular, the iPLL with correlation-based PD and the ePLL with difference-based PD are sensitive to the co-phase of the input, and thus establish a unique response pattern that cannot be produced by single-neuron oscillators. The operating range over which the timing of the novel input may be decoded with respect to the reference signal has been determined and provides a design criterion for selecting the optimum PLL variant that best matches the requirements for a given task.
Temporal Decoding of Phase-Locked Loops
1633
5.3 Circuit-Based versus Single-Neuron PLLs. The relationship between the operation of a single neuron and that of a phase-locked loop was suggested and extensively explored in Hoppensteadt (1986). The cell body is modeled as a voltage-controlled oscillator (VCO, equivalent to RCO here), and the synaptic effect as a monotonically increasing nonlinear function of the combined effect of the VCO and the external input. The resulting model was shown here to be equivalent to either iPLL with a difference-based PD or ePLL with correlation-based PD. 5.4 Temporal Decoding in the Vibrissal System. The hypothesis that temporal decoding in the vibrissal system is facilitated by neural circuits implementing PLLs has received substantial support from a range of observations: (1) existence of neural oscillators in the relevant range of frequencies (Ahissar et al., 1997), (2) existence of PD-like neurons, which exhibit frequency-dependent gated outputs (Ahissar et al., 2000), (3) phase locking of oscillators and PD-like neurons to a range of input frequencies (Ahissar et al., 1997), (4) monotonic direct relationships between input frequencies to locking phases in both oscillators and PD-like neurons (Ahissar et al., 2000), (5) monotonic inverse relationships between input frequencies to locking spike counts in PD-like neurons (Ahissar et al., 2000), which depend on the length of the stimulus burst (Ahissar et al., 2001; Sosnik et al., 2001), and (6) estimated sensitivity that agrees with the observed marginal stability (Zacksenhouse, 2001). In this article, we provide additional theoretical support for this hypothesis based on the decoding range of the different PLL variants. The nature of the temporal decoding task facing the vibrissal system suggests that the widest detection and decoding ranges may be achieved with the co-phase-sensitive PLLs, and those cannot be implemented by single neural oscillators. Furthermore, new observations from frequency-modulated experiments are described that support the hypothesis that the vibrissal system implements iPLLs with correlation-based PDs. Control Capabilities of PLLs. Neural oscillators play an important role not only in decoding but also in generating temporal patterns (Zacksenhouse, 2001). In particular, networks of coupled oscillators are assumed to generate patterns of rhythmic movements that underlie a diverse range of rhythmic tasks, including locomotion (Nishii, Uno, & Suzuki, 1994; Rand et al., 1986) and chewing (Rowat & Selverston, 1993), for example. These networks can generate the patterns of activity even in the absence of any sensory feedback, and thus are referred to as central pattern generators (CPGs). However, feedback may still play an important role in these tasks (Ekeberg, 1993; Grillner et al., 1995) and in particular in tasks that involve open-loop unstable dynamical systems. The closed-loop control of such tasks, and in particular the control of yoyo-playing with oscillatory units, revealed additional advantages of PLLs over single-neuron oscillators (Jin & Zacksenhouse, 2002, 2003). In
1634
M. Zacksenhouse and E. Ahissar
this application, the neural oscillator determines when to start the upward movement and receives a once-per-cycle input at a characteristic phase of the movement, when the yoyo reaches the bottom of its flight. As discussed here, single neural oscillators, or single-cell PLLs, may establish only inputlagging phase relationships, while neural network PLLs may also establish a unique input-leading phase relationship. The latter was demonstrated to have critical control advantages, which are essential in the context of yoyo playing (Jin & Zacksenhouse, 2002, 2003). Thus, the unique temporal detection characteristics of circuit PLLs also provide control capabilities beyond those of directly coupled neural oscillators.
Acknowledgments This work was supported by the United States-Israel Binational Science Foundation Grant # 2003222, the MINERVA Foundation, the Human Frontiers Science Programme, and by the Fund for the Promotion of Research at the Technion. E.A. holds the Helen Diller Family Professorial Chair of Neurobiology.
References Ahissar, E. (1998). Temporal-code to rate-code conversion by neuronal phase-locked loops. Neural Comput., 10, 597–650. Ahissar, E., & Arieli, A. (2001). Figuring space by time. Neuron, 22, 185–201. Ahissar, E., Haidarliu, S., & Zacksenhouse, M. (1997). Decoding temporally encoded sensory input by cortical oscillations and thalamic phase comparators. Proc. Natl. Acad. Sci. USA, 94, 11633–11638. Ahissar, E., Sosnik, R., Bagdasarian, K., & Haidarliu, S. (2001). Temporal frequency of whisker movement, II. Laminar organization of cortical representations. J. Neurophysiol, 86, 354–367. Ahissar, E., Sosnik, R., & Haidarliu, S. (2000). Transformation from temporal to rate coding in somatosensory thalamocortical pathway. Nature, 406, 302–305. Ahissar, E., & Vaadia, E. (1990). Oscillatory activity of single units in a somatosensory cortex of an awake monkey and their possible role in texture analysis. Proc. Natl. Acad. Sci. USA, 87, 8935–8939. Ahissar, E., & Zacksenhouse, M. (2001). Temporal and spatial coding in the rat vibrissal system. Prog. in Brain Res., 130, 75–87. Amitai, Y. (1994). Membrane potential oscillations underlying firing patterns in neocortical neurons. Neuroscience, 63, 151–161. Ekeberg, O. (1993). A neuro-mechanical model of undulatory swimming. Biol. Cybern., 69, 363–374. Flint, A. C., Maisch, U. S., & Kriegstein A. R. (1997). Postnatal development of low [Mg2+] oscillations in neocortex. J. Neurophysiol., 78, 1990–1996. Gardner, F. M. (1979). Phaselocked techniques (2nd ed.) New York: Wiley.
Temporal Decoding of Phase-Locked Loops
1635
Grillner, S., Deliagina, T., Ekeberg, O., El Manira, A., Hill, R. H., Lansner, A., Orlovsky, G. N., & Wallen, P. (1995). Neural-networks that co-ordinate locomotion and body orientation in the lamprey. Trends Neurosci., 18, 270–279. Hoppensteadt, F. C. (1986). An introduction to the mathematics of neurons. Cambridge: Cambridge University Press. Jin H., & Zacksenhouse, M. (2002). Necessary condition for simple oscillatory neural control of robotic yoyo. In Int. Joint Conf. on Neural Networks, World Cong. on Intell. Comp. (IJCNN-WCCI’02) (pp. 1427–1432). Honolulu, HI. Jin, H., & Zacksenhouse, M. (2003). Oscillatory neural control of dynamical systems. IEEE Trans. Neural Networks, 14(2), 317–325. Jones, S. R., Pinto, D. J., Kaper, T. J., & Koppel, N. (2000). Alpha-frequency rhythms desynchronize over long cortical distances: A modeling study. J. Comp. Neurosci., 9, 271–291. Kawato, M., & Suzuki, R. (1978). Biological oscillators can be stopped—topological study of a phase response curve. Biol. Cybern. 30, 241–248. Kleinfeld, D., Berg, R. W., & O’Connor, S. M. (1999). Anatomical loops and their electrical dynamics in relation to whisking by rat. Somatosensory and Motor Res., 16(2), 69–88. Lebedev, M. A., & Nelson, R. J. (1995). Rhythmically firing (20–50 Hz) neurons in monkey primary somatosensory cortex: Activity patterns during initiation of vibratory-cued hand movements. J. Comp. Neurosci., 2, 313– 334. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Freeman. Nicolelis, M. A. L., Baccala, L. A., Lin, R. C. S., & Chapin, J. K. (1995). Sensorimotor encoding by synchronous neural ensemble activity at multiple levels of the somatosensory system. Science, 268, 1353–1358. Nishii, J., Uno, Y. & Suzuki, R. (1994). Mathematical models for the swimming pattern of a lamprey I and II. Biol. Cybern., 72, 1–9, 11–18. Perkel, D. H., Schulman, J. H., Bullock, T. H., Moore, G. P. & Segundo J. P. (1964). Pacemaker neurons: Effects of regularly spaced synaptic input. Science, 145, 61–63. Pinto, D. J., Jones, S. R., Kaper, T. J., & Koppel, N. (2003). Analysis of state-dependent transitions in frequency and long-distance coordination in a model oscillatory cortical circuit. J. Comp. Neurosci., 15, 283–298. Rand, R. H., Cohen, A. H., & Holmes, P. J. (1986). Systems of coupled oscillators as models of central pattern generators. In A. H. Cohen, S. Rossignol, and S. Grillner (Eds.), Neural control of rhythmic movements in vertebrates. New York: Wiley. Rowat P. F., & Selverston, A. I. (1993). Modeling the gastric mill central Pattern generator of the lobster with a relaxation-oscillator Network. J. Neurophysiol., 70(3), 1030–1053. Silva, L. R., Amitai, Y., & Connors, B. W. (1991). Intrinsic oscillations of neocortex generated by layer 5 pyramidal neurons. Science, 251, 432–435. Sosnik, R., Haidarliu, S., & Ahissar, E. (2001). Temporal frequency of whisker movement. I. Representations in brainstem and thalamus. J. Neurophysiol., 86, 339– 353.
1636
M. Zacksenhouse and E. Ahissar
Snyders, D. L. (1975). Random point processes. New York: Wiley. Szwed, M., Bagdasarian, K., and Ahissar, E. (2003). Encoding of vibrissal active touch. Neuron, 40, 621–630. Winfree, A. T. (1980). Geometry of biological time. Berlin: Springer. Yamanishi, J., Kawato, M., & Suzuki, R. (1980). Two coupled oscillators as a model for the coordinated finger tapping by both hands. Biol. Cybern., 37, 219–225. Zacksenhouse, M. (2001). Sensitivity of basic oscillatory mechanisms for pattern generation and detection. Biol. Cybern., 85(4), 301–311.
Received September 10, 2004; accepted September 29, 2005.
LETTER
Communicated by Mark Ungless
Representation and Timing in Theories of the Dopamine System Nathaniel D. Daw [email protected] UCL, Gatsby Computational Neuroscience Unit, London, WC1N3AR, U.K.
Aaron C. Courville [email protected] Carnegie Mellon University, Robotics Institute and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A.
David S. Touretzky [email protected] Carnegie Mellon University, Computer Science Department and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A.
Although the responses of dopamine neurons in the primate midbrain are well characterized as carrying a temporal difference (TD) error signal for reward prediction, existing theories do not offer a credible account of how the brain keeps track of past sensory events that may be relevant to predicting future reward. Empirically, these shortcomings of previous theories are particularly evident in their account of experiments in which animals were exposed to variation in the timing of events. The original theories mispredicted the results of such experiments due to their use of a representational device called a tapped delay line. Here we propose that a richer understanding of history representation and a better account of these experiments can be given by considering TD algorithms for a formal setting that incorporates two features not originally considered in theories of the dopaminergic response: partial observability (a distinction between the animal’s sensory experience and the true underlying state of the world) and semi-Markov dynamics (an explicit account of variation in the intervals between events). The new theory situates the dopaminergic system in a richer functional and anatomical context, since it assumes (in accord with recent computational theories of cortex) that problems of partial observability and stimulus history are solved in sensory cortex using statistical modeling and inference and that the TD system predicts reward using the results of this inference rather than raw sensory data. It also accounts for a range of experimental data, including the experiments involving programmed temporal variability and other previously unmodeled dopaminergic response phenomena, Neural Computation 18, 1637–1677 (2006)
1638
N. Daw, A. Courville, and D. Touretzky
which we suggest are related to subjective noise in animals’ interval timing. Finally, it offers new experimental predictions and a rich theoretical framework for designing future experiments.
1 Introduction The responses of dopamine neurons in the primate midbrain are well characterized by a temporal difference (TD) (Sutton, 1988) reinforcement learning (RL) theory, in which neuronal spiking is supposed to signal error in the prediction of future reward (Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). Although such theories have been influential, a key computational issue remains: How does the brain keep track of those sensory events that are relevant to predicting future reward, when the rewards and their predictors may be separated by long temporal intervals? The problem traces to a disconnect between the physical world and the abstract formalism underlying TD learning. The formalism is the Markov process, a model world that proceeds stochastically through a series of states, sometimes delivering reward. The TD algorithm learns to map each state to a prediction about the reward expected in the future. This is possible because, in a Markov process, future states and rewards are conditionally independent of past events, given only the current state. There is thus no need to remember past events: the current state contains all information relevant to prediction. This assumption is problematic when it comes to explaining experiments on dopamine neurons, which often involve delayed contingencies. In a typical experiment, a monkey learns that a transient flash of light signals that, after a 1-second delay, a drop of juice will be delivered. Because of this temporal gap, the animal’s immediate sensory experiences (gap, flash, gap, juice) do not by themselves correspond to the states of a Markov process. This example also demonstrates that these issues of memory are tied up with issues of timing—in this case, marking the passage of the 1 second interval. Existing TD theories of the dopamine system address these issues using variations on a device called a tapped delay line (Sutton & Barto, 1990), which redefines the state to include a buffer of previous sensory events within some time window. If the window is large enough to encompass relevant history, which is assumed in the dopamine theories, then the augmented states form a Markov process, and TD learning can succeed. Clearly, this approach fudges an issue of selection: How can the brain adaptively decide which events should be remembered, and for how long? In practice, the tapped delay line is also an awkward representation for predicting events whose timing can vary. As a result, the theory incorrectly predicted the firing of dopamine neurons in experiments in which the timing of events was
Representation and Timing in Theories of the Dopamine System
1639
varied (Hollerman & Schultz, 1998; Fiorillo & Schultz, 2001). This problem has received only limited attention (Suri & Schultz, 1998, 1999; Daw, 2003). In this letter, we take a deeper look at these issues by adopting a more appropriate formalism for the experimental situation. In particular, we propose modeling the dopamine response using a TD algorithm for a partially observable semi-Markov process (also known as a hidden semi-Markov model), which generalizes the Markov process in two ways. This richer formalism incorporates variability in the timing between events (semi-Markov dynamics; Bradtke & Duff, 1995) and a distinction between the sensory experience and the underlying but only partially observable state (Kaelbling, Littmann, & Cassandra, 1998). The established theory of RL with partial observability offers an elegant approach to maintaining relevant sensory history. The idea is to use Bayesian inference, with a statistical description (“world model”) of how the hidden process evolves, to infer a probability distribution over the likely values of the unobservable state. If the world model is correct, this inferred state distribution incorporates all relevant history (Kaelbling et al., 1998) and can itself be used in place of the delay line as a state representation for TD learning. Applied to theories of the dopamine system, this viewpoint casts new light on a number of issues. The system is viewed as making predictions using an inferred state representation rather than raw sensory history. This reframes the problem of representing adequate stimulus history in the computationally more familiar terms of learning an appropriate world model. It also situates the dopamine neurons in a broader anatomical and functional context, since predominant models of sensory cortex envision it performing the sort of world modeling and hidden state inference we require (Doya, 1999; Rao, Olshausen, & Lewicki, 2002). Combined with a number of additional assumptions (notably, about the relative strength of positive and negative error representation in the dopamine response; Niv, Duff, & Dayan, 2005; Bayer & Glimcher, 2005), the new model accounts for puzzling results on the responses of dopamine neurons when event timing is varied; further, armed with this account of temporal variability, we consider the effect of noise in internal timing processes and show that this can address other experimental phenomena. Previous models can be viewed as approximations to the new one under appropriate limits. The rest of the letter is organized as follows. In section 2 we discuss previous models and how they cope with temporal variability. We realize our own account of the system in several stages. We begin in section 3 with a general overview of the pieces of the model. In section 4, we develop and simulate a fully observable semi-Markov TD model, and in the following section we generalize it to the partially observable case. As a limiting case of the complete, partially observable model, the simpler model is appropriate for analyzing the complete model’s behavior in many situations. After presenting results about the behavior of each model, we discuss to what
1640
N. Daw, A. Courville, and D. Touretzky
extent its predictions are upheld experimentally. Finally, in section 6, we conclude with more general discussion. 2 Previous Models In this section, we review the TD algorithm and its use in models of the dopamine response, focusing on the example of a key, problematic experiment. These models address unit recordings of dopamine neurons in primates performing a variety of appetitive conditioning tasks (for review, see Schultz, 1998). These experiments can largely be viewed as variations on Pavlovian trace conditioning, a procedure in which a transient cue such as a flash of light is followed after some interval by a reinforcer, regardless of the animal’s actions. In fact, some of the experiments were conducted using delay conditioning (in which the initial stimulus is not punctate but rather prolonged to span the gap between stimulus onset and reward) or involved instrumental requirements (that is, the cue signaled the monkey to perform some behavioral response such as a key press to obtain a reward). For most of the data considered here, no notable differences in dopamine behavior have been observed between these methodological variations, to the extent that comparable experiments have been done. Thus, here, and in common with much prior modeling work on the system, we will neglect action selection and stimulus persistence and idealize the tasks as Pavlovian trace conditioning. 2.1 The TD Algorithm. The TD theory of the dopamine response (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997) involves modeling the experimental situation as a Markov process and drawing on the TD algorithm for reward prediction in such processes (Sutton, 1988). Such a process comprises a set S of states, a transition function T, and a reward function R. The process has discrete dynamics; at each time step t, a realvalued reward rt and a successor state st+1 ∈ S are drawn. The distributions over rt and st+1 are specified by the functions T and R and depend on only the value of the current state st . In modeling the experimental situation, the process time steps are taken to correspond to short, constant-length blocks of real time, and the state corresponds to some representation, internal to the animal, of relevant experimental stimuli. We can define a value function mapping states to expected cumulative discounted future reward,
Vs ≡ E
t end τ =t
γ
τ −t
rτ | st = s ,
(2.1)
Representation and Timing in Theories of the Dopamine System
1641
where the expectation is taken with respect to stochasticity in the state transitions and reward magnitudes, tend is the time the current trial ends, and γ is a parameter controlling the steepness of temporal discounting. The goal of the TD algorithm is to use samples of states st and rewards rt to learn an approximation Vˆ to the true value function V. If such an estimate were correct, it would satisfy Vˆ st = E[rt + γ Vˆ st+1 | st ],
(2.2)
which is just the value function definition rewritten recursively. The TD learning rule is based on this relation: given a sample of a pair of adjacent states and an intervening reward, the TD algorithm nudges the estimate Vˆ st toward rt + γ Vˆ st+1 . The change in Vˆ st is thus proportional to the TD error, δt = rt + γ Vˆ st+1 − Vˆ st ,
(2.3)
with values updated as Vˆ st ← Vˆ st + ν · δt for learning rate ν. In this article, we omit consideration of eligibility traces, as appear in the TD-λ algorithm (Sutton, 1988; Houk et al., 1995; Sutton & Barto, 1998). These would allow error at time t directly to affect states encountered some time steps before, an elaboration that can speed up learning but does not affect our general argument. 2.2 TD Models of the Dopamine Response. 2.2.1 Model Specification. The TD models of the dopamine response assume that dopamine neurons fire at a rate proportional to the prediction error δt added to some constant background activity level, so that positive δt corresponds to neuronal excitation and negative δt to inhibition. They differ in details of the value function definition (e.g., whether discounting is used) and how the state is represented as a function of the experimental stimuli. Here we roughly follow the formulation of Montague et al. (1996; Schultz et al., 1997), on which most subsequent work has built. The state is taken to represent both current and previous stimuli, represented using tapped delay lines (Sutton & Barto, 1990). Specifically, assuming the task involves only a single stimulus, the state st is defined as a binary vector, whose ith element is one if the stimulus was last seen at time t − i and zero otherwise. For multiple stimuli, the representation is the concatenation of several such history vectors, one for each stimulus. Importantly, reward delivery is not represented with its own delay line. In fact, reward delivery is assumed to have no effect on the state representation. As illustrated in Figure 1, stimulus delivery sets off a cascade of internal states, whose progression, once per time step, tracks the time since stimulus
1642
N. Daw, A. Courville, and D. Touretzky
ISI
ITI ...
stim
reward
Figure 1: The state space for a tapped delay line model of a trace conditioning experiment. The stimulus initiates a cascade of states that mark time relative to it. If the interstimulus interval is deterministic, the reward falls in one such state. ISI: interstimulus interval; ITI: intertrial interval.
delivery. These time steps are taken to correspond to constant slices of real time, of duration perhaps 100 ms. The value is estimated linearly as the dot product of the state vector with a weight vector: Vˆ st = st · wt . For the case of a single stimulus, this is equivalent to a table maintaining a separate value for each of the marker states shown in Figure 1. 2.2.2 Account for Basic Findings. The inclusion of the tapped delay line enables the model to mimic basic dopamine responses in trace conditioning (Montague et al., 1996; Schultz et al., 1997). Dopamine neurons burst to unexpected rewards or reward-predicting signals, when δt is positive, and pause when an anticipated reward is omitted (and δt is negative). The latter is a timed response and occurs in the model because value is elevated in the marker states intervening between stimulus and reward. If reward is omitted, the difference γ Vˆ st+1 − Vˆ st in equation 2.3 is negative at the state where the reward was expected, so negative error is seen when that state is encountered without reward. 2.2.3 Event Time Variability. This account fails to predict the response of dopamine neurons when the timing of rewards is varied from trial to trial (Hollerman & Schultz, 1998; see also Fiorillo & Schultz, 2001). Figure 2 (left) shows the simulated response when a reward is expected 1 second after the stimulus but instead delivered 0.5 second early (top) or late (bottom) in occasional probe trials. The noteworthy case is what follows early reward. Experiments (Hollerman & Schultz, 1998) show that the neurons burst to the early reward but do not subsequently pause at the time reward was originally expected. In contrast, because reward arrival does not affect the model’s stimulus representation, the delay line will still subsequently arrive in the state in which reward is usually received. There, when the reward is not delivered again, the error signal will be negative, predicting (contrary to experiment) a pause in dopamine cell firing at the time the reward was originally expected.
Representation and Timing in Theories of the Dopamine System
Delay line model
δ(t)
1 0 –1 1 0 –1 1 0 –1
0
1 time
2
1643
Delay line model with reset stim rew
1 0 –1
stim rew
stim rew
1 0 –1
stim rew
stim rew
1 0 –1
stim rew
0
1 time
2
Figure 2: TD error (simulated dopamine response) in two tapped delay line models (γ = 0.95) of a conditioning task in which reward was delivered early (top) or late (bottom) in occasional probe trials. (Middle: reward delivered at normal time.) (Left) In the standard model, positive error is seen to rewards arriving at unexpected times and negative error is seen on probe trials at the time reward had been originally expected. (Right) In a modified version of the model in which reward resets the delay line, no negative error is seen following an early reward.
It might seem that this problem could be solved simply by adding a second delay line to represent the time since reward delivery, so that the model could learn not to expect a second reward after an early one. However, in the experiment discussed here, the model might not have had the opportunity for such learning. Since early rewards were delivered only in occasional probe trials, value predictions were presumably determined by experience with the reward occurring at its normal time. Further, even given extensive experience with early rewards, two tapped delay lines plus a linear value function estimator could never learn the appropriate discrimination, because (as is easy to verify) the expected future value at different points in the task is a nonlinear function of the two-delay-line state representation. A number of authors have proposed fixing this misprediction by assuming that reward delivery resets the representational system, for example, by clearing the delay line representing time since the stimulus (Suri & Schultz, 1998, 1999; Brown, Bullock, & Grossberg, 1999). This operation negates all pending predictions and avoids negative TD error when they fail to play out. Figure 2 (right) verifies that this device eliminates the spurious inhibition after an early reward. However, it is unclear from the original work under what circumstances such a reset is justified or appropriate, and doubtful that this simple, ad hoc rule generalizes properly to other situations. We return to these considerations in the discussion. Here, we investigate a more
1644
N. Daw, A. Courville, and D. Touretzky
systematic approach to temporal representation in such experiments, based on the view of stimulus history taken in work on partially observable Markov processes (Kaelbling et al., 1998). On this view, a (in principle, learnable) world model is used to determine relevant stimulus history. Therefore, we outline an appropriate family of generative models for reinforcer and stimulus delivery in conditioning tasks: the partially observable semi-Markov process. 3 A New Model: A Broad Functional Framework In this article, we specify a new TD model of the dopamine system incorporating semi-Markov dynamics and partial observability. Our theory envisions the dopaminergic value learning system as part of a more extensive framework of interacting learning systems than had been previously considered in dopaminergic theories. Figure 3 lays out the pieces of the model; those implemented in this article are shown in black. The idea of the theory is to address prediction in the face of partial observability by using a statistical model of the world’s contingencies to infer a probability distribution over the world’s (unobservable) state and then to use this inferred representation as a basis for learning to predict values using a TD algorithm. Thus, we have:
r
A model learning system that learns a forward model of state transitions, state dwell times, and observable events. Similar functions are often ascribed to cortical areas, particularly prefrontal cortex (Owen, 1997; Daw, Niv, & Dayan, 2005).
Model learning State estimation (sensory cortex)
Value prediction (ventral striatum)
Action selection (dorsal striatum)
TD error signal (dopamine, serotonin?) Figure 3: Schematic of the interaction between multiple learning systems suggested in this article. Those discussed in detail here are shown in black.
Representation and Timing in Theories of the Dopamine System
r
r
1645
A state estimation system that infers the world’s state (and related latent variables) using sensory observations and the world model. This broadly corresponds to cortical sensory processing systems (Rao et al., 2002). A value prediction system that uses a TD error signal to map this inferred state representation to a prediction of future reward. This portion of the system works similarly to previous TD models of the dopamine system, except that we assume semi-Markov rather than Markov state transition dynamics. We associate this aspect of the model with the dopamine neurons and their targets (Schultz, 1998). Additionally, as discussed below, information about negative errors or aversive events, which may be missing from the dopaminergic error signal, could be provided by other systems such as serotonin (Daw, Kakade, & Dayan, 2002).
In this article, since we are studying asymptotic dopamine responding, we assume the correct model has already been learned and do not explicitly perform model learning (though we have studied it elsewhere in the context of theories of conditioning behavior; Courville & Touretzky, 2001; Courville, Daw, Gordon, & Touretzky, 2003; Courville, Daw, & Touretzky, 2004). Model fitting can be performed by variations on the expectation-maximization algorithm (Dempster, Laird, & Rubin, 1977); a version for hidden semiMarkov models is presented by Guedon and Cocozza-Thivent (1990). Here we would require an online version of these methods (as in Courville & Touretzky, 2001). RL is ultimately concerned with action selection in Markov decision processes, and it is widely assumed that the dopamine system is involved in control as well as prediction. In RL approaches such as actor-critic (Sutton, 1984), value prediction in a Markov process (as studied here) is a subproblem useful for learning action selection policies. Hence we assume that there is also:
r
An action selection system that uses information from the TD error (or perhaps the learned value function) to learn an action selection policy. Traditionally, this is associated with the dopamine system’s targets in the dorsal striatum (O’Doherty et al., 2004).
As we are focused on dopamine responses in Pavlovian tasks, we do not address policy learning in this article. 4 A New Model: Semi-Markov Dynamics We build our model in two stages, starting with a model that incorporates semi-Markov dynamics but not partial observability. This simplified model is useful for both motivating the description of the complete model and studying its behavior, since the simplified model is easier to analyze and
1646
N. Daw, A. Courville, and D. Touretzky
represents a good approximation to the complete model’s behavior under certain conditions. 4.1 A Fully Observable Semi-Markov Model. A first step toward addressing issues of temporal variability in dopamine experiments is to adopt a formalism that explicitly models such variability. Here we generalize the TD models presented in section 2 to use TD in a semi-Markov process, which adds temporal variability in the state transitions. In a semi-Markov process, state transitions occur as in a Markov process, except that they occur irregularly. The dwell time for each state visit is randomly drawn from a distribution associated with the state. In addition to transition and reward functions (T and R), semi-Markov models contain a function D specifying the dwell time distribution for each state. The process is known as semi-Markov because although the identities of successor states obey the Markov conditional independence property, the probability of a transition at a particular instant depends not just on the current state but on the time that has already been spent there. We model rewards and stimuli as instantaneous events occurring on the transition into a state. We require additional notation. It can at times be useful to index random variables either by their time t or by a discrete index k that counts state transitions. The time spent in state sk is dk , drawn conditional on sk from the distribution specified by the function D. If the system entered that state at time τ , delivering reward rk , then we can also write that st = sk for all τ ≤ t < τ + dk and rt = rk for t = τ while rt = 0 for τ < t < τ + dk . It is straightforward to adapt standard reinforcement learning algorithms to this richer formal framework, a task first tackled by Bradtke and Duff (1995). Our formulation is closer to that of Mahadevan, Marchalleck, Das, & Gosavi (1997; Das, Gosavi, Mahadevan, & Marchalleck, 1999). We use the value function Vˆ sk = E[rk+1 − ρdk + Vˆ sk+1 | sk ],
(4.1)
where the expectation is now taken additionally with respect to randomness in the dwell time dk . There are two further changes to the formulation here. First, for bookkeeping purposes, we omit the reward rk received on entering state sk from that state’s value. Second, in place of the exponentially discounted value function of equation 2.2, we use an average reward formulation, in which ρ ≡ limn→∞ (1/n) · t+n−1 rτ is the average reward τ =t per time step. This represents a limit of the exponentially discounted case as the discounting factor γ → 1 (for details, see Tsitsiklis & Van Roy, 2002) and has some useful properties for modeling dopamine responses (Daw & Touretzky, 2002; Daw et al., 2002). Following that work, we will henceforth assume that the value function is infinite horizon, that is, when written in
Representation and Timing in Theories of the Dopamine System
1647
reward ISI
ITI
duration
duration
stim Figure 4: The state space for a semi-Markov model of a trace conditioning experiment. States model intervals of time between events, which vary according to the distributions sketched in the insets. Stimulus and reward are delivered on state transitions. ISI: interstimulus interval; ITI: intertrial interval. Here, the ISI is constant, while the ITI is drawn from an exponential distribution.
the unrolled form of equation 2.1 as a sum of rewards, the sum does not terminate on a trial boundary but rather continues indefinitely. In TD for semi-Markov processes (Bradtke & Duff, 1995; Mahadevan et al., 1997; Das et al., 1999), value updates occur irregularly, whenever there is a state transition. The error signal is δk = rk+1 − ρk · dk + Vsk+1 − Vsk ,
(4.2)
where ρk is now subscripted because it must be estimated separately (e.g., k by the average reward over the last n states, ρk = k+1 k =k−n+1 rk / k =k−n dk ) 4.2 Connecting This Theory to the Dopamine Response. Here we discuss a number of issues related to simulating the dopamine response with the algorithm described in the previous section. 4.2.1 State Representation. To connect equation 4.2 to the firing of dopamine neurons, we must relate the states sk to the observable events. In the present, fully observable case, we take them to correspond one to one. In this model, a trace conditioning experiment has a very simple structure, consisting of two states that capture the intervals of time between events (see Figure 4; compare Figure 1). The CS is delivered on entry into the state labeled ISI (for interstimulus interval), while the reward is delivered on entering the ITI (intertrial interval) state. This formalism is convenient for reasoning about situations in which interevent intervals can vary, since such variability is built into the model. Late or early rewards, for instance, just correspond to longer or shorter times spent in the ISI state. Although the model assumes input from a separate timing mechanism— in order to measure the elapsed interval dk between events used in the update equation—the passage of time does not by itself have any effect on
1648
N. Daw, A. Courville, and D. Touretzky
the modeled dopamine signal. Instead, TD error is triggered only by state transitions, which are here taken to be always signaled by external events. Thus, this simple scheme cannot account for the finding that dopamine neurons pause when reward is omitted (Schultz et al., 1997). (It would instead assume the world remains in the ISI state with zero TD error until another event occurs, signaling a state transition and triggering learning.) In section 5, we handle these cases by using the assumption of partial observability to infer a state transition in the absence of a signal; however, when states are signaled reliably, that model will reduce to this one. We thus investigate this model in the context of experiments not involving reward omission. 4.2.2 Scaling of Negative Error. Because the background firing rates of dopamine neurons are low, excitatory responses have a much larger magnitude (measured by spike count) than inhibitory responses thought to represent the same absolute prediction error (Niv, Duff, et al., 2005). Recent work quantitatively comparing the firing rate to estimated prediction error confirms this observation and suggests that the dopamine response to negative error is rescaled or partially rectified (Bayer & Glimcher, 2005; Fiorillo, Tobler, & Schultz, 2003; Morris, Arkadir, Nevet, Vaadia, & Bergman, 2004). This fact can be important when mean firing rates are computed by averaging dopamine responses over trials containing both positive and negative prediction errors, since the negative errors will be underrepresented (Niv, Duff, et al., 2005). To account for this situation, we assume that dopaminergic firing rates are proportional to δ + ψ, positively rectified, where ψ is a small background firing rate. We average this rectified quantity over trials to simulate the dopamine response. The pattern of direct proportionality with rectification beneath a small level of negative error is consistent with the experimental results of Bayer and Glimcher (2005). Note that we assume that values are updated based on the complete error signal, with the missing information about negative errors reported separately to targets (perhaps by serotonin; Daw et al., 2002). An alternative possibility (Niv, Duff, et al., 2005) is that complete negative error information is present in the dopaminergic signal, though scaled differently, and targets are properly able to decode such a signal. There are as yet limited data to support or distinguish among these suggestions, but the difference is not material to our argument here. This article explores the implications for dopaminergic recordings of the asymmetry between bursts and pauses. Such asymmetry is empirically well demonstrated and distinct from speculation as to how dopamine targets might cope with it. 4.2.3 Interval Measurement Noise. We will in some cases consider the effects of internal timing noise on the modeled dopamine signal. In the model, time measurements enter the error signal calculation through the estimated dwell time durations dk . Following behavioral studies (Gibbon,
Representation and Timing in Theories of the Dopamine System
1649
1977), we assume that for a constant true duration, these vary from trial to trial with a standard deviation proportional to the length of the true interval. We take the noise to be normally distributed. 4.3 Results: Simulated Dopamine Responses in the Model. Here we present simulations demonstrating the behavior of the model in various conditions. We consider unsignaled and signaled reward and the effect of externally imposed or subjective variability in the timing of events. Finally, we discuss experimental evidence relating to the model’s predictions. 4.3.1 Results: Free Reward Delivery. The simplest experimental finding about dopamine neurons is that they burst when animals receive random, unsignaled reward (Schultz, 1998). The semi-Markov model’s explanation for this effect is different from the usual TD account. This “free reward” experiment can be modeled as a semi-Markov process with a single state (see Figure 5, bottom right). Assuming Poisson delivery of rewards with magnitude r , mean rate λ, and mean interreward interval θ = 1/λ, the dwell times are exponentially distributed. We examine the TD error, is arbitrary (since it only appears using equation 4.2. The state’s value V subtracted from itself in the error signal) and ρ = r/θ asymptotically. The TD error on receiving a reward of magnitude r after a delay d is thus −V δ = r − ρd + V = r (1 − d/θ ),
(4.3) (4.4)
which is positive if d < θ and negative if d > θ , as illustrated in Figure 5 (left). That is, the TD error is relative to the expected delay θ : rewards occurring earlier than usual have higher value than expected, and conversely for later-than-average rewards. Figure 5 (right top) confirms that the semi-Markov TD error averaged over multiple trials is zero. However, due to the partial rectification of inhibitory responses, excitation dominates in the average over trials of the simulated dopamine response (see Figure 5, right middle), and net phasic excitation is predicted. 4.3.2 Results: Signaled Reward and Timing Noise. When a reward is reliably signaled by a stimulus that precedes it, dopaminergic responses famously transfer from the reward to the stimulus (Schultz, 1998). The corresponding semi-Markov model is the two-state model of Figures 4 and 6a. (We assume the stimulus timing is randomized.) As in free-reward tasks, the model predicts that the single-trial response to an event can vary from positive to negative depending on the interval preceding it, and, if sufficiently variable, the response averaged over trials will skew excitatory due to partial rectification of negative errors.
1 0 –1 1
rew
0
δ(t)
–1 1
rew
0 –1 1
rew
avg. δ(t)
N. Daw, A. Courville, and D. Touretzky
1
rew
0.5
avg. rectified δ(t)
1650
0 1
rew
0.5 0 –1 –0.5 0 0.5 1 time relative to reward
rew
ITI
0 –1 1
rew
reward
0 –1
duration 2 4 6 8 10 time since prev. reward
Figure 5: TD error to rewards delivered freely at Poisson intervals, using the semi-Markov TD model of equation 4.2. The state space consists of a single state (illustrated bottom right), with reward delivered on entry. (Left) Error in a single trial ranges from strongly positive through zero to strongly negative (top to bottom), depending on the time since the previous reward. Traces are aligned with respect to the previous reward. (Right) Error averaged over trials, aligned on the current reward. Right top: Mean TD error over trials is zero. Right middle: Mean TD error over trials with negative errors partially rectified (simulated dopamine signal) is positive. Mean interreward interval: 5 sec; reward magnitude: 1; rectification threshold: −0.1.
Analogous to rewards, this is evident here also for cues, whose occurrence (but not timing) in this task is signaled by the reward in the previous trial. As shown in Figure 6a, the model thus predicts the transfer of the (trialaveraged) response from the wholly predictable reward to the temporally unpredictable stimulus. (Single-trial cue responses covary with the preceding intertrial interval in a manner exactly analogous to reward responses in Figure 5 and are not illustrated separately.) Variability in the stimulus-reward interval has analogous effects. If the stimulus-reward interval is jittered slightly (see Figure 6b), there is no effect on the average simulated dopamine response. This is because, in contrast to the situations considered thus far, minor temporal jitter produces only small negative and positive prediction errors, which fail to reach the threshold
Representation and Timing in Theories of the Dopamine System
1651
ISI
ITI
avg. rectified δ(t)
reward 0.4
stim rew
0.2
duration
duration
0 0
stim
(a)
1
ISI
ITI
avg. rectified δ(t)
reward 0.4
stim rew
0.2
duration
duration
0 0
reward ISI
ITI
avg. rectified δ(t)
stim
(b)
1–1.5
0.4
stim rew
0.2
duration
(c)
stim
duration
0 0 1–5 time since stim. onset
Figure 6: Semi-Markov model of experiments in which reward delivery is signaled. Tasks are illustrated as semi-Markov state spaces next to the corresponding simulated dopaminergic response. When stimulus-reward interval is (a) deterministic or (b) only slightly variable, excitation is seen to stimulus but not reward. (c) When the stimulus-reward interval varies appreciably, excitation is seen in the trial-averaged reward response as well. (ITI changed between conditions to match average trial lengths.) (a) Mean ITI: 5 sec, ISI: 1 sec; (b) Mean ITI: 4.75 secs, ISI: 1–1.5 secs uniform, (c) Mean ITI: 3 secs, ISI: 1–5 secs uniform; reward magnitude: 1; rectification threshold: –0.1.
of rectification and thus cancel each other out in the average. But if the variability is substantial, then a response is seen on average (see Figure 6c), because large variations in the interstimulus interval produce large positive and negative variations in the single-trial prediction error, exceeding the rectification threshold. Responding is broken out separately by delay in Figure 7. In general, the extent to which rectification biases the average dopaminergic response to be excitatory depends on how often, and by how much, negative TD error exceeds the rectification threshold. This in turn depends on the amount of jitter in the term −ρk · dk in equation 4.2, with larger average rewards ρ and more sizable jitter in the interreward intervals d promoting a net excitatory response.
1652
N. Daw, A. Courville, and D. Touretzky
0.4 0.2 0
stim rew
reward ISI
ITI
duration
duration
stim
avg. rectified δ(t)
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0.4 0.2 0
stim rew
0
1 2 3 4 5 time since stim. onset
Figure 7: Semi-Markov TD error to rewards occurring at different delays after a stimulus (earlier to later, top to bottom); same task as Figure 6c. (Left) Task illustrated as a semi-Markov state space; rewards arrive at random, uniformly distributed intervals after the stimulus. (Right) Model predicts a decline in reward response with delay, with inhibition for rewards later than average. Parameters as in Figure 6c.
Finally, a parallel effect can be seen when we consider the additional effect of subjective time measurement noise. We repeat the conditioning experiment with a deterministic programmed ITI but add variability due to modeled subjective noise in timing. Figure 8 demonstrates that this noise has negligible effect when the delay between stimulus and reward is small, but for a longer delay, the modeled dopaminergic response to the reward reemerges and cannot be trained away. This is because timing noise has a constant coefficient of variation (Gibbon, 1977) and is thus more substantial for longer delays. 4.4 Discussion: Data Bearing on These Predictions. We have shown that the semi-Markov model explains dopaminergic responses to temporally unpredictable free rewards and reward predictors (Schultz, 1998). Compared to previous models, this account offers a new and testably different pattern of explanation for these excitatory responses. On stimulus or reward receipt, the “cost” −ρk · dk of the delay preceding it is subtracted from the phasic dopamine signal (see equation 4.2). Because of this
avg. rectified δ(t)
Representation and Timing in Theories of the Dopamine System
0.25
1653
0.25
0
stim rew
0
2 4 6 time since stim. onset
0
stim rew
0
2 4 6 time since stim. onset
Figure 8: Effect of timing noise on modeled dopaminergic response to signaled reward depends on the interval between stimulus and reward (ISI). (Left) For ISI = 1 sec, error variation due to timing noise is unaffected by rectification and response to reward is minimal. (Right) For ISI = 6 sec, error variation from timing noise exceeds rectification level, and response to reward emerges. Mean ITI: 3 sec; reward magnitude: 1; rectification threshold: −0.1.; coefficient of variation of timing noise: 0.5.
subtraction, the model predicts that the single-trial phasic dopamine response should decline as the intertrial or interstimulus interval preceding it increases. This prediction accords with results (albeit published only in abstract form) of Fiorillo and Schultz (2001) from a conditioning experiment in which the stimulus-reward interval varied uniformly between 1 and 3 seconds. The theory further predicts that the response to later-than-average rewards should actually become negative; the available data are ambiguous on this point.1 However, suggestively, all published dopamine neuron recordings exhibit noticeable trial-to-trial variability in excitatory responses (e.g., to temporally unpredictable stimuli), including many trials with a response at or below baseline. The suggestion that this variability reflects the preceding interevent interval could be tested with reanalysis of the data. These phenomena are not predicted by the original tapped delay line model. This is because, unlike the semi-Markov model, it assesses the cost of a delay not all together in the phasic response to an event (so that variability in the delay impacts on the event response) but instead gradually during the interval preceding it, on the passage through each marker state. (In particular, at each time step, the error includes a term of −ρ in the average reward formulation, or in the exponentially discounted version a related penalty arising from discounting; Daw & Touretzky, 2002.) On that account, rewards or reward predictors arriving on a Poisson or uniformly distributed random schedule should generally excite neurons regardless of their timing,
1 The publicly presented data from Fiorillo and Schultz (2001) include spike counts for different stimulus-reward delays, supporting the conclusion that the mean response never extends below baseline. However, the accompanying spike rasters suggest that this conclusion may depend on the length of the time window over which the spike counts are taken.
1654
N. Daw, A. Courville, and D. Touretzky
and the ubiquitous response variability must be attributed to unmodeled factors. The new theory also predicts that dopamine neurons should not respond when small amounts of temporal jitter precede an event, but that an excitatory response should emerge, on average, when the temporal variability is increased. This is consistent with the available data. In the experiment discussed above involving 1 to 3 second interstimulus intervals, Fiorillo and Schultz (2001) report net excitation to reward. Additionally, in an experiment involving a sequence of two stimuli predicting reward, neurons were excited by the second stimulus only when its timing varied (Schultz, Apicella, & Ljungberg, 1993). There is also evidence for tolerance of small levels of variability. In an early study (Ljungberg, Apicella, & Schultz, 1992), dopamine neurons did not respond to rewards (“no task” condition) or stimuli (“overtrained” condition) whose timing varied somewhat. Finally, the model predicts similar effects of subjective timing noise, and unpublished data support the model’s prediction that it should be impossible to train away the dopaminergic response to a reward whose timing is deterministically signaled by a sufficiently distant stimulus (C. Fiorillo, personal communication, 2002). Thus, insofar as data are available, the predictions of the theory discussed in this section appear to be borne out. A number of these phenomena would be difficult to explain using a tapped delay line model. The major gap in the theory as presented so far is the lack of account for experiments involving reward omission. We now show how to treat these as examples of partial observability. The model discussed so far follows as a limiting case of the resulting, more complex model whenever the world’s state is directly observable. 5 A New Model: Partial Observability Here we extend the model described in the previous section to include partial observability. We specify the formalism and discuss algorithms for state inference and value learning. Next, we present simulation results and analysis for several experiments involving temporal variability and reward omission. Finally, we discuss how the model’s predictions fare in the light of available data. 5.1 A Partially Observable Semi-Markov Model. Partial observability results from relaxing the one-to-one correspondence between states and observables that was assumed above. We assume that there is a set O of possible observations (which we take, for presentational simplicity, as each instantaneous and binary) and that reward is simply a special observation. The state evolves as before, but it is not observable; instead, each state is associated with a multinomial distribution over O, specified by an observation function O. Thus, when the process enters state sk , it emits an
Representation and Timing in Theories of the Dopamine System
1655
accompanying observation ok ∈ O according to a multinomial conditioned on the state. One observation in O is the null observation, ∅. If no state transition occurs at time t, then ot = ∅. That is, nonempty observations can occur only on state transitions. Crucially, the converse is not true: state transitions can occur silently. This is how we model omitted reward in a trace conditioning experiment. We wish to find a TD algorithm for value prediction in this formalism. Most of the terms in the error signal of equation 4.2 are unavailable, because the states they depend on are unobservable. In fact, it is not even clear when to apply the update, since the times of state transitions are themselves unobservable. Extending a standard approach to partial observability in Markov processes (Chrisman, 1992; Kaelbling et al., 1998) to the semi-Markov case, we assume that the animal learns a model of the hidden process (that is, the functions T, O, and D determining the conditional probabilities of state transitions, observations, and dwell time durations). Such a model can be used together with the observations to infer estimates of the unavailable quantities. Note that given such a model, the values of the hidden states could in principle be computed offline using value iteration. (Since the hidden process is just a normal semi-Markov process, partial observability does not affect the solution.) We return to this point in the discussion; here, motivated by evidence of dopaminergic involvement in error-driven learning, we present an online TD algorithm for learning the same values by sampling, assisted by the model. The new error signal has a form similar to the fully observable case: δs,t = βs,t (rt+1 − ρt · E[dt ] + E[ Vst+1 ] − Vs ).
(5.1)
We discuss the differences, from left to right. First, the new error signal is state as well as time dependent. We compute an error signal for each state s at each time step t, on the hypothesis that the process transitioned out of s at t. The error signal is weighted by the probability that this is actually the case: βs,t ≡ P(st = s, φt = 1|o1 , . . . , ot+1 ),
(5.2)
where φt is a binary indicator that takes the value one if the state transitioned between times t and t + 1 (self-transitions count) and zero otherwise. Note that this determination is conditioned on observations made through time t + 1; this is chosen to parallel the one-time-step backup in the TD algorithm. β can be tracked using a version of the standard forward-backward recursions for hidden Markov models; equations are given in the appendix. The remaining changes in the error signal of equation 5.1 are the expected st+1 ]. These dwell time E[dt ] and expected value of the successor state E[V
1656
N. Daw, A. Courville, and D. Touretzky
are computed from the observations using the model, again conditioned on the hypothesis that the system left state s at time t: E[dt ] ≡
∞
d · P(dt = d|st = s, φt = 1, o1 , . . . , ot+1 )
(5.3)
d=1
E[ Vst+1 ] ≡
Vs P(st+1 = s |st = s, φt = 1, ot+1 ).
(5.4)
s ∈S
Expressions for computing these quantities using the inference model are given in the appendix. Assuming the inference model is correct (i.e., that it accurately captures the process generating the observations), this TD algorithm for value learning is exact in that it has the same fixed point as value iteration using the inference model. The proof is sketched in the appendix. Note also that in the fully observable limit (i.e., when s, d, and φ can be inferred with certainty), the algorithm reduces exactly to the semi-Markov rule of equation 4.2. Simulations (not reported here) demonstrate that in general, when the posterior distributions over these parameters are relatively well specified (i.e., when uncertainty is moderate), this algorithm behaves similarly to the semi-Markov algorithm described in section 4. The main differences come about in cases of considerable state uncertainty, as when reward is omitted. We have described the TD error computation for learning values in a partially observable semi-Markov process. It may be useful to review how the computation actually proceeds. At each time step, the system receives a (possibly empty) observation or reward, and the representational system uses this to update its estimate of the state departure distribution β and other latent quantities. The TD learning system uses these estimates to compute the TD error δ, which is reported by the dopamine system (perhaps assisted are updated to reduce by the serotonin system). Stored value estimates V the error, and the cycle repeats. 5.2 Connecting This Theory to the Dopamine Response. In order to finalize the specification of the model, we discuss several further issues about simulating the dopamine response with the partially observable semiMarkov algorithm. 5.2.1 Vector vs. Scalar Error Signals. As already mentioned, equation 5.1 is a vector rather than a scalar error signal, since it contains an error for each state’s value. Previous models have largely assumed that the dopamine response reports a scalar error signal, supported by experiments showing striking similarity between the responses of most dopaminergic neurons (Schultz, 1998). However, there is some variability between neurons.
Representation and Timing in Theories of the Dopamine System
1657
Notably, only a subset (55–75%) of neurons displays any particular sort of phasic response (Schultz, 1998). Also, several studies report sizable subsets of neurons showing qualitatively different patterns of responding in the same situation (e.g., excitation versus inhibition; Schultz & Romo, 1990; Mirenowicz & Schultz, 1996; Waelti, Dickinson, & Schultz, 2001; Tobler, Dickinson, & Schultz, 2003, though see Ungless, Magill, & Bolam, 2004). We suggest that dopamine neurons might code a vector error signal like equation 4.2 in some distributed manner and that this might account for response variation between neurons. Absent data from experiments designed to constrain such a hypothesis, we illustrate for this article the dopamine response as a scalar, cumulative error over all the states: δt =
δs,t .
(5.5)
s∈S
This quantity may be interpreted in terms of either a vector or scalar model of the dopamine signal. If dopamine neurons uniformly reported this scalar signal, then values could be learned by apportioning the state specific error according to βs,t at targets (with Vs updated proportionally to δt · βs,t / s ∈S βs ,t ). This is a reasonable approximation to the full algorithm so long as there is only moderate state uncertainty and works well in our experience (simulations not reported here). The simplest vector signal would have different dopamine neurons reporting the state-dependent error δs,t for different states; the scalar error δt could then be viewed as modeling the sum or average over a large group of neurons. It is noteworthy that articles on dopamine neuron recordings predominantly report data in terms of such averages, accompanied by traces of a very few individual neurons. However, such a sparsely coded vector signal is probably unrealistic given the relative homogeneity reported for individual responses. A viable compromise might be a more coarsely coded vector scheme in which each dopamine neuron reports the cumulative TD error over a random subset of states. For large enough subsets (e.g, more than 50% of states per neuron), such a scheme would capture the overall homogeneity but limited between-neuron variability in the single-unit responses. In this case, the aggregate error signal from equation 5.5 would represent both the average over neurons and, roughly, a typical single-neuron response. 5.2.2 World Models and Asymptotic Model Uncertainty. As already mentioned, because our focus is on the influence of an internal model on asymptotic dopamine responding rather than on the process of learning such a model, for each task we take as given a fixed world model based on the actual structure that generated the task events. For instance, for trace-conditioning experiments, the model is based on Figure 4. Importantly, however, we assume that animals never become entirely certain
1658
N. Daw, A. Courville, and D. Touretzky
(nothing 2%) (stim 2%) reward 96% ISI
ITI
duration
duration
stim 96% (reward 2%) (nothing 2%) Figure 9: State space for semi-Markov model of trace conditioning experiment, with asymptotically uncertain dwell time distributions and observation models. For simplicity, analogous noise in the transition probabilities (small chance of self-transition) is not illustrated.
about the world’s precise contingencies; each model is thus systematically altered to include asymptotic uncertainty in its distributions over dwell times, transitions, and observations. This variance could reflect irreducible sensor noise (e.g., time measurement error) and persistent uncertainty due to assumed nonstationarity in the contingencies being modeled (Kakade & Dayan, 2000, 2002). We represent even deterministic dwell-time distributions as gaussians with nonzero variance. Similarly, the multinomials describing observation and state transition probabilities attribute nonzero probability even to anomalous events (such as state self-transitions or reward omission). The effects of these modifications on the semi-Markov model of trace conditioning are illustrated in Figure 9. 5.3 Results: Simulated Dopaminergic Responding in the Model. We first consider the basic effect of reward omission. Figure 10 (left top) shows the effect on a trace-conditioning task, with the inferred state distribution illustrated by a shaded bar under the trace. As time passes without reward, inferred probability mass leaks into the ITI state (shown as the progression from black to white in the bar, blown up on the inset), accompanied by negative TD error. Because the model’s judgment that the reward was omitted occurs progressively, the predicted dopaminergic inhibition is slightly delayed and prolonged compared to the expected time of reward delivery. Repeated reward omission also reduces the value predictions attributed to preceding stimuli, which in turn has an impact on the dopaminergic responses to the stimuli and to the subsequent rewards when they arrive. Figure 11 shows how, asymptotically, the degree of partial reinforcement trades off relative responding between stimuli and rewards.
avg. rectified δ(t)
Representation and Timing in Theories of the Dopamine System
Reward omitted
0.4 0.2 0
1659
Reward timing probed stim rew
stim 0.4 no R 0.2
0
1 time
2
reward
ISI
duration stim
ITI
duration
0 −0.2 0.4 0.2 0 −0.2
stim rew
0.4 0.2 0 −0.2
stim rew
0
1 time
2
Figure 10: Simulated dopamine responses from partially observable semiMarkov model of trace conditioning, with reward omitted or delivered at an unexpected time. (Left top) Reward omission inhibits response, somewhat after the time reward was expected. (Left bottom) State space of inference model. (Right) When reward is delivered earlier (top) or later (bottom) than expected, excitation is seen to reward. No inhibition is seen after early rewards, at the time reward was originally expected. Shaded bars show inferred state (white: ITI, black: ISI). Mean ITI: 5 sec; ISI: 1 sec; reward magnitude: 1; rectification threshold: −0.1; probability of anomalous event in inference model: 2%; CV of dwell time uncertainty: 0.08.
Hollerman and Schultz (1998) generalized the reward omission experiment to include probe trials in which the reward was delivered a halfsecond early or late. Figure 10 (right) shows the behavior of the partially observable semi-Markov model on this task. In accord with the findings discussed in section 2, the semi-Markov model predicts no inhibition at the time the reward was originally expected. As shown by the shaded bar underneath the trace, this is because the model assumes that the early reward has signaled an early transition into the ITI state, where no further reward is expected. While the inference model judges such an event unlikely, it is a better explanation for the observed data than any other path through the state space. 5.4 Discussion: Data Bearing on These Predictions. The partially observable model behaves like the fully observable model for the experiments reported in the previous section (simulations not shown) and additionally accounts for dopamine responses when reward is omitted (Schultz et al., 1997) or delivered at unexpected times (Hollerman & Schultz, 1998). The results are due to the inference model making judgments about the likely progression of the hidden state when observable signals are absent
1660
N. Daw, A. Courville, and D. Touretzky
4% reinforcement
1
stim
0.5 0
rew
25% reinforcement
1
stim
0.5 0
ISI
ITI
avg. rectified δ(t)
reward (prob p)
rew
50% reinforcement
1
stim
0.5
duration
stim
duration
0
rew
75% reinforcement 1
stim
0.5 0
rew
96% reinforcement
1
stim
0.5 0
rew
0
0.5 time
1
Figure 11: Simulated dopamine responses from partially observable semiMarkov model of trace conditioning with different levels of partial reinforcement. (Left) State space for inference. (Right) As chance of reinforcement increases, phasic responding to the reward decreases while responding to the stimulus increases. Shaded bars show inferred state (white: ITI, black: ISI). Mean ITI: 5 sec; ISI: 1 sec; reward magnitude: 1; rectification threshold: −0.1; probability of anomalous event in inference model: 2%; CV of dwell time uncertainty: 0.08.
or unusual. Since such judgments unfold progressively with the passage of time, simulated dopaminergic pauses are both later and longer than bursts. This difference is experimentally verified by reports of population duration and latency ranges (Schultz, 1998; Hollerman & Schultz, 1998) and is unexplained by delay line models.
Representation and Timing in Theories of the Dopamine System
1661
However, in order to obtain pause durations similar to experimental data, it was necessary to assume fairly low levels of variance in the inference model’s distribution over the interstimulus interval. That this variance (0.08) was much smaller than the level of timing noise suggested by behavioral experiments (0.3–0.5 according to Gibbon, 1977, or even 0.16 reported more recently by Gallistel, King, & McDonald, 2004), goes against the natural assumption that the uncertainty in the inference model captures the noise in internal timing processes. It is possible that because of the short, 1-second interevent durations used in the recordings, the monkeys are operating in a more accurate timing regime that seems behaviorally to dominate for subsecond intervals (see Gibbon, 1977). In this respect, it is interesting to compare the results of Morris et al. (2004), who recorded dopamine responses after slightly longer (1.5 and 2 seconds) deterministic trace intervals and reported noticeably more prolonged inhibitory responses. A fuller treatment of these issues would likely require both a more realistic theory that includes spike generation and as yet unavailable experimental analysis of the trial-to-trial variability in dopaminergic pause responses. The new model shares with previous TD models the prediction that the degree of partial reinforcement should trade off phasic dopaminergic responding between a stimulus and its reward (see Figure 11). This is well verified experimentally (Fiorillo et al., 2003; Morris et al., 2004). One of these studies (Fiorillo et al., 2003) revealed an additional response phenomenon: a tonic “ramp” of responding between stimulus and reward, maximal for 50% reinforcement. Such a ramp is not predicted by our model (in fact, we predict very slight inhibition preceding reward, related to the possibility of an early, unrewarded state transition), but we note that such ramping is seen only in delay, and not trace, conditioning (Fiorillo et al., 2003; Morris et al., 2004). Therefore, it might reflect some aspect of the processing of temporally extended stimuli that our theory (which incorporates only instantaneous stimuli) does not yet contemplate. Alternatively, Niv, Duff, et al. (2005) suggest that the ramp might result from trial averaging over errors on the progression between states in a tapped-delay line model, due to the asymmetric nature of the dopamine response. This explanation can be directly reproduced in our semi-Markov framework by assuming that there is at least one state transition during the interstimulus interval, perhaps related to the persistence of the stimulus (simulations not reported; Daw, 2003). Finally, we could join the authors of the original study (Fiorillo et al., 2003) in assuming that the ramp is an entirely separate signal multiplexed with the prediction error signal. (Note, however, that although they associate the ramp with “uncertainty,” this is not the same kind of uncertainty that we mainly discuss in this article. Our uncertainty measures posterior ignorance about the hidden state; Fiorillo et al. are concerned with known stochasticity or “risk” in reinforcer delivery.)
1662
N. Daw, A. Courville, and D. Touretzky
6 Discussion We have developed a new account of the dopaminergic response that builds on previous ones in a number of directions. The major features in the model are a partially observable state and semi-Markov dynamics; these are accompanied by a number of further assumptions (including asymptotic uncertainty, temporal measurement noise, and the rectification of negative TD error) to produce new or substantially different explanations for a range of dopaminergic response phenomena. We have focused particularly on a set of experiments that exercise both the temporal and hidden state aspects of the model—those involving the state uncertainty that arises when an event can vary in its timing or be altogether omitted. 6.1 Comparing Different Models. The two key features of our model, partial observability and semi-Markov dynamics, are to a certain extent separable and each interesting in its own right. We have already shown how a fully observable semi-Markov model with interesting properties arises as a limiting case of our model when, as in many experimental situations, the observations are unambiguous. Another interesting relative, which has yet to be explored, would include partial observability and state inference but not semi-Markov dynamics. One way to construct such a model is to note that any discrete-time semi-Markov process of the sort described here has a hidden Markov model that is equivalent to it for generating observation sequences. This can be obtained by subdividing each semi-Markov state into a series of discrete time marker states, each lasting one time step (see Figure 12). State inference and TD are straightforward in this setting (in
ISI ...
stim reward ...
ITI Figure 12: Markov model equivalent to the semi-Markov model of a trace conditioning experiment from Figure 4. The ISI and ITI states are subdivided into a series of substates that mark the passage of time. Stimuli and rewards occur only on transitions from one set of states into the other; dwell time distributions are encoded by the transition probabilities from each marker state.
Representation and Timing in Theories of the Dopamine System
1663
particular, standard TD can be performed directly on the state posteriors, which are Markov; Kaelbling et al., 1998). Moreover, the fully observable limit of this model (equivalently, the model obtained by subdividing states in our fully observable semi-Markov model) is just a version of the familiar tapped delay line model, in which a series of marker states marks time from each event. Since each new event launches a new cascade of marker states and stops the old one, this model includes a “reset” device similar to those discussed in section 2. An interesting avenue for exploration would be intermediate hybrids, in which long delays are subdivided more coarsely into a series of a few temporally extended, semi-Markov states. A somewhat similar division of long delays into a series of several, temporally extended (internal) substates of different length is a feature of some behavioral theories of timing (Killeen & Fetterman, 1988; Machado, 1997), in part because animals engage in different behaviors during different portions of a timed interval. Although the Markov and semi-Markov formalisms are equivalent as generative models, the TD algorithms for value learning in each are qualitatively different, because of their different approaches for “backing up” reward predictions over a long delay. The semi-Markov algorithms do so directly, while Markov algorithms employ a series of intermediate marker states. There is thus an empirical question as to which sort of algorithm best corresponds to the dopaminergic response. A key empirical signature of the semi-Markov model would be its prediction that when reward or stimulus timing can vary, later-than-average reward-related events should inhibit dopamine neurons (e.g., see Figure 7). Markov models generally predict no such inhibition, but instead excitation that could shrink toward but not below baseline. In principle, this question could be addressed with trial-by-trial reanalysis of any of a number of extant data sets, since the new model predicts the same pattern of dopaminergic excitation and inhibition as a function of the preceding interval in a number of experiments, including ones as simple as free reward delivery (see Figure 5). Unfortunately, the one study directly addressing this issue was published only as an abstract (Fiorillo & Schultz, 2001). Though the original interpretation followed the Markov account, the publicly presented data appear ambiguous. (See the discussion in section 4 and note 1.) A signature of many (though not all; Pan, Schmidt, Wickens, & Hyland, 2005) tapped delay line algorithms would be phasic dopaminergic activity during the period between stimulus and reinforcer, reflecting value backing up over intermediate marker states during initial acquisition of a stimulusreward association. No direct observations have been reported of such intermediate activity, which may not be determinative since it would be subtle and fleeting. As already mentioned Niv, Duff, et al. (2005) have suggested that the tonic ramp observed in the dopamine response by Fiorillo et al. (2003) might reflect the average over trials of such a response. Should
1664
N. Daw, A. Courville, and D. Touretzky
this hypothesis be upheld by a trial-by-trial analysis of the data, it would be the best evidence for Markov over semi-Markov TD. A final consideration regarding the trade-off between Markov and semiMarkov approaches is that, as Niv, Duff, and Dayan (2004) point out, Markov models are quite intolerant of timing noise. Our results suggest that semi-Markov models are more robustly able to account for dopaminergic response patterns in the light of presumably noisy internal timing. It is worth discussing the contributions of two other features of our model that differ from the standard TD account. Both have also been studied separately in previous work. First, the asymmetry between excitatory and inhibitory dopamine responses has appreciable effects only when averaging over trials with different prediction errors. Thus, it is crucial in the present semi-Markov account, where most of the burst responses we simulate originate from the asymmetric average over errors that differ from trial to trial due to differences in event timing. Delay line models do not generally predict similar effects of event timing on error, and so in that context, asymmetric averaging has mainly been important in understanding experiments in which error varies due to differential reward delivery (as in Figure 7; Niv, Duff, et al., 2005). In contrast, the use of an average-reward TD rule (rather than the more traditional discounted formulation) plays a more cosmetic role in this work. In the average reward formulation, trial-to-trial variability in delays dk affects the prediction error (see equation 4.2) through the term −ρk · dk ; in a discounted version, analogous effects would occur due to long delays being more discounted (as γ dk ). One advantage of the average reward formulation is that it is invariant to dilations or contractions of the timescale of events, which may be relevant to behavior (discussed below). Previous work on average-reward TD in the context of delay line models has suggested that this formulation might shed light on tonic dopamine, dopamine-serotonin interactions, and motivational effects on response vigor (Daw & Touretzky, 2002; Daw et al., 2002; Niv, Daw, & Dayan, 2005). 6.2 Behavioral Data. Our model is concerned with the responses of dopamine neurons. However, insofar as dopaminergically mediated learning may underlie some forms of both classical and instrumental conditioning (e.g., Parkinson et al., 2002; Faure, Haberland, Cond´e, & Massioui, 2005), the theory suggests connections with behavioral data as well. Much more work will be needed to develop such connections fully, but we mention a few interesting directions here. Our theory generalizes and provides some justification (in limited circumstances) for the “reset” hypothesis that had previously been proposed, on an ad hoc basis, to correct the TD account of the Hollerman and Schultz (1998) experiment (Suri & Schultz, 1998, 1999; Brown et al., 1999). In our theory, “reset” (here, an inferred transition into the ITI state) occurs after reward because this is consistent with the inference model for that experiment. But
Representation and Timing in Theories of the Dopamine System
1665
in other circumstances, for instance, those in which a stimulus is followed by a sequence of more than one reward, information about the stimulus remains predictively relevant after the first reward, and our theory (unlike its predecessors) would not immediately discard it. Dopamine neurons have not been recorded in such situations, but behavioral experiments offer some clues. Animals can readily learn that a stimulus predicts multiple reinforcers; in classical conditioning, this has been repeatedly, though indirectly, demonstrated by showing that adding or removing reinforcers to a sequence has effects on learning (upward and downward unblocking; Holland & Kenmuir, 2005; Holland, 1988; Dickinson & Mackintosh, 1979; Dickinson, Hall, & Mackintosh, 1976). No such learning would be possible in the tapped delay line model if the first reward reset the representation. Because it has somewhat different criteria for what circumstances trigger a reset, the Suri and Schultz (1998, 1999) model would also have problems learning a task known variously as feature-negative occasion setting or sequential conditioned inhibition (Yin, Barnet, & Miller, 1994; Bouton & Nelson, 1998; Holland, Lamoureux, Han, & Gallagher, 1999). However, we should note that a different sort of behavioral experiment does support a reward-triggered reset in one situation. In an instrumental conditioning task requiring animals to track elapsed time over a period of about 90 seconds, reward arrival seems to reset animals’ interval counts (Matell & Meck, 1999). It is unclear how to reconcile this last finding with the substantial evidence from classical conditioning. Gallistel and Gibbon (2000) have argued that behavioral response acquisition in classical conditioning experiments is timescale invariant in the sense that contracting or dilating the speed of all events does not affect the number of stimulus-reward pairings before a response is seen. This would not be true for responses learned by simple tapped delay line TD models (since doubling the speed of events would halve the number of marker states and thereby speed acquisition), but this property naturally holds for our fully observable semi-Markov model (and would hold for the partially observable model if the unspecified model learning phase were itself timescale invariant). There is also behavioral evidence that may relate to our predictions about the relationship between the dopamine response and the preceding interval. The latency to animals’ behavioral responses across many instrumental tasks is correlated with the previous interreinforcement interval, with earlier responding after shorter intervals (“linear waiting”; Staddon & Cerutti, 2003). Given dopamine’s involvement in response vigor (e.g., Dickinson, Smith, & Mirenowicz, 2000; Satoh, Nakai, Sato, & Kimura, 2003; McClure, Daw, & Montague, 2003; Niv, Daw, et al., 2005), this effect may reflect enhanced dopaminergic activity after shorter intervals, as our model predicts. However, the same reasoning applied to reward omission (after which dopaminergic responding is transiently suppressed) would incorrectly predict slower responding. In fact, animals respond earlier following reward
1666
N. Daw, A. Courville, and D. Touretzky
omission (Staddon & Innis, 1969; Mellon, Leak, Fairhurst, & Gibbon, 1995); we thus certainly do not have a full account of the many factors influencing behavioral response latency, particularly on instrumental tasks. Particularly given these complexities, we stress that our theory is not intended as a theory of behavioral timing. To the contrary, it is adjunct to such a theory: it assumes an extrinsic timing signal. We have investigated how information about the passage of time can be combined with sensory information to make inferences about future reward probabilities and drive dopaminergic responding. We have explored the resulting effect on the dopaminergic response of one feature common to most behavioral timing models—scalar timing noise (Gibbon, 1977; Killeen & Fetterman, 1988; Staddon & Higa, 1999)—but we are otherwise rather agnostic about the timing substrate. One prominent timing theory, BeT (Killeen & Fetterman, 1988), assumes animals time intervals by probabilistically visiting a series of internally generated behavioral states of extended duration (see also LeT; Machado, 1997). While these behavioral “states” may seem to parallel the semi-Markov states of our account, it is important to recall that in our theory, the states are not actual internal states of the animal but rather are notional features of the animal’s generative model of events in the external world. That generative model, plus extrinsic information about the passage of time, is used to infer a distribution over the external state. One concrete consequence of this difference can be seen in our simulations, in which even though the world is assumed to change abruptly and discretely, the internal representation is continuous and can sometimes change gradually (unlike the states of BeT). Finally, both behavioral and neuroscientific data from instrumental conditioning tasks suggest that depending on the circumstances, animals seem to evaluate potential actions using either TD-style model-free or dynamicprogramming-style model-based RL methods (Dickinson & Balleine, 2002; Daw, Niv, & Dayan, in press). This apparent heterogeneity of control is puzzling and relates to a potential objection to our framework. Specifically, our assumption that animals maintain a full world model may seem to make redundant the use of TD to learn value estimates, since the world model itself already contains the information necessary to derive value estimates (using dynamic programming methods such as value iteration). This tension may be resolved by considerations of efficiency and of balancing the advantages and disadvantages of both RL approaches. Given that the world model is in principle learned online and thus subject to ongoing change, it is computationally more frugal and often not appreciably less accurate to use TD to maintain relatively up-to-date value estimates rather than to repeat laborious episodes of value iteration. This simple idea is elaborated in RL algorithms like prioritized sweeping (Moore & Atkeson, 1993) and Dyna-Q (Sutton, 1990). In fact, animals behaviorally exhibit characteristics of each sort of planning under circumstances to which that method is computationally well suited (Daw et al., 2005).
Representation and Timing in Theories of the Dopamine System
1667
6.3 Future Theoretical Directions. A number of authors have previously suggested schemes for combining a world model with TD theories of the dopamine system (Dayan, 2002; Dayan & Balleine, 2002; Suri, 2001; Daw et al., in press; Daw et al., 2005; Smith, Becker, & Kapur, 2005). However, all of this work concerns the use of a world model for planning actions. The present account explores a separate, though not mutually exclusive, function of a world model: for state inference to address problems of partial observability. Our work thus situates the hypothesized dopaminergic RL system in the context of a broader view of the brain’s functional anatomy, in which the subcortical RL systems receive a refined, inferred sensory representation from cortex (similar frameworks have been suggested by Doya, 1999, 2000 and by Szita & Lorincz, 2004). Our Bayesian, generative view on this sensory processing is broadly consistent with recent theories of sensory cortex (Lewicki & Olshausen, 1999; Lewicki, 2002). Such theories may suggest how to address an important gap in the present work: we have not investigated how the hypothesized model learning and inference functions might be implemented in brain tissue. In the area of cortical sensory processing, such implementational questions are an extremely active area of research (Gold & Shadlen, 2002; Deneve, 2004; Rao, 2004; Zemel, Huys, Natarajan, & Dayan, 2004). Also, in the context of a generative model whose details are closer to our own, it has been suggested that a plausible neural implementation might be possible using sampling rather than exact inference for the state posterior (Kurth-Nelson & Redish, 2004). Here, we intend no claim that animal brains are using the same methods as we have to implement the calculations; our goal is rather to identify the computations and their implications for the dopamine response. The other major gap in our presentation is that while we have discussed reward prediction in partially observable semi-Markov processes, we have not explicitly treated action selection in the analogous decision processes. It is widely presumed that dopaminergic value learning ultimately supports the selection of high-valued actions, probably by an approach like actor-critic algorithms (Sutton, 1984; Sutton & Barto, 1998), which use TDderived value predictions to influence a separate process of learning action selection policies. There is suggestive evidence from functional anatomy that distinct dopaminergic targets in the ventral and dorsal striatum might subserve these two functions (Montague et al., 1996; Voorn, Vanderschuren, Groenewegen, Robbins, & Pennartz, 2004; Daw et al., in press; but see also Houk et al., 1995; Joel, Niv, & Ruppin, 2002, for alternative views). This idea has also recently received more direct support from fMRI (O’Doherty et al., 2004) and unit recording (Daw, Touretzky, & Skaggs, 2004) studies. We do not envision the elaborations in the present theory as substantially changing this picture. That said, hidden state deeply complicates the action selection problem in Markov decision processes (i.e., partially observable MDPs or POMDPs; Kaelbling et al., 1998). The difficulty, in a nutshell, is that the optimal action when the state is uncertain may be different
1668
N. Daw, A. Courville, and D. Touretzky
from what would be the optimal action if the agent were certainly in any particular state (e.g., for exploratory or information-gathering actions)— and, in general, the optimal action varies with the state posterior, which is a continuous-valued quantity unlike the manageably discrete state variable that determines actions in a fully observable MDP. Behavioral and neural recording experiments on monkeys in a task involving a simple form of state uncertainty suggest that animals indeed use their degree of state uncertainty to guide their behavior (Gold & Shadlen, 2002). Since the appropriate action can vary continuously with the state posterior, incorporating action selection in the present model would require approximating the policy (and value) as functions of the full belief state, preferably nonlinearly. 6.4 Future Experimental Directions. Our work enriches previous theories of dopaminergic responding by identifying two important theoretical issues that should guide future experiments: the distinction between Markov and semi-Markov backup and partial observability. We have already discussed how the former issue suggests new experimental analyses; similarly, issues of hidden state are ripe for future experiment. Partial observability suggests a particular question: whether dopaminergic neurons report an aggregate error signal or separate error signals tied to different hypotheses about the world’s state (see section 5.2). This could be studied more or less directly, for instance, by placing an animal in a situation where ambiguous reward expectancy (e.g., one reward, or two, or three) resolved into a situation where the intermediate reward was expected with certainty. On a scalar error code, dopamine neurons should not react to this event; with a vector error code, different neurons should report both positive and negative error. More generally, it would be interesting to study how dopaminergic neurons behave in many situations of sensory ambiguity (as in noisy motion discriminations, e.g., Gold & Shadlen, 2002, where much is known about how cortical neurons track uncertainty but there is no indication how, if at all, the dopaminergic system is involved). The present theory and the antecedent theory of partially observable Markov decision processes suggest a framework by which many such experiments could be designed and analyzed. Appendix Here we present and sketch derivations for the formulas for inference in the partially observable semi-Markov model. Inference rules for similar hidden semi-Markov models have been described by Levinson (1986) and Guedon & Cocozza-Thivent (1990). We also sketch the proof of the correctness of the TD algorithm for the model.
Representation and Timing in Theories of the Dopamine System
1669
Below, we use abbreviated notation for the transition, dwell time, and observation functions. We write the conditional transition probabilities as Ts ,s ≡ P(sk = s|sk−1 = s ); the conditional dwell time distributions as Ds,d ≡ P(dk = d|sk = s); and the observation probabilities as Os,o ≡ P(ok = o|sk = s). These functions, together with the observation sequence, are given; the goal is to infer posterior distributions over the unobservable quantities needed for TD learning. The chief quantity necessary for the TD learning rule of equation 5.1 is βs,t = P(st = s, φt = 1|o1 . . . ot+1 ), the probability that the process left state s at time t. To perform the one time step of smoothing in this equation, we use Bayes’ theorem on the subsequent observation: βs,t =
P (ot+1 |st = s, φt = 1) · P (st = s, φt = 1|o1 . . . ot ) . P (ot+1 |o1 . . . ot )
(A.1)
In this equation and several below, we have made use of the Markov property: the conditional independence of st+1 and ot+1 from the previous observations and states given the predecessor state st . In semi-Markov processes (unlike Markov processes), this property holds only at a state transition, that is, when φt = 1. The first term of the numerator of equation A.1 can be computed by integrating over st+1 in the model: P(ot+1 |st = s, φt = 1) = s ∈S Ts,s Os ,ot+1 . Call the second term of the numerator of equation A.1 αs,t . Computing it requires integrating over the possible durations of the stay in state s: αs,t = P(st = s, φt = 1|o1 . . . ot ) =
∞
P (st = s, φt = 1, dt = d|o1 ...ot )
(A.2) (A.3)
d=1
=
∞ d=1
=
P (ot−d+1 . . . ot |st = s, φt = 1, dt = d, o1 . . . ot−d ) · P (st = s, φt = 1, dt = d|o1 . . . ot−d ) P (ot−d+1 . . . ot |o1 . . . ot−d )
∞ Os,ot−d+1 Ds,d P (st−d+1 = s, φt−d = 1|o1 ...ot−d ) , P (ot−d+1 . . . ot |o1 ...ot−d )
(A.4)
(A.5)
d=1
where the sum need not actually be taken out to infinity, but only until the last time a nonempty observation was observed (where a state transition must have occurred). The derivation makes use of the fact that the observation ot is empty with probability one except on a state transition. Thus, under the hypothesis that the system dwelled in state s from time t − d + 1 through time t, the probability of the sequence of null observations during that period equals just the probability of the first, Os,ot−d+1 .
1670
N. Daw, A. Courville, and D. Touretzky
Integrating over predecessor states, the quantity P(st−d+1 = s, φt−d = 1|o1 . . . ot−d ), the probability that the process entered state s at time t − d + 1, equals
Ts ,s · P(st−d = s , φt−d = 1|o1 . . . ot−τ ) =
s ∈S
Ts ,s · αs ,t−d .
(A.6)
s ∈S
Thus, α can be computed recursively, and prior values of α back to the time of the last nonempty observation can be cached, allowing dynamic programming analogous to the Baum-Welch procedure for hidden Markov models (Baum, Petrie, Soulds, & Weiss, 1970). Finally, the normalizing factors in the denominators of equations A.5 and A.1 can be computed by similar recursions, after integrating over the state occupied at t − d (see equation A.5) or t (see equation A.1) and the value of φ at those times. Though we do not make use of this quantity in the learning rules, the belief state over state occupancy, Bs,t = P(st = s|o1 . . . ot ), can also be computed by a recursion on α exactly analogous to equation A.2, substituting P(dt ≥ d|st = s) for Ds,d . The two expectations in the TD learning rule of equation 5.1 are: E[ Vst+1 ] =
Vs P(st+1 = s |st = s, φt = 1, ot+1 )
(A.7)
s ∈S
=
s ∈S
Vs
Ts,s Os ,ot+1 s Ts,s Os ,ot+1
(A.8)
and E[dt ] =
∞
d · P(dt = d|st = s, φt = 1, o1 ...ot+1 )
(A.9)
d · P(dt = d|st = s, φt = 1, o1 . . . ot )
(A.10)
d=1
=
∞ d=1
=
∞ d=1
d · P(st = s, dt = d, φt = 1|o1 . . . ot ) , αs,t
(A.11)
where the sum can again be truncated at the time of the last nonempty observation, and P(st = s, dt = d, φt = 1|o1 . . . ot ) is computed as on the right-hand side of equation A.2. The proof that the TD algorithm of equation 5.1 has the same fixed point as value iteration is sketched below. We assume the inference model correctly matches the process generating the samples. With each TD update, Vs is nudged toward some target value with some step size βs,t . It is easy
Representation and Timing in Theories of the Dopamine System
1671
to show that, analogous to the standard stochastic update situation with constant step sizes, the fixed point is the average of the targets, weighted by their probabilities and their step sizes. Fixing some arbitrary t, the update targets and associated step sizes β are functions of the observations o1 , . . . , ot+1 , which are, by assumption, samples generated with probability P(o1 , . . . , ot+1 ) by a semi-Markov process whose parameters match those of the inference model. The fixed point is Vs =
o1 ...ot+1 [P(o1
. . . ot+1 ) · βs,t · (r (ot+1 ) − ρt · E[dt ] + E[ Vst+1 ])] , o1 ...ot+1 [P(o1 . . . ot+1 ) · βs,t ] (A.12)
where we have written the reward rt+1 as a function of the observation, r (ot+1 ), because rewards are just a special case of observations in the partially observable framework. The expansions of βs,t , E[dt ], and E[ Vst+1 ] are all conditioned on P(o1 , . . . , ot+1 ), where this probability from the inference model is assumed to match the empirical probability appearing in the numerator and denominator of this expression. Thus, we can marginalize out the observations in the sums on the numerator and denominator, reducing the fixed-point equation to Vs =
Ts,s
s ∈S
o∈O
[Os ,o
· r (o)] + Vs
−
ρt · d · Ds,d ,
(A.13)
d
which (assuming ρt = ρ) is Bellman’s equation for the value function, and is also by definition the same fixed point as value iteration. Acknowledgments This work was supported by National Science Foundation grants IIS9978403 and DGE-9987588. A.C. was funded in part by a Canadian NSERC PGS B fellowship. N.D. is funded by a Royal Society USA Research Fellowship and the Gatsby Foundation. We thank Sham Kakade, Yael Niv, and Peter Dayan for their helpful insights, and Chris Fiorillo, Wolfram Schultz, Hannah Bayer, and Paul Glimcher for helpfully sharing with us, often prior to publication, their thoughts and experimental observations. References Baum, L. E., Petrie, T., Soulds, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.
1672
N. Daw, A. Courville, and D. Touretzky
Bayer, H. M., & Glimcher, P. W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129–141. Bouton, M. E., & Nelson, J. B. (1998). Mechanisms of feature-positive and featurenegative discrimination learning in an appetitive conditioning paradigm. In N. A. Schmajuk & P. C. Holland (Eds.), Occasion setting: Associative learning and cognition in animals (pp. 69–112). Washington, DC: American Psychological Association. Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuoustime Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 393–400). Cambridge, MA: MIT Press. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. Journal of Neuroscience, 19(23), 10502–10511. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92) (pp. 183–188). San Jose, CA: AAAI Press. Courville, A. C., Daw, N. D., Gordon, G. J., & Touretzky, D. S. (2003). Model ¨ uncertainty in classical conditioning. In S. Thrun, L. K. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 Cambridge, MA: MIT Press. Courville, A. C., Daw, N. D., & Touretzky, D. S. (2004). Similarity and discrimination in classical conditioning: A latent variable account. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Courville, A. C., & Touretzky, D. S. (2001). Modeling temporal structure in classical conditioning. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 3–10). Cambridge, MA: MIT Press. Das, T., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45, 560–574. Daw, N. D. (2003). Reinforcement learning models of the dopamine system and their behavioral implications. Unpublished doctoral dissertation, School of Computer Science, Carnegie Mellon University. Daw, N. D., Kakade, S., & Dayan, P. (2002). Opponent interactions between serotonin and dopamine. Neural Networks, 15, 603–616. Daw, N. D., Niv, Y., & Dayan, P. (in press). Actions, values, policies, and the basal ganglia. In E. Bezard (Ed.), Recent breakthroughs in basal ganglia research. New York: Nova Science. Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience. Daw, N. D., & Touretzky, D. S. (2002). Long-term reward prediction in TD models of the dopamine system. Neural Computation, 14, 2567–2583. Daw, N., Touretzky, D., & Skaggs, W. (2004). Contrasting neuronal correlates between dorsal and ventral striatum in the rat. In Cosyne04 Computational and Systems Neuroscience Abstracts, Vol. 1.
Representation and Timing in Theories of the Dopamine System
1673
Dayan, P. (2002). Motivated reinforcement learning. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 11– 18). Cambridge, MA: MIT Press. Dayan, P., & Balleine, B. W. (2002). Reward, motivation and reinforcement learning. Neuron, 36, 285–298. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38. Deneve, S. (2004). Bayesian inference in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Dickinson, A., & Balleine, B. (2002). The role of learning in motivation. In C. R. Gallistel (Ed.), Stevens’ handbook of experimental psychology (3rd ed.), Vol. 3: Learning, motivation and emotion (pp. 497–533). New York: Wiley. Dickinson, A., Hall, G., & Mackintosh, N. J. (1976). Surprise and the attenuation of blocking. Journal of Experimental Psychology: Animal Behavior Processes, 2, 313–322. Dickinson, A., & Mackintosh, N. J. (1979). Reinforcer specificity in the enhancement of conditioning by posttrial surprise. Journal of Experimental Psychology: Animal Behavior Processes, 5, 162–177. Dickinson, A., Smith, J., & Mirenowicz, J. (2000). Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behavioral Neuroscience, 114, 468–483. Doya, K. (1999). What are the computations in the cerebellum, the basal ganglia, and the cerebral cortex? Neural Networks, 12, 961–974. Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10, 732–739. Faure, A., Haberland, U., Cond´e, F., & Massioui, N. E. (2005). Lesion to the nigrostriatal dopamine system disrupts stimulus-response habit formation. Journal of Neuroscience, 25, 2771–2780. Fiorillo, C. D., & Schultz, W. (2001). The reward responses of dopamine neurons persist when prediction of reward is probabilistic with respect to time or occurrence. In Society for Neuroscience Abstracts, 27, 827.5. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Gallistel, C. R., & Gibbon, J. (2000). Time, rate and conditioning. Psychological Review, 107(2), 289–344. Gallistel, C. R., King, A., & McDonald, R. (2004). Sources of variability and systematic error in mouse timing behavior. Journal of Experimental Psychology: Animal Behavior Processes, 30, 3–16. Gibbon, J. (1977). Scalar expectancy theory and Weber’s law in animal timing. Psychological Review, 84, 279–325. Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36, 299– 308. Guedon, Y., & Cocozza-Thivent, C. (1990). Explicit state occupancy modeling by hidden semi-Markov models: Application of Derin’s scheme. Computer Speech and Language, 4, 167–192.
1674
N. Daw, A. Courville, and D. Touretzky
Holland, P. C. (1988). Excitation and inhibition in unblocking. Journal of Experimental Psychology: Animal Behavior Processes, 14, 261–279. Holland, P. C., & Kenmuir, C. (2005). Variations in unconditioned stimulus processing in unblocking. Journal of Experimental Psychology: Animal Behavior Processes, 31, 155–171. Holland, P. C., Lamoureux, J. A., Han, J., & Gallagher, M. (1999). Hippocampal lesions interfere with Pavlovian negative occasion setting. Hippocampus, 9, 143–157. Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1, 304–309. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134. Kakade, S., & Dayan, P. (2000). Acquisition in autoshaping. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Kakade, S., & Dayan, P. (2002). Acquisition and extinction in autoshaping. Psychological Review, 109, 533–544. Killeen, P. R., & Fetterman, J. G. (1988). A behavioral theory of timing. Psychological Review, 95, 274–295. Kurth-Nelson, Z., & Redish, A. (2004). µagents: Action-selection in temporally dependent phenomena using temporal difference learning over a collective belief structure. Society for Neuroscience Abstracts, 30, 207.1. Levinson, S. E. (1986). Continuously variable duration hidden Markov models for automatic speech recognition. Computer Speech and Language, 1, 29–45. Lewicki, M. S. (2002). Efficient coding of natural sounds. Nature Neuroscience, 5, 356–363. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A: Optics, Image Science, and Vision, 16, 1587–1601. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology, 67, 145–163. Machado, A. (1997). Learning the temporal dynamics of behavior. Psychological Review, 104, 241–265. Mahadevan, S., Marchalleck, N., Das, T., & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. In Proceedings of the 14th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Matell, M. S., & Meck, W. H. (1999). Reinforcement-induced within-trial resetting of an internal clock. Behavioural Processes, 45, 159–171. McClure, S. M., Daw, N. D., & Montague, P. R. (2003). A computational substrate for incentive salience. Trends in Neurosciences, 26, 423–428.
Representation and Timing in Theories of the Dopamine System
1675
Mellon, R. C., Leak, T. M., Fairhurst, S., & Gibbon, J. (1995). Timing processes in the reinforcement-omission effect. Animal Learning and Behavior, 23, 286– 296. Mirenowicz, J., & Schultz, W. (1996). Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature, 379, 449–451. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13, 103–130. Morris, G., Arkadir, D., Nevet, A., Vaadia, E., & Bergman, H. (2004). Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133–143. Niv, Y., Daw, N. D., & Dayan, P. (2005). How fast to work: Response vigor, motivation, and tonic dopamine. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Niv, Y., Duff, M. O., & Dayan, P. (2004). The effects of uncertainty on TD learning. In Cosyne04—Computational and Systems Neuroscience Abstracts, vol. 1. Niv, Y., Duff, M. O., & Dayan, P. (2005). Dopamine, uncertainty, and TD learning. Behavioral and Brain Functions, 1, 6. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304, 452–454. Owen, A. M. (1997). Cognitive planning in humans: Neuropsychological, neuroanatomical and neuropharmacological perspectives. Progress in Neurobiology, 53, 431–450. Pan, W. X., Schmidt, R., Wickens, J., & Hyland, B. (2005). Dopamine cells respond to predicted events during classical conditioning: Evidence for eligibility traces in the reward-learning network. Journal of Neuroscience, 25, 6235–6242. Parkinson, J. A., Dalley, J. W., Cardinal, R. N., Bamford, A., Fehnert, B., Lachenal, G., Rudarakanchana, N., Halkerston, K., Robbins, T. W., & Everitt, B. J. (2002). Nucleus accumbens dopamine depletion impairs both acquisition and performance of appetitive Pavlovian approach behaviour: Implications for mesoaccumbens dopamine function. Behavioral Brain Research, 137, 149–163. Rao, R. P. N. (2004). Hierarchical Bayesian inference in networks of spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Satoh, T., Nakai, S., Sato, T., & Kimura, M. (2003). Correlated coding of motivation and outcome of decision by dopamine neurons. Journal of Neuroscience, 23, 9913– 9923. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913.
1676
N. Daw, A. Courville, and D. Touretzky
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Romo, R. (1990). Dopamine neurons of the monkey midbrain: Contingencies of responses to stimuli eliciting immediate behavioral reactions. Journal of Neurophysiology, 63, 607–624. Smith, A. J., Becker, S., & Kapur, S. (2005). A computational model of the functional role of the ventral-striatal D2 receptor in the expression of previously acquired behaviors. Neural Computation, 17, 361–395. Staddon, J. E. R., & Cerutti, D. T. (2003). Operant conditioning. Annual Reviews of Psychology, 54, 115–144. Staddon, J. E. R., & Higa, J. J. (1999). Time and memory: Towards a pacemakerfree theory of interval timing. Journal of the Experimental Analysis of Behavior, 71, 215–251. Staddon, J. E., & Innis, N. K. (1969). Reinforcement omission on fixed-interval schedules. Journal of the Experimental Analysis of Behavior, 12, 689–700. Suri, R. E. (2001). Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model. Experimental Brain Research, 140, 234–240. Suri, R. E., & Schultz, W. (1998). Learning of sequential movements with dopaminelike reinforcement signal in neural network model. Experimental Brain Research, 121, 350–354. Suri, R. E., & Schultz, W. (1999). A neural network with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience, 91, 871–890. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, (pp. 216–224). San Mateo, CA: Morgan Kaufmann. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 497–537). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Szita, I., & Lorincz, S. (2004). Kalman filter control embedded into the reinforcement learning framework. Neural Computation, 16, 491–499. Tobler, P., Dickinson, A., & Schultz, W. (2003). Coding of predicted reward omission by dopamine neurons in a conditioned inhibition paradigm. Journal of Neuroscience, 23, 10402–10410. Tsitsiklis, J. N., & Van Roy, B. (2002). On average versus discounted reward temporaldifference learning. Machine Learning, 49, 179–191. Ungless, M. A., Magill, P. J., & Bolam, J. P. (2004). Uniform inhibition of dopamine neurons in the ventral tegmental area by aversive stimuli. Science, 303, 2040–2042. Voorn, P., Vanderschuren, L. J., Groenewegen, H. J., Robbins, T. W., & Pennartz, C. M. (2004). Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neuroscience, 27, 468–474.
Representation and Timing in Theories of the Dopamine System
1677
Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412, 43–48. Yin, H., Barnet, R. C., & Miller, R. R. (1994). Second-order conditioning and Pavlovian conditioned inhibition: Operational similarities and differences. Journal of Experimental Psychology: Animal Behavior Processes, 20, 419–428. Zemel, R., Huys, Q., Natarajan, R., & Dayan, P. (2004). Probabilistic computation in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.
Received February 24, 2005; accepted September 29, 2005.
LETTER
Communicated by Richard Zemel
Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression D. L. Shrestha [email protected]
D. P. Solomatine [email protected] UNESCO-IHE Institute for Water Education, Westvest 7 Delft, The Netherlands
The application of boosting technique to regression problems has received relatively little attention in contrast to research aimed at classification problems. This letter describes a new boosting algorithm, AdaBoost.RT, for regression problems. Its idea is in filtering out the examples with the relative estimation error that is higher than the preset threshold value, and then following the AdaBoost procedure. Thus, it requires selecting the suboptimal value of the error threshold to demarcate examples as poorly or well predicted. Some experimental results using the M5 model tree as a weak learning machine for several benchmark data sets are reported. The results are compared to other boosting methods, bagging, artificial neural networks, and a single M5 model tree. The preliminary empirical comparisons show higher performance of AdaBoost.RT for most of the considered data sets. 1 Introduction Recently many researchers have investigated various techniques combining the predictions from multiple predictors to produce a single predictor. The resulting predictor is generally more accurate than an individual one. The ensemble of predictors is often called a committee machine (or mixture of experts). In a committee machine, an ensemble of predictors (often referred as a weak learning machine or simply a machine) is generated by means of a learning process; the overall predictions of the committee machine are the combination of the individual committee members’ predictions (Tresp, 2001). Figure 1 presents the basic scheme of a committee machine. Each machine (1 through T) is trained using training examples sampled from the given training set. A filter is employed when different machines are to be fed with different subsets (denoted as type A) of the training set; in this case, the machines can be run in parallel. Flows of type B appear when machines pass unclassified data subsets to the subsequent machines, thus making a hierarchical committee machine. The individual outputs yi for each example Neural Computation 18, 1678–1710 (2006)
C 2006 Massachusetts Institute of Technology
Experiments with AdaBoost.RT
1679
Input x A
A Filter
Machine 1
y1 Machine 2 B
y2
Machine T B
yT
Combiner
Output y Figure 1: Block diagram of a committee machine. Flows of type A distribute data between the machines. Flows of type B appear when machines pass unclassified data subsets to the subsequent machines, thus making a hierarchy.
from each machine are combined to produce the overall output y of the ensemble. Bagging (Breiman, 1996a, 1996b) and boosting (Schapire, 1990; Freund & Schapire, 1996, 1997) are the two popular committee machines that combine the outputs from different predictors to improve overall accuracy. Several studies of boosting and bagging in classification have demonstrated that these techniques are generally more accurate than the individual classifiers. Boosting can be used to reduce the error of any “weak” learning machine that consistently generates classifiers that need be only a little bit better than random guessing (Freund & Schapire, 1996). Boosting works by repeatedly running a given weak learning machine on different distributions of training examples and combining their outputs. At each iteration, the distributions of training examples depend on the performance of the machine in the previous iteration. The method to calculate the distribution of the training examples is different for various boosting methods; the outputs of the different machines are combined using voting multiple classifiers in case of classification problems or weighted average or median in
1680
D. Shrestha and D. Solomatine
case of regression ones. There are different versions of boosting algorithm for classification and regression problems, and they are covered in detail in the following section. This letter introduces a new boosting scheme, AdaBoost.RT, for regression problems initially outlined by the authors in 2004 (Solomatine & Shrestha, 2004). It is based on the following idea. Typically in a regression, it is not possible to predict the output exactly as in classification, as real-valued variables may exhibit chaotic behavior, local nonstationarity, and high levels of noise (Avnimelech & Intrator, 1999). Some discrepancy between the predicted and the observed value is typically inevitable. A possibility here is to use insensitive margins to differentiate correct predictions from incorrect ones, as is done, for example, in support vector machines (Vapnik, 1995). The discrepancy (prediction error) for each example allows us to distinguish whether an example is well (acceptable) or poorly predicted. Once we define the measure of such a discrepancy, it is straightforward to follow the AdaBoost procedure (Freund & Schapire, 1997) by modifying the loss function that fits regression problems better. AdaBoost.RT addresses some of the issues associated with the existing boosting schemes covered in section 2 by introducing a number of novel features. First, AdaBoost.RT uses the so-called absolute relative error threshold φ to project training examples into two classes (poorly and well-predicted examples) by comparing the prediction error (absolute relative error) with the threshold φ. (The reasons to use absolute relative error instead of absolute error, as is done in many boosting algorithms, are discussed in section 3.) Second, the weight updating parameter is computed differently from in the AdaBoost.R2 algorithm (Drucker, 1997) to give more emphasis to the harder examples when error rate is very low. Third, the algorithm does not have to stop when the error rate is greater than 0.5, as happens in some other algorithms. This makes it possible to run a user-defined number of iterations, and in many cases the performance is improved. Last, the outputs from individual machines are combined by weighted average, while most of the methods consider the median, and this appears to give better performance. This letter covers a number of experiments with AdaBoost.RT using model trees (MTs) as weak learning machine, comparing it to other boosting algorithms, bagging, and artificial neural networks (ANNs). Finally, conclusions are drawn. 2 The Boosting Algorithms The original boosting approach is boosting by filtering and is described by Schapire (1990). It was motivated by the PAC (probably approximately correct) learning theory (Valiant, 1984; Kearns & Vazirani, 1994). Boosting by filtering requires a large number of training examples, which is not feasible in many real-life cases. This limitation can be overcome by using
Experiments with AdaBoost.RT
1681
another boosting algorithm, AdaBoost (Freund & Schapire 1996, 1997) in several versions. In boosting by subsampling, the fixed training size and training examples are used, and they are resampled according to a given probability distribution during training. In boosting by reweighting, all the training examples are used to train the weak learning machine, with weights assigned to each example. This technique is applicable only when the weak learning machine can handle the weighted examples. Originally boosting schemes were developed for binary classification problems. Freund and Schapire (1997) extended AdaBoost to a multiclass case, versions of which they called AdaBoost.M1 and AdaBoost.M2. Recently several applications of boosting algorithms for classification problems have been reported (e.g., Quinlan, 1996; Drucker, 1999; Opitz & Maclin, 1999). Application of boosting to regression problems has received attention as well. Freund and Schapire (1997) extended AdaBoost.M2 to boosting regression problems and called it AdaBoost.R. It solves regression problems by reducing them to classification ones. Although experimental work shows that the AdaBoost.R can be effective by projecting the regression data into classification data sets, it suffers from the two drawbacks. First, it expands each example in the regression sample into many classification examples, and the number grows linearly in the number of boosted iterations. Second, the loss function changes from iteration to iteration and even differs between examples in the same iteration. In the framework of AdaBoost.R, Ridgeway, Madigan, and Richardson (1999) performed experiments by projecting regression problems into classification ones on a data set of infinite size. Breiman (1997) proposed the arc-gv (arcing game value) algorithm for regression problems. Drucker (1997) developed the AdaBoost.R2 algorithm, which is an ad hoc modification of AdaBoost.R. He conducted some experiments with AdaBoost.R2 for regression problems and obtained good results. Avnimelech and Intrator (1999) extended the boosting algorithm to regression problems by introducing the notion of weak and strong learning and appropriate equivalence between them. They introduced so-called big errors with respect to threshold γ , which has to be chosen initially. They constructed triplets of weak learning machines and combined them to reduce the error rate using the median of the outputs of the three machines. Using the framework of Avnimelech and Intrator (1999), Feely (2000) introduced the big error margin (BEM) boosting technique. Namee, Cunningham, Byrne, and Corrigan (2000) compared the performance of AdaBoost.R2 with the BEM technique. Recently many researchers (for example, Friedman, Hastie, & Tibshirani, 2000; Friedman, 2001; Duffy & Helmbold, 2000; Zemel & Pitassi, 2001; Ridgeway, 1999) have viewed boosting as a “gradient machine” that optimizes a particular loss function. Friedman et al. (2000) explained the AdaBoost algorithm from a statistical perspective. They showed that the
1682
D. Shrestha and D. Solomatine
AdaBoost algorithm is a Newton method for optimizing a particular exponential loss function. Although all these methods involve diverse objectives and optimization approaches, they are all similar except for the one considered by Zemel and Pitassi (2001). In this latter approach, a gradient-based boosting algorithm was derived, which forms new hypotheses by modifying only the distribution of the training examples. This is in contrast to the former approaches, where the new regression models (hypotheses) are formed by changing the distribution of the training examples and modifying the target values. The following subsections describe briefly a boosting algorithm for classification problem: AdaBoost.M1 (Freund & Schapire, 1997). The reason is that our new boosting algorithm is its direct extension for regression problems. The threshold- or margin-based boosting algorithms for regression problems including AdaBoost.R2, which are similar to our algorithm, are described as well. 2.1 AdaBoost.M1. The AdaBoost.M1 algorithm (see appendix A) works as follows. The first weak learning machine is supplied with a training set of m examples with the uniform distribution of weights so that each example has an equal chance to be selected in the first training set. The performance of the machine is evaluated by computing the classification error rate εt as the ratio of incorrect classifications. Knowing εt , the weight updating parameter denoted by βt is computed as follows: βt = εt /(1 − εt ).
(2.1)
As εt is constrained on [0, 0.5], βt is constrained on [0, 1]. βt is a measure of confidence in the predictor. If εt is low, then βt is also low, and low βt means high confidence in the predictor. To compute the distribution for the next machine, the weight of each example is multiplied by βt if the previous machine classifies this example correctly (this reduces the weight of the example), or the weight remains unchanged otherwise. The weights are then normalized to make their set a distribution. The process is repeated until the preset number of machines are constructed or εt < 0.5. Finally, the weight denoted by W is computed using βt as W = log(1/βt ).
(2.2)
W is used to weight the outputs of the machines when the overall output is calculated. Notice that W becomes larger when εt is low, and, consequently, more weight is given to the machine when combining the outputs from individual machines.
Experiments with AdaBoost.RT
1683
The essence of the boosting algorithm is that “easy” examples that are correctly classified by most of the previous weak machines will be given a lower weight, and “hard” examples that often tend to be misclassified will get a higher weight. Thus, the idea of boosting is that each subsequent machine focuses on such hard examples. Freund and Schapire (1997) derived the exponential upper bound for the error rate of the resulting ensemble, and it is smaller than that of the single machine (weak learning machine). It does not guarantee that the performance on an independent test set will be improved; however, if the weak hypotheses are “simple” and the total number of iterations is “not too large,” then the difference between the training and test errors can also be theoretically bounded. As reported by Freund and Schapire (1997), AdaBoost.M1 has a disadvantage in that it is unable to handle weak learning machines with an error rate greater than 0.5. In this case, the value of βt will exceed unity, and consequently, the correctly classified examples will get a higher weight and W becomes negative. As a result, the boosting iterations have to be terminated. Breiman (1996c) describes a method by resetting all the weights of the examples to be equal and restart if either εt is not less than 0.5 or εt becomes 0. Following the revision described by Breiman (1996c), Opitz and Maclin (1999) used a very small positive value of W (e.g., 0.001) rather than a negative or 0 when εt is larger than 0.5. They reported results slightly better than those achieved by the standard AdaBoost.M1. 2.2 AdaBoost.R2. The AdaBoost.R2 (Drucker, 1997) boosting algorithm is an ad hoc modification of AdaBoost.R (Freund & Schapire, 1997), which is an extension of AdaBoost.M2 for regression problems. Drucker’s method followed the spirit of the AdaBoost.R algorithm by repeatedly using a regression tree as a weak learning machine followed by increasing the weights of the poorly predicted examples and decreasing the weights of the wellpredicted ones. Similar to the error rate in classification, he introduced the average loss function to measure the performance of the machine; it is given by
Lt =
m
L t (i)Dt (i),
(2.3)
i=1
where L t is one of three candidate loss functions (see appendix B), all of which are constrained on [0, 1]. The definition of βt remains unchanged. However, unlike projecting the regression data into classification in AdaBoost.R, the reweighting procedure is formulated in such a way that the
1684
D. Shrestha and D. Solomatine
5
9 8
4
7 6
3
W
β
5 4
2
3 2
1
1 0
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
ε (AdaBoost.RT) or Lta (AdaBoost.R2 ) BetaR2 BetaRT WR2 WRT Figure 2: Weight updating parameter (left y-axis) or weight of the machine (right y-axis) and error rate ε (AdaBoost.RT) or average loss Lta (AdaBoost.R2) BetaR2 and BetaRT represent the weight-updating parameters for AdaBoost.R2 and AdaBoost.RT, respectively. WR2 and WRT represent the weights of the machine for AdaBoost.R2 and AdaBoost.RT, respectively.
poorly predicted examples get higher weights and well-predicted examples get lower weights: (1−L t (i))
Dt+1 (i) =
Dt (i)βt Zt
.
(2.4)
In this way, all the weights are updated according to the exponential loss functions of βt so that the weight of the example with lower loss (smaller error) will be highly reduced, thus reducing the probability that this example will be picked up for the next machine. Finally, the outputs from each machine are combined using the weighted median. Figure 2 presents the relation between βt or W and L t . Similar to AdaBoost.M1, AdaBoost.R2 has a drawback that it is unable to handle weak learning machines with an error rate greater than 0.5. Furthermore, the algorithm is sensitive to noise and outliers, as reweighting formulation is proportional to the prediction error. This algorithm also has an advantage: it does not have any parameter that has to be calibrated,
Experiments with AdaBoost.RT
1685
which is not the case with the other boosting algorithms described in this article. 2.3 Boosting Method of Avnimelech and Intrator (1999). Avnimelech and Intrator (1999) were the first to introduce the threshold-based boosting algorithm for regression problems that used an analogy between classification errors and large errors in regression. The idea of using threshold for a big error is similar to the ε-insensitive loss function used in support vector machines (Vapnik, 1995) where the loss function of the examples having error less than ε is zero. The examples whose prediction errors are greater than the big error margin γ are filtered out, and in subsequent iterations, weak learning machine concentrates on them. The fundamental concept of this algorithm is that the mean squared error, which is often significantly greater than the square median of the error, can be reduced by reducing the number of large errors. Unlike other boosting algorithms, the method of Avnimelech and Intrator is an ensemble of only three weak learning machines. Depending on the distribution of training examples, their method has three versions (see appendix C). Initially the training examples have to be split into three sets, Set 1, Set 2, and Set 3, in such a way that the Set 1 should be smaller than the other two sets. The first machine is trained on Set 1 (constituting a portion of the original training set, e.g., 15%). The training set for the second machine is composed of all examples on which the first machine has a big error on Set 2 and the same number of examples sampled from Set 2 on which the first machine has a small error. Note that big error should be defined with respect to threshold γ , which has to be chosen initially. In Boost1, the training data set for the last machine consists of only examples on which exactly one of the previous machines had a big error. But in Boost2, the training data set for the last machine is composed of the data set constructed for Boost1 plus examples on which previous machines had big errors of different signs. The authors further modified this version and called it Modified Boost2, where the training data set for the last machine is composed of the data set constructed for Boost2, plus examples on which both previous machines had big errors, but there is a big difference between the magnitudes of the errors. The final output is the median of the outputs of the three machines. They proved that the error rate could be reduced from ε to 3ε 2 − 2ε 3 or even less. One of problems of this method is selection of the optimal value of threshold γ for big errors. Theoretically, the optimal value may be the lowest value for which there exists a γ -weak learner. A γ -weak learner is such a learner for which there exists some ε < 0.5, and for any given distribution D and δ > 0, it is capable of finding function f D with probability 1-δ such that D(i) < ε. There are, however, practical considerations i: f (xi )−yi > γ such as limitation of data sets and the limited number of machines in the ensemble, which makes it difficult to choose the desired value of γ in
1686
D. Shrestha and D. Solomatine
advance. Furthermore, an issue concerning splitting the training data into three subsets may arise. Since all of the training examples are not used to construct the weak learning machine, waste of data is a crucial issue for small data sets. Furthermore, similar to AdaBoost.R2, the algorithm does not work when the error rate of the first weak learning machine on Set 2 is greater than 0.5. It should also be noted that the use of an absolute error to identify big errors is not always the right measure in many practical applications when the variability of the target values is very high (an explanation follows in section 3). 2.4 BEM Boosting Method. The big error margin (BEM) boosting method (Feely, 2000) is quite similar to the AdaBoost.R2 method. It is based on an approach of Avnimelech and Intrator (1999). Similar to their method, the prediction error is compared with the preset threshold value called BEM, and the corresponding example is classified as either well or poorly predicted. The BEM method (see appendix D), counts the number of correct and incorrect predictions by comparing the prediction error with the preset BEM, which has to be chosen initially. The prediction is considered to be incorrect if the absolute value of prediction error is greater than BEM. Otherwise the prediction is considered to be correct. Counting the number of correct or incorrect predictions allows for computing the distribution of the training examples using the so-called UpFactor and DownFactor: Upfactort = m/errCountt Downfactort = 1/Upfactort .
(2.5)
Using these values, the numbers of copies of each training example to be included for the subsequent machine in the ensemble are calculated. So if the example is correctly predicted by the preceding machine in the ensemble, then the number of copies of this example for the next machine is given by the number of its copies in the current training set multiplied by the DownFactor. Similarly, for an incorrectly predicted example, the number of its copies to be presented to the subsequent machine is the one multiplied by the UpFactor. Finally, the outputs from individual machine are combined to give the final output. Similar to the method of Avnimelech and Intrator (1999), this method requires tuning of the value of BEM in order to achieve the optimum results. The way the training examples’ distribution is represented is different from that in other boosting algorithms. In the BEM boosting method, the distribution represents how many times an example will be included in the training set rather than the probability that it will appear. Furthermore, it has another drawback: the size of the training data set is changing for each machine and may increase to an impractical value after a few iterations. In
Experiments with AdaBoost.RT
1687
addition, this method has a problem encountered in the boosting method of Avnimelech and Intrator (1999) due to the use of absolute error measure when the variability of the target values is very high. In the new algorithm presented in the next section, an attempt has been made to resolve some of the problems noted here.
3 Methodology This section describes a new boosting algorithm for regression problems, AdaBoost.RT (R stands for regression and T for threshold). AdaBoost.RT Algorithm
1. Input: r Sequence of m examples (x1, y1 ), . . . , (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) r Threshold φ(0 < φ < 1) for demarcating correct and incorrect predictions 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m for all i r Error rate εt = 0 3. Iterate while t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate absolute relative error for each training example as f t (xi ) − yi ARE t (i) = yi r Calculate the error rate of f t (x) : εt = i:ARE t (i)>φ Dt (i) r Set βt = εn , where n = power coefficient (e.g., linear, square, or t cubic) r Update distribution Dt as Dt (i) βt if ARE t (i) <= φ × Dt+1 (i) = 1 otherwise Zt
r
where Zt is a normalization factor chosen such that Dt+1 will be a distribution Set t = t + 1
1688
D. Shrestha and D. Solomatine
4. Output the final hypothesis: 1 1 f fin (x) = log f t (x)/ log βt βt t t Similar to the boosting methods described in section 2, the performance of the machine is evaluated by computing the error rate εt . Furthermore, similar to threshold-based boosting methods of Avnimelech and Intrator (1999) and BEM, the regression problem in AdaBoost.RT is projected into the binary classification problem while demarcating well-predicted and poorly predicted examples. However, unlike the absolute big error margin in the methods of Avnimelech and Intrator (1999) and BEM, we use the absolute relative error (hereafter referred to simply as relative error) to demarcate examples as either well or poorly predicted. If the relative error for any particular example is greater than the preset relative error, the so-called threshold φ, the predicted value for this example is considered to be poorly predicted; otherwise, it is well predicted. The weight updating parameter βt is computed differently than it is in Adaboost.M1 or AdaBoost.R2. We reformulated the expression for βt to overcome the situation when the machine weight W becomes negative if εt is greater than 0.5. βt will be linear to εt when the value of the power coefficient n is one. Other possible values of n are two (square law) and three (cubic law). As the value of n increases, then relatively more weight is given to the harder examples in cases when εt is very low (close to 0.1 or even less) (as βt deviates further from 1), and the machine concentrates on these hard examples. However, a very high value of n may cause the algorithm to become unstable because the machine will be trained on only a limited number of harder examples and may fail to generalize. So in our experiment, we used square law for βt . It is worthwhile to mention that the normalized W used for combining the outputs from each machine is independent of n. Unlike AdaBoost.M1 and AdaBoost.R2, our AdaBoost.RT does not have a “natural” stopping rule (when εt exceeds 0.5) but can generate any number of machines. This number, in principle, should be determined on the basis of analysis of the cross-validation error (training stops when the crossvalidation error starts to increase). We found that even if εt for some of the machines in the ensemble is greater than 0.5, the final output of the combined machine is better than that of a single machine. Finally, the outputs from different machines are combined using a weighted average. Similar to the threshold-based boosting methods of Avnimelech and Intrator (1999) and BEM, AdaBoost.RT has a drawback that the threshold φ should be chosen initially, and hence introduces additional complexity that has to be handled. The experiments with AdaBoost.RT have shown that the performance of the committee machine is sensitive to φ. Figure 3 shows the relation between φ and εt . If φ is too low (e.g., < 1%), then it is generally very difficult to get a sufficient number of correctly predicted
Error rate
Experiments with AdaBoost.RT
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
1689
SieveQ3 Boston housing Friedman#1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Relative error threshold Figure 3: Error rate and relative error threshold for selected data sets. The error rate for all data sets drops exponentially to a relative error threshold.
examples. If φ is very high, it is possible that there are only a few “hard” examples, which are often outliers, and these examples will get boosted. This will affect the performance of the committee machine seriously and may make it unstable. In order to estimate the preliminary value of φ, it is required to find statistics on the prediction errors (relative errors) for the single machine before committing the boosting. Theoretically, the lower and upper boundary of φ should reside between the minimum and maximum values of the relative error. Simple line search within these boundaries may be required to find the suboptimal value of φ that minimizes root-meansquared error (RMSE), and this can be done using the cross-validation data set. Note that φ is relative and is constrained to [0, 1]. 3.1 Loss Function and Evaluation of the Machine. An obvious question may arise while dealing with the regression problems: What is the loss function? For Adaboost.R2, it is explicitly mentioned in the algorithm, and it has three different functional forms. However, for AdaBoost.RT, we tried to replace misclassification error rate as used by Freund and Schapire (1997) with a loss function that better fits regression problems. Note that because of using the threshold in AdaBoost.RT, the loss function is represented by a logical expression (computer code) rather than by a simple formula as in AdaBoost.R2.
1690
D. Shrestha and D. Solomatine
Many boosting schemes use absolute error as the loss function, and this choice can be justified in many applications. In AdaBoost.RT, however, the error is measured by the relative error. This choice is based on our belief that any machine learning algorithms should reflect the essence of the problem or process being learned. One of the motivations to build an improved boosting scheme was to increase the accuracy of predictors used in engineering and hydrology. In these domain areas, experts often judge the predictor’s accuracy by relative rather than absolute error. For example, consider a hydrological model that predicts water level in a river. A hydrologist would ignore an absolute error of 10 cm when the target water level is 10 m (relative error of 1%). However, the same 10 cm error for a target level of 50 cm is not small (20%). It can be said that in a sense, relative error adapts to the magnitude of the target value. In our view, using relative error is justified in many real-life applications. However, if the target values are zero or close to zero, the following phenomenon may appear. For small target values, relative errors could be high even if a machine makes a good prediction (in the sense of absolute error); in this case, such examples will be assigned (perhaps unfairly) a higher probability to be selected for the next machine. This may lead to its specialization on the examples with small target values only because the model accuracy was measured by relative error, not by absolute error. We have not investigated this issue in our experiments. A possible remedy for this potential problem is special handling of examples with very low target values (e.g., by appropriately modifying relative error) or data transformation (adding a constant). It is to be noted that the relative error is employed to evaluate the performance of the weak learning machines. However, to make the comparisons with other machines and with previous research on benchmark data sets, the classic RMSE is used to evaluate the performance of the ensemble machine (see, e.g., Cherkassky & Ma, 2004). Minimizing relative absolute errors greater than φ will reduce the number of large errors and consequently will lead to the lower value of RMSE.
4 Bagging Bagging (Breiman, 1996a) is another popular ensemble technique used to improve the accuracy of a single learning machine. In bagging, multiple versions of a predictor are generated and used to obtain an aggregated predictor (Breiman, 1996b). Bagging is “bootstrap” (Efron & Tibshirani, 1993) aggregating that works by training each classifier on a bootstrap sample, that is, a sample of size m chosen uniformly at random with replacement from the original training set of size m. Boosting and bagging are to some extent similar: in both, each machine is trained on different sets of the training examples, and the output from each machine is combined to get a
Experiments with AdaBoost.RT
1691
single output. There is, however, a considerable difference in sampling the training examples and combining the outputs from each machine. In bagging, for each machine in a trial t = 1, 2, . . . , T, training data sets of m examples are randomly sampled with replacement from the whole training set of m examples. Thus, it is possible that many of the original examples may be repeated in the resulting training set, while others may be left out. In boosting, however, the training examples are sampled according to the performance of the machine in the previous iterations. Based on these training examples, classifier h t or predictor f t (x) is constructed. In bagging, the prediction from each machine in an ensemble is combined by simply taking an average of predictions from each machine. In boosting, predictions are given different weights and then averaged. Breiman (1996a, 1996c) showed that bagging is effective on an unstable learning algorithm where a small perturbation in the training set can cause significant changes in the predictions. He also claimed that neural networks and regression trees are examples of unstable learning algorithms. 5 Experimental Setup 5.1 Data Sets. In order to evaluate the performance of boosting, several data sets were selected. The first group was composed of some data sets from the UCI repository (Blake & Merz, 1998) and the Friedman #1 set (Friedman, 1991). These data sets can be also used as benchmarks for comparing results with the previous research. The second group of data sets is based on hydrological data used in the previous research (Solomatine & Dulal, 2003) where artificial neural networks (ANNs) and model trees (MTs) were used to predict river flows in the Sieve catchment in Italy. In addition, for comparison with the boosting method of Avnimelech and Intrator, we also used Laser data from the Sante Fe time series competition (data set A) (Weigend & Gershenfeld, 1993). A brief description of the data sets follows. Hydrological data. Two hydrological data sets, SieveQ3 and SeiveQ6, are constructed from hourly rainfall (RE t ) and river flow (Qt ) data in the Sieve river basin in Italy. From the history of previous rainfall and runoff, a lagged vector of rainfall and runoff time series is constructed. Prediction of river flows Qt+i several hours ahead (i = 3 or 6) is based on using the previous values of flow (Qt−τ ) and previous values of rainfall (RE t−τ ), τ being between 0 and 3 hours. The regression models were based on 1854 examples. The SieveQ3 data set is the 3-hour-ahead prediction for the values of river flow and consists of six continuous input attributes. The regression model is formulated as Qt+3 = f (RE t , RE t−1 , RE t−2 , RE t−3 , Qt , Qt−1 ). The sieveQ6 set is used for the 6-hour-ahead prediction for the values of river flow and consists of two continuous input attributes. The regression model is formulated as Qt+6 = f (RE t , Qt, ).
1692
D. Shrestha and D. Solomatine
Benchmark data. Four data sets from this group are Boston Housing, Auto-Mpg, CPU from the UCI repository and the Friedman#1. Boston Housing has 506 cases and 12 continuous and 1 binary-valued input attributes. The output attribute is the median price of a house in the Boston area in the United States. Auto-Mpg data concern city-cycle fuel consumption of automobiles in miles per gallon, to be predicted in terms of three multivalue discrete and four continuous attributes. CPU data characterize the relative performance of computers on the basis of a number of relevant attributes. The data consist of six continuous and one discrete input attributes. Friedman#1 are the synthetic data used in Friedman (1991) on multivariate adaptive regression splines (MARS). It has 10 independent random variables x1 , . . . , x10 , each uniformly distributed over [0, 1]. The response (output attribute) is given by y = 10 sin(π x1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε,
(5.1)
where, ε is N [0, 1] and represents variables that have no prediction ability (x6 , . . . , x10 ). Laser data. This time series is the intensity of a NH3-FIR laser that exhibits Lorenz-like chaos. It has sampling noise due to the A/D conversions to 256 discrete values. We constructed input vectors using 16 previous values to predict the next value as Avnimelech and Intrator (1999) did. For SieveQ3, SieveQ6, and Boston Housing, the data set was randomly divided into training, cross-validation (sometimes referred to as simply validation), and test subsets. The division of data for Boston Housing is kept as in the previous research by Drucker (1997). The division of data for SieveQ3 and SieveQ6 is based on the previous research by Solomatine and Dulal (2003). The remaining three data sets from the UCI repository were randomly divided into training (66% of the data set) and test set (34%) only. The Laser data set was randomly split into a training (75% of the data set) and a test set (25%), which is similar to the split used by Avnimelech and Intrator. Table 1 summarizes the characteristics of the data sets. 5.2 Methods. The data sets for training, testing, and cross validation (if any) were randomly constructed without replacement from the original data. This process was repeated 10 times with different random number generating seeds to have 10 statistically different data sets. It is worthwhile to point out that the cross-validation data sets for the first three data sets in Table 1 (hereafter referred to as Group 1 of data sets) are used only to tune suboptimal φ value for AdaBoost.RT, big error γ for the method of Avnimelech and Intrator, and the error margin for BEM boosting method, while other techniques (committee and noncommittee) are trained on a combination of training and cross-validation data sets.
Experiments with AdaBoost.RT
1693
Table 1: Data Sets Summary. Attributes Data Set Group 1 SieveQ3 SieveQ6 Boston Housing Group 2 Auto-Mpg CPU Friedman#1 Laser
Training
CV
Test
Continuous
Discrete
1554 1554 401
300 300 80
300 300 25
7 3 13
0 0 1
262 137 990 7552
– – – –
136 72 510 2518
5 7 6 17
3 1 0 0
Note: CV = cross-validation data set.
All boosting algorithms for regression problems mentioned in section 2 including our boosting method were implemented in Java by the authors and in the form of classes added to the Weka software (Witten & Frank, 2000). Our implementation of boosting algorithms makes it possible to boost a number of weak learners that are already implemented in the original version of Weka, including the M5 model tree and ANN. In all the experiments with the boosting algorithms, we used MT (Quinlan, 1992; see also Witten & Frank, 2000) as a weak learning machine. This learning technique is simple, easy, and fast to train. Solomatine and Dulal (2003) have shown that MT can be used as an alternative to ANN in rainfall-runoff modeling. MT is similar to a regression tree, but it has firstorder (linear) regression models in leaves, as opposed to the zeroth-order model in regression trees. In addition, to show the applicability of the boosting algorithms for different learning machines, experiments were conducted for the Laser data set using ANNs as a weak learning machine. Besides these, we performed the experiments with bagging using MT as a weak learning machine and carried out experiments with ANNs to compare the results. Each experiment with boosting, bagging, or ANNs was repeated 10 times on subsets of all seven data sets. Moreover, we constructed 10 machines for each subset of data sets to form a committee machine for bagging and all boosting methods (except that of Avnimelech and Intrator). All possible efforts were undertaken to ensure the fair comparison of various methods. 6 Experimental Results and Discussion This section reports the results of applying our boosting algorithm to the data sets. We investigated the use of cross-validation data sets to overcome
1694
D. Shrestha and D. Solomatine
the overfitting problem while selecting suboptimal value for the threshold. The performance of AdaBoost.RT was compared against different numbers of machines and demonstrated that it achieved an overall better result even in case of adding more weak learners in boosting schemes when the error rate becomes greater than 0.5. Finally, we investigated the problem of bias toward the values more commonly occurring in the training examples. Performance of AdaBoost.RT was compared with that of the other boosting methods, bagging, and ANNs. 6.1 Weak Learner: M5 Model Tree (MT). The first experiment with all data sets was aimed at building a single machine: an MT. The purpose of these experiments was twofold: to establish a benchmark against which to compare the boosting and bagging techniques and to optimize the parameter of the single machine, if any, such as a pruning factor (number of linear models) in the case of the MT (often pruning of MT is necessary for better generalization). A number of MTs with the varying complexity (number of linear models) were generated; their performance for one of the Boston Housing data sets is presented in Figure 4. 6.2 Boosting of M5 Model Tree 6.2.1 AdaBoost.RT. The next step entails implementation of experiments with AdaBoost.RT. As described in section 3, the appropriate value for φ is to be selected. This was achieved using a calibration process in which a possible range of values was tested to find the suboptimal value of φ. In the developed computer program, the user can select the lower and the upper boundary and incremental value of φ and run the algorithm within the range of values. If there is a cross-validation data set, then φ should correspond to the minimal RMSE on it. Otherwise, φ should be taken when the training error is minimal; in this case, however, there is a possibility of overfitting. This calibration process is, in fact, a simple line search that minimizes the cost function (in this case, RMSE). Calibrating φ adds to the total training time, and indeed, there is a trade-off between computation time (hence cost) and prediction accuracy. The experiment was performed using the cross-validation data set for the Group 1 of data sets. Table 2 shows that using this set, the problem of overfitting the data is overcome; however, performance improvement is marginal. The experiments were also carried out for the Group 2 of data sets that do not have corresponding cross-validation data sets. Although the performance of AdaBoost.RT was not satisfactory on Auto-Mpg data set, it outperformed the single machine for the CPU, Friedman#1, and Laser data sets (23.5%, 21.5%, and 31.7% RMSE reduction, respectively).
Experiments with AdaBoost.RT
1695
4 3.8
Training
Test
3.6
RMSE
3.4 3.2 3 2.8 2.6 2.4 2.2 0
10
20
30
40
50
Number of linear models Figure 4: Performance of M5 model trees against the number of linear models for one of the Boston Housing data sets. Table 2: Comparison of Performance (RMSE) of AdaBoost.RT on Test Data for Group 1 Data Sets with and Without Cross Validation. AdaBoost.RT
Data set SieveQ3 SieveQ6 Boston Housing
MT
With Cross Validation
Without Cross Validation
14.66 27.01 3.50
13.81 26.37 3.23
15.36 26.91 3.38
Notes: The results are averaged for 10 different runs (each run for each subset of data sets). Bold type signifies the minimum value of RMSE for each data set.
It can be observed from Table 3 that AdaBoost.RT reduces the RMSE for all of the data sets considered. In 70 experiments (10 experiments for each data set), AdaBoost.RT wins over the single machine in 54 experiments (77% times on average; see Table 5). The two-tail sign test also indicates that AdaBoost.RT is significantly better than the single machine at the 99% confidence level. Note that as pointed out by Quinlan (1996) and Opitz and Maclin (1999), there are cases where the performance of a committee machine is even worse than the single machine. This is the case in
1696
D. Shrestha and D. Solomatine
Table 3: Comparison of Performance (RMSE) of Different Machines on Test Data. Data Set SieveQ3 SieveQ6 Boston Housing Auto-Mpg CPU Friedman#1 Laser
MT
RT
R2
AI
BEM
Bagging
ANN
14.67 26.55 3.62 2.97 34.65 2.19 4.04
13.81 26.37 3.23 2.92 26.52 1.72 2.76
15.01 28.36 3.23 2.91 24.45 1.82 2.69
14.14 26.82 4.11 3.09 46.7 2.24 4.5
22.3 35.15 3.44 3.00 23.2 1.69 2.87
13.93 26.47 3.24 2.86 32.64 2.06 3.35
17.09 32.87 3.54 3.79 13.91 1.51 3.95
Notes: The results are averaged for 10 different runs. RT, R2, AI, and BEM represent boosting algorithms AdaBoost.RT, AdaBoost.R2, the method of Avnimelech and Intrator (1999), and BEM boosting methods, respectively. Bold type signifies the minimum value of RMSE for each data set.
16 experiments out of 70. Freund and Schapire (1996) also observed that boosting is sensitive to noise, and this may be partly responsible for the increase in error in these cases. We investigated the performance of AdaBoost.RT with up to 50 machines (see Figure 5) for one of the Friedman#1 data sets. It is observed that RMSE was reduced significantly when only five or six machines were trained, and after about 10 machines, the error was not decreasing much, so we used 10 machines in all experiments. An additional reason was that Breiman (1996b) also used 10 machines in bagging. The performance of AdaBoost.RT for different values of φ on one of the Friedman#1 test data sets is presented in the Figure 6. It can be observed that performance is sensitive to the value of φ. The minimum error corresponds to φ of around 8% and starts increasing continuously with the increase of φ. Finally, at a value of around 40%, the AdaBoost.RT becomes unstable due to overfitting and the boosting of noise. Figure 7 shows RMSE and the error rate of AdaBoost.RT against the number of machines for one of the Friedman#1 data sets. It demonstrates that the Adaboost.R2-like algorithm would have stopped at six machines when the error rate becomes greater than 0.5, but AdaBoost.RT continued and achieved higher performance. There is, of course, a danger of boosting noise, but it is present in any boosting algorithm. Again, ideally, stopping should be associated with the cross-validation error and not with the error rate on training set. However, if this is not done, then the number of machines should be determined on the basis of experiments, as shown in Figure 5. Our final results show indeed that the generalization properties of Adaboost.RT are good when 10 machines are used beyond the level of 0.5 error rate.
Experiments with AdaBoost.RT
1697
2.2 2
Training
Test
RMSE
1.8 1.6 1.4 1.2 1 0
10
20
30
40
50
Number of machines Figure 5: Performance of AdaBoost.RT against different numbers of machines for one of the Friedman#1 data sets.
9 8
Training
Test
RMSE
7 6 5 4 3 2 1 0
0.1
0.2
0.3
0.4
0.5
Relative error threshold Figure 6: Performance of AdaBoost.RT against different values of the relative error threshold for one of the Friedman#1 data sets.
0.7
2.2
0.6
2.1
0.5
2
0.4
1.9
0.3
1.8
0.2
RMSE
D. Shrestha and D. Solomatine
Error rate
1698
1.7 0
2
4
6
8
10
Number of machines Error rate
Training
Test
Figure 7: Performance and error rate of AdaBoost.RT against different numbers of machines for one of the Friedman#1 data sets. Although the error rate is higher than 0.5 for machine 7, the RMSE is still decreasing after machine 7.
Figure 8 depicts the histogram of training and test data set for one of the Friedman#1 data sets. It is observed that the large numbers of training and test examples are in the middle range (13–19). Many experiments have shown that most of the learning machines are biased toward values more commonly occurring in the training examples, and thus there would be high probability that the extreme values will be poorly predicted. Because the main idea behind the boosting algorithm is to improve performance on the hard examples (often extreme events), it is possible that the performance on tails (extreme events) may be improved at the expense of the commonly occurring cases. This effect was also reported by Namee et al. (2000). We investigated whether AdaBoost.RT also suffers from this problem. One of the Friedman#1 data sets was selected for this purpose, and the resulting average error (absolute error in this case) for different ranges of test data is presented in Figure 9. It can be seen that AdaBoost.RT outperformed MT on all ranges of the data set (low, medium, and high). Figure 10 shows that AdaBoost.RT works very well for the extreme cases and was quite successful in reducing errors on these parts of the data range without causing performance in the middle range to deteriorate.
Experiments with AdaBoost.RT
1699
180 Training Test
160
Frequency
140 120 100 80 60 40 20
Output y variable
More
27
25
23
21
19
17
15
13
11
9
7
5
3
0
Figure 8: Histogram of the output variable y in one of the Friedman#1 data sets.
Average absolute error
3.5 MT
3.0
AdaBoost.RT
2.5 2.0 1.5 1.0 0.5
27
25
More
Output y variable
23
21
19
17
15
13
11
9
7
5
3
0.0
Figure 9: Comparison of the average absolute error using MT and AdaBoost.RT in different ranges for one of the Friedman#1 data sets. AdaBoost.RT outperforms MT on all ranges except one.
1700
D. Shrestha and D. Solomatine
140 MT
120
AdaBoost.RT
Frequency
100 80 60 40 20 6 6.5 7 More
2.5 3 3.5 4 4.5 5 5.5
0.5 1 1.5 2
0
Absolute error Figure 10: Histogram of absolute error using MT and AdaBoost.RT for one of the Friedman#1 test data sets. AdaBoost.RT does well on the hard examples, as the frequency of larger absolute errors is reduced.
6.2.2 AdaBoost.R2. For comparison, experiments with AdaBoost.R2 were performed as well; MT was used as a weak learning machine. We used linear and square loss function to calculate the weight-updating parameter. The performance of this boosting method for all test data sets is shown in Table 3. In general, the overall RMSE is satisfactory. There is a significant increase of performance over the Laser and CPU data sets as RMSE is reduced from 4.04 to 2.69 (33.4% reduction) and 34.65 to 24.45 (29.4% reduction), respectively. For the other three data sets, the performance improvement was less significant: from 16.9% for Friedman#1 to almost no improvement for Auto-Mpg. However, for SieveQ3 and SieveQ6, performance deteriorated. The reason for the deteriorating performance of AdaBoost.R2 can be seen in the boosting of outliers and noise. Noisy examples that obviously have large prediction error are strongly boosted, while those with smaller prediction error are weakly boosted as weight-updating parameters depend on the amount of prediction error. Intuitively, some examples will typically generate extraordinarily high prediction errors, thus resulting in low overall performance of the committee machine. Experiments with the boosting method of Avnimelech and Intrator were conducted as well. Fifteen percent of the training set was assigned to the Set 1, and the rest were divided into two equal sets similar to the experiments
Experiments with AdaBoost.RT
1701
performed by Avnimelech and Intrator (1999) with the Laser data set. As in AdaBoost.RT, the big error γ is selected using the cross-validation set for Group 1 of the data sets and the training set for Group 2. We used the Modified Boost2 version of their boosting method. The results are reported in Table 3. The performance of boosting is worse than that of the single machine for all data sets except SieveQ3. The reason could be that there are not sufficient data available for training due to the partition of the training data into three subsets. Furthermore, nonoptimal partition of training data and selection of big error lead to deteriorating performance. Finally, the experiments with the BEM boosting method (Feely, 2000) were repeated. Similar to other threshold-based boosting algorithms, BEM value is calibrated using cross validation (for the Group 1 data set) and the training data set (for Group 2). The results are reported in Table 3. The BEM method is more accurate than the single machine for the benchmark and Laser data sets. RMSE on test data sets is reduced by 33% for CPU data set and 22.8% for Friedman#1 data set. However, for hydrological data sets, this method suffers from overfitting: RMSE in training drops dramatically, while the RMSE in testing is much higher than that of the single machine. The reason for overfitting is that after several iterations, the training data consist of a large number of copies of just a few examples (examples having higher errors) and consequently fail to generalize. Furthermore, the BEM boosting method is highly sensitive to the value of BEM. The computation cost is very high as compared to other boosting algorithms since the size of training data increases after few iterations. 6.3 Bagging. The experiments with bagging were performed as well. Since Breiman (1996b) noted that most of the improvement from bagging is evident within 10 replications and since it was interesting to see the performance improvement that can be brought by a single order of magnitude increase in computation cost, the number of machines was set to 10. The results of these experiments are presented in Table 3. Bagging outperformed the AdaBoost.RT in only one data set out of seven: namely, Auto-Mpg (by only 3.4%). On the other six data sets, AdaBoost.RT supersedes bagging by as much as 18.8% (for the CPU data set). These results demonstrate that AdaBoost.RT outperforms bagging, although the differences for the Group 1 data sets are not considerable. Bagging, however, always improves the performance over a single machine, although only marginally. The two-tail sign test also indicates that bagging is significantly better than the single machine at the 99% confidence level. It is worthwhile mentioning that bagging always seems to be reliable and consistent but may not be the most accurate method. 6.4 Neural Networks. Finally, an ANN model (multilayer perceptron) was set up to compare the techniques described above against a popular machine learning method. The hyperbolic tangent function was used for
1702
D. Shrestha and D. Solomatine
Table 4: Comparison of Performance (RMSE) of AdaBoost.R2 and AdaBoost.RT Using Different Combining Schemes in Test Data Set for the Group 1 Data Sets. Weighted Average Data Set SieveQ3 SieveQ6 Boston Housing
Weighted Median
AdaBoost.R2
AdaBoost.RT
AdaBoost.R2
AdaBoost.RT
21.92 32.15 3.57
13.81 26.37 3.23
15.01 28.36 3.23
13.73 26.30 3.39
Notes: The results are averaged for 10 different runs. Bold type signifies the minimum value of RMSE for each data set.
the hidden layer and the linear transfer function for the output layer. For all the data sets, default values of 0.2 and 0.7 were taken as learning and momentum rate, respectively. The number of epochs was 1000. The number of hidden nodes was selected according to the number of attributes, and in most cases it was less than the number of attributes. We used NeuroSolutions (http://www.nd.com) and NeuralMachine (http://www.datamachine.com) software. The results are presented in the Table 3. In comparison to MT, it is observed that for the three of seven data sets, performance of the ANN is worse. For example, in the case of SieveQ6, the RMSE value achieved using the ANN is 32.87 as compared to 26.55 using MT. This may be due to using nonoptimal values of ANN parameters. However, for Freidman#1, ANN outperformed MT considerably (RMSE is lower by more than 30%). Interestingly, for the CPU data set, ANN is considerably better than all other methods. 6.5 Use of Different Combining Methods for Boosting. It has already been mentioned that AdaBoost.RT uses the weighted average while combining outputs of the machines in the ensemble. In contrast, AdaBoost.R2 uses the weighted median. In order to compare these two competent boosting algorithms, with different schemes for combining outputs, we repeated experiments for the Group 1 data sets using both weighted average and median. The results are presented in the Table 4. When using the weighted average in AdaBoost.R2, AdaBoost.RT leads for all three data sets. Furthermore, if a weighted median is used for AdaBoost.RT, AdaBoost.R2 leads on only one data set. If we examine 30 experiments in this group of data sets, it is observed that AdaBoost.RT outperformed AdaBoost.R2 for 23 data sets using a weighted average, while AdaBoost.R2 outperformed AdaBoost.RT for only 12 data sets using a weighted median as the combination scheme. This means that AdaBoost.RT outperforms AdaBoost.R2 regardless of the combination scheme on the data sets considered.
Experiments with AdaBoost.RT
1703
Table 5: Qualitative Scoring Matrix for Different Machines. Machine
MT
RT
R2
AI
BEM
Bagging
ANN
Total
MT RT R2 AI BEM Bagging ANN Total
0 77 70 27 57 70 53 59
23 0 40 10 33 40 41 31
30 60 0 20 39 39 43 38
73 90 80 0 66 84 57 75
43 67 61 34 0 54 56 53
30 60 61 16 46 0 44 43
47 59 57 43 44 56 0 51
41 69 62 25 47 57 49 0
Note: Bold type indicates the total value for each learner.
6.6 Comparison of All Methods. From the performance comparison as presented in Table 3, it has been found that results are mixed: no machine beats all other machines on all data sets. Since the results presented in the Table 3 are an average for 10 independent runs, it is also possible to count how many times one machine beats another in 70 independent runs for all data sets (10 runs for each data set). Table 5 presents the qualitative performance of one machine over another in percentage (also called a qualitative scoring matrix). Its element QSMi, j should be read as the number of times the performance of ith machine (header row in Table 5) beats the jth machine (header column) in 100 independent runs for all data sets and is given by the following formula:
QSMi, j =
Counti f RMSE K ,i >RMSE K , j Nt
∀K , i = j
0, i = j
,
(6.1)
where RMSE K ,i is the RMSE of the ith machine (rows’ headers in the first column in Table 5), RMSE K , j is the root mean squared error of the jth machine (columns’ header in the first row), Nt is the total number of independent runs for all data sets, and K = 1, 2, ..., Nt. The following relation holds: QSMi, j + QSMj,i = 1
∀ i, j
and i = j.
(6.2)
For example, the value of 77 appearing in the second row and the first column means that AdaBoost.RT beats MT 77% on average. Table 5 demonstrates that AdaBoost.RT is better than all other machines considered 69% of times, and 62%, 47%, and 24% for AdaBoost.R, BEM the boosting method, and the method of Avnimelech and Intrator, respectively. If one is interested to analyze the relative performance of the algorithms more precisely, then a measure reflecting the relative performance is needed.
1704
D. Shrestha and D. Solomatine
Table 6: Quantitative Scoring Matrix for Different Machines. Machine
MT
RT
R2
AI
BEM
Bagging
ANN
Total
MT RT R2 AI BEM Bagging ANN
0.0 13.7 12.0 −7.5 4.3 6.9 5.8
−13.7 0.0 −1.4 −19.5 −8.7 −7.5 −5.9
−12.0 1.4 0.0 −17.6 −7.9 −5.9 −4.2
7.5 19.5 17.6 0.0 10.1 13.7 10.8
−4.3 8.7 7.9 −10.1 0.0 1.6 4.2
−6.9 7.5 5.9 −13.7 −1.6 0.0 −0.3
−5.8 5.9 4.2 −10.8 −4.2 0.3 0.0
−35.2 56.7 46.2 −79.2 −8.0 9.2 10.3
Note: Bold type indicates the total value for each learner.
For this reason, we used the so-called quantitative scoring matrix, or, simply, scoring matrix (Solomatine & Shrestha, 2004), as shown in Table 6. It shows the average relative performance (in %) of one technique (either single machine or committee machine) over another one for all the data sets. The element of scoring matrix SMi, j should be read as average performance of the ith machine (header row in Table 6) over the jth machine (header column) and is calculated as follows:
SMi, j
N 1 RMSE k, j − RMSE k,i , = N max(RMSE k, j , RMSE k,i ) k=1 0, i = j
i = j
,
(6.3)
where N is the number of data sets. The value of 13.7 appearing in the second row and the first column in Table 6 means that the performance of AdaBoost.RT over the MT is 13.7% higher if averaged over all seven data sets considered. By summing up all the elements’ values row-wise, one can determine the overall score of each machine. It can be clearly observed from Tables 5 and 6 that AdaBoost.RT on average scores highest, which implies that on the data sets used, AdaBoost.RT is comparatively better than the other techniques considered. 6.7 Comparison with Previous Research. In spite of the attention that boosting receives in classification problems, relatively few comparisons between boosting techniques for regression exist. Drucker (1997) performed some experiments using AdaBoost.R2 with the Friedman#1 and Boston Housing data sets. He obtained an RMSE of 1.69 and 3.27 for the Friedman#1 and Boston Housing data sets, respectively. Our results are consistent with his results; however, there are certain procedural differences in experimental settings. He used 200 training examples, 40 pruning samples, and 5000 test examples per run and per 10 runs to determine the best loss functions. In our experiments, we used only 990 training examples and 510
Experiments with AdaBoost.RT
1705
Table 7: Comparison of Performance (RMSE) of Boosting Algorithms on Laser Data Set (Test Data) Using ANN as a Weak Learning Machine. Boosting Algorithm ANN AdaBoost.RT AdaBoost.R2 AI BEM
RMSE
NMSE
4.07 2.48 2.84 3.13 3.42
0.008042 0.002768 0.003700 0.004402 0.005336
Notes: The results are averaged for 10 different runs. Bold type indicates the minimum value of RMSE among different learners.
test examples, and the number of iterations was only 10 against his 100. We did not optimize the loss functions; moreover, the weak learning machine is completely different, as we used the M5 model tree (potentially more accurate), whereas he used a regression tree. We also conducted experiments with the Laser data set using ANNs as a weak learning machine to compare with the previous work of Avnimelech and Intrator (1999). The architecture of neural networks is the same as described before. The hidden layer consisted of six nodes. The number of epochs was limited to 500. The results, reported in Table 7, show that AdaBoost.RT is superior compared to all other boosting methods. The RMSE of AdaBoost.RT is 2.48 as compared to 2.84 for AdaBoost.R2, 3.13 for Avnimelech and Intrator’s method, and 3.42 for the BEM method. The normalized mean squared error (NMSE: the mean squared error divided by the variance across the target value) was also calculated. Avnimelech and Intrator obtained an NMSE value of 0.0047, while ours was 0.0044 using their method on the Laser data set. There were some procedural differences in the experimental settings: they used 5 different partitions of 8000 training examples and 2000 test examples, whereas we used 10 different partitions of 7552 training examples and 2518 test examples. In spite of these differences, it can be said that our result is consistent with theirs. 7 Conclusions Experiments with a new boosting algorithm, AdaBoost.RT, for regression problems were presented and analyzed. Unlike several recent boosting algorithms for regression problems that follow the idea of gradient descent, AdaBoost.RT builds the regression model by simply changing the distribution of the sample and is a variant of AdaBoost.M1 modified for regression. The training examples are projected into two classes by comparing the accuracy of prediction with the preset relative error threshold. The idea of using an error threshold is analogous to the insensitivity range used, for example,
1706
D. Shrestha and D. Solomatine
in support vector machines. Loss function is computed using relative error rather than absolute error; in our view it is justified in many real-life applications. Unlike most of the other boosting algorithms, the AdaBoost.RT algorithm does not have to stop when the error rate is greater than 0.5. The modified weight updating parameter not only ensures that the value of machine weight is nonnegative, but also gives relatively higher emphasis on the harder examples. The boosting method of Avnimelech and Intrator (1999) suffers from data shortage, and the BEM boosting method (Feely, 2000) is time-consuming to handle large data sets. In this sense AdaBoost.RT would be a preferred option. If compared to AdaBoost.R2, however, AdaBoost.RT needs a parameter to be selected initially, and this introduces additional complexity that has to be handled as in other threshold-based boosting algorithms (for example, by Avnimelech and Intrator and BEM). The algorithmic structure of AdaBoost.RT is such that it updates the weight-updating parameter (used to calculate the probability to be chosen for the next machine) by the same value. This feature ensures that the outliers (noisy examples) do not dominate the training sets for the subsequent machines. The experimental results demonstrated that AdaBoost.RT outperforms a single machine (i.e., a weak learning machine for which an M5 model tree was used) for all of the data sets considered. The two-tail sign test also indicates that AdaBoost.RT is better than a single machine with a confidence level higher than 99%. Compared with the other boosting algorithms, it was observed that AdaBoost.RT surpasses them on most of the data sets considered. Qualitative and quantitative performance measures (scoring matrix) also give an indication of the higher accuracy of AdaBoost.RT. However, for more accurate and statistically significant comparison, more experiments are needed. An obvious next step would be to automate the choice of a (sub)optimal value for the threshold depending on the characteristics of the data set and to test other functional relationships for weight-updating parameters. Appendix A: AdaBoost.M1 Algorithm 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where labels y ∈ Y = {1, . . . , k} r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Error rate εt = 0 3. Iterate while error rate εt < 0.5 or t ≤ T
Experiments with AdaBoost.RT
1707
r r r r r
Call Weak Learner, providing it with distribution Dt Get back a hypothesis h t : X → Y Calculate the error rate of h t : εt = i : h t (xi ) = yi Dt (i) Set βt = εt /(1 − εt ) Update distribution Dt as Dt (i) βt if h t (xi ) = yi Dt+1 (i) = × 1 otherwise Zt where Zt is a normalization factor chosen such that Dt+1 will be a distribution r Set t = t + 1 4. Output the final hypothesis: h fin (x) = arg max t : h t (x)=y log β1t
Appendix B: AdaBoost.R2 Algorithm 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations (machines) 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Average loss function L t = 0 3. Iterate while average loss function L t < 0.5 or t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate the loss for each training example as: lt (i) = f t (xi ) − yi r Calculate the loss function L t (i) for each training example using three different functional forms as: 2 lt (i) lt (i) Linear : L t (i) = ; Squarelaw : L t (i) = Denomt Denomt lt (i) Exponential : L t (i) = 1 − exp − Denomt
r r r r
whereDenomt = max (lt (i)) i=1,...,m m Calculate an average loss as: L t = i=1 L t (i)Dt (i) Set βt = L t /(1 − L t ) (1−L (i)) Dt (i)βt t Update distribution Dt as: Dt+1 (i) = where Zt is a Zt normalization factor chosen such that Dt+1 will be a distribution Set t = t + 1
1708
D. Shrestha and D. Solomatine
4. Output the final hypothesis: f f in (x) = inf y ∈ Y :
t: f t (x)≤y
1 1 1 log ≥ log βt 2 t βt
Appendix C: Boosting Method of Avnimelech and Intrator (1999) 1. Spilt the training examples to three sets—Set 1, Set 2, and Set 3—in such a way that Set 1 should be smaller than the other two sets. 2. Train the first machine on Set 1. 3. Assign to the second machine all the examples from Set 2 on which the first machine has a big error and a similar number of examples from Set 2 on which it does not have a big error, and train the second machine on it. 4. Assign the training examples to the third machine according to the following different versions: r Boost1: All examples on which exactly one of the first two machines has a big error. r Boost2: Data set constructed for Boost1 plus all examples on which both machines have a big error but these errors have different signs. r Modified Boost 2: Data set constructed for Boost2 plus all examples on which both machines have a big error, but there is a “big” difference between the magnitude of errors. 5. Combine the outputs of the three machines using the median as the final prediction from the ensemble. Appendix D: BEM Boosting Method (Feely, 2000) 1. Input: r Sequence of m examples (x1, y1 ), ..., (xm , ym ), where output y ∈ R r Weak learning algorithm Weak Learner r Integer T specifying number of iterations r BEM for demarcating correct and incorrect predictions 2. Initialize: r Machine number or iteration t = 1 r Distribution Dt (i) = 1/m ∀i r Error count errCountt = 0 3. Iterate while t ≤ T r Call Weak Learner, providing it with distribution Dt r Build the regression model: f t (x) → y r Calculate absolute error AE t (i) for each training example
Experiments with AdaBoost.RT
r r
r r r
Calculate the error count of f t (x) : errCountt = whole training data Calculate the Upfactor and Downfactor as: m Upfactort = errCountt
1709
i : AE t (i) >
BEM
1 Upfactort Update distribution Dt as Up f a ctort i f AE t (i) > B E M Dt+1 (i) = Dt (i) × Downf a ctort other wise Sample new training data according to the distribution Dt+1 Set t = t + 1 Downfactort =
4. Combine outputs from individual machine References Avnimelech, R., & Intrator, N. (1999). Boosting regression estimators. Neural Computation, 11(2), 499–520. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Breiman, L. (1996a). Stacked regressor. Machine Learning, 24(1), 49–64. Breiman, L. (1996b). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996c). Bias, variance, and arcing classifiers, (Tech. Rep. 460). Berkeley: Statistics Department, University of California. Breiman, L. (1997). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1518. Cherkassky, V., & Ma, Y. (2004). Comparison of loss functions for linear regression. In Proc. of the International Joint Conference on Neural Networks (pp. 395–400). Piscataway, NJ: IEEE. Drucker, H. (1997). Improving regressor using boosting techniques. In D. H. Fisher, Jr. (Ed.), Proc. of the 14th International Conferences on Machine Learning (pp. 107–115). San Mateo, CA: Morgan Kaufmann. Drucker, H. (1999). Boosting using neural networks. In A. J. C. Sharkey (Ed.), Combining artificial neural Nets (pp. 51–77). London: Springer-Verlag. Duffy, N., & Helmbold, D. P. (2000). Leveraging for regression. In Proc.of the 13th Annual Conference on Computational Learning Theory (pp. 208–219). San Francisco: Morgan Kaufmann. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Feely, R. (2000). Predicting stock market volatility using neural networks. Unpublished B.A (Mod.) dissertation, Trinity College Dublin. Freund, Y., & Schapire, R. (1996). Experiment with a new boosting algorithm. In Proc. of the 13th International Conference on Machine Learning (pp. 148–156). San Francisco: Morgan Kaufmann.
1710
D. Shrestha and D. Solomatine
Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19(1), 1–82. Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statictical view of boosting. Annals of Statistics, 28(2), 337–374. Kearns, M., & Vazirani, U. V. (1994). An Introduction to computational learning theory. Cambridge, MA: MIT Press. Namee, B. M., Cunningham, P., Byrne, S., & Corrigan, O. I. (2000). The problem of bias in training data in regression problems in medical decision support. P´adraig Cunningham’s Online publications, TCD-CS-2000-58. Available online at http://www.cs.tcd.ie/Padraig.Cunningham/online-pubs.html. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. Quinlan, J. R. (1992). Learning with continuous classes. In Proc. of the 5th Australian Joint Conference on AI (pp. 343–348). Singapore: World Scientific. Quinlan, J. R. (1996). Bagging, boosting and C4.5. In Proc. of the 13th national Conference on Artificial Intelligence (pp. 725–730). Menlo Park, CA: AAAI Press. Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172– 181. Ridgeway, G., Madigan, D., & Richardson, T. (1999). Boosting methodology for regression problems. In Proc. of the 7th International Workshop on Artificial Intelligence and Statistics (pp. 152–161). San Francisco: Morgan Kaufmann. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Solomatine, D. P., & Dulal, K. N. (2003). Model tree as an alternative to neural network in rainfall-runoff modelling. Hydrological Science Journal, 48(3), 399–411. Solomatine, D. P., & Shrestha, D. L. (2004). AdaBoost.RT: A boosting algorithm for regression problems. In Proc. of the International Joint Conference on Neural Networks (pp. 1163–1168). Piscataway, NJ: IEEE. Tresp, V. (2001). Committee machines. In Y. H. Hu & J.-N. Hwang (Eds.), Handbook for neural network signal processing. Boca Raton, FL: CRC Press. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vapnik, V. (1995). The nature of statistical learning theorey. New York: Springer. Weigend A. S., & Gershenfeld G. (1993). Time series prediction: Forecasting the future and understanding the past. In Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis. Menlo Park, CA: Addison-Wesley. Witten, I. H., & Frank, E. (2000). Data mining. San Francisco: Morgan Kaufmann. Zemel, R., & Pitassi, T. (2001). A gradient-based boosting algorithm for regression problems. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press.
Received November 16, 2004; accepted October 21, 2005.
LETTER
Communicated by Carter Wendelken
A Connectionist Computational Model for Epistemic and Temporal Reasoning Artur S. d’Avila Garcez [email protected] Department of Computing, City University, London EC1V 0HB, UK
Lu´ıs C. Lamb [email protected] Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre RS, 91501-970, Brazil
The importance of the efforts to bridge the gap between the connectionist and symbolic paradigms of artificial intelligence has been widely recognized. The merging of theory (background knowledge) and data learning (learning from examples) into neural-symbolic systems has indicated that such a learning system is more effective than purely symbolic or purely connectionist systems. Until recently, however, neural-symbolic systems were not able to fully represent, reason, and learn expressive languages other than classical propositional and fragments of first-order logic. In this article, we show that nonclassical logics, in particular propositional temporal logic and combinations of temporal and epistemic (modal) reasoning, can be effectively computed by artificial neural networks. We present the language of a connectionist temporal logic of knowledge (CTLK). We then present a temporal algorithm that translates CTLK theories into ensembles of neural networks and prove that the translation is correct. Finally, we apply CTLK to the muddy children puzzle, which has been widely used as a testbed for distributed knowledge representation. We provide a complete solution to the puzzle with the use of simple neural networks, capable of reasoning about knowledge evolution in time and of knowledge acquisition through learning. 1 Introduction The importance of the efforts to bridge the gap between the connectionist and symbolic paradigms of artificial intelligence has been widely recognised (Ajjanagadde, 1997; Cloete & Zurada, 2000; Shastri, 1999; Sun, 1995; Sun & Alexandre, 1997). The merging of theory (background knowledge) and data learning (learning from examples) into neural networks has been shown to provide a learning system that is more effective than purely symbolic or purely connectionist systems, especially when data are noisy Neural Computation 18, 1711–1738 (2006)
C 2006 Massachusetts Institute of Technology
1712
A. d’Avila Garcez and L. Lamb
(Towell & Shavlik, 1994). This contributed decisively to the growing interest in developing neural-symbolic learning systems, hybrid systems based on neural networks that are capable of learning from examples and background knowledge (d’Avila Garcez, Broda, & Gabbay, 2002). Typically, translation algorithms from a symbolic to a connectionist representation, and vice versa, are employed to provide a neural implementation of a logic, a sound logical characterization of a neural system, or a hybrid learning system that brings together features from connectionism and symbolic artificial intelligence (AI). As argued in Browne and Sun (2001), if connectionism is an alternative paradigm to artificial intelligence, neural networks must be able to compute symbolic reasoning in an efficient and effective way. Further, it is argued that in hybrid systems, usually the connectionist component is fault tolerant, while the symbolic component may be brittle and rigid. We tackle this problem of the symbolic component by offering a principled way to compute, represent, and learn propositional temporal and epistemic reasoning within connectionist models. Until recently, neural-symbolic systems were not able to fully represent, compute, and learn expressive languages other than propositional and fragments of first-order, classical logic (Cloete & Zurada, 2000). However, in d’Avila Garcez, Lamb, Broda, and Gabbay (2004) and d’Avila Garcez, Lamb, and Gabbay (2002), a new approach to knowledge representation and reasoning in neural-symbolic systems based on neural network ensembles was proposed. This new approach shows that nonclassical, modal logics can be effectively represented in artificial neural networks. Learning in the connectionist modal logic system is achieved by training each network in the ensemble, which corresponds to the current knowledge of an agent within a possible world. In this letter, following the formalization of connectionist modal logics (CML) proposed in d’Avila Garcez, Lamb, et al. (2002), d’Avila Garcez, Lamb, Broda, and Gabbay (2003) and d’Avila Garcez et al. (2004), we show that temporal and epistemic logics, by means of temporal and epistemic logic programming fragments, can be effectively represented in, and combined with, artificial neural networks (d’Avila Garcez & Lamb 2004). This is done by providing a translation algorithm from temporal logic theories to the initial architecture of a neural network. A theorem then shows that the given temporal theory and the network are equivalent in the sense that the network computes the fixed point of the theory. We then validate the connectionist temporal logic of knowledge (CTLK) system by applying it to a distributed time and knowledge representation problem known as the muddy children puzzle (Fagin, Halpern, Moses, & Vardi, 1995). As an extension of CML that includes temporal operators, CTLK provides a combined (multimodal) connectionist system of knowledge and time. This allows the modeling of evolving situations such as changing environments or possible worlds, and the construction of a connectionist
A Connectionist Model for Temporal Reasoning
1713
model for reasoning about the temporal evolution of knowledge. These features, combined with the computational power of neural networks, lead us toward a rich neural-symbolic learning system (d’Avila Garcez, Broda, et al., 2002), where various forms of nonclassical reasoning are naturally represented, derived, and learned. Hence, the approach presented here extends the representation power of artificial neural networks beyond the classical level. There has been a considerable amount of work in symbolic AI using nonclassical logics. It is important that these are investigated within the neural computation paradigm.1 For instance, applications in AI and computer science have made extensive use of decidable modal logics, including the analysis and model checking of distributed and multi-agent systems, program verification and specification, and hardware model checking. In the case of temporal and epistemic logics, these have found a large number of applications, notably in game theory and in models of knowledge and interaction in multi-agent systems (Pnueli, 1977; Fagin et al., 1995; Gabbay, Hodkinson, & Reynolds, 1994). Therefore, this work contributes toward the representation of such expressive, highly successful logical languages in neural networks. Our long-term aim is to contribute to the challenge of representing expressive symbolic formalisms within learning systems. We are proposing a methodology for the representation of several nonclassical logics in artificial neural networks. Such expressive logics have been successfully used in computer science, and we believe that connectionist approaches should take them into account by means of adequate computational models catering for reasoning, knowledge representation, and learning. According to Valiant (2003), the two most fundamental phenomena of intelligent cognitive behavior are the ability to learn from experience and the ability to reason from what has been learned: We are seeking a semantics of knowledge that can computationally support the basic phenomena of intelligent behaviour. It should support integrated algorithms for learning and reasoning that are computationally tractable and have some nontrivial scope. Another requirement is that it has a principled way of ensuring that the knowledge-base from which reasoning is done is robust, in the sense that errors in the deductions are at least controllable. (Valiant, 2003)
Taking the requirements put forward by Valiant into consideration, this article provides a robust connectionist computational model for epistemic and temporal reasoning. Knowledge is represented by a symbolic language, while deduction and learning are carried out by a connectionist engine. 1 It is well known that modal logics correspond, in terms of expressive power, to a two-variable fragment of first-order logic (Vardi, 1997). Further, as the two-variable fragment of predicate logic is decidable, this explains why modal logics are so robustly decidable and amenable to multiple applications.
1714
A. d’Avila Garcez and L. Lamb
The remainder of this letter is organized as follows. In section 2, we recall some useful preliminary concepts and present the CML system. In section 3, we present the CTLK system, introduce the temporal algorithm, which translates temporal logic programs into artificial neural networks, and prove that the translation is correct. In section 4, we use CML and CTLK to tackle the muddy children puzzle and compare the solutions provided by each system. In section 5, we conclude and discuss directions for future work. 2 Preliminaries The CTLK uses ensembles of connectionist inductive learning and logic programming (C-ILP) neural networks (d’Avila Garcez, Broda, et al., 2002; d’Avila Garcez & Zaverucha, 1999). C-ILP networks are single hidden-layer networks that can be trained with backpropagation (Rumelhart, Hinton, & Williams, 1986). In C-ILP, a translation algorithm (given later) maps a logic program P into a single hidden-layer neural network N such that N computes the least fixed point of P. This provides a massively parallel model for computing the stable model semantics of P (Gelfond & Lifschitz, 1991). In addition, N can be trained with examples using backpropagation, having P as background knowledge. The knowledge acquired by training can then be extracted (d’Avila Garcez, Broda, & Gabbay, 2001), closing the learning cycle (as in Towell & Shavlik, 1994). Let us exemplify how the C-ILP translation algorithm works. Each rule rl of program P is mapped from the input layer to the output layer of N through one neuron (Nl ) in the single hidden layer of N . Thresholds and weights are such that the hidden layer computes a logical and of the input layer, while the output layer computes a logical or of the hidden layer. Hence, the translation algorithm from P to N has to implement the following conditions: (C1 ) The input potential of a hidden neuron (Nl ) can only exceed Nl ’s threshold (θl ), activating Nl , when all the positive antecedents of rule rl are assigned truth-value true while all the negative antecedents of rl are assigned false; and (C2 ) the input potential of an output neuron (A) can only exceed A’s threshold (θ A), activating A, when at least one hidden neuron Nl that is connected to A is activated. Example 1 (C-ILP). Consider the logic program P = {r1 : B ∧ C ∧ ¬D → A; r2 : E ∧ F → A; r3 : B}. The translation algorithm derives the network N of Figure 1, setting weights (W) and thresholds (θ ) in such a way that conditions C1 and C2 above are satisfied. Note that if N has to be fully connected, any other link (not shown in Figure 1) should receive weight zero initially. Each input and output neuron of N is associated with an atom of P. As a result, each input and output vector of N can be associated with an interpretation for P, so that an atom (e.g., A) is true iff its corresponding neuron (neuron A) is activated. Note also that each hidden neuron Nl corresponds
A Connectionist Model for Temporal Reasoning
θA
W θ1
W
B
θB
A
1715
B
W θ2
N1 W
C
W
N2
θ3 N 3
-W
W
D
W
E
F
Figure 1: Sketch of a neural network N that represents logic program P.
to a rule rl of P. In order to compute a stable model, output neuron B should feed input neuron B such that N is used to iterate the fixed-point operator TP of P (d’Avila Garcez, Lamb, et al. 2002). This is done by transforming N into a recurrent network Nr , containing feedback connections from the output to the input layer of N , all with fixed weights Wr = 1. As a result, the activation of output neuron B feeds back into the activation of input neuron B, what allows the network to compute chains such as A → B and B → C. In the case of P above, given any initial activation to the input layer of Nr (network of Figure 1 recurrently connected), it always converges to the following stable state: A = false, B = true, C = false, D = false, E = false, and F = false. In CML, a (one-dimensional) ensemble of C-ILP neural networks is used to represent modalities such as necessity and possibility. In CTLK, a twodimensional C-ILP ensemble is used to represent the evolution of knowledge through time. In both cases, each C-ILP network can be seen as representing a (learnable) possible world that contains information about the knowledge held by an agent in a distributed system.
1716
A. d’Avila Garcez and L. Lamb
2.1 Connectionist Modal Logic. Modal logic began with the analysis of concepts such as necessity and possibility under a philosophical logic perspective (Chagrov & Zakharyaschev, 1997). Modal logic was found to be appropriate to study mathematical necessity (in the logic of provability), time, knowledge, and other modalities (Fagin et al., 1995; Chagrov & Zakharyaschev, 1997). A main feature of modal logics is the use of Kripke’s possible world semantics, a fundamental abstraction in the semantics of modal logics (Broda, Gabbay, Lamb, & Russo, 2004; Chagrov & Zakharyaschev, 1997; Fagin et al., 1995; Gabbay, 1996). In such a semantics, a proposition is necessarily true in a world if it is true in all worlds possible in relation to that world, whereas it is possibly true in a world if it is true in at least one world that is possible in relation to that same world. This is expressed in the semantics formalization by the use of a binary accessibility relation between possible worlds. The language of basic propositional modal logic extends the language of propositional logic with the necessity () and possibility (♦) operators. The modality is also known as a universal operator, since it expresses the idea of a proposition being “universally” true, or true under all interpretations. The ♦ modality is seen as an existential operator, in the sense that it “quantifies” a formula that is possibly true, that is, true in some interpretation. We start with a simple example to illustrate intuitively how an ensemble of neural networks is used for modeling nonclassical reasoning with propositional modal logic. We will see that C-ILP ensembles are appropriate to represent the and ♦ modalities in a connectionist setting. In order to reason with modal operators (temporal, epistemic, spatial, or multimodal), a variety of proof systems for modal logics have been developed over the years (Broda et al., 2004; Gabbay, 1996). Further, logic programming approaches to deal with intentional (Orgun & Wadge, 1992) and modal operators (Orgun & Ma, 1994; Farinas del Cerro & Herzig, 1995) have also been extensively developed, leading to implementations of symbolic modal reasoning. In some of these reasoning mechanisms, formulas are labeled by the worlds (or states) in which they hold, facilitating the modal reasoning process, an approach we adopt here. Consider Figure 2. It shows an ensemble of three C-ILP neural networks labeled ω1 , ω2 , ω3 , which might communicate in different ways. We look at ω1 , ω2 , and ω3 as possible worlds. Input and output neurons may now represent L, ♦ L, or L, where L is a literal (an atom or a negation of an atom), denotes modal necessity ( L is also read “box L,” and it means that L is necessarily true), ♦ denotes modal possibility (♦ L is also read “diamond L,” and it means that L is possibly true). A will be true in a world ωi if A is true in all worlds ω j to which ωi is related. Similarly, ♦ A will be true in a world ωi if A is true in some world ω j to which ωi is related. As a result, if neuron A is activated at a world (network) ω1 , denoted by ω1 : A, and ω1 is related to worlds (networks) ω2 and ω3 , then neuron Amust be activated in ω2 and ω3 . Similarly, if neuron ♦ Ais activated
A Connectionist Model for Temporal Reasoning
1717
ω2 s
q q
M
p
M M q s
ω3
q
r
∧
∨ s
r
ω1 Figure 2: The ensemble of networks that represents modal program P.
in ω1 , then a neuron A must be activated in an arbitrary network related to ω1 .2 It is also possible to make use of connectionist modal logic (CML) ensembles to compute that A holds at a possible world, say ωi , whenever A holds at all possible worlds related to ωi , by connecting the output neurons of the related networks to a hidden neuron in ωi , which connects to an output neuron labeled as A. Dually for ♦ A, whenever A holds at some possible world related to ωi , we connect the output neuron representing A to a hidden neuron in ωi that connects to an output neuron labeled ♦ A.3
2 In practice, the easiest way to implement ♦A will be to create a network N in which A is set up, as described later in our translation algorithm. 3 These rules are actually instances of the rules known as -Elimination, -Introduction, ♦-Elimination, and ♦-Introduction (Broda et al., 2004). In a connectionist setting, -Elimination deals with neurons labeled by formulas of the type ωi : α. Intuitively, the modal necessity is “eliminated” in the sense that the rule allows the inference
1718
A. d’Avila Garcez and L. Lamb
Due to the simplicity of each network in the CML ensemble, when it comes to learning, we can use backpropagation on each network to learn the local knowledge in each possible world. The way we should connect the different networks in the ensemble is then given by the meaning of and ♦, which is not supposed to change, and the accessibility relation. Let us give an example of the use of CML. Example 2 (CML) Let P = {ω1 : r → q , ω1 : ♦s → r, ω2 : s, ω3 : q → ♦ p, R(ω1 ,ω2 ), R(ω1 ,ω3 )}. We start by applying the C-ILP translation algorithm (given later), which creates three neural networks to represent the worlds ω1 , ω2 , and ω3 (see Figure 2). Then we apply the modalities algorithm (also given later) in order to connect the networks. Hidden neurons labeled by {M, ∨, ∧} are created using the modalities algorithm. The remaining neurons are all created using the translation algorithm. For clarity, unconnected input and output neurons are not shown in Figure 2. Taking ω1 , q is connected to q in both ω2 and ω3 so that whenever q is active, q is also active. Taking ω2 , since s is a fact in ω2 , and ω1 is related to ω2 , neuron s must be connected to a neuron ♦s in ω1 such that, whenever s is active ♦s also is. Dually, ♦s is connected to s, and for q , neurons q in ω2 and ω3 are connected to neuron q in ω1 , so that if q is active in both ω2 and ω3 , then q is active in ω1 . The modalities algorithm will perform the translation of modal logic programs into neural networks; it reflects the underlying meaning of the and ♦ modalities as formally interpreted in Kripke models, according to the definitions below. Definition 1 (Kripke models). A Kripke model for a modal language L is a tuple M = , R, v where is a set of possible worlds, v is a mapping that assigns to each propositional letter of L a subset of , and R is a binary relation over . We say that a modal formula α is true at a possible world ω of a model M, written (M, ω) |= α, if one of the following satisfiability conditions holds. Definition 2 (satisfiability of modal formulas). Let L be a modal language, and let M = , R, v be a Kripke model. The satisfiability relation |= is uniquely defined as follows: i. (M, ω) |= p iff ω ∈ v( p) for a propositional letter p ii. (M, ω) |= ¬α iff (M, ω) |=α of α from α at all worlds, say ω j , related to ωi via the accessibility relation. In the case of ♦-Introduction, in a connectionist setting, the ♦ is “introduced,” in the sense that from a formula ω j : α and a relation R(ωi , ω j ), one can infer a formula ♦α at the world ωi . The cases of -Introduction and ♦-Elimination are analogous.
A Connectionist Model for Temporal Reasoning
1719
iii. (M, ω) |= α ∧ β iff (M, ω) |= α and (M, ω) |= β iv. (M, ω) |= α ∨ β iff (M, ω) |= α or (M, ω) |= β v. (M, ω) |= α → β iff (M, ω) |=α or (M, ω) |= β vi. (M, ω) |= α iff for all ω1 ∈ , if R(ω, ω1 ) then (M, ω1 ) |= α vii. (M, ω) |= ♦α iff there is a ω1 such that R(ω, ω1 ) and (M, ω1 ) |= α. In order to reason with CML, we have to define an algorithm that translates a class of extended modal programs into ensembles of neural networks. Formal definitions of extended modal programs are as follows. Definition 3 (modal literal). A modal atom is of the form MA where M ∈ {, ♦} and A is an atom. A modal literal is of the form ML where L is a literal. Definition 4 (modal logic program). A modal program is a finite set of clauses of the form a 1 , . . . , a n → a n+1 where a i (1 ≤ i ≤ n) is either an atom or a modal atom, and a n+1 is an atom. We define extended modal programs as modal programs extended with modalities and ♦ in the head of clauses, and negation in the body of clauses. In addition, each clause is labeled by the possible world in which it holds, as in Gabbay’s labeled deductive systems (Broda et al., 2004; Gabbay, 1996). Definition 5 (extended modal logic program). An extended modal program is a finite set of clauses C of the form ωi : l1 , . . . , ln → ln+1 , where li , (1 ≤ i ≤ n) is either a literal or a modal literal and ln+1 is either an atom or a modal atom, and a finite set of relations R(ωi , ω j ) between worlds ωi and ω j in C, where ωi is a label representing a world in which the associated clause holds. For example, P = {ω1 : r → q , ω1 : ♦s → r, ω2 : s, ω3 : q → ♦ p, R(ω1 , ω2 ), R(ω1 , ω3 )} is an extended modal program. Definition 6 (modal immediate consequence operator MTP ). Let P = {P1 , . . . , Pk } be an extended modal program, where each Pi is the set of modal clauses that hold in a world ωi (1 ≤ i ≤ k). Let BP denote the set of atoms occurring in P, and let I be an interpretation for P. Let a be either an atom or a modal atom. The mapping MTP : 2 BP → 2 BP in ωi is defined as follows: MTP (I ) = {a ∈ BP | either i or ii or iii or iv or v below holds}. i. l1 , . . . , ln → a is a clause in P and {l1 , . . . , ln } ⊆ I . ii. a is of the form ωi : A, ωi is a particular world associated with A, and there is a world ωk such that R(ωi , ωk ), and ωk : l1 , . . . , lm → ♦ A is a clause in P with {l1 , . . . , lm } ⊆ I .
1720
A. d’Avila Garcez and L. Lamb
iii. a is of the form ωi : ♦ A and there exists a world ω j such that R(ωi , ω j ), and ω j : l1 , . . . , lm → A is a clause in P with {l1 , . . . , lm } ⊆ I . iv. a is of the form ωi : A and for each a world ω j such that R(ωi , ω j ), and ω j : l1 , . . . , lo → A is a clause in P with {l1 , . . . , lo } ⊆ I . v. a is of the form ωi : A and there exists a world ωk such that R(ωk , ωi ), and ωk : l1 , . . . , lo → A is a clause in P with {l1 , . . . , lo } ⊆ I . 3 Connectionist Temporal Logic of Knowledge Temporal logic and its combination with other modalities such as knowledge and belief operators have long been the subject of intensive investigation (Fagin et al., 1995; Gabbay, Kurucz, Wolter, & Zakharyaschev, 2003; Hintikka, 1962). Temporal logic has evolved from philosophical logic to become one of the main logical systems used in computer science and artificial intelligence (Pnueli, 1977; Fagin et al., 1995; Gabbay et al., 1994). It has been shown to be a powerful formalism for the modeling, analysis, verification, and specification of distributed systems (Fagin et al., 1995; Halpern, van der Meyden, & Vardi, 2004). Further, in logic programming, several approaches to deal with temporal and modal reasoning have been developed, leading to application in databases, knowledge representation, and the specification of systems (see, e.g., Farinas del Cerro & Herzig, 1995; Orgun & Ma, 1994; Orgun & Wadge, 1994). In this section, we show how temporal logic programs can be expressed in a connectionist setting in conjunction with a knowledge operator. We do so by extending CML into a connectionist temporal logic of knowledge (CTLK), which allows the specification of knowledge evolution through time in network ensembles. In what follows, we present a temporal algorithm, which translates temporal logic programs into artificial neural networks, and a theorem showing that the temporal theory and the network ensemble are equivalent, and therefore that the translation is correct. Let us start by presenting a simple example. Example 3 (next time operator). One of the typical axioms of temporal logics of knowledge is K i α → K i α (Halpern et al., 2004), where denotes the next time temporal operator. This means that what an agent i knows today (K i ) about tomorrow (α), she still knows tomorrow (K i α). In other words, this axiom states that an agent does not forget what she knew. This can be represented in an ensemble of C-ILP networks with the use of a network that represents the agent’s knowledge today, a network that represents the agent’s knowledge tomorrow, and the appropriate connections between networks. Clearly, an output neuron K α of a network that represents agent i at time t needs to be connected to an output neuron K α of a network that represents agent i at time t + 1 in such a way that, whenever K α is activated, K α is also activated. This is illustrated in Figure 3,
A Connectionist Model for Temporal Reasoning
1721
Kα
Agent i at t+1 KOα
Agent i at t Figure 3: Simple example of connectionist temporal reasoning.
where the black circle denotes a neuron that is always activated, and the activation value of output neuron K α is propagated to output neuron K α. Weights must be such that K α is also activated. Generally the idea behind a connectionist temporal logic is to have (instead of a single ensemble) a number n of ensembles, each representing the knowledge held by a number of agents at a given time point t. Figure 4 illustrates how this dynamic feature can be combined with the symbolic features of the knowledge represented in each network, allowing not only the analysis of the current state (possible world or time point) but also the analysis of how knowledge changes through time. 3.1 The Language of CTLK. In order to reason over time and represent knowledge evolution, we combine temporal logic programming (Orgun & Ma, 1994) and the knowledge operator K i into a connectionist temporal logic of knowledge (CTLK). The implementation of K i is analogous to that of ; we treat K i as a universal modality as done in Fagin et al. (1995). This will become clearer when we apply a temporal operator and K i to the muddy children puzzle in section 4. Definition 7 (connectionist temporal logic). The language of CTLK contains: 1. A set { p, q , r, . . .} of primitive propositions 2. A set of agents A = {1, . . . , n}
1722
A. d’Avila Garcez and L. Lamb
t3 Agent 1
Agent 2
Agent 3
Agent 1
Agent 2
Agent 3
Agent 1
Agent 2
Agent 3
t2
t1
Figure 4: Evolving knowledge through time.
3. A set of connectives K i (i ∈ A), where K i p reads agent i knows p 4. The temporal operator (next time) 5. A set of extended modal logic clauses of the form t : ML 1 , . . . , ML n → ML n+1 , where t is a label representing a discrete time point in which the associated clause holds, M ∈ {, ♦}, and L j (1 ≤ j ≤ n + 1) is a literal. We consider the case of a linear flow of time. As a result, the semantics of CTLK requires that we build models in which possible states form a linear temporal relationship. Moreover, to each time point, we associate the set of formulas holding at that point by a valuation map. The definitions are as follows.
A Connectionist Model for Temporal Reasoning
1723
Definition 8 (time line). A time line T is a sequence of ordered points, each one corresponding to a natural number. Definition 9 (temporal model). A model M is a tuple M = (T, R1 , . . . , Rn , π), where (i) T is a (linear) time line; (ii) Ri (i ∈ A) is an agent accessibility relation over T; and (iii) π : T → ϕ is a map associating with each propositional variable of CTLK a set π( p) of time points in T. The truth conditions for CTLK’s well-formed formulas are then defined by the following satisfiability relation: Definition 10 (satisfiability of temporal formulas). Let M = T, Ri , π be a temporal model for CTLK. The satisfiability relation |= is uniquely defined as follows: i. (M, t) |= p iff t ∈ π( p) ii. (M, t) |= ¬α iff (M, t) |=α iii. (M, t) |= α ∧ β iff (M, t) |= α and (M, t) |= β iv. (M, t) |= α ∨ β iff (M, t) |= α or (M, t) |= β v. (M, t) |= α → β iff (M, t) |=α or (M, t) |= β vi. (M, t) |= α iff for all u ∈ T, if R(t, u) then (M, u) |= α vii. (M, t) |= ♦α iff there exists a u such that R(t, u) and (M, u) |= α viii. (M, t) |= α iff (M, t + 1) |= α ix. (M, t) |= K i α iff for all u ∈ T, if Ri (t, u) then (M, u) |= α Since every clause is labeled by a time point t ranging from 1 to n, if A holds at time point n, our time line will have n + 1 time points; otherwise, it will contain n time points. Results provided by Brzoska (1991), Orgun and Wadge (1992, 1994), and Farinas del Cerro and Herzig (1995) for temporal extensions of logic programming apply directly to CTLK. The following definitions will be needed to express the computation of CTLK in neural networks. Definition 11 (temporal clause). A clause of the form t : L 1 , . . . , L o → L o+1 is called a CTLK temporal clause, which holds at time point t, where L j (1 ≤ j ≤ o + 1) is either a literal, a modal literal, or of the form K i L j (i ∈ A). Definition 12 (temporal immediate consequence operator TP ). Let P = {P1 , . . . , Pk } be a CTLK temporal logic program (i.e., a finite set of CTLK temporal clauses). The mapping TPi : 2 BP → 2 BP at time point ti (1 ≤ i ≤ k) is defined as follows: TPi (I ) = {L ∈ BP | either i or ii or iii below holds}. (i) there exists a clause ti−1 : L 1 , . . . , L m → L in Pi−1 and {L 1 , . . . , L m } is satisfied
1724
A. d’Avila Garcez and L. Lamb
by an interpretation J for Pi−1 ; 4 (ii) L is qualified by , there exists a clause ti+1 : L 1 , . . . , L m → L in Pi+1 , and {L 1 , . . . , L m } is satisfied by an interpretation J for Pi+1 ; (iii) L ∈ MTPi (I ). A global temporal immediate consequence operator can be defined as TP (I1 , . . . , Ik ) = kj=1 {TP j }. 3.2 The CTLK Algorithm. In this section, we present an algorithm to translate temporal logic programs into (two-dimensional) neural network ensembles. We consider temporal clauses and make use of CML’s modalities algorithm and of C-ILP’s translation algorithm, both reproduced below. The temporal algorithm is concerned with how to represent the next time connective and the knowledge operator K , which may appear in clauses of the form ti : K a L 1 , . . . , K b L o → K c L o+1 , where a , b, c, . . . are agents and 1 ≤ i ≤ n. In such clauses, we extend a normal clause of the form L 1 , . . . , L o → L o+1 to allow the quantification of each literal with a knowledge operator indexed by different agents {a , b, c, . . .} varying from 1 to m. We also label the clause with a time point ti in our timescale varying from 1 to n, and we allow the use of the next time operator on the left-hand side of the knowledge operator.5 For example, the clause t1 : K j α, K k β → K j γ states that if agent j knows α and agent k knows β at time t1 , then agent j knows γ at time t2 . The CTLK algorithm is presented below, where Nk,t will denote a C-ILP neural network for agent k at time t. Let q denote the number of clauses occurring in P. Let o l denote the number of literals in the body of clause l. Let µl denote the number of clauses in P with the same consequent, for each clause l. Let h(x) = 1+e2−βx − 1, where β ∈ (0, 1). Let Amin be the minimum activation for a neuron to be considered “active” (or true), Amin ∈ (0, 1). Set Amin > (MAXP (o 1 , . . . , o q , µ1 , . . . , µq ) − 1)/ (MAXP (o 1 , . . . , o q , µ1 , . . . , µq ) + 1). Let W (respectively, −W) be the weight of connections associated with positive (respectively, negative) literals. Set W ≥ (2/β) · (ln(1 + Amin ) − ln(1 − Amin ))/(MAXP (o 1 , . . . , o q , µ1 , . . . , µq )· (Amin − 1) + Amin + 1)).6
4 Notice that this definition implements a credulous approach in which every agent is assumed to be truthful, and therefore every agent believes not only in what he knows about tomorrow but also in what he is informed by other agents about tomorrow. A more skeptical approach could be implemented by restricting the derivation of A to interpretations in Pi only. 5 Notice that according to definition 10, if A is true at time t and t is the last time point n, the CTLK algorithm will create n + 1 points, as described here. 6 These values for A min and W are obtained from the proof of the correctness of the C-ILP translation algorithm.
A Connectionist Model for Temporal Reasoning
1725
Temporal Algorithm For each time point t (1 ≤ t ≤ n) in P, for each agent k (1 ≤ k ≤ m) in P do: 1. For each clause l in P containing K k L i in the body: a. Create an output neuron K k L i in Nk,t (if it does not exist yet); b. Create an output neuron K k L i in Nk,t+1 (if it does not exist yet); c. Define the thresholds of K k L i and K k L i as θ = (1 + Amin ) · (1 − µl ) · W/2; d. Set h(x) as the activation function of output neurons K k L i and Kk Li ; e. Add a hidden neuron L ∨ to Nk,t , and set the step function as the activation function of L ∨ ; f. Connect K k L i in Nk,t+1 to L ∨ and set the connection weight to 1; g. Set the threshold θ ∨ of L ∨ such that −m Amin < θ ∨ < Amin − (m − 1);7 h. Connect L ∨ to K k L i in Nk,t and set the connection weight to W M such that W M > h −1 (Amin ) + µl W + θ . 2. For each clause in P containing K k L i in the head: a. Create an output neuron K k L i in Nk,t (if it does not exist yet); b. Create an output neuron K k L i in Nk,t+1 (if it does not exist yet); c. Define the thresholds of K k L i and K k L i as θ = (1 + Amin ) · (1 − µl ) · W/2; d. Set h(x) as the activation function of K k L i and K k L i ; e. Add a hidden neuron L to Nk,t+1 , and set the step function as the activation function of L ; f. Connect K k L i in Nk,t to L and set the connection weight to 1; g. Set the threshold θ of L such that −1 < θ < Amin ;8 h. Connect L to K k L i in Nk,t+1 and set the connection weight to W M such that W M > h −1 (Amin ) + µl W + θ . 3. Call the modalities algorithm. Modalities Algorithm 1. Rename each literal ML j in P by a new literal not occurring in P of ♦ the form L j if M = , or L j if M = ♦; 2. Call the C-ILP translation algorithm; 3. For each output neuron L ♦j in Nk,t , do: ♦ a. Add a hidden neuron L M j to an arbitrary network N ;
7 8
A maximum number of m agents could be making use of L ∨ . A maximum number of 1 agent will be making use of L .
1726
A. d’Avila Garcez and L. Lamb
b. c. d. e. f.
Set the step function as the activation function of L M j ; ♦ M Connect L j in Nk,t to L j and set the connection weight to 1; M Set the threshold θ M of L M < Amin ; j such that −1 < θ ♦ Create an output neuron L j in N , if it does not exist yet; ♦ M Connect L M j to L j in N and set the connection weight to W .
4. For each output neuron L j in Nk,t , do: a. Add a hidden neuron L M j to each Nk,t+1 such that Rk (t, t + 1); b. Set the step function as the activation function of L M j ; M c. Connect L j in Nk,t to L j and set the connection weight to 1; M < Amin ; d. Set the threshold θ M of L M j such that −1 < θ e. Create output neurons L j in Nk,t+1 , if it does not exist yet; M f. Connect L M j to L j in Nk,t+1 and set the connection weight to W . 5. For each output neuron L j in Nk,t , do: a. Add a hidden neuron L ∨j to Nk,t−1 such that Rk (t − 1, t); b. Set the step function as the activation function of L ∨j ; c. For each output neuron L ♦j in Nk,t−1 , do: i. Connect L j in Nk,t to L ∨j and set the connection weight to 1; ii. Set the threshold θ ∨ of L ∨j such that −m Amin < θ ∨ < Amin − (m − 1); iii. Create an output neuron L ♦j in Nk,t−1 if it does not exist yet; iv. Connect L ∨j to L ♦j in Nk,t−1 and set the connection weight to W M. d. Add a hidden neuron L ∧j to Nk,t−1 such that Rk (t − 1, t); e. Set the step function as the activation function of L ∧j ; f. For each output neuron L j in Nk,t−1 , do: i. Connect L j in Nk,t to L ∧j and set the connection weight to 1; ii. Set the threshold θ ∧ of L ∧j such that m − (1 + Amin ) < θ ∧ < m Amin ; iii. Create an output neuron L j in Nk,t−1 if it does not exist yet; ∧ iv. Connect L j to L j in Nk,t−1 and set the connection weight to W M. C-ILP Translation Algorithm 1. For each clause l of P of the form L 1 , . . . , L o → L o+1 , do:9 a. Create input neurons L 1 , . . . , L o and output neuron L o+1 in N (if they do not exist yet); b. Add a neuron Nl to the hidden layer of N ;
9
Here L i can be of the form K j L i or ¬K j L i .
A Connectionist Model for Temporal Reasoning
1727
c. Connect each neuron L i (1 ≤ i ≤ o) in the input layer to the neuron Nl in the hidden layer. If L i is a positive literal, then set the connection weight to W; otherwise, set the connection weight to −W; d. Connect Nl in the hidden layer to neuron L o+1 in the output layer, and set the connection weight to W; e. Set h(x) as the activation function of Nl and L o+1 ; f. Define the threshold of Nl in the hidden layer as ((1 + Amin ) · (o l − 1)/2)W; g. Define the threshold of neuron L o+1 in the output layer as ((1 + Amin ) · (1 − µl )/2)W. Theorem 3 below shows that the network ensemble N obtained from the temporal algorithm is equivalent to the original CTLK program P in the sense that N computes the temporal immediate consequence operator TP of P (see definition 12). The theorem makes use of theorems 1 and 2, which follow: Theorem 1 (correctness of C-ILP; d’Avila Garcez, Broda et al., 2002). For each definite logic program P, there exists a feedforward neural network N with exactly one hidden layer and semilinear neurons such that N computes the fixed-point operator TP of P. Theorem 2 (correctness of CML; d’Avila Garcez, Lamb, et al., 2002). For any extended modal program P, there exists an ensemble of feedforward neural networks N with a single hidden layer and semilinear neurons, such that N computes the modal fixed-point operator MTP of P. Theorem 3 (correctness of CTLK). For any CTLK program P, there exists an ensemble of single hidden-layer neural networks N such that N computes the temporal fixed-point operator TP of P. Proof. We need to show that K k L i is active in Nt+1 if and only if either (1) there exists a clause of P of the form ML 1 , . . . , ML o → K k L i s.t. ML 1 , . . . , ML o are satisfied by an interpretation (input vector), or (2) K k L i is active in Nt . Case 1 follows from theorem 1. The proof of case 2 follows from theorem 2 as the algorithm for is a special case of the algorithm for ♦ in which a more careful selection of world (i.e., t + 1) is made when applying the ♦ Elimination rule. 4 Case Study: The Muddy Children Puzzle In this section, we apply the CTLK system to the muddy children puzzle, a classic example of reasoning in multi-agent environments. We also compare the CTLK solution with a previous (connectionist modal logic-based)
1728
A. d’Avila Garcez and L. Lamb
solution, which uses snapshots in time instead of a time flow. We start by stating the puzzle as described in Fagin et al. (1995). A number n of (truthful and intelligent) children are playing in a garden. A certain number of children k (k ≤ n) have mud on their faces. Each child can see if the others are muddy but not himself or herself. Now, consider the following situation: a caretaker announces that at least one child is muddy (k ≥ 1) and asks, Does any of you know if you have mud on your own face?10 To help understand the puzzle, let us consider the cases in which k = 1, k = 2, and k = 3. If k = 1 (only one child is muddy), the muddy child answers yes at the first instance since she cannot see any other muddy child. All the other children answer no at the first instance. If k = 2, suppose children 1 and 2 are muddy. In the first instance, all children can only answer no. This allows 1 to reason as follows: if 2 had said yes the first time, she would have been the only muddy child. Since 2 said no, she must be seeing someone else muddy; and since I cannot see anyone else muddy apart from 2, I myself must be muddy! Child 2 can reason analogously and also answers yes the second time. If k = 3, suppose children 1, 2, and 3 are muddy. Every children can only answer no the first two times. Again, this allows child 1 to reason as follows: if 2 or 3 had said yes the second time, they would have been the only two muddy children. Thus, there must be a third person with mud. Since I can see only 2 and 3 with mud, this third person must be me! Children 2 and 3 can reason analogously to conclude as well that yes, they are muddy. The above cases clearly illustrate the need to distinguish between an agent’s individual knowledge and common knowledge about the world in a particular situation. For example, when k = 2, after everybody says no in the first round, it becomes common knowledge that at least two children are muddy. Similarly, when k = 3, after everybody says no twice, it becomes common knowledge that at least three children are muddy, and so on. In other words, when it is common knowledge that there are at least k − 1 muddy children, after the announcement that nobody knows if they are muddy or not, then it becomes common knowledge that there are at least k muddy children, for if there were k − 1 muddy children, all of them would know that they had mud in their faces. Notice that this reasoning process can start only once it is common knowledge that at least one child is muddy, as announced by the caretaker. 4.1 Distributed Knowledge Representation. In this section, we formalize the muddy children puzzle using CTLK. For comparison, we start by reproducing the CML formalization presented in d’Avila Garcez et al., 2004; d’Avila Garcez, Lamb, et al., 2002). Typically, the way to represent the knowledge of a particular agent is to express the idea that an agent knows a fact α if the agent considers that α is true at every world that the agent
10
Of course, if k > 1, they already know that there are muddy children among them.
A Connectionist Model for Temporal Reasoning
1729
sees as possible. In such a formalization, a K j modality that represents the knowledge of an agent j is used analogously to a modality as defined in section 2.1. In addition, we use pi to denote that proposition p is true for agent i. For example, K j pi means that agent j knows that p is true for agent i. We omit the subscript j of K whenever it is clear from the context. We use pi to say that child i is muddy and q k to say that at least k children are muddy (k ≤ n). Let us consider the case in which three children are playing in the garden (n = 3). Rule r11 below states that when child 1 knows that at least one child is muddy and that neither child 2 nor child 3 is muddy, then child 1 knows that she herself is muddy. Similarly, rule r21 states that if child 1 knows that there are at least two muddy children and she knows that child 2 is not muddy, then she must also be able to know that she herself is muddy, and so on. The rules for children 2 and 3 are interpreted analogously. Snapshot Rules for Agent (Child) 1 r11 : K1 q1 ∧K1 ¬p2 ∧K1 ¬p3 →K1 p1 r21 : K1 q2 ∧K1 ¬p2 →K1 p1 r31 : K1 q2 ∧K1 ¬p3 →K1 p1 r41 : K1 q3 →K1 p1 Each set of rules rml (1 ≤ l ≤ n, m ∈ N+ ) is implemented in a C-ILP network. Figure 5 shows the implementation of rules r11 to r41 (for agent 1). In addition, it contains p1 and Kq 1 , Kq 2 , and Kq 3 , all represented as facts. Note the difference between p1 (child 1 is muddy) and K p1 (child 1 knows she is muddy). Facts are highlighted in gray in Figure 5. This setting complies with a presentation of the puzzle given in Fagin et al. (1995), in which snapshots of the knowledge evolution along time rounds are taken in order to logically deduce the solution of the problem without the addition of a time variable. In contrast with p1 and Kq k (1 ≤ k ≤ 3), K¬p2 and K¬p3 must be obtained from agents 2 and 3, respectively, whenever agent 1 does not see mud on their foreheads. Figure 6 illustrates the interaction between three agents in the muddy children puzzle. The arrows connecting C-ILP networks implement the fact that when a child is muddy, the other children can see that. For clarity, only the rules rm1 , corresponding to neuron K1 p1 , are shown in Figure 5. Analogously, the rules for K2 p2 and K3 p3 would be represented in similar C-ILP networks. This is indicated in Figure 6 by neurons highlighted in black. In addition, Figure 6 shows only positive information about the problem. Recall that negative information such as ¬ p1 , K¬ p1 , K¬ p2 is to be added explicitly to the network, as shown in Figure 5. This completes the connectionist representation of the snapshot solution to the muddy children puzzle.
1730
A. d’Avila Garcez and L. Lamb
... p1
Kp1 K¬ ¬p2 K¬ ¬p3 Kq1 Kq2 Kq3
¬p3 Kq1 Kq2 K¬ ¬p2 K¬
Kq3
... Agent1 Figure 5: The implementation of rules {r11 , . . . , r41 }.
4.2 Temporal Knowledge Representation. The addition of a temporal variable to the muddy children puzzle allows one to reason about knowledge acquired after each time round. For example, assume as before that there are three muddy children playing in a garden. First, they all answer no when asked if they know whether they are muddy. Moreover, as each muddy child can see the other children, they will reason as previously described, and answer no the second time round, reaching the correct conclusion in time round 3. This solution requires, at each round, that the C-ILP networks be expanded with the knowledge acquired from reasoning about what is seen and what is heard by each agent. This clearly requires each agent to reason about time. The snapshot solution should then be seen as representing the knowledge held by the agents at an arbitrary time t.
A Connectionist Model for Temporal Reasoning
1731
agent 1 p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
p2 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
agent 2
p3 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
agent 3
Figure 6: Interaction among agents in the muddy children puzzle.
1732
A. d’Avila Garcez and L. Lamb
The knowledge held by the agents at time t + 1 would then be represented by another set of C-ILP networks, appropriately connected to the original set of networks. Let us consider again the case in which k = 3. There are alternative ways of modeling this, but one possible representation is as follows: Temporal Rules for Agent(Child) 1 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 3 Temporal Rules for Agent(Child) 2 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 2 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 2 q 3 Temporal Rules for Agent(Child) 3 t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 3 q 2 t2 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 3 q 3 In addition, the snapshot rules are still necessary here to assist each agent’s reasoning at any particular time point. Finally, the interaction between the agents, as depicted in Figure 6, is also necessary to model the fact that each child will know that another child is muddy when they see each other, analogously to the modality. This can be represented as ti : p1 → K 2 p1 , ti : p1 → K 3 p1 for time i = 1, 2, 3 and analogously for p2 and p3 . Together with ti : ¬ p2 → K 1 ¬ p2 , ti : ¬ p3 → K 1 ¬ p3 , also for time i = 1, 2, 3 and analogously for K 2 and K 3 , this completes the formalization. The rules above, the temporal rules, and the snapshot rules for Agent(child1) are described, following the temporal algorithm, in Figure 7, where dotted lines indicate negative weights and solid lines indicate positive weights. The network of Figure 7 provides a complete solution to the muddy children puzzle. It is worth noting that each network remains a simple single hidden-layer neural network that can be trained with the use of standard backpropagation or other off-the-shelf learning algorithm. 4.3 Learning. The merging of theory (background knowledge) and data learning (learning from examples) in neural networks has been shown to provide a learning system that is more effective than purely symbolic or purely connectionist systems, especially when data are noisy (Towell & Shavlik, 1994). The temporal algorithm introduced here allows one to perform theory and data learning in neural networks when the theory includes temporal knowledge.
A Connectionist Model for Temporal Reasoning
1733
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3
Kq3
Agent 1 at t3
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3 OKq 3
¬p3 K¬ ¬p2 K¬
K2p2 K3p3
Kq2 Kp1
Agent 1 at t2
p1 Kp1 Kp2 Kp3 Kq1 Kq2 Kq3 OKq 2
¬p3 Kq1 Kp1 K¬ ¬p2 K¬
K2p2
K3p3
Agent 1 at t1 Figure 7: An agent’s knowledge evolution in time in the muddy children puzzle.
1734
A. d’Avila Garcez and L. Lamb
In this section, we use the temporal algorithm introduced above and standard backpropagation to compare learning from data only and learning from theory and data with temporal background knowledge. Since we show a relationship between temporal and epistemic logics and artificial neural network ensembles, we should also be able to learn epistemic and temporal knowledge in the ensemble (and, indeed, to perform knowledge extraction of revised temporal and epistemic rules after learning, but this is left as future work). We train two ensembles of C-ILP neural networks to compute a solution to the muddy children puzzle. To one of them we add temporal and epistemic background knowledge in the form of a single rule t1 : ¬K 1 p1 ∧ ¬K 2 p2 ∧ ¬K 3 p3 → K 1 q 2 by applying the temporal algorithm. To the other, we do not add any rule and then compare the average accuracies of the ensembles. We consider, in particular, the case in which Agent 1 is to decide whether he is muddy at time t2 . Each training example expresses the knowledge held by Agent 1 at t2 , according to the truth values of atoms K 1 ¬ p2 , K 1 ¬ p3 , K 1 q 1 , K 1 q 2 , and K 1 q 3 . As a result, we have 32 examples containing all possible combinations of truth values for input neurons K 1 ¬ p2 , K 1 ¬ p3 , K 1 q 1 , K 1 q 2 , and K 1 q 3 , where input value 1 indicates truth value true, while input −1 indicates truth value false. For each example, we are concerned with whether Agent 1 will know that he is muddy, that is, we are concerned with whether output neuron K 1 p1 is active. For example, if the inputs are false (input vector [−1, −1, −1, −1, −1]), then Agent 1 should not know whether he is muddy (K 1 p1 is false). If, however, K 1 q 2 is true and either K 1 ¬ p2 or K 1 ¬ p3 is true then Agent 1 should be able to recognize that indeed he is muddy (K 1 p1 is true). This allows us to create the 32 training examples. From the description of the muddy children puzzle, we know that at t2 , K 1 q 2 should be true (i.e., K 1 q 2 is a fact). This information can be derived from the temporal rule given as background knowledge above, but not from the training examples. Although the background knowledge can be changed by the training examples, it places a bias on certain combinations (in this case, the examples in which K 1 q 2 is true), and this may produce better performance, typically when the background knowledge is correct. This effect has been observed, for instance, in Towell and Shavlik (1994), on experiments on DNA sequence analysis, in which background knowledge is expressed by production rules. The set of examples is noisy, and background knowledge counteracts the noise and reduces the chances of overfitting. We have evaluated the two C-ILP ensembles using eight-fold crossvalidation, so that each time, four examples were left for testing. We have used a learning rate η = 0.2, a term of momentum α = 0.1, activation function h(x) = tanh(x), and bipolar inputs in {−1, 1}. For each training task, the training set was presented to the network for 10, 000 epochs. For both ensembles, the networks reached a training set error smaller than 0.01
A Connectionist Model for Temporal Reasoning
1735
before 10, 000 epochs had elapsed. In other words, all the networks have been trained successfully. As for the networks’ test set performance, the results corroborate the importance of exploiting any available background knowledge. For the first ensemble, in which the networks were trained with no background knowledge, an average test set accuracy of 81.25% was obtained. For the second ensemble, to which the temporal rule has been added, an average test set accuracy of 87.5% was obtained—a noticeable difference in performance considering there is a single rule in the background knowledge. In both cases, exactly the same training parameters have been used. The experiments above illustrate that the merging of temporal background knowledge and data learning may provide a system that is more effective than a purely connectionist system. The focus of this article has been on the theory of neural-symbolic systems, their expressiveness, and reasoning capabilities. More extensive experiments to validate the system proposed here would be useful and will be carried out in connection with knowledge extraction and using applications containing continuous attributes.
5 Conclusions and Future Work Connectionist temporal and modal logics render neural-symbolic learning systems with the ability to make use of more expressive representation languages. In his seminal paper, Valiant (1984) argues for rich logic-based knowledge representation mechanisms in learning systems. The connectionist model proposed here addresses such a need, while complying with important principles of connectionism such as massive parallelism and learning. A very important feature of our system is the temporal dimension that can be combined with an epistemic dimension for knowledge, belief, desires and intentions. This article provides the first account of how to integrate such dimensions in a neural-symbolic system. We have illustrated this by providing a full solution to the muddy children puzzle, where agents can reason about knowledge evolution in time. Although a number of multimodal systems—for example, combining knowledge and time (Gabbay et al., 2003; Halpern et al., 2004) and combining beliefs, desires, and intentions (Rao & Georgeff, 1998)—have been proposed for distributed knowledge representation, little attention has been paid to the integration of a learning component for knowledge acquisition. This work contributes to bridging this gap by allowing the knowledge representation to be integrated in a neural learning system. One could also think of the system presented here as a massively distributed system where each ensemble (or sets of ensembles) can be seen as a neural-symbolic processor. This would open several interesting research avenues. For instance, one could investigate how to reason about protocols and actions in this
1736
A. d’Avila Garcez and L. Lamb
neural-symbolic distributed system or how to train the processors in order to learn how to preserve the security of such systems. The connectionist temporal and knowledge logic presented here allows the representation of a variety of properties such as knowledge-based specifications, in the style of Fagin et al. (1995). These specifications are frequently represented using temporal and modal logics, but without a learning feature, which comes naturally in CTLK. As future work, we aim to investigate knowledge extraction of modalities from trained artificial neural network ensembles in which not only discrete data are used, as in the muddy children example, but also continuous attributes are relevant. In addition, since the models of the modal logic S4 can be used to model intuitionistic modal logics, we may also have a system that can combine reasoning about time and learn intuitionistic theories. This is an interesting result, as neural-symbolic systems can be used to “think” constructively, in the sense of Brouwer (Gabbay et al., 2003). In summary, we believe that the connectionist temporal and epistemic computational model presented here opens several interesting research avenues in the domain of neural-symbolic integration, allowing the distributed representation, computation, and learning of expressive knowledge representation formalisms. Acknowledgments We thank Gary Cottrell and two anonymous referees for several constructive comments that led to the improvement of this article. A.G. is partly supported by the Nuffield Foundation, UK. L.L. is partly supported by the Brazilian Research Council CNPq and by the CAPES and FAPERGS Foundations. References Ajjanagadde, V. (1997). Rule-based reasoning in connectionist networks. Unpublished doctoral dissertation, University of Minnesota. Broda, K., Gabbay, D. M., Lamb, L. C., & Russo, A. (2004). Compiled labelled deductive systems: A uniform presentation of non-classical logics. Baldock, UK: Research Studies Press/Institute of Physics Publishing. Browne, A., & Sun, R. (2001). Connectionist inference models. Neural Networks, 14, 1331–1355. Brzoska, C. (1991). Temporal logic programming and its relation to constraint logic programming. In Proc. International Symposium on Logic Programming (pp. 661– 677). Cambridge, MA: MIT Press. Chagrov, A., & Zakharyaschev, M. (1997). Modal logic. Oxford: Clarendon Press. Cloete, I., & Zurada, J. M. (Eds.). (2000). Knowledge-based neurocomputing. Cambridge, MA: MIT Press.
A Connectionist Model for Temporal Reasoning
1737
d’Avila Garcez, A. S., Broda, K., & Gabbay, D. M. (2001). Symbolic knowledge extraction from trained neural networks: A sound approach. Artificial Intelligence, 125, 155–207. d’Avila Garcez, A. S., Broda, K., & Gabbay, D. M. (2002). Neural-symbolic learning systems: Foundations and applications. Berlin: Springer-Verlag. d’Avila Garcez, A. S., & Lamb, L. C. (2004). Reasoning about time and knowledge ¨ in neural-symbolic learning systems. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16 (pp. 921–928). Cambridge, MA: MIT Press. d’Avila Garcez, A. S., Lamb, L. C., Broda, K., & Gabbay, D. M. (2003). Distributed knowledge representation in neural-symbolic learning systems: A case study. In Proceedings of AAAI International FLAIRS Conference (pp. 271–275). Menlo Park, CA: AAAI Press. d’Avila Garcez, A. S., Lamb, L. C., Broda, K., & Gabbay, D. M. (2004). Applying connectionist modal logics to distributed knowledge representation problems. International Journal on Artificial Intelligence Tools, 13(1), 115– 139. d’Avila Garcez, A. S., Lamb, L. C., & Gabbay, D. M. (2002). A connectionist inductive learning system for modal logic programming. In Proc. ICONIP’02 (pp. 1992– 1997). Singapore: IEEE Press. d’Avila Garcez, A. S., & Zaverucha, G. (1999). The connectionist inductive learning and logic programming system. Applied Intelligence Journal [Special issue] 11(1), 59–77. Fagin, R., Halpern, J., Moses, Y., & Vardi, M. (1995). Reasoning about knowledge. Cambridge, MA: MIT Press. Farinas del Cerro, L., & Herzig, A. (1995). Modal deduction with applications in epistemic and temporal logics. In D. M. Gabbay, C. J. Hogger, & J. A. Robinson (Eds.), Handbook of logic in artificial intelligence and logic programming (Vol. 4, pp. 499–594). New York: Oxford University Press. Gabbay, D. M. (1996). Labelled deductive systems (Vol 1). New York: Oxford University Press. Gabbay, D. M., Hodkinson, I., & Reynolds, M. (1994). Temporal logic: Mathematical foundations and computational aspects (Vol. 1). New York: Oxford University Press. Gabbay, D., Kurucz, A., Wolter, F., & Zakharyaschev, M. (2003). Many-dimensional modal logics: Theory and applications. Dordrecht, Elsevier. Gelfond, M., & Lifschitz, V. (1991). Classical negation in logic programs and disjunctive databases. New Generation Computing, 9, 365–385. Halpern, J. Y., van der Meyden, R., & Vardi, M. Y. (2004). Complete axiomatizations for reasoning about knowledge and time. SIAM Journal on Computing, 33(3), 674– 703. Hintikka, J. (1962). Knowledge and belief. Ithaca, NY: Cornell University Press. Orgun, M. A., & Ma, W. (1994). An overview of temporal and modal logic programming. In Proceedings of International Conference on Temporal Logic, ICTL’94 (pp. 445–479). Berlin: Springer. Orgun, M. A., & Wadge, W. W. (1992). Towards a unified theory of intensional logic programming. Journal of Logic Programming, 13(4), 413–440.
1738
A. d’Avila Garcez and L. Lamb
Orgun, M. A., & Wadge, W. W. (1994). Extending temporal logic programming with choice predicates non-determinism. Journal of Logic and Computation, 4(6), 877– 903. Pnueli, A. (1977). The temporal logic of programs. In Proceedings of 18th IEEE Annual Symposium on Foundations of Computer Science (pp. 46–57). Piscataway, NJ: IEEE Computer Society Press. Rao, A. S., & Georgeff, M. P. (1998). Decision procedures for BDI logics. Journal of Logic and Computation, 8(3), 293–343. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Shastri, L. (1999). Advances in SHRUTI: A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence Journal [Special issue] 11(1), 79–108. Sun, R. (1995). Robust reasoning: Integrating rule-based and similarity-based reasoning. Artificial Intelligence, 75(2), 241–296. Sun, R., & Alexandre, F. (1997). Connectionist symbolic integration. Hillsdale, NJ: Erlbaum. Towell, G. G., & Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial Intelligence, 70(1), 119–165. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Valiant, L. G. (2003). Three problems in computer science. Journal of the ACM, 50(1), 96–99. Vardi, M. Y. (1997). Why is modal logic so robustly decidable. In N. Immerman & P. Kolaitis (Eds.), Descriptive complexity and finite models (pp. 149–184). Providence, RI: American Mathematical Society.
Received August 23, 2004; accepted October 5, 2005.
ARTICLE
Communicated by Liam Paninski
Multivariate Information Bottleneck Noam Slonim [email protected] Department of Physics and the Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, U.S.A.
Nir Friedman [email protected] School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel
Naftali Tishby [email protected] School of Computer Science and Engineering, and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
The information bottleneck (IB) method is an unsupervised model independent data organization technique. Given a joint distribution, p(X, Y), this method constructs a new variable, T, that extracts partitions, or clusters, over the values of X that are informative about Y. Algorithms that are motivated by the IB method have already been applied to text classification, gene expression, neural code, and spectral analysis. Here, we introduce a general principled framework for multivariate extensions of the IB method. This allows us to consider multiple systems of data partitions that are interrelated. Our approach utilizes Bayesian networks for specifying the systems of clusters and which information terms should be maintained. We show that this construction provides insights about bottleneck variations and enables us to characterize the solutions of these variations. We also present four different algorithmic approaches that allow us to construct solutions in practice and apply them to several real-world problems. 1 Introduction The volume of available data in a variety of domains has grown rapidly over the past few years. Examples include the consistent growth in the amount of Preliminary and partial versions of this work appeared in Proc. of the 17th conf. on Uncertainty in artificial Intelligence (UAI-17), 2001, and in Proc. of Neural Information Processing Systems (NIPS-14), 2002. Neural Computation 18, 1739–1789 (2006)
C 2006 Massachusetts Institute of Technology
1740
N. Slonim, N. Friedman, and N. Tishby
online text and the dramatic increase in the available genomic information. As a result, there is a crucial need for data analysis methods. A major goal in this context is the development of unsupervised data representation methods to reveal the inherent hidden structure in a given body of data. Such methods include various dimension-reduction, geometric embedding, and statistical modeling approaches. Arguably the most fundamental class of such methods are clustering techniques. At first, the clustering problem seems intuitively clear: similar elements should be assigned to the same cluster and dissimilar ones to different clusters. However, formalizing this notion in a well-defined way is not obvious. Nonetheless, such a formulation is essential in order to obtain an objective interpretation of the results. Indeed, while numerous clustering methods exist, many of them are driven by an algorithm rather than by a clear optimization principle, making their results hard to interpret. Clustering methods can be roughly divided into two categories. The first is based on a choice of some distance or distortion measure among the data points, which presumably reflects some background knowledge about the data. Such a measure implicitly represents the important attributes of the data and is as important as the choice of the data representation itself. For most problems, a proper choice of the distance measure can be the main practical difficulty and through which much of the arbitrariness of the results can enter. Another class of methods is based on statistical assumptions on the origin of the data. Such assumptions enable the design of a statistical model, such as a mixture of gaussians, where the model parameters are then estimated based on the given data. Roughly speaking, these approaches represent different frameworks for thinking about the data. For instance, in the statistical approach, one usually thinks of the given data as a sample taken from an underlying distribution, whereas in the distance-based approach, such an assumption is typically not required. Distance-based methods can be further divided into central and pairwise clustering methods. In the latter, one uses the direct distances between the points to partition the data, using various (often graph-theoretical) algorithms. In central clustering, one is provided with a distance function that can be applied to new points and generate cluster centroids by minimizing the expected distance to the data points. Central clustering is more closely related to the statistical modeling, whereas the results of the pairwise methods are often more difficult to interpret. Quite separated from this hierarchy of clustering techniques, the information bottleneck (IB) method (Tishby, Pereira, & Bialek, 1999; Slonim, 2002) stems from a rather different—information theoretic—perspective. The basic idea is surprisingly simple. While choosing the distance measure is notoriously difficult, often one can naturally specify another relevant variable that should be predicted by the obtained clusters. This specification determines which relevant underlying structure we desire. An illustrative example is the problem of speech recognition. While it is not obvious what
Multivariate Information Bottleneck
1741
the proper distance measure is among speech utterances, in many applications one would agree that the corresponding text is the relevant variable here. Moreover, this scheme allows different valid ways to cluster the same data. For example, one could also define the relevant variable as the identity of the speaker rather than the text and obtain an entirely different, yet equally valid, quantization of the signal. A common data type that calls for this type of analysis is co-occurrence data, such as verbs and direct objects in sentences (Pereira, Tishby, & Lee, 1993), words and documents (Baker & McCallum, 1998; Hofmann, 1999; Slonim & Tishby, 2000), tissues and gene expression patterns (Tishby & Slonim, 2000), or galaxies and spectral components (Slonim, Somerville, Tishby, & Lahav, 2001). In most such cases, there is no obvious “correct” measure of similarity between the data items. Thus, one would like to rely purely on the joint statistics of the co-occurrences and organize the data such that the relevant information among the variables is captured in the best possible way. 1.1 The Contributions of This Work. The main contribution of this article is in providing a general formulation for a multivariate extension of the IB method. This extension allows us to consider cases where the clustering is relevant with respect to several variables and multiple systems of clusters are constructed simultaneously. For example, in symmetric, or two-sided, clustering, we want to find two systems of clusters that are informative about each other. A possible application is relating documents to words, where we seek clustering of documents according to word usage and a corresponding clustering of words (Slonim & Tishby, 2000). In parallel clustering we attempt to build several systems of clusters of one variable in order to capture independent aspects of the information it conveys about another, relevant, variable. A possible example is the analysis of gene expression data, where multiple independent distinctions about tissues are relevant for the expression of genes. Furthermore, it is possible to think of more complicated scenarios, where there are more than two observed variables. For example, given many input variables that represent a visual stimulus, we might want to recover a smaller set of features that are most informative about an output variable that here represents the firing pattern of a neuron. A most general formulation should consider the compression of different subsets of the observed variables, while maximizing the information about other predefined subsets. The multivariate IB principle, as suggested in this work, provides such a principled general formulation. To address this type of problem within the IB framework, we use the concept of multi-information, a natural extension of the pairwise concept of mutual information. Our approach further utilizes the theory of probabilistic graphical models such as Bayesian networks for specifying the trade-off terms: which variables to compress and which information terms
1742
N. Slonim, N. Friedman, and N. Tishby
should be maintained. In particular, we use one Bayesian network, denoted as G in , to specify a set of variables that are compressed versions of the observed variables: each new variable compresses its parents in the network. A second network, G out , specifies the relations, or dependencies, that should be maintained or predicted: each variable should be predicted by its parents in the network. We formulate the general principle as a trade-off between the multi-information each network carries, where we want to minimize the information maintained by G in and at the same time maximize the information maintained by G out . We show that as with the original IB principle, it is possible to analytically characterize the form of the optimal solutions to the general multivariate trade-off principle. The original IB principle yielded practical algorithms that could be applied to a variety of real-world problems. Here, we show that all these algorithms are naturally extended to the new conceptual framework and illustrate their applicability on several real-world data: text processing applications, gene expression data analysis, and protein sequence analysis. 2 The Information Bottleneck Method In the original IB principle (Tishby et al., 1999), the relevance of one variable, X, with respect to another one, Y, is quantified in terms of the mutual information (Cover and Thomas, 1991), I (X; Y) =
x,y
p(x, y) log
p(x, y) . p(x) p(y)
This functional is symmetric and nonnegative, and it equals zero if and only if the variables are independent. It measures how many bits are needed on average to convey the information X has about Y and vice versa. The IB method is thus based on the availability of two variables, with their (assumed given) joint distribution, p(X, Y), where X is the variable we want to compress and Y is the variable we would like to predict. Using a correspondence between this task and the core problems of Shannon’s (1948) information theory, Tishby et al. (1999) formulated it as a trade-off between two types of information terms. The idea is to seek partitions of X values that preserve as much mutual information as possible about Y while losing as much information as possible about the original representation, X. Thus, among all the distinctions made by X, one tries to maintain only those that are most relevant to predict Y. Finding optimal representations is posed as a construction of an auxiliary variable, T, that represents soft partitions of X values through q (T | X), such that I (T; X) is minimized while I (T; Y) is maximized. Notice that throughout this article, we denote by q (·) the unknown distributions that involve the representation parameters, T, and by p(·) the distributions that are given as input and do not change during the analysis.
Multivariate Information Bottleneck
1743
Since T is a compressed representation of X, its distribution should be completely determined given X alone; that is, q (T | X, Y) = q (T | X), or q (X, Y, T) = p(X, Y)q (T | X).
(2.1)
An equivalent formulation is to require the following Markovian independence relation, known as the IB Markovian relation: T ↔ X ↔ Y.
(2.2)
Tishby et al. (1999) formulated this optimization problem as the minimization of the following IB functional, L[q (T | X)] = I (T; X) − β I (T; Y),
(2.3)
where β is a positive Lagrange multiplier and the minimization takes place over all the normalized q (T | X) distributions. The distributions q (T) and q (Y | T) that are further involved in this functional must satisfy the probabilistic consistency conditions,
q (x, t) = x p(x)q (t | x) q (y | t) = q 1(t) x q (x, y, t) = q 1(t) x p(x, y)q (t | x),
q (t) =
x
(2.4)
where the IB Markovian relation is used in the last equality. As shown in Tishby et al. (1999), every stationary point of the IB functional must satisfy q (t | x) =
q (t) exp (−β DK L [ p(y | x)||q (y | t)]) , Z(x, β)
(2.5)
where DK L [ p||q ] = E p [log qp ] is the familiar Kullback-Leibler (KL) divergence (Cover & Thomas, 1991) and Z(x, β) is a normalization (partition) function. The three equations in equations 2.4 and 2.5 must be satisfied selfconsistently at all the stationary points of L. The form of the optimal solution in equation 2.5 is reminiscent of the general form of the rate distortion optimal solution (Cover & Thomas, 1991). However, the effective IB distortion, d I B (X, T) = DK L [ p(Y | X)||q (Y | T)], emerges here from the trade-off principle rather than being chosen arbitrarily. The self-consistent condition in equation 2.5, together with the marginalization constraints in equation 2.4, can be turned into an iterative algorithm, similar to expectation maximization (EM) and the Blahut-Arimoto algorithm in information theory (Cover & Thomas, 1991). In particular, finding {q (T | X), q (T), q (Y | T)} that correspond to a (local) stationary point of L,
1744
N. Slonim, N. Friedman, and N. Tishby
involves the following iterations: q (m) (t) =
p(x)q (m) (t | x)
(2.6)
x
q (m) (y | t) = q (m+1) (t | x) =
1 q (m) (t)
p(x, y)q (m) (t | x)
x
(m)
q (t) (m) e −β DK L [ p(y|x)||q (y|t)] , Z(m+1) (x, β)
whose general convergence was proved in Tishby et al. (1999). 3 Multivariate Extensions of the IB Method The original formulation of the IB principle concentrated on the tradeoff between the compression of one variable, X, and the information this compression maintains about another, relevant variable, Y. We now describe a more general formulation for a multivariate extension of the IB trade-off principle. The motivation is to provide a similar framework for higherdimensional data and with different probabilistic dependencies, as found in ample examples of real-world problems. This extension combines the theory of probabilistic graphical models, such as Bayesian networks, and a multivariate extension of the mutual information concept, known as multiinformation. 3.1 Bayesian Networks and Multi-Information. A Bayesian network over a set of random variables X ≡ {X1 , . . . , Xn } is a directed acyclic graph (DAG), G, in which vertices are annotated by names of random variables. For each variable Xi , we denote by PaGXi the set of parents of Xi in G. We say that a distribution p(X) is consistent with G if it can be factored in the form p(X1 , . . . , Xn ) =
p(Xi | PaGXi ),
i
and use the notation p |= G to denote that. The information that the random variables {X1 , . . . , Xn } ∼ p(X1 , . . . , Xn ) share about each other is given by (see, e.g., Studeny & Vejnarova, 1998), I[ p(X)] = I(X) = DK L [ p(X1 , . . . , Xn )|| p(X1 ) · · · p(Xn )] p(X1 , . . . , Xn ) = E p(X1 ,...,Xn ) log . p(X1 ) · · · p(Xn )
Multivariate Information Bottleneck
1745
This multi-information measures the average number of bits that can be gained by a joint compression of all the variables versus independent compression. Like the mutual information, it is nonnegative and equals zero if and only if all the variables are independent. When the variables have known independence relations, the multiinformation can be simplified as follows. Proposition 1. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X such that p |= G. Then I[ p(X )] = I(X ) =
I (Xi ; Pa GXi ).
i
That is, the multi-information is the sum of “local” mutual information terms between each variable and its parents. Notice that even if p(X) is not consistent with G, this sum is well defined, but it may capture only part of the real multi-information. Hence, we introduce the following definition. Definition 1. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X . The multi-information in p(X ) with respect to G is defined as: I G [ p(X )] =
I (Xi ; Pa GXi ),
(3.1)
i
where each of the local mutual information terms is calculated using the marginal distributions of p(X ). If p |= G, then I[ p(X)] = I G [ p(X)], but in general, I[ p(X)] ≥ I G [ p(X)] (see below). In this case, we often want to know how close p is to the distribution class that is consistent with G. We define this notion through the M-projection of p on that class: DK L [ p||G] = min DK L [ p||q ]. q |=G
(3.2)
The following proposition specifies the form of the distribution q for which the minimum is attained. Proposition 2.
Let p(X ) be a distribution, and let G be a DAG over X . Then,
DK L [ p||G] = DK L [ p||q ∗ ],
(3.3)
where q ∗ is given by n q ∗ (X ) = i= p(Xi | Pa GXi ). 1
(3.4)
1746
N. Slonim, N. Friedman, and N. Tishby
Thus, q ∗ is equivalent to the factorization of p using the conditional independencies implied by G. This proposition extends the Csisz´ar-Tusn´ady (1984) lemma that refers to the special case of n = 2 (see also Cover & Thomas, 1991, p. 365). Next, we provide two possible interpretations of the M-projection distance, DK L [ p||G], in terms of the structure of G. Proposition 3. Let X = {X1 , . . . , Xn } ∼ p(X ), and let G be a Bayesian network structure over X . Assume that the order X1 , . . . , Xn is consistent with the DAG G (i.e., Pa GXi ⊆ {X1 , . . . , Xi−1 }). Then DK L [ p||G] =
I (Xi ; {X1 , . . . , Xi−1 } \ Pa GXi | Pa GXi )
i
= I[ p(X )] − I G [ p(X )]. As an immediate corollary, we have I[ p(X)] ≥ I G [ p(X)], as mentioned earlier. Thus, DK L [ p||G] can be expressed as a sum of local conditional mutual information terms, where each term corresponds to a possible violation of a Markov independence assumption with respect to G. If every Xi is independent of {X1 , . . . , Xi−1 } \ PaGXi } given {PaGXi }, as implied by G, then DK L [ p||G] becomes zero. As these conditional independence assumptions are more extremely violated in p, the corresponding DK L [ p||G] will increase. Recall that the Markov independence assumptions with respect to a given order are necessary and sufficient to require the factored form of distributions consistent with G (Pearl, 1988). Therefore, we see that DK L [ p||G] = 0 if and only if p is consistent with G. An alternative interpretation of this measure is given in terms of multiinformation terms. Specifically, we see that DK L [ p||G] can be written as the difference between the real multi-information, I[ p(X)] = I(X), and the multi-information when p is forced to be consistent with G, I G [ p(X)], which in particular cannot be larger. Hence, we can think of DK L [ p||G] as the residual information between the variables that is not captured by the dependencies implied by G. 3.2 Multi-Information Bottleneck Principle. Let us consider a set of random variables, X = {X1 , . . . , Xn }, distributed according to p(X). We assume that p(X) is known and forms the input to our analysis, where finite sample issues are discussed in section 7. Given p(X), we specify a set of partition variables T = {T1 , . . . , Tk }, such that for each subset of X that we would like to compress, we specify a corresponding subset of the compression variables T. Recall that in the original IB, the IB Markovian relation, T ↔ X ↔ Y, defines a solution space with all the distributions over {X, Y, T} with q (X, Y, T) = p(X, Y)q (T | X). Analogously, we define the solution space in our case through a set of IB
Multivariate Information Bottleneck
1747
Markovian independence relations that imply that each compression variable, Tj ∈ T, is completely determined given the variables it represents. This is achieved by introducing a DAG, G in , over X ∪ T where the variables in T are leaves. G in is defined such that p(X) is consistent with its structure restricted to X. The edges from X to T define what compresses what, and the independencies implied by G in correspond to the required set of IB Markovian independence relations. In particular, since we require that every Tj is a leaf, every Tj is independent of all the other variables, given the variables in 1 it compresses, PaG Tj . The multivariate IB solution space thus consists of all the distributions q (X, T) |= G in , for which q (X, T) = p(X)
k
in q (Tj | PaG Tj ),
(3.5)
j=1
where the free parameters correspond to the stochastic mappings q (Tj | in PaG Tj ), and the other unknown q (·) distributions are determined by marginalization over q (X, T) using the Markovian structure of G in . Analogous to the original IB formulation, the information that we would like to minimize is now given by I G in , where I G in = I(X, T) since q (X, T) |= G in , that is, this is the real multi-information in q (X, T). Minimizing this quantity attempts to make the T variables as independent of the X variables as possible. Once G in is defined, we need to specify the relevant information that we wish to preserve. We do that by specifying another DAG, G out , which determines what predicts what. Specifically, for every Tj , we define the variables it should preserve information about as its children in G out . Thus, using definition 1, we may think of I G out as a measure of how much information the variables in T maintain about their target variables. This suggests that we should maximize I G out . The multivariate IB functional can now be written as L(1) [q (X, T)] = I G in [q (X, T)] − βI G out [q (X, T)],
(3.6)
where the minimization is done subject to the normalization constraints on the partition distributions, and β is the positive Lagrange multiplier controlling the trade-off.2 Notice that this functional is a direct generalization of the 1 It is possible to apply a similar derivation where the T variables are not required to be leaves in G in . This might be useful in various situations where, for example, Tj is used to design a better code for Tj . A relevant example is the relay channel (Cover & Thomas, 1991, Chap. 14). 2 Since I G out typically consists of several mutual information terms, in principle it is possible to define a separate Lagrange multiplier for each of these terms. This might be
1748
N. Slonim, N. Friedman, and N. Tishby
(a)
(b)
Figure 1: The source (left) and target networks for the original IB principle. The target network for the multivariate IB principle is presented in the middle panel. The target network for the structural principle is described in the right panel.
original IB functional. Again, we are interested in the competition between the complexity of the representation, measured through the compression multi-information, I G in , and the accuracy of the relevant predictions provided by this representation, as quantified by the multi-information I G out . For β → 0, the focus is on the compression term, I G in , yielding a trivial solution in which every Tj consists effectively of one value to which all the in values of PaG Tj are mapped, where all the distinctions among these values, relevant or not, are lost. If β → ∞, we concentrate on maintaining the relevant information terms as high as possible. This yields a trivial solution in at the opposite extreme, where every Tj is a one-to-one map of PaG Tj with no loss of relevant information. The interesting cases are, of course, in between, where β takes positive finite values. Example 1. Consider the application of the multivariate principle with G in (a ) (a ) and G out of Figure 1. G in specifies that T compresses X and G out specifies that G in T should preserve information about Y. For these choices, I = I (T; X) + I (X; Y) and I G out = I (T; Y). The resulting functional is L(1) = I (X; Y) + I (T; X) − β I (T; Y), where I (X; Y) is constant and can be ignored. Thus, we end up with a functional equivalent to that of the original IB functional.
3.3 A Structural Variational Principle. We now describe an alternative and closely related variational principle, which provides more insight useful if, for example, the preservation of one information term is of greater importance than the preservation of the others.
Multivariate Information Bottleneck
1749
into the relationship of IB to generative models with maximum-likelihood estimation. As before, we trade between two complementary goals. On the one hand, we want to compress the observed variables, that is, to minimize I G in . On the other hand, instead of maximizing I G out , we now utilize the compression variables to drive q (X, T) toward a desired structure, G out , that represents which dependencies and independencies we would like to impose. Let us consider again the two-variable case shown in Figure 1, where G in specifies that T is a compressed version of X. Ideally, T should preserve all the information about Y. This is equivalent to the situation where T separates X from Y (Cover & Thomas, 1991, Chap. 2), that is, X ↔ T ↔ Y, (b) as specified by G out of Figure 1. Thus, we wish to construct q (T | X) such that the specified independencies in G out are satisfied as much as possible. Notice that in general, G in and G out are incompatible. Except for trivial cases, we cannot achieve both sets of independencies simultaneously (Pereira et al. 1993; Slonim & Weiss, 2002). Instead, we aim to come as close as possible to achieving this by a trade-off between the two. We formalize this by requiring that q (X, T) be closely approximated by its closest distribution among all distributions consistent with G out . As previously discussed, a natural information theoretic measure of this discrepancy is DK L [q ||G]. Thus, the functional that we want to minimize is L(2) [q (X, T)] = I G in [q (X, T)] + γ DK L [q (X, T)||G out ],
(3.7)
where γ is, again, a positive Lagrange multiplier. We will refer to this functional as the structural multivariate IB functional. The range of γ is between 0, in which case we have the trivial—maximally compressed—solution, and ∞, in which we strive to make q as consistent as possible with G out . Notice that the γ → ∞ limit is different in nature from the β → ∞ limit, since forcing the dependency structure is different from preserving all the information, as we see next. (b)
Example 2. Consider again the example of Figure 1 with G in and G out . In this case, we have I G in = I (X; Y) + I (T; X) and I G out = I (T; X) + I (T; Y). From DK L [q ||G out ] = I G in − I G out (see propostion 3) we obtain L(2) = I (T; X) − γ I (T; Y) + (1 + γ )I (X; Y), where the last (constant) term can be ignored. Hence, we end up with the original IB functional. Thus, we can think of the original IB problem as finding a compression T of X that results in a joint distribution, q (X, Y, T), that is as close as possible to the DAG where X and Y are independent given T.
1750
N. Slonim, N. Friedman, and N. Tishby
From proposition 3 we obtain L(2) = I G in + γ (I G in − I G out ) = (1 + γ )I G in − γ I G out , γ . In which is similar to the functional L(1) under the transformation β = 1+γ this transformation, the range γ ∈ [0, ∞) corresponds to the range β ∈ [0, 1). Notice that when β = 1, we have L(1) = DK L [q ||G out ], which is the extreme case of L(2) . Thus, from a mathematical perspective, L(2) is a special case of L(1) with the restriction β ≤ 1. As we saw, the two principles require different versions of G out to reconstruct the original IB functional. More generally, for a given principle, different choices of G out yield different optimization problems. Alternatively, given G out , the two principles yield different optimization problems. In the previous example, we saw that these two effects can compensate for each other. In other words, using the structural variational principle with a different choice of G out ends up with the same optimization problem, which in this case is equivalent to the original IB problem. To further understand the relation between the two principles, we consider the range of solutions for extreme values of β and γ . When β → 0 and γ → 0, in both formulations we simply minimize I G in . In the other limit, the two principles differ. When β → ∞, in L(1) we simply maximize I G out . (a ) Here, applying L(1) with G out corresponds to maximizing I (T; Y). However, (b) applying L(1) with G out corresponds to maximizing I (T; X) + I (T; Y); thus, information about X will be preserved even if it is irrelevant to Y. When γ → ∞, in L(2) we simply minimize DK L [q ||G out ], that is, minimize the violations of conditional independencies implied by G out (see (b) proposition 3). For G out , this minimizes I (X; Y | T) = I (X; Y) − I (T; Y) (where we used the structure of G in and proposition 3), hence, this is (a ) equivalent to maximizing I (T; Y). For G out , when γ → ∞, we minimize I (X; Y | T) = I (X; Y) + I (T; X) − I (T; Y). Thus, unlike the application of (a ) L(1) to G out , we cannot ignore the term I (T; X). To summarize, we can say that L(1) focuses on the edges that are present in G out , while L(2) focuses on the edges that are missing or, more precisely, on the conditional independencies implied by their absence. Thus, although both principles can be applied to any choice of G out , some choices make more sense for L(1) than for L(2) , and vice versa.
3.4 Examples: IB Variations 3.4.1 Parallel IB. In Figure 2A we consider a simple extension of the original IB, where we introduce k compression variables, {T1 , . . . , Tk }, instead of one. Similar to the original IB problem, we want {T1 , . . . , Tk } to preserve the information X maintains about Y, as specified by the DAG
Multivariate Information Bottleneck
1751
(a)
(b)
(a)
(b)
(a)
(b)
Figure 2: Possible source and target networks for the parallel, symmetric, and triplet IB examples.
(a )
G out in the same panel. We call this example the parallel IB, as {T1 , . . . , Tk } compress X in parallel. (a ) Here, I G in = I (X; Y) + kj=1 I (Tj ; X) and I G out = I (T1 , . . . , Tk ; Y); thus, La(1) =
k
I (Tj ; X) − β I (T1 , . . . , Tk ; Y).
(3.8)
j=1
That is, we attempt to minimize the information between X and every Tj while maximizing the information all the Tj ’s preserve together about Y. From the structure of G in , we can also obtain k j=1
I (Tj ; X) = I (T1 , . . . , Tk ; X) + I(T1 , . . . , Tk ),
(3.9)
1752
N. Slonim, N. Friedman, and N. Tishby
where I(T1 , . . . , Tk ) is the multi-information of all the compression variables. Thus, La(1) = I (T1 , . . . , Tk ; X) + I(T1 , . . . , Tk ) − β I (T1 , . . . , Tk ; Y).
(3.10)
In other words, we aim to find {T1 , . . . , Tk } that compress X, preserve the information about Y, and remain independent of each other as much as possible. Recall that using L(2) , we aim at minimizing violation of indepen(b) dencies in G out . This suggests that the DAG G out of Figure 2A captures our intuitions above. In this DAG, X and Y are independent given (b)
every Tj and all the Tj ’s are independent of each other. Here, I G out = I (T1 , . . . , Tk ; X) + I (T1 , . . . , Tk ; Y), and using equation 3.9, we have (2)
Lb =
k
I (Tj ; X) + γ (I(T1 , . . . , Tk ) − I (T1 , . . . , Tk ; Y)),
j=1
which is reminiscent of equation 3.10. 3.4.2 Symmetric IB. Another natural extension of the original IB is the symmetric IB. Here, we want to compress X into TX and Y into TY such that TX extracts the information X contains about Y while TY extracts the information Y contains about X. The DAG G in of Figure 2B captures the (a ) form of the compression. For G out in the same panel, we have La(1) = I (TX ; X) + I (TY ; Y) − β I (TX ; TY ).
(3.11)
Thus, on one hand, we attempt to compress, and on the other hand, we attempt to make TX and TY as informative about each other as possible. Notice that if TX is informative about TY , then it is also informative about Y. Second, we use the structural principle, L(2) , for which we are interested in approximating the conditional independencies implied by G out . This sug(b) gests that G out of Figure 2B represents our desired target model. Here, both TX and TY are sufficient to separate X from Y, while being dependent on each other. Thus, we obtain La(2) = I (TX ; X) + I (TY ; Y) − γ I (TX ; TY ).
(3.12)
As in example 1, we see that by using the structural variational principle with a different G out , we end up with the same optimization problem as by using L(1) . Other alternative specifications of G out that are interesting in this
Multivariate Information Bottleneck
1753
context (Friedman, Mosenzon, Slonim, & Tishby, 2001) are omitted here for brevity. 3.4.3 Triplet IB. A challenging task in the analysis of sequence data, such as DNA and protein sequences or natural language text, is to identify features that are relevant for predicting another symbol in the sequence. Typically these features are different for forward prediction versus backward prediction. For example, the textual features that predict the next word to be information are clearly different from those that predict the previous word to be information. Here, we address this issue by extracting features of both types such that their combination is highly informative about a symbol between other known symbols. The DAG G in of Figure 2C is one way of capturing the form of the compression, where we denote by X p , Y, Xn the previous, current, and next symbol in a given sequence, respectively. Here, Tp compresses X p , while Tn compresses Xn . For the choice of G out , we consider again two alternatives. First, we simply require that the combination of Tp and Tn will maximally preserve the information that X p and Xn hold about the current symbol, Y. (a ) This is specified by G out in the same panel for which we obtain La(1) = I (Tp ; X p ) + I (Tn ; Xn ) − β I (Tp , Tn ; Y).
(3.13) (b)
Second, we use the structural principle, L(2) , with G out of Figure 2C. Here, Tp and Tn are independent, and both are needed to make Y independent of X p and Xn . Hence, the resulting Tp and Tn partitions provide compact, independent, and informative evidence regarding the value of Y. This specification yields (2)
Lb = I (Tp ; X p ) + I (Tn ; Xn ) − γ I (Tp , Tn ; Y),
(3.14)
which is equivalent to equation 3.13. We will term this example the triplet IB. 4 Characterization of the Solution As shown in Tishby et al. (1999), it is possible to implicitly characterize the form of the optimal solutions to the original IB functional. Here, we provide a similar characterization to the multivariate IB case. Specifically, we want in to describe the distributions q (Tj | PaG Tj ) that optimize the trade-off defined by each of the two principles. We present this characterization for L(1) . A similar analysis for L(2) is straightforward. We first need some additional notational shorthands. We denote by G out in U j = PaG the Tj the (X) variables that Tj should compress, by V Xi = Pa Xi
1754
N. Slonim, N. Friedman, and N. Tishby G
variables that should maintain information about Xi , and by VTj = PaTjout the variables that should maintain information about Tj . We also denote −j −j V Xi = V Xi \ {Tj } , and similarly, VT = VT \ {Tj }. To simplify the presentation, we also assume that U j ∩ VTj = ∅.3 In addition, we use the notation
E p(·|u j ) DK L p Y | Z, u j p Y | Z, t j
p z | u j DK L p Y | z, u j p Y | z, t j = z
p Y | Z, u j = E p(Y,Z|u j ) log
, p Y | Z, t j where Y is a random variable and Z is a set of random variables. Notice that this term implies averaging over all values of Y and Z using the conditional distribution p(Y, Z | u j ). In particular, if Y or Z intersects with U j , then only the values that are consistent with u j have positive weights in this averaging. Also, notice that if Z is empty, this term reduces to the standard DK L [ p(Y | u j )|| p(Y | t j )]. The main result of this section is as follows. Theorem 1. Assume that p(X ), G in , G out , and β are given and that q (X , T ) |= G in . The conditional distributions {q (Tj | U j )}kj=1 are a stationary point of L(1) [q (X , T )] = I G in [q (X , T )] − βI G out [q (X , T )] if and only if q (t j | u j ) =
q (t j ) e −βd(t j ,u j ) , ZTj (u j , β)
(4.1)
where ZTj (u j , β) is a normalization function, and
d(t j , u j ) ≡
−j
−j
E q (·|u j ) [DK L [q (Xi | V Xi , u j )||q (Xi | V Xi , t j )]]
i:Tj ∈V X i
+
−j
−j
E q (·|u j ) [DK L [q (T | V T , u j )||q (T | V T , t j )]]
:Tj ∈V T
+DK L [q (V Tj | u j )||q (V Tj | t j )].
(4.2)
The first sum is over all Xi such that Tj participates in predicting Xi . The second sum is over all Tl such that Tj participates in predicting Tl . The last term is related to cases where Tj should be predicted by some VTj = ∅. This theorem provides an implicit set of equations for q (Tj | U j ) through the multivariate relevant distortion d(Tj , U j ), which in turn depends on 3
This is, in fact, the standard situation, since U j ∩ T = ∅ and typically VT j ⊂ T.
Multivariate Information Bottleneck
1755
those unknown distributions. This distortion measures the degree of proximity of the conditional distributions in which U j is involved to those where we replace U j with its compact representative, Tj . For example, if some cluster t j ∈ T j behaves more similarly to u j ∈ U j than another cluster, t j ∈ T j , we have d(t j , u j ) < d(t j , u j ), which implies q (t j | u j ) > q (t j | u j ). In other words, if t j is a good representative of u j the corresponding membership probability, q (t j | u j ), is increased accordingly. As in the original IB problem, equation 4.1 must be solved selfconsistently with the equations for the other distributions that involve Tj , which emerge through marginalization over q (X, T) using the conditional independencies implied by G in . Notice that when β is small, the q (Tj | U j ) are diffused since β reduces the differences between the distortions for different values of Tj . When β → ∞, all the probability mass will be assigned to the value t j with the smallest distortion, that is, the above stochastic mapping will become deterministic. (a )
4.1 Examples. For G in and G out of Figure 1, it is easy to verify that equation 4.2 amounts to d(T, X) = DK L [ p(Y | X)||q (Y | T)], in full analogy to equation 2.5, as required. (a ) For the parallel IB case of G out in Figure 2A, we have d(Tj , X) = E q (·|X) [DK L [q (Y | T− j , X||q (Y | T− j , Tj )]],
(4.3)
where we used the notation T− j = T \ {Tj }. Notice that due to the structure of G in , q (Y | T− j , X) = p(Y | X). (a ) For the symmetric IB case of G out in Figure 2B, we obtain d(TX , X) = E p(·|X) [DK L [q (TY | X)||q (TY | TX )]]
(4.4)
and a symmetric expression for d(TY , Y). Thus, TX attempts to make predictions similar to those of X about TY . (a ) Last, for the triplet IB case of G out in Figure 2C, we have d(Tp , X p ) = E q (·|Xp ) [DK L [q (Y | Tn , X p )||q (Y | Tn , Tp )]].
(4.5)
Thus, q (tp | x p ) increases when the predictions about Y given by tp are similar to those given by x p (when averaging over Tn ). The distortion term for Tn is defined analogously. 5 Multivariate IB Algorithms Similar to the original IB functional, the multivariate IB functional is not convex with respect to all of its arguments simultaneously. Except for trivial cases, it always has multiple minima. Since theorem 1 provides necessary
1756
N. Slonim, N. Friedman, and N. Tishby
conditions for internal minima of the functional, they can be used to find such solutions. However, as in many other optimization problems, different heuristics can also be employed to construct solutions, with relative advantages and disadvantages. Here, we show that the four algorithmic approaches proposed for the original IB problem (Slonim, 2002) can be extended into the multivariate case. We concentrate on the variational principle L(1) . Deriving the same algorithms for L(2) is straightforward. 5.1 Iterative Optimization Algorithm: Multivariate iIB. We start with the case where β is fixed. Following Tishby et al. (1999), we apply the fixedpoint equations in equation 4.1, alternately with the equations for the other distributions that involve some T variables. Given the intermediate solution of the algorithm at the mth iteration, {q (m) (Tj | U j )}kj =1 , we find q (m) (Tj ) and d (m) (Tj , U j ) out of q (m) (X, T) = p(X)
k
q (m) (Tj | U j )
(5.1)
j =1
and then update (m+1) (t j | u j ) ← q
q
(m+1)
q (m) (t j ) (m+1) (u j ,β) j
ZT
(t j | u j ) ← q
(m)
e −βd
(m)
(t j ,u j )
, (5.2)
(t j | u j ), ∀ j = j.
In Figure 3 we present pseudocode for this iterative algorithm, which we will term multivariate iIB. As an example, consider the case of the symmetric IB. Given {q (m) (TX | X), (m) q (TY | Y)}, we find q (m) (TX ), q (m) (TY | X) and q (m) (TY | TX ) out of q (m) (X, Y, TX , YY ) = p(X, Y)q (m) (TX | X)q (m) (TY | Y), from which we obtain d (m) (TX , X) = DK L [q (m) (TY | X)||q (m) (TY | TX )]. Next, we update q (m+1) (tx | x) ←
(m) q (m) (tx ) e −βd (tx ,x) , Z(m+1) (x,β)
q (m+1) (t | y) ← q (m) (t | y). y y
(5.3)
In the next iteration, we find a new version for q (TY | Y) while q (TX | X) is kept still. We repeat these updates until convergence to a stationary point. We note that proving convergence in general, for any choice of β and any choice of G in and G out , is more involved than for the original IB problem due to the complex structure of equation 4.2. Nonetheless, in all our experiments, on real and synthetic data, the algorithm always converged to a (locally) optimal solution.
Multivariate Information Bottleneck
1757
Figure 3: Pseudocode of the multivariate iterative IB algorithm (multivariate iIB). JS denotes the Jensen-Shannon divergence (see equation 5.8). In principle, we repeat this procedure for different initializations and choose the solution that minimizes L = I G in − βI G out .
5.2 Deterministic Annealing Algorithm: Multivariate dIB. It is often desirable to explore a hierarchy of solutions at different β values. Thus, we now present a multivariate deterministic annealing-like procedure that extends the original approach in Tishby et al. (1999). In deterministic annealing, we iteratively increase β and then adapt the solution at the previous β value to the new one (Rose, 1998). Recall that for β → 0, the solution consists of essentially one cluster per Tj . As β is increased, at some critical point the values of some Tj diverge and show different behaviors. Successive increments of β will reach additional bifurcations that we wish to identify. Thus, for each Tj , we end up with a bifurcating hierarchy that traces the sequence of solutions at different β values. To detect these bifurcations, we adopt the method suggested in Tishby et al. (1999). Given the solution from the previous β value, we construct an initial problem in which every Tj value is duplicated. Let t j and trj be
1758
N. Slonim, N. Friedman, and N. Tishby
Figure 4: Pseudocode of the multivariate deterministic annealing-like algorithm (multivariate dIB). JS denotes the Jensen-Shannon divergence (see equation 5.8). NeGTj our denotes the neighbors of Tj in G out (parents/direct descendants). f (β, εβ ) is a simple function used to increase β based on its current value and on some scaling parameter εβ .
two such duplications of t j ∈ T j . Then we set q ∗ (t j | u j ) = q (t j |
u j ) 12 + α ˆ (t j , u j ) and q ∗ (trj | u j ) = q (t j | u j ) 12 − α ˆ (t j , u j ) , where ˆ (t j , u j ) is a randomly drawn noise term and α > 0 is a (small) scale parameter. Thus, each copy is a slightly perturbed version of t j . For large enough β, this perturbation suffices for the two copies to diverge; otherwise, they collapse to the same solution. Specifically, given this initial point, we apply the multivariate iIB algorithm. After convergence, if t j and trj are sufficiently different, we declare that t j has split and incorporate t j and trj into the hierarchy we construct for Tj . Finally, we increase β and repeat the whole process. We will term this algorithm multivariate dIB. A pseudocode is given in Figure 4.
Multivariate Information Bottleneck
1759
There are several difficulties with this algorithm. The parameters b ( j) that are involved in detecting the bifurcations often should scale with β. Further, one may need to tune the rate of increasing β; otherwise, cluster splits might be skipped. Last, the duplication process is stochastic in nature and involves additional parameters. Some of these issues were addressed rigorously for the original IB problem (Parker, Gedeon, & Dimitrov, 2002). Extending this work in our context seems like a natural direction for future research. 5.3 Agglomerative Algorithm: Multivariate aIB. The agglomerative IB algorithm was introduced in Slonim & Tishby (1999) as a simple and approximated algorithm for the original IB problem. It employs a greedy agglomerative clustering technique to find a hierarchical clustering tree in a bottom-up fashion and was found to be useful for various problems (Slonim, 2002). We now present a multivariate extension of this algorithm. To this aim, it is more convenient to consider the problem of maximizing Lmax [q (X, T)] = I G out [q (X, T)] − β −1 · I G in [q (X, T)],
(5.4)
which is clearly equivalent to minimizing L(1) (see equation 3.6). Our algorithm starts with the most fine-grained solution, Tj = U j . Thus, every u j is solely assigned to a unique singleton cluster, t j ∈ T j , and the assignment probabilities, {q (Tj | U j )}kj=1 are deterministic—either 0 or 1 (that is, “hard” clustering). Nonetheless, the following analysis holds for the general case of soft clustering as well. Given the singleton initialization, we reduce the cardinality of one Tj by agglomerating, or merging, two of its values, t j and trj , into a single value t¯ j . Formally, this is defined through q (¯t j | u j ) = q (t j | u j ) + q (trj | u j ).
(5.5)
The corresponding conditional merger distribution is defined through z = {π,z , πr,z } =
q (t j | z) q (trj | z) , . q (¯t j | z) q (¯t j | z) q (t )
q (tr )
(5.6)
Note that if Z = ∅, then z = = { q (t¯jj ) , q (t¯jj ) }, that is, these are the relative weights of each of the clusters that participate in the merger (Slonim & Tishby, 1999). The basic question in an agglomerative process is which pair to merge bef aft at each step. Let Tj and Tj denote the random variables that correspond to Tj , before and after a merger in Tj , respectively. Then our merger cost is
1760
N. Slonim, N. Friedman, and N. Tishby
given by Lmax (t j , trj ) = Lmax − Lmax , bef
bef
aft
(5.7)
aft
bef
aft
where Lmax and Lmax are calculated based on Tj and Tj , respectively. The greedy procedure evaluates all the potential mergers, for every Tj , and applies the one that minimizes Lmax (t j , trj ). This is repeated until all the T variables degenerate into trivial clusters. The resulting set of hierarchies describes a range of solutions at all the different resolutions. A direct calculation of all the potential merger costs using equation 5.7 is typically unfeasible. However, as in Slonim & Tishby (1999), one may calculate Lmax (t j , trj ) while examining only the distributions that involve t j and trj directly. An essential concept in this derivation is the Jensen-Shannon (JS) divergence. Specifically, the JS divergence between two probability distributions, p1 and p2 , with respect to the positive weights, = {π1 , π2 }, π1 + π2 = 1, is given by JS [ p1 , p2 ] = π1 DK L [ p1 || p¯ ] + π2 DK L [ p2 || p¯ ],
(5.8)
where p¯ = π1 p1 + π2 p2 . The JS is nonnegative and upper bounded, and it equals zero if and only if p1 = p2 . It is also symmetric, but it does not satisfy the triangle inequality. Comparing the JS between two empirical distributions of two samples with some predefined threshold is asymptotically the optimal way to determine whether both samples came from a single source (Gutman, 1989; Schreibman, 2000). Theorem 2.
Let t j , trj ∈ T j be two clusters. Then,
Lmax (t j , trj ) = q (¯t j ) · d A(t j , trj ),
(5.9)
where
d A(t j , trj ) ≡
i:Tj ∈V Xi
+
−j
E q (·|t¯ j ) [JS
:Tj ∈V T
−j VX i
E q (·|t¯ j ) [JS
−j
[q (Xi | V Xi , t j ), q (Xi | V Xi , trj )]]
−j VT
−j
−j
[q (T | V T , t j ), q (T | V T , trj )]]
+ JS [q (V Tj | t j ), q (V Tj | trj )] − β −1 · JS [q (U j | t j ), q (U j | trj )].
(5.10)
That is, the merger cost is a multiplication of the “weight” of the merger components, q (¯t j ), with their “distance,” d A(t j , trj ). Notice that due to the JS
Multivariate Information Bottleneck
1761
Figure 5: Pseudocode of the multivariate agglomerative IB algorithm (multivariate aIB).
properties, this “distance” is symmetric, but it is not a metric. It is small for pairs that give similar predictions about the variables that Tj should predict, and have different predictions, or minimum overlap about the variables that Tj should compress. Equation 5.10 is clearly analogous to equation 4.2. While for the multivariate iIB the optimization is governed by the KL divergences between data and cluster centroids, here it is controlled through JS divergences, which are related to the likelihood that the two merged clusters have a common source. For brevity, in the rest of this section, we focus on the simpler hard clustering case, for which JS [q (U j | t j ), q (U j | trj )] = H[], where H is Shannon’s entropy. A pseudocode of the general procedure is given in Figure 5. Examples. obtain
(a)
For the original IB problem (G in and G out in Figure 1), we
Lmax (tl , tr ) = q (t¯) · (JS [q (Y | tl ), q (Y | tr )] − β −1 H[]),
(5.11)
which is consistent with the original aIB algorithm (Slonim & Tishby, 1999).
1762
N. Slonim, N. Friedman, and N. Tishby (a )
For the parallel IB (G in and G out in Figure 2A), we have Lmax (t j , trj ) = q (¯t j ) · (E q (·|t¯ j ) [JST− j [q (Y | T− j , t j ), q (Y | T− j , trj )]] − β −1 H[]),
(5.12)
where again we used T− j = T \ {Tj }. (a ) For the symmetric IB (G in and G out in Figure 2B), we obtain Lmax (tX , tXr ) = q (¯t X ) · (JS [q (TY | tX ), q (TY | tXr )] − β −1 H[]),
(5.13)
and an analogous expression for TY . 5.4 Sequential Optimization Algorithm: Multivariate sIB. An agglomerative approach is relatively computationally demanding. If we start from Tj = U j , the time complexity is O( kj=1 | U j |3 | V j |) (where | V j | denotes the complexity of calculating a single merger in Tj ), while the space complexity is O( kj=1 | U j |2 ), that is, unfeasible for large data sets. Moreover, it is not guaranteed to extract even locally optimal solutions. Recently, we suggested a framework for casting an agglomerative clustering algorithm into a sequential optimization algorithm, which is guaranteed to converge to a stable solution in much better time and space complexity (Slonim, Friedman, & Tishby, 2002). Next, we describe how to apply this idea in our context. The sequential procedure maintains for each Tj a flat partition with Mj (hard) clusters. At each step we draw a u j ∈ U j out of its current cluster (denoted here t j (u j )) and represent it as a new singleton cluster. Using equation 5.9, we now merge u j into a cluster t new such that j t new = argmint j ∈T j Lmax ({u j }, t j ), to obtain a (possibly new) partition Tjnew , j with the appropriate cardinality. Since this step can only increase the (upper-bounded) functional Lmax , we are guaranteed to converge to a stable solution. It is easy to verify that our time complexity is O( · j | U j T j V V j |), where is the number of iterations we should perform until convergence is attained. Since typically · |T j | | U j |2 , we get a significant run-time improvement. Moreover, we dramatically improve our memory consump tion toward O( j |T j |2 ). We will term this algorithm multivariate sIB. A pseudocode is given in Figure 6. 6 Illustrative Applications In this section we consider a few illustrative applications of the general methodology. In practice we do not have access to the true joint distribution, p(X), but only a finite sample drawn out of this distribution. Here, a
Multivariate Information Bottleneck
1763
Figure 6: Pseudocode of the multivariate sequential IB algorithm (multivariate sIB). In principle, we repeat this procedure for different initializations and choose the solution that maximizes Lmax = I G out − β −1 I G in .
pragmatic approach was taken where we estimated p(X) through a simple normalization. Our results seem satisfactory even in extreme undersampling situations, and we leave the theoretical analysis of the finite sample effects over our methodology for future research. An appealing property of an information-theoretic approach, and the IB framework in particular, is that it can be applied to data sets of different nature in exactly the same manner. There is no need to tailor the algorithms to the given data or to define a domain-specific distortion measure. To demonstrate this issue, we apply our method to a variety of data types, including natural language text, protein sequences, and gene expression data (see appendix B regarding preprocess and implementation details). In all cases, the quality of the results can be assessed on similar grounds in terms of compression versus preservation of relevant information.
1764
N. Slonim, N. Friedman, and N. Tishby
6.1 Parallel IB Applications. We consider the specification of G in and (a ) G out of Figure 2A. The problem thus is to partition X into k sets of clusters, T = {T1 , . . . , Tk }, that minimize the information they maintain about X, maximize the information they hold about Y, and remain indepenent of each other as much as possible. For various technical reasons, the sIB algorithm is most suitable here. Specifically, here, we first apply sIB with k = 1, which is equivalent to solving the original IB problem. Given this solution, denoted as T1 , we apply again sIB with k = 2, while T1 is kept fixed. That is, given T1 , we look for T2 such that I (T1 , T2 ; Y)) − β −1 (I (T1 , T2 ; X) + I (T1 ; T2 )) is maximized. Next, we hold T1 and T2 fixed while looking for T3 , and so forth. Loosely speaking, in T1 we aim to extract the first principal partition of the data. In T2 we seek for a second—approximately independent—principal partition, and so on. 6.1.1 Parallel sIB for Style Versus Topic Text Clustering. A well-known concern in cluster analysis is that there might be more than one meaningful way to partition a given body of data. For example, text documents might have two possible dichotomies: by their topics and by their writing styles. Next, we construct such an example and solve it using our parallel IB approach. We selected four books: The Beasts of Tarzan and The Gods of Mars by E. R. Burroughs and The Jungle Book and Rewards and Fairies by R. Kipling. Thus, except for the partition according to the writing style, there is a possible topic partition of the “jungle” topic versus all the rest. We split each book into “documents” consisting of 200 successive words each. We defined p(w, d) as the number of occurrences of the word w in the document d, normalized by the total number of words in the corpus, and applied parallel sIB to cluster the documents into two partitions, T1 and T2 , of two clusters each. Since this setting already implies significant compression, we used β −1 = 0 and concentrated on maximizing I (T1 , T2 ; W). In Table 1, we see that T1 shows almost perfect correlation to an authorship partitioning, while T2 is correlated with a topical partitioning. Moreover, I (T1 ; T2 ) ≈ 0.001 nats, that is, these two partitions are practically independent. In addition, with only four clusters, I (T1 , T2 ; W) ≈ 0.3 nats, which is about 13% of the original (empirical) information, I (D; W). 6.1.2 Parallel sIB for Gene Expression Data Analysis. As our second example, we used the mRNA gene expression measurements of approximately 6800 human genes in 72 samples of leukemia (Golub et al., 1999). These data include independent annotations of their components, including the type of leukemia (ALL versus AML), the type of cells, the donating hospital, and others. We normalized the measurements of the genes in each sample independently to get an estimated joint distribution, p(S, G), over samples and genes (where p(S) is uniform). Given this joint distribution, we
Multivariate Information Bottleneck
1765
Table 1: Results for Parallel sIB Applied to Style Versus Topic Text Clustering.
The Beasts of Tarzan (Burroughs) The Gods of Mars (Burroughs) The Jungle Book (Kipling) Rewards and Fairies (Kipling)
T1,a
T1,b
T2,a
T2,b
315 407 0 0
2 0 255 367
315 1 254 42
2 406 1 325
Notes: Each entry indicates the number of “documents” in some cluster and some class. For example, the first cluster of the first partition, T1,a , includes 315 “documents” taken from the book The Beast of Tarzan.
Table 2: Results for Parallel sIB Applied to Gene Expression Measurements of Leukemia Samples (Golub et al., 1999).
AML ALL B-cell T-cell Average PS
T1,a
T1,b
T2,a
T2,b
T3,a
T3,b
T4,a
T4,b
23 0 0 0 0.64
2 47 38 9 0.72
14 37 37 0 0.71
11 10 1 9 0.66
12 9 6 3 0.53
13 38 32 6 0.76
13 22 20 2 0.70
12 25 18 7 0.69
Note: Each entry indicates the number of samples in some cluster and some class. Note that T-cell/B-cell annotations are available only for samples annotated as ALL type. The last row indicates the average “prediction strength” score (Golub et al., 1999) in the cluster.
applied the parallel sIB to cluster the samples into four clustering hierarchies, consisting of two clusters each (again, with β −1 = 0). In Table 2 we present the four partitions. T1 almost perfectly matches the AML versus ALL annotation, while T2 is correlated with the B-cells/T-cell split. For T3 we note that the average “prediction strength” score (see Golub et al., 1999) is very different between both clusters. For T4 we were not able to find any clear correlation with the available annotations, suggesting that this partition overfits the data or that further meaningful partitions of these data are not expressed in the provided annotations. In terms of information, I (T; G) preserves about 54% of the original information, I (S; G) ≈ 0.23 nats. 6.2 Symmetric IB Applications. Here, we illustrate the applicability of (a ) all our algorithms for G in and G out of Figure 2B. 6.2.1 Symmetric dIB and iIB for Word-Topic Clustering. We start with a simple text processing example, constructed out of the 20 news-groups data (Lang, 1995). These data consist of approximately 20,000 documents, distributed among 20 different topics. We defined p(w, c) as the number of occurrences of a word w in all documents of topic c, normalized by
1766
N. Slonim, N. Friedman, and N. Tishby
Figure 7: Application of the symmetric dIB to the 20-news-group data. The learned hierarchy of topic clusters, Tc , is presented after four splits. The numerical value inside each ellipse denotes the bifurcation β value. Notice that at the early stages, the algorithm is inconclusive regarding the assignment of the electronics topic, which demonstrates that the hierarchy obtained by the dIB algorithm does not necessarily construct a tree. Below every topic cluster, tc , we present its most probable word cluster (tw∗ = argmaxtw ∈Tw q (tw | tc )) through its five most probable words, sorted via q (w | tw∗ ).
the total number of words in the corpus, and applied the symmetric dIB algorithm. We start with β −1 = 0 and gradually “anneal” the system to extract a hierarchy of word clusters, Tw , and a corresponding hierarchy of topic clusters, Tc . The obtained Tc partitions were typically hard; hence, this hierarchy can be presented as a simple tree-like structure (see Figure 7). For every tc , we find its most probable word cluster (tw∗ = argmaxtw q (tw | tc )) and present it in the same figure. Evidently there is a strong semantic relation in every such pair. The word clusters also utilized the soft clustering utility to deal with words that are relevant to several topics, as illustrated in Table 3. In terms of information, after four splits, |Tw | = 14, |Tc | = 9, and I (Tw ; Tc ) = 0.6 nats, which is about 70% of the original information, I (W; C). We further applied the symmetric iIB algorithm to the same data. For purposes of comparison, the input parameters were set as in the dIB result, |Tw | = 14, |Tc | = 9, and β ≈ 22.72. We performed 100 different random initializations, 8 of which converged to a better minimum of L than the one
Multivariate Information Bottleneck
1767
Table 3: Results for “Soft” Word Clustering Using Symmetric dIB over the 20-News-group Data. W
q (tw |w)
war
0.92 0.06 0.02 0.86 0.08 0.06 0.77 0.23 0.74 0.26 0.99 0.01 0.58 0.42
killed
evidence price speed application
tc∗ politics religion-mideast religion-mideast politics religion-mideast religion-mideast religion-mideast politics hardware sport hardware sport hardware windows
q (tc∗ |tw ) 0.44 0.34 0.93 0.44 0.34 0.93 0.34 0.44 0.31 0.35 0.31 0.35 0.31 0.84
Notes: The first column indicates the word, that is, the W value. The second column presents q (tw | w) for all the clusters for which this probability was nonzero. tc∗ denotes the topic cluster that maximizes q (tc | tw ). It is represented in the table, in the third column, by the joint topic of its members (see Figure 7). The last column presents the probability of this topic cluster given tw .
found by dIB (see Figure 8). Thus, by tracking the changes in the solution as β increases, the dIB approach succeeds in finding a relatively good solution and also provides more details by describing a hierarchy of solutions. Nonetheless, if one is interested in a flat partition, applying iIB with a sufficient number of initializations will probably yield a better optimum. 6.2.2 Symmetric sIB and aIB for Protein Sequence Analysis. As a second example we used a subset of five protein classes taken from the PRINTS database (Attwood et al., 2000) (see Table 4). All five classes share a common protein structural unit, known as the glutathione S-transferase (GST) domain. A well-established database (the Pfam database, http://www.sanger.ac.uk/Pfam) has chosen not to model these groups separately due to high sequence similarity between them. Nonetheless, unsupervised symmetric IB algorithms extract clusters that are well correlated with these groups. We represented each protein as a count vector over different 4-mers of amino acids, or features, present in these data. We defined p( f | r ) as the relative frequency of a feature f in a protein r and further defined p(r ) as uniform. Given these data, we applied the symmetric aIB and sIB (with β −1 = 0) to extract protein clusters, TR , and feature clusters, TF , such that I (TR ; TF ) is maximized.
1768
N. Slonim, N. Friedman, and N. Tishby
L = I(Tw;W)+I(Tc;C)–22.72 I(Tw;Tc)
−6
Symmetric iIB Symmetric dIB
−6.5 −7 −7.5 −8 −8.5 −9 −9.5 −10
0
20
40 60 Initialization index
80
100
Figure 8: Application of symmetric dIB and symmetric iIB to the 20-news-group data. The iIB results over 100 different initializations are sorted with respect to L = I (Tw ; W) + I (Tc ; C) − 22.72 · I (Tw ; Tc ). In eight cases, iIB converged to a better minimum.
Table 4: Data Set Details of the Protein GST Domain Test.
Class
Family Name
c1 c2 c3 c4 c5
GST—no class label S crystalline Alpha class GST Mu class GST Pi class GST
Number of Proteins 298 29 40 32 22
For the sIB results, with 10 protein clusters and 10 feature clusters, we obtain I (TR ; TF ) = 1.1 nats (∼ 30% of the original information), and the algorithm almost perfectly recovers the manual biological partitioning of the proteins (see Table 5). For each tR , we identify its most probable feature cluster (tF∗ = argmaxtF q (tF | tR )) and present in Table 6 its most probable features, which apparently are good indicators for the biological class that is correlated with tR .
Multivariate Information Bottleneck
1769
Table 5: Results for Applying Symmetric sIB to the GST Protein Data Set with |T R | = 10, |T F | = 10. Class/ Cluster
tR1
tR2
tR3
tR4
tR5
tR6
tR7
tR8
tR9
tR10
c1 c2 c3 c4 c5 Errors
107 0 0 0 0 0
49 0 0 0 7 7
47 0 0 0 0 0
42 0 0 0 0 0
30 0 0 0 0 0
17 0 0 2 0 2
4 29 0 0 0 4
1 0 39 0 0 1
1 0 0 30 1 2
0 0 1 0 14 1
Notes: Each entry indicates the number of proteins in some cluster and some class. The last row indicates the number of “errors” for each cluster, defined as the number of proteins in this cluster that are not labeled by the cluster’s most dominant label.
Table 6: Results for Symmetric sIB: Indicative Features for GST Protein Classes. TR
tF∗
q (tF∗ |tR )
Feature
q ( f |tF∗ )
c1
c2
c3
c4
c5
tR7 (c 2 )
tF10
0.91
tR9 (c 4 )
tF8
0.89
tR1 0 (c 5 )
tF2
0.85
tR8 (c 3 )
tF1
0.85
tR5 (c 1 )
tF9
0.83
tR4 (c 1 )
tF5
0.80
RYLA GRGR NGRG FPNL AILR SNAI LDLL SFAD FETL FPLL YGKD AAGV TLVD WESR EFLK IPVL ARFW KIPV
0.022 0.02 0.019 0.025 0.018 0.017 0.019 0.017 0.017 0.018 0.017 0.016 0.015 0.015 0.015 0.010 0.010 0.009
0.08 0.04 0.04 0.06 0.03 0.03 0.04 0.04 0.04 0.01 0.03 0.03 0.13 0.12 0.00 0.11 0.11 0.09
1.00 0.72 0.72 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.45 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.58 0.43 0.00 0.03 0.03 0.08 0.03 0.03 0.90 0.88 0.78 0.00 0.00 0.00 0.00 0.00 0.00
0.19 0.00 0.00 1.00 0.88 0.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.64 0.46 0.59 0.64 0.59 0.77 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.05
Notes: The left column indicates the index of the cluster in T R , and in parentheses the most dominant protein class in it. Given this protein cluster, the second column indicates its most probable feature cluster, tF∗ = argmaxtF q (tF | tR ). The next column indicates the probability of this feature cluster, given the protein cluster. Results are presented only when this value is greater than 0.8, indicating high coupling between both clusters. We further sort all features by q (F | tF∗ ) and present the top three features in the next column. The last five columns indicate for each of these features its relative frequency in all five classes (estimated as the number of proteins that contain this feature, normalized by the total number of proteins in the class). Clearly, the extracted features are correlated with the biological class associated with tR .
1770
N. Slonim, N. Friedman, and N. Tishby
c1~49 c6~3
c1~102 c5~9 c6~7
c6~26 c1~15
c1~34
c1~39
c5~13 c3~1
c4~30 c1~1
c2~29 c1~4
c3~39 c1~1
c4~32 c1~18
c1~39 c5~13 c3~1
c1~151 c6~10 c5~9
c1~17 c4~2
c1~73 c5~13 c3~1
c3~39 c2~29 c1~5
c1~166 c6~36 c5~9
c1~239 c6~36 c5~22 c3~1
c3~39 c4~32 c2~29 c1~23
ALL
Figure 9: Application of the symmetric aIB to the GST protein data set. The learned protein cluster hierarchy, TR , is presented from |T R | = 10 and below. In each cluster, the number of proteins from every class is indicated. For example, in the extreme upper right cluster, there are 39 proteins from the class c 3 and a single protein from the unlabeled class, c 1 . After completing the experiments, we found that 36 out of the unlabeled (c 1 ) proteins were recently labeled as Omega class. This class is denoted by c 6 in the figure. Notice that all its members were clustered in the three left-most clusters.
Thus, our unsupervised analysis finds clusters that highly correlate with a manual partitioning of the proteins and simultaneously extracts features (subsequences of amino acids) that are good indicators for each such class. Last, we apply the symmetric aIB to the same data. For comparison purposes, we consider the solution at |T R | = |T F | = 10. Here, I (TR ; TF ) = 0.9 nats, clearly inferior to the sIB result. However, the differences are mainly in the feature clusters, TF , while the protein clusters obtained by aIB strongly correlate with the corresponding sIB solution, and thus also correlate with the protein class labels. In Figure 9 we present the TR hierarchy. Notice that many of our clusters correspond to the “unlabeled” c 1 class, and thus presumably correlate with additional—yet unknown—subclasses in the GST domain. In fact, after completing our experiments, it was brought to our attention that one such new class was recently defined in a different database, the InterPro database (Apweiler et al., 2000). Thirty-six proteins out of this new Omega class were present in our data. In Figure 9 we see
Multivariate Information Bottleneck
1771
that all these proteins were assigned automatically to a single branch in our hierarchy. (a )
6.3 Triplet IB Application. We consider the specification of G in and G out of Figure 2C, and use the triplet sIB algorithm over a simple text processing example. 6.3.1 Triplet sIB for Natural Language Processing. Our data consisted of seven Tarzan books by E. R. Burroughs, from which we got a sequence of about 600,000 words. We defined three random variables, Wp , W, and Wn , corresponding to the previous, current, and the next word in the sequence, respectively. For simplicity, we defined W as the set of 10 most frequent words in our data that are not neutral (“stop”) words. Hence, we considered only word triplets in which the middle word was one of these 10 and defined p(w p , w, wn ) as the relative frequency of a triplet {w p − w − wn } among all these triplets. Given p(w p , w, wn ), we applied the triplet sIB algorithm to construct two systems of clusters: Tp for the first word in the triplets, and Tn for the last word in the triplets, with |T p | = 10, |Tn | = 10, and β −1 = 0, such that I (Tp , Tn ; W) is maximized. In 50 different random initializations, the obtained solution always preserved more than 90% of the original information, I (Wp , Wn ; W) = 1.6 nats, although the dimensions of the compressed distribution q (Tp , W, Tn ) are more than 200 times smaller than those of the original matrix, p(Wp , W, Wn ). The best solution preserved about 94% of the original information, and we further concentrate on this solution. For every w ∈ W, a manual examination of the two clusters that maximize q (tp , w, tn ), indicates that they consist of word pairs that are indicative of the word in between them, reflecting how Tp and Tn preserve the information about W (data not shown). To validate the predictive power of Tp and Tn about W, we examined another book by E. R. Burroughs (The Son of Tarzan), which was not used while estimating p(Wp , W, Wn ) and constructing Tp and Tn . In this book, for every occurrence of one of the 10 words in W, we try to predict it using its two neighbors, w p and wn . Specifically, w p and wn correspond to two clusters, tp ∈ T p , tn ∈ Tn ; thus, we predict the in-between word to be w ˆ = argmaxw q (w | tp , tn ). For comparison, we also predict the in-between word from the complete joint statistics, w ˆ = argmaxw p(w | w p , wn ), and while using a single neighbor, w ˆ = argmaxw p(w | w p ), and w ˆ = argmaxw p(w | wn ). In Table 7 we present the precision and recall (Sebastiani, 2002) for all these prediction schemes. In spite of the significant compression implied by Tp and Tn , the averaged precision of their predictions is similar to those obtained using the original complete joint statistics. In terms of recall, predictions from the triplet IB clusters are even superior to those using the original Wp , Wn variables, since using q (Tp , W, Tn ) allows us to make predictions even for triplets that were not included in our training data.
1772
N. Slonim, N. Friedman, and N. Tishby
Table 7: Precision and Recall Results for Triplet sIB. Precision W apeman (33) apes (78) eyes (177) girl (240) great (219) jungle (241) tarzan (48) time (145) two (148) way (101) Microaveraged
Recall
Tp , Tn
Wp , Wn
Wp
Wn
Tp , Tn
Wp , Wn
Wp
Wn
5.9 43.3 82.6 43.3 91.7 49.3 41.3 70.4 41.0 59.6 53.3
7.4 25.6 80.7 30.0 92.0 53.7 66.7 82.2 92.3 80.8 55.4
4.3 93.6 58.0 0.0 58.0 0.0 30.9 70.6 84.6 61.3 28.2
1.5 11.4 65.3 37.5 91.0 37.6 7.7 31.1 91.7 61.3 34.3
24.2 16.7 32.2 5.4 50.2 27.4 39.6 47.6 10.8 27.7 27.9
30.3 14.1 28.3 1.3 47.5 24.1 25.0 25.5 8.1 20.8 22.2
81.8 37.2 49.2 0.0 21.5 0.0 60.4 53.1 7.4 18.8 22.8
3.0 6.4 18.1 1.3 55.7 18.3 47.9 34.5 14.9 18.8 22.5
Notes: The left column indicates the word w ∈ W and in parentheses its number of occurrences in the test sequence. The next column presents the precision of the predictions while using the triplet sIB clusters statistics, that is, q (W | Tp , Tn ). The third column presents the precision while using the original joint statistics, that is, p(W | Wp , Wn ). The next two columns present the precision while using only one word neighbor for the prediction, that is, p(W | Wp ) and p(W | Wn ), respectively. The last four columns indicate the recall of the predictions while using these four different prediction schemes. The last row presents the microaveraged precision and recall.
We note that this type of application might be useful in tasks like speech recognition, optical character recognition, and more, where it is not feasible to use the original joint distribution due to its high dimensionality. The triplet IB clusters provide a reasonable alternative that is dramatically less demanding. For biological sequence data, the analysis demonstrated here might be useful to gain further insights about the data properties. 7 Discussion and Future Research 7.1 An Information-Theoretic Perspective. The traditional approach to the analysis and modeling of empirical data is through generative models and maximum likelihood parameter estimation. However, complex data sets rarely come with their “correct” parametric model. Thus, the choice of the parametric class often involves nonobvious assumptions about the data. Shannon’s (1948) information theory represents a radically different approach, which is concerned with two fundamental trade-offs. The first is between compression and distortion and is known as rate distortion theory or (lossy) source coding. The second is between reliable error correction and its cost and is known as the capacity-cost trade-off or channel coding (Cover & Thomas, 1991).
Multivariate Information Bottleneck
1773
These two problems are dual components of one larger problem: the trade-off between distortion and cost, that is, the minimum average cost that is required for communication with at most a given average distortion. Shannon was able to break this problem, in the point-to-point communication setting, into a rate-distortion trade-off on one hand and a capacity-cost trade-off on the other. In the former, one minimizes mutual information subject to an average distortion constraint, while in the latter, one maximizes information subject to an average cost constraint. The optimal solution to both problems requires only the specification of the distortion or cost functions, without any assumptions about the nature of the underlying distributions, although they should be accessible. This is in sharp contrast to the statistical modeling approach. Nonetheless, in this work, we use a fundamental concept in the statistical modeling literature, known as Bayesian networks, in order to extend the information-theoretic trade-offs paradigm. In the original IB, we balance between losing irrelevant distinctions made by X, while maintaining those that are relevant about Y. The first part—minimizing I (T; X)—is analogous to rate distortion, while the second part—maximizing I (T; Y)—is analogous to channel coding. Tishby et al. (1999) formulated both parts in a single principle and characterized its general solution. Here, we described a natural extension of these ideas, which allows the consideration of more complicated scenarios. This extension required an elevation of the point-to-point communication problem that lies behind the original IB to a network communication setting, a notoriously more difficult and largely unsolved problem. Nonetheless, as we demonstrated, such an extension is both viable and natural. In particular, the source-channel separation that exists in the original IB no longer holds nor is needed, as G in and G out replace the source and the channel terms of the original IB, respectively. It is thus possible that the current work will provide a way around the difficult multiterminal information theory problem, which remains unsolved to date (Cover & Thomas, 1991). 7.2 Finite Data and Sample Complexity Issues. An immediate obstacle in our information-theoretic approach is the assumption that the joint distribution, p(X), is known, as typically we are given only a sample out of this distribution. Several ideas were suggested in this context for the original IB problem and are worth mentioning. First, finite sample effects become more severe as the complexity of the new representation, T , increases. That is, overfitting is more evident as the number of clusters increases (Pereira et al., 1993; Still & Bialek, 2003). Thus, in particular, one should apply with great care agglomerative algorithms that begin with substantial overfitting for small samples. Having said that, it is worth noting that the agglomerative IB algorithm is supported through the work of Gutman (1989). Specifically, theorem 1 in that work implies
1774
N. Slonim, N. Friedman, and N. Tishby
that the JS divergence, used here to determine which clusters to merge, is the optimal criterion to decide whether two empirical samples came from a single source. Next, given the empirical evidence, it might be useful to use the least informative distribution as the input to the IB analysis. The notion of least informative distributions, under expectation constraints, is related to maximum entropy methods and has been used recently for dimension reduction (Globerson & Tishby, 2004). One can prove generalization sample complexity bounds for the IB problem within this framework. Finally, it was shown that one can introduce a corrected IB functional in which finite sample effects are taken directly into account for the relevant information term (Still & Bialek, 2003; Atwal & Slonim, 2005). Extending these works in our context is clearly important yet out of the scope of this work. 7.3 Future Research 7.3.1 Choosing the Number of Clusters. A natural question in cluster analysis is the estimation of the “correct” number of clusters. It is important to bear in mind, though, that this question might have more than one proper answer, for example, if there is a natural hierarchical structure in p(X) such that different resolutions convey multiple important insights. The number of clusters we extract is related to the trade-off parameter β, where low β values imply a relatively low resolution, while high β values suggest that a large number of clusters should be employed. The deterministic annealing multivariate IB algorithm seems to be most relevant here since it automatically adjusts the resolutions of the different clustering systems as β is increased. For the original IB problem, a recent rigorous treatment characterizes the maximal β value that can be employed for given data before overfit effects take place (Still & Bialek, 2003). Extending this work in our context is a natural direction for future research. 7.3.2 Relation to Other Methods and Parametric Multivariate IB. The possible connections with other data analysis methods merit further investigation. The general structure of the multivariate iIB algorithm is reminiscent of EM (Dempster, Laird, & Rubin, 1977). Moreover, there are strong relations between the original IB problem and maximum likelihood estimation for mixture models (Slonim & Weiss, 2002). Hence, it is natural to look for further relationships between generative models and different multivariate IB problems. In particular, formulating new problems with the multivariate IB framework might suggest new generative models that are worth exploring. Other connections are, for example, to other dimensionality reduction techniques, such as independent component analysis (ICA) (Bell & Sejnowski, 1995). The parallel IB provides an ICA-like decomposition with
Multivariate Information Bottleneck
1775
an important distinction. In contrast to ICA, it is aimed at preserving information about predefined aspects of the data, specified through the choice of the relevant variables in G out . A parametric variant of our framework might be useful in different situations (Elidan & Friedman, 2003). This issue seem to be better addressed by the structural multivariate IB principle, L(2) . In this formulation, we aim to minimize the KL divergence between q (X, T) and the target class defined by G out . If we further require a particular parametric form over this class, minimizing this KL corresponds to finding q (X, T) with minimum violation of the conditional independencies implied by G out and with the appropriate parametric form. In particular, this means that the number of free parameters can be drastically reduced, avoiding possible redundant solutions. 7.3.3 Multivariate Relevance Compression Function. In the original IB problem, the trade-off in the IB functional is quantified by the relevancecompression function (also known as the information curve). Given p(X, Y), this concave function bounds the maximal achievable I (T; Y), for any level of compression, I (T; X) (Gilad-Bachrah, Navot, & Tishby, 2003). It is intimately related to the rate distortion function and the capacity cost function, and in a sense unifies them, as discussed earlier. It depends solely on p(X, Y) and characterizes the structure in this joint distribution: the clearer this structure is, the steeper this curve becomes. Analogously, given p(X), we may consider the multivariate relevance compression function as the two-dimensional optimal curve of the maximally attainable I G out for any level of I G in . From the variational principle, L(1) , it follows that the slope of this curve is β −1 . Thus, assuming that this curve is differentiable, it must be downward concave, as in the original IB case. Importantly, though, this function depends on the specification of G in and G out . That is, given p(X), there are many different such functions that characterize the structure in this joint distribution in multiple ways. 7.3.4 How to specify G in and G out . The underlying assumption in our formulation is that G in and G out are provided as part of the problem setup. However, specifying these two networks might be far from trivial. For example, in the parallel IB case, where T = {T1 , . . . , Tk }, setting k can be seen as a model selection task, and certainly not an easy one. Thus, an important issue is to develop automatic methods for specifying both networks. Possible guidance can come from the multivariate relevance compression function. Specifically, it seems plausible to prefer specifications that yield steeper relevance compression curves. This issue clearly deserves further investigation. 7.4 Conclusion. Our formulation corresponds to a rich family of optimization problems that are all unified under the same information-theoretic
1776
N. Slonim, N. Friedman, and N. Tishby
framework. In particular, it allows one to extract structure from data in many different ways. In this work, we focused on three examples: parallel IB, symmetric IB, and triplet IB. However, we believe that this is only the tip of the iceberg. An immediate corollary of our analysis is that the general term of clustering conceals a broad family of many distinct problems that deserve special consideration. To the best of our knowledge, the multivariate IB framework described in this work is the first successful attempt to define these subproblems, solve them, and demonstrate their importance. Appendix A: Proofs Proof of Proposition 1. Using the multi-information definition in equation 3.1 and the fact that p(X) |= G, we get p(x) p(x1 ) . . . p(xn )
p xi | paGXi n = E p(x) log i=1 p(xi )
n p xi | paGXi E p(x) log = p(xi )
I(X) = E p(x) log
i=1
n
I Xi ; PaGXi . = i=1
Proof of Proposition 2. p(x1 , . . . , xn ) DK L [ p||G] = min E p(x) log q |=G q (x1 , . . . , xn ) p(x1 , . . . , xn )
= min E p(x) log n q |=G i=1 p xi | paGXi
n p xi | paGXi i=1
+E p(x) log n i=1 q xi | paGXi p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi
n G p xi | pa Xi
p paGXi p xi | paGXi log + min q |=G q xi | paGXi G i=1 xi ,pa X
i
Multivariate Information Bottleneck
1777
p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi n
+ min p paGXi DK L p xi | paGXi q xi | paGXi ,
q |=G
i=1 paGX
i
and since the right term is nonnegative and equals zero if and only if we choose q (xi | paGXi ) = p(xi | paGXi ) we get the desired result. We use proposition 2:
Proof of Proposition 3.
p(x1 , . . . , xn ) DK L [ p||G] = min E p(x) log q |=G q (x1 , . . . , xn ) p(x1 , . . . , xn )
= E p(x) log n i=1 p xi | paGXi n p(xi | x1 , . . . , xi−1 ) i=1
= E p(x) log n i=1 p xi | paGXi n p(xi | x1 , . . . , xi−1 )
= E p(x) log p xi | paGXi i=1 =
n
I Xi ; {X1 , . . . , Xi−1 } \ PaGXi | PaGXi , i=1
where we used the consistency of the order X1 , . . . , Xn with the order of the DAG G. To prove the second part of the proposition, we note that DK L [ p||G] = E p(x)
p(x1 , . . . , xn )
log n i=1 p xi | paGXi
n p xi | paGXi i=1 p(x1 , . . . , xn ) = E p(x) log − E p(x) log n n i=1 p(xi ) i=1 p(xi )
n p xi | paGXi E p(x) log = I(X) − p(xi ) i=1
= I(X) −
n
I Xi ; PaGXi . i=1
1778
N. Slonim, N. Friedman, and N. Tishby
Proof of Theorem 1. The basic idea is to find stationary points of L(1) subject to the normalization constraints. Thus, we add Lagrange multipliers and use definition 1 to get the Lagrangian, ˜ (X, T)] = L[q
k
I (T ; U ) − β
=1
+
λ(u )
n
I (Xi ; V Xi ) +
i=1
k
I (T ; VT )
=1
q (t | u ),
(A.1)
t
u
where we drop terms that depend only on the observed variables X. To differentiate L˜ with respect to a specific parameter q (t j | u j ) we use the following two lemmas. In the proofs of these two lemmas, we assume that q (X, T) |= G in and that the T variables are all leafs in G in . Lemma 1.
Under the above normalization constraints, for every event a over
X ∪ T (that is, a is some assignment to some subset of X ∪ T ), we have
∂q (a ) = q (u j )q (a | t j , u j ). ∂q (t j | u j )
(A.2)
Proof. Let Z denote all the random variables in X ∪ T suchthat their values are not set by the event a. In the following, the notation z,a q (x, t) means that the sum is only over the variables in Z (where the others are set through a): ∂q (a) ∂ q (x, t) = ∂q (t j | u j ) ∂q (t j | u j ) z,a = =
∂
∂q (t j | u j )
z,a
z,a
k=1 q (t | u )
∂ k q (t | u ). ∂q (t j | u j ) =1
Clearly the derivatives are nonzero only for terms in which tl = t j and ul = u j . For each such term, the derivative is simply k=1,= j q (t | u ). Dividing and multiplying every such term by q (t j | u j ), we obtain ∂q (a ) 1 k q (t | u ) = ∂q (t j | u j ) q (t j | u j ) z\{t ,u },a ,t ,u =1 j
=
q (a , t j , u j ) q (t j | u j )
j
j
j
Multivariate Information Bottleneck
1779
= q (u j )q (a | t j , u j ). Using this lemma, we get: Lemma 2.
For every Y, Z ⊆ X ∪ T
∂ I (Y; Z ) q (y | z ) = q (u j ) −1 . q (y, z | t j , u j )log ∂q (t j | u j ) q (y) y,z
(A.3)
Proof. ∂ q (y | z) ∂ I (Y; Z) log = q (y, z) ∂q (t j | u j ) y,z q (y) ∂q (t j | u j ) +
y,z
−
∂ q (y, z) ∂q (t j | u j ) q (z | y)
y,z
−
q (y | z)
y,z
∂ ∂q (t j | u j )
q (y)
∂ q (z). ∂q (t j | u j )
Applying lemma 1 for each of these derivatives, we get the desired result. We now can differentiate each mutual information term that appears in L˜ of equation A.1. Note that we can ignore terms that do not depend on the value of Tj since these are constants with respect to q (t j | u j ). Therefore, by taking the derivative and equating to zero, we get: log q (t j | u j ) = logq (t j ) −β
−j
p(xi ) −j q v Xi | u j q xi | v Xi , u j log
−j q xi | v Xi , t j i:Tj ∈V Xi v− j ,xi Xi −j
q (t ) −j q vT | u j q t | vT , u j log −
−j q t | vT , t j :Tj ∈VT v− j ,t T q (vTj ) + c(u j ), q (vTj | u j ) log − (A.4) q (vTj | t j ) v Tj
where c(u j ) is a term that depends only on u j . To get the desired KL form,
1780
N. Slonim, N. Friedman, and N. Tishby
we add and subtract −j v X ,Xi i
−j q xi | v Xi , u j −j
−j , q v Xi | u j q xi | v Xi , u j log p(xi )
(A.5)
for every term in the first outside summation. Note again that this is possible since we can absorb in c(u j ) every expression that depends only on u j . A similar transformation applies to the other two summations on the righthand side of equation A.4. Hence, we end up with log q (t j | u j ) = log q (t j )
−j −j E q (·|u j ) DK L q Xi | v Xi , u j q Xi | v Xi , t j −β · i:Tj ∈V Xi
−
:Tj ∈VT
−j −j E q (·|u j ) DK L q T | vT , u j q T | vT , t j
−DK L q (VTj | u j )q (VTj | t j ) + c(u j ).
(A.6)
Finally, taking the exponent and applying the normalization constraints for each distribution q (t j | u j ) completes the proof. Proof of Theorem 2. The following lemma allows drawing the connection between Tj and the other variables after every merger. Lemma 3.
and
Let Y , Z ⊂ X ∪ T . Then
q z , t¯ j = q z , t j + q z , trj ,
(A.7)
q (y | z , t¯ j ) = π,z · q y | y , t j + πr,z · q y | z , trj .
(A.8)
Proof. We use the following notations: W = Z ∩ U j , Z−W = Z \ −W {W}, U j = U j \ {W}. Note that in principle, it might be that W = ∅. For every z, t¯ j we have q (z, t¯ j ) = q (z) p(¯t j | z)
q u j −w | z q t¯ j | z−w , w, u j −w = q (z) u j −w
= q (z)
u j −w
q u j −w | z q t¯ j | u j ,
Multivariate Information Bottleneck
1781
where in the last step, we used the structure of G in and the fact that Z−W ∩ U j = ∅. Using equation 5.5, we find that q (z, t¯ j ) = q (z)
q u j −w | z q t j | u j + q trj | u j
u j −w
= q (z)
q u j −w | z q t j | z−w , w, u j −w + q trj | z−w , w, u j −w ,
u j −w
where again we used the structure of G in . Since Z = Z−W ∪ {W}, we get q (z, t¯ j ) = q (z)
q u j −w , t j | z + q u j −w , trj | z u j −w
= q z, t j + q z, trj ,
as required. To prove the second part we first note that if q (z, t¯ j ) = 0, then both sides of equation A.8 are trivially equal; thus, we assume that this is not the case. Then, for every y, z, t¯ j , we have q (y, z, t¯ j ) q (z, t¯ j )
q y, z, t j + q y, z, trj =
q z, t¯ j
q t j | z
q trj | z
q y | z, trj . =
q y | z, t j + q t¯ j | z q t¯ j | z
q (y | z, t¯ j ) =
Hence from equation 5.6, we get the desired form. Next, we need the following simple lemma. Recall that we denote by aft
Tj , Tj the random variables that correspond to Tj before and after the merger, respectively. Let V = V− j ∪ Tj be a set of random variables that be f includes Tj , and let Vbe f = V− j ∪ Tj , and similarly for Vaft . Let Y be a set of random variables such that Tj ∈ / Y. Using these notations, we have: bef
Lemma 4. The reduction of the mutual information I (Y ; V ) due to the merger {t j , trj } ⇒ t¯ j is given by I (Y ; V ) ≡ I (Y ; V be f ) − I (Y ; V afT ) = q (¯t j ) · E q (·|t¯ j ) [JSv − j [q (Y | t j , v − j ), q (Y | trj , v − j )]].
1782
N. Slonim, N. Friedman, and N. Tishby
Proof. Using the chain rule for mutual information (Cover & Thomas, 1991, p. 22), we get bef aft
I (Y; V) = I (V− j ; Y) + I Tj ; Y | V− j − I (V− j ; Y) − I Tj ; Y | V− j bef
aft
= I Tj ; Y | V− j − I Tj ; Y | V− j . From equation 5.5, we find that I (Y; V) =
q (v− j ) I (v− j ),
v− j
where we used the notation
q y | t j , v− j
−j log q tj , y | v I (v ) = q (y | v− j ) y −j
q y | trj , v−i
r −j log q tj , y | v + q (y | v− j ) y −
q y | t¯ j , v−i . q t¯ j , y | v− j log q (y | v− j ) y
Using lemma 3 (with Z = Y ∪ V− j ), we obtain
q t¯ j , y | v− j = q t j , y | v− j + q trj , y | v− j . Setting this in the previous equation, we get,
q y | t j , v− j
−j
log q tj , y | v I (v ) = q y | t¯ j , v− j y −j
q y | trj , v− j
r −j
+ q tj , y | v log q y | t¯ j , v− j y
= q t¯ j | v
−j
+q t¯ j | v
· π,v− j
−j
·π
q y | t j , v− j
−j
log q y | tj , v q y | t¯ j , v− j y
r,v− j
q y | trj , v− j
r −j
. q y | tj , v log q y | t¯ j , v− j y
Multivariate Information Bottleneck
1783
However, using again lemma 3, we see that
q y | t¯ j , v− j = π,v− j · q y | t j , v− j + πr,v− j · q y | trj , v− j . Therefore, using the JS definition in equation 5.8, we get
I (v− j ) = q t¯ j | v− j · JSv− j q Y | t j , v− j , q Y | trj , v− j . Setting this back in the expression for I (Y; V), we get I (Y; V) =
q v− j q t¯ j | v− j · JSv− j q Y | t j , v− j , q Y | trj , v− j v− j
= q t¯ j · E q (·|t¯ j ) JSv− j q Y | t j , v− j , q Y | trj , v− j . Using this lemma, we now prove the theorem. Note that the only information terms in L = I G out − β −1 I G in that change due to a merger in Tj are those that involve Tj . Therefore,
I (Xi ; V Xi ) + I (T ; VT ) + I (Tj ; VTj ) L t j , trj = i:Tj ∈V Xi
:Tj ∈VT
−β −1 I (Tj ; U j ).
(A.9)
Applying lemma 4 for each of these information terms, we get the desired form. Appendix B: Implementation and Preprocess Details In this appendix we describe the details of the implementation and the preprocess applied in our examples. In several cases, in order to avoid too high dimensionality, we apply feature selection by information gain before the clustering is applied. Specifically, given a joint distribution, p(X, Y), we sort all X values by their contribution to I (X; Y): p(x) y p(y | x) log p(y|x) , and select only the top sorted values for further p(y) analysis. Parallel sIB for Style Versus Topic Text Clustering
r r
All books were downloaded http://promo.net/pg/.
from
the
Gutenberg
Project,
Uppercase characters were lowered, digits were united to one symbol, and nonalphanumeric characters were ignored.
1784
r r r
N. Slonim, N. Friedman, and N. Tishby
Each book was split into “documents” consisting of 200 successive words each, ending up with 1346 documents and 15,528 distinct words. After normalization, we got an estimated joint distribution, p(D, W), where p(D) is uniform and each entry indicates the probability that a random word position is equal to w ∈ W while the document is d ∈ D. We applied the parallel sIB algorithm to this p(D, W) with T = {T1 , T2 }, |T j | = 2 and 5 different initializations per Tj .
Parallel sIB for Gene Expression Data Analysis
r r
r r r
We used the gene expression measurements of approximately 6800 human genes in 72 samples of leukemia (Golub et al., 1999). We removed about 1500 genes that were not expressed in the data and normalized the measurements of the remaining 5288 genes in each sample independently, to get an estimated joint distribution p(S, G) over samples and genes, with uniform p(S). We sorted all genes by their contribution to I (S; G), and selected the 500 most informative ones. After renormalization of the measurements in each sample, we ended up with an estimated joint distribution, p(S, G), with |S| = 72, |G| = 1 500, and p(S) = |S| . We applied the parallel sIB algorithm to this p(S, G) with T = {T1 , . . . , T4 }, |T j | = 2, and 5 different initializations per Tj .
Symmetric dIB and iIB for Word-Topic Clustering
r r r r r
We used the 20-news-group corpus (Lang, 1995), which contains about 20,000 documents and messages, distributed among 20 discussion groups, or topics. We removed all file headers, lowered uppercase characters, united digits into one symbol, and ignored nonalphanumeric characters. We further removed stop words and words with only one occurrence, ending up with a counts matrix of |D| = 19, 997 documents versus |W| = 74, 000 unique words. By summing the word counts of all the documents in each topic and applying simple normalization, we extracted an estimated joint distribution, p(W, C), of words versus topics with |C| = 20. We sorted all words by their contribution to I (W; C) and selected the 200 most informative ones. After renormalization, we ended up with a joint distribution with |W| = 200, |C| = 20.
Multivariate Information Bottleneck
r
1785
We applied the symmetric dIB algorithm to this joint distribution. Increasing β was done through β (r +1) = (1 + εβ )β (r ) , εβ = 0.001; β (0) = εβ . The split detection parameters were b = β1 , that is, as β increases, the algorithm becomes more liberal for declaring cluster splits. The scaling factor for the stochastic duplication was α = 0.005. However, before the first split, we used α1 = 0.95, so as to avoid the attractor of the trivial fixed point, q (Tj | U j ) = q (Tj ).
Symmetric sIB and aIB for Protein Sequence Analysis
r r r r
r
Each protein was represented as a count vector over all the 38,421 different 4-mers of amino acids present in the data. After normalizing the counts for each protein independently, we got 1 a joint distribution p(R, F ) with p(R) = |R| . We sorted all features by their contribution to I (R; F ) and selected the top 2,000. After renormalization, we ended up with a joint distribution p(R, F ) 1 with |R| = 421, |F| = 2, 000 and p(R) = |R| . In the sIB algorithm, we used for the initialization the strategy described in Slonim & Tishby (2000). We randomly initialize only TF and optimize it using the original sIB algorithm (Slonim et al., 2002), such that I (TF ; R) is maximized. Given this TF , we randomly initialize TR and use again the original sIB algorithm to optimize it such that I (TR ; TF ) is maximized. We use these two solutions as the initialization to the symmetric sIB algorithm, and continue by the general framework described in Figure 6 until convergence is attained. We repeat this procedure 100 times and select the solution that maximizes I (TR ; TF ).
Triplet sIB for Natural Language Processing
r
r r r
The seven Tarzan books, available from the Gutenberg project (http://promo.net/pg/), were Tarzan and the Jewels of Opar, Tarzan of the Apes, Tarzan the Terrible, Tarzan the Untamed, The Beasts of Tarzan, The Jungle Tales of Tarzan, and The Return of Tarzan. We followed the same preprocessing as in section 6.1.1, ending up with a sequence of 580,806 words taken from a vocabulary of 19,458 distinct words. The 10 most frequent words in the above books that are not stop words (i.e., W values) were apemans, apes, eyes, girl, great, jungle, tarzan, time, two, and way. After removing triplets with fewer than three occurrences, we had 672 different triplets with a total of 4,479 occurrences. The number
1786
r
r r
N. Slonim, N. Friedman, and N. Tishby
of distinct first words was 90, and the number of distinct last words was 233. Thus, after simple normalization, we had an estimated joint distribution, p(Wp , W, Wn ), with |W p | = 90, |W| = 10, |Wn | = 233. As in the symmetric IB, we first randomly initialize Tp and optimize it using the original sIB algorithm (Slonim et al., 2002), such that I (Tp ; W) is maximized. Similarly, we find Tn such that I (Tn ; W) is maximized. Using these initializations and the general scheme described in Figure 6, we optimize both systems of clusters until they converge to a local maximum of I (Tp , Tn ; W). We repeat this procedure for 50 different initializations to extract different locally optimal solutions. In the predictions over the new sequence (The Son of Tarzan), for each of the 10 words in W, we define the following quantities: A1 (w) is the number of w occurrences correctly predicted as w (true positives); A2 (w) is the number of words incorrectly predicted as w (false positives); A3 (w) is the number of w occurrences incorrectly not predicted as w (false negatives); The precision and recall for w are then defined as A1 (w) A1 (w) Prec(w) = A1 (w)+A , Rec(w) = A1 (w)+A , where the microaveraged 2 (w) 3 (w) precision and recall are defined by (Sebastiani, 2002): < Pr ec >= < Rec >=
w A1 (w) , A1 (w)+A2 (w) A (w) w 1 . w A1 (w)+A3 (w)
w
(B.1)
Appendix C: Notations Capital letters, {X, Y, T}, denote names of random variables. Lowercase letters, {x, y, t}, denote specific values taken by these variables. p(X) denotes the probability distribution function, and p(x) denotes the specific (scalar) value, p(X = x)—the probability that the assignment to the random variable X is the specific value x. We further use X ∼ p(X) to denote that X is distributed according to p(X). Probability distributions that are given as input and do not change during the analysis are denoted by p(·), while probability distributions that involve changeable parameters are denoted by q (·). Calligraphic notations, {X , Y, T }, denote the spaces to which the values of the random variables belong. Thus, X is the set of all possible values (or assignments) to X. The notation x means summation over all x ∈ X , and |X | stands for the cardinality of X . For simplicity, we limit the discussion to discrete random variables with a finite number of possible values. Sets of random variables are denoted by boldface capital letters {X, T} and specific values taken by those sets by boldface lowercase letters {x, t}.
Multivariate Information Bottleneck
1787
The boldface calligraphic notation T stands for the set of all possible assignments to T. Acknowledgments Insightful discussions with Ori Mosenzon are greatly appreciated. We thank Gill Bejerano who prepared the GST proteins data set and brought to our attention the existence of the new Omega class in these data. We thank Esther Singer and Michal Rosen-Zvi for comments on previous drafts of this article. This work was supported in part by the Israel Science Foundation (ISF), the Israeli Ministry of Science, and the US-Israel Bi-National Science Foundation. N.S. was also supported by an Eshkol fellowship. N.F. was also supported by an Alon fellowship and the Harry and Abe Sherman Senior Lectureship in Computer Science. Experiments reported in this work were run on equipment funded by an ISF Basic Equipment Grant. References Apweiler, R., and Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J., & Zdobnov, E. M. (2000). InterPro—an integrated documentation resource for protein families, domains and functional sites, Bioinformatics, 16, 1145–1150. Attwood, T. K., Croning, M. D. R., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J., & Wright, W. (2000). PRINTS-S: The database formerly known as PRINTS. Nucl. Acids Res., 28, 225–227. Atwal, G. S., & Slonim, N. (2005). Information bottleneck with finite samples. Unpublished manuscript. Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proc. of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Csisz´ar, I., & Tusn´ady, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, suppl. 1, 205–237. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Elidan, G., & Friedman, N. (2003). The information bottleneck expectation maximization algorith. In Proc. of the 19th Conf. on Uncertainty in Artificial Intelligence (UAI-19). San Mateo, CA: Morgan Kaufmann.
1788
N. Slonim, N. Friedman, and N. Tishby
Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001). Multivariate information bottleneck. In Proc. of Uncertainty in Artificial Intelligence (UAI-17). San Mateo, CA: Morgan Kaufmann. Gilad-Bachrach, R., Navot, A., & Tishby, N. (2003). An information theoretic tradeoff between complexity and accuracy. In The Sixteenth Annual Conference on Learning Theory (COLT). New York: Springer. Globerson, A., & Tishby, N. (2004). The minimum information principle in discriminative learning. In C. Meek, M. Chickering, & J. Halpern (Eds.), Uncertainty in artificial intelligence. Banff, Canada: AUAI Press. Golub, T., Slonim, D., Tamayo, P., Huard, C. M., Caasenbeek, J. M., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., & Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Gutman, M. (1989). Asymptotically optimal classification for multiple tests with empirically observed statistics, IEEE Trans. Inf. Theory, 35(2), 401–408. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57). New York: ACM. Lang, K. (1995). Learning to filter netnews. In Proc. of the 12th International Conf. on Machine Learning (ICML). San Mateo, CA: Morgan Kaufmann. Parker, E. A., Gedeon, T., & Dimitrov, A. G. (2002). Annealing and the rate distortion problem. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 969–976). Cambridge, MA: MIT Press. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Francisco: Morgan Kaufmann. Pereira, F. C., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics. New York: ACM. Rose, K. (1998). Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86, 2210– 2239. Schriebman, A. (2000). Stochastic modeling for efficient computation of information theoretic quantities. Unpublished doctoral dissertation, Hebrew University, Jerusalem, Israel. Available online at http://www.cs.huji.ac.il/labs/learning/Theses/ theses list.html. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656. Slonim, N. (2002). The information bottleneck: Theory and applications. Unpublished doctoral dissertation, Hebrew University, Jerusalem, Israel. Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM.
Multivariate Information Bottleneck
1789
Slonim, N., Somerville, R., Tishby, N., & Lahav, O. (2001). Objective classification of galaxy spectra using the information bottleneck method. Monthly Notes of the Royal Astronomical Society, 323, 270–284. Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. In S. A. Solla, ¨ T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 617–623). Cambridge, MA: MIT Press. Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 208– 215). New York: ACM. Slonim, N., & Weiss, Y. (2002). Maximum likelihood and the information bottleneck. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Still, S., & Bialek, W. (2003). How many clusters? Physics/030101. Studeny, M., & Vejnarova, J. (1998). The Multi-information function as a tool for measuring stochastic dependence. In M. I. Jordan (Ed.), Learning in graphical models (pp. 261–298). Dordrecht: Kluwer. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In Proc. of 37th Allerton Conference on Communication and Computation. Tishby, N., & Slonim, N. (2000). Data clustering by Markovian relaxation and the information bottleneck method. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press.
Received February 25, 2005; accepted September 15, 2005.
LETTER
Communicated by Manfred Opper
Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors Mark Girolami [email protected]
Simon Rogers [email protected] Department of Computing Science, University of Glasgow, Glasgow, Scotland
It is well known in the statistics literature that augmenting binary and polychotomous response models with gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favor of gaussian process (GP) priors over functions, and employing variational approximations to the full posterior, we obtain efficient computational methods for GP classification in the multiclass setting.1 The model augmentation with additional latent variables ensures full a posteriori class coupling while retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multiclass informative vector machines (IVM), emerge in a natural and straightforward manner. This is the first time that a fully variational Bayesian treatment for multiclass GP classification has been developed without having to resort to additional explicit approximations to the nongaussian likelihood term. Empirical comparisons with exact analysis use Markov Chain Monte Carlo (MCMC) and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation. 1 Introduction Albert and Chib (1993) were the first to show that by augmenting binary and multinomial probit regression models with a set of continuous latent variables yk , corresponding to the kth response value where yk = mk + , ∼ N (0, 1) and mk = j βk j x j , an exact Bayesian analysis can be performed by Gibbs sampling from the parameter posterior. As an example, consider binary probit regression on target variables tn ∈ {0, 1};
1 Matlab code to allow replication of the reported results is available online at http:// www.dcs.gla.ac.uk/people/personal/girolami/pubs 2005/VBGP/index.htm.
Neural Computation 18, 1790–1817 (2006)
C 2006 Massachusetts Institute of Technology
Variational Bayesian Multinomial Probit Regression
1791
the probit likelihood for the nth data sample taking unit value (tn = 1) is P(tn = 1|xn , β) = (β T xn ), where is the standardized normal cumulative distribution function (CDF). This can be obtained by the following marginalization, P(tn = 1, yn |xn , β)dyn = P(tn = 1|yn ) p(yn |xn , β)dyn , and by definition P(tn = 1|yn ) = δ(yn > 0), we see that the required marginal is simply the normalizing constant of a left truncated univariate gaussian so that P(tn = 1|x n , β) = δ(yn > 0)N yn (β T xn , 1)dyn = (β T xn ). The key observation here is that working with the joint distribution P(tn = 1, yn |xn , β) = δ(yn > 0)N yn (β T xn , 1) provides a straightforward means of Gibbs sampling from the parameter posterior, which would not be the case if the marginal term, (β T xn ), was employed in defining the joint distribution over data and parameters. This data augmentation strategy can be adopted in developing efficient methods to obtain binary and multiclass gaussian process (GP) (Williams & Rasmussen, 1996) classifiers, as will be presented in this article. With the exception of Neal (1998), where a full Markov chain Monte Carlo (MCMC) treatment to GP-based classification is provided, all other approaches have focused on methods to approximate the problematic form of the posterior,2 which allow analytic marginalization to proceed. Laplace approximations to the posterior were developed in Williams and Barber (1998), and lowerand upper-bound quadratic likelihood approximations were considered in Gibbs and MacKay (2000). Variational approximations for binary classification were developed in Seeger (2000) where a logit likelihood was considered and mean field approximations were applied to probit likelihood terms in Opper and Winther (2000) and Csato, Fokue, Opper, Schottky, and Winther (2000), respectively. Additionally, incremental (QuinoneroCandela & Winther, 2003) or sparse approximations based on assumed density filtering (ADF; Csato & Opper, 2002), informative Vector Machines (IVM, Lawrence, Seeger, and Herbrich, 2003), and expectation propagation (EP; Minka, 2001; Kim, 2005) have been proposed. With the exceptions of Williams and Barber (1998), Gibbs and MacKay (2000), Seeger and Jordan (2004), and Kim (2005) the focus of most recent work has largely been on the binary GP classification problem. Seeger and Jordan (2004) developed a multiclass generalization of the IVM employing a multinomial-logit softmax likelihood. However, considerable representational effort is required to ensure that the scaling of computation and storage required of the proposed method matches that of the original IVM with linear scaling in the number of classes. In contrast, by adopting the probabilistic representation of Albert and Chib (1993), we will see that GP-based K -class classification and efficient sparse approximations (IVM generalizations with scaling linear in the number of classes) can be realized by optimizing a strict lower
2 The likelihood is nonlinear in the parameters due to either the logistic or probit link functions required in the classification setting.
1792
M. Girolami and S. Rogers
α
ψ
ϕ
m
y
M
t N
K Figure 1: Graphical representation of the conditional dependencies within the general multinomial probit regression model with gaussian process priors.
bound of the marginal likelihood of a multinomial probit regression model that requires the solution of K computationally independent GP regression problems while still operating jointly (statistically) on the data. We will also show that the accuracy of this approximation is comparable to that obtained via MCMC. The following section introduces the multinomial-probit regression model with GP priors. 2 Multinomial Probit Regression Define the data matrix as X = [x1 , . . . , x N ]T , which has dimension N × D and the N × 1–dimensional vector of associated target values as t where each element tn ∈ {1, . . . , K }. The N × K matrix of GP random variables mnk is denoted by M. We represent the N × 1 dimensional columns of M by mk and the corresponding K × 1–dimensional rows by mn . The N × K matrix of auxiliary variables ynk is represented as Y, where the N × 1–dimensional columns are denoted by yk and the corresponding K × 1–dimensional rows as yn . The M × 1 vector of covariance kernel hyperparameters for each class is denoted by ϕ k , and associated hyperparameters ψ k and α k complete the model.3 The graphical representation of the conditional dependency structure in the auxiliary variable multinomial probit regression model with GP priors in the most general case is shown in Figure 1. 3 Prior Probabilities From the graphical model in Figure 1 a priori we can assume classspecific GP independence and define model priors such that mk |X, ϕ k ∼ G P(ϕ k ) = Nmk (0, Cϕ k ), where the matrix Cϕ k of dimension N × N defines 3 This is the most general setting; however, it is more common to employ a single and shared GP covariance function across classes.
Variational Bayesian Multinomial Probit Regression
1793
the class-specific GP covariance.4 Typical examples of such GP covariance functions are radial basis–style functions such that the i, jth element of each M Cϕ k is defined as exp{− d=1 ϕkd (xid − x jd )2 }, where in this case M = D; however, there are many other forms of covariance functions that may be employed within the GP function prior (see, e.g., MacKay, 2003). As in Albert and Chib (1993), we employ a standardized normal noise model such that the prior on the auxilliary variables is ynk |mnk ∼ N ynk (mnk , 1) to ensure appropriate matching with the probit function. Rather than having this variance fixed, it could also be made an additional free parameter of the model and therefore would yield a scaled probit function. For the presentation here, we restrict ourselves to the standardized model and consider extensions to a scaled probit model as possible further work. The relationship between the additional latent variables yn (denoting the nth row of Y) and the targets tn as defined in multinomial probit regression (Albert and Chib, 1993) is adopted here: tn = j
if
ynj = max {ynk }.
(3.1)
1≤k≤K
This has the effect of dividing R K (y space) into K nonoverlapping K -dimensional cones Ck = {y : yk > yi , k = i} where R K = ∪k Ck , and so each P(tn = i|yn ) can be represented as δ(yni > ynk ∀ k = i). We then see that similar to the binary case, where the probit function emerges from explicitly marginalizing the auxiliary variable, the multinomial probit takes the form given below (details are given in appendix A), P(tn = i|mn ) =
δ(yni > ynk ∀ k = i)
K
p(ynj |mnj ) dy
j=1
=
K Ci j=1
p(ynj |mnj )dy = E p(u)
j=i
(u + mni − mnj ) ,
where the random variable u is standardized normal, p(u) = N (0, 1). A hierarchic prior on the covariance function hyperparameters is employed such that each hyperparameter has, for example, an independent exponential distribution ϕkd ∼ Exp(ψkd ) and a gamma distribution is placed on the mean values of the exponential ψkd ∼ (σk , τk ), thus forming a conjugate pair. Of course, as detailed in Girolami and Rogers (2005), a
The model can be defined by employing K − 1 GP functions and an alternative truncation of the gaussian over the variables ynk ; however, for the multiclass case, we define a GP for each class. 4
1794
M. Girolami and S. Rogers
more general form of covariance function can be employed that will allow the integration of heterogeneous types of data, which takes the form of a weighted combination of base covariance functions. The associated hyper-hyperparameters α = σk=1,...,K , τk=1,...,K can be estimated via type II maximum likelihood or set to reflect some prior knowledge of the data. Alternatively, vague priors can be employed such that, for example, each σk = τk = 10−6 . Defining the parameter set as = {Y, M} and the hyperparameters as = {ϕ k=1,...,K , ψ k=1,...,K }, the joint likelihood takes the form p(t, , |X, α) =
K N n=1
×
K
δ(yni > ynk ∀ k = i)δ(tn = i)
i=1
p(ynk |mnk ) p(mk |X, ϕ k ) p(ϕ k |ψ k ) p(ψ k |α k ). (3.2)
k=1
4 Gaussian Process Multiclass Classification We now consider both exact and approximate Bayesian inference for GP classification with multiple classes employing the multinomial-probit regression model. 4.1 Exact Bayesian Inference: The Gibbs Sampler. The representation of the joint likelihood (see equation 3.2) is particularly convenient in that samples can be drawn from the full posterior over the model parameters (given the hyperparameter values) p(|t, X, , α) using a Gibbs sampler in a very straightforward manner with scaling per sample of O(K N3 ). Full details of the Gibbs sampler are provided in appendix D, and this sampler will be employed in the experimental section. 4.2 Approximate Bayesian Inference: The Laplace Approximation. The Laplace approximation of the posterior over GP variables, p(M|t, X, , α) (where Y is marginalized) requires finding the mode of the unnormalized posterior. Approximate Bayesian inference for GP classification with multiple classes employing a multinomial logit (softmax) likelihood has been developed previously Williams and Barber (1998). Due to the form of the multinomial logit likelihood, a Newton iteration to obtain the posterior mode will scale at best as O(K N3 ). Employing the multinomial probit likelihood, we find that each Newton step will scale as O(K 3 N3 ). Details are provided in appendix E. 4.3 Approximate Bayesian Inference: A Variational and Sparse Approximation. Employing a variational Bayes approximation (Beal, 2003; Jordan, Ghahramani, Jaakkola, & Saul, 1999; MacKay, 2003) by using an approximating ensemble of factored posteriors such that p(|t, X, , α) ≈
Variational Bayesian Multinomial Probit Regression
1795
i=1 Q(i ) = Q(Y)Q(M) for multinomial probit regression is more appealing from a computational perspective as a sparse representation, with scaling O(K NS2 ) (where S is the subset of samples entering the model and S N), can be obtained in a straightforward manner, as will be shown in the following sections. The lower bound5 (see, e.g., Beal, 2003; Jordan et al., 1999; MacKay, 2003) on the marginal likelihood log p(t|X, , α) ≥ E Q() log p(t, |X, , α) − E Q() log Q() is minimized by distributions that take an unnormalized form of Q(i ) ∝ exp E Q(\i ) {log P(t, |X, , α)} where Q(\i ) denotes the ensemble distribution with the ith component of removed. Details of the required posterior components are given in appendix A. The approximate posterior over the GP random variables takes a factored form such that
Q(M) =
K
Q(mk ) =
k=1
K
Nmk ( mk , k ),
(4.1)
k=1
where the shorthand tilde notation denotes posterior expectation, f (a ) = k = E Q(a ) { f (a )}, and so the required posterior mean for each k is given as m k yk , where k = Cϕ k (I + Cϕ k )−1 (see appendix A for full details). We will see that each row, yn , of Y will have posterior correlation structure induced, ensuring that the appropriate class-conditional posterior depen It should be stressed here that while there dencies will be induced in M. are K a posteriori independent GP processes, the associated K -dimensional posterior means for each of N data samples induces posterior dependencies due to the posterior coupling over each between each of the K columns of M of the auxiliary variables yn . We will see that this structure is particularly convenient in obtaining sparse approximations (Lawrence et al., 2003) for the multiclass GP in particular. Due to the multinomial probit definition of the dependency between each element of yn and tn (see equation 3.1), the posterior for the auxiliary variables follows as Q(Y) =
N n=1
Q(yn ) =
N
Nytnn ( mn , I)
(4.2)
n=1
where Nytnn ( mn , I) denotes a conic truncation of a multivariate gaussian such that if tn = i where i ∈ {1, . . . , K }, then the ith dimension has the largest
5
The bound follows from the application of Jensen’s inequality, for example, log p(t|X) = log p(t,|X) Q() log p(t,|X) Q() Q()d ≥ Q() d.
1796
M. Girolami and S. Rogers
value. The required posterior expectations ynk for all k = i and yni follow as E p(u) Nu ( mnk − m ni , 1)n,i,k u ynk = m nk − E p(u) (u + m nk )n,i,k ni − m u ni − ynj − m nj , yni = m
(4.3)
(4.4)
j=i
where n,i,k = j=i,k (u + m ni − m nj ), and p(u) = Nu (0, 1). The expectau tions with respect to p(u), which appear in equation 4.3, can be obtained by quadrature or straightforward sampling methods. If we also consider the set of hyperparameters, , in this variational treatment, then the approximate posterior for the covariance kernel hyperparameters takes the form of Q(ϕ k ) ∝ Nm k (0, Cϕ k )
M d=1
kd ), Exp(ϕkd |ψ
and the required posterior expectations can be estimated employing importance sampling. Expectations can be approximated by drawing S samples s kd ), and so such that each ϕkd ∼ Exp(ψ f (ϕ k ) ≈
S
f (ϕ sk )w(ϕ sk )
where
s=1
w(ϕ sk )
Nm k 0, Cϕ sk . = S k 0, Cϕ s s =1 Nm
(4.5)
k
This form of importance sampling within a variational Bayes procedure has been employed previously in Lawrence, Milo, Niranjan, Rashbass, and Soullier (2004). Clearly the scaling of the above estimator per sample is similar to that required in the gradient-based methods that search for optima of the marginal likelihood as employed in GP regression and classification (e.g., MacKay, 2003). Finally, we have that each Q(ψkd ) = ψkd (σk + 1, τk + ϕkd ), and the assokd = (σk + 1)/(τk + ciated posterior mean is simply ψ ϕkd ). 4.4 Summarizing Variational Multiclass GP Classification. We can summarize what has been presented by the following iterations that, in the general case, for all k and d, will optimize the bound on the marginal likelihood (explicit expressions for the bound are provided in appendix C): k ← Cϕ k (I + Cϕ k )−1 ( m mk + p k ) ϕk ← ϕ sk w(ϕ sk ) s
(4.6) (4.7)
Variational Bayesian Multinomial Probit Regression
kd ← ψ
σk + 1 , τk + ϕkd
1797
(4.8)
s kd ), w(ϕ sk ) is defined as previously and pk is the where each ϕkd ∼ Exp(ψ kth column of the N × K matrix P whose elements pnk are defined by the right-most terms in equations 4.3 and 4.4: for tn = i, then for all k = i, E {N ( m − m ,1)n,i,k } pnk = − p(u) u nk ni un,i,k and pni = − j=i pnj . mni − E p(u) (u+ mnk )u
These iterations can be viewed as obtaining K one-against-all binary classifiers: however, most important, they are not statistically independent of each other but are a posteriori coupled via the posterior mean estimates of each of the auxiliary variables yn . The computational scaling will be linear in the number of classes and cubic in the number of data points O(K N3 ). It is worth noting that if the covariance function hyperparameters are fixed, then the costly matrix inversion requires being computed only once. The Laplace approximation will require a matrix inversion for each Newton step when finding the mode of the posterior (Williams & Barber, 1998). 4.4.1 Binary Classification. Previous variational treatments of GP-based binary classification include Seeger (2000), Opper and Winther (2000), Gibbs and MacKay (2000), Csato and Opper (2002), and Csato et al. (2000). It is, however, interesting to note in passing that for binary classification, the outer plate in Figure 1 is removed, and further simplification follows as only K − 1, that is, one set of posterior mean values requires being estimated, = Cϕ (I + Cϕ )−1 and so the posterior expectations m y now operate on N × and 1–dimensional vectors m y. The posterior Q(y) is now a product of truncated univariate gaussians, so the expectation for the latent variables yn has an exact analytic form. For a unit variance gaussian truncated below zero if tn = 1 and above zero if tn = −1, the required posterior mean y has elements that can be obtained by the following analytic expression derived from straightforward results for corrections to the mean of a gaussian due to truncation yn = m n + tn Nm n ).6 So the following iteration will n (0, 1)/ (tn m guarantee an increase in the bound of the marginal likelihood, ← Cϕ (I + Cϕ )−1 ( m m + p),
(4.9)
where each element of the N × 1 vector p is defined as pn = tn Nm n ). n (0, 1)/(tn m 4.5 Variational Predictive Distributions. The predictive distribution, P(tnew = k|xnew , X, t), for a new sample xnew follows from results for +∞ For t = +1, then y = 0 yN y ( m, 1)/{1 − (− m)}dy = m + Nm m), and for (0, 1)/( 0 t = −1, then y = −∞ yN y ( m, 1)/(− m)dy = m − Nm m). (0, 1)/(− 6
1798
M. Girolami and S. Rogers
standard GP regression.7 The N × 1 vector Cnew ϕ k contains the covariance function values between the new point and those contained in X, and cϕnew k denotes the covariance function value for the new point and itself. So the GP posterior p(mnew |xnew , X, t) is a product of K gaussians each with mean and variance −1 new m new = yTk I + Cϕk Cϕ k k −1 new T σk2 ,new = cϕnew − (Cnew Cϕ k I + Cϕk ϕk ) k using the following shorthand, νknew = 1 + σk2 ,new , then it is straightforward (details in appendix B) to obtain the predictive distribution over possible target values as
P(tnew
1 new = k|xnew , X, t) = E p(u) new −m new u νk + m k j new ν j j=k
!
,
where, as before, u ∼ Nu (0, 1). The expectation can be obtained numerically employing sample estimates from a standardized gaussian. For the binary case, the standard result follows: P(tnew = 1|xnew , X, t) =
mnew , δ(ynew > 0)N ynew ( ν new ) dynew
" # " new # m m new = 1 − − new = . ν ν new
5 Sparse Variational Multiclass GP Classification The dominant O(N3 ) scaling of the matrix inversion required in the posterior mean updates in GP regression has been the motivation behind a large body of literature focusing on reducing this cost via reduced rank approximations (Williams & Seeger, 2001) and sparse online learning (Csato & Opper, 2002; Quinonero-Candela & Winther, 2003) where assumed density filtering (ADF) forms the basis of online learning and sparse approximations for GPs. Likewise in Lawrence et al. (2003) the informative vector machine (IVM) (refer to Lawrence, Platt, & Jordan, 2005, for comprehensive details) is proposed, which employs informative point selection criteria (Seeger, Williams, & Lawrence, 2003) and ADF updating of the approximations of
7
and α is implicit. Conditioning on Y, ϕ , ψ,
Variational Bayesian Multinomial Probit Regression
1799
the GP posterior parameters. Only binary classification has been considered in Lawrence et al. (2003), Csato and Opper (2002), and Quinonero-Candela and Winther (2003), and it is clear from Seeger and Jordan (2004) that extension of ADF-based approximations such as IVM to the multiclass problem is not at all straightforward when a multinomial-logit softmax likelihood is adopted. However, we now see that sparse GP-based classification for multiple classes (multiclass IVM) emerges as a simple by-product of online ADF approximations to the parameters of each Q(mk ) (multivariate gaussian). The ADF approximations when adding the nth data sample, selected at the lth of S iterations, for each of the K GP posteriors, Q(mk ), follow simply from details in Lawrence et al. (2005) as given below:
k,n ← Cnϕ k − MTk Mk,n s k ← sk −
1 T diag k,n k,n 1 + skn
1 T
k,n Mlk ← √ 1 + skn k ← m k + m
nk ynk − m
k,n . 1 + skn
(5.1) (5.2) (5.3) (5.4)
nk = pnk as defined in section 4.4 and can be obtained from Each ynk − m the current stored approximate values of each m n1 , . . . , m nK via equations 4.3 and 4.4, k,n , an N × 1 vector, is the nth column of the current estimate of each k ; likewise, Cnϕ k is the nth column of each GP covariance matrix. All elements of each Mk and mk are initialized to zero, while each sk has initial unit values. Of course, there is no requirement to explicitly store each N × N–dimensional matrix k ; only the S × N matrices Mk and N × 1 vectors sk require storage and maintenance. We denote indexing into the lth row of each Mk by Mlk , and the nth element of each sk by skn , which is the estimated posterior variance. The efficient Cholesky factor updating as detailed in Lawrence et al. (2005) will ensure that for N data samples, K distinct GP priors, and a maximum of S samples included in the model where S N, then at most O(K SN) storage and O(K NS2 ) compute scaling will be realized. As an alternative to the entropic scoring heuristic of Seeger et al. (2003) and Lawrence et al. (2003), we suggest that an appropriate criterion for point inclusion assessment will be the posterior predictive probability of a target value given the current model parameters for points that are currently not included in the model that is, P (tm |xm , {mk }, { k }), where the subscript m indexes such points. From the results of the previous section, this is equal
1800
M. Girolami and S. Rogers
to Pr (ym ∈ Ctm =k ), which is expressed as
E p(u)
j=k
"
1 uνkm + m mk − m mj
ν jm
#
,
(5.5)
$ where k is the value of tm , ν jm = 1 + s jm , and so the data point with the smallest posterior target probability should be selected for inclusion. This k and scoring criterion requires no additional storage overhead, as all m sk are already available and it can be computed for all m not currently in the model in, at most, O(K N) time.8 Intuitively, points in regions of low target posterior certainty, that is, class boundaries, will be the most influential in updating the approximation of the target posteriors. And so the inclusion of points with the most uncertain target posteriors will yield the largest possible translation of each updated mk into the interior of their respective cones Ck . Experiments in the following section will demonstrate the effectiveness of this multiclass IVM. 6 Experiments 6.1 Illustrative Multiclass Toy Example. Ten-dimensional data vectors, x, were generated such that if t = 1, then 0.5 > x12 + x22 > 0.1; for t = 2, then 1.0 > x12 + x22 > 0.6; and for t = 3, then [x1 , x2 ]T ∼ N (0, 0.01I) where I denotes an identity matrix of appropriate dimension. Finally x3 , . . . , x10 are all distributed as N (0, 1). Both of the first two dimensions are required to define the three class labels, with the remaining eight dimensions being irrelevant to the classification task. Each of the three target values was sampled uniformly, thus creating a balance of samples drawn from the three target classes. Two hundred forty draws were made from the above distribution, and the sample was used in the proposed variational inference routine, with a further 4620 points being used to compute a 0-1 loss class predictionerror. A common radial basis covariance function of the form exp{− d ϕd |xid − x jd |2 } was employed, and vague hyperparameters, σ = τ = 10−3 were placed on the length-scale hyperparameters ψ1 , . . . , ψ10 . The posterior expectations of the auxiliary variables y were obtained from equations 4.3 and 4.4 where the gaussian integrals were computed using 1000 samples drawn from p(u) = N (0, 1). The variational importance sampler d ) in estimating the correemployed 500 samples drawn from each Exp(ψ sponding posterior means ϕd for the covariance function parameters. Each M and Y was initialized randomly, and ϕ had unit initial values. In this
8
Assuming constant time to approximate the expectation.
Variational Bayesian Multinomial Probit Regression
−4.2
−5.2 0
10 20 30 40 Iteration Number
(a)
50
100
1 0
−1 −2 −3 0
10 20 30 40 Iteration Number
50
% Predictions Correct
Lower Bound × 10−2
10
Length Scale (Log )
−3.2
1801
80 60 40 20 0
10 20 30 40 Iteration Number
(b)
50
(c)
Figure 2: (a) Convergence of the lower bound on the marginal likelihood for the toy data set considered. (b) Evolution of estimated posterior means for the inverse squared length scale parameters (precision parameters) in the RBF covariance function. (c) Evolution of out-of-sample predictive performance on the toy data set.
example, the variational iterations ran for 50 steps, where each step corresponds to the sequential posterior mean updates of equations 4.6 to 4.8. The value of the variational lower bound was monitored during each step, and as would be expected, a steady convergence in the improvement of the bound can be observed in Figure 2a. The development of the estimated posterior mean values for the covariance function parameters ϕd , Figure 2b, shows automatic relevance detection (ARD) in progress (Neal, 1998), where the eight irrelevant features are effectively removed from the model. From Figure 2c we can see that the development of the predictive performance (out of sample) follows that of the lower bound (see Figure 2a), achieving a predictive performance of 99.37% at convergence. As a comparison to our multiclass GP classifier, we use a directed acyclic graph (DAG) SVM (Platt, Cristianni, & Shawe-Taylor, 2000) (assuming equal class dis tributions the scaling9 is O N3 K −1 ) on this example. By employing the values of the posterior mean values of the covariance function length scale parameters (one for each of the 10 dimensions) estimated by the proposed variational procedure in the RBF kernel of the DAG SVM, a predictive performance of 99.33% is obtained. So on this data set, the proposed GP classifier has comparable performance, under 0-1 loss, to the DAG SVM. However, the estimation of the covariance function parameters is a natural part of the approximate Bayesian inference routines employed in GP classification. There is no natural method of obtaining estimates of the 10 kernel parameters for the SVM without resorting to cross validation (CV), which, in the case of a single parameter, is feasible but rapidly becomes infeasible as the number of parameters increases. 9
This assumes the use of standard quadratic optimization routines.
1802
M. Girolami and S. Rogers
6.2 Comparing Laplace and Variational Approximations to Exact then Inference via Gibbs Sampling. This section provides a brief empirical comparison of the variational approximation, developed in previous sections, to a full MCMC treatment employing the Gibbs sampler detailed in appendix D. In addition, a Laplace approximation is considered in this short comparative study. Variational approximations provide a strict lower bound on the marginal likelihood, and this bound is one of the approximation’s attractive characteristics. However, it is less well understood how much the parameters obtained from such approximations differ from those obtained using exact methods. Preliminary analysis of the asymptotic properties of variational estimators is provided in Wang and Titterington (2004). A recent experimental study of EP and Laplace approximations to binary GP classifiers has been undertaken by Kuss and Rasmussen (2005), and it is motivating to consider a similar comparison for the variational approximation in the multiple-class setting. Kuss and Rasmussen observed that the marginal and predictive likelihoods, computed over a wide range of covariance kernel hyperparameter values, were less well preserved by the Laplace approximation than the EP approximation when compared to that obtained by MCMC. We then consider the predictive likelihood obtained by the Gibbs sampler and compare this to the variational and Laplace approximations of the GP-based classifiers. The toy data set from the previous section is employed, and as in Kuss and Rasmussen (2005), a covariance kernel of the form s exp{−ϕ d xid − x jd 2 } is adopted. Both s & ϕ are varied in the range (log scale) −1 to +5 and at each pair of hyperparameter values, a multinomial probit GP classifier is induced using (1) MCMC via the Gibbs sampler, (2) the proposed variational approximation, and (3) a Laplace approximation of the probit model. For the Gibbs sampler, after a burn-in of 2000 samples, the following 1000 samples were used for inference purposes, and the predictive likelihood (probability of target values in the test set) and test error (0-1 error loss) were estimated from the 1000 post-burn-in samples as detailed in appendix D. We first consider a binary classification problem by merging classes 2 and 3 of the toy data set into one class. The first thing to note from Figure 3 is that the predictive likelihood response under the variational approximation preserves, to a rather good degree, the predictive likelihood response obtained when using Gibbs sampling across the range of hyperparameter values. However, the Laplace approximation does not do as good a job in replicating the levels of the response profile obtained using MCMC over the range of hyperparameter values considered. This finding is consistent with the results of Kuss and Rasmussen (2005). The Laplace approximation to the multinomial probit model has O(K 3 N3 ) scaling (see appendix E), which limits its application to situations where the number of classes is small. For this reason, in the following experiments we instead consider the multinomial logit Laplace approximation
5
4
4
−20
3 2 1 −160 0
−1 −1
1
2
3
3 2 −40
−160 0
−60 0
5
−20
1
−40
5
−1 −1
0
1
2
3
4
−80
3 2 1
− 40 −160 −60
0
−60
−80 4
1803
log(s)
5
log(s)
log(s)
Variational Bayesian Multinomial Probit Regression
−80 4
5
−1 −1
0
(a)
1
2
3
4
5
log(ϕ)
log(ϕ)
log(ϕ)
(b)
(c)
Figure 3: Isocontours of predictive likelihood for binary classification problem: (a) Gibbs sampler, (b) variational approximation, (c) Laplace approximation. 5
5
5
−10 3 2
log(s)
4
log(s)
log(s)
−10 4 3 2
4 3 −30
2
−20 −20
1
0
−40 0
1
2
3
log(ϕ) (a)
4
5
−1
1
−50
−30
−30
0 −1 −1
1
0
1
2
3
log(ϕ) (b)
−50
0
− 40 4
5
−1 −1
−60
−40
0
1
2
3
4
5
log(ϕ) (c)
Figure 4: Isocontours of predictive likelihood for multiclass classification problem: (a) Gibbs sampler, (b) variational approximation, (c) Laplace approximation.
(Williams & Barber, 1998). In Figure 4 the isocontours of predictive likelihood for the toy data set in the multiclass setting under various hyperparameter settings are provided. As with the binary case, the variational multinomial probit approximation provides predictive likelihood response levels that are good representations of those obtained from the Gibbs sampler. The Laplace approximation for the multinomial logit suffers from the same distortion of the contours as does the Laplace approximation for the binary probit; in addition, the information in the predictions is lower. We note, as in Kuss and Rasmussen (2005), that for s = 1 (log s = 0), the Laplace approximation compares reasonably with results from both MCMC and variational approximations. In the following experiment, four standard multiclass data sets (Iris, Thyroid, Wine, and Forensic Glass) from the UCI Machine Learning Data
1804
M. Girolami and S. Rogers
Table 1: Results of Comparison of Gibbs Sampler, Variational, and Laplace Approximations When Applied to Several UCI Data Sets.
Toy Data Marginal likelihood Predictive error Predictive likelihood Iris Marginal likelihood Predictive error Predictive likelihood Thyroid Marginal likelihood Predictive error Predictive likelihood Wine Marginal likelihood Predictive error Predictive likelihood Forensic Glass Marginal likelihood Predictive error Predictive likelihood
Laplace
Variational
Gibbs Sampler
−169.27 ± 4.27 3.97 ± 2.00 −98.90 ± 8.22
−232.00 ± 17.13 3.65 ± 1.95 −72.27 ± 9.25
−94.07 ± 11.26 3.49 ± 1.69 −73.44 ± 7.67
−143.87 ± 1.04 3.88 ± 2.00 −10.43 ± 1.12
−202.98 ± 1.37 4.08 ± 2.16 −7.35 ± 1.27
−45.27 ± 6.17 4.08 ± 2.16 −7.26 ± 1.40
−158.18 ± 1.94 4.73 ± 2.36 −19.01 ± 2.55
−246.24 ± 1.63 3.86 ± 2.04 −14.62 ± 2.70
−68.82 ± 8.29 3.94 ± 2.02 −14.47 ± 2.39
−152.22 ± 1.29 2.95 ± 2.16 −14.57 ± 1.29
−253.90 ± 1.52 2.65 ± 1.87 −10.16 ± 1.47
−68.65 ± 6.19 2.78 ± 2.07 −10.47 ± 1.41
−275.11 ± 2.87 36.54 ± 4.74 −90.38 ± 3.25
−776.79 ± 5.75 32.79 ± 4.57 −77.60 ± 3.91
−268.21 ± 5.46 34.00 ± 4.62 −79.86 ± 4.80
Note: Best results for Predictive likelihood are highlighted in bold.
Repository,10 along with the toy data previously described, are used. For each data set a random 60% training and 40% testing split was used to assess the performance of each of the classification methods being considered, and 50 random splits of each data set were used. For the toy data set 50, random train and test sets were generated. The hyperparameters for an RBF covariance function taking the form of exp{− d ϕd xid − x jd 2 } were estimated employing the variational importance sampler, and these were then fixed and employed in all the classification methods considered. The marginal likelihood for the Gibbs sampler was estimated by using 1000 samples from the GP prior. For each data set and each method (multinomial logit Laplace approximation, variational approximation, and Gibbs sampler), the marginal likelihood, (lower bound in the case of the variational approximation), predictive error (0-1 loss), and predictive likelihood were measured. The results, given as the mean and standard deviation over the 50 data splits, are listed in Table 1. The predictive likelihood obtained from the multinomial logit Laplace approximation is consistently, across all data sets, lower than that of the 10
Available online at http://www.ics.uci.edu/∼ mlearn/MPRepository.html.
Variational Bayesian Multinomial Probit Regression
1805
variational approximation and the Gibbs sampler. This indicates that the predictions from the Laplace approximation are less informative about the target values than both other methods considered. In addition, the variational approximation yields predictive distributions that are as informative as those provided by the Gibbs sampler; however, the 0-1 prediction errors obtained across all methods do not differ as significantly. Kuss and then Rasmussen (2005) made a similar observation for the binary GP classification problem when Laplace and EP approximations were compared to MCMC. It will be interesting to further compare EP and variational approximations in this setting. We have observed that the predictions obtained from the variational approximation are in close agreement with those of MCMC, while the Laplace approximation suffers from some inaccuracy. This has also been reported for the binary classification setting in Kuss and Rasmussen (2005). 6.3 Multiclass Sparse Approximation. A further 1000 samples were drawn from the toy data generating process already described, and these were used to illustrate the sparse GP multiclass classifier in operation. The posterior mean values of the shared covariance kernel parameters estimated in the previous example were employed here, and so the covariance kernel parameters were not estimated. The predictive posterior scoring criterion proposed in section 6 was employed in selecting points for inclusion in the overall model. To assess how effective this criterion is, random sampling was also employed to compare the rates of convergence of both inclusion strategies in terms of predictive 0-1 loss on a held-out test set of 2385 samples. A maximum of S = 50 samples was to be included in the model defining a 95% sparsity level. In Figure 5a the first two dimensions of the 1000 samples are plotted with the three different target classes denoted by symbols. The isocontours of constant target posterior probability at a level of one-third (the decision boundaries) for each of the three classes are shown by the solid and dashed lines. What is interesting is that the 50 included points (circled) all sit close to, or on, the corresponding decision boundaries as would be expected given the selection criteria proposed. These can be considered as a probabilistic analog to the support vectors of an SVM. The rates of 0-1 error convergence using both random and informative point sampling are shown in Figure 5b. The procedure was repeated 20 times using the same data samples, and the error bars show one standard deviation over these repeats. It is clear that on this example at least, random sampling has the slowest convergence, and the informative point inclusion strategy achieves less than 1% predictive error after the inclusion of only 30 data points. Of course we should bridle our enthusiasm by recalling that the estimated covariance kernel parameters are already supplied. Nevertheless, multiclass IVM makes Bayesian GP inference on large-scale problems with multiple classes feasible, as will be demonstrated in the following example.
1806
M. Girolami and S. Rogers
1 Percentage Predictions Correct
100
0.5
0
−0.5
−1 −1
−0.5
0
(a)
0.5
1
90 80 70 60 50 40 30 20 0
10
20 30 40 Number Points Included
50
(b)
Figure 5: (a) Scatter plot of the first two dimensions of the 1000 available data sample. Each class is denoted by ×, +, • and the decision boundaries denoted by the contours of target posterior probability equal to one-third are plotted by solid and dashed line. The 50 points selected based on the proposed criterion are circled, and it is clear that these sit close to the decision boundaries. (b) The averaged predictive performance (percentage predictions correct) over 20 random starts (dashed line denotes random sampling and solid line denotes informative sampling) is shown, with the slowest converging plot characterizing what is achieved under a random sampling strategy.
6.4 Large-Scale Example of Sparse GP Multiclass Classification. The Isolet data set11 comprises of 6238 examples of letters from the alphabet spoken in isolation by 30 individual speakers, and each letter is represented by 617 features. An independent collection of 1559 spoken letters is available for classification test purposes. The best reported test performance over all 26 classes of letter was 3.27% error achieved using 30-bit error correcting codes with an artificial neural network. Here we employ a single RBF covariance kernel with a common inverse length scale of 0.001 (further fine tuning is of course possible), and a maximum of 2000 points from the available 6238 are to be employed in the sparse multiclass GP classifier. As in the previous example, data are standardized; both random and informative sampling strategies were employed, with the results given in Figure 6 illustrating the superior convergence of an informative sampling strategy. After including 2000 of the available 6238 samples in the model, under the informative sampling strategy, a test error rate of 3.52% is achieved. We are unaware of any multiclass GP classification method that has been applied to such a large-scale problem in terms of both data samples available and the number of classes. 11 The data set is available online from http://www.ics.uci.edu/mlearn/databases/ isolet.
Variational Bayesian Multinomial Probit Regression Percentage Predictions Correct
Log Predictive Likelihood
0 −1000 −2000 −3000 −4000 −5000 −6000
0
500
1000
1500
Number of Points Included
(a)
1807
100 90 80 70 60 50 40 30
0
500
1000
1500
Number of Points Included
(b)
Figure 6: (a) The predictive likelihood computed on held-out data for both random sampling (solid line with + markers) and informative sampling (solid line with markers). The predictive likelihood is computed once every 50 inclusion steps. (b) The predictive performance (percentage predictions correct) achieved for both random sampling (solid line with + markers) and informative sampling (solid line with markers)
Qi, Minka, Picard, and Ghahramani (2004) have presented an empirical study of ARD when employed to select basis functions in relevance vector machine (RVM) (Tipping, 2000) classifiers. It was observed that a reliance on the marginal likelihood alone as a criterion for model identification ran the risk of overfitting the available data sample by producing an overly sparse representation. The authors then employ an approximation to the leave-one-out error, which emerges from the EP iterations, to counteract this problem. For Bayesian methods that rely on optimizing in-sample marginal likelihood (or an appropriate bound), great care has to be taken when setting the convergence tolerance, which determines when the optimization routine should halt. However, in the experiments we have conducted, this phenomenon did not appear to be such a problem with the exception of one data set, discussed in the following section. 6.5 Comparison with Multiclass SVM. To briefly compare the performance of the proposed approach to multiclass classification with a number of multiclass SVM methods, we consider the recent study of Duan and Keerthi (2005). In that work, four forms of multiclass classifier were considered: WTAS (one-versus-all SVM method with winner-takes-all class selection, MWVS (one-versus-one SVM with a maximum votes class selection strategy, PWCK (one-versus-one SVM with probabilistic outputs employing pairwise coupling; see Duan & Keerthi, 2005), for details, and PWCK (kernel logistic regression with pairwise coupling of binary outputs). Five multiclass data sets from the UCI Machine Learning Data Repository were employed: ABE (16 dimensions and 3 classes), a subset of the Letters data
1808
M. Girolami and S. Rogers
Table 2: SVM and Variational Bayes GP Multiclass Classification Comparison.
SEG DNA ABE WAV SAT
WTAS
MWVS
PWCP
PWCK
VBGPM
VBGPS
9.4 ± 0.5 10.2 ± 1.3 1.9 ± 0.8 17.2 ± 1.4 11.1 ± 0.6
7.9 ± 1.2 9.9 ± 0.9 1.9 ± 0.6 17.8 ± 1.4 11.0 ± 0.7
7.9 ± 1.2 8.9 ± 0.8 1.8 ± 0.6 16.4 ± 1.4 10.9 ± 0.4
7.5 ± 1.2 9.7 ± 0.7 1.8 ± 0.6 15.6 ± 1.1 11.2 ± 0.6
∗7.8 ± 1.5 74.0 ± 0.3 ∗1.8 ± 0.8 25.2 ± 1.2 12.0 ± 0.4
11.5 ± 1.2 13.3 ± 1.3 2.4 ± 0.8 ∗15.6 ± 0.7 12.1 ± 0.4
Note: The asterisk highlights the cases where the proposed GP-based multiclass classifiers were part of the best-performing set.
set using the letters A, B, and E; DNA (180 dimensions and 3 classes); SAT (36 dimensions and 6 classes)—Satellite Image; SEG (18 dimensions and 7 classes)—Image Segmentation; and WAV (21 dimensions and 3 classes)— Waveform. For each of these, Duan and Keerthi (2005) created 20 random partitions into training and test sets for three different sizes of training set, ranging from small to large. Here we consider only the smallest training set sizes. In Duan and Keerthi (2005) thorough and extensive cross validation was employed to select the length-scale parameters (single) of the gaussian kernel and the associated regularization parameters used in each of the SVMs. The proposed importance sampler is employed to obtain the posterior mean estimates for both single and multiple length scales—VBGPS (variational bayes gaussian process classification) for the single-length scale and VBGPM (variational bayes gaussian process classification), for Multiplelength scales—for a common GP covariance shared across all classes. We monitor the bound on the marginal and consider that convergence has been achieved when less than a 1% increase in the bound is observed for all data sets except for ABE where a 10% convergence criterion was employed due to a degree of overfitting being observed after this point. In all experiments, data were standardized to have zero mean and unit variance. The percentage test errors averaged over each of the 20 data splits (mean ± standard deviation) are reported in Table 2. For each data set the classifiers that obtained the lowest prediction error and whose performances were indistinguishable from each other at the 1% significance level using a paired t-test are highlighted in bold. An asterisk highlights the cases where the proposed GP-based multiclass classifiers were part of the best-performing set. We see that in three of the five data sets, performance equal to the bestperforming SVMs is achieved by one of the GP-based classifiers without recourse to any cross validation or in-sample tuning with comparable performance being achieved for SAT and DNA. The performance of VBGPM is particularly poor on DNA, and this is possibly due to the large number (180) of binary features.
Variational Bayesian Multinomial Probit Regression
1809
7 Conclusion and Discussion The main novelty of this work has been to adopt the data augmentation strategy employed in obtaining an exact Bayesian analysis of binary and multinomial probit regression models for GP-based multiclass (of which binary is a specific case) classification. While a full Gibbs sampler can be straightforwardly obtained from the joint likelihood of the model, approximate inference employing a factored form for the posterior is appealing from the point of view of computational effort and efficiency. The variational Bayes procedures developed provide simple iterations due to the inherent decoupling effect of the auxiliary variable between the GP components related to each class. The scaling is still dominated by an O(N3 ) term due to the matrix inversion required in obtaining the posterior mean for the GP variables and the repeated computing of multivariate gaussians required for the weights in the importance sampler. However, with the simple decoupled form of the posterior updates, we have shown that ADF-based online and sparse estimation yields a full multiclass IVM that has linear scaling in the number of classes and the number of available data points, and this is achieved in a straightfoward manner. An empirical comparison with full MCMC suggests that the variational approximation proposed is superior to a Laplace approximation. Further ongoing work includes an investigation into the possible equivalences between EP and variationalbased approximate inference for the multiclass GP classification problem as well as developing a variational treatment to GP-based ordinal regression (Chu & Ghahramani, 2005). Appendix A A.1 Q(M). We employ the shorthand Q(ϕ) = k Q(ϕ k ) in the following. Consider the Q(M) component of the approximate posterior. We have Q(M) ∝ exp E Q(Y)Q(ϕ) log p(ynk |mnk ) + log p(mk |ϕ k ) n k log Nyk (mk , I) + log Nmk (0|Cϕ k ) ∝ exp E Q(Y)Q(ϕ) k " −1 # −1 Nyk (mk , I)Nmk 0, Cϕ k ∝ , k
and so we have
Q(M) =
K k=1
Q(mk ) =
K k=1
Nmk ( mk , k ),
1810
M. Girolami and S. Rogers
−1 −1 k = k yk . Now each element of C−1 where k = I + C and m ϕk ϕ k is a nonlinear function of ϕ k , and so, if considered appropriate, a first-order approximation can be made to the expectation of the matrix inverse such −1 ≈ C−1 , in which case = C (I + C )−1 . that C ϕk
ϕk
ϕk
k
ϕk
A.2 Q(Y) log p(tn |yn ) + log p(yn |mn ) Q(Y) ∝ exp E Q(M) n ∝ exp log p(tn |yn ) + log Nyn ( mn |I) n mn , I) δ(yni > ynk ∀ k = i)δ(tn = i). ∝ Nyn ( n
Each yn is then distributed as a truncated multivariate gaussian such that for tn = i, the ith dimension of yn is always the largest, and so we have Q(Y) =
N
Q(yn ) =
n=1
N
mn , I) , Nytnn (
n=1
where Nytnn (., .) denotes a K -dimensional gaussian truncated such that the dimension indicated by the value of tn is always the largest. The posterior expectation of each yn is now required. Note that Q(yn ) = Zn−1
N ynk ( mnk , 1),
k
where Zn = Pr (yn ∈ C) and C = {yn : ynj < yni , j = i}. Now Zn = Pr (yn ∈ C) +∞ N yni ( mni , 1) = −∞
= E p(u)
j=i
j=i
yni
−∞
N ynj ( mnj , 1)dyni dynj
(u + m ni − m nj ) ,
where u is a standardized gaussian random variable such that p(u) = Nu (0, 1). For all k = i, the posterior expectation follows as ynk = Zn−1
+∞ −∞
ynk
K j=1
N ynj ( mnj , 1)dynj
Variational Bayesian Multinomial Probit Regression
= Zn−1
+∞
−∞
yni −∞
ynk N ynk ( mnk , 1)
=m nk − Zn−1 E p(u)
N yni ( mni , 1)(yni − m nj )dyni dynk
j=i,k
mnk − m ni , 1) Nu (
1811
j=i,k
(u + m ni − m nj ) .
The required expectation for the ith component follows as yni = Zn−1
+∞ −∞
yni N yni ( mni , 1)
(yni − m nj )dyni
j=i
=m ni + Zn−1 E p(u) u (u + m ni − m nj ) j=i
( mnk − ynk ). =m ni + k=i
The final expression in the above follows from noting that for a random variable u ∼ N (0, 1) and any differentiable function g(u), E{ug(u)} = E{g (u)}, in which case E p(u) u (u + m ni − m nj ) j=i
=
k=i
E p(u) Nu ( mnk − m ni , 1) (u + m ni − m nj ) . j=i
A.3 Q(ϕ k ). For each k, we obtain the posterior component Q(ϕ k ) ∝ exp E Q(mk )Q(ψ k ) log p(mk |ϕ k ) + log p(ϕ k |ψ k ) kd ), = Zk Nm Expϕkd (ψ k (0|Cϕ k ) d
where Zk is the corresponding normalizing constant for each posterior, which is unobtainable in closed form. As such, the required expectations can be obtained by importance sampling. A.4 Q(ψ k ). The final posterior component required is Q(ψ k ) ∝ exp E Q(ϕ k ) log p(ϕ k |ψ k ) + log p(ψ k |α k )
1812
M. Girolami and S. Rogers
∝
Expϕkd (ψkd )ψkd (σk , τk )
d
=
ψkd (σk + 1, τk + ϕkd ),
d
kd = and the required posterior mean values follow as ψ
σk +1 τk + ϕkd
.
Appendix B The predictive distribution for a new point xnew can be obtained by first marginalizing the associated GP random variables such that p(ynew |xnew , X, t) = =
p(ynew |mnew ) p(mnew |xnew , X, t)dmnew K
Nmnew (yknew , 1)Nmnew ( mnew σknew )dmnew k , k k k
k=1
=
K
mnew N yknew ( νknew ) , k ,
k=1
where the shorthand νknew = 1 + σk2 ,new is employed. Now that we have the predictive posterior for the auxiliary variable ynew , the appropriate conic truncation of this spherical gaussian yields the required distribution P(tnew = k|xnew , X, t) as follows. Using the following shorthand, P(tnew = k|ynew ) = δ(yknew > yinew ∀ i = k)δ(tnew = k) ≡ δk,new , then P(tnew = k|xnew , X, t) =
P(tnew = k|ynew ) p(ynew |xnew , X, t)dynew
=
Ck
p(ynew |xnew , X, t)dynew
=
δk,new
K
mnew N yknew ( νknew ) dyknew k ,
k=1
1 new = E p(u) new −m new u νk + m k j new ν j j=k
!
.
Variational Bayesian Multinomial Probit Regression
1813
This is the probability that the auxiliary variable ynew is in the cone Ck , so K k=1
P(tnew = k|xnew , X, t) =
K k=1
=
RK
Ck
p(ynew |xnew , X, t)dynew
p(ynew |xnew , X, t)dynew = 1,
thus yielding a properly normalized posterior distribution over classes 1, . . . , K . Appendix C The variational bound conditioned on the current values of ϕ k , ψ k , α k (assuming these are fixed values) can be obtained in the following manner using the expansion of the relevant components of the lower bound: k
+
E Q(M)Q(Y) log p(ynk |mnk )
(C.1)
n
E Q(M) log p(mk |X, ϕ k )
(C.2)
E Q(mk ) {log Q(mk )}
(C.3)
E Q(yn ) {log Q(yn )}.
(C.4)
k
−
k
−
n
Expanding each component in turn obtains −
NK 1 2 %2 nk − 2 ynk m nk − log 2π y nk + m 2 k n 2
−
1 T −1 1 k Cϕ k m k m log |Cϕ k | − 2 k 2 k
−
NK 1 trace C−1 log 2π ϕ k k − 2 k 2
(C.6)
−
NK 1 NK − log 2π − log | k | 2 2 2 k
(C.7)
−
N 1 2 y nk + m 2nk − 2 ynk m nk − log Zn − log 2π. 2 k n 2 n
(C.8)
(C.5)
1814
M. Girolami and S. Rogers
Combining and manipulating equations C.5 to C.8 gives the following expression for the lower bound, −
NK N NK 1 tra ce{ k } log 2π + log 2π + − 2 2 2 2 k
−
1 1 T −1 k Cϕ k m k − m tra ce C−1 ϕ k k 2 k 2 k
−
1 1 log |Cϕ k | + log | k | + log Zn , 2 k 2 k n
where each Zn = E p(u)
j=i
(u + m ni − m nj ) .
Appendix D Details of the Gibbs sampler required to obtain samples from the posterior p(|t, X, , α) now follow. From the definition of the joint likelihood (see equation 3.2) it is straightforward to see that the conditional distribution for each yn |mn will be a truncated gaussian defined in the cone Ctn , centered at mn with identity covariance, and denoted by Nytn (mn , I). The distribution for each mk |yk is multivariate gaussian with covariance k = Cϕ k (I + Cϕ k )−1 and mean k yk . Thus, the Gibbs sampler, for each n and k, takes the simple form ∼ Nytn (m(i−1) , I) yn(i) |m(i−1) n n (i+1)
mk
(i)
(i)
|yk ∼ Nm ( k yk , k ),
where the superscript (i) denotes the ith sample drawn. The dominant scaling will be O(K N3 ) per sample draw. With the multinomial probit likelihood for a new data point defined as P(tnew = k|mnew ) = E p(u) (u + mnew − mnew k j ) , j=k
the predictive distribution12 is then obtained from P(tnew = k|mnew ) p(mnew |xnew , X, t) dmnew . P(tnew = k|xnew , X, t) = A Monte Carlo estimate of the above required marginal posterior expectation can be obtained by drawing samples from the full posterior 12
Conditioning on and α is implicit.
Variational Bayesian Multinomial Probit Regression
1815
distribution, p(|t, X, , α), using the above sampler. Then for each (i) sampled, an additional set of samples mnew,s is drawn, such that for each k, k −1 new (i) new,s (i) new,i new,i 2 mk |yk ∼ Nm (µk , σk ,new ), where µk = (yk )T I + Cϕk Cϕ , and −1 new k new T − (C ) C I + C the associated variance is σk2 ,new = c ϕnew ϕk ϕk ϕ k . The apk proximate predictive distribution can then be obtained by the following Monte Carlo estimate: 1 Nsa mps
Nsa mps
E p(u)
s=1
j=k
(u + mnew,s k
− mnew,s ) . j
An additional Metropolis-Hastings subsampler can be employed within the above Gibbs sampler to draw samples from the posterior p(, |t, X, α) if the covariance function hyperparameters are to be integrated out. Appendix E The Laplace approximation requires the Hessian matrix of second-order derivatives of the joint log likelihood with respect to each mn . The derivatives of the noise component, log P(tn = k|mn ) = log E p(u) × { j=k (u + m nk − m nj )}, follow, where we denote expectation with respect to a gaussian truncated in the cone Ck as E Nyk {·}: ∂ 1 log P(tn = k|mn ) = ∂mni P(tn = k|mn )
Ck
(yni − mni )Nyn (m, I)dy
= E Nyk {yni } − mni and ∂2 log P(tn = k|mn ) = E Nyk {yni ynj } − E Nyk {yni }E Nyk {ynj } − δi j . ∂mnj ∂mni This then defines an NK × NK –dimensional Hessian matrix that, unlike the Hessian of the multinomial logit counterpart, cannot be decomposed into a diagonal plus multiplicative form (refer to Williams & Barber, 1998, for details), due to the cross-diagonal elements E Nyk {yni ynj }, and so the required matrix inversions of the Newton step and those required to obtain the predictive covariance will operate on a full NK × NK matrix. Acknowledgments This work is supported by Engineering and Physical Sciences Research Council grants GR/R55184/02 & EP/C010620/1. We are grateful to
1816
M. Girolami and S. Rogers
˜ Chris Williams, Jim Kay, and Joaquin Quinonero-Candela for motivating discussions regarding this work. In addition, the comments and suggestions made by the anonymous reviewers helped to improve the manuscript significantly. References Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistcial Association, 88(422), 669– 679. Beal, M. (2003). Variational algorithms for approximate bayesian inference. Unpublished doctoral dissertation, University College London. Chu, W., & Ghahramani, Z. (2005). Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6, 1019–1041. Csato, L., Fokue, E., Opper, M., Schottky, B., & Winther, O. (2000). Efficient approaches to gaussian process classification. In S. A. Solla, T. K. Leen, & K.-R. ¨ Muller (Eds.), Advances in neural information processing systems, 12 (pp. 252–257). Cambridge, MA: MIT Press. Csato, L., & Opper, M. (2002). Sparse online gaussian processes. Neural Computation, 14, 641–668. Duan, K., & Keerthi, S. (2005). Which is the best multi-class SVM method? An empirical study. In N. C. Oza, R. Polikar, J. Kittler, & F. Roli (Eds.), Proceedings of the Sixth International Workshop on Multiple Classifier Systems (pp. 278–285). Seaside, CA. Gibbs, M. N., & MacKay, D. J. C. (2000). Variational gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6), 1458–1464. Girolami, M., & Rogers, S. (2005). Hierarchic Bayesian models for kernel learning. In Proceedings of the 22nd International Conference on Machine Learning (pp. 241–248). New York: ACM. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233. Kim, H. C. (2005). Bayesian and ensemble kernel classifiers. Unpublished doctoral dissertation, Pohang University of Science and Technology. Available online at http://home.postech.ac.kr/∼grass/publication/. Kuss, M., & Rasmussen, C. E. (2005). Assessing approximate inference for binary gaussian process classification. Journal of Machine Learning Research, 6, 1679–1704. Lawrence, N. D., Milo, M., Niranjan, M., Rashbass, P., & Soullier, S. (2004). Reducing the variability in cDNA microarray image processing by Bayesian inference. Bioinformatics, 20(4), 518–526. Lawrence, N. D, Platt, J. C., & Jordan, M. I. (2005). Extensions of the informative vector machine. In J. Winkler, N. D. Lawrence, & M. Niranjan (Eds.), Proceedings of the Sheffield Machine Learning Workshop. Berlin: Springer-Verlag. Lawrence, N. D., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process methods: The informative vector machine. In S. Thrun, S. Becker, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.
Variational Bayesian Multinomial Probit Regression
1817
MacKay, D. J. C (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Minka, T. P. (2001). A family of algorithms for approximate Bayesian inference. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Neal, R. (1998). Regression and classification using gaussian process priors. In A. P. Dawid, M. Bernardo, J. O. Berger, & A. F. M. Smith (Eds.), Bayesian statistics 6 (pp. 475–501). New York: Oxford University Press. Opper, M., & Winther, O. (2000). Gaussian processes for classification: Mean field algorithms. Neural Computation, 12, 2655–2684. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multi¨ class classification. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. Qi, Y., Minka, T. P., Picard, R. W., & Ghahramani, Z. (2004). Predictive automatic relevance determination by expectation propagation. In R. Greiner & D. Schuurmans (Eds.), Proceedings of the Twenty-First International Conference on Machine Learning. New York: ACM. Quinonero-Candela, J., & Winther, O. (2003). Incremental gaussian processes. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Seeger, M. (2000). Bayesian model selection for support vector machines, gaussian ¨ processes and other kernel classifiers. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing Systems, 12 (pp. 603–609). Cambridge, MA: MIT Press. Seeger, M., & Jordan, M. I. (2004). Sparse gaussian Process classification with multiple classes (Tech. Rep. 661). Berkeley: Department of Statistics, University of California. Seeger, M., Williams, C. K. I., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop, & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Np: Society for Artificial Intelligence and Statistics. Tipping, M. (2000). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Wang, B., & Titterington, D. M. (2004). Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values (Tech. Rep. No. 04–02). Glasgow: Department of Statistics, University of Glasgow. Williams, C. K. I., & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342–1351. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 598–604). Cambridge, MA: MIT Press. Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press.
Received July 1, 2005; accepted November 8, 2005.
LETTER
Communicated by Youshen Xia
A Novel Neural Network for a Class of Convex Quadratic Minimax Problems Xing-Bao Gao [email protected] College of Mathematics and Information Science, Shaanxi Normal University, Xi’an, Shaanxi 710062, China
Li-Zhi Liao [email protected] Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Based on the inherent properties of convex quadratic minimax problems, this article presents a new neural network model for a class of convex quadratic minimax problems. We show that the new model is stable in the sense of Lyapunov and will converge to an exact saddle point in finite time by defining a proper convex energy function. Furthermore, global exponential stability of the new model is shown under mild conditions. Compared with the existing neural networks for the convex quadratic minimax problem, the proposed neural network has finite-time convergence, a simpler structure, and lower complexity. Thus, the proposed neural network is more suitable for parallel implementation by using simple hardware units. The validity and transient behavior of the proposed neural network are illustrated by some simulation results. 1 Introduction In this letter, we are interested in the following convex quadratic minimax problem: min{max{ f (x, y)}}, x∈U
(1.1)
y∈V
where f (x, y) =
1 1 T x Hx + h T x − x T Qy − yT Sy − s T y, 2 2
(1.2)
H ∈ Rm×m , S ∈ Rn×n , Q ∈ Rm×n , h ∈ Rm , and s ∈ Rn are given with H and S being symmetric and positive semidefinite, U = {x ∈ Rm | a i ≤ xi ≤ b i , Neural Computation 18, 1818–1846 (2006)
C 2006 Massachusetts Institute of Technology
A Novel Neural Network
1819
i = 1, 2, · · · , m}, V = {y ∈ Rn | c j ≤ y j ≤ d j , j = 1, 2, · · · , n}, and some −a i (or −c j , b i , d j ) could be +∞. Minimax problems provide a useful reformulation of optimality conditions and also arise in a variety of engineering and economic contexts, including game theory, military scheduling, and automatic control. In particular, problem 1.1 includes:
r r r r
(Piecewise) linear programming (H = 0 and S = 0) Linear programming (H = 0, S = 0, and Q = 0) Quadratic programming (S = 0 and H = 0) Linear and quadratic programming with bound constraints
In many engineering and scientific applications, real-time online solutions of minimax problems are desired. However, traditional algorithms (Fukushima, 1992; He, 1996, 1999; Rockafellar, 1987; Solodov & Tseng, 1996; Tseng, 2000) are not suitable for a real-time online implementation on the computer because the computing time required for a solution is greatly dependent on the dimension and the structure of the problem and the complexity of the algorithm used. One promising approach to handle these problems with high dimension and dense structure is to employ artificial neural network–based circuit implementation. Because of the dynamic nature of optimization and the potential of electronic implementation, neural networks can be implemented physically by designated hardware such as application-specific integrated circuits, where the optimization procedure is truly done in parallel. Therefore, the neural network approach can solve optimization problems in running times that are orders of magnitude much faster than conventional optimization algorithms executed on general-purpose digital computers. It is of great interest to develop some neural network models that could provide a real-time online solution. In recent years, the neural network approach for solving optimization problems has been studied by many researchers, and many good results have been achieved (Bouzerdorm & Pattison, 1993; Friesz, Bernstein, Mehta, Tobin, & Ganjlizadeh, 1994; Gao, 2003, 2004; Gao & Liao, 2003; Gao, Liao, & Xue, 2004; Han, Liao, Qi, & Qi, 2001; He & Yang, 2000; Xia, 2004; Xia & Feng, 2004; Xia, Feng, & Wang, 2004; Xia & Wang, 1998, 2000, 2001). Since the condition of the saddle point of equation 1.2 can be formulated as the following linear variational inequality LVI(M, q , C), to find a vector z∗ ∈ C such that (z − z∗ )T (Mz∗ + q ) ≥ 0,
∀z ∈ C,
(1.3)
where M ∈ Rk×k , q ∈ Rk , and C ⊆ Rk is a nonempty closed convex set (see remark 1), problem 1.1 can be solved by using the models in Gao et al., (2004), He and Yang (2000), and Xia and Wang (1998, 2000). In particular,
1820
X.-B. Gao and L.-Z. Liao
Gao et al. (2004) proposed the following neural network, d dt
x − PU x − Hx − h + Qy x , = −λ y y − PV y − Sy − s − QT x
(1.4)
where λ > 0 is a scaling constant, PU : Rm → U is the projection operator defined by PU (u) = arg min u − v, v∈U
· is the Euclidean norm, and PV : Rn → V is the projection operator defined similar to PU . Gao et al. (2004) also provided several simple and feasible sufficient conditions to ensure the asymptotical stability of equation 1.4. Although model 1.4 has a one-layer structure and is exponentially stable for any initial point in U × V when matrices H and S are positive definite, its convergence is not very satisfactory since it may not be stable and does not have a finite-time convergence when matrices H and S are only positive semidefinite. For example, for the following problem, min max(xy), x
y
where x and y ∈ R1 , model 1.4 can be simplified as dx = λy, dt dy = −λx. dt
(1.5)
(1.6)
It is easy to see that equation 1.6 is divergent. The models proposed by He and Yang (2000) and Xia and Wang (1998) have good stability performance; however, because the model in He and Yang (2000) is not suitable for parallel implementation due to the choice of the varying parameter and the model in Xia and Wang (1998) has a complex structure, further simplification can be achieved. Therefore, it is necessary to build a new neural network for equation 1.1 with a lower complexity and good stability and convergence results. Based on the above considerations, in this article, we will propose a new neural network model for solving problem 1.1 by means of sufficient and necessary conditions of the saddle point of equation 1.2, define a convex energy function by introducing a convex function, and prove that the proposed neural network is stable in the sense of Lyapunov and will converge to an exact saddle point in finite time when matrices H and S are only positive semidefinite. Furthermore, global exponential stability of the new model is also shown when matrices H and S are positive definite. Compared
A Novel Neural Network
1821
with the existing neural networks and some conventional numerical methods, the new model has a lower complexity and finite-time convergence, and its asymptotical stability requires only the positive semidefiniteness of matrices H and S. Thus, the new model is very simple and more suitable for the hardware implementation. The solution of problem 1.1 is closely related to the saddle point of f (x, y). A point (x ∗ , y∗ ) ∈ U × V is said to be a saddle point of f (x, y) if f (x ∗ , y) ≤ f (x ∗ , y∗ ) ≤ f (x, y∗ ),
∀(x, y) ∈ U × V.
(1.7)
Throughout this letter, we assume that the set K ∗ = {(x, y) ∈ U × V | (x, y) is a saddle point of f (x, y)} = ∅ and there exists a finite point (x ∗ , y∗ ) ∈ K ∗ . Obviously, if (x ∗ , y∗ ) ∈ K ∗ is a saddle point of f (x, y), then it must be a solution of problem 1.1. Therefore, it would be sufficient to find a saddle point of f (x, y) for problem 1.1. For the convenience of later discussions, it is necessary to introduce the following definition: Definition 1. A neural network is said to have finite-time convergence to one of its equilibrium points z∗ if there exists a time τ0 such that the output trajectory z(t) of this network reaches z∗ for t ≥ τ0 (see Xia et al., 2004). In our following discussions, we let · denote the Euclidean norm, In denote the identity matrix of order n, ∇ϕ(x) = (∂ϕ(x)/∂ x1 , ∂ϕ(x)/ ∂ x2 ,. . ., ∂ϕ(x)/∂ xn )T ∈ Rn denote the gradient vector of the differentiable function ϕ(x) at x. For any vector u ∈ Rn , uT denotes its transpose. For any n × n real symmetric matrix M, λmin (M) and λmax (M) denote its minimum and maximum eigenvalues, respectively. A basic property of the projection mapping on a closed convex set U ⊆ Rm is (Kinderlehrer & Stampacchia, 1980) [w − PU (w)]T [PU (w) − p] ≥ 0,
∀w ∈ Rm , p ∈ U.
(1.8)
The rest of the letter is organized as follows. In section 2, a neural network model for problem 1.1 is proposed. The stability and convergence of the proposed network are analyzed in section 3. The simulation results of our proposed neural network are reported in section 4. Finally, some concluding remarks are drawn in section 5. 2 A Neural Network Model In this section, a neural network for solving problem 1.1 is presented, and the comparisons with the existing neural networks and some conventional numerical methods are discussed. First, we provide a necessary and
1822
X.-B. Gao and L.-Z. Liao
sufficient condition for the saddle point of f (x, y) in equation 1.2. This result provides the theoretical foundation for us to design the neural network for problem 1.1. Theorem 1.
(x ∗ , y∗ ) ∈ K ∗ if and only if
(x − x ∗ )T (Hx ∗ + h − Qy∗ ) ≥ 0 , (y − y∗ )T (Sy∗ + s + QT x ∗ ) ≥ 0 ,
∀x ∈ U, ∀y ∈ V.
(2.1)
Proof. From equation 1.7 and theorem 3.3.3 in Bazaraa, Sherali, and Shetty (1993), this can be easily proved. Remark 1. Theorem 1 indicates that z∗ = ((x ∗ )T , (y∗ )T )T ∈ K ∗ if and only if it is a solution of a monotone LVI(M, q , C) defined in equation 1.3 with k = m + n, M=
H −Q , QT S
q=
h , s
and C = U × V.
(2.2)
From equations 1.8 and 2.1, we can easily establish the following result, which shows that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 is the projection of some vector on U × V. Lemma 1.
(x ∗ , y∗ ) ∈ K ∗ if and only if
x ∗ = PU (x ∗ − Hx ∗ − h + Qy∗ ), y∗ = PV (y∗ − Sy∗ − s − QT x ∗ ),
(2.3)
where PU (x) = [(PU (x))1 , (PU (x))2 , . . . , (PU (x))m ]T and (PU (x))i = min{b i , ma x{xi , a i }} for i = 1, 2, . . . , m, PV (y) = [(PV (y))1 , (PV (y))2 , t, (PV (y))n ]T , and (PV (y)) j = min{d j , ma x{y j , c j }} for j = 1, 2, . . . , n. Lemma 1 indicates that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 can be obtained by solving equation 2.3. Based on the above results, we propose the following dynamical system for a neural network model to solve problem 1.1: d dt
s − QT x x 2 x − PU x − Hx − h + QPV y − Sy − , = −λ y y − PV y − Sy − s − QT x (2.4)
where λ > 0 is a scaling constant.
A Novel Neural Network
1823
q q
q
q
q q
ha1 q h11 + ? . + s q . h12j + .. + . q . h1m h q q h21 + a2 q ? s + q . h22j + q + . q . h2m .. ham q q .hm1+ ? q + s q . hm2j + + q q .. hmm
q11 + - ? + ? s q12+j -P (·) -ˆ j + U λ + q1n> q21 + ? ? + s q22+ j j -ˆ PU (·) .. q2n+ λ + >. .. . qm1 + + ? s qm2+? -ˆ j + j-PU (·) .. qmn λ + >.
q11 + s q q . 21j + + .. q .. qm1+ a6 . s 1 q q12 + s q . q22+ j q q q .. qm2+ + a6 q .. .. . .q1n s2 + q s q2nj + .. qmn + + q. . a6 q . . s
s11 - q W s s12+j - -λ q - y1 j -P (·) V s1n + -q6 6 s21 - W s s22+j -λ q - y2 - q js2n - + PV (·) - q6 .. 6 .. . sn1 . W s sn2+ j-λ - jq q y snn - + PV (·) ... n -6 6
q
n
q
q
q
q - x1
.q . . x2 .. . q xm
q
Figure 1: The architecture of network 2.4.
As a result of lemma 1, we have the following result, which describes the relationship between an equilibrium point of equation 2.4 and a saddle point of f (x, y) in equation 1.2. Lemma 2. 2.4. Proof.
(x T , yT )T ∈ K ∗ if and only if (x, y) is an equilibrium point of network
From lemma 1 and equations 2.3 and 2.4, the result is trivial.
Lemma 2 also illustrates that a saddle point (x ∗ , y∗ ) of f (x, y) in equation 1.2 is the projection of some vector on U × V and can be obtained by the equilibrium point of equation 2.4. The architecture of neural network 2.4 is shown in Figure 1, where vectors x and y are the network’s outputs, vectors h = (h 1 , h 2 , . . . , h m )T and s = (s1 , s2 , . . . , sn )T are the external inputs, the projection operators PU (·) and PV (·) could be implemented by some piecewise activation functions
1824
X.-B. Gao and L.-Z. Liao
(Bouzerdorm & Pattison, 1993), and other parameters are defined by H = (h i j )m×m , Q = (q i j )m×n , S = (si j )n×n , and λˆ = 2λ. According to Figure 1, the circuit realizing the proposed neural network 2.4 consists of m + n integrators, m + n linear piecewise activation functions, (m + n)2 weighted connections for H, S, Q, and QT , and (m + n)2 + 2(m + n) adders. Thus, it can be implemented by using simple hardware units. For the convenience of later discussions, we denote z = (x T , yT )T ∈ Rm+n and
v = PV y − Sy − s − QT x , u = PU (x − Hx − h + Qv).
(2.5)
It should be noted that the definition of u in equation 2.5 requires the value of v. Then the proposed neural network 2.4 can be written as dz = −λF (z) = −λ dt
2(x − u) . y−v
(2.6)
To show the advantages of the proposed neural network 2.4, we compare it with four existing neural network models and some conventional numerical methods. First, we look at model 1.4 proposed by Gao et al. (2004). The function F (z) in equation 2.6 and the right-hand side of equation 1.4 are totally different since u = PU (x − Hx − h + Qv) = PU (x − Hx − h + Qy). It is easy to see that the complexity of the above model is about the same as that of proposed neural network 2.4, yet the stability conditions are different. When matrices H and S are only positive semidefinite, theorem 3 ensures that neural network 2.4 is stable in the sense of Lyapunov and converges to a saddle point in finite time, but model 1.4 may not be stable and may not converge even when the initial point z0 lies in U × V (see examples 2–5 in section 4). Thus the stability and finite-time convergence conditions of model 1.4 are stronger than that of 2.4. To clarify this issue further, we consider problem 1.5; then model 2.4 can be written as dx dt
= 2λ(y − x),
dy dt
= −λx.
Obviously this system differs from model 1.6, and is stable and convergent, but system 1.6 is divergent. Second, we compare the proposed neural network 2.4 with the models proposed by He and Yang (2000) and Xia and Wang (1998). The model
A Novel Neural Network
1825
proposed by Xia and Wang (1998) for problem 1.1 is defined as dz = −λ Im+n + MT e(z) dt
(2.7)
where λ > 0 is a scaling constant, M and q are defined in equation 2.2, and e(z) = z − PU×V (z − Mz − q ).
(2.8)
In terms of the model complexity, it is easy to see that the total multiplications/divisions and additions/subtractions per iteration for equation 2.7 are 2(m + n)2 + m + n and 2(m + n)2 + 2(m + n), respectively. But the total multiplications/divisions and additions/subtractions per iteration for neural network 2.4 are (m + n)2 + 2m + n and (m + n)2 + 2(m + n), respectively. Thus the asymptotic complexity of model 2.4 is about half of model 2.7. Furthermore, for problem 1.1, the model proposed by He and Yang (2000) is dz = λ{PU×V [z − θ α(z)(MT e(z) + Mz + q )] − z}, dt
(2.9)
where λ > 0 is a scaling constant, θ ∈ (0, 2), M, q , and e(z) are defined in equations 2.2 and 2.8, respectively, and α(z) = e(z)2 /(Im+n + MT )e(z)2 . It is easy to see that this model is not suitable for parallel implementation due to the choice of the varying parameter α(z) and requires computing two projections and terms e(z), MT e(z), and α(z) per iteration. Even though the proposed neural network 2.4 has a two-layer structure, it is required to compute only one projection and term F (z) in equation 2.6 per iteration. Since the complexity of F (z) is about the same as that of e(z), the proposed neural network has a low complexity. Therefore, model 2.4 is simpler than models 2.7 and 2.9 and reduces the model complexity in implementation. In addition, no result for the finite-time convergence for models 2.7 and 2.9 is available in the literature, and the stability of model 2.9 requires that the initial point z0 lies in U × V, yet theorem 3 ensures that neural network 2.4 is stable and convergent in finite time for any initial point z0 ∈ Rm+n . Third, we compare the proposed neural network 2.4 with the model proposed by Gao (2004). Model 2.4 is designed to solve convex quadratic minimax problems, while the model in Gao (2004) is developed to solve nonlinear convex programming problems. Thus, the energy functions and theoretical results of the two models are different. In particular, model 2.4 is globally exponentially stable when matrices H and S are positive definite (see theorem 4), but the model in Gao (2004) has no exponential stability result. Moreover, for a convex quadratic problem (problem 1.1) with S = 0 and V = {y ∈ Rm |y ≥ 0}), even though the two models are the same, the finite-time convergence results are different. Model 2.4 is stable
1826
X.-B. Gao and L.-Z. Liao
and converges to a saddle point in finite time (see theorem 3), but no result for the finite-time convergence of the model in Gao (2004) is available in the literature. Finally, we compare the proposed neural network 2.4 with two typically numerical methods: a modified projection-type method (Solodov & Tseng, 1996) and a forward-backward splitting method (Tseng, 2000). For problem 1.1, a modified projection-type method proposed by Solodov and Tseng (1996) is defined as zk+1 = zk − θ γ (zk )N−1 (Im+n + MT )e(zk ),
(2.10)
where θ ∈ (0, 2), N is an (m + n) × (m + n) symmetric positive-definite matrix, M and e(z) are defined in equations 2.2 and 2.8, respectively, and γ (z) = e(z)2 /[e(z)T (Im+n + M)N−1 (Im+n + MT )e(z)]. It is easy to see that this method is not suitable for parallel implementation due to the choice of the varying parameter γ (zk ), and its asymptotic complexity is about two times that of model 2.4 even when N = Im+n . Furthermore, for problem 1.1, a forward-backward splitting method proposed by Tseng (2000) is defined as
z¯ k = PU×V zk − θ Mzk + q , zk+1 = PU×V z¯ k + θ M zk − z¯ k ,
(2.11)
where θ is a positive constant and M and q are defined in equation 2.2. This method can be viewed as a prediction-correction method and requires two projections per iteration, and its asymptotic complexity is about two times that of model 2.4. In addition, besides the positive semidefiniteness requirement for matrix M, the parameters N and θ are key to the convergence of method 2.10, and method 2.11 is globally convergent only when θ < ν/M with 0 < ν < 1. On the other hand, the stability and convergence of model 2.4 require only the positive semidefiniteness of matrices H and S, and model 2.4 has finite-time convergence without condition θ < ν/M with 0 < ν < 1. Thus, model 2.4 is simpler than methods 2.10 and 2.11, avoids the difficulty of choosing the network parameters, and requires weaker convergence condition. 3 Stability Analysis In this section, we study some theoretical properties for model 2.4. First, we prove the following lemma, which will be very useful in our later discussion. Lemma 3. ϕ(z) =
Let 1 (y − Sy − s − QT x2 − y − Sy − s − QT x − v2 ), 2
(3.1)
A Novel Neural Network
1827
where v is defined in equation 2.5. Then the following is true: i. ϕ(z) is continuously differentiable and convex on Rm+n . ii. For any z, z = ((x )T , (y )T )T ∈ Rm+n , the following inequality holds: ϕ(z) ≤ ϕ(z ) + (z − z )T ∇ϕ(z ) + (z − z )T W(z − z )/2, where W=
QQT −Q(In − S) −(In − S)QT (In − S)2
=
−Q −QT In − S . In − S (3.2)
Proof. set ,
From equation 1.8, we can easily verify that for any closed convex
P ( p) − P (w)2 ≤ ( p − w)T [P ( p) − P (w)] ≤ p − w2 ,
∀ p, w ∈ Rn . (3.3)
Let ϕ1 (z) = y − Sy − s − QT x2 /2 and ϕ2 (z) = y − Sy − s − QT x − v2 /2, where v is defined in equation 2.5. Then ϕ(z) = ϕ1 (z) − ϕ2 (z). i. Obviously ϕ2 (z) is a compound function of the two functions: ψ(w) = w − PV (w)2 /2 and w = y − Sy − s − QT x. According to lemma 3.7 in Smith, Friesz, Bernstein, and Suo (1997), the function ψ(w) is continuously differentiable and ∇ψ(w) = w − PV (w). Thus, their compound function ϕ2 (z) is differential with respect to z, and ∇ϕ2 (z) =
−Q y − Sy − s − QT x − v . (In − S) y − Sy − s − QT x − v
(3.4)
Therefore, ϕ(z) defined in equation 3.1 is also continuously differentiable and ∇ϕ(z) =
−Qv (In − S)v
.
(3.5)
1828
X.-B. Gao and L.-Z. Liao
For any z, z ∈ Rm+n , let v = PV (y − Sy − s − QT x ). Then from equation 3.5, we have (z − z )T [∇ϕ(z) − ∇ϕ(z )] = (v − v )T [(In − S)(y − y ) −QT (x − x )] ≥ v − v 2 , where the last step follows by setting p = y − Sy − s − QT x and w = y − Sy − s − QT x on the left-hand side of equation 3.3. Thus, ϕ(z) is convex on Rm+n by theorem 3.4.5 in Ortega and Rheinboldt (1970). ii. Similar to the proof of lemma 3i, we can prove that ϕ2 (z) is also convex on Rm+n by equation 3.4 and the right-hand side of equation 3.3. Thus, ∀z, z ∈ Rm+n ; we have 1 ϕ1 (z) = ϕ1 (z ) + (z − z )T ∇ϕ1 (z ) + (z − z )T W(z − z ), 2 and ϕ2 (z) ≥ ϕ2 (z ) + (z − z )T ∇ϕ2 (z ) from theorem 3.3.3 in Bazaraa et al. (1993). Therefore, lemma 3ii holds from ϕ(z) = ϕ1 (z) − ϕ2 (z). Remark 2. If V is a closed convex cone, for example, V = {y ∈ Rn |y ≥ 0}, then ϕ(z) = v2 /2 (v is defined in equation 2.5) is continuously differentiable on Rn . However, this may not be true for a general closed convex set V. For example, let V = {x ∈ R1 | − 1 ≤ x ≤ 1}; then 1, if x > 1, v 2 = [PV (x)]2 = x 2 , if −1 ≤ x ≤ 1, 1, if x < −1, and if x > 1, 2x − 1, if −1 ≤ x ≤ 1, 2ϕ(x) = x 2 − [x − PV (x)]2 = x 2 , −2x − 1, if x < −1. Thus [PV (x)]2 = x 2 − [x − PV (x)]2 , and [PV (x)]2 is not differentiable on (−∞, +∞). From lemma 3i, we can define the function 1 G(z, z∗ ) = [(x − x ∗ )T H(x − x ∗ ) + 3(y − y∗ )T S(y − y∗ )] 2 1 + z − z∗ 2 + ϕ(z) − ϕ(z∗ ) − (z − z∗ )T ∇ϕ(z∗ ), 2
(3.6)
A Novel Neural Network
1829
where z∗ ∈ K ∗ is finite and ϕ(z) is defined in equation 3.1. Then we have the following result, which explores some properties for G(z, z∗ ). Lemma 4. Let G(z, z∗ ) be the function in equation 3.6 and W be the matrix in equation 3.2. Then the following is true: i. G(z, z∗ ) is continuously differentiable and convex on Rm+n and 1 µ1 z − z∗ 2 ≤ G(z, z∗ ) ≤ z − z∗ 2 , 2 2
∀z ∈ Rm+n ,
(3.7)
where µ1 = 1 + λmax (W) + max{λmax (H), 3λmax (S)} ≥ 1. ii. ∇G(z, z∗ )T F (z) ≥ F (z)2 /2, ∀z ∈ Rm+n . iii. ∇G(z, z∗ )T F (z) ≥ 2µ2 G(z, z∗ ), {λmin (H), λmin (S)}/µ1 .
∀z ∈ Rm+n ,
where
µ2 = 2min
Proof. i. Obviously, µ1 ≥ 1 by the semidefiniteness of matrices W, H, and S. From lemma 3i, we know that ϕ(z) is continuously differentiable and convex. Thus, G(z, z∗ ), defined in equation 3.6, is also continuously differentiable and convex on Rm+n , and ϕ(z) ≥ ϕ(z∗ ) + (z − z∗ )T ∇ϕ(z∗ ),
∀z ∈ Rm+n
from theorem 3.3.3 in Bazaraa et al. (1993). Therefore, the first inequality in equation 3.7 holds by equation 3.6 since matrices H and S are positive semidefinite. On the other hand, ∀z ∈ Rm+n , we have from lemma 3ii that 1 ϕ(z) ≤ ϕ(z∗ ) + (z − z∗ )T ∇ϕ(z∗ ) + λmax (W)z − z∗ 2 . 2 Thus, G(z, z∗ ) ≤
≤
1 λmax (H)x − x ∗ 2 + 3λmax (S)y − y∗ 2 2 + (1 + λmax (W))z − z∗ 2 µ1 z − z∗ 2 , 2
∀z ∈ Rm+n .
ii. From equations 2.3, 2.5, 3.5, and 3.6, we have ∗
∇G(z, z ) =
(Im + H)(x − x ∗ ) − Q(v − y∗ ) (In + 3S)(y − y∗ ) + (In − S)(v − y∗ )
.
1830
X.-B. Gao and L.-Z. Liao
Then ∀z ∈ Rm+n ; it is straightforward to have ∇G(z, z∗ )T F (z) = 2(x − u)T [(Im + H)(x − x ∗ ) − Q(v − y∗ )] + (y − v)T [2(In + S)(y − y∗ ) − (In − S)(y − v)] = 2(u − x ∗ )T [x − u − H(x − x ∗ ) + Q(v − y∗ )] + 2(x − x ∗ )T H(x − x ∗ ) + 2x − u2 + y − v2 + (y − v)T S(y − v) + 2(y − y∗ )T S(y − y∗ ) + 2(v − y∗ )T [y − v − S(y − y∗ ) − QT (x − x ∗ )] = 2x − u2 + y − v2 + 2(x − x ∗ )T H(x − x ∗ ) + (y − v)T S(y − v) + 2(y − y∗ )T S(y − y∗ ) + 2(u − x ∗ )T (x − u − Hx − h + Qv) + 2(u − x ∗ )T (Hx ∗ + h − Qy∗ ) + 2(v − y∗ )T (y − v − Sy − s − QT x) + 2(v − y∗ )T (Sy∗ + s + QT x ∗ ) ≥
1 F (z)2 + 2(x − x ∗ )T H(x − x ∗ ) 2
+ 2(y − y∗ )T S(y − y∗ ) + (y − v)T S(y − v),
(3.8)
where the last step follows from equation 2.1 (u ∈ U, v ∈ V), (x − Hx − h + Qv − u)T (u − x ∗ ) ≥ 0, by setting w = x − Hx − h + Qv and p = x ∗ ∈ U in equation 1.8, and (y − Sy − s − QT x − v)T (v − y∗ ) ≥ 0 by setting w = y − Sy − s − QT x and p = y∗ ∈ V in equation 1.8, respectively. Therefore, lemma 4ii holds by equation 3.8 since matrices H and S are positive semidefinite. iii. If min{λmin (H), λmin (S)} = 0, then µ2 = 0 from the definition of µ2 , and this result can be obtained by lemma 4ii. If min{λmin (H), λmin (S)} > 0, then µ2 > 0. From equation 3.8 and the right-hand side of equation 3.7, we have ∇G(z, z∗ )T F (z) ≥ 2[λmin (H)x − x ∗ 2 + λmin (S)]y − y∗ 2 ≥ 2 min{λmin (H), λmin (S)}z − z∗ 2 ≥ 2µ2 G(z, z∗ ),
∀z ∈ Rm+n .
A Novel Neural Network
1831
The results in lemma 4 are very important and pave the way for the stability results of neural network 2.4. In particular, neural network 2.4 has the following basic property: Theorem 2. For any z0 ∈ Rm+n , there exists a unique and continuous solution z(t) of neural network 2.4 for all t ≥ 0 with z(0) = z0 . Proof.
From equation 3.3, we have for any closed convex set
P ( p) − P (w) ≤ p − w,
∀ p, w ∈ Rn .
Thus, for any z, z ∈ Rm+n , by equation 2.5 and the above inequality, we have u − u ≤ (Im − H)(x − x ) + Q(v − v ) ≤ Im − H · x − x + Q · v − v and v − v ≤ In − S · y − y + Q · x − x , where v = PV (y − Sy − s − QT x ) and u = PU (x − Hx − h + Qv ). From the above two inequalities and F (z) − F (z ) ≤ 2(x − x + u − u ) +y − y + v − v ,
∀z, z ∈ Rm+n ,
we can see that F (z) is Lipschitz continuous on Rm+n . Thus, the result can be established from theorem 1 in Han et al. (2001). The results of lemma 2 and theorem 2 indicate that neural network model 2.4 is well defined. Now we are in the position to prove the following stability results for this model. Theorem 3. Neural network 2.4 is stable in the sense of Lyapunov, and for any z0 ∈ Rm+n , its trajectory will reach a saddle point of f (x, y) within a finite time when the scaling parameter λ is large enough. In particular, if problem 1.1 has a unique solution, then neural network 2.4 is globally asymptotically stable. Proof. From theorem 2, ∀z0 ∈ Rm+n , let z(t) be the unique and continuous solution of neural network 2.4 for all t ≥ 0 with z(0) = z0 .
1832
X.-B. Gao and L.-Z. Liao
For the function G(z, z∗ ) defined in equation 3.6, we have from lemma 4ii that d λ G(z(t), z∗ ) = −λ∇G(z(t), z∗ )T F (z(t)) ≤ − F [z(t)]2 ≤ 0, dt 2
∀t ≥ 0. (3.9)
From equation 3.9, we know that G(z(t), z∗ ) is monotonically nonincreasing on [0, +∞). Then z(t) − z∗ 2 ≤ 2G(z(t), z∗ ) ≤ 2G(z0 , z∗ ),
∀t ≥ 0
from the first inequality in equation 3.7. Thus, the set {z(t) | t ≥ 0} is bounded. Therefore, there exist a limit point zˆ and a sequence {tn } with 0 < t1 < t2 < . . . < tn < tn+1 < . . . and tn → +∞ as n → +∞ such that lim z(tk ) = zˆ .
(3.10)
k→+∞
On the other hand, ∀s ≥ 0, we have from equation 3.9 that λ 2
s
F [z(t)]2 dt ≤ G(z0 , z∗ ) − G(z(s), z∗ ) ≤ G(z0 , z∗ ).
0
Thus,
+∞
F [z(t)]2 dt ≤
0
2 G(z0 , z∗ ) < +∞. λ
This implies that lim F [z(t)] = 0. Therefore, F (ˆz) = 0 from equation 3.10, t→+∞
so zˆ ∈ K ∗ from lemma 2. By replacing zˆ with z∗ in G(z, z∗ ), we can prove that G(z, zˆ ) ≥ z − zˆ 2 /2 for all z ∈ Rm+n and G(z(t), zˆ ) is monotonically nonincreasing on [0, +∞). From the continuity of G(z, zˆ ), it follows that ∀ε > 0, ∃ δ > 0 such that G(z, zˆ ) <
1 2 ε , 2
if z − zˆ ≤ δ.
(3.11)
From equations 3.7, 3.10, and 3.11, there exists a natural number N such that z(t) − zˆ 2 ≤ 2G([z(t), zˆ ] ≤ 2G(z(tN ), zˆ ) < ε 2 ,
when t ≥ tN .
A Novel Neural Network
1833
That is, limt→+∞ z(t) = zˆ . This indicates that the solution z(t) of neural network 2.4 converges to a point in K ∗ ; that is, the solution z(t) of neural network 2.4 converges to a saddle point of f (x, y). Now we show that the convergence time is finite. Without loss of generality, we let lim z(t) = z∗ ∈ K ∗ . If z0 ∈ K ∗ , then F (z0 ) = 0 from lemma 2. t→+∞
Thus, there exist τ > 0 and δ > 0 such that F [z(t)] ≥ δ for all t ∈ [0, τ ). It follows from equation 3.9 that G[z(t), z∗ ] ≤ G(z0 , z∗ ) −
λ 2
t
F [z(s)]2 ds ≤ G(z0 , z∗ ) −
0
λδτ , 2
∀t ≥ τ.
Therefore, z(t) − z∗ 2 ≤ µ1 z0 − z∗ 2 − λδτ,
∀t ≥ τ
from equation 3.7. Let λ = µ1 z0 − z∗ 2 /(δτ ) in the above inequality; then z(t) − z∗ ≡ 0 for all t ≥ τ . This implies that z(t) ≡ z∗ for all t ≥ τ . In particular, if problem 1.1 has a unique solution z∗ , then K ∗ = {z∗ } since K ∗ = ∅, and for each z0 ∈ Rm+n , the trajectory z(t) with z(0) = z0 will approach z∗ . So neural network 2.4 is globally asymptotically stable at z∗ . Remark 3. Compared with the existing finite-time convergence result for model 1.4 in Xia and Feng (2004) (see theorem 3 in Xia & Feng, 2004), theorem 3 for neural network 2.4 does not require the additional condition that the initial point z0 satisfies (z0 − z∗ )T M(z0 − z∗ ) = 0 or [e(z0 )]T Me(z0 ) = 0, where z∗ ∈ K ∗ and e(z) is defined in equation 2.8. Unlike the existing finite-time convergence result for model 1.4 in Xia (2004) (see remark 1 in Xia, 2004), theorem 3 holds without requiring the positive definiteness of matrix H or S (see examples 3–5 in section 4). When matrices H and S are positive definite, we have the following exponential stability result for neural network 2.4. Theorem 4. If matrices H and S are positive definite, then neural network 2.4 is globally exponentially stable at the unique saddle point of f (x, y). Proof. From the hypothesis of this theorem and K ∗ = ∅, there exists a unique saddle point z∗ ∈ K ∗ for f (x, y). From theorem 2, let z(t) be the unique solution of system 2.4 with z(0) = z0 ∈ Rm+n for all t ≥ 0. Since matrices H and S are positive definite, we have λmin (H) > 0 and λmin (S) > 0. Thus, µ2 > 0. For the function G(z, z∗ ) defined in equation 3.6, it follows from lemma 4iii that d G(z(t), z∗ ) ≤ −2λµ2 G(z(t), z∗ ), dt
∀t ≥ 0.
1834
X.-B. Gao and L.-Z. Liao
Thus, G(z(t), z∗ ) ≤ G(z0 , z∗ )e −2λµ2 t ,
∀t ≥ 0.
This and lemma 4i imply that z(t) − z∗ ≤
2G(z0 , z∗ )e −λµ2 t ≤
√
µ1 z0 − z∗ e −λµ2 t ,
∀t ≥ 0.
Remark 4. Obviously, model 2.4 can be applied to solve a class of linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with M, q , and C being defined in equation 2.2 and H and S being symmetric (see examples 2 and 5 in section 4). Remark 5. The results obtained in this article hold for any closed convex subset of Rn . In particular, the common cases for U or V are (1) n {x ∈ Rn |x ≤ c} (ball or l2 norm constraint) and (2) {x ∈ Rn | i=1 |xi | ≤ d} (l1 norm constraint). 4 Illustrative Examples In this section, five examples are provided to illustrate the theoretical results achieved in section 3 and the simulation results of the dynamical system 2.4. The simulation is conducted in Matlab, and the ODE solver used is ODE45 which is a nonstiff medium-order method. Example 1 shows the effectiveness of the proposed neural network 2.4 for problem 1.1 with infinite number of saddle points. 4.1 Example 1. s = (0, 0)T , H=
Consider problem 1.1 with U = V = {x ∈ R2 |x ≥ 0}, h =
4 −2 −2 1
,
Q=
−1 1 , 1 −2
and
S=
1 −2 −2 4
.
This problem has an infinite number of saddle points x ∗ = (0, 0)T and y∗ = (2r, r )T (r ≥ 0). From the analysis in section 3, we use neural network 2.4 to solve this problem; all simulation results show that neural network 2.4 always converges to one of its equilibrium points. For example, let λ = 100. Figure 2 shows the convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points. Example 2 shows that the proposed neural network 2.4 can be applied to solve a class of linear variational inequality problems.
A Novel Neural Network
1835
3
2.5
||z(t)−z*||
2
1.5
1
0.5
0
0
0.01
0.02
0.03
0.04
0.05 t
0.06
0.07
0.08
0.09
0.1
Figure 2: Convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points in example 1.
4.2 Example 2. Consider the linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with C = {z ∈ R4 | − 8 ≤ zi ≤ 9, i = 1, 2, 3, 4}, 0.1 0.1 0.5 −0.5 0.1 0.1 −0.5 0.5 M= −0.5 0.5 0.2 0.1 , 0.5 −0.5 0.1 0.05
1 −1 and q = 1. −1
This problem has a unique solution z∗ = (1, −1, −2/3, 4/3)T . Let U = V = {x ∈ R2 | − 8 ≤ xi ≤ 9, i = 1, 2}, h = s = (1, −1)T , H=
0.1 0.1 , 0.1 0.1
Q=
−0.5 0.5 , 0.5 −0.5
and
S=
0.2 0.1 . 0.1 0.05
Then model 2.4 can be applied to solve this problem from remark 4. All our simulation results show that neural network 2.4 is asymptotically stable at z∗ . For example, let λ = 100. Figure 3 displays the convergence behavior
1836
X.-B. Gao and L.-Z. Liao 4
3.5
3
||z(t)−z*||
2.5
2
1.5
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
t
Figure 3: Convergence behavior of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points in example 2.
of the error z(t) − z∗ based on neural network 2.4 with 20 random initial points. It should be mentioned that neural network 1.4 cannot be used to solve this problem. In fact, Figure 4 shows model 1.4 with initial point (2, 2, 2, 2)T ∈ R4 and λ = 100 is not stable, where the error z(t) − z∗ approaches 0.0178632. Example 3 shows that the proposed neural network 2.4 can be applied to solve large-scale problems. 4.3 Example 3. Consider problem 1.1 with U = {x ∈ R2n | − 1 ≤ x ≤ 1}, V = {y ∈ Rn | − 1 ≤ y ≤ 1}, h = (0, 0, . . . , 0)T ∈ R2n , s = −(1, 1, . . . , 1)T ∈ Rn , S being an n × n zero matrix, 1 0 ... 0 1 −1 0 . . . 0 0 1 0 ... 0 −1 2 −1 . . . 0 0 0 1 ... 0 0 −1 2 . . . 0 0 , and Q = 0 1 . . . 0 . H= . . . . . . . . . . .. .. .. . . .. .. .. .. .. .. 0 0 0 . . . 2 −1 0 0 ... 1 0 0 0 . . . −1 1 2n×2n 0 0 . . . 1 2n×n
A Novel Neural Network
1837
4
3
2 z (t) 4 1 z1(t)
0 z (t) 3 −1
z (t) 2
−2
−3
0
1
2
3 t
4
5
6
Figure 4: Transient behavior of model 1.4 in example 2.
This problem has a unique saddle point x ∗ = (0.5, 0.5, . . . , 0.5)T ∈ R2n and y∗ = (0, 0, . . . , 0)T ∈ Rn . We use neural network 2.4 to solve this problem; all simulation results show that this neural network is asymptotically stable at z∗ . For example, let λ = 100. Figures 5a and 5b show the trajectories of the first 20 components of x and y of neural network 2.4 with 6 random initial points z0 for n = 1500 and 2000, respectively. Next, we compare the proposed model 2.4 with other methods. For simplicity, we let N = Im+n in method 2.10. Figure 6 shows that model 1.4 with initial point (0, 0, . . . , 0)T ∈ R18 and λ = 10 is not stable, where the error z(t) − z∗ approaches 1.0327. Tables 1 and 2 report the numerical results with two different initial points obtained by methods 2.4, 2.7, 2.9, 2.10, and 2.11, respectively, where “Iter”. represents the iterative number; t f denotes the time that the stopping criterion dz /λ < 10−5 is met for dt models 2.4, 2.7, and 2.9; and the stopping rule for methods 2.10 and 2.11 is zk+1 − zk < 10−5 . From Tables 1 and 2, we can see that the proposed method not only provides a better solution but also has a faster convergence than methods 2.7, 2.9, 2.10, and 2.11, except for method 2.10 with θ = 1. Example 4 illustrates that the proposed neural network 2.4 has a faster convergence than other methods.
1838
X.-B. Gao and L.-Z. Liao 2.5
2
1.5
1 x ∼x 1 20 0.5 y1∼ y20
0 −0.5 −1 −1.5 −2 −2.5
0
0.02
0.04
0.06
0.08
0.1 t
0.12
0.14
0.16
0.18
0.2
0.14
0.16
0.18
0.2
(a) n = 1500 3
2
1 x ∼x 1 20 y1∼ y20 0
−1
−2
−3
0
0.02
0.04
0.06
0.08
0.1 t
0.12
(b) n = 2000 Figure 5: Transient behavior of model 2.4 in example 3.
A Novel Neural Network
1839
1 x1∼ x12
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
y ∼y 1 6
0
1
2
3 t
4
5
6
Figure 6: Transient behavior of model 1.4 in example 3. Table 1: Numerical Results of Example 3 with Initial Point −(1, 1, . . . , 1) ∈ R1800 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 100, t f = 0.0931 λ = 100, t f = 0.1803 λ = 100, θ = 1.8, t f = 0.2631 λ = 100, θ = 1, t f = 0.2516 θ = 1.8 θ = 1 (best θ value) θ = 0.2 θ = 0.2475, ν = 0.99 θ = 0.15, ν = 0.6
81 121 117 89 106 28 105 246 604
3.110 9.641 8.797 6.672 3.578 0.985 3.563 8.291 20.172
5.34 × 10−6 4.08 × 10−6 6.80 × 10−6 1.22 × 10−5 6.35 × 10−6 1.09 × 10−5 5.81 × 10−5 2.43 × 10−5 4.46 × 10−5
Method 2.4 2.7 2.9 2.10
2.11
Example 4. Consider problem 1.1 with U = {x ∈ R4 |xi ≥ 0, i = 1, . . . , 4}, V = {y ∈ R4 |yi ≥ 0, i = 1, 2}, H and S are 4 × 4 zero matrix, h = −(6, 6, 5, 5)T , s = −(0, 0, 10, 5)T , and 1 −2 1 0 −1 30 0 1 Q= 0 0 1 0. 0 001
1840
X.-B. Gao and L.-Z. Liao
Table 2: Numerical Results of Example 3 with Initial Point (2, 2, . . . , 2) ∈ R1800 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 100, t f = 0.0949 λ = 100, t f = 0.1559 λ = 100, θ = 1.8, t f = 0.1568 λ = 100, θ = 1, t f = 0.2599 θ = 1.8 θ = 1 (best θ value) θ = 0.2 θ = 0.2475, ν = 0.99 θ = 0.15, ν = 0.6
73 113 85 89 109 30 113 240 591
3.062 8.203 6.968 6.453 3.688 1.047 3.828 8.032 19.766
5.33 × 10−6 4.08 × 10−6 6.80 × 10−6 1.22 × 10−5 6.12 × 10−6 7.98 × 10−6 5.76 × 10−5 2.42 × 10−5 4.51 × 10−5
Method 2.4 2.7 2.9 2.10
2.11
Then this problem has a unique solution x ∗ = (10, 5, 0, 0)T and y∗ = (0, 0, −6, −6)T . From the analysis in section 3, neural network 2.4 can be applied to solve this problem. In this case, it becomes dz d = dt dt
T x 2 x − PU x − h + QPV T y − Q x − s , = −λ y y − PV y − Q x − s
(4.1)
where x ∈ R4 and y ∈ R4 . All simulation results show that neural network 4.1 is always asymptotically stable at (x ∗ , y∗ ). For example, let λ = 1000. Figure 7 shows that the convergence behavior of the error z(t) − z∗ based on equation 4.1 with 20 random initial points. Next, we compare the proposed model 2.4 with other methods. For simplicity, we let N = Im+n in method 2.10. As a matter of fact, Figure 8 shows that model 1.4 with initial point (2, 2, . . . , 2)T ∈ R8 and λ = 100 is not stable, where the error z(t) − z∗ approaches 1.38113. Tables 3 and 4 report the numerical results obtained by methods 2.4, 2.7, 2.9, 2.10, and 2.11 with two different initial points, respectively, where Iter., t f , and the stopping criterion are the same as for example 3. From Tables 3 and 4, we can see that the proposed method not only provides a better solution but also has a faster convergence than other methods. Example 5 shows that the conditions of theorem 3 in Xia and Feng (2004) are not enough to ensure the finite-time convergence of model 1.4. Example 5. Consider the linear variational inequality problem LVI(M, q , C) defined in equation 1.3 with C = R3 :
1 −1 −1 M = −1 1 −1 , 1 1 0
0 and q = 0 . 1
A Novel Neural Network
1841
16
14
12
||z(t)−z*||
10
8
6
4
2
0
0
0.005
0.01
0.015
t
Figure 7: Convergence behavior of the error z(t) − z∗ based on equation 4.1 with 20 random initial points in example 4. Table 3: Numerical Results of Example 4 with Initial Point −(2, 2, . . . , 2)T ∈ R8 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 1000, t f = 0.0187 λ = 1000, t f = 0.0490 λ = 1000, θ = 1.8, t f = 0.4575 λ = 1000, θ = 1, t f = 0.9082 θ = 1.8 θ = 0.9 (best θ value) θ = 0.2 θ = 0.0329, ν = 0.99 θ = 0.0199, ν = 0.6
177 53,585 745 1281 19,229 2290 4140 20,661 45,813
0.062 13.829 0.312 0.516 1.297 0.171 0.266 1.656 3.672
3.45 × 10−6 5.98 × 10−6 1.60 × 10−5 4.03 × 10−5 1.25 × 10−4 4.76 × 10−4 1.69 × 10−3 3.04 × 10−4 5.01 × 10−4
Method 2.4 2.7 2.9 2.10
2.11
This problem has a unique solution z∗ = (−0.5, −0.5, 0)T . From remark 4, this problem can be solved by model 2.4. In this case, model 2.4 becomes dz d = dt dt
2[Hx − Q y − QT x − s ] x , = −λ y QT x + s
(4.2)
1842
X.-B. Gao and L.-Z. Liao
40
30
20 x
1
10 x2 0
x3,x4,y1,y2 y ,y 3 4
−10
−20
−30
−40
−50
0
0.5
1
1.5
2
2.5 t
3
3.5
4
4.5
5
Figure 8: Transient behavior of model 1.4 in example 4. Table 4: Numerical Results of Example 4 with Initial Point (2, 2, . . . , 2)T ∈ R8 . Parameters
Iter.
CPU (sec.)
z(t f ) − z∗
λ = 1000, t f = 0.0190 λ = 1000, t f = 0.0539 λ = 1000, θ = 1.8, t f = 0.5038 λ = 1000, θ = 1, t f = 0.8924 θ = 1.8 θ = 0.9 (best θ value) θ = 0.2 θ = 0.0329, ν = 0.99 θ = 0.0199, ν = 0.6
141 58,965 785 1245 19,026 2319 5292 20,746 46,007
0.047 13.563 0.312 0.500 1.234 0.156 0.360 1.671 3.719
5.99 × 10−6 7.61 × 10−6 1.22 × 10−5 4.41 × 10−5 1.26 × 10−4 3.85 × 10−4 1.39 × 10−3 3.04 × 10−4 5.02 × 10−4
Method 2.4 2.7 2.9 2.10
2.11
where x ∈ R2 , y ∈ R1 , s = 1, H=
1 −1 , −1 1
and
Q=
1 . 1
All simulation results show that neural network 4.2 is always asymptotically stable at z∗ . For example, let λ = 100. Figure 9 displays the convergence
A Novel Neural Network
1843
3.5
3
2.5
||z(t)−z*||
2
1.5
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
t
Figure 9: The convergence behavior of the error z(t) − z∗ based on equation 4.2 with 20 random initial points in example 5.
behavior of the error z(t) − z∗ based on equation 4.2 with 20 random initial points. It should be mentioned that model 1.4 cannot be used to solve this problem. When applied to this problem, model 1.4 becomes
dz1 dt
= λ(z2 − z1 + z3 ),
dz2
= λ(z1 − z2 + z3 ),
dt dz3 dt
= −λ(z1 + z2 + 1).
It is easy to verify that the solution of equation 4.3 is √
√ 0 z1 (t) = 2z3 sin(ω1 t) − 2ω2 cos(ω1 t) − 1 + z10 − z20 e −2λt /2, √
√ 0 2z3 sin(ω1 t) − 2ω2 cos(ω1 t) − 1 − z10 − z20 e −2λt /2, z2 (t) = z3 (t) = z30 cos(ω1 t) + ω2 sin(ω1 t),
(4.3)
1844
X.-B. Gao and L.-Z. Liao
√ √ where ω1 = 2λ and ω2 = − 2(z10 + z20 + 1)/2. Obviously equation 4.3 is divergent and has no finite-time convergence when |ω2 | + |z30 | > 0. However, for any z0 ∈ R3 with |ω2 | + |z30 | > 0 and (z0 − z∗ )T M(z0 − z∗ ) + [e(z0 )]T Me(z0 ) > 0 (e(z) is defined in equation 2.8), for example, z0 = (−3, 0, 0)T , the conditions of theorem 3 in Xia and Feng (2004) are satisfied. Thus, the conditions of theorem 3 in Xia and Feng (2004) are not enough to ensure the finite-time convergence of model 1.4 when the model’s trajectory z(t) with z(0) = z0 ∈ U × V does not converge to one of its equilibrium point z∗ . From the above examples and their simulation results, we have the following remark. Remark 6. (i) For model 1.4 the simulation results show that it is stable for example 1, yet unlike model 2.4, its stability and convergence might not be guaranteed when initial point z0 ∈ U × V (see Friesz et al, 1994; Gao, 2003, 2004; Gao et al., 2004; Xia, 2004; Xia & Feng, 2004; Xia & Wang, 2001), even the matrices H and S are positive semidefinite (see examples 2–5). (ii) For method 2.10, computational results for example 3 show that method 2.10 is better than others when θ = 1, yet unlike model 2.4, it is not suitable for parallel implementation due to the choice of the varying parameter γ (z) as mentioned in model 2.9, and its performance depends on the choices of parameters N and θ . (iii) Since H is positive semidefinite and S = 0 in examples 3 to 5, the existing finite-time convergence result for model 1.4 in Xia (2004) cannot be applied to these examples (see remark 1 in Xia, 2004). 5 Conclusion In this letter, we have proposed a new neural network for solving a class of convex quadratic minimax problems by means of its inherent properties. We have shown that the new model is stable in the sense of Lyapunov and converges to an exact saddle point in finite time when matrices H and S are positive semidefinite. Furthermore, the global exponential stability of the proposed neural network is also obtained under certain conditions. Compared with the existing neural networks and two typically numerical methods, the proposed neural network has finitetime convergence, a simpler structure, and lower complexity. Thus, the proposed neural network is more suitable for hardware implementation. Since the new network can be applied directly to solve a class of linear variational inequality problems and a broad set of classes of optimization problems, it has great application potential. Illustrative examples confirm the theoretical results and demonstrate that our new model is reliable and attractive.
A Novel Neural Network
1845
Acknowledgments We are very grateful to the two anonymous reviewers for their comments and constructive suggestions on earlier versions of this article. The research was supported in part by grants from Hong Kong Baptist University, the Research Grant Council of Hong Kong, and NSFC Grant of China No. 10471083. References Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming theory and algorithms (2nd ed.). New York: Wiley. Bouzerdorm, A., & Pattison, T. R. (1993). Neural network for quadratic optimization with bound constraints. IEEE Trans. Neural Networks, 4, 293–304. Friesz, T. L., Bernstein, D. H., Mehta, N. J., Tobin, R. L., & Ganjlizadeh, S. (1994). Day-to-day dynamic network disequilibria and idealized traveler information systems. Operations Research, 42, 1120–1136. Fukushima, M. (1992). Equivalent differentiable optimization problems and descent method for asymmetric variational inequality problems. Mathematical Programming, 53, 99–110. Gao, X. B. (2003). Exponential stability of globally projected dynamic systems. IEEE Trans. Neural Networks, 14, 426–431. Gao X. B. (2004). A novel neural network for nonlinear convex programming. IEEE Trans. Neural Networks, 15, 613–621. Gao, X. B., & Liao, L.-Z. (2003). A neural network for monotone variational inequalities with linear constraints. Physics Letters A, 307, 118–128. Gao, X. B., Liao, L.-Z., & Xue, W. M. (2004). A neural network for a class of convex quadratic minimax problems with constraints. IEEE Trans. Neural Networks, 15, 622–628. Han, Q. M., Liao, L.-Z., Qi, H. D., & Qi, L. Q. (2001). Stability analysis of gradientbased neural networks for optimization problems. J. Global Optim., 19, 363– 381. He, B. S. (1996). Solution and application of a class of general linear variational inequalities. Science in China, series A, 39, 395–404. He, B. S. (1999). Inexact implicit methods for monotone general variational inequalities. Mathematical Programming, 86, 199–217. He, B. S., & Yang, H. (2000). A neural-network model for monotone linear asymmetric variational inequalities. IEEE Trans. Neural Networks, 11, 3–16. Kinderlehrer, D., & Stampacchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Ortega, T. M., & Rheinboldt, W. C. (1970). Iterative solution of nonlinear equation in several variables. New York: Academic Press. Rockafellar, R. T. (1987). Linear-quadratic programming and optimal control. SIAM J. Control Optim., 25, 781–814. Smith, T. E., Friesz, T. L., Bernstein, D. H., & Suo, Z. G. (1997). A comparative analysis of two minimum-norm projective dynamics and their relationship to variational
1846
X.-B. Gao and L.-Z. Liao
inequalities. In M. C. Ferris, J. S. Pang (Eds.), Complementarity and variational problems: State of art (pp. 405–424). Philadelphia: SIAM. Solodov, M. V., & Tseng, P. (1996). Modified projection-type methods for monotone variational inequalities. SIAM J. Control Optim., 27, 1814–830. Tseng, P. (2000). A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim., 38, 431–446. Xia, Y. S. (2004). An extended projection neural network for constrained optimization. Neural Computation, 16, 863–883. Xia, Y. S., & Feng, G. (2004). On convergence rate of projection neural networks. IEEE Trans. Automatic Control, 49, 91–96. Xia, Y. S., Feng, G., & Wang, J. (2004). A recurrent neural network with exponential convergence for solving convex quadratic program and related linear piecewise equations. Neural Networks, 17, 1003–1015. Xia, Y. S., & Wang, J. (1998). A general methodology for designing globally convergent optimization neural networks. IEEE Trans. Neural Networks, 9, 1331–1343. Xia, Y. S., & Wang, J. (2000). On the stability of globally projected dynamical systems. J. Optim. Theory Appl., 106, 129–160. Xia, Y. S., & Wang, J. (2001). Global asymptotic and exponential stability of a dynamic neural system with asymmetric connection weights. IEEE Trans. Automatic Control, 46, 635–638.
Received August 9, 2004; accepted June 14, 2005.
LETTER
Communicated by Emilio Salinas
Dynamic Gain Changes During Attentional Modulation Arun P. Sripati [email protected] Department of Electrical and Computer Engineering, Zanvyl-Krieger Mind Brain Institute, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Kenneth O. Johnson [email protected] Department of Neuroscience, Zanvyl-Krieger Mind Brain Institute, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Attention causes a multiplicative effect on firing rates of cortical neurons without affecting their selectivity (Motter, 1993; McAdams & Maunsell, 1999a) or the relationship between the spike count mean and variance (McAdams & Maunsell, 1999b). We analyzed attentional modulation of the firing rates of 144 neurons in the secondary somatosensory cortex (SII) of two monkeys trained to switch their attention between a tactile pattern recognition task and a visual task. We found that neurons in SII cortex also undergo a predominantly multiplicative modulation in firing rates without affecting the ratio of variance to mean firing rate (i.e., the Fano factor). Furthermore, both additive and multiplicative components of attentional modulation varied dynamically during the stimulus presentation. We then used a standard conductance-based integrate-and-fire model neuron to ascertain which mechanisms might account for a multiplicative increase in firing rate without affecting the Fano factor. Six mechanisms were identified as biophysically plausible ways that attention could modify the firing rate: spike threshold, firing rate adaptation, excitatory input synchrony, synchrony between all inputs, membrane leak resistance, and reset potential. Of these, only a change in spike threshold or in firing rate adaptation affected model firing rates in a manner compatible with the observed neural data. The results indicate that only a limited number of biophysical mechanisms can account for observed attentional modulation. 1 Introduction Recent experiments have shown that attention can modify neuronal responses to relevant stimuli (e.g., Moran and Desimone, 1985; Motter, 1993; Hsiao, O’Shaughnessy, & Johnson, 1993). The analysis presented here was Neural Computation 18, 1847–1867 (2006)
C 2006 Massachusetts Institute of Technology
1848
A. Sripati and K. Johnson
motivated by the effect of attention on neuronal firing in visual area V4 (McAdams & Maunsell, 1999a, 1999b). When attention was directed into receptive fields of these neurons, their firing rates were scaled multiplicatively (almost doubled) without affecting orientation selectivity, and the ratio of spike count variance to mean (i.e., the Fano factor) was unaffected. Our aim was to examine whether similar attentional effects on firing rate are seen in other sensory areas besides visual cortex. A finding that the effects are similar will support the idea that the mechanisms of attention are common throughout cortex. Our second motivation was to understand the unusual nature of the change in gain: multiplying a random variable (in this case, spike count) by a factor g increases its mean and standard deviation by g, but its variance increases by g 2 . In contrast, data from V4 cortex (McAdams & Maunsell, 1999b) indicate that the Fano factor is unaffected, which suggests that the effect of attention is not a simple change in gain. We reasoned that the biophysical mechanisms might be inferred by investigating which of them could produce a gain change without affecting the Fano factor. We analyzed the responses of 144 neurons in the somatosensory cortex (SII) of two monkeys trained to switch their attention between a tactile pattern recognition task and a visual dimming detection task (Steinmetz et al., 2000; Hsiao et al., 1993). Because the tactile stimuli (letters) used in this study did not fall along a continuum, we were unable to compute the change in gain experienced by a single neuron. Instead, we measured the extent of attentional modulation over the entire population of neurons and separated the effect of attention into multiplicative and additive components. In addition, the temporal modulation of the multiplicative and additive components during the trial was examined by repeating this procedure over short time intervals across the trial. We then examined six mechanisms that affect the mean firing rate and variability in a standard integrate-and-fire cortical neuron model (Troyer & Miller, 1997; Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000), with the aim of finding which mechanisms could account for the observed data. Candidate mechanisms were chosen based on their biophysical feasibility and their ability to be modulated rapidly over the timescale required for attention (hundreds of milliseconds). To simplify our analysis, we assumed that the model neuron receives inputs with fixed firing rates and that attentional modulation acts either by modifying an intrinsic biophysical parameter (e.g., spike threshold) or changing the input synchrony. The results indicate that attentional modulation alters either the spike threshold or firing rate adaptation. 2 Methods 2.1 Neurophysiology and Behavioral Tasks. Two macaque monkeys, M1 and M2, were trained to perform both tactile and visual discrimination
Dynamic Gain Changes During Attentional Modulation
1849
tasks (Hsiao et al., 1993; Steinmetz et al., 2000). In the visual task, three white squares appeared on a computer screen, and after a random interval, one of the two outer squares dimmed slightly. To obtain a reward, the monkey was required to detect the dimming and indicate which of the squares dimmed by turning a switch. In the tactile task, the monkey discriminated raised capital letters (6.0 mm high) embossed on a rotating drum (Johnson & Phillips, 1988) that were scanned from proximal to distal across the center of a fingerpad (at 15 mm/sec). To obtain a reward, the animal responded when the letter scanning across the fingerpad matched a target letter displayed on a computer screen. For monkey M1, the target letter remained constant throughout the collection of data from a single neuron. For monkey M2, the task was made more difficult by changing the target letter randomly after each response. Single-unit recordings were obtained from the SII cortex of both monkeys using standard methods. These data have been analyzed for attentional modulation of the firing rate (Hsiao et al., 1993) and synchrony (Steinmetz et al., 2000). 2.2 Statistical Methods. To estimate the 95% confidence intervals shown in Figures 3 and 4, we used a bootstrap method. A random subset of neurons was drawn with replacement, and the slope and intercept were computed as a function of time for that subset. Nonparametric estimates of the variance in the computed slope and intercepts were obtained by repeating the procedure several times. To estimate the significance of the trends in the gain modulation, gaussian random vectors were generated using the means and variances estimated from the bootstrap data. As a result, these random vectors have no temporal trends. We then performed a multivariate analysis-of-variance (manova1, Matlab 7.0) between the bootstrap data and the randomly generated data to determine the significance of the observed trends in gain. 2.3 Conductance-Based Integrate-and-Fire Model. We used a conductance-based integrate-and-fire model as a simplified model of neuronal firing. This model reproduces the Poisson-like variability exhibited by cortical neurons (Salinas & Sejnowski, 2000; Shadlen & Newsome, 1998; Troyer & Miller, 1997). Parameters of this model were examined for their ability to account for the experimental data. The model is driven by excitatory and inhibitory inputs that arrive randomly. Each input spike triggers an exponentially decaying change in conductance that leads to the characteristic postsynaptic current—excitatory postsynaptic current (EPSC) or inhibitory postsynaptic current (IPSC)—at the soma. The membrane potential in the model is governed by a linear differential equation (2.1), driven by postsynaptic currents and by intrinsic adaptation and leak currents. When the membrane potential crosses the threshold Vθ , a spike is registered and the potential is held at a reset
1850
A. Sripati and K. Johnson
Table 1: Conductance Changes Accompanying Each Type of Spike in Equation 2.1. Current (Spike Type) SRA (output spike at t = t out j ) AMPA (excitatory spike at t = t jE ) GABA (inhibitory spike at t = t jI )
Conductance Change Due to Spike t−t out j , t > t out gSRA = g¯ SRA exp − τSRAj j t−t jE j gAMPA = g¯ AMPA exp − τAMPA , t > t jE j
gGABA = t−t I t−t I g¯ GABA exp − 1 j − exp − 2 j , t > t jI D τGABA
τGABA
j
Note: D is adjusted to make gGABA = g¯ GABA at t = 0.
potential, Vreset , for a time equal to the refractory period: C
dV = −gL (V − EL ) − gSRA (V − E K ) − gAMPA (V − EAMPA ) dt −gGABA (V − E GABA ).
(2.1)
In the above equation, C is the membrane capacitance, and g L is the conductance due to leak channels. The membrane potential is influenced by three types of currents: a current due to fast excitatory synapses (AMPA), a current due to fast inhibitory synapses (GABA), and a spike rate adaptation current (SRA). The conductances gSRA , gAMPA , and gGABA are the sums of the conductances evoked by past output spikes and excitatory and inhibitory input spikes, respectively (see Table 1). The arrival of the inputs is stochastic, and as a result, the membrane potential executes a random walk toward the threshold (Troyer & Miller, 1997). In the baseline condition, the model produces a firing rate of 40 spikes per second when driven by 160 excitatory inputs and 40 inhibitory inputs firing at 40 spikes per second. Input synchrony was controlled as described below. Model parameters for the baseline condition are listed in the appendix and are based on in vitro electrophysiology (McCormick, Connors, Lighthall, & Prince, 1985). 2.4 Input Synchrony. We used a method in which the synchrony of the inputs could be systematically modulated independent of their firing rates (Salinas & Sejnowski, 2000). Briefly, each input is modeled as a random walk to threshold. When the random walk reaches threshold, a spike is registered, and the random walk is reset. Synchrony between spikes is controlled using the degree of correlation between the random walks. However, changing the correlation between random walks also affects the firing rate. Therefore, we adjusted the standard deviation of individual steps in the random walks to maintain a constant firing rate (Salinas & Sejnowski,
Dynamic Gain Changes During Attentional Modulation
1851
2000). Synchrony between excitatory-excitatory (E-E), excitatory-inhibitory (E-I), and inhibitory-inhibitory (I-I) input pairs was controlled using the corresponding correlations φ E , φ E I , and φ I between the random walks. 3 Results 3.1 Observed Changes in Population Firing. We recorded the responses of 178 neurons from the somatosensory cortex (SII) across two monkeys, M1 and M2. Eighty percent (144/178) of these neurons were selected for further analysis based on the following criteria. First, at least two trials were required in both attended and unattended conditions (176/178). Among these neurons, we found that poorly isolated cells had a very high spike count variance (due to huge variations in the number of recorded spikes from trial to trial); these cells were eliminated by requiring the variance of the spike count to be less than 10 times the mean (144/176). All cells had a firing rate greater than 5 impulses per second. Our analysis was restricted to a 2.6 second time window during which the target stimulus was scanned across the skin. The target letter contacted the skin at a time t = 1.4 seconds after trial onset. Because the raised letter stimuli used in this study did not fall along any continuum, we were unable to compute the change in gain experienced by a single neuron. Instead, we measured the extent of attentional modulation over the entire population of neurons. We separated the effect of attention into multiplicative and additive components by plotting the spike count for each neuron in the attended condition versus the spike count in the unattended condition. In the plot, a deviation in slope from unity would indicate a multiplicative modulation (i.e., gain change) due to attention over the entire population of neurons, whereas a constant shift (positive or negative y-intercept) would indicate an additive modulation. Since neurons in SII cortex are known to exhibit both increases and decreases in firing rates with changes in attention (Hsiao et al., 1993), we separated the neurons into three categories in M1 and M2 (t-test for unequal variances for firing rates, p < 0.05): (1) neurons that did not show any significant modulation during the first 600 ms after the stimulus onset—52% (31/60) in M1 and 27% (23/84) in M2; (2) neurons that increased their firing rates during this period—27% (16/60) in M1 and 44% (37/84) in M2; and (3) neurons that decreased their firing rate during this period—22% (13/60) in M1 and 29% (24/84) in M2. Neurons from the second and third categories were selected for analysis of dynamic changes in gain and variability. Figure 1 shows the change in firing rates with a shift in attention during a 200 ms time period that resulted in the maximum gain change for neurons in monkeys M1 (see Figure 1A) and M2 (see Figure 1B) from the second category (i.e., those that increased their firing rates in the tactile task). Thus, for example, a neuron in monkey M1 with a firing rate of 10 impulses per second in the visual (unattended) task will have a firing rate of
1852
A. Sripati and K. Johnson
Firing rate (spikes/s), tactile task
A
100 80
y=x
y = 1.86x - 3.77
60 40 20 0 0
20 40 60 80 Firing rate (spikes/s), visual task
100
Firing rate (spikes/s), tactile task
B
100 80
y = 1.70x + 0.85
y=x
60 40 20 0 0
20 40 60 80 Firing rate (spikes/s), visual task
100
Figure 1: Effect of attention on the firing rate during the 200 ms period of maximum gain change. (A) Monkey M1. Spikes are counted between t = 1.8 and 2.0 seconds. (B) Monkey M2. Spikes are counted between t = 0.8 and 1.0 seconds. Data in both panels are from neurons that increased their firing significantly during the peak firing epoch ( p < 0.05, t-test). Best-fitting slopes, with 95% confidence intervals, were 1.85 ± 0.23 (M1) and 1.70 ± 0.11 (M2). The corresponding y-intercepts were 3.77 ± 4.26 (M1) and 0.85 ± 2.18 (M2).
14.8 impulses per second in the tactile task, with the multiplicative component contributing to most of the increase. We then plotted spike count variance against the mean spike count in the same time period for the unattended and attended conditions (see Figure 2).
Dynamic Gain Changes During Attentional Modulation
1853
A 3
Spike count variance
10
2
10
1
10
-1
10 -1 10
1
10 Spike count
2
10
3
10
B 3
Spike count variance
10
2
10
1
10
-1
10 -1 10
1
10 Spike count
2
10
3
10
Figure 2: Effect of attention on the variability of neuronal responses. (A) Monkey M1. (B) Monkey M2. Plots show the spike count variance versus spike count for neurons whose spike rate was increased by the focus of attention (data from neurons as in Figure 1). Crosses correspond to neurons in the attended condition, and dots correspond to the unattended condition. Fitted powers (with 95% confidence intervals): M1: attended, 1.0 ± 0.13; unattended, 1.1 ± 0.11. M2: attended, 1.2 ± 0.09; unattended, 1.1 ± 0.08. Coefficients (with 95% confidence intervals): M1: attended, 1 ± 0.24; unattended, 1.4 ± 0.21. M2: attended, 1.2 ± 0.16; unattended, 1.5 ± 0.13.
1854
A. Sripati and K. Johnson
The mean firing rate in the attended condition (M1, 20.4; M2, 22.6 spikes/s) was more than 1.5 times larger than in the unattended condition (M1, 12.9; M2, 12.8 spikes/s). Furthermore, the relationship between the mean spike count and the spike count variance is unaffected by the attentional state (i.e., the regressions in Figure 2 are not significantly different from one another). These results agree well with attentional effects observed in the visual cortex (see Figure 4 of McAdams & Maunsell, 1999b). 3.2 Dynamic Modulation of Additive and Multiplicative Components. We examined the temporal variations of additive and multiplicative modulation by repeating this procedure for successive 200 ms windows across the duration of the trial. Additive and multiplicative effects were modulated dynamically during the trial in both monkeys (see Figures 3 and 4). Although trends in both additive and multiplicative modulations were significant (MANOVA, p < 0.0005; see section 2), the effects were predominantly multiplicative in both monkeys (except among neurons in monkey M1, which experienced a decrease in firing rate). The time course of multiplicative and additive modulation was different in the two monkeys. In monkey M1, multiplicative gain reaches a maximum 0.6 seconds before stimulus onset, whereas the additive component peaks roughly 0.2 seconds after stimulus onset. In monkey M2, maximum multiplicative gain occurred 0.5 seconds after stimulus onset, and additive gain peaked 0.3 seconds after stimulus onset. Similar modulations were observed among the subpopulation of neurons that decreased their firing rates. Despite the changes in multiplicative and additive modulation across the trial, we did not observe significant modulations in the Fano factors in any subpopulation of neurons. 3.3 Mechanisms of Gain Change in a Model Neuron. The analysis of SII data shows that in general, many neurons exhibit a multiplicative gain change when the animal shifts its focus of attention (see Figure 1), and that the Fano factor is unaffected by the gain change (see Figure 2). We performed computer simulations on a model cortical neuron to identify which mechanisms could account for a multiplicative gain change without affecting the Fano factor. With independent Poisson inputs, the standard balanced conductance-based model generates Poisson-like output with a Fano factor of 1.0 (Shadlen & Newsome, 1998). To replicate the Fano factor of 2.0 seen in our data, we increased the level of input synchrony and set φ E = φ I = 0.2. This corresponds to a 20% common drive for all inputs, a correlation that is commonly seen in neuronal data (e.g., Britten, Shadlen, Newsome, & Movshon, 1992; Shadlen, Britten, Newsome, & Movshon, 1996). We confirmed that our simulations did not depend on the exact Fano factor chosen by repeating them for other levels of input synchrony (data not shown). We selected parameters of the model neuron such that the output firing rate would approximately equal that of the individual inputs over a
Dynamic Gain Changes During Attentional Modulation
1855
A
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
20 10 0 -10 0
1 1.4 Time, seconds
2
B
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
10
0 0
1 1.4 Time, seconds
2
Figure 3: Multiplicative and additive components of attentional modulation observed in monkey M1 within two subpopulations. (A) Neurons that increased their firing rates during the first 600 ms of the stimulus. (B) Neurons from monkey M1 that significantly decreased their firing rates during the tactile (attended) task. Dotted lines indicate 95% confidence intervals computed using bootstrap.
physiological range (0–100 spikes/s). We then analyzed six putative mechanisms, each corresponding to a single parameter in the model (e.g., firing threshold), in two steps. First, the mechanism was used to double the firing rate from 40 to 80 spikes per second while leaving the input firing rates
1856
A. Sripati and K. Johnson A
Multiplicative component 2 1 0 0
1
1.4
2
Additive component spikes/s
10
0 0
1 1.4 Time, seconds
2
B
Multiplicative component 2 1 0 0
1
1.4
2
spikes/s
Additive component 0
-10 0
1 1.4 Time, seconds
2
Figure 4: Multiplicative and additive components of attentional modulation for monkey M2 neurons, plotted analogous to the monkey M1 data (see Figure 3).
unchanged. Second, we examined the ability of the mechanism to modulate the slope of the input-output relationship without affecting the Fano factor (see Figure 5). A multiplicative modulation in the input-output relationship corresponds to a gain change across a population of neurons, each receiving inputs firing at different rates. Thus, we were able to determine whether a given mechanism produced an effect on the firing rate that was consistent with data. Note that although the analysis is shown for populations of
Dynamic Gain Changes During Attentional Modulation
1857
C
150
Output rate, spikes/s
Output rate, spikes/s
A
100 50 0 0
150 100
50 100 Exc. input rate, spikes/s
50 0 0
50 100 Exc. input rate, spikes/s
10 10
2
D 2
3
Firing variance, spikes /s
2
Firing variance, spikes /s
2
B 10
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
10 10 10
3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
Figure 5: Two types of gain change observed in the conductance-based integrate-and-fire model neuron. (A) Input-output relationship for two different values of threshold. The solid line corresponds to the baseline value (−54 mV) and the dotted line to a lowered threshold (−55 mV). (B) Spike count variance plotted against output rate for each of these data points (dots: baseline; crosses: lowered threshold). The solid line represents the relationship (y = x) expected of a Poisson process. (C, D) Corresponding plots for gain produced by increasing the membrane time constant by four times to 80 ms.
neurons that undergo an increase in gain, it is equally applicable to a population of neurons that experience a decrease in gain. Table 2 summarizes the effect of each mechanism on the slope of the input-output relationship and the Fano factor. Figure 5 indicates the two types of gain change produced in the model neuron. A slight change in threshold produced a multiplicative gain change without affecting the Fano factor (see Figures 5A and 5B). On the other hand, changing the membrane time constant had an additive effect on the inputoutput relationship, without affecting the Fano factor. We found that two of the six mechanisms had only weak effects on the firing rate but adverse
1858
A. Sripati and K. Johnson
Table 2: Parameter Changes Required to Double the Output Firing Rate and Their Effects on Gain and Fano Factor.
Parameter 1 2 3
4 5 6
Reset potential Synchrony of E inputs Synchrony between all input pairs Membrane time constant Adaptation conductance Threshold
% Change in Parameter Needed
Fano Factor at 40 Spikes/S After Change
Multiplicative Gain (Slope at 40 Spikes/S)
Consistent with Data?
Not possible +50%
NA 3.85
NA 1.8X
No No
Not possible
NA
NA
No
+300%
1.5
1X
No
−100%
2.0
2X
Yes
−1.8%
2.0
2X
Yes
Note: Missing values indicate that it was not possible to double the firing rate using that parameter.
effects on output variability; these were the reset potential and synchrony between all input pairs (see Figure 6). Changes in synchrony between all input pairs result in a cancellation of the effect of excitatory and inhibitory inputs; this effect has been discussed in detail by Salinas and Sejnowski (2000). On the other hand, changes in excitatory input synchrony produced a multiplicative gain change but a disproportionate increase in output variability (see Figures 7A and 7B and Table 2). Changes in inhibitory input synchrony produced effects that were identical to those produced by changes in excitatory input synchrony (data not shown). Finally, changes in firing rate adaptation produced a multiplicative change in firing while leaving the Fano factor unaffected (see Figures 7C and 7D). The results above also indicate that each constraint on attentional modulation can be violated independently; thus, while the membrane time constant did not affect the Fano factor but produced an additive increase in firing, excitatory synchrony produced a gain change but had a large effect on the Fano factor. In summary, of the six mechanisms considered here, only threshold and firing rate adaptation modulated firing rates in a manner consistent with a shift in attention. 4 Discussion Understanding how neural representations are affected by attentional state is a topic of intense investigation (for a review, see Reynolds & Chelazzi, 2004). However, it is unclear how attentional state affects neural coding
Dynamic Gain Changes During Attentional Modulation
1859
C output rate, spikes/s
output rate, spikes/s
A 60 40 20 0
-56 -55 Reset potential, mV
-54
60 40 20 0 0
0.5 φ ,φ E I
10 10
2 2
10
D
3
2
1
0
10 0 10
1
10 Firing rate, spikes/s
10
2
Firing variance, spikes /s
2
Firing variance, spikes /s
2
B
1
10 10 10
3
2
1
0
10 0 10
1
10 Firing rate, spikes/s
10
2
Figure 6: Effect of reset potential and synchrony between all pairs of inputs on output firing. (A) Output rate versus reset potential. (B) Output spike count variance versus output firing rate for these data. (C, D) The corresponding figures for increases in synchrony between all pairs of inputs (E-E, I-I, E-I). Input synchrony was increased without affecting the input firing rates (see section 2).
and what mechanisms underlie the modulation. Our analysis shows that attention multiplicatively modulates the population firing in a dynamic manner during a trial without affecting the firing variability; in addition, the unique nature of the observed gain change limits the possible mechanisms underlying attentional modulation. 4.1 Dynamic Gain Modulation. Many studies have shown that a shift in the focus of attention modulates neuronal responses over timescales of hundreds of milliseconds (Hsiao et al., 1993; Luck, Chelazzi, Hillyard, & Desimone, 1997; McAdams & Maunsell, 1999b). We observed a similar time course for the gain of neuronal responses (see Figures 3 and 4), which rose to a peak of about 1.5 times and decreased to 1.0 times over a period of
1860
A. Sripati and K. Johnson
C
150
Output rate, spikes/s
Output rate, spikes/s
A
100 50 0 0
50 100 Exc. input rate, spikes/s
150 100 50 0 0
50 100 Exc. input rate, spikes/s
10 10
2 2
10
D Firing variance, spikes /s
2
Firing variance, spikes /s
2
B 3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
10 10 10
3
2
1
0
10 0 10
1
2
10 10 Firing rate, spikes/s
10
3
Figure 7: Effect of spike rate adaptation and excitatory input synchrony on output firing. (A) Change in the input-output relationship produced by an increase in synchrony from zero (solid) to φ E = 0.2 (dotted). (B) Output spike count variance versus output firing rate for the corresponding data points. Changing excitatory input synchrony produced a Fano factor of 3.85 at an output rate of 40 spikes per second. (C, D) Corresponding plots for spike rate adaptation conductance gSRA . The baseline model firing rate was elevated from 40 to 80 spikes per second by reducing the adaptation conductance to zero.
about 500 ms. We found different time courses for gain in the two monkeys, which could be due to task-related differences. In monkey M1, the target letter remained constant throughout, whereas in monkey M2, the target letter changed after each trial. Thus, the attentional focus of M1 tended to wax and wane since the animal knew that target letters never followed each other in succession. In contrast, the attentional focus of M2 remained consistently high throughout all tactile trials. Therefore, we hypothesized that the peak in gain observed in monkey M1 prior to stimulus onset may be related to stimulus expectation.
Dynamic Gain Changes During Attentional Modulation
1861
Despite the dynamic modulation of gain, the relationship between spike count variance and mean was unaffected by attentional state. Because the firing rates of neurons varied during the trial, we selected a short interval of 200 ms over which the rate remained relatively constant. Although the Fano factor is known to depend on the counting duration (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997), we found little variation over the short timescales required for the analysis (0.1–0.5 ms). The observation that spike count variance is proportional to the mean is thus supported not only across stimulus conditions (Shadlen & Newsome, 1998) but also across variations in attentional state. Therefore, the discharge variability may well be an intrinsic property rather than a function of the input. 4.2 Mechanisms Underlying Attentional Modulation. The main result is that the observed change in gain and variability with attentional state limits the possible mechanisms that can produce this modulation. We selected six biophysical mechanisms that we hypothesized could account for the neural data. Although we did not systematically test whether the effect could be due to some combination of these mechanisms, it is likely that attention functions through one or more of these possibilities. Out of the six mechanisms considered, only changes in spike threshold or in firing rate adaptation produced a gain modulation without affecting the Fano factor. Primarily, these parameters affect the distance between the threshold and the steady-state membrane potential to bring about a multiplicative gain change. The following mechanisms were found inconsistent with the observed data: changes in (1) reset potential, (2) synchrony between all input pairs, (3) synchrony between excitatory or inhibitory inputs, and (4) membrane time constant. The first two mechanisms affect the variability of the membrane potential about the steady state and produce an increase in the discharge variability. Changing the membrane time constant had an additive effect on firing rate because of the diminishing contribution of leak current with increasing input firing rates. Biophysical mechanisms underlying attentional modulation have been investigated in relatively few studies. Reynolds, Chelazzi, and Desimone (1999) proposed that attention may increase the effective synaptic strength, although how this might occur biophysically is unclear. Niebur and Koch (1994) first suggested that input correlations may play a role in attentional modulation. Furthermore, input synchrony has a multiplicative effect on firing rate (Salinas & Sejnowski, 2000; Tiesinga, Fellous, Salinas, Jose, & Sejnowski, 2004). However, input synchrony dramatically affects the output variability, as reported here as well as in both computational and in vitro studies (Stevens & Zador, 1998; Salinas & Sejnowski, 2000; Feng & Brown, 2000; Harsch & Robinson, 2000). Finally, Chance and Abbott (2002) have shown in vitro that balanced synaptic input can have a multiplicative effect on firing rate (see below).
1862
A. Sripati and K. Johnson
Although the observation that attention modulates neuronal gain is relatively recent (McAdams & Maunsell, 1999a), gain modulation itself is a widespread phenomenon (Salinas & Thier, 2000). Simulations using biophysical neuronal models have shown that balanced synaptic input has a modulatory effect on the overall gain and variability (Burkitt, Meffin, & Grayden, 2003). Finally, the impact of Poisson inputs on the output firing rate and variability may differ depending on whether an integrate-and-fire model or a Hodgkin-Huxley model is used (Tiesinga, Jose, & Sejnowski, 2000). Further in vitro studies using dynamic clamp methods need to be performed to develop a statistical description of the output as a function of the inputs (cf. Chance & Abbott, 2002). 4.3 The Neuron Model. We used a conductance-based integrate-andfire model for cortical neurons with parameters derived from in vitro electrophysiology (McCormick et al., 1985). While the essential features of integration of synaptic EPSPs and IPSPs on the soma are present in the model, the nonlinear dynamics of spiking are lumped together into a threshold (Koch, 1999). The assumption of a fixed threshold is justified because a substantial part of the discharge variability can be attributed to irregular synaptic input (Calvin & Stevens, 1968; Mainen & Sejnowski, 1995; Nowak, Sanchez-Vives, & McCormick, 1997). We used balanced input to produce high discharge variability in the baseline condition (Shadlen & Newsome, 1998). Due to the balanced input, the membrane potential rapidly achieves a steady-state value after an action potential; the subsequent random walk to the threshold produces a Poissonlike discharge variability (Troyer & Miller, 1997). Cortical neurons are not far from an approximate balance; for example, receptive fields of area 3b neurons in the somatosensory cortex show roughly equal strengths of inhibition and excitation with a median ratio close to 0.8 (DiCarlo, Johnson, & Hsiao, 1998). Our results were independent of balance as long as input synchrony was adjusted to yield the requisite discharge variability at baseline. This is important since changing input balance affects response selectivity, whereas a gain change leaves the selectivity unaffected (e.g., McAdams & Maunsell, 1999a). Therefore, we did not consider input balance to be a putative mechanism if attention affects only the gain of a neuron’s response without changing its selectivity. We made several assumptions regarding the nature of inputs to the model. First, we used input correlations and balanced synaptic input to obtain the high discharge variability seen in our data (Salinas & Sejnowski, 2000; Shadlen & Newsome, 1998). Second, we assumed inputs to be unaffected by attentional state. Although attentional effects increase progressively along the sensory pathway (Hsiao et al., 1993; Reynolds & Chelazzi, 2004), it is unclear whether attentional modulation in lower cortical areas has a feedforward effect on higher cortical areas. Therefore, we considered the simplified situation in which input firing rates are unaffected
Dynamic Gain Changes During Attentional Modulation
1863
by attentional state. We also sought to distinguish the impact of synchrony from that of the firing rate in order to identify how each affects gain and variability. It is unclear whether increasing the gain in a population of neurons will affect their synchrony. It is likely that the underlying mechanisms are constrained further by a network model that reproduces attentional effects on gain and variability reported here, as well as on pair-wise synchrony reported previously (Steinmetz et al., 2000). 4.4 Biophysical Feasibility. Each of the above mechanisms was selected based on its ability to be modulated in a biophysically feasible manner, over durations of hundreds of milliseconds. 4.4.1 Reset Potential. The reset potential is determined by the balance between sodium and potassium conductances. This can be modulated by a specific potassium channel conductance (e.g., M-type K+ channel; Hille, 2001). 4.4.2 Input Synchrony. Increased gamma-band oscillations can lead to increased synchrony, as has been observed recently (Fries, Reynolds, Rorie, & Desimone, 2001). However, we have found that a change in the synchrony alone (without affecting the firing rates) increases the firing variability far beyond physiological levels. Our results are consistent with in vitro experiments (Stevens & Zador, 1998; Harsch & Robinson, 2001) as well as with other theoretical and modeling studies (Salinas & Sejnowski, 2000; Feng & Brown, 2000). 4.4.3 Membrane Time Constant. The leak resistance can be modulated by a change in the properties of K-channels For example, the M-type potassium current can be affected at short timescales and can be modulated by acetylcholine inputs (Hille, 2001). 4.4.4 Adaptation Conductance. Calcium-triggered potassium channels are responsible for adaptation in the firing rates of neocortical neurons (McCormick et al., 1985; Hille, 2001). These channels can be modulated by second messenger effects; for example, acetylcholine or norepinephrine inputs can lead to a change in channel properties over hundreds of milliseconds (Hille, 2001). There is evidence that cholinergic modulation is related to the level of attention (Kodama, Honda, Watanabe, & Hikosaka, 2002; Davidson & Marrocco, 2000). Computational studies also suggest a possible role of feedback modulation in controlling attention, which can be controlled by the extent of firing rate adaptation (Wilson, Krupa, & Wilkinson, 2000; Rothenstein, Martinez-Trujillo, Treue, Tsotsos, & Wilson, 2002). 4.4.5 Threshold. A constant depolarizing current can affect the threshold of production of an action potential (Johnston & Wu, 1995). This can be
1864
A. Sripati and K. Johnson
achieved by a tonic input from other cortical areas. This input need not even be stimulus dependent. A similar observation has been made by Chance and Abbott (2002), in which they find that the level of balanced background synaptic input to neurons in vitro offsets the stimulus-driven current, bringing the steady state closer to threshold, and produces a multiplicative gain change. They too report that the output variability (measured by the coefficient of variation) is unaffected, which is consistent with our findings. An increase in tonic input is consistent with the findings of Luck et al (1997), in which the spontaneous firing rates of neurons in V2 and V4 were found to increase by 30% to 40% when attention was directed into the receptive field. 5 Conclusions Our results show that in the secondary somatosensory cortex, neurons experience a multiplicative increase in firing rates during attentional modulation while preserving the Fano factor. These effects are similar to those seen in visual area V4 (McAdams & Maunsell, 1999b) and suggest that the mechanisms of attention are common in both sensory areas. While there have been many studies of the effects of attention on perception and the underlying neural responses (Kastner & Ungerleider, 2000), the mechanisms that underlie the modulation of neuronal responses at the single-neuron level are unknown. In this study, we employ a deductive approach to identifying the mechanisms responsible for attentional modulation at the single-neuron level and conclude that attention causes changes in spike threshold or in firing rate adaptation. Further studies are required to understand how changes in these mechanisms account for other effects of selective attention such as the modulation of synchrony with attentional state (Steinmetz et al., 2000). Appendix Model neuron parameters at baseline (i.e., output rate = 40 spikes/s, Fano factor = 2.0): Conductances (in nS) g¯ SRA = 3.5; g¯ AMPA = 2.01; g¯ GABA = 27.85; gL = 25 Time constants (in ms) 1 τm = 20; τrefrac = 1.72; τSRA = 100; τAMPA = 5; τGABA 2 = 5.6; τGABA = 0.28
Dynamic Gain Changes During Attentional Modulation
1865
Reversal potentials (in mV) E K = −80; E L = −74; E AMPA = 0; E GABA = −61; Vrest = Vreset = −60; Vθ = −54 Numbers of inputs ME = 160 excitatory, MI = 40 inhibitory inputs Input rates (in spikes/sec) Excitatory rate λ E = 40, Inhibitory rate λ I = αλ E , α = 1.7 Parameters for all random walks Nreset = 20, Nθ = 40, Nrest = 0 Input synchrony φ E = φ I = 0.2 The subthreshold equation was numerically integrated with a time step of 0.05 ms. Acknowledgments We thank Steven Hsiao, Ernst Niebur, and Alfredo Kirkwood for helpful discussions. This work was supported by NIH grants NS34086 and NS18787. References Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4765. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motor neuron inter-spike intervals. J. Neurophysiol., 31, 574–587.
1866
A. Sripati and K. Johnson
Chance, F. S., & Abbott, L. F. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Davidson, M. C., & Marrocco, R. T. (2000). Local infusion of scopolamine into intraparietal cortex slows covert orienting in rhesus monkeys. J. Neurophysiol., 83, 1536–1549. DiCarlo, J. J., Johnson, K. O., & Hsiao, S. S. (1998). Structure of receptive fields in area 3b of primary somatosensory cortex in the alert monkey. J. Neurosci., 18(7), 2626–2645. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Computation, 12, 671–692. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Harsch, A., & Robinson, H. P. C. (2000). Postsynaptic variability of firing in rat neocortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hille, B. (2001). Ion channels of excitable membranes (3rd ed.). Sunderland, MA: Sinauer. Hsiao, S. S., O’Shaughnessy, D. M., & Johnson, K. O. (1993). Effects of selective attention on spatial form processing in monkey primary and secondary somatosensory cortex. J. Neurophysiol., 70(1), 444–447. Johnson, K. O., & Phillips, J. R. (1988). A rotating drum stimulator for scanning embossed patterns and textures across the skin. J. Neurosci. Methods, 22, 221–231. Johnston, D., & Wu, S. M. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Kastner, S., & Ungerleider, L. G. (2000). Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience, 23, 315–341. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Kodama, T., Honda, Y., Watanabe, M., & Hikosaka, K. (2002). Release of neurotransmitters in the monkey frontal cortex is related to level of attention. Psych. and Clinical Neurosciences, 56, 341–342. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2 and V4 of macaque visual cortex. J. Neurophysiol., 77, 24–42. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. McAdams, C. J., & Maunsell, J. H. R. (1999a). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. Journal of Neuroscience, 19, 431–441. McAdams, C. J., & Maunsell, J. H. R. (1999b). Effects of attention on the reliability of individual neurons in the monkey visual cortex. Neuron, 23, 765–773. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. Moran, J., & Desimone, R. (1985). Selective attention gates visual attention in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919.
Dynamic Gain Changes During Attentional Modulation
1867
Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. J. Comput. Neurosci., 1, 141–158. Nowak, L. G., Sanchez-Vives, M. V., & McCormick, D. A. (1997). Influence of low and high frequency inputs on spike timing in visual cortical neurons. Cerebral Cortex, 7, 487–501. Reynolds, J. H, & Chelazzi, L. (2004). Attentional modulation of visual processing. Annual Review of Neuroscience, 27, 611–647. Reynolds, J. H., Chelazzi, L., & Desimone, R. H. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19, 1736–1753. Rieke, F., Warland, D., Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rothenstein, A. L., Martinez-Trujillo, J. C., Treue, S., Tsotsos, J. K., & Wilson, H. R. (2002). Modeling attentional effects in cortical areas MT and MST of the macaque monkey through feedback loops. Society for Neuroscience Abstracts, 559.7. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. J. Neurosci., 16, 1486–1510. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1, 210–217. Tiesinga, P. H. E., Fellous, J.-M., Salinas, E., Jose, J. V., & Sejnowski, T. E. (2004). Synchronization as a mechanism for attentional modulation. Neurocomputing, 58, 641–646. Tiesinga, P. H. E, Jose, J. V., & Sejnowski, T. E. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gated channels. Physical Review E, 62, 8413–8419. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9, 971–983. Wilson, H. R., Krupa, B., & Wilkinson, F. (2000). Dynamics of perceptual oscillation in form vision. Nature Neuroscience, 3(2), 170–176.
Received February 3, 2005; accepted June 28, 2005.
LETTER
Communicated by Odelia Schwartz
On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields Pietro Berkes [email protected]
Laurenz Wiskott [email protected] Institute for Theoretical Biology, Humboldt University Berlin, D-10115 Berlin, Germany
In this letter, we introduce some mathematical and numerical tools to analyze and interpret inhomogeneous quadratic forms. The resulting characterization is in some aspects similar to that given by experimental studies of cortical cells, making it particularly suitable for application to secondorder approximations and theoretical models of physiological receptive fields. We first discuss two ways of analyzing a quadratic form by visualizing the coefficients of its quadratic and linear term directly and by considering the eigenvectors of its quadratic term. We then present an algorithm to compute the optimal excitatory and inhibitory stimuli—those that maximize and minimize the considered quadratic form, respectively, given a fixed energy constraint. The analysis of the optimal stimuli is completed by considering their invariances, which are the transformations to which the quadratic form is most insensitive, and by introducing a test to determine which of these are statistically significant. Next we propose a way to measure the relative contribution of the quadratic and linear term to the total output of the quadratic form. Furthermore, we derive simpler versions of the above techniques in the special case of a quadratic form without linear term. In the final part of the letter, we show that for each quadratic form, it is possible to build an equivalent twolayer neural network, which is compatible with (but more general than) related networks used in some recent articles and with the energy model of complex cells. We show that the neural network is unique only up to an arbitrary orthogonal transformation of the excitatory and inhibitory subunits in the first layer. 1 Introduction Recent research in neuroscience has seen an increasing number of extensions of established linear techniques to their nonlinear equivalent in both experimental and theoretical studies. This is the case, for example, for spatiotemporal receptive field estimates in physiological studies (see Neural Computation 18, 1868–1895 (2006)
C 2006 Massachusetts Institute of Technology
Inhomogeneous Quadratic Forms as Receptive Fields
1869
Simoncelli, Pillow, Paninski, & Schwartz, 2004, for a review) and information-theoretical models like principal component analysis (PCA) ¨ ¨ (Scholkopf, Smola, & Muller, 1998) and independent component analysis (ICA) (see Jutten & Karhunen, 2003, for a review). Additionally, new nonlinear unsupervised algorithms have been introduced, for example, slow feature analysis (SFA) (Wiskott & Sejnowski, 2002). The study of the resulting nonlinear functions can be a difficult task because of the lack of appropriate tools to characterize them qualitatively and quantitatively. During a recent project concerning the self-organization of complex cell receptive fields in the primary visual cortex (V1) (Berkes & Wiskott, 2002, 2005b; see section 2), we developed some of these tools to analyze quadratic functions in a high-dimensional space. Because of the complexity of the methods, we describe them here in a separate letter. The resulting characterization is in some aspects similar to that given by physiological studies, making it particularly suitable to be applied to the analysis of nonlinear receptive fields. We are going to focus on the analysis of the inhomogeneous quadratic form g(x) =
1 T x Hx + fT x + c , 2
(1.1)
where x is an N-dimensional input vector, H an N × N matrix, f an Ndimensional vector, and c a constant. Although some of the mathematical details of this study are specific to quadratic forms only, it should be straightforward to extend most of the methods to other nonlinear functions while preserving the same interpretations. In other contexts, it might be more useful to approximate the function under consideration by a quadratic form using a Taylor expansion up to the second order and then apply the algorithms described here. In experimental studies, quadratic forms occur naturally as a secondorder approximation of the receptive field of a neuron in a Wiener expansion (Marmarelis & Marmarelis, 1978; van Steveninck & Bialek, 1988; Lewis, Henry, & Yamada, 2002; Schwartz, Chichilnisky, & Simoncelli, 2002; Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2004; Simoncelli et al., 2004). Quadratic forms were also used in various theoretical articles, either explicitly (Hashimoto, 2003; Bartsch & Obermayer, 2003) or implicitly in the form of neural net¨ works (Hyv¨arinen & Hoyer, 2000, 2001; Kording, Kayser, Einh¨auser, & ¨ Konig, 2004). The analysis methods used in these studies are discussed in section 10. Table 1 lists some important terms and variables used throughout the article. We will refer to 12 xT Hx as the quadratic term, to fT x as the linear term, and to c as the constant term of the quadratic form. Without loss of generality, we assume that H is a symmetric matrix, since if necessary we can substitute
1870
P. Berkes and L. Wiskott
Table 1: Definitions of Some Important Terms. N ·t x g, g˜ H, hi vi , µi V, D f c x+ , x −
Number of dimensions of the input space Mean over time of the expression between the two brackets Input vector The considered inhomogeneous quadratic form and its restriction to a sphere N × N matrix of the quadratic term of the inhomogeneous quadratic form (see equation 1.1) and ith row of H (i.e., H = (h1 , . . . , h N )T ). H is assumed to be symmetric. ith eigenvector and eigenvalue of H, sorted by decreasing eigenvalues (i.e., µ1 ≥ µ2 ≥ . . . ≥ µ N ) The matrix of the eigenvectors V = (v1 , . . . , v N ) and the diagonal matrix of the eigenvalues, so that VT HV = D N-dimensional vector of the linear term of the inhomogeneous quadratic form (see equation 1.1) Scalar value of the constant term of the inhomogeneous quadratic form (see equation 1.1) Optimal excitatory and inhibitory stimuli, x+ = x− = r
H in equation 1.1 by the symmetric matrix 12 (H + HT ) without changing the values of the function g. We define µ1 , . . . , µ N to be the eigenvalues to the eigenvectors v1 , . . . , v N of H sorted in decreasing order µ1 ≥ µ2 ≥ . . . ≥ µ N . V = (v1 , . . . , v N ) denotes the matrix of the eigenvectors and D the diagonal matrix of the corresponding eigenvalues, so that VT HV = D. Furthermore, ·t indicates the mean over time of the expression included in the angle brackets. In the next section we introduce the model system that we use for illustration throughout this letter. Section 3 describes two ways of analyzing a quadratic form by visualizing the coefficients of its quadratic and linear term directly and by considering the eigenvectors of its quadratic term. We then present in section 4 an algorithm to compute the optimal excitatory and inhibitory stimuli—the stimuli that maximize and minimize a quadratic form, respectively, given a fixed energy constraint. In section 5 we consider the invariances of the optimal stimuli, which are the transformations to which the function is most insensitive, and in the following section we introduce a test to determine which of these are statistically significant. In section 7 we discuss two ways to determine the relative contribution of the different terms of a quadratic form to its output. Furthermore, in section 8 we consider the techniques described above in the special case of a quadratic form without the linear term. In the end, we present in section 9 a two-layer neural network architecture equivalent to a given quadratic form. The letter concludes with a discussion of the relation of our approach to other studies in section 10.
Inhomogeneous Quadratic Forms as Receptive Fields
1871
2 Model System To illustrate the analysis techniques presented here, we use the quadratic forms presented in Berkes and Wiskott (2002) in the context of a theoretical model of self-organization of complex-cell receptive fields in the primary visual cortex (see also Berkes & Wiskott, 2005b). In this section, we summarize the settings and main results of this example system. We generated image sequences from a set of natural images by moving an input window over an image by translation, rotation, and zoom and subsequently rescaling the collected stimuli to a standard size of 16 × 16 pixels. For efficiency reasons, the dimensionality of the input vectors x was reduced from 256 to 50 input dimensions and whitened using principal component analysis (PCA). We then determined quadratic forms (also called functions or units in the following) by applying SFA to the input data. SFA is an implementation of the temporal slowness principle (see Wiskott & Sejnowski, 2002, and references there). Given a finite-dimensional function space, SFA extracts the functions that, applied to the input data, return output signals that vary as slowly as possible in time (as measured by the variance of the first derivative) under the constraint that the output signals have zero mean and unit variance and are decorrelated. The functions are sorted by decreasing slowness. For analysis, the quadratic forms are projected back from the 50 first principal components to the input space. Note that the rank of the quadratic term after the transformation is the same as before, and it thus has only 50 eigenvectors. The units receive visual stimuli as an input and can be interpreted as nonlinear receptive fields. They were analyzed with the algorithms presented here and with sine-grating experiments similar to the ones performed in physiology and were found to reproduce many properties of complex cells in V1—not only the primary ones, that is, response to edges and phase-shift invariance (see sections 4 and 5), but also a range of secondary ones such as direction selectivity, nonorthogonal inhibition, end inhibition, and side inhibition. This model system is complex enough to require an extensive analysis and is representative of the application domain considered here, which includes second-order approximations and theoretical models of physiological receptive fields. 3 Visualization of Coefficients and Eigenvectors One way to analyze a quadratic form is to look at its coefficients. The coefficients f 1 , . . . , f N of the linear term can be visualized and interpreted directly. They give the shape of the input stimulus that maximizes the linear part given a fixed norm. The quadratic term can be interpreted as a sum over the inner product of the jth row h j of H with the vector of the products x j xi between the jth
1872
P. Berkes and L. Wiskott
linear term f
f 0.02
0.04 0.02 0 −0.02
0
−0.02
quadratic term y
h 52
h 116
h 56
h 120
h 184
h 60
h 124
h 188
Unit 4
y
h 180
0.4 0.2 0 −0.2 −0.4
h 52
h 116
h 180 0.5 0 −0.5
h 56
h 120
h 184
h 60
h 124
h 188
x
Unit 28
x
Figure 1: Some of the quadratic form coefficients of two functions learned in the model system. The top plots show the coefficients of the linear term f, reshaped to match the two-dimensional shape of the input. The bottom plots show the coefficients of nine of the rows h j of the quadratic term. The crosses indicate the spatial position of the corresponding reference index j.
variable x j and all other variables:
x j x1 N N x j x2 xT Hx = x j (hTj x) = hTj . . .. j=1 j=1 x j xN
(3.1)
In other words, the response of the quadratic term is formed by the sum of N linear filters h j which respond to all combinations of the jth variable with the other ones. If the input data have a two-dimensional spatial arrangement, as in our model system, the interpretation of the rows can be made easier by visualizing them as a series of images (by reshaping the vector h j to match the structure of the input) and arranging them according to the spatial position of the corresponding variable x j . In Figure 1 we show some of the
Inhomogeneous Quadratic Forms as Receptive Fields
1873
Unit 4 6.56
6.54
4.72
4.64
3.81
−2.89
−3.69
−3.74
−4.96
−5.00
−6.52
−7.93
−8.08 −11.72 −12.03
... Unit 28 12.23
12.10
6.38
6.24
6.04
... Figure 2: Eigenvectors of the quadratic term of two functions learned in the model system sorted by decreasing eigenvalues as indicated above each eigenvector.
coefficients of two units learned in the model system. In both, the linear term looks unstructured. The absolute values of its coefficients are small in comparison to those of the quadratic term so that its contribution to the output of the functions is very limited (cf. section 7). The row vectors h j of unit 4 have a localized distribution of their coefficients; they respond only to combinations of the corresponding variable x j and its neighbors. The filters h j are shaped like a four-leaf clover and centered on the variable itself. Pairs of opposed leaves have positive and negative values, respectively. This suggests that the unit responds to stimuli oriented in the direction of the two positive leaves and is inhibited by stimuli with an orthogonal orientation, which is confirmed by successive analysis (cf. later in this section and section 4). In unit 28 the appearance of h j depends on the spatial position of x j . In the bottom half of the receptive field, the interaction of the variables with their close neighbors along the vertical orientation is weighted positively, with a negative flank on the sides. In the top half, the rows have similar coefficients but with reversed polarity. As a consequence, the unit responds strongly to vertical edges in the bottom half, while vertical edges in the top half result in strong inhibition. Edges extending over the whole receptive field elicit only a weak total response. Such a unit is said to be end inhibited. Another possibility for visualizing the quadratic term is to display its eigenvectors. The output of the quadratic form to one of the (normalized) eigenvectors equals half of the corresponding eigenvalue, since 1 T v Hvi = 12 viT (µi vi ) = 12 µi . The first eigenvector can be interpreted as the 2 i stimulus that among all input vectors with norm 1 maximizes the output of the quadratic term. The jth eigenvector maximizes the quadratic term in the subspace that excludes the previous j − 1 ones. In Figure 2 we show the eigenvectors of the two functions previously analyzed in Figure 1. In unit 4, the first eigenvector looks like a Gabor wavelet (i.e., a sine grating multiplied by a gaussian). The second eigenvector has the same form except for a 90 degree phase shift of the sine grating. Since the two eigenvalues have almost the same magnitude, the response of the quadratic term
1874
P. Berkes and L. Wiskott
is similar for the two eigenvectors and also for linear combinations with constant norm 1. For this reason, the quadratic term of this unit has the main characteristics of complex cells in V1: a strong response to an oriented grating with an invariance to the phase of the grating. The last two eigenvectors, which correspond to the stimuli that minimize the quadratic term, are Gabor wavelets with orientation orthogonal to the first two. This means that the output of the quadratic term is inhibited by stimuli at an orientation orthogonal to the preferred one. A similar interpretation can be given in the case of unit 28, although in this case, the first and the last two eigenvalues have the same orientation but occupy two different halves of the receptive field. This confirms that unit 28 is end inhibited. A direct interpretation of the remaining eigenvectors in the two functions is difficult (see also section 8), although the magnitude of the eigenvalues shows that some of them elicit a strong response. Moreover, the interaction of the linear and quadratic terms to form the overall output of the quadratic form is not considered but cannot generally be neglected. The methods presented in the following sections often give a more direct and intuitive description of quadratic forms. 4 Optimal Stimuli Another characterization of a nonlinear function can be borrowed from neurophysiological experiments, where it is common practice to characterize a neuron by the stimulus to which the neuron responds best (for an overview, see Dayan & Abbott, 2001). Analogously, we can compute the optimal excitatory stimulus of g, the input vector x+ that maximizes g given a fixed norm x+ = r .1 Note that x+ depends qualitatively on the value of r : if r is very small, the linear term of the equation dominates, so that x+ ≈ f, while if r is very large, the quadratic part dominates, so that x+ equals the first eigenvector of H (see also section 8). We usually choose r to be the mean norm of all input vectors, since we want x+ to be representative of the typical input. In the same way, we can also compute the optimal inhibitory stimulus x− , which minimizes the response of the function.
1 The fixed norm constraint corresponds to a fixed energy constraint (Stork & Levinson, 1982) used in experiments involving the reconstruction of the Wiener kernel of a neuron (Dayan & Abbott, 2001). During physiological experiments in the visual system, one sometimes uses stimuli with fixed contrast instead. The optimal stimuli under these two constraints may be different. For example, with fixed contrast, one can extend a sine grating indefinitely in space without changing its intensity, while with fixed norm, its maximum intensity is going to dim as the extent of the grating increases. The fixed contrast constraint is more difficult to enforce analytically (e.g., because the surface of constant contrast is not bounded).
Inhomogeneous Quadratic Forms as Receptive Fields
1875
The problem of finding the optimal excitatory stimulus under the fixed energy constraint can be mathematically formulated as follows: maximize
g(x) = 12 xT Hx + fT x + c
under the constraint
xT x = r 2 .
(4.1)
This problem is known as the trust region subproblem and has been extensively studied in the context of numerical optimization, where a nonlinear function is minimized by successively approximating it by an inhomogeneous quadratic form, which is in turn minimized in a small neighborhood. Numerous studies have analyzed its properties in particular in the numerically difficult case where H is near to singular (see Fortin, 2000, and references there). We make use of some basic results and extend them where needed. If the linear term is equal to zero (i.e., f = 0), the problem can be easily solved (it is simply the first eigenvector scaled to norm r ; see section 8). In the following, we consider the more general case where f = 0. We can use a Lagrange formulation to find the necessary conditions for the extremum:
and
xT x = r 2
(4.2)
∇[g(x) − 12 λxT x] = 0
(4.3)
Hx + f − λx = 0
(4.4)
⇔ ⇔
−1
x = (λI − H)
f,
(4.5)
where we inserted the factor 12 for mathematical convenience. According to theorem 3.1 in Fortin (2000), if an x that satisfies equation 4.5 is a solution to equation 4.1, then (λI − H) is positive semidefinite (i.e., all eigenvalues are greater than or equal to 0). This imposes a strong lower bound on the range of possible values for λ. Note that the matrix (λI − H) has the same eigenvectors vi as H with eigenvalues (λ − µi ). For (λI − H) to be positive semidefinite, all eigenvalues must be nonnegative, and thus λ must be greater than the largest eigenvalue µ1 , µ1 ≤ λ .
(4.6)
An upper bound for lambda can be found by considering an upper bound for the norm of x. First, we note that matrix (λI − H)−1 is symmetric and has the same eigenvectors as H with eigenvalues 1/(λ − µi ). We also know that Av ≤ Av for every matrix A and vector v. A is here the spectral norm of A, which for symmetric matrices is simply the largest
1876
P. Berkes and L. Wiskott
absolute eigenvalue. With this we find an upper bound for λ: r = x
(4.7) −1
= (λI − H) f
(4.8)
≤ (λI − H)−1 f
1 f = max i λ − µi
(4.9)
=
(4.6)
⇔
λ≤
(4.10)
1 f λ − µ1
(4.11)
f + µ1 . r
(4.12)
The optimization problem, equation 4.1, is thus reduced to a search over λ on the interval [µ1 , ( f + µ1 )] until x defined by equation 4.5 fulfills the r constraint x = r (see equation 4.2). Vector x and norm x can be efficiently computed for each λ using the eigenvalue decomposition of f: x
(λI − H)−1 f vi (viT f) = (λI − H)−1 =
(4.5)
(4.13) (4.14)
i
=
(λI − H)−1 vi (viT f)
(4.15)
i
=
i
1 vi (viT f) λ − µi
(4.16)
and x2 =
i
1 λ − µi
2 (viT f)2 ,
(4.17)
where the terms viT f and (viT f)2 are constant for each quadratic form and can be computed in advance. The last equation also shows that the norm of x is monotonically decreasing in the considered interval, so that there is exactly one solution and the search can be efficiently performed by a bisection method. x− can be found in the same way by maximizing the negative of g. The pseudocode of an algorithm that implements all the considerations above can be found in Berkes and Wiskott (2005a). A Matlab version can be downloaded online from the authors’ home pages (http://itb.biologie.huberlin.de/{˜berkes,˜wiskott}).
1
5
6
10
11
15 ...
1877
...
Inhomogeneous Quadratic Forms as Receptive Fields
26
30
Figure 3: Optimal stimuli of some of the units in the model system. x+ looks like a Gabor wavelet in almost all cases, in agreement with physiological data. x− is usually structured and is also similar to a Gabor wavelet, which suggests that inhibition plays an important role.
If the matrix H is negative definite (i.e., all its eigenvalues are negative) there is a global maximum that may not lie on the sphere, which might be used in substitution for x+ if it lies in a region of the input space that has a high probability of being reached (the criterion is quite arbitrary, but the region could be chosen to include, for example, 75% of the input data with highest density). The gradient of the function disappears at the global extremum such that it can be found by solving a simple linear equation system: ∇g(x) = Hx + f = 0 ⇔
−1
x = −H f.
(4.18) (4.19)
In the same way, a positive definite matrix H has a negative global minimum, which might be used in substitution for x− . In Figure 3 we show the optimal stimuli of some of the units in the model system. In almost all cases, x+ looks like a Gabor wavelet, in agreement with physiological data for neurons of the primary visual cortex (Pollen & Ronner, 1981; Adelson & Bergen, 1985; Jones & Palmer, 1987). The functions respond best to oriented stimuli having the same frequency as x+ . x− is usually structured as well and looks like a Gabor wavelet too, which suggests that inhibition plays an important role. x+ can be used to compute the position and size of the receptive fields as well as the preferred orientation and frequency of the units for successive experiments. Note that although x+ is the stimulus that elicits the strongest response in the function, it does not necessarily mean that it is representative of the class of stimuli that give the most important contribution to its output. This
1878
P. Berkes and L. Wiskott
depends on the distribution of the input vectors. If x+ lies in a low-density region of the input space, it is possible that other kinds of stimuli drive the function more often. In that case, they might be considered more relevant than x+ to characterize the function. Symptomatic for this effect would be if the output of a function when applied to its optimal stimulus would lie far outside the range of normal activity. This means that x+ can be an atypical, artificial input that pushes the function in an uncommon state. A similar effect has also been reported in a physiological article comparing the response of neurons to natural stimuli and to artificial stimuli such as sine gratings (Baddeley et al., 1997). The characterization of a neuron or a nonlinear function as a feature detector via the optimal stimulus is thus at least incomplete (see also MacKay, 1985). However, the optimal stimuli remain extremely informative in practice. 5 Invariances Since the considered functions are nonlinear, the optimal stimuli do not provide a complete description of their properties. We can gain some additional insights by studying a neighborhood of x+ and x− . An interesting question is to which transformations of x+ or x− the function is invariant. This is similar to the common interpretation of neurons as detectors of a specific feature of the input that are invariant to a local transformation of that feature. For example, complex cells in the primary visual cortex are thought to respond to oriented bars and to be invariant to a local translation. In this section, we consider the function g˜ defined as g restricted to the sphere S of radius r , since as in section 4, we want to compare input vectors having fixed energy. Notice that although g˜ and g take the same values on S (i.e., g˜ (x) = g(x) for each x ∈ S), they are two distinct mathematical objects. For example, the gradient of g˜ in x+ is zero because x+ is by definition a maximum of g˜ . On the other hand, the gradient of g in the same point is Hx+ + f, which is in general different from zero. Strictly speaking, there is no invariance in x+ , since it is a maximum, and the output of g˜ decreases in all directions (except in the special case where the linear term is zero and the first two or more eigenvalues are equal). On the other hand, in a general, noncritical point x∗ (i.e., a point where the gradient does not disappear), the rate of change in any direction w is given by its inner product with the gradient, ∇ g˜ (x∗ ) · w. For all vectors orthogonal to the gradient (which span an N − 2 dimensional space), the rate of change is thus zero. Note that this is not merely a consequence of the fact that the gradient is a first-order approximation of g˜ . By the implicit function theorem (see e.g., Walter, 1995, theorem 4.5), in each open neighborhood U of a noncritical point x∗ , there is an N − 2–dimensional level surface {x ∈ U ⊂ S | g˜ (x) = g˜ (x∗ )}, since the domain of g˜ (the sphere S) is an N − 1–dimensional surface and its range (R) is one-dimensional. Each noncritical point thus belongs to an N − 2 dimensional surface where
Inhomogeneous Quadratic Forms as Receptive Fields
1879
Figure 4: Definition of invariance. This figure shows a contour plot of g˜ (x) on the surface of the sphere S in a neighborhood of x+ . Each general point x∗ on S lies on an N − 2–dimensional level surface (as indicated by the closed lines) where the output of the function g˜ does not change. The only interesting direction in x∗ is the one of maximal change, as indicated by the gradient ∇ g˜ (x∗ ). On the space orthogonal to it, the rate of change is zero. In x+ the function has a maximum, and its output decreases in all directions. There is thus no strict invariance. Considering the second derivative, however, we can identify the directions of minimal change. The arrows in x+ indicate the direction of the invariances (see equation 5.9) with a length proportional to the corresponding second derivative.
the value of the g˜ stays constant. This is a somewhat surprising result: for an optimal stimulus, there does not exist any invariance (except in some degenerate cases); for a general suboptimal stimulus, there exist many invariances. This shows that although it might be useful to observe, for example, that a given function f that maps images to real values is invariant to stimulus rotation, one should keep in mind that in a generic point, there is a large number of other transformations to which the function is equally invariant but would lack an easy interpretation. The strict concept of invariance is thus not useful for our analysis, since in the extrema we have no invariances at all, while in a general point, they are the typical case and the only interesting direction is the one of maximal change, as indicated by the gradient. In the extremum x+ , however, since the output changes in all directions, we can relax the definition of invariance and look for the transformation to which the function changes as little as possible, as indicated by the direction with the smallest absolute value of the second derivative (see Figure 4). (In a noncritical point, this weak definition of invariance still does not help. If the quadratic form that represents the second derivative has positive as well as negative eigenvalues, there is still an N − 3–dimensional surface where the second derivative is zero.)
1880
a)
P. Berkes and L. Wiskott
b)
α
Figure 5: Invariances. (a) To compute the second derivative of the quadratic form on the surface of the sphere, one can study the function along special paths on the sphere, known as geodetics. Geodetics of a sphere are great circles. (b) This plot illustrates how the invariances are visualized. Starting from the optimal stimulus (top), we move on the sphere in the direction of an invariance until the response of the function drops below 80% of the maximal output or α reaches 90 degrees. In the figure, two invariances of unit 4 are visualized. The one on the left represents a phase shift invariance and preserves more than 80% of the maximal output until 90 degrees (the output at 90 degrees is 99.6% of the maximum). The one on the right represents an invariance to orientation change with an output that drops below 80% at 55 degrees.
To study the invariances of the function g in a neighborhood of its optimal stimulus respecting the fixed energy constraint, we have defined the function g˜ as the function g restricted to S. This is particularly relevant here since we want to analyze the derivatives of the function, that is, its change under small movements. Any straight movement in space is going to leave the surface of the sphere. We must therefore be able to define movements on the sphere itself. This can be done by considering a path ϕ(t) on the surface of S such that ϕ(0) = x+ and then studying the change of g along ϕ. By doing this, however, we add the rate of change of the path (i.e., its acceleration) to that of the function. Of all possible paths, we must take the ones that have as little acceleration as possible—those that have just the acceleration that is needed to stay on the surface. Such a path is called a geodetic. The geodetics of a sphere are great circles, and our paths are thus defined as ϕ(t) = cos (t/r ) · x+ + sin (t/r ) · r w
(5.1)
for each direction w in the tangential space of S in x+ (i.e., for each w orthogonal to x+ ), as shown in Figure 5a. The 1/r factor in the cosine
Inhomogeneous Quadratic Forms as Receptive Fields
1881
d and sine arguments normalizes the function such that dt ϕ(0) = w with w = 1. For the first derivative of g˜ along ϕ, we obtain by straightforward calculations (with (g˜ ◦ ϕ)(t) := g˜ (ϕ(t)))
d d 1 T T (g˜ ◦ ϕ)(t) = ϕ(t) Hϕ(t) + f ϕ(t) + c = . . . dt dt 2
(5.2)
1 = − sin (t/r ) cos (t/r ) x+T Hx+ + cos (2t/r ) x+T Hw r + sin (t/r ) cos (t/r ) r wT Hw 1 − sin (t/r ) fT x+ + cos (t/r ) fT w , r
(5.3)
and for the second derivative, d2 1 2 (g˜ ◦ ϕ)(t) = − 2 cos (2t/r ) x+T Hx+ − sin (2t/r ) x+T Hw 2 dt r r 1 1 + cos (2t/r ) wT Hw − 2 cos (t/r ) fT x+ − sin (t/r ) fT w . r r (5.4) In t = 0 we have 1 d2 (g˜ ◦ ϕ)(0) = wT Hw − 2 (x+T Hx+ + fT x+ ) , dt 2 r
(5.5)
that is, the second derivative of g˜ in x+ in the direction of w is composed of two terms: wT Hw corresponds to the second derivative of g in the direction of w, while the constant term −1/r 2 · (x+T Hx+ + fT x+ ) depends on the curvature of the sphere 1/r 2 and on the gradient of g in x+ orthogonal to the surface of the sphere, ∇g(x+ ) · x+ = (Hx+ + f)T x+ =x
+T
+
T +
Hx + f x .
(5.6) (5.7)
To find the direction in which g˜ decreases as little as possible, we only need to minimize the absolute value of the second derivative (see equation 5.5). This is equivalent to maximizing the first term wT Hw in equation 5.5 since the second derivative in x+ is always negative (because x+ is a maximum of g˜ ) and the second term is constant. w is orthogonal to x+ , and thus the maximization must be performed in the space tangential
1882
P. Berkes and L. Wiskott
to the sphere in x+ . This can be done by computing a basis b2 , . . . , b N of the tangential space (e.g., using the Gram-Schmidt orthogonalization on x+ , e1 , . . . , e N−1 where ei is the canonical basis of R N ) and replacing the matrix H by ˜ = BT HB , H
(5.8)
where B = (b2 , · · · , b N ). The direction of the smallest second derivative ˜ with the largest positive eigenvalue. corresponds to the eigenvector v˜ 1 of H The eigenvector must then be projected back from the tangential space into the original space by a multiplication with B: w1 = Bv˜ 1 .
(5.9)
The remaining eigenvectors corresponding to eigenvalues of decreasing value are also interesting, as they point in orthogonal directions where the function changes with a gradually increasing rate of change. To visualize the invariances, we move x+ (or x− ) along a path on the sphere in the direction of a vector wi according to x(α) = cos (α) · x+ + sin (α) · r wi
(5.10)
for α ∈ [−90◦ , 90◦ ], as illustrated in Figure 5b. At each point, we measure the response of the function to the new input vector and stop when it drops below 80% of the maximal response. In this way, we generate for each invariance a movie like those shown in Figure 6 for some of the optimal stimuli (the corresponding animations are available at the authors’ home pages). Each frame of such a movie contains a nearly optimal stimulus. Using this analysis, we can systematically scan a neighborhood of the optimal stimuli, starting from the transformations to which the function is most insensitive up to those that lead to a great change in response. Note that our definition of invariance applies only locally to a small neighborhood of x+ . The path followed in equation 5.10 goes beyond such a neighborhood and is appropriate only for visualization. The pseudocode of an algorithm that computes and visualizes the invariances of the optimal stimuli can be found in Berkes and Wiskott (2005a). A Matlab version can be downloaded from the authors’ home pages. 6 Significant Invariances The procedure described above finds for each optimal stimulus a set of N − 1 invariances ordered by the degree of invariance (i.e., by increasing magnitude of the second derivative). We would like to know which of these are statistically significant. An invariance can be defined as significant
Inhomogeneous Quadratic Forms as Receptive Fields a) Unit 4, Inv. 1 – Phase shift °
99.5% (−90 )
°
100% (0 )
b) Unit 6, Inv. 3 – Position change °
99.6% (90 )
c) Unit 13, Inv. 4 – Size change °
92% (−29 )
°
100% (0 )
100% (0°)
°
84% (−59 )
°
100% (0 )
°
84% (59 )
d) Unit 14, Inv. 5 – Frequency change °
92% (29 )
e) Unit 9, Inv. 3 – Orientation change 88% (−44°)
1883
88% (44°)
°
81% (−37 )
°
100% (0 )
°
81% (37 )
f) Unit 6, Inv. 5 – Curvature change 80% (−42°)
100% (0°)
80% (42°)
Figure 6: Selected invariances for some of the optimal excitatory stimuli shown in Figure 3. The central patch of each plot represents the optimal stimulus of a unit, while the ones on the sides are produced by moving it in one (left patch) or the other (right patch) direction of the eigenvector corresponding to the invariance. In this image, we stopped before the output dropped below 80% of the maximum to make the interpretation of the invariances easier. The relative output of the function in percent and the angle of displacement α (see equation 5.10) are given above the patches. The animations corresponding to these invariances are available at the authors’ home pages.
if the function changes exceptionally little (less than chance level) in that direction, which can be measured by the value of the second derivative: the smaller its absolute value, the slower the function will change. To test for their significance, we compare the second derivatives of the invariances of the quadratic form we are considering with those of random inhomogeneous quadratic forms that are equally adapted to the statistics of the input data. We therefore constrain the random quadratic forms to produce an output that has the same variance and mean as the output of the analyzed ones when applied to the input stimuli. Without loss of generality, we assume here zero mean and unit variance. These constraints are compatible with the ones that are usually imposed on the functions learned by many theoretical models. Because of this normalization, the distribution of the random quadratic forms depends on the distribution of the input data. To understand how to efficiently build random quadratic forms under these constraints, it is useful to think in terms of a dual representation of the problem. A quadratic form over the input space is equivalent to a linear function over the space of the input expanded to all
1884
P. Berkes and L. Wiskott
monomials of degree one and two using the function ((x1 , . . . , xn )T ) := (x1 x1 , x1 x2 , x1 x3 , . . . , xn xn , x1 , . . . , xn )T , that is,
h 11 h 12 · · · 1 T h 12 h 22 x . .. .. 2 . ··· h 1n
h 1n
.. x + .
h nn
T
f1 f2 .. . fn
H
f
1 h 2 11 h 12
T
h 13 .. . x+c = 1 h nn 2 f1 . ..
fn
x1 x1 x1 x2 x1 x3 .. . xn xn + c . (6.1) x1 . ..
xn (x)
q
We can whiten the expanded input data (x) by subtracting its mean (x)t and transforming it with a whitening matrix S. In this new coordinate system, each vector with norm 1 applied to the input data using the scalar product fulfills the unit variance and zero mean constraints by construction. We can thus choose a random vector q of length 1 in the whitened, expanded space and derive the corresponding quadratic form in the original input space: T
qT (S((x) − (x)t )) = (ST q ) ((x) − (x)t )
(6.2)
=:q
= qT ((x) − (x)t ) =
(6.1)
(6.3)
1 T x Hx + fT x − qT (x)t 2
(6.4)
:=c
=
1 T x Hx + fT x + c , 2
(6.5)
with appropriately defined H and f according to equation 6.1. We can next compute the optimal stimuli and the second derivative of the invariances of the obtained random quadratic form. To make sure that we get independent measurements, we keep only one second derivative chosen at random for each random function. This operation, repeated over many quadratic forms, allows us to determine a distribution of the second derivatives of the invariances and a corresponding confidence interval. Figure 7a shows the distribution of 50,000 independent second derivatives of the invariances of random quadratic forms and the distribution of the second derivatives of all invariances of the first 50 units learned in the model system. The dashed line indicates the 95% confidence interval
Inhomogeneous Quadratic Forms as Receptive Fields
occurrences (%)
0.06
b)
Model System Random quadratic forms
0.04
0.02
0 −60
−40
−20
second derivative
50
# relevant invariances
a)
0
1885
40 30 20 10 0 0
10
20
30
40
50
unit number
Figure 7: Significant invariances. (a) Distribution of 50,000 independently drawn second derivatives of the invariances of random quadratic forms and distribution of the second derivatives of all invariances of the first 50 units learned in the model system. The dashed line indicates the 95% confidence interval as derived from the random quadratic forms. The distribution in the model system is more skewed toward small second derivatives and has a clear peak near zero. Twenty-eight percent of all invariances were classified as significant. (b) Number of significant invariances for each of the first 50 units learned in the model system (the functions were sorted by decreasing slowness; see section 2). The number of significant invariances tends to decrease with decreasing slowness.
derived from the former distribution. The latter is more skewed toward small second derivatives and has a clear peak near zero. Twenty-eight percent of all invariances were classified as significant. Figure 7b shows the number of significant invariances for each individual quadratic form in the model system. Each function has 49 invariances since the rank of the quadratic term is 50 (see section 2). The plot shows that the number of significant invariances decreases with increasing ordinal number (the functions are ordered by slowness, the first ones being the slowest). Forty-six units out of 50 have three or more significant invariances. The first invariance, which corresponds to a phase shift invariance, was always classified as significant, which confirms that the units behave like complex cells. Note that since the eigenvalues of a quadratic form are not independent of each other, with the method presented here it is possible to make statements only about the significance of individual invariances, and not about the joint probability distribution of two or more invariances. 7 Relative Contribution of the Quadratic, Linear, and Constant Term The inhomogeneous quadratic form has a quadratic, a linear, and a constant term. It is sometimes of interest to know what their relative contribution to the output is. The answer to this question depends on the considered input. For example, the quadratic term dominates for large input vectors,
1886
P. Berkes and L. Wiskott b) 0.5
a) 40
activity
30
occurrences (%)
abs(constant term) abs(linear term) abs(quadratic term)
20 4
10
2 0 0
0 0
10
20
30
5
40
10
50
0.4 0.3 0.2 0.1 0 −4
−3 −2
−1
0
1
2
3
4
unit number
Figure 8: Relative contribution of the quadratic, linear, and constant term. (a) The absolute value of the output of the quadratic, linear, and constant term in x+ for each of the first 50 units in the model system. In all but the first 2 units, the quadratic term has a larger output. The subplot shows a magnified version of the contribution of the terms for the first 10 units. (b) Histogram of the mean of the logarithm of the ratio between the activity of the linear and the quadratic term in the model system when applied to 90,000 test input vectors. A negative value means that the quadratic term dominates, and a positive value means the linear term dominates. In all but 4 units (units 1, 7, 8, and 24), the quadratic term is greater on average.
while the linear or even the constant term dominates for input vectors with a small norm. A first possibility is to look at the contribution of the individual terms at a particular point. A privileged point is, for example, the optimal excitatory stimulus, especially if the quadratic form can be interpreted as a feature detector (cf. section 4). Figure 8a shows for each function in the model system the absolute value of the output of all terms with x+ as an input. In all functions except the first two, the activity of the quadratic term is greater than that of the linear and of the constant term. The first function basically computes the mean pixel intensity, which explains the dominance of the linear term. The second function is dominated by a constant term from which a quadratic expression very similar to the squared mean pixel intensity is subtracted. As an alternative we can consider the ratio between linear and quadratic term, averaged over all input stimuli:
fT x 1 log 1 = log fT x − log xT Hx . T x Hx 2 t 2
(7.1)
t
The logarithm ensures that a given ratio (e.g., linear/quadratic = 3) has the same weight as the inverse ratio (e.g., linear/quadratic = 1/3) in the mean.
Inhomogeneous Quadratic Forms as Receptive Fields
1887
A negative result means that the quadratic term dominates, while for a positive value, the linear term dominates. Figure 8b shows the histogram of this measure for the functions in the model syatem. In all but 4 units (units 1, 7, 8, and 24), the quadratic term is on average greater than the linear one. 8 Quadratic Forms Without the Linear Term In the case of a quadratic form without the linear term, g(x) =
1 T x Hx + c, 2
(8.1)
the mathematics of sections 4 and 5 becomes much simpler. The quadratic form is now centered at x = 0, and the direction of maximal increase corresponds to the eigenvector v1 with the largest positive eigenvalue. The optimal excitatory stimulus x+ with norm r is thus x+ = r v 1 .
(8.2)
Similarly, the eigenvector corresponding to the largest negative eigenvalue v N points in the direction of x− . The second derivative, equation 5.5, in x+ in this case becomes 1 d2 (g˜ ◦ ϕ)(0) = wT Hw − 2 x+T Hx+ dt 2 r =
(8.2)
wT Hw − v1T Hv1
= wT Hw − µ1 .
(8.3) (8.4) (8.5)
The vector w is by construction orthogonal to x+ and therefore lies in the space spanned by the remaining eigenvectors v2 , . . . , v N . Since µ1 is the maximum value that wT Hw can assume for vectors of length 1, it is clear that equation 8.5 is always negative (as it should since x+ is a maximum) and that its absolute value is successively minimized by the eigenvectors v2 , . . . , v N in this order. The value of the second derivative on the sphere in the direction of vi is given by d2 (g˜ ◦ ϕ)(0) = viT Hvi − µ1 dt 2 = µi − µ1 .
(8.6) (8.7)
In the same way, the invariances of x− are given by v N−1 , . . . , v1 with second derivative values (µi − µ N ).
1888
P. Berkes and L. Wiskott
Since, as shown in Figure 8a, in the model system the linear term is mostly small in comparison with the quadratic one, the first and last eigenvectors of our units are expected to be very similar to their optimal stimuli. This can be verified by comparing Figures 2 and 3. Moreover, successive eigenvectors are almost equal to the directions of the most relevant invariances (compare, for example, unit 4 in Figure 2 and Figure 5b). This does not have to be the case in general. For example, the data in Lewis et al. (2002) show that cochlear neurons in the gerbil ear have a linear as well as a quadratic component. In such a situation, the algorithms must be applied in their general formulation. 9 Decomposition of a Quadratic Form in a Neural Network As also noticed by Hashimoto (2003), for each quadratic form there exists an equivalent two-layer neural network, which can be derived by rewriting the quadratic form using its eigenvector decomposition: 1 g(x) = xT Hx + fT x + c 2 1 = xT VDVT x + fT x + c 2 1 = (VT x)T D(VT x) + fT x + c 2 =
N µi i=1
2
(viT x)2 + fT x + c .
(9.1) (9.2) (9.3) (9.4)
The network has a first layer formed by a set of N linear subunits sk (x) = vkT x followed by a quadratic nonlinearity weighted by the coefficients µk /2. The output neuron sums the contribution of all subunits plus the output of a direct linear connection from the input layer (see Figure 9a). Since the eigenvalues can be negative, some of the subunits give an inhibitory contribution to the output. It is interesting to note that in an algorithm that learns quadratic forms, the number of inhibitory subunits in the equivalent neural network is not fixed but is a learned feature. As an alternative, one can scale the weights vi by |µi |/2 and specify which subunits are excitatory and which are inhibitory according to the sign of µi , since g(x)
=
N µi
(9.4)
i=1
=
N i=1 µi >0
2
(viT x)2 + fT x + c
(9.5)
T 2 T 2 N |µ | |µ | i i vi vi x − x + f T x + c . 2 2 i=1 µi <0
(9.6)
Inhomogeneous Quadratic Forms as Receptive Fields
1889
a)
...
b)
...
c)
...
Figure 9: Neural networks related to inhomogeneous quadratic forms. In all plots we assume that the norm of the subunits is 1 (i.e., vi = 1). The ellipse in the input layer represents a multidimensional input. (a) Neural network equivalent to an inhomogeneous quadratic form. The first layer consists of N linear subunits, followed by a quadratic nonlinearity weighted by the coefficients µi /2. The output neuron sums the contribution of each subunit plus the output of a direct linear connection from the input layer. (b) Simpler neural network used in some theoretical studies. The output of the linear subunits is squared but not weighted and can give only an excitatory (positive) contribution to the output. There is no direct linear connection between input and output layer. (c) The energy model of complex cells consists of two linear subunits whose weights are Gabor filters having the same shape except for a 90 degree phase difference. The output is given by the square sum of the response of the two subunits.
This equation also shows that the subunits are unique only up to an orthogonal transformation (i.e., a rotation or reflection) of the excitatory subunits and another one of the inhibitory subunits, which can be seen as follows. Let A+ and A− be the matrices having as rows the vectors |µi |/2 vi for positive and negative µi , respectively. Equation 9.6 can then be
1890
P. Berkes and L. Wiskott
Unit 4 1.54
1.38
1.28
1.23
1.20
−1.06
−1.13
−1.16
−1.21
−1.22
−1.02
−1.10
−1.12
−1.13
−1.13
... ... 1.46
1.40
1.22
1.22
1.20
... ... Figure 10: Random rotations of the positive and negative subunits. Two examples of the weights of the subunits of unit 4 after a random rotation as in equation 4.8. The numbers above the patches are the weighting coefficients on the second layer when the weight vectors of the first layer are normalized to norm 1. The subunits before rotation are equal to the eigenvectors of unit 4, and their weighting coefficients are equal to half the eigenvalues (see Figure 2, top).
rewritten as g(x) = A+ x2 − A− x2 + fT x + c .
(9.7)
Since the length of a vector does not change under rotation or reflection, the output of the function remains unchanged if we introduce two orthogonal transformations R+ and R− : g(x) = R+ A+ x2 − R− A− x2 + fT x + c .
(9.8)
Figure 10 shows the weights of the subunits of the neural network equivalent to unit 4 as defined by the eigenvectors of H (see equation 9.4) after a random rotation of the excitatory and inhibitory subunits. The subunits are not as structured as in the case of the eigenvectors (cf. Figure 2), although the orientation and frequency can still be identified. The neural model suggests alternative ways to learn quadratic forms, for example, by adapting the weights by backpropagation. The high number of parameters involved, however, could make it difficult for an incremental optimization method to avoid local extrema. On the other hand, each network of this form can be transformed into a quadratic form and analyzed with the techniques described in this article, which might be useful, for example, to compute the optimal stimuli and the invariances. The equivalent neural network shows that quadratic forms are compatible with the hierarchical model of the visual cortex first proposed by Hubel and Wiesel (1962), in which complex cells pool over simple cells having similar orientation and frequency in order to achieve phase-shift invariance. This was later formalized in the energy model of complex cells (Adelson & Bergen, 1985), which can be implemented by the neural network introduced above. The subunits are interpreted as simple cells and the output
Inhomogeneous Quadratic Forms as Receptive Fields
1891
unit as a complex cell. In its usual description, the energy model consists of only two excitatory subunits. If, for example, the subunits are two Gabor wavelets with identical envelope function, frequency, and orientation but with a 90 degree phase difference (see Figure 9c), the network will reproduce the basic properties of complex cells: edge detection and phase-shift invariance. Additional excitatory or inhibitory subunits might introduce additional complex cell invariances, broaden or sharpen the orientation and frequency tuning, and provide end or side inhibition. However, as mentioned in the previous section, the neural network is not unique, so that the subunits can assume different forms, many of which might not be similar to simple cells (see Figure 10). For example, as discussed in section 8, if the linear term is missing and the subunits are defined using the eigenvectors of H as in equation 9.4, the linear filters of the subunits can be interpreted as the optimal stimuli and the invariances thereof. As shown in Figure 2, the invariances themselves need not be structured like a simple cell, since they only represent transformations of the optimal stimuli.
10 Relation to Other Studies As mentioned in section 1, quadratic forms occur in experimental studies as a second-order approximation of the receptive field of neurons. The linear and quadratic terms correspond in this case to the first two terms in a Wiener expansion. They can be estimated from a stimulus-response electrophysiological recording using the spike-triggered average (STA) and the spike-triggered covariance matrix (STC) (van Steveninck & Bialek, 1988; Lewis et al., 2002; Schwartz et al., 2002; Touryan et al., 2002; Rust et al., 2004; Simoncelli et al., 2004). Most of these studies perform an analysis of the first principal components of the STC, which is motivated in terms of identifying the stimuli that contribute most to the variance of the output of the neuron (Lewis et al., 2002; Schwartz et al., 2002; Rust et al., 2004; Simoncelli et al., 2004) or more in terms of a gaussian approximation of the spike-triggered ensemble (van Steveninck and Bialek, 1988; Touryan et al., 2002). The extracted principal components span the subspace of stimuli that governs the response of a cell (Rust et al., 2004). If the linear term is negligible, our analysis is mostly consistent with this interpretation: ordering the eigenvectors by decreasing eigenvalues, the first one corresponds to the optimal stimulus and the following ones to the most relevant invariances (see section 8). Every stimulus that is generated by a linear combination of the optimal stimulus and the most relevant invariances is going to produce a strong output in the quadratic form. However, using the concept of invariances, we can refine the analysis and identify a hypercone in this subspace where the output is more than 80% of the maximal one with a large extension in the most invariant directions and a small one in the least invariant ones (see section 5).
1892
P. Berkes and L. Wiskott
Figure 11: Interpretation of the invariances. This figure illustrates that although the vector corresponding to an invariance (center) might be difficult to interpret or even look unstructured, when applied to the optimal excitatory stimulus (left) it can code for a meaningful invariance (right). The invariance shown here is the curvature invariance of Figure 6f.
The stimuli lying in this hypercone are all nearly optimal stimuli, and their visualization can give good insight in the overall behavior of the neuron. In our approach, the quadratic forms are interpreted as second-order approximations of the input-output functions computed by the neurons, and the resulting characterization is similar to the one given by classical physiological experiments (e.g., De Valois, Albrecht, & Thorell, 1982; De Valois, Yund, & Hepler, 1982; Schiller, Finlay, & Volman, 1976a, 1976b). Because of this interpretation, the linear term cannot be neglected or eliminated as in the experimental studies cited above. Only if the linear term is proved to be reasonably close to zero can one consider the quadratic term alone and apply the methods described in section 8. Two recent theoretical studies (Hashimoto, 2003; Bartsch & Obermayer, 2003) learned quadratic forms without the linear term from natural images. The eigenvectors of H were visualized and interpreted as “relevant features.” Some of them were discarded because they were “unstructured.” According to our analysis, this interpretation holds for only the two eigenvectors with the largest positive and negative eigenvalues. We think that the remaining eigenvectors should not be visualized directly but applied as transformations to the optimal stimuli. Therefore, it is possible for them to look unstructured but still represent a structured invariance, as illustrated in Figure 11. For example, Hashimoto (2003, Fig. 5a) shows the eigenvectors of a quadratic form learned by a variant of SFA performed by gradient descent. The two largest eigenvectors look like two Gabor wavelets and have the same orientation and frequency. According to the interpretation above and to Hashimoto, this shows that the network responds best to an oriented stimulus and is invariant to a phase shift. The third eigenvector looks like a Gabor wavelet with the same frequency as the first two but a slightly different orientation. Hashimoto suggests that this eigenvector makes the interpretation of that particular quadratic form difficult. According to our
Inhomogeneous Quadratic Forms as Receptive Fields
1893
analysis, that vector might code for a rotation invariance, which would be compatible with a complex cell behavior. Neural networks closely related to those presented in section 9 were used ¨ in some theoretical studies (Hyv¨arinen & Hoyer, 2000, 2001; Kording et al., 2004). There, a small set of linear subunits (2 to 25) was connected to an output unit that took the sum of the squared activities (see Figure 9b). These networks differ from inhomogeneous quadratic forms in that they lack a direct linear contribution to the output and have much fewer subunits (a quadratic form of dimension N has N subunits). The most important difference, however, is related to the normalization of the weights. In the theoretical studies cited above, the weights are normalized to a fixed norm, and the activity of the subunits is not weighted. In particular, since there are no negative coefficients, no inhibition is possible, whereas this turned out to be essential for a number of complex cell properties in our simulations. However, the results of section 9 show that it is possible to use the algorithms presented here to analyze and interpret the weights of this kind of neural network.
11 Conclusion We have presented a collection of tools to analyze nonlinear functions, in particular quadratic forms. These tools allow us to visualize the coefficients of the individual terms of an inhomogeneous quadratic form, to compute its optimal stimuli (i.e., the stimuli that maximize or minimize the quadratic form under a fixed energy constraint) and their invariances (i.e., the transformations of the optimal stimuli to which the quadratic form is most insensitive), and to determine which of these invariances are statistically significant. We have also proposed a way to measure the relative contribution of the linear and quadratic term. Moreover, we have discussed a neural network architecture equivalent to a given quadratic form. The methods presented here can be used in a variety of fields, in particular in physiological experiments to study the nonlinear receptive fields of neurons and in theoretical studies.
Acknowledgments This work has been supported by a grant from the Volkswagen Foundation. We are grateful to Thomas Neukircher for some interesting discussions about some of the mathematical ideas of the article and to Henning Sprekeler and two anonymous reviewers for useful comments on the manuscript. The home page of the first author contains additional material to this article, including the animations corresponding to Figure 6 and Matlab source code for the algorithms of sections 4 and 5.
1894
P. Berkes and L. Wiskott
References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299. Baddeley, R., Abbott, L., Booth, M., Sengpiel, F., Freeman, T., Wakeman, E., & Rolls, E. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B Biol. Sci., 264(1389), 1775–1783. Bartsch, H., & Obermayer, K. (2003). Second-order statistics of natural images. Neurocomputing, 52–54, 467–472. Berkes, P., & Wiskott, L. (2002). Applying slow feature analysis to image sequences yields a rich repertoire of complex cell properties. In J. R. Dorronsoro (Ed.), Artificial Neural Networks—ICANN 2002 Proceedings (pp. 81–86). Berlin: Springer. Berkes, P., & Wiskott, L. (2005a). On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Cognitive Sciences EPrint Archive (CogPrints) 4081, http://cogprints.org/4081/. Berkes, P., & Wiskott, L. (2005b). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5(6), 579–602, http://journalofvision.org/ 5/6/9, doi: 10.1167/5.6.9. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. De Valois, R., Albrecht, D., & Thorell, L. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Res., 22, 545–559. De Valois, R., Yund, E., & Hepler, N. (1982). The orientation and direction selectivity of cells in macaque visual cortex. Vision Res., 22(5), 531–544. Fortin, C. (2000). A survey of the trust region subproblem within a semidefinite framework. Unpublished master’s thesis, University of Waterloo. Hashimoto, W. (2003). Quadratic forms in natural images. Network: Computation in Neural Systems, 14(4), 765–788. Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Hyv¨arinen, A., & Hoyer, P. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720. Hyv¨arinen, A., & Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18), 2413–2423. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1233–1257. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. Proc. of the 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003) (pp. 245–256). ¨ ¨ Kording, K., Kayser, C., Einh¨auser, W., & Konig, P. (2004). How are complex cell properties adapted to the statistics of natural scenes? Journal of Neurophysiology, 91(1), 206–212. Lewis, E., Henry, K., & Yamada, W. (2002). Tuning and timing in the gerbil ear: Wiener-kernel analysis. Hearing Research, 174, 206–221.
Inhomogeneous Quadratic Forms as Receptive Fields
1895
MacKay, D. (1985). The significance of “feature sensitivity.” In D. Rose & V. Dobson (Eds.), Models of the visual cortex (pp. 47–53). New York: Wiley. Marmarelis, P., & Marmarelis, V. (1978). Analysis of physiological systems: The whitenoise approach. New York: Plenum Press. Pollen, D., & Ronner, S. (1981). Phase relationship between adjacent simple cells in the visual cortex. Science, 212, 1409–1411. Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. (2004). Spike-triggered characterization of excitatory and suppressive stimulus dimensions in monkey V1. Neurocomputing, 58–60, 793–799. Schiller, P., Finlay, B., & Volman, S. (1976a). Quantitative studies of single-cell properties in monkey striate cortex. I. Spatiotemporal organization of receptive fields. J. Neurophysiol., 39(6), 1288–1319. Schiller, P., Finlay, B., & Volman, S. (1976b). Quantitative studies of single-cell properties in monkey striate cortex. II. Orientation specificity and ocular dominance. J. Neurophysiol., 39(6), 1320–1333. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Schwartz, O., Chichilnisky, E., & Simoncelli, E. (2002). Characterizing neural gain control using spike-triggered covariance. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 269–276). Cambridge, MA: MIT Press. Simoncelli, E. P., Pillow, J., Paninski, L., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge, MA: MIT Press. Stork, D., & Levinson, J. (1982). Receptive fields and the optimal stimulus. Science, 216, 204–205. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. Journal of Neuroscience, 22(24), 10811–10818. van Steveninck, R., & Bialek, W. (1988). Coding and information transfer in short spike sequences. Proc. Soc. Lond. B. Biol. Sci., 234, 379–414. Walter, W. (1995). Analysis 2. Berlin: Springer-Verlag. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770.
Received February 16, 2005; accepted September 14, 2005.
LETTER
Communicated by Peter Thomas
Comment on “Characterization of Subthreshold Voltage Fluctuations in Neuronal Membranes,” by M. Rudolph and A. Destexhe Benjamin Lindner [email protected]
Andr´e Longtin [email protected] Department of Physics, University of Ottawa, Ottawa, KIN 6N5, Canada
In two recent articles, Rudolph and Destexhe (2003, 2005) studied a leaky integrator model (an RC-circuit) driven by correlated (“colored”) gaussian conductance noise and Gaussian current noise. In the first article, they derived an expression for the stationary probability density of the membrane voltage; in the second, they modified this expression to cover a larger parameter regime. Here we show by standard analysis of solvable limit cases (white noise limit of additive and multiplicative noise sources; only slow multiplicative noise; only additive noise) and by numerical simulations that their first result does not hold for the general colored-noise case and uncover the errors made in the derivation of a Fokker-Planck equation for the probability density. Furthermore, we demonstrate analytically (including an exact integral expression for the time-dependent mean value of the voltage) and by comparison to simulation results that the extended expression for the probability density works much better but still does not exactly solve the full colored-noise problem. We also show that at stronger synaptic input, the stationary mean value of the linear voltage model may diverge and give an exact condition relating the system parameters for which this takes place.
1 Introduction The inherent randomness of neural spiking has stimulated the exploration of stochastic neuron models for several decades (Holden, 1976; Tuckwell, 1988, 1989). The subthreshold membrane voltage of cortical neurons shows strong fluctuations in vivo caused mainly by synaptic stimuli coming from as many as tens of thousands of presynaptic neurons. In the theoretical literature, these stimuli have been approximated in different ways. The most biophysically realistic description is to model an extended neuron with different sorts of synapses distributed over the dendrite and possibly the soma, with each synapse following its own kinetics when Neural Computation 18, 1896–1931 (2006)
C 2006 Massachusetts Institute of Technology
Comment on “Characterization of Subthreshold Voltage Membranes”
1897
excited by random incoming pulses that change the local conductance. In a point-neuron model for the membrane potential in the spike generating zone, this amounts to an effective conductance noise for each sort of synapse. If the contribution of a single spike is small and the effective input rates are high, the incoming spike trains can be well approximated by gaussian white noise; this is known as the diffusion approximation of spike train input (see, e.g., Holden, 1976). Furthermore, these conductance fluctuations driving the membrane voltage dynamics will be correlated in time (the noise will be “colored”) due to the synaptic filtering (Brunel & Sergi, 1998). Assuming the validity of the diffusion approximation, two further common approximations found in the theoretical literature are to (1) replace the conductance noise by a current noise and (2) neglect the correlation of the noise and use a white noise. Exploring the validity of these approximations has been the aim of a number of recent theory articles (Rudolph & Destexhe, 2003, 2005; Richardson, 2004; Richardson & Gerstner, 2005). Rudolph and Destexhe (hereafter referred to as R&D) recently studied the subthreshold voltage dynamics driven by colored gaussian conductance and current noises, with the goal of deriving analytical expressions for the probability density of the voltage fluctuations in the absence of a spike-generating mechanism. Such expressions are desirable because they permit one to use experimentally measured voltage traces in vivo to determine (or at least to obtain constraints on) synaptic parameters. R&D gave a one-dimensional Fokker-Planck equation for the evolution of the probability density of the voltage variable and solved this equation in the stationary state. Comparing this solution to results of numerical simulations they found good agreement with simulations of the full model. In a recent article, however, they discovered a disagreement of their formula to simulations in extreme parameter regimes (Rudolph & Destexhe, 2005). R&D proposed an extended expression that is functionally equivalent to their original formula; it results from effective correlation times that were introduced into their original formula in a heuristic manner. According to R&D, this new expression fits simulation results well for various parameter sets. In this comment we show that both proposed formulas are not exact solutions of the mathematical problem that R&D posed. We demonstrate this by the analysis of limit cases by means of an exact analytical result for the mean value of the voltage as well as by numerical simulation results. The failure of the first formula is pronounced; for example, it fails dramatically if the synaptic correlation times are varied by only one order of magnitude relative to R&D’s standard parameters. The extended expression, although not an exact solution of the problem, seems to provide a reasonable approximation for the probability density of the membrane voltage if the conductance noise is not too strong. We also show that if the conductance noise is strong, the model itself and not only the solutions proposed by R&D becomes problematic: the moments of the voltage, such
1898
B. Lindner and A. Longtin
as its stationary mean value, diverge. For the mean value we will give an exact solution and identify by means of this solution the parameters for which a divergence is observed. This letter is organized as follows. In the next section, we introduce the model that R&D studied. Then we study the limit cases of only white noise (section 3), of only additive colored noise (section 4), and of slow (“static”) multiplicative noise (section 5). In section 6 we derive expressions for the time-dependent and the stationary mean value of the voltage at arbitrary values of the correlation times. Section 7 is devoted to a comparison of numerical simulations to the various theoretical formulas. We summarize and discuss our findings in section 8. In the appendix, we uncover the errors in the derivation of the Fokker-Planck equation that R&D made. We anticipate that our results will help future investigations of the neural colored noise problem. 2 Basic Model The current balance equation for a patch of passive membrane is
Cm
d V(t) 1 = −g L (V(t) − E L ) − Isyn (t), dt a
(2.1)
where Cm is the specific membrane capacity, a is the membrane area, and g L and E L the leak conductance and reversal potential, respectively. The total synaptic current is given by Isyn = ge (t)(V(t) − E e ) + gi (t)(V(t) − E i ) − I (t),
(2.2)
with ge,i being the noisy conductances for excitatory and inhibitory synapses and E e,i the respective reversal potentials; I (t) is an additional noisy current. With respect to the conductances, R&D assume the diffusion approximation to be valid. This means approximating the superposition of incoming presynaptic spikes at the excitatory and inhibitory synapses by gaussian white noise. Including a first-order linear synaptic filter, the conductances are consequently described by Ornstein-Uhlenbeck processes (OUP); similarly, R&D also assume a OUP for the current I (t) dge (t) 1 = − (ge (t) − ge0 ) + dt τe dgi (t) 1 = − (gi (t) − gi0 ) + dt τi
2σe2 ξe (t) τe
(2.3)
2σi2 ξi (t) τi
(2.4)
Comment on “Characterization of Subthreshold Voltage Membranes”
d I (t) 1 = − (I (t) − I0 ) + dt τI
2σ I2 ξ I (t). τI
1899
(2.5)
Here the functions ξe,i,I (t) are independent gaussian white noise sources with ξk (t)ξl (t ) = δk,l δ(t − t ) (here k, l ∈ {e, i, I }, and the brackets · · · stand for a stationary ensemble average). The processes ge , gi , and I are gaussian distributed around the mean values ge0 , gi0 , and I0 with variances σe2 , σi2 , and σ I2 , respectively: 1 ρe (ge ) = exp − (ge − ge0 )2 / 2σe2 2 2πσe
(2.6)
1 ρi (gi ) = exp −(gi − gi0 )2 / 2σi2 2πσi2
(2.7)
1 exp −(I − I0 )2 / 2σ I2 . ρ I (I ) = 2πσ I2
(2.8)
As discussed by R&D, these solutions permit unphysical negative conductances, which become especially important if ge0 /σe and gi0 /σi are small. Furthermore, the three processes are exponentially correlated with the correlation times given by τe , τi , and τ I , respectively (ge (t) − ge0 ) (ge (t + τ ) − ge0 ) = σe2 exp[−|τ |/τe ] (gi (t) − gi0 ) (gi (t + τ ) − gi0 ) = σi2 exp[−|τ |/τi ] (I (t) − I0 ) (I (t + τ ) − I0 ) = σ I2 exp[−|τ |/τ I ].
(2.9) (2.10) (2.11)
Note that R&D used another parameter to quantify the strength of the 2 noise processes: D{e,i,I } = 2σe,i,I /τe,i,I . Here we will not follow this unusual 1 scaling but consider variations of the correlation times at either fixed vari2 2 ance σe,i,I of the OUPs or fixed noise intensities σe,i,I τe,i,I . Eq. (1) can be looked upon as a one-dimensional dynamics driven by multiplicative and additive colored noises. Equivalently, it can be, together
1 In general, two different intensity scalings for an OUP η(t) are used in the literature
∞ see, e.g., H¨anggi & Jung, 1995). (1) Fixing the noise intensity Q = 0 dTη(t)η(t + T) = 2 σ τ , allowing for a proper white noise limit by letting τ approach zero. With fixed noise intensity and τ → ∞ (static limit), the effect of the OUP vanishes, since the variance of the process tends to zero. (2) Fixing the noise variance σ 2 , which leads to a finite effect of the noise for τ → ∞ (static limit) but makes the noise effect vanish as τ → 0. R&D use functions α{e,i,I } (t), the long-time limit of which is proportional to the noise intensity 2 τ σe,i,I e,i,I .
1900
B. Lindner and A. Longtin
with equations 2.3, 2.4, and 2.5, regarded as a four-dimensional nonlinear dynamical system driven by only additive white noise. For such a process it is in general quite difficult to calculate the statistics, such as the stationary probability density P0 (V, g e , gi , I ) or the stationary marginal density for the driven variable ρ(V) = dge dgi d I P0 (V, ge , gi , I ) unless so-called potential conditions are met (see, e.g., Risken, 1984). It can be easily shown that the above problem does not fulfill these potential conditions, and no solution has yet been found. R&D have proposed a solution for the stationary marginal density of the membrane voltage ρ(V) for colored noises of arbitrary correlation times driving their system. Their solution for the stationary probability of the membrane voltage reads
a1 ln b 2 V 2 + b 1 V + b 0 2b 2 V + b 2b 2b 2 a 0 − a 1 b 1 2 1 , arctan + 4b 2 b 0 − b 12 b 2 4b 2 b 0 − b 12
ρ RD (V) = N exp
(2.12)
with N being the normalization constant and with these abbreviations:
a0 =
1 2Cm a (g L E L a + ge0 E e + gi0 E i ) + I0 Cm a + σe2 τe E e + σi2 τi E i (Cm a )2
a1 = − b0 =
1 2 σe τe E e2 + σi2 τi E i2 + σ I2 τ I 2 (Cm a )
b1 = − b2 =
1 2Cm a (g L a + ge0 + gi0 ) + σe2 τe + σi2 τi (Cm a )2
2 2 σe τe E e + σi2 τi E i 2 (Cm a )
1 2 σ τe + σi2 τi . (Cm a )2 e
(2.13)
In a subsequent Note on their article, Rudolph and Destexhe (2005) considered the case of only multiplicative colored noise (σ I = 0) and showed that the solution in equation 2.12 does not fit numerical simulations for certain parameter regimes. They claim that this disagreement is due to a filtering problem not properly taken into account in their previous work. They proposed a new solution for the case of only multiplicative noise that is functionally equivalent to equation 2.12 for σ I = 0 but simply replaces
Comment on “Characterization of Subthreshold Voltage Membranes”
1901
correlation times by effective correlation times, τe,i =
2τe,i τ0 . τe + τ0
(2.14)
where τ0 = aCm /(ag L + ge0 + gi0 ). Explicitly, this extended expression is given by
σi2 τi σe2 τe 2 2 ρRD, ext (V) = N exp A1 ln (V − E e ) + (V − E i ) (Cm a )2 (Cm a )2 σe2 τe (V − E e ) + σi2 τi (V − E i ) + A2 arctan (2.15) (E e − E i ) σe2 τe σi2 τi
with the abbreviations A1 = −
2Cm a (ge0 + gi0 ) + 2Cm a 2 g L + σe2 τe + σi2 τi 2 σe2 τe + σi2 τi
(2.16)
g L a σe2 τe (E L − E e ) + σi2 τi (E L − E i ) + ge0 σi2 τi − gi0 σe2 τe (E e − E i ) A2 = . (E e − E i ) σe2 τe σi2 τi σe2 τe + σi2 τi /(2Cm a ) (2.17) The introduction of the effective correlation times was justified by considering the effective-time constant (ETC) or gaussian approximation from Richardson (2004) (see below), which reduces the system to one with additive noise. The new formula, equation 2.15, fits well their simulation results for various combinations of parameters (Rudolph & Destexhe, 2005). In this comment, we will show that neither of these formulas yields the exact solution of the mathematical problem. As we will show first, the original formula fails significantly outside the limited parameter range investigated in R&D (2003). Apparently the second formula provides a good fit for a number of parameter sets. It also reproduces two of the simple limit cases, in which the first formula fails. By means of the third limit case as well as of an exact solution for the stationary mean value (derived in section 6), we can show that the new formula is not an exact result either. To demonstrate the invalidity of the first expression in the general case, we will show that equation 2.12 fails in three limits that are tractable by standard techniques: (1) the white noise limit of all three colored noise 2 sources, that is, keeping the noise intensities σe,i,I τe,i,I fixed and letting all noise correlation times tend to zero τe,i → 0; (2) the case of additive colored noise only; and (3) the limit of large τe,i in the case of multiplicative colored
1902
B. Lindner and A. Longtin
noises with fixed variances σe2 and σi2 . In all cases, we also ask whether mean and variance can be expected to be finite as R&D tacitly assumed. We will also compare both solutions proposed by R&D as well as our own analytical results for the limit cases to numerical simulation results. While the failure of the first formula, equation 2.12, is pronounced except for a small parameter regime, deviations of the extended expression, equation 2.15, are much smaller and for six different parameter sets inspected, the new formula can be regarded at least as a good approximation. Parameters can be found, however, where deviations of this new formula from numerical simulations become more serious. To simplify the notation we will use the new variable v = V − with =
g L a E L + ge0 E e + gi0 E i + I0 . g L a + ge0 + gi0
(2.18)
Then the equations can be recast into v˙ = −βv − ye (v − Ve ) − yi (v − Vi ) + yI 2 2σ˜ e,i,I ye,i,I + ξe,i,I (t) y˙ e,i,I = − τe,i,I τe,i,I
(2.19) (2.20)
with the abbreviations β=
g L a + ge0 + gi0 aCm
Ve,i = E e,i − σ˜ e,i,I = σe,i,I /(aCm ).
(2.21) (2.22) (2.23)
Once we have found an expression for the probability density of v, the density for the original variable V is given by the former density taken at v = (V − ). Finally, we briefly explain the effective-time constant (ETC) or gaussian approximation (cf. Richardson, 2004; Richardson & Gerstner, 2005, and references there), which we will refer to later. Assuming weak noise sources, the voltage will fluctuate around the deterministic equilibrium value v = 0 with an amplitude proportional to the square root of the sum of the noise variances; for example, for only excitatory conductance fluctuations we would have a proportionality to the standard deviation of ye , that is, |v| ∝ ye2 . From this we can see that the multiplicative terms ye V and yI V make a contribution proportional to the squares of the standard deviations and can therefore be neglected for weak noise. The resulting dynamics contains only additive noise sources: v˙ = −βv + ye Ve + yi Vi + yI .
(2.24)
Comment on “Characterization of Subthreshold Voltage Membranes”
1903
The stationary probability density is a gaussian, exp −v 2 / 2v 2 ETC ρETC (v) = 2πv 2 ETC
(2.25)
with zero mean and a variance given by Richardson (2004): v 2 ETC = Ve2
σ 2 τi /β σe2 τe /β σ 2 τ I /β + Vi2 i + I . 1 + βτe 1 + βτi 1 + βτ I
(2.26)
The solution takes into account the effect of the mean conductances on the effective membrane time constant 1/β through equation 2.21. 3 The White Noise Limit If we fix the noise intensities 2 τe,i,I , Qe,i,I = σ˜ e,i,I
(3.1)
we may consider the limit of white noise by letting τe,i,I → 0. A special case of this has been recently considered by Richardson (2004) with σ I = 0 (only multiplicative noise is present). In the white noise limit, the three OUPs approach mutually independent white noise sources, ye →
2Qe ξe (t), yi →
2Qi ξi (t), yI → 2Q I ξ I (t),
(3.2)
and thus the current balance equation, equation 2.19, becomes v˙ = −βv −
2Qe (v − Ve )ξe (t) −
2Qi (v − Vi )ξi (t) +
2Q I ξ I (t), (3.3)
which is equivalent2 to a driving by a single gaussian noise ξ (t), v˙ = −βv +
2Qe (v − Ve )2 + 2Qi (v − Vi )2 + 2Q I ξ (t),
(3.4)
with ξ (t)ξ (t ) = δ(t − t ). Since we approach the white noise limit having in mind colored noises with negligible correlation times, equation 3.4 has to be interpreted in the sense of Stratonovich (Risken, 1984; Gardiner, 1985).
2 The sum of three independent gaussian noise sources gives one gaussian noise, the variance of which equals the sum of the variances of the single noise sources.
1904
B. Lindner and A. Longtin
The drift and diffusion coefficients then read (Risken, 1984) D(1) = −βv + Qe (v − Ve ) + Qi (v − Vi ) = −βv +
1 d D(2) 2 dv
D(2) = Q I + Qe (v − Ve )2 + Qi (v − Vi )2 ,
(3.5) (3.6)
and the stationary solution of the probability density is given by Risken (1984), (2) ρwn (v) = N exp − ln D +
v
D(1)(x) d x (2)(x) , D
(3.7)
where the subscript wn refers to white noise. After carrying out the integral, the solution can be written as follows, β + b˜ 2 2 ln b˜ 2 v + b˜ 1 v + b˜ 0 ρwn (v) = N exp − 2b˜ 2 ˜ 2 v + b˜ 1 2 b β b˜ 1 arctan + 4b˜ 0 b˜ 2 − b˜ 2 b˜ 2 4b˜ 0 b˜ 2 − b˜ 2 1
(3.8)
1
with these abbreviations: b˜ 0 = Q I + Qe Ve2 + Qi Vi2
(3.9)
b˜ 1 = −2(Qe Ve + Qi Vi )
(3.10)
b˜ 2 = Qe + Qi .
(3.11)
Different versions of the white noise case have been discussed and also analytically studied in the literature (see, e.g., Hanson & Tuckwell, 1983; ´ & Smith, 1994; Richardson, 2004). L´ansky´ & L´ansk´a, 1987; L´ansk´a, L´ansky, In particular, equation 3.8 is consistent with the expression for the voltage density in a leaky integrate-and-fire neuron driven by white noise3 given by Richardson, (2004). Since equation 2.12 was proposed by R&D as the solution for the probability density at arbitrary correlation times of the colored noise sources, it should be also valid in the white noise limit and agree with equation 3.8. On closer inspection, it becomes apparent that both equations 2.12 and 2.15
3
The density equation 3.8 results from equation 2.18, in Richardson (2004) when firing and reset in the integrate-and-fire neuron become negligible. This can be formally achieved by letting threshold and reset voltage go to positive infinity.
Comment on “Characterization of Subthreshold Voltage Membranes”
1905
have the structure of the white noise solution, equation 3.8. Comparing the factors of the terms in the exponential, we find that the first solution (in terms of the shifted voltage variable and using the noise intensities, equation 3.1) can be written as ρ RD (v, Qe , Qi , Q I ) = ρwn (v, Qe /2, Qi /2, Q I /2),
(3.12)
where the additional arguments of the functions indicate the parametric dependence of the densities on the noise intensities. According to equation 3.12, if formulated in terms of the noise intensities (and not the noise variances), the first formula proposed by R&D does not depend on the correlation times τe,i,I at all. Furthermore, it is evident from equation 3.12 that the expression is incorrect in the white noise limit. If all correlation times τe,i,I simultaneously go to zero, the density approaches the white noise solution with only half of the true values of the noise intensities. The density will certainly depend on the noise intensities and will change if one uses only half of their values. We may also rewrite R&D’s extended expression, equation 2.15, in terms of the white noise density: ρ RD,e xt (v, Qe , Qi ) = ρwn (v, Qe /(1 + βτe ), Qi /(1 + βτi ), Q I = 0).
(3.13)
This expression agrees with the original solution by R&D only for the specific parameter set, τe = τi = 1/β.
(3.14)
We note that since the extended expression can be expressed by means of the white noise density, it makes sense to describe the extended expression by means of effective noise intensities, Qe,i =
Qe,i 1 + βτe,i
(3.15)
rather than in terms of the effective correlation times τe,i (cf. equation 2.14) used by R&D. The assertion behind equation 3.13 is the following: the probability density of the membrane voltage is always equivalent to the white noise density; correlations in the synaptic input (i.e., finite values of τe,i,I ) lead to rescaled (smaller) noise intensities Qe,i given in equation 3.15. If we consider the white noise limit of the right-hand side of equation 3.13, we find that the extended expression equation 2.15 reproduces this limit:
lim ρRD,ext (V, Qe , Qi ) = ρwn (V, Qe , Qi , Q I = 0).
τe ,τi →0
(3.16)
1906
B. Lindner and A. Longtin
So there is no problem with the extended expression in the white noise limit. 3.1 Divergence of Moments in the White Noise Limit and in R&D’s Expressions for the Probability Density. We consider the density equation 3.8 in the limits v → ±∞ and conclude whether the moments and, in particular, the mean value of the white noise density are finite; similar arguments will be applied to the solutions proposed by R&D. At large v and to leading order in 1/v, we obtain
ρwn (v) ∼
˜ − β+b2 |v| b˜ 2
b2 − β+ 2b˜ ˜
Nb˜ 2
2
π as v → ±∞. exp ± b˜ 2 4b˜ 0 b˜ 2 − b˜ 12 2
β b˜ 1
(3.17) When calculating the nth moment, we have to multiply with v n and obtain a nondiverging integral only if v n ρwn (v) decays faster than v −1 . This is the case only if n − (β + b˜ 2 )/b˜ 2 < −1 or using equation 3.11, n v wn < ∞ iff β > n(Qe + Qi ),
(3.18)
where “iff” stands for “if and only if” and the index wn indicates that we consider the white noise case. Note that no symmetry argument applies for odd n since the asymptotic limits differ for ∞ and −∞ according to equation 3.17. For the mean, this implies that |vwn | < ∞ iff β > Qe + Qi ;
(3.19)
otherwise, the integral diverges. In general, the power law tail in the density is a hint that (for white noise at least) we face the problem of rare strong deviations in the voltage that are due to the specific properties of the model (multiplicative gaussian noise). Because of equation 3.12, similar conditions (differing by a prefactor of 1/2 on the respective right-hand sides) also apply for the finiteness of the mean and variance of the original solution, equation 2.12, proposed by R&D. For the mean value of this solution one, obtains the condition |v RD | < ∞ iff β >
Qe + Qi , 2
(3.20)
which should hold true in the general colored noise case but does not agree with the condition in equation 3.19 even in the white noise case.
Comment on “Characterization of Subthreshold Voltage Membranes”
1907
From the extended expression we obtain |vRD,ext | < ∞ iff β >
Qe Qi + . 1 + βτe 1 + βτi
(3.21)
Note that equation 3.21 agrees with equation 3.19 only in the white-noise case (i.e. for τe , τi → 0). Below we will show that equation 3.19 gives the correct condition for a finite mean value in the general case of arbitrary correlation times, too. Since for finite τe , τi , the two conditions equation 3.19 and equation 3.21 differ, we can already conclude that the equation 2.15 that led to condition equation 3.21 cannot be the exact solution of the original problem. 4 Additive Colored Noise Setting the multiplicative colored noise sources to zero, R&D obtain an expression for the marginal density in case of additive colored noise only (cf. equations 3.7–3.9 in R&D) 2 a g L Cm (V − E L − I0 /(g L a ))2 ρa dd,RD (V) = N exp − , σ I2 τ I
(4.1)
which corresponds in our notation and in terms of the shifted variable v to βv 2 . ρ˜ a dd,RD (v) = N exp − QI
(4.2)
Evidently, once more a factor 2 is missing in the white noise case (where the process v(t) itself becomes an OUP), since for an OUP, we should have ρ ∼ exp[−βv 2 /(2Q I )]. However, there is also a missing additional dependence on the correlation time. For additive noise only, the original problem given in equation 2.1 reduces to v˙ = −βv + yI , √ 1 2Q I ξ I (t). y˙I = − yI + τI τI
(4.3) (4.4)
This system is mathematically similar to the gaussian approximation or effective-time constant approximation, equation 2.25, in which no multiplicative noise is present as well. The density function for the voltage is well known; for clarity, we show here how to calculate it.
1908
B. Lindner and A. Longtin
The system, equations 4.3 and 4.4, obeys the two-dimensional FokkerPlanck equation, yI QI + 2 ∂ yI P(v, yI , t) ∂t P(v, yI , t) = ∂v (βv − yI ) + ∂ yI τI τI
(4.5)
The stationary problem (∂t P0 (v, yI ) = 0) is solved by an ansatz P0 (v, y) ∼ exp[Av 2 + Bvy + C y2 ], yielding the solution for the full probability density: P0 (v, yI ) = N exp
τ I (1 + βτ I ) c QI β yI2 − 2βvyI − 2 cv 2 , c = − . 2 QI τI
(4.6)
Integrating over yI yields the correct marginal density, ρa dd (v) =
β(1 + βτ I ) βv 2 exp − (1 + βτ I ) , 2π Q I 2Q I
(4.7)
which is in disagreement with equation 4.2 and hence also with equation 4.1. From the correct solution given in equation 4.7, we also see what happens in the limit of infinite τ for fixed noise intensity Q I : the exponent tends to minus infinity except at v = 0, or, put differently, the variance of the distribution tends to zero, and we end up with a δ function at v = 0. This limit makes sense (cf. note 1) but is not reflected at all in the original solution, equation 2.15, given by R&D. We can also rewrite the solution in terms of the white noise solution in the case of vanishing multiplicative noise: ρadd (v) = ρwn (v, Qe = 0, Qi = 0, Q I /[1 + βτ I ]).
(4.8)
Thus, for the additive noise is true, what has been assumed by R&D in the case of multiplicative noise: the density in the general colored noise case is given by the white noise density with a rescaled noise intensity QI = Q I /[1 + βτ I ] (or equivalently, rescaled correlation time τ I = 2τ I /[1 + βτ I ] in equation 4.2 with Q I = σ 2 τ I ). We cannot perform the limit of only additive noise in the extended expression, equation 2.15, proposed by R&D because this solution was meant for the case of only multiplicative noise. If, however, we generalize that expression to the case of additive and multiplicative colored noises, we can consider the limit of only additive noise in this expression. This is done by taking the original solution by R&D, equation 2.12, and replacing not only the correlation times of the multiplicative noises τe,i by the effective
Comment on “Characterization of Subthreshold Voltage Membranes”
1909
ones τe,i but also that of the additive noise τ I by an effective correlation time,
τ I =
2τ I . 1 + τI β
(4.9)
If we now take the limit Qe = Qi = 0, we obtain the correct density, ρr ud,e xt,a dd (v) = ρwn (v, Qe = 0, Qi = 0, Q I /[1 + βτ I ]),
(4.10)
as becomes evident on comparing the right-hand sides of equation 4.10 and equation 4.8. Finally, we note that the case of additive noise is the only limit that does not pose any condition on the finiteness of the moments. 5 Static Multiplicative Noises Only (Limit of Large τe,i ) Here we assume for simplicity σ˜ I = 0 and consider multiplicative noise 2 with fixed variances σ˜ e,i only. If the noise sources are much slower than the internal timescale of the system, that is, if 1/(βτe ) and 1/(βτi ) are practically zero, we can neglect the time derivative in equation 2.19. This means that the voltage adapts instantaneously to the multiplicative (“static”) noise sources which is strictly justified only for βτe , βτi → ∞. If τe , τi attain large but finite values (βτi , βτi 1), the formula derived below will be an approximation that works the better the larger these values are. Because of the slowness of the noise sources compared to the internal timescale, we call the resulting expression the “static-noise” theory for simplicity. This does not imply that the total system (membrane voltage plus noise sources) is not in the stationary state: we assume that any initial condition of the variables has decayed on a timescale t much larger than τe,i .4 For a simulation of the density, this has the practical implication that we should choose a simulation time much larger than any of the involved correlation times. Setting the time derivative in equation 2.19 to zero, we can determine at which position the voltage variable will be for a given quasi-static pair of (ye , yi ) values, yielding v=
ye VE + yi Vi . β + ye + yi
(5.1)
4 In the strict limit of βτ , βτ → ∞, this would imply that t goes stronger to infinity e i than the correlation times τe,i do.
1910
B. Lindner and A. Longtin
This sharp position will correspond to a δ peak of the probability density βv + yi (v − Vi ) ye VE + yi Vi |yi (Vi − Ve ) − βVe | δ y + δ v− = e β + ye + yi (v − Ve )2 (v − Ve )
(5.2)
(here we have used δ(a x) = δ(x)/|a |). This peak has to be averaged over all possible values of the noise, that is, integrated over the two gaussian distributions in order to obtain the marginal density: ρstatic (v) = δ(v − v(t)) ∞ ∞ dye dyi |yi (Vi − Ve ) − βVe | βv + yi (v − Vi ) = δ y + e ˜ i σ˜ e (v − Ve )2 (v − Ve ) −∞ −∞ 2π σ y2 y2 × exp − e 2 − i 2 (5.3) 2σ˜ e 2σ˜ i Carrying out these integrals yields ν(v) v2 πν(v) ν(v) σ˜ e σ˜ i |Ve − Vi | − 2µ(v) − µ(v) + e erf , e ρstatic (v) = πβ 2 µ(v) µ(v) µ(v)
(5.4)
where erf(z) is the error function (Abramowitz & Stegun, 1970) and the functions µ(v) and ν(v) are given by µ(v) = ν(v) =
σ˜ e2 (v − Ve )2 + σ˜ i2 (v − Vi )2 β2 2 2 σ˜ e Ve (v − Ve ) + σ˜ i2 Vi (v − Vi ) 2σ˜ e2 σ˜ i2 (Ve − Vi )2
(5.5) .
(5.6)
If one of the expressions by R&D, equation 2.12 or 2.15, would be the correct solution, it should converge for σ I = 0 and τe,i → ∞ to the formula for the static case, equation 5.4. In general, this is not the case since the functional structures of the white-noise solution and of the static-noise approximation are quite different. There is, however, one limit case in which the extended expression yields the same (although trivial) function. If we fix the noise intensities Qe,i and let the correlation times go to infinity, the variances will go to zero and the static noise density, equation 5.4, approaches a δ peak at v = 0. Although the extended expression, equation 2.15, has a different functional dependence on system parameters and voltage, the same thing happens in the extended expression for τe,i → ∞ because the effective noise intensities Qe,i = Qe,i /(1 + βτe,i ) approach zero in this limit. The white noise solution at vanishing noise intensities is, however, also
Comment on “Characterization of Subthreshold Voltage Membranes”
1911
a δ peak at v = 0. Hence, in the limit of large correlation time at fixed noise intensities, both the static noise theory, equation 5.4, and the extended expression yield the probability density of a noise-free system and therefore agree. For fixed variance where a nontrivial large-τ limit of the probability density exists, the static noise theory and the extended expression by R&D differ as we will also numerically verify. A final remark concerns the asymptotic behavior of the static noise solution, equation 5.4. The asymptotic expansions for v → ±∞ show that the density goes like |v|−2 in both limits. Hence, in this case, we cannot obtain a finite variance of the membrane voltage at all (the integral dv v 2 ρsta tic (v) will diverge). The mean may be finite since the coefficients of the v −2 term are symmetric in v. The estimation in the following section, however, will demonstrate that this is valid only strictly in the limit τe,i → ∞ but not at any large but finite value of τe,i . So the mean may diverge for large but finite τe,i . 6 Mean Value of the Voltage for Arbitrary Values of the Correlation Times By inspection of the limit cases, we have already seen that the moments do not have to be finite for an apparently sensible choice of parameters. For the white noise case, it was shown that the mean of the voltage is finite only if β > Qe + Qi . Next, we show by direct analytical solution of the stochastic differential equation, equation 2.19, involving the colored noise sources, equation 2.20, that this condition (i.e., equation 3.19), holds in general, and thus a divergence of the mean is obtained for β < Qe + Qi . For only one realization of the process, equation 2.19, the driving functions ye (t), yi (t), and yI (t) can be regarded as just time-dependent parameters in a linear differential equation. The solution is then straightforward (see also Richardson, 2004, for the special case of only multiplicative noise): t t v(t) = v0 exp −βt − du(ye (u) + yi (u)) + ds(Ve ye (s) + Vi yi (s) 0
0
t du(ye (u) + yi (u)) . +yI (s))e −β(t−s) exp −
(6.1)
s
t The integrated noise processes we,i (s, t) = s duye,i (u) in the exponents are independent gaussian processes with variance 2 we,i (s, t) = 2Qe,i (t − s − τe,i + τe,i e −(t−s)/τe,i ).
(6.2)
1912
B. Lindner and A. Longtin
For a gaussian variable, we know that e w = e w /2 (Gardiner, 1985). Using this relation for the integrated noise processes together with equation 6.2
t and expressing the average ye,i (s) exp[− s duye,i (u)] by a derivative of the exponential with respect to s, we find an integral expression for the mean value 2
v(t) = v0 e (Qe +Qi −β)t exp [−τe f e (t) − τi f i (t)] t − ds{Ve f e (s) + Vi f i (s)}e (Qe +Qi −β)s−τe fe (s)−τi fi (s) ,
(6.3)
0
where f e,i (s) = Qe,i (1 − exp[−s/τe,i ]). The stationary mean value corresponding to the stationary density is obtained from this expression in the asymptotic limit t → ∞. We want to draw attention to the fact that this mean value is finite exactly for the same condition as for the white noise case—for |v| < ∞ iff β > Qe + Qi
(6.4)
First, this is so because otherwise the exponent (Qe + Qi − β)t in the first line is positive and the exponential diverges for t → ∞. Furthermore, if β < Qe + Qi , the exponential in the integrand diverges at large s. In terms of the original parameters of R&D, the condition for a finite stationary mean value of the voltage reads |v| < ∞ iff g L a + ge0 + gi0 >
σe2 τe + σi2 τi aCm
(6.5)
Note that this depends also on a and Cm , and not only on the synaptic parameters. R&D use as standard parameter values (Rudolph & Destexhe, 2003, p. 2589) ge0 = 0.0121 µS, gi0 = 0.0573 µS, σe = 0.012 µS, σi = 0.0264 µS, τe = 2.728 ms, τi = 10.49 ms, a = 34636 µm2 , and Cm = 1 µF/cm2 . They state that the parameters have been varied in numerical simulations from 0% to 260% relative to these standard values covering more than “the physiological range observed in vivo” (Rudolph & Destexhe, 2003). Inserting the standard values into the relation, equation 6.5, yields 0.0851µS > 0.0221 µS.
(6.6)
So in this case, the mean will be finite. However, using twice the standard value for the inhibitory noise standard deviation—σi = 0.0528 µS (corresponding to 200% of the standard value) and all other parameters as before, leads to a diverging mean because we obtain 0.0852 µS on the right-hand side of equation 6.5, while the left-hand side is unchanged. This means,
Comment on “Characterization of Subthreshold Voltage Membranes”
1913
even in the parameter regime that R&D studied, that the model predicts an infinite mean value of the voltage. A stronger violation of equation 6.5 will be observed by either increasing the standard deviations σe,i and/or correlation times τe,i or decreasing the mean conductances ge,i . We also note that for higher moments, and especially for the variance, the condition for finiteness will be even more restrictive, as can be concluded from the limit cases investigated before. The stationary mean value at arbitrary correlation times can be inferred from equation 6.3 by taking the limit t → ∞. Assuming the relation, equation 6.4, holds true, we can neglect the first term involving the initial condition v0 and obtain v = −
∞
ds{Ve f e (s) + Vi f i (s)} exp[(Qe + Qi − β)s − τe f e (s) − τi f i (s)].
0
(6.7) We can also use equation 6.7 to recover the white noise result for the mean as, for instance, found in Richardson (2004) by taking τe,i → 0. In this case, we can integrate equation 6.7 and obtain vwn = −{Ve Qe + Vi Qi } =−
∞
ds exp [(Qe + Qi − β)s]
0
Ve Qe + Vi Qi . β − Qe − Qi
(6.8)
Because of the similarity of the R&D solution to the white noise solution (cf. equation 3.12), we can also infer that the mean value of the former density is v RD = −
Ve Qe + Vi Qi . 2β − Qe − Qi
(6.9)
Note the different prefactor of β in the denominator, which is due to the factor 1/2 in noise intensities of the solution, equation 2.12, by R&D. Finally, we can also determine easily the mean value for the extended expression by R&D (Rudolph & Destexhe, 2005) since this solution is also equivalent to the white noise solution with rescaled noise intensities. Using the noise intensities Qe,i from equation 3.15, we obtain v RD,e xt = − =−
Ve Qe (τe ) + Vi Qi (τi ) β − Qe (τe ) − Qi (τi ) Ve Qe (1 + βτi ) + Vi Qi (1 + βτe ) . β(1 + βτi )(1 + βτe ) − Qe (1 + βτi ) − Qi (1 + βτe )
(6.10)
1914
B. Lindner and A. Longtin
We will verify numerically that this expression is not equal to the exact solution, equation 6.7. One can, however, show that for small to medium values of the correlation times τe,i and weak noise intensities, these differences are not drastic. If we expand both equation 6.3 and equation 6.10 for small noise intensities Qe , Qi (assuming for the former that the products Qe τe , Qi τi are small, too), the resulting expressions agree to first order and also agree with a recently derived weak noise result for filtered Poissonian shot noise given by Richardson & Gerstner (2005, cf. eq. D.3): vRD,ext ≈ v ≈ −
Ve Qe (1 + βτi ) + Vi Qi (1 + βτe ) + O Q2e , Qi2 . β(1 + βτi )(1 + βτe )
(6.11)
The higher-order terms differ, and that is why a discrepancy between both expressions can be seen at nonweak noise. The results for the mean value achieved in this section are useful in two respects. First, we can check whether trajectories indeed diverge for parameters where the relation, equation 6.4, is violated. Second, the exact solution for the stationary mean value and the simple expressions resulting for the different solutions proposed by R&D can be compared in order to reveal their range of validity. This is done in the next section. 7 Comparison to Simulations Here we compare the different formulas for the probability density of the membrane voltage and its mean value to numerical simulations for different values of the correlation times, restricting ourselves to the case of multiplicative noise only. For the simulations, we followed a single realization v(t) using a simple Euler procedure. The probability density at a certain voltage is then proportional to the time spent by the realization in a small region around this voltage. Decreasing t or increasing the simulation time did not change our results. We will first discuss the original expression, equation 2.12, proposed by R&D and the analytical solutions for the limit cases of white and static multiplicative noise, equations 3.8 and 5.4, respectively; later we examine the validity of the new extended expression. Finally, we also check the stationary and time-dependent mean value of the membrane voltage and discuss how well these simple statistical characteristics are reproduced by the different theories, including our exact result, equation 6.3. To check the validity of the different expressions, we use first a dimensionless parameter set where β = 1 but also the original parameter set used by R&D (2003). In both cases, we consider variations of the correlation times between three orders of magnitude (standard values are varied between 10% and 1000%). Note that the latter choice goes beyond the range originally considered by R&D (2003), where parameter variations were limited to the range 0% to 260%.
Comment on “Characterization of Subthreshold Voltage Membranes”
1915
7.1 Probability Density of the Membrane Voltage—Original Expression by R&D. In a first set of simulations, we ignore the physical dimensions of all the parameters and pick rather arbitrary but simple values (β = 1, Qi = 0.75, Qe = 0.075). Keeping the ratio of the correlation times (τ I = 5τe ) and the values of the noise intensities Qe , Qi fixed, we vary the correlation times. In Figure 1, simulation results are shown for τe = 10−2 , 10−1 , 1, and 10. We recall that with a fixed noise intensity according to the result by R&D given in equation 2.12, the probability should not depend on τe at all. It is obvious, however, in Figure 1a that the simulation data depend strongly on the correlation times in contrast to what is predicted by equation 2.12. The difference between the original theory by R&D and the simulations is smallest for an intermediate correlation time (τe = 1). In contrast to the general discrepancy between simulations and equation 2.12, the white noise formula, equation 3.8, and the formula from the static noise theory (cf. the solid and dotted lines in Figure 1b), agree well with the simulations at τe = 0.01 (circles) and τe = 10 (diamonds), respectively. The small differences between simulations and theory decrease as we go to smaller or larger correlation times, respectively, as expected. R&D also present results of numerical simulations (Rudolph & Destexhe, 2003), which seem to agree fairly well with their formula. In order to give a sense of the reliability of these data, we have repeated the simulations for one parameter set in Rudolph and Destexhe (2003, Fig. 2b). These data are shown in Figure 2 and compared to R&D’s original solution, equation 2.12. For this specific parameter set, the agreement is indeed relatively good, although there are differences between the formula and the simulation results in the location of the maximum as well as at the flanks of the density. These differences do not vanish by extending the simulation time or decreasing the time step; hence, the curve according to equation 2.12 does not seem to be an exact solution but at best a good approximation. The disagreement becomes significant if the correlation times are changed by one order of magnitude (see Figure 3) (in this case, we keep the variances of the noises constant, as R&D have done, rather than the noise intensities as in Figure 1). The asymptotic formulas for either vanishing (see Figure 3a) or infinite (see 3b) correlation times derived in this article do a much better job in these limits. Note that the large correlation time used in Figure 3b is outside the range considered by R&D (2003). Regardless of the fact that the correlation times we have used in Figures 3a and 3b are possibly outside the physiological range, an analytical solution should also cover these cases. Regarding the question of whether the correlation time is short (close to the white noise limit), long (close to the static limit), or intermediate (as seems to be the case in the original parameter set of Figure 2b in Rudolph & Destexhe, 2003), it is not the absolute value of τe,i,I that matters
1916
B. Lindner and A. Longtin
a
τe=0.01 τe=0.1 τe=1 τe=10 R&D original
ρ
2
1
0
-1
-0.5
0
0.5
1
1.5
v in a.u.
b
τe=10 τe=0.01 static-noise theo. with τe=10 white-noise theo.
ρ
2
1
0
-1
-0.5
0
0.5
1
1.5
v in a.u. Figure 1: Probability density of the shifted voltage compared to results of numerical simulations. (a) Density according to equation 2.12 (theory by R&D) is compared to simulations at different correlation times as indicated (τi = 5τe ). Since the noise intensities are fixed, the simulated densities at different τe should all fall onto the solid line according to equation 2.12, which is not the case. (b) The simulations at small (τe = 0.01) and large (τe = 10) correlation times are compared to our expressions found in the limit case of white and static noise: equations 3.8 and 5.4, respectively. Note that in the constant-intensity scaling, equation 5.4 depends implicitly on τe,i since the variances change as σe,i = Qe,i /τe,i . Parameters: β = 1, Qe = 0.075, Qi = 0.75, Q I = 0, t = 0.001, and simulation time T = 105 .
but the product βτe,i,I . Varying one or more of the parameters g L , ge0 , gi0 , a , or Cm can push the dynamics in one of the limit cases without the necessity of changing τe,i,I .
Comment on “Characterization of Subthreshold Voltage Membranes” 70
1917
Sims, τe=2.728 ms τι=10.45 ms R&D original
60 50
ρ
40 30 20 10 0 -0.1
-0.08
-0.06
-0.04
-0.02
V in Volt Figure 2: Probability density of membrane voltage corresponding to the parameters in Figure 2b of Rudolph and Destexhe (2003): g L = 0.0452 mS/cm2 , a = 34636 µm2 , Cm = 1 µF/cm2 , E L = −80 mV, E e = 0 mV, E i = −75 mV, σe = 0.012 µS, σi = 0.0264 µS, ge0 = 0.0121 µS, gi0 = 0.0573 µS; additive-noise parameters (σ I , I0 ) are all zero; we used a time step of t = 0.1 ms and a simulation time of 100 s.
7.2 Probability Density of the Membrane Voltage—Extended Expression by R&D. So far we have not considered the extended expression (R&D, 2005) with the effective correlation times. Plotting the simulation data shown in Figures 1a and 3 against this new formula gives a very good, although not perfect, agreement (cf. Figures 4a and 5a). Note, for instance, in Figure 4a that the height of the peak for τe = 1 and the location of the maximum for τe = 0.1 are slightly underestimated by the new theory. Since most of the data look similar to gaussians, we may also check whether they are described by the ETC theory (cf. equation 2.25). This is shown in Figures 4b and 5b and reveals that for the two parameter sets studied so far, the noise intensities are reasonably small such that the ETC formula gives an approximation almost as good as the extended expression by R&D. One exception to this is shown in Figure 4b: at small correlation times where the noise is effectively white (τe = 0.1), the ETC formula fails since the noise variances become large. For τe = 0.01, the disagreement is even worse (not shown). In this range, the extended expression captures the density better, in particular its nongaussian features (e.g., the asymmetry in the density). Since the agreement of the extended expression to numerical simulations was so far very good, one could argue that it represents the exact solution to the problem and the small differences are merely due to numerical inaccuracy. We will check whether the extended expression is the exact solution in two ways. First, we know how the density behaves if both multiplicative noises are very slow (βτe , βτi 1), namely, according to equation 5.4.
1918
B. Lindner and A. Longtin
a
200
Sims, τe=0.2728 ms τι=1.045 ms white-noise theory R&D original
ρ
150 100 50 0
-0.07
-0.06
-0.05
V in Volt
b
50
Sims, τe=27.28 ms τι=104.5 ms static-noise theo. R&D original
40
ρ
30 20 10 0 -0.1
-0.08
-0.06
-0.04
-0.02
V in Volt Figure 3: Probability density of membrane voltage for different orders of magnitude of the correlation times τe , τi . Parameters as in Figure 2 except for the correlation times, which were chosen one order of magnitude smaller (a) or larger (b).
We thus possess an additional control of whether the extended expression, equation 2.15, is exact by comparing it not only to numerical simulation results but also to the static noise theory. Second, we have derived an exact integral expression, equation 6.7, for the stationary mean value, so we can compare the stationary mean value according to the extended expression by R&D (given in equation 6.10) to the exact expression and to numerical simulations. To check the extended expression against the static noise theory, we have to choose parameter values for which βτe and βτi are much larger than one; at the same time, the noise variances should be sufficiently large.
Comment on “Characterization of Subthreshold Voltage Membranes”
a 2.5
extended expression τe=0.01 τe=0.1 τe=1 τe=10
2
ρ
1919
1.5 1 0.5 0 -1
-0.5
0
0.5
1
v in a.u.
b 2.5
ETC appr. τe=0.1 τe=1 τe=10
ρ
2 1.5 1 0.5 0 -1
-0.5
0
0.5
1
v in a.u. Figure 4: Probability density of membrane voltage for simulation data and parameters as in Figure 1a. The extended expression equation 2.15 (a) and the effective time constant approximation, equation 2.25 (b), are compared to results of numerical simulations.
We compare both theories, equation 2.15 and equation 5.4, once for the system equation 2.19, equation 2.20, with simplified parameters at strong noise (Qe = Qi = 1) and large correlation times (βτe,i = 20) (see Figure 6a) and once for the original system (see Figure 6b). For the latter, increases in βτe,i can be achieved by increasing either g L , ge0 , gi0 or the synaptic correlation times τe,i . We do both and increase ge0 to the ten-fold of the standard value by R&D (i.e., ge0 = 0.0121 µS → ge0 = 0.121 µS) and also multiply the standard values of the correlation times by roughly three (i.e., τe = 2.728 ms, τi = 10.45 ms → τe = 7.5 ms, τi = 30 ms); additionally, we choose a larger standard deviation for the inhibitory conductance than
1920
B. Lindner and A. Longtin
a150 ρ
100
R&D extended Sims, τe=0.2728 ms τι=1.045 ms Sims, τe=27.28 ms τι=104.5 ms Sims, τe=2.728 ms τι=10.45 ms
50
0 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03
V in Volt
b150 ρ
100
ETC appr. Sims, τe=0.2728 ms τι=1.045 ms Sims, τe=27.28 ms τι=104.5 ms Sims, τe=2.728 ms τι=10.45 ms
50
0 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03
V in Volt
Figure 5: Probability density of membrane voltage for simulation data and parameters as in Figures 2 and 3. The extended expression, equation 2.15 (a), and the effective time constant approximation, equation 2.25 (b), are compared to results of numerical simulations.
in R&D’s standard parameter set (σi = 0.0264 µS → σi = 0.045 µS). For these parameters, we have βτe ≈ 4.2 and βτi ≈ 16.8, so we may expect a reasonable agreement between static noise theory and the true probability density of the voltage obtained by simulation. Indeed, for both parameter sets, the static noise theory works reasonably well. For the simulation of the original system (see Figure 6b), we also checked that the agreement is significantly enhanced (agreement within line width) by using larger correlation times (e.g., τe = 20 ms, τi = 100 ms) as can be expected. Compared to the static noise theory, the extended expression by R&D shows stronger although not large deviations. There are differences in the location and height of the maximum of the densities for
Comment on “Characterization of Subthreshold Voltage Membranes”
a
0.6
1921
R&D extended static noise theory sims
ρ
0.4 0.2 0 -3
-2
-1
0
1
v in a.u.
b
40 10 10
ρ
30
10 10
20
0
-1
-2
-0.04
0
0.04
0.08
R&D extended static noise theory sims
10 0
1
-0.04
-0.02
0
0.02
V in Volt Figure 6: Probability density of membrane voltage for long correlation times; static noise theory, (equation 5.4, solid lines) and extended expression by R&D (equation 2.15, dashed lines) against numerical simulations (symbols). (a) Density of the shifted variable v with Qe = Qi = 3, β = 1, τe = τi = 20, Ve = 1.5, Vi = −0.5. Here, the mean value is infinite. In the simulation, we implemented reflecting boundaries affecting the density only in its tails (not shown in the figure). (b) Density for the original voltage variable with g L = 0.0452 mS/cm2 , a = 34, 636 µm2 , Cm = 1 µF/cm2 , E L = −80 mV, E e = 0 mV, E i = −75 mV, σe = 0.012 µS, σi = 0.045 µS, ge0 = 0.121 µS, gi0 = 0.0574 µS, τe = 7.5 ms, τi = 30 ms. Here the mean value is finite. Inset: Same data on a logarithmic scale.
both parameter sets; prominent also is the difference between the tails of the densities (see the Figure 6b inset). Hence, there are parameters that are not completely outside the physiological range, for which the extended expression yields only an approximate description and for which the static
1922
B. Lindner and A. Longtin
noise theory works better than the extended expression by R&D. This is in particular the case for strong and long-correlated noise. 7.3 Mean Value of the Membrane Voltage. The second way to check the expressions by R&D was to compare their mean values to the exact expression for the stationary mean, equation 6.7. We do this for the transformed system, equation 2.19, equation 2.20, with dimensionless parameters. In Figure 7, the stationary mean value is shown as a function of the correlation time τe of the excitatory conductance. In the two panels, we keep the noise intensities Qe and Qi fixed; the correlation time of inhibition is small (in Figure 7a) or medium (in Figure 7b) compared to the intrinsic timescale (1/β = 1). We choose noise intensities Qi = 0.3 and Qe = 0.2 so that the mean value is finite because equation 6.4 is satisfied. In Figure 7a the disagreement between the extended expression by R&D (dash-dotted line) and the exact solution (thick solid line) is apparent for medium values of the correlation time. To verify this additionally, we also compare to numerical simulation results. The latter agree with our exact theory for the mean value within the numerical error of the simulation. We also plot two limits that may help to explain why the new theory by R&D works in this special case at very small and very large values of τ E . At small values, both noises are effectively white, and we have already discussed that in this case, the extended expression for the probability density, equation 2.15, approaches the correct white noise limit. Hence, also the first moment should be correctly reproduced in this limit. On the other hand, going to large correlation time τe at fixed noise intensity Qe means that the effect of the colored noise ye (t) on the dynamics vanishes. Hence, in this limit, we obtain the mean value of a system that is driven only by one white noise (i.e., yi (t)). Also this limit is correctly described by R&D’s new theory, since the effective noise intensity Qe = 2Qe /[1 + βτe ] vanishes for τe → ∞ if Qe is fixed. However, for medium values of τe , the new theory predicts a larger mean value than the true value. The mean value, equation 6.9, of the original solution, equation 2.12 (dotted lines in Figure 7), leads to a mean value of the voltage that does not depend on the correlation time τe at all. If the second correlation time τ I is of the order of the effective membrane time constant 1/β (see Figure 7b), the deviations between the mean value of the extended expression and the exact solution are smaller but extend over all values of τe . In this case, the new solution does not approach the correct one in either of the limit cases, τe → 0 or τe → ∞. The overall deviations between the mean according to the extended expression are small. Also for both panels, the differences in the mean are small compared to the standard deviations of the voltage. Thus, the expression equation 6.10, corresponding to the extended expression, can be regarded as a good approximation for the mean value. Finally, we illustrate the convergence or divergence of the mean if the condition equation 6.4 is obeyed or violated, respectively. First, we choose
Comment on “Characterization of Subthreshold Voltage Membranes”
a 0.2
white noise limit R&D extended R&D original exact solution Simulation results white-noise limit with Qe=0
0.1
in a.u.
1923
0 -0.1 -0.2 -0.3 10
-2
10
-1
10
0
10
1
10
2
τe in a.u.
in a.u.
b 0.1
R&D extended R&D original exact solution
0 -0.1 -0.2 -0.3 10
-2
10
-1
10
0
10
1
τe in a.u. Figure 7: Stationary mean value of the shifted voltage (in arbitrary units) versus correlation time (in arbitrary units) of the excitatory conductance. Noise intensities Qe = 0.2, Qi = 0.3, Q I = 0, and β = 1 are fixed in all panels. Correlation time of the inhibitory conductance: τi = 10−2 (a) and τi = 1 (b). Shown are the exact analytical result, equation 6.7 (solid line); the mean value according to the original solution, equation 6.9 (dotted line); and the mean value according to the extended expression, equation 6.10 (dash-dotted line). In panel a , we also compare to the mean value of the white noise solution for Qe = 0.2, Qi = 0.3 (thin solid line) and for Qe = 0, Qi = 0.3 (dashed line), as well as to numerical simulation results (symbols).
the original system and the standard set of parameters by Rudolph and Destexhe (2003) and simulate a large number of trajectories in parallel. All of these are started at the same value (V = 0) and each with independent noise sources, the initial values of which are drawn from the stationary
B. Lindner and A. Longtin
in Volt
1924
0.1
Theory σi=0.0264 mS Theory σi=0.0660 mS Sims σi=0.0264 mS Sims σi=0.0660 mS
0
-0.1 -4 10
-3
10
10
-2
t in seconds Figure 8: Time-dependent mean value of the original voltage variable (in volts) as a function of time (in seconds) for the initial value V(t = 0) = 0 V and different values of the inhibitory conductance standard deviation σi ; numerical simulations of equations 2.19 and 2.20 (circles) and theory according to equation 6.3 (solid lines). For all curves, ge0 = 0.0121 µS, gi0 = 0.0573 µS, σe = 0.012 µS, τe = 2.728 ms, τi = 10.49 ms, a = 34,636 µm2 , and Cm = 1 µF/cm2 . For the dashed line (theory) and the gray squares (simulations), we choose σi = 0.0264 µS; hence, in this case, parameters correspond to the standard parameter set by Rudolph and Destexhe (2003). For the solid line (theory) and the black circles, we used σi = 0.066 µS corresponding to the 250% of the standard value by R&D. At the standard parameter set, the mean value saturates at a finite level, in the second case, the mean diverges and goes beyond 100 mV within 31 ms. Simulations were carried out for 106 voltage trajectories using an adaptive time step (always smaller than 0.01 ms) that properly took into account the trajectories that diverge the strongest. The large number of trajectories was required in order to get a reliable estimate of the time-dependent mean value in the case of strong noise (σi = 0.066 µS) where voltage fluctuations are quite large.
gaussian densities. In an experiment, this corresponds exactly to fixing the voltage of the neuron via voltage clamp and then to let the voltage freely evolve under the influence of synaptic input (that has not been affected by the voltage clamp). In Figure 8 we compare the time-dependent average of all trajectories to our theory, equation 6.3 (in terms of the original variable and parameters). For R&D’s standard parameters, the mean value reaches after a relaxation of roughly 20 ms a finite value (V≈ −65 mV). The time course of the mean value is well reproduced by our theory, as it should be. Increasing one of the noise standard deviations to 2.5-fold of its standard value (σi = 0.0264 µS → 0.066 µS), which is still in the range inspected by
Comment on “Characterization of Subthreshold Voltage Membranes”
1925
R&D, results in a diverging mean.5 Again the theory (solid line) is confirmed by the simulation results (black circles). Starting from zero voltage, the voltage goes beyond 100 mV within 31 ms. In contrast to this, the mean value of the extended expression is finite (the condition equation 3.21 is obeyed) and the mean value formula for this density, equation 6.10, yields a stationary mean voltage of −66 mV. Thus, in the general colored noise case, the extended expression cannot be used to decide whether the moments of the membrane voltage will be finite. We note that the divergence of the mean is due to a small number of strongly deviating voltage trajectories in the ensemble over which we average. This implies that the divergence will not be seen in a typical trajectory and that a large ensemble of realizations and a careful simulation of the rare strong deviations (adaptive time step) are required to confirm the diverging mean predicted by the theory. Thus, although the linear model with multiplicative gaussian noise is thought to be a simple system compared to nonlinear spike generators with Poissonian input noise, its careful numerical simulation may be much harder than that of the latter type of model. 8 Conclusions We have investigated the formula for the probability density of the membrane voltage driven by multiplicative and/or additive (conductance and/or current noise) proposed by R&D in their original article. Their solution deviates from the numerical simulations in all three limits we have studied (white noise driving, colored additive noise, and static multiplicative noise). The deviation is significant over extensive parameter ranges. The extended expression by R&D (2005), however, seems to provide a good approximation to the probability density of the system for a large range of parameters. In the appendix we show where errors have been made in the derivation of the Fokker-Planck equation on which both the original and extended expressions are based. Although there are serious flaws in the derivation, we have seen that the new formula (obtained by an ad hoc introduction of effective correlation times in the original solution) gives a very good reasonable approximation to the probability density for weak noise. What could be the reason for this good agreement? The best, though still phenomenological, reasoning for the solution, equation 2.15, is as follows. First, an approximation to the probability
5 These parameter values were not considered by R&D to be in the physiological range. We cannot, however, exclude that other parameter variations (e.g., decreasing the leak conductance or increasing the synaptic correlation times) will not lead to a diverging mean for parameters in the physiological range.
1926
B. Lindner and A. Longtin
density should work in the solvable white noise limit: lim ρappr (v, Qe , Qi , τe , τi ) = ρwn (v, Qe , Qi ).
τe ,τi →0
(8.1)
Second, we know that at weak multiplicative noise of arbitrary correlation time, the effective time constant approximation will be approached: ρappr (v, Qe , Qi , τe , τi ) = ρETC (v, Qe , Qi , τe , τi ), (Qe , Qi small).
(8.2)
The latter density given in equation 2.25 can be expressed by the white noise density with rescaled noise intensities (note that the variance in the ETC approximation given in equation 2.26 has this property); furthermore, it is close to the density for white multiplicative noise if the noise is weak: ρETC (v, Qe , Qi , τe , τi )
= (Qe ,Qi
small)
ρETC (v, Qe /(1 + βτe ), Qi /(1 + βτi ), 0, 0),
≈
ρ(v, Qe /(1 + βτe ), Qi /(1 + βτi ), 0, 0)
=
ρwn (v, Qe /(1 + βτe ), Qi /(1 + βτi )).
(8.3)
Hence, using this equation together with equation 8.1, one arrives at ρappr (v, Qe , Qi , τe , τi ) ≈ ρwn (v, Qe /(1 + βτe ), Qi /(1 + βτi )).
(8.4)
This approximation, which also obeys equation 8.1, is the extended expression by R&D. It is expected to function in the white noise and the weak noise limits and can be regarded as an interpolation formula between these limits. We have seen that for stronger noise and large correlation times (i.e., in a parameter regime where neither of the above assumptions of weak or uncorrelated noise holds true), this density and its mean value disagree with numerical simulation results as well as with our static noise theory. Regarding the parameter sets for which we checked the extended expression for the probability density, it is remarkable that the differences to numerical simulations were not stronger. Two issues remain. First, we have shown that the linear model with gaussian conductance fluctuations can show a diverging mean value. Certainly, for higher moments, as, for instance, the variance, the restrictions on parameters will be even more severe than that for the mean value (this can be concluded from the tractable limit cases we have considered). As demonstrated in the case of the stationary mean value, the parameter regime for such a divergence cannot be determined using the different solutions proposed by R&D. Of course, a real neuron can be driven by a strong synaptic input without showing a diverging mean voltage—the divergence of moments found
Comment on “Characterization of Subthreshold Voltage Membranes”
1927
above is just due to the limitations of the model. One such limitation is the diffusion approximation on which the model is based. Applying this approximation, the synaptically filtered spike train inputs have been replaced by OUPs. In the original model with spike train input, it is well known that the voltage cannot go below the lowest reversal potential E i or above the excitatory reversal potential E e if no current (additive) noise is present (see, e.g., L´ansky´ & L´ansk´a, 1987, for the case of unfiltered Poissonian input). In this case, we do not expect a power law behavior of the probability density at large values of the voltage. Another limitation of the model considered by R&D is that no nonlinear spike-generating mechanism has been included. In particular, the mechanism responsible for the voltage reset after an action potential would prevent any power law at strong, positive voltage. Thus, we see that at strong, synaptic input, the shot-noise character of the input and nonlinearities in the dynamics cannot be neglected and even determine whether the mean of the voltage is finite. The second issue concerns the consequences of the diffusion approximation for the validity of the achieved results. Even if we assume a weak noise such that all the lower moments like mean and variance will be finite, is there any effect of the shot-noise character of the synaptic input that is not taken into account properly by the diffusion approximation? Richardson and Gerstner (2005) have recently addressed this issue and shown that the shot-noise character will affect the statistics of the voltage and that its contribution is comparable to that resulting from the multiplicativity of the noise. Thus, for a consistent treatment, one should either include both features (as done by Richardson and Gerstner, 2005, in the limit of weak synaptic noise) or none (corresponding to the effective timescale approximation; cf. Richardson & Gerstner, 2005). Summarizing, we believe that the use of the extended expression by R&D is restricted to parameters obeying β Qe + Qi .
(8.5)
This restriction is consistent with (1) the diffusion approximation on which the model is based, (2) a qualitative justification of the extended expression by R&D as given above, and (3) the finiteness of the stationary mean and variance. For parameters that do not obey the condition equation 8.5, one should take into account the shot-noise statistics of the synaptic drive. Recent perturbation results were given by Richardson and Gerstner (2005) assuming weak noise; we note that the small parameter in this theory is (Qe + Qi )/β and therefore exactly equal to the small parameter in equation 8.5. The most promising result in our letter seems to be the exact solution for the time-dependent mean value, a statistical measure that can be easily determined in an experiment and might tell us a lot about the synaptic
1928
B. Lindner and A. Longtin
dynamics and its parameters. The only weakness of this formula is that it is still based on the diffusion approximation, that is, on the assumption of gaussian conductance noise. One may, however, overcome this limitation by repeating the calculation for synaptically filtered shot noise. Appendix: Analysis of the Derivation of the Fokker-Planck Equation Here we show where in the derivation of the Fokker-Planck equation by R&D errors have been made. Let us first note that although R&D use a so-called Ito rule, there is no difference between the Ito and Stratonovich interpretations of the colored noise–driven membrane dynamics. Since the noise processes possess a finite correlation time, the Ito-Stratonovich dilemma occurring in systems driven by white multiplicative noise is not an issue here. To comprehend the errors in the analytical derivation of the FokkerPlanck equation in R&D, it suffices to consider the case of only additive OU noise. For clarity we will use our own notation: the OUP is denoted by yI (t), and we set h I = 1 (the latter function is used in R&D for generality). R&D give a formula for the differential of an arbitrary function F (v(t)) in equation B.9. 1 d F (v(t)) = ∂v F (v(t))dv + ∂v2 F (v(t))(dv)2 . 2
(A.1)
R&D use the membrane equation in its differential form, which for vanishing multiplicative noises reads dv = f (v)dt + dw I ,
(A.2)
where the drift term is f (v) = −βv and w I is the integrated OU process yI : wI =
t
ds yI (s).
(A.3)
0
Inserting equation A.2 into equation A.1, we obtain 1 d F (v(t)) = ∂v F (v(t)) f (v(t))dt + ∂v F (v(t))dw I + ∂v2 F (v(t))(dw I )2 . 2
(A.4)
This should correspond to equation B.10 in R&D for the case of zero multiplicative noise. However, our formula differs from equation B.10 in one important respect: R&D have replaced (dw I )2 by 2α I (t)dt using their Ito
Comment on “Characterization of Subthreshold Voltage Membranes”
1929
rule,6 equation A.13a. Dividing by dt, averaging, and using the fact that for finite τ I dw I (t)/dt = yI (t), we arrive at dF (v(t)) 1 2 (dw I )2 = ∂v F (v(t)) f (v(t)) + ∂v F (v(t))yI (t) + ∂ F (v(t)) . dt 2 v dt (A.5) This should correspond to equation B.12 in R&D (again for the case of vanishing multiplicative noise) but is not equivalent to the latter equation for two reasons. First, R&D set the second term on the right-hand side to zero, reasoning that the mean value yI (t) is zero (they also use an argument about h {e,i,I } , which is irrelevant in the additive noise case considered here). Evidently if yI (t) is a colored noise, it will be correlated to its values in the past y(t ) with t < t. The voltage v(t) and any nontrivial function F (v(t)) is a functional of and therefore correlated to yI (t ) with t < t. Consequently, there is also a correlation between yI (t) and F (v(t)), and thus ∂v F (v(t))yI (t) = ∂v F (v(t))yI (t) = 0.
(A.6)
Hence, setting the second term (which actually describes the effect of the noise on the system) to zero is wrong.7 This also applies to the respective terms due to the multiplicative noise. Second, the last term on the right-hand side of equation A.5 was treated as a finite term in the limit t → ∞. According to R&D’s equation A.13a (for i = j), equation 3.2, and equation 3.3, limt→∞ (dw I )2 = limt→∞ 2α I (t)dt = σ˜ I2 τ I dt and, thus (dw 2I )/dt → σ˜ I2 τ I as t → ∞. However, the averaged variance of dw I = yI (t)dt is (dw I )2 = yI (t)2 (dt)2 = σ˜ I2 (dt)2 and therefore the last term in equation A.5 is of first order in dt (since (dw I )2 /dt = yI (t)2 dt ∼ dt) and vanishes. This is the second error in the derivation. We note that the limit in equation 3.3 is not correctly carried out. Even if we follow R&D in using their relations, equation A.13a, together with the correct relation, equation A.10a (instead of the white noise formula, equation A.12a), we obtain that for finite τ I , the mean squared increment
6 Note that R&D use α (t) for two different expressions: according to equation B.8 for I σ˜ I2 [τ I (1 − exp(−t/τ I )) − t] + w 2I (t)/(2τ I ) but also according to equation 3.2 for the average of this stochastic quantity. 7 For readers still unconvinced of equation A.6, a simple example will be useful. Let F (v(t)) = v 2 (t)/2. Then
∂v F (v(t))yI (t) = v(t)yI (t). In the stationary state, this average can be calculated as dvdyI vyI P0 (v, yI ) using the density equation 4.6. This yields v(t)yI (t) = Q I /[1 + βτ I ], which is finite for all finite values of the noise intensity Q I and correlation time τ I . Note that this line of reasoning is valid only for truly colored noise (τ I > 0); the white noise case has to be treated separately.
1930
B. Lindner and A. Longtin
(dw I )2 is zero in linear order in dt for all times t, which is in contradiction to equation 3.3 in R&D. This incorrect limit stems from using the white noise formula, equation (A.12a) which R&D assume to go from equation 3.2 to equation 3.3 in R&D (2003). The use of equation A.12a is justified by R&D by the steady-state limit t → ∞ with t/τ I 1. However, t → ∞ with t/τ I 1 does not imply that τ I → 0 and that one can use equation A.12a, which holds true only for τ I → 0. In other words, a steady-state limit does not imply a white noise limit. We now show that keeping the proper terms in equation A.5 does not lead to a useful equation for the solution of the original problem. After applying what was explained above, equation A.5 reads correctly, dF (v(t)) = ∂v F (v(t)) f (v(t)) + ∂v F (v(t))yI (t) . dt
(A.7)
Because of the correlation between v(t) and yI (t), we have to use the full two-dimensional probability density to express the averages:
∂v F (v(t)) f (v(t)) =
dyI (∂v F (v)) f (v)P(v, yI , t)
dv
=
∂v F (v(t))yI (t) =
dv(∂v F (v)) f (v)ρ(v, t)
dv
dyI (∂v F (v))yI P(v, yI , t).
(A.8)
Inserting these relations into equation A.7, performing an integration by part, and setting F (v) = 1 leads us to ∂t ρ(v, t) = −∂v ( f (v)ρ(v, t)) − ∂v
dyI yI P(v, y, t) ,
(A.9)
which is not a closed equation for ρ(v, t) or a Fokker-Planck equation. The above equation with f (v) = −βv can be also obtained by integrating the two-dimensional Fokker-Planck equation, equation 4.5, over yI . In conclusion, by neglecting a finite term and assuming a vanishing term to be finite, R&D have effectively replaced one term by the other; the colored noise drift term is replaced by a white noise diffusion term, the latter with a prefactor that corresponds to only half of the noise intensity. This amounts to a white noise approximation of the colored conductance noise, although with a noise intensity that is not correct in the white noise limit of the problem.
Comment on “Characterization of Subthreshold Voltage Membranes”
1931
Acknowledgments This research was supported by NSERC Canada and a Premiers Research Excellence Award from the government of Ontario. We also acknowledge an anonymous reviewer for bringing to our attention the Note by R&D (2005), which at that time had not yet been published. References Abramowitz, M., & Stegun, I. A. (1970). Handbook of mathematical functions. New York: Dover. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky integrate-and-fire neurons with synaptic currents dynamics. J. Theor. Biol., 195, 87–95. Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. H¨anggi, P., & Jung, P. (1995). Colored noise in dynamical systems. Adv. Chem. Phys., 89, 239–326. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximation for neuronal activity including synaptic reversal potentials. J. Theor. Neurobiol., 2, 127–153. Holden, A. V. (1976). Models of the stochastic activity of neurones. Berlin: SpringerVerlag. ´ P., & Smith, C. E. (1994). Synaptic transmission in a diffusion L´ansk´a, V., L´ansky, model for neural activity. J. Theor. Biol., 166, 393–406. ´ P., & L´ansk´a, V. (1987). Diffusion approximation of the neuronal model with L´ansky, synaptic reversal potentials. Biol. Cybern., 56, 19–26. Richardson, M. J. E. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Phys. Rev. E, 69, 051918. Richardson, M. J. E., & Gerstner, W. (2005). Synaptic shot noise and conductance fluctuations affect the membrane voltage with equal significance. Neural Comp., 17, 923–948. Risken, H. (1984). The Fokker-Planck equation. Berlin: Springer. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comp., 15, 2577–2618. Rudolph, M., & Destexhe, A. (2005). An extended analytical expression for the membrane potential distribution of conductance-based synaptic noise. Neural Comp., 17, 2301–2315. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neuroscience. Philadelphia: Society for Industrial and Applied Mathematics.
Received February 3, 2005; accepted October 5, 2005.
LETTER
¨ Communicated by Klaus-Robert Muller
Kernel Projection Classifiers with Suppressing Features of Other Classes Yoshikazu Washizawa∗ [email protected] Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8552, Japan
Yukihiko Yamashita [email protected] Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8552, Japan
We propose a new classification method based on a kernel technique called suppressed kernel sample space projection classifier (SKSP), which is extended from kernel sample space projection classifier (KSP). In kernel methods, samples are classified after they are mapped from an input space to a high-dimensional space called a feature space. The space that is spanned by samples of a class in a feature space is defined as a kernel sample space. In KSP, an unknown input vector is classified to the class of which projection norm onto the kernel sample space is maximized. KSP can be interpreted as a special type of kernel principal component analysis (KPCA). KPCA is also used in classification problems. However, KSP has more useful properties compared with KPCA, and its accuracy is almost the same as or better than that of KPCA classifier. Since KSP is a single-class classifier, it uses only self-class samples for learning. Thus, for a multiclass classification problem, even though there are very many classes, the computational cost does not change for each class. However, we expect that more useful features can be obtained for classification if samples from other classes are used. By extending KSP to SKSP, the effects of other classes are suppressed, and useful features can be extracted with an oblique projection. Experiments on two-class classification problems, indicate that SKSP shows high accuracy in many classification problems. 1 Introduction Kernel-based methods for pattern recognition, such as support vector machines (SVMs), kernel principal component analysis (KPCA), and the ∗ Current address: Brain Science Institute, RIKEN, 2-1 Hirosawa, wako-shi, Saitama 351-0198, Japan
Neural Computation 18, 1932–1950 (2006)
C 2006 Massachusetts Institute of Technology
Kernel Projection Classifiers
1933
¨ kernel Fisher discriminant (KFD), achieve high accuracy (Muller, Mika, ¨ R¨atsch, Tsuda, & Scholkopf, 2001; Vapnik, 1998). In particular, SVMs are widely used for classification and regression problems. In kernel-based methods, an input pattern is classified after it has been mapped to a feature space F whose dimension is very high or infinite. Let be a mapping from an N-dimensional input space to a feature space: :
R N → F.
(1.1)
Instead of calculating directly, one can introduce a kernel function k that satisfies k(x, y) = (x), (y),
(1.2)
where ·, · denotes the inner product. If we use the Mercer kernel function ¨ (Scholkopf, Mika, et al., 1999), a exists that satisfies equation 1.2. Mercer kernel functions, which are widely used, are as follows. Linear : Polynomial : Gaussian :
k(x, y) = x, y
(1.3)
k(x, y) = (x, y + 1) x − y2 k(x, y) = exp − . 2σ 2 d
(1.4) (1.5)
When equation 1.3 is the kernel function, is an identity operator. When the polynomial function, equation 1.4, and the gaussian kernel function, equation 1.5, are used, an N-dimensional input vector is mapped into a ( N+d Cd − 1)-dimensional or an infinite-dimensional Hilbert space, respectively. SVMs can be calculated only from the inner product of learning samples or an unknown input vector. Thus, kernel methods involve exchanging inner products in the feature space for kernel functions. Using a mapping , SVMs can achieve high accuracy for many types of classification and regression problems. However, SVMs are two-class classifiers, and a quadratic optimization problem has to be solved in the learning stage, so the computational cost is very high. Thus, for multiclass classification problems, the computational cost increases with the number of classes. A kernel-based method called kernel PCA (KPCA), which is extended from PCA or the Karhunen-Lo`eve transform (KLT), has been proposed ¨ ¨ (Scholkopf, Smola, & Muller, 1998). Subspace methods utilizing PCA or KLT are used widely for pattern recognition or classification problems (Watanabe & Pakvasa, 1973; Oja, 1983). KPCA has also been applied to classification problems in work by Tsuda (1999), Maeda and Murase (1999), and Murata
1934
Y. Washizawa and Y. Yamashita
and Onoda (2001). We call such methods a KPCA classifier. KPCA classifiers perform better than conventional subspace methods. Although they do not exceed kernel two-class classifiers (e.g., SVM or KFD) in accuracy, they have an advantage from the viewpoint of the computational cost of a multiclass classification problem because they are single-class classifiers. Note that SVMs and KFDs are two-class classifiers, and they require all samples belonging to all classes in the learning stage. A single-class classifier of each class needs only samples belonging to a class in the learning stage. Thus, a KPCA classifier is suitable for solving multiclass classification problems. Moreover, if the number of classes changes, two-class classifiers of all classes have to be reconstructed. In contrast, single-class classifiers require only the newly added classes to be constructed. Subspace methods in the input space need to reduce the dimension of the space that expresses the features of each class because spaces that are spanned by samples of various classes overlap. However, in the feature space, the overlap with other classes is small. Thus, for subspace methods in a feature space, reducing the dimension of the space is unnecessary (see section 4). Thus, the kernel sample space classifier (KSP) was proposed as a kernel subspace method that does not require reducing the dimension of the space for each class. We describe the details of KSP in section 2. Although KSP can be interpreted as a special case of KPCA, it is constructed by calculating an inverse matrix, whereas KPCA needs the solution of an eigenvalue problem. Thus, KSP can incorporate incremental learning with updates of an inverse matrix. Moreover, a gaussian elimination method can be applied to obtain an inverse matrix. This is useful in applications that need many reconstructions, such as leave-one-out cross validation or time-series data. KSP has almost the same or higher performance than the KPCA classifier (Washizawa & Yamashita, 2004). In this letter, we propose a suppressed kernel sample projection classifier (SKSP) that is extended from KSP. KSP and KPCA can extract the features that are included in not fewer than two classes. Such common features among classes are not only useless for classification; they are also harmful as noise. Therefore, such features have to be suppressed. We introduce a suppression trick to inhibit them. For the criterion of SKSP, we add a term that suppresses such common features by using some of the samples belonging to other classes in the learning stage. Section 2 describes characterization by an optimality criterion and introduces regularization. Section 3 defines a SKSP criterion as an extension of this viewpoint of KSP and provides its solution. Section 4 shows experimental results of two-class classification problems to compare the performance of SKSP with other classification methods. Finally section 5 discusses the advantages of SKSP and shows the differences between it and other methods.
Kernel Projection Classifiers
1935
Figure 1: Kernel sample space projection.
2 Kernel Sample Space Projection 2.1 Definition of KSP. Let i be a set of vectors that belong to class i and f 1i , f 2i , . . . , and f Li i be samples in i , where L i is the number of samples in class i. We define an operator Si as Si = f 1i f 2i . . . f Li i .
(2.1)
R(Si ) is a kernel sample space of class i where R(A) denotes the range of A. Generally, the dimension of the feature space is much larger than the number of samples L i . Therefore, a kernel sample space is sparse and expresses features of the class. In KSP, the similarity between an unknown input vector and a class i is evaluated by the norm of its projection onto the kernel sample space of class i. An unknown input vector f x is classified according to the maximum norm (see Figure 1). Let K Si be a Gram matrix in F of class i as K Si = Si∗ Si i i k f1 , f1 .. = i. i k f L i , f1
· · · k f 1i , f Li i .. .. , . i. i · · · k f Li , f Li
(2.2)
(2.3)
where A∗ is the adjoint operator of A. The projection operator PR(Si ) and the projection norm of ( f x ) onto R(Si ) are expressed as †
PR(Si ) = Si K Si Si∗
(2.4)
1936
Y. Washizawa and Y. Yamashita †
PR(Si ) ( f x )2 = h( f x ), K Si h( f x ),
(2.5)
¨ where h(x) = Si∗ (x) is the empirical kernel map defined by Scholkopf, Mika, et al. (1999) and A† is a Moore-Penrose generalized inverse operator that satisfies AA† A = A, A† AA† = A† , (AA† )∗ = AA† , and(A† A)∗ = A† A. If A is a nonsingular operator, A† = A−1 . Although ( f x ) is a vector in a high-dimensional space or in an infinite Hilbert space, Si∗ ( f x ) is an L i -dimensional real vector, that is, Si∗ ( f x ) = (k( f 1i , f x ) k( f 2i , f x ), . . . , k( f Li i , f x )) , and we can calculate it directly. An unknown input vector f x is classified into class i if and only if PR(Si ) ( f x )2 > PR(S j ) ( f x )2 ∀ j = i.
(2.6)
2.2 Properties of KSP. KSP is characterized by the following optimality criterion. Accordingly, a regularized KSP and a suppressed KSP are defined, min :
J [Xi ] =
Li 1 ( f si ) − Xi ( f si )2 Li
(2.7)
s=1
subject to :
N (Xi ) ⊃ R(Si )⊥ ,
(2.8)
where N (A) is the null space of A. To obtain the solution of KPCA, one has to solve the eigenvalue problem whose size is equal to the number of samples. However, in KSP, we obtain a solution by simply calculating the inverse problem of the size of the number of samples. If we use Sherman-Morrison-Woodbury’s theorem and its extension in Rohde (1965), incremental learning is easily achieved in KSP. As we mention further in the next section, Tikhonov’s regularization is introduced to KSP in order to avoid the overfitting problem, whereas KPCA is interpreted as a truncated singular value decomposition (TSVD) from the viewpoint of its regularization. † If the Gram matrix K Si is nonsingular, K Si = K S−1 . Since the problem we i have to solve is not the generalized inverse but the inverse problem, we can obtain a solution more easily. If a gaussian kernel function is used, the Gram matrix is always nonsingular unless equal samples exist. If polynomial kernel function is used and the dimension of its feature space is large enough, the Gram matrix is considered to be nonsingular unless equal samples exist. 2.3 Regularization of KSP. Generally, a set of learning samples will include outliers or noisy samples. Thus, the generalization capability of classifiers may not be high, even if they can classify finite learning samples correctly. This is the overfitting or overlearning problem, and it can be avoided by using regularization or model selection. For example, Cortes
Kernel Projection Classifiers
1937
¨ and Vapnik (1995) and R¨atsch, Onoda, and Muller (2001) introduced a soft margin for SVM and AdaBoost technique. In KSP, the learned patterns are always classified correctly as long as the Gram matrix is nonsingular. Thus, the overfitting problem may occur when it is used. The overfitting problem occurs when the classifier has a decision boundary that is too complex. If the classifier has a discriminant function, the complexity of its decision boundaries is measured using the variation of the function with respect to a very small variation in the input vector. Let be the former variation, δ f x be a small variation of f x , and d : R N → R be a discriminant function. Then is expressed as =
(d( f x + δ f x ) − d( f x ))/d( f x ) . δ f x / f x
(2.9)
For KSP, K SP is expressed as K SP = ≤
PR(Si ) ( f x + δ f x ) − PR(Si ) ( f x ) fx · δ f x PR(Si ) ( f x ) PR(Si ) ( f x + δ f x ) − PR(Si ) ( f x ) fx · . δ f x PR(Si ) ( f x )
(2.10)
To suppress K SP directly is difficult because is nonlinear. Instead of K SP , we introduce K SP , which is a variation of the feature with respect to a very small variation δ( f x ) of ( f x ). Then we have K SP =
PR(Si ) (( f x ) + δ( f x )) − PR(Si ) ( f x ) ( f x ) · δ( f x ) PR(Si ) ( f x )
≤
PR(Si ) (( f x ) + δ( f x )) − PR(Si ) ( f x ) ( f x ) · δ( f x ) PR(Si ) ( f x )
=
PR(Si ) δ( f x ) ( f x ) · . δ( f x ) PR(Si ) ( f x )
(2.11)
(2.12)
Suppression of has been discussed in reference to ill-posed problems and regularization (Groetsch, 1993). Most ill-posed problems are caused fx ) by the first part of , d( f x +δδf fxx)−d( . If d( f ) = Af , the maximum value of the first part of is given by its operator norm, which is defined as A = sup f =1 Af . Tikhonov’s regularization avoids ill-posed problems (Tikhonov & Arsenin,1977). It suppresses the Frobenius norm, which is defined as A2 = tr[A∗ A]. Since A ≤ A2 , we can avoid ill-posed problems by suppressing the Frobenius norm.
1938
Y. Washizawa and Y. Yamashita
For KSP, from equation 2.12, we add a term for suppressing X2 to equation 2.7 by using Tikhonov regularization. We define the regularized KSP as follows. Definition 1 (Regularized KSP). Regularized KSP is defined by the solution of the following optimization problem,
min : subject to :
J [Xi ] =
Li i 1 f − Xi f i 2 + µXi 2 , s s 2 L i s=1
N (X) ⊃ R(Si )⊥ ,
(2.13) (2.14)
where µ > 0 is a regularization parameter. Theorem 1.
A solution of regularized KSP is
P˜ R(Si ) = Si (K Si + µL i I )−1 Si∗ ,
(2.15)
where I denotes the identity matrix. This theorem is easily proven from theorem 3 (see section 3.3). 3 Suppressed KSP 3.1 Definition of Suppressed KSP. In KSP, an orthogonal projection can extract the features of each category. Thus, the projection norm of an unknown input vector ( f x ) onto R(Si ) stands for the similarity between f x and class i. However, KSP may extract features that belong to more than one class. Such features cannot be used for classification, since they may be as harmful as noise. They can be suppressed using an oblique projection because its direction is determined by a set of samples that belong to other classes. Let i be a set of vectors that do not belong to class i. Let g1i , g2i . . . and g iMi be samples in i and Ti = g1i g2i . . . g iMi
(3.1)
Ui = [Si Ti ]
(3.2)
K Ui = Ui∗ Ui .
(3.3)
Next, we introduce the following criterion for SKSP.
Kernel Projection Classifiers
1939
Definition 2 (Suppressed KSP). Suppressed KSP (SKSP) is defined by the solution of following optimization problem:
min :
J [Xi ] =
subject to:
Li Mi
i 1 f − Xi f i 2 + α Xi g i 2 s s t L i s=1 Mi t=1
(3.4)
N (Xi ) ⊃ R(Ui )⊥ ,
(3.5)
where α is a parameter that controls the effect of the suppression of other classes. However if K Ui is nonsingular (i.e., all of samples are linearly independent), α is vanished from the solution. After introducing regularization to SKSP, α appears in its solution. The concept behind the criterion is based on the least-squares estimation (Luenberger, 1969) and the relative Karhunen-Lo`eve transform (Yamashita & Ogawa, 1996). The additional term Xi (gti )2 is used to minimize the Mt by Xi . From this term, the features in both features extracted from {gti }t=1 i Ls i Mt { f s }s=1 and {gt }t=1 are suppressed because the similarity between an unknown input pattern f x and class i is obtained from Xi ( f x )2 . Criterion 3.7 cannot determine where X maps a vector out of R(Ui ). Constraint 3.5 removes the component orthogonal to R(Ui ). We provide a solution to this criterion in the form of the next theorem. Theorem 2. Let I L i 0 L i Mi Di = , 0 Mi L i 0 Mi Mi
(3.6)
where Ia is the identity matrix whose size is a and 0 a b is the a × b matrix of which
all elements are zero. If K Ui is nonsingular, the SKSP operator PR(S is given as i)
PR(S = Ui Di K U−i1 Ui∗ . i)
(3.7)
3.2 Properties of SKSP. Proposition 1.
is a projection operator onto R(Si ). If K Ui is nonsingular, PR(S i)
Proposition 2.
If K Ui is nonsingular,
PR(S P = PR(S , i ) R(Ui ) i)
(3.8)
= PR(S , PR(Ui ) PR(S i) i)
(3.9)
1940
Y. Washizawa and Y. Yamashita
Figure 2: Suppressed kernel sample space projection.
where PR(Ui ) is the orthogonal projection operator onto R(Ui ). Proposition 3.
v = 0 for all v ∈ R(Ti ). If K Ui is nonsingular, PR(S i)
Figure 2 shows a sketch of SKSP. From propositions 1, 2, and 3, PR(S ( f x ) i) can be considered as follows. First, ( f x ) is orthogonally projected onto R(Ui ), after it is projected onto R(Si ) along R(Ti ). The similarity between f x
and i against i is given as PR(S ( f x ). If i = φ, PR(S = PR(Si ) . Thus, i) i) SKSP is an extension of KSP. In actual problems, KSP can extract necessary features by itself. Thus, we do not have to use all samples of other classes; only the samples that are similar to i have to be included in i . Since the similarity of an input vector is evaluated using the projection norm onto R(Si ), it is sufficient that samples whose projection norms are large are included in i .
3.3 Regularized SKSP. We also introduce a Tikhonov’s regularization term to SKSP as well as KSP. Definition 3 (Regularized SKSP). min :
J [Xi ] =
Li i 1 f − Xi f i 2 s s L i s=1
+
Mi α Xi g i 2 + µXi 2 2 t Mi t=1
subject to: N (Xi ) ⊃ R(Ui )⊥ ,
(3.10) (3.11)
Kernel Projection Classifiers
1941
where µ > 0 is the regularization parameter. Theorem 3. Let L i I L i 0 L i ,Mi ˜i = D . i 0 Mi ,L i M I α Mi
(3.12)
A solution of regularized SKSP is
˜ i )−1 Ui∗ . = Ui Di (K Ui + µ D P˜ R(S i)
(3.13)
Let ˜ i, K˜ Ui = K Ui + µ D
h i ( f x ) = Ui∗ ( f x ) = k f x , f 1i , . . . , k f x , f Li i , k f x , g1i , . . . , k f x , g iMi . Then the similarity between an unknown input vector f x and i against i is given as P˜
2 = h i ( f x ), K˜ −1 Di K U Di K˜ −1 h i ( f x ) . i Ui Ui
R(Si ) ( f x )
(3.14)
4 Experiments We used several practical data sets that were used in R¨atsch et al. (2001), ¨ ¨ Mika, R¨atsch, Weston, Scholkopf, and Muller (1999) and R¨atsch (2001).1 It consists of 13 binary classification problems, and each data set consists of 100 or 20 realizations. We used gaussian kernel function (see equation 1.5) in our approach. The parameters of the kernel function, the regularization, and α were obtained with a fivefold cross validation. We extracted training samples from the first five realizations of each data set. For each of these realizations, a five-fold cross-validation procedure gives the best model among several parameters. About 10 values were tested for each parameter in two stages; first a global search was done to find a good guess for the parameter; the guess become more precise in the second stage. The model parameters were obtained as the median of five estimations and were used throughout the training on all 100 realizations of the data set. If there were identical samples, we added only one of them to the learning set because the kernel Gram matrix is singular if identical samples exist. The kernel sample space was not changed by this operation. The 1 The data sets were downloaded online from http://ida.first.fraunhofer.de/projects/ bench/benchmarks.htm.
1942
Y. Washizawa and Y. Yamashita
Table 1: Mean Test Error Rates and Their Standards Deviations. Data set
SKSP
KSP
AB Reg
SVM
KFD
Banana Breast Cancer Diabetes Flare Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform
10.5 ± 0.4 25.0 ± 4.9
11.3 ± 0.6 27.0 ± 4.1
10.9 ± 0.4 26.5 ± 4.5
11.5 ± 0.7 26.0 ± 4.7
10.8 ± 0.5 24.5 ± 4.6
23.1 ± 1.6 44.8 ± 1.8 23.2 ± 2.2 15.8 ± 2.2 2.6 ± 0.5 15.5 ± 1.9 14.2 ± 0.7 3.9 ± 2.2 27.3 ± 8.8 2.3 ± 0.1 10.7 ± 0.7
26.5 ± 1.8 46.8 ± 4.7 25.5 ± 2.4 18.8 ± 3.5 3.2 ± 0.6 41.4 ± 17.8 12.2 ± 0.9 3.9 ± 2.3 31.1 ± 13.5 2.8 ± 0.1 11.1 ± 0.4
23.8 ± 1.8 34.2 ± 2.2 24.7 ± 2.4 16.5 ± 3.5 2.7 ± 0.6 1.6 ± 0.1 9.5 ± 0.7 4.6 ± 2.2 22.6 ± 1.2 2.7 ± 0.2 9.8 ± 0.8
23.5 ± 1.73 32.4 ± 1.8 23.6 ± 2.1 16.0 ± 3.3 3.0 ± 0.6 1.7 ± 0.1 10.9 ± 0.7 4.8 ± 2.2 22.4 ± 1.0 3.0 ± 0.2 9.9 ± 0.4
23.2 ± 1.6 33.2 ± 1.7 23.7 ± 2.2 16.1 ± 3.4 4.8 ± 0.6 1.5 ± 0.1 10.5 ± 0.6 4.2 ± 2.1 23.3 ± 2.1 2.6 ± 0.2 9.9 ± 0.4
Number of bold Number of italic
7
1
2
2
2
1
0
3
3
6
Notes: AB Reg: Regularized AdaBoost. KFD: kernel Fisher discriminant. The best result is written in boldface, and the second best is italicized.
Table 2: Comparison of Run Times (Seconds).
Learning stage Recognition stage
KSP
SKSP
SVM
0.19 0.13
0.53 2.02
3.40 0.04
suppression set i of class i consists of all samples in the other class since the learning sets of these problems are not large. Table 1 lists mean test error rates and their standard deviations. The underlined results show the significance by t-test with 5% significant level between SKSP and SVM. The results except for KSP and SKSP are taken from the articles mentioned above. Furthermore, we compared the runtimes of SKSP, KSP, and SVM. Table 2 shows the computational cost in the learning stage and the recognition stage of one realization of the German data set. We used SVM Toolbox for Matlab, provided by A. Schwaighofer, to compare run times. For KSP and SKSP, we used ordinary Matlab code. We used a computer with a Pentium 4 3-GHz CPU and 2-Gbyte memory for computations. To show that the reduction of dimension in feature space is not so valid in many cases, we provide another experimental result. We present the
Kernel Projection Classifiers
1943
50 banana breast-cancer diabetis flare-solar german heart image ringnorm splice thyroid titanic twonorm waveform
Error rate [%]
40
30
20
10
0 0
5
10
15 Rank of KPCA
20
25
30
Figure 3: Error rates of KPCA classifier with respect to its rank.
error rates of the KPCA classifier with respect to its rank in Figure 3. Kernel parameters that achieved the best result were chosen in each rank from several preset parameters. The SKSP classifier had the lowest error rates in many problems. Thus, we can say that SKSP can suppress the effect of overlapping features in the other class and can extract important features. In KSP and SKSP, the error rates of the Ringnorm data set were high. Data of each class are sampled from an isotropic normal distribution. The means of two classes are almost the same, and their variances are different. Since any pair of random variables of an isotropic normal distribution is independent and their kernel covariance operator vanishes (Bach & Jordan, 2002), it is difficult to extract features by a subspace. According to Figure 3, KPCA cannot extract their features either. Figure 3 shows that error rates are stable and low in higher rank in many classification problems. These results prove the concept of KSP is valid. The sharp increases of error rates in some problems are due to ill-posedness. It can be suppressed by regularization introduced to KSP and SKSP. 5 Discussion Here, we compare SKSP with other classification methods and clarify their differences.
1944
Y. Washizawa and Y. Yamashita
The remarkable difference between SKSP and SVM or KFD is that SKSP is a quadratic discriminant function, while SVM and KFD are linear discriminant functions in the feature space. In the case of classification in the input space (not using a kernel method), quadratic classifiers generally perform better than linear ones because they have more degrees of freedom. When the kernel methods are used, we cannot know the advantages of the quadratic classifier because all discriminant functions are nonlinear. However, the advantages of SKSP were shown by our experiments. In SVM, a separating hyperplane is determined by a few samples called support vectors (SVs). The separating hyperplane depends on only samples around the boundary and does not depend on other samples or their distribution. If there are noisy samples or outliers in the learning sample, the separating hyperplane will be deteriorated by them because they become SVs with high probability. Thus, SVM is not fundamentally robust even if regularizing methods, such as soft margin (Cortes & Vapnik, 1995) ¨ or ν-SVM (Scholkopf, Bartlett, Smola, & Williamson, 1999), are used. In general, a classifier has a trade-off between robustness and sparseness in relation to the number of samples. The computational cost of SVM in the recognition stage is low because its solution is sparse. Let k and s be the computational costs of calculating a kernel function and a multiplication, respectively. Let L and L SV be the number of learning samples and SVs, respectively. Then main computational costs in the recognition stage are given as: SVM:
(k + s)L SV
KFD:
(k + s)L
SKSP:
s L 2 + (k + s)L .
Note that s < k and L SV < L. Only SKSP has a second-order term with respect to L, because it is a quadratic discriminant in the feature space. However, as mentioned in section 3, not all samples belonging to other classes have to be included. Hence, we can decrease L and the computational cost. SVM calculation takes a lot of time for the learning stage because a quadratic optimization problem must be solved. KFD involves solving an eigenvalue problem or a quadratic optimization problem transformed from it (Mika et al., 2000). On the other hand, the SKSP solution is given as a closed form with an inverse operation and multiplications. Moreover, as stated above, not all the samples belonging to other classes have to be used. The steps for calculating the inverse and multiplication are O(L 3 ). Generally, matrix inverse problems are easier than quadratic optimization problems. Thus, the computational cost of KSP or SKSP is lower than SVM in the learning stage.
Kernel Projection Classifiers
1945
Moreover, when there is a huge number of learning samples, we can easily introduce the multi template method to SKSP or KSP because they are not two-class classifiers. In the multitemplate method, subclasses in a class are prepared and an input vector is classified into one of the subclasses. On the other hand, to employ this method for two-class classifiers (e.g., SVM or KFD) is useless because they must use all samples of all classes. Appendix We prepare the following lemmas and corollaries for the proofs of theorems 2 and 3. Lemma 1 (Israel & Greville, 1974). Let H1 , H2 , H3 , and H4 be Hilbert spaces and A ∈ B(H3 , H4 ), B ∈ B(H1 , H2 ), C ∈ B(H1 , H4 ), where B(H, H ) is a bounded linear operator from H to H . Assume that R(A), R(B), and R(C) are closed. Then the operator equation, AXB = C,
(A.1)
has a solution X ∈ B(H2 , H3 ) when R(A) ⊃ R(C) and N (B) ⊂ N (C). A general form of a solution is given by X = A† C B † + Y − A† AY B B † ,
(A.2)
where Y is an arbitrary operator in B(H2 , H3 ). Corollary 1. Let A ∈ B(H1 , H2 ), B ∈ B(H1 , H3 ). If R(A) and R(B) are closed, an operator equation, A = XB,
(A.3)
has a solution when N (B) ⊂ N (A). Proof of Proposition 1.
P = Ui Di K U−1i Ui∗ Ui Di K U−1i Ui∗ PR(S i ) R(Si )
= Ui Di Di K U−1i Ui∗
= PR(S . i)
(A.4)
1946
Y. Washizawa and Y. Yamashita
Proof of Proposition 2.
PR(S P = Ui Di K U−1i Ui∗ Ui K U−1i Ui∗ = PR(S i ) R(Ui ) i)
PR(Ui ) PR(S = Ui K U−1i Ui∗ Ui Di K U−1i Ui∗ = Ui Di K U−1i Ui∗ = PR(S . i) i)
Proof of Proposition 3.
v is expressed as v =
Mi j=1
v j (g j ) = Ui v ,
where . . 0 ν1 ν2 . . . ν Mi ) . v = (0 . Li
Then
u = Ui Di K Ui −1 Ui∗ v = Ui Di K Ui −1 Ui∗ Ui v = Ui Di v = 0. PR(S i)
Proofs of Theorems 2 and 3. Here we omit the symbol i for a class i for brevity. When we let µ = 0 in equation 3.10, it corresponds to equation 3.4. Thus, we let µ ≥ 0. Since u − PR(U) v ≤ u − v for ∀ u ∈ R(U), ∀ v ∈ H, X can be expressed as X = U B. From corollary 1, X in definition 1 can be expressed as X = CU ∗ . Then we can let X = U AU ∗ ,
(A.5)
where A is a real matrix of which size is (L + M). Equation 3.10 yields that J=
L M 1 α ( f s ) − U AU ∗ ( f s )2 + U AU ∗ (gt )2 + µU AU ∗ 22 L M s=1
t=1
(A.6) =
L 1 k( f s , f s ) − 2( f s ), U AU ∗ ( f s ) + U AU ∗ ( f s ), U AU ∗ ( f s ) L s=1
α U AU ∗ (gt ), U AU ∗ (gt ) + µtr[U A∗ U ∗ U AU ∗ ] M M
+
t=1
=
1 L
L
k( f s , f s ) − 2U ∗ ( f s ), AU ∗ ( f s ) + AU ∗ ( f s ), K U AU ∗ ( f s )
s=1
α AU ∗ (gt ), K U AU ∗ (gt ) + µtr[A∗ K U AK U ]. M M
+
t=1
Kernel Projection Classifiers
1947
Note that U ∗ ( f s ) and U ∗ (gt ) are the sth and (L + t)th column of K U , ˜ as equation 3.6 and equation 3.12, respecrespectively. We define D and D tively. Then J is expressed as 1 tr[K U D − 2(K U D)∗ AK U D + (K U D)∗ A∗ K U AK U D] L α + tr[(K U D)∗ A∗ K AK U D] + µtr[A∗ K U AK U ] M 1 2 ∗ −1 ∗ ˜ = tr K U D − DK U AK U D + K U A K U AK U D + µA K U AK U , L L
J=
(A.7) ˜ ∗ = D, ˜ and K U∗ = K U . Hence, the variation of J with respect since D∗ = D, D to A is given as ˜ −1 + K U A∗ K U (δ A)K U D ˜ −1 δ J = tr K U (δ A)∗ K U AK U D 2 ∗ ∗ − DK U (δ A)K U D + µ(δ A) K U AK U + µ(δ A)K U A K U L 1 ˜ −1 K U − K U DK U + µK U AK U = tr (δ A)∗ K U AK U D L 1 −1 ∗ ˜ +(δ A) K U D K U AK U − K U DK U + µK U A K U L 1 ˜ −1 K U − K U DK U + µK U AK U . = 2tr (δ A)∗ K U AK U D L
(A.8)
J is minimum when ˜ D ˜ −1 K U = K U A(K U + µ D)
1 K U DK U . L
(A.9)
In the case of µ > 0, from lemma 1, 1 † † † † K K U DK U K U + W − K U K U WK U K U L U 1 † † D − W KU KU , = W + KU KU L
˜ D ˜ −1 = A(K U + µ D)
(A.10)
1948
Y. Washizawa and Y. Yamashita
where W is arbitrary operator. Let W = W − ˜ D ˜ −1 = A(K U + µ D)
1 L
D. It follows that
1 † † D + W − K U K U W K U K U . L
Then we have 1 ˜ U + µ D) ˜ −1 + W D(K ˜ −1 ˜ U + µ D) A= D D(K L † † ˜ ˜ −1 −K U K U W K U K U D(K U + µ D) .
Since
1 L
(A.11)
˜ = D, X, which minimizes J , is given as DD
X = U AU ∗ ˜ −1 U ∗ + UW D(K ˜ −1 U ∗ ˜ U + µ D) = U D(K U + µ D) †
†
˜ −1 U ∗ . ˜ U + µ D) −U K U K U W K U K U D(K Since K U = U ∗ U, R(K U ) = R(U ∗ ). Then we have †
KU KU U∗ = U∗, †
U K U K U = U. It is clear that ˜ −1 U ∗ + µI ) = (U ∗ U + µ D) ˜ D ˜ −1 U ∗ , U ∗ (U D so that ˜ −1 U ∗ = D ˜ −1 U ∗ (U D ˜ −1 U ∗ + µI )−1 . (K U + µ D)
(A.12)
Then we have † ˜ † ∗ −1 ˜ −1 ∗ ˜ −1 ∗ K U K U D(K U + µ D) U = K U K U U (U D U + µI )
˜ −1 U ∗ + µI )−1 = U ∗ (U D ˜ U + µ D) ˜ −1 U ∗ . = D(K Hence, equation A.12 yields that ˜ −1 U ∗ + UW D(K ˜ −1 U ∗ ˜ U + µ D) X = U D(K U + µ D) ˜ −1 U ∗ ˜ U + µ D) −UW D(K ˜ −1 U ∗ . = U D(K U + µ D)
(A.13)
Kernel Projection Classifiers
1949
In the case of µ = 0, if K U is nonsingular, equation A.9 yields A= DK U−1 ,
(A.14)
X = U DK U−1 U ∗ .
(A.15)
References Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. Groetsch, C. W. (1993). Inverse problems in the mathematical sciences. Braunschweig: Vieweg. Israel, A. B., & Greville, T. N. E. (1974). Generalized inverse: Theory and applications. New York: Wiley. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Maeda, E., & Murase, H. (1999). Multi-category classification by kernel based nonlinear subspace method. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Vol. 2, pp. 1025–1028). Piscataway, NJ: IEEE Press. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., & Muller, K.-R. (1999). Fisher discriminant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). Piscataway, NJ: IEEE. ¨ ¨ Mika, S., R¨atsch, G., Weston, J., Scholkopf, B., Smola, A., & Muller, K. (2000). Invariant feature extraction and classification in kernel spaces. In S. A. Solla, T. K. Leen, & ¨ K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 526– 532). Cambridge, MA: MIT Press. ¨ ¨ Muller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Scholkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. Murata, H., & Onoda, T. (2001). Applying kernel based subspace classification to a non-intrusive monitoring for household electoric appliances. In G. Dorffiner, H. Bischof, & K. Hornik (Eds.), International Conference on Artificial Neural Networks (ICANN) (pp. 692–698). Berlin: Springer-Verlag. Oja, E. (1983). Subspace methods of pattern recognition. New York: Wiley. R¨atsch, G., (2001). Robust boosting via convex optimization. Unpublished doctoral dissertation, University of Potsdam. ¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320. Rohde, C. A. (1965). Generalized inverses of partitioned matrices. Journal of Soc. Indust. Appl. Math., 13, 1033–1035. ¨ Scholkopf, B., Bartlett, P., Smola, B., & Williamson, R. (1999). Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press.
1950
Y. Washizawa and Y. Yamashita
¨ ¨ Scholkopf, B., Mika, S., Burges, C., Knirsch, P., Muller, K.-R., R¨atsch, G., & Smola, A. (1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of ill-posed problems. Silver Spring, MD: V. H. Winston and Sons. Tsuda, K. (1999). Subspace classifier in the Hilbert space. Pattern Recognition Letters, 20(5), 513–519. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Washizawa, Y., & Yamashita, Y. (2004). Kernel sample space projection classifier for pattern recognition. In 17th International Conference on Pattern Recognition (Vol. 2, pp. 435–438). Los Alamitos, CA: IEEE CS Press. Watanabe, S., & Pakvasa, N. (1973, February). Subspace method in pattern recognition. Proc. 1st Int. J. Conf on Pattern Recognition, Washington DC (pp. 25–32). New York: IEEE Press. Yamashita, Y., & Ogawa, H. (1996). Relative Karhunen-Lo`eve transform. IEEE Transactions on Signal Processing, 44(2), 1031–1033.
Received March 16, 2004; accepted December 16, 2005.
LETTER
Communicated by Peter Latham
Implications of Neuronal Diversity on Population Coding Maoz Shamir [email protected] Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Haim Sompolinsky halm@fiz.huji.ac.il Racah Institute of Physics and Center for Neural Computation, Hebrew University of Jerusalem, Jerusalem 91904, Israel
In many cortical and subcortical areas, neurons are known to modulate their average firing rate in response to certain external stimulus features. It is widely believed that information about the stimulus features is coded by a weighted average of the neural responses. Recent theoretical studies have shown that the information capacity of such a coding scheme is very limited in the presence of the experimentally observed pairwise correlations. However, central to the analysis of these studies was the assumption of a homogeneous population of neurons. Experimental findings show a considerable measure of heterogeneity in the response properties of different neurons. In this study, we investigate the effect of neuronal heterogeneity on the information capacity of a correlated population of neurons. We show that information capacity of a heterogeneous network is not limited by the correlated noise, but scales linearly with the number of cells in the population. This information cannot be extracted by the population vector readout, whose accuracy is greatly suppressed by the correlated noise. On the other hand, we show that an optimal linear readout that takes into account the neuronal heterogeneity can extract most of this information. We study analytically the nature of the dependence of the optimal linear readout weights on the neuronal diversity. We show that simple online learning can generate readout weights with the appropriate dependence on the neuronal diversity, thereby yielding efficient readout. 1 Introduction In many areas of the central nervous system, information on specific stimulus features is coded by the average firing rates of a large population of neurons (see Hubel & Wiesel, 1962; Georgopoulos Schwartz & Kettner, 1982; Razak & Fuzessery, 2002; Coltz, Johnson, & Ebner, 2000). Recently Yoon and Sompolinsky (1999) and Sompolinsky, Yoon, Kang, and Shamir (2001) have Neural Computation 18, 1951–1986 (2006)
C 2006 Massachusetts Institute of Technology
1952
M. Shamir and H. Sompolinsky
shown that information coded by the mean responses of a neural population is greatly suppressed in the presence of the experimentally observed positive correlations. Thus, in the presence of noise correlations, there exists a finite amount of noise in the neural representation of the stimulus. This noise cannot be overcome by increasing the population size. However, central to the analysis of Sompolinsky et al. was the assumption of a homogeneous population of neurons. Empirical observations show a considerable measure of heterogeneity in the response properties of different neurons (Ringach, Shapley, & Hawken, 2002). Here we study the possible role of neuronal diversity in coding for information. We address two questions. First, what is the effect of heterogeneity on the information capacity? In particular, we are interested in the scaling of the information capacity with the number of cells in the population. Second, how does neuronal diversity affect biological readout mechanisms? We address these questions in the context of a statistical model for the responses of a population of N neurons coding for an angular variable, θ , which we term the stimulus, such as the direction of arm movement during a simple reaching task or the orientation of a grating stimulus. Below we define the statistical model of the neural responses and review the main results for a homogeneous population of neurons. We use the Fisher information (see, e.g., Thomas & Cover, 1991; Kay, 1993) and the accuracy of biologically plausible readout mechanisms as measures of information capacity. The first question is addressed in section 2, where the Fisher information of a diverse population of neurons is studied. In section 3 we investigate the efficiency of linear readout mechanisms. First, we study the population vector (see Georgopoulos, Schwartz, & Kettner, 1986) readout; then we investigate the optimal linear estimator (see Salinas & Abbott, 1994) and show a different scaling of their efficiencies with the population size in the presence of correlations. Finally we summarize our results in section 4 and discuss further extensions of our theory. 1.1 The Statistical Model. We consider a system of N neurons coding for an angle, θ ∈ [−π, π). Let ri denote the activity of the ith neuron during a single trial; ri can be thought of as the number of spikes the ith neuron has fired within a specific time interval around the onset of a stimulus, θ . Assuming the firing rates of the neurons are sufficiently high, we model the probability distribution of the neural responses during a single trial, given the stimulus θ , according to a multivariate gaussian distribution: P(r | θ ) =
1 1 exp − (r − m(θ ))T C−1 (r − m(θ )) . Z 2
(1.1)
Here mi (θ ), the tuning curve of neuron i, is the mean activity of the ith neuron averaged over many trials with the same stimulus, θ ; C is the firing rate covariance matrix; XT denotes the transpose of X; and Z is a normalization constant. We model the single neuron tuning curve by a unimodal
Implications of Neuronal Diversity on Population Coding
1953
bell-shaped function of θ with a maximum at the neuron’s preferred direction, mi (θ ) = m(εi , φi − θ ),
(1.2)
where φi is the preferred direction of neuron i. We take the preferred directions of the neurons to be evenly spaced on the ring: φk = −π N+1 + 2π k. N N The εi is a set of parameters characterizing the tuning curve of the ith neuron and representing the deviation from homogeneity. For example, εi can quantify how sharper or narrower is the tuning curve of neuron i than the stereotypical tuning curve or can represent the ratio between the tuning amplitude of the ith neuron and the tuning amplitude of the stereotypical tuning curve. Here, for simplicity, we shall assume that εi is a scalar. It is important to note that εi is a number that characterizes the tuning curve of neuron i; it is a property on the ith neuron and does not change from trial to trial (we ignore effects of plasticity and learning). However, different neurons in the population may have different values for their ε parameters, reflecting the heterogeneity of the population. Different neural populations N can be characterized by different realizations of their set of {εi }i=1 . We shall N assume the {εi }i=1 are independent and identically distributed random variN ables that are independent of the stimulus, that is, P({εi }) = i=1 p(εi ). We distinguish between two sources of randomness in this model. One source is the “warm fluctuations,” represented by the trial-to-trial variability of the neural responses, equation 1.1. The second is the “quenched disorder,” represented by the heterogeneity of the tuning curves of the different neurons, reflected by the distribution of the {εi }. Throughout this article, we will be interested in calculating quantities that involve spatial averaging over the entire population. The value of such quantities will depend on the specific realization of the neural heterogeneity, the {εi }, and will fluctuate from one realization of the neuronal heterogeneity to another. We can characterize such a quantity by its statistics with respect to the quenched randomness. Although for local quantities the quenched fluctuations may be considerable, they are uncorrelated spatially; hence, the quenched fluctuations of quantities that involve spatial averaging, relative to their means, will decrease to zero in the large N limit. This property of a random variable with vanishing standard deviation to mean ratio, in the large N limit, is called self-averaging. Note that the value of a self-averaging quantity in a typical system will be equal to its mean across different systems. The practical implications of this property are that one can replace a self-averaging quantity by its quenched mean for large systems. Hence, instead of computing self-averaged quantities for a specific realization of the neuronal heterogeneity, we can calculate the average of this quantity over different realizations of the heterogeneity. We denote by angular brackets averaging with respect to the trialto-trial fluctuations of the neural responses, given a specific stimulus:
1954
M. Shamir and H. Sompolinsky
X = drXP(r | θ ). This averaging is done with a fixed set of parameters, {εi }, reflecting the fact that the single-neuron tuning curves are fixed and unchanged across many different trials. Averaging over the quenched disorder is denoted by double angular brackets, X({εi }) = N i=1 dεi p(εi )X({εi }). Fluctuations with respect to the distribution of the neural responses in a given system are denoted by δ, that is, δ X ≡ X − X. We use to denote quenched fluctuations: X ≡ X− X . It is convenient to write the tuning curve of neuron i as the sum of its quenched average, mi (θ ) , plus a fluctuation mi (θ ): mi (θ ) = f (φi − θ ) + mi (θ )
(1.3)
f (φi − θ ) ≡ mi (θ ) .
(1.4)
Note that in the last equation, we used equation 1.2 and the fact that the statistics of the {εi } are independent of the neuronal preferred directions and the stimulus. Similar to the single-cell tuning curve, we model f (θ ) by a smooth bell-shaped function that peaks at θ = 0. Specifically, in our numerical simulations, we used the following average tuning curve, f (θ ) = ( f max − f ref ) exp
cos(θ ) − 1 σ2
+ f ref ,
(1.5)
where σ , ( f max − f ref ) and f ref are the tuning width, the tuning amplitude, and a reference value for the stereotypical average tuning curve f (θ ), respectively. For a given stimulus, the quenched fluctuations of the tuning curves, {mi (θ )} (see equation 1.3), is a set of zero mean independent random variables with respect to the statistics of the quenched disorder. An important quantity for the calculation of the Fisher information is the derivative of the i tuning curve with respect to θ , mi = ∂m . The quenched fluctuations of the ∂θ tuning curve, {mi (θ )}, are also smooth periodic functions of θ as a difference of such functions. Using the independence of the {εi } and equation 1.2, one obtains (1.6) mi (θ ) = f (φi − θ ) + mi (θ ) ∂mi (εi , θ ) d mi (θ ) = p(εi ) dεi = p(εi )mi (εi , θ )dεi = 0 (1.7) ∂θ dθ mi (θ )mj (θ ) = δij K (φi − θ ),
(1.8)
where K (φi − θ ) is the variance of the tuning curve derivative of a neuron with preferred direction φi , given a stimulus θ , with respect to the quenched disorder. We further assume that the quenched fluctuations of the tuning curve derivatives, {mi }, follow gaussian statistics. In section 2, where we
Implications of Neuronal Diversity on Population Coding
1955
study the Fisher information of a heterogeneous population of neurons, we use equations 1.6 to 1.8 to define the gaussian statistics of the quenched fluctuations of tuning curves. For small, quenched fluctuations, one can expand the tuning curves, equation 1.2, in powers of εi and approximate mi (θ ) = f (φi − θ ) + εi g(φi − θ )
(1.9)
where g(θ ) = ∂m(ε = 0, θ )/∂ε. A simple example, where the approximation of equation 1.9 is exact, is in the case where mi (θ ) = f (φi − θ )(1 + εi ).
(1.10)
We term this model the amplitude diversity model. In section 3, where we address the question of readout, we use the specific form of equation 1.10 for the tuning curves in order to make the analysis analytically tractable. This model, equation 1.10, is also used for all of the numerical results presented in this article. We further assume, in section 3 and in all of the numerical results, that the {εi } are independent and identically distributed (i.i.d.) gaussian random variables with zero mean and variance κ, εi = 0 εi ε j = δij κ
∀i
(1.11) ∀i, j.
(1.12)
In this case, mi = −εi f (φi − θ ) and (mi (θ ))2 = K (φi − θ ) = κ| f (φi − θ )|2 .
(1.13)
We assume the response covariance of two neurons i and j, Cij , is independent of the stimulus, θ , and only depends on the functional distance between the two neurons, that is, the difference in their preferred directions. A decrease in the response covariance of different neurons with the increase in their functional distance has been reported in cortical areas (see, e.g., Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998; van Kan, Scobey, & Gabor, 1985; Mastronarde, 1983). Specifically, in all of our numerical results, we have used exponentially decaying correlations, |φi − φ j | Cij = C(φi − φ j ) = a δij + c(1 − δij ) exp − , ρ
(1.14)
where a is the variance of the single-neuron response, c and ρ are the correlation strength and correlation length, respectively, and |φi − φ j | ∈ [0, π] is the functional distance between neurons i and j. Note that in
1956
M. Shamir and H. Sompolinsky
this simple model, we did not incorporate any measure of diversity in the higher-order statistics of the neuronal responses. Hence, the Fourier modes of the system are eigenvectors of the covariance matrix. We assume that in the biologically relevant regime, every neuron is correlated with a substantial fraction of the entire population. Mathematically this means that as we consider the limit of large N, both ρ and c remain finite. In this regime, the eigenvalues of the covariance matrix, C, scale linearly with the size of the system, N. For all numerical results presented in this article, we used the amplitude diversity model for the neuronal tuning curves, equation 1.10, √ with the following parameters: ρ = 1, c = 0.4, a = 40 [sec−1 ], σ = 1/ 2, f max = 60 [sec−1 ], f ref = 20 [sec−1 ], and κ = 0.25, unless stated otherwise. Note that these parameters are are given in rates, that is, the number per 1 second. In order to obtain the spike count statistics in a given time interval T, the tuning curves and correlation strength were scaled by a factor of T; unless stated otherwise, we used T = 0.5 [sec]. 1.2 The Fisher Information. Throughout this article, we will be interested in studying the efficiency of different estimators, θˆ (r), of the stimulus θ . We define the efficiency of an estimator by the inverse of its average quadratic estimation error, 1/(θˆ − θ )2 . It is convenient to distinguish beˆ − θ , and the trialtween two sources of estimation error: the bias, b = θ 2 ˆ to-trial variability, (δ θ ) . The Fisher information (see, e.g., Thomas & Cover, 1991; Kay, 1993) is given by
J =
∂ log P(r|θ ) ∂θ
2 .
(1.15)
From the Cram´er-Rao inequality, the square estimation error of any readout θˆ (r) is bounded by (θˆ − θ )2 = (δ θˆ )2 + b(θ )2
(1.16)
(δ θˆ )2 ≥ where b =
db . dθ
(1 + b )2 , J
(1.17)
The Fisher information of this model is given by
J = mT C−1 m ,
(1.18)
. Note that the Fisher information has the form of a squared where, m = dm dθ signal-to-noise ratio, where the signal is the sensitivity of the neural
Implications of Neuronal Diversity on Population Coding
1957
0.04 FI & PV efficiency [deg−2]
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0
100
200
300
400
500
N Figure 1: The Fisher information and population vector efficiency of a homogeneous population of neurons. The Fisher information is shown by the solid line as a function of the number of cells in the population, N. The analytical result for the population vector efficiency, equation 1.26, is shown by the dashed line. The open circles show the results of numerical calculation of averaging the population vector estimation error over 2000 trials estimating θ = 0. In this figure, κ = 0 was used for a homogeneous population.
responses to small changes in the stimulus, m , and the squared noise is represented by the correlation matrix, C. 1.2.1 Fisher Information of an Isotropic Population. In the limit of an isotropic population, K (θ ) = 0 (see equation 1.8), the signal, m = f , resides in a low-dimensional subspace of the neural responses, spanned by the slowly varying collective modes of the system. However, due to the correlations, both the squared signal and noise will scale linearly with the size of the system, yielding J = O(1) even in the large N limit (see Sompolinsky et al., 2001, for further discussion). Figure 1 shows the Fisher information of an isotropic system, κ = 0 in the amplitude diversity model (solid line), as a function of the population size, N. As can be seen from the figure, the Fisher information of an isotropic system saturates to a finite value in the limit of large N. In contrast, in a diverse population of neurons, the signal, √ m , will have an O( N) projection on a subspace spanned by eigenvectors
1958
M. Shamir and H. Sompolinsky
of C corresponding to eigenvalues of O(1). This effect of the neuronal diversity is studied in section 2. 1.3 Linear Readout. A linear readout is an estimator of the form zˆ = i wi ri , where w is a fixed weights vector that defines the readout. It is convenient to adopt complex notation for the stimulus. We denote by z = e iθ a two-dimensional unit vector in the complex plane pointing in the direction of the stimulus, θ . Similarly, the estimator, zˆ = xˆ + i yˆ , will represent a two-dimensional vector in the complex plane with θˆ = arg(ˆz). One can measure the performance of such a readout by either the efficiency of angular estimation, (θˆ − θ )2 −1 , or by the Euclidean distance between z and zˆ . In this work, we employ both measures. We shall call (θˆ − θ )2 −1 the efficiency of the estimator and |ˆz − z|2 the Euclidean error. Let E(w) be the Euclidean error of a linear readout with a weights vector w, E(w) = Q= U=
dθ |ˆz − z|2 = w† Qw − w† U − U† w + 1 2π
(1.19)
dθ T rr 2π
(1.20)
dθ re iθ , 2π
(1.21)
where X† denotes the conjugate transpose of X. It is important to note that, being a function of the neural responses, the linear estimator, zˆ = rT w, is a random variable that fluctuates from trial to trial with a probability distribution that depends on the stimulus, θ (see equation 1.1). The Euclidean error, E(w), defined in equation 1.19, incorporates two averaging steps. First is averaging the Euclidean error that results from the trial-to-trial fluctuations of the linear readout, |ˆz − z|2 , for a given stimulus angle, θ . The second is averaging the Euclidean error over the different possible stimuli by integrating over the stimulus angle, θ , assuming a uniform prior. The optimal linear estimator (Salinas & Abbott, 1994) is defined by the set of linear weights, wole , that minimizes E. The optimal linear estimator weights are given by wole = Q−1 U (see also equations 3.4 to 3.6), and the average quadratic estimation error of the optimal linear estimator is given by E(wole ) = 1 − U† Q−1 U. Below we present the main results for the optimal linear estimator efficiency in a correlated homogeneous population of neurons and in an uncorrelated heterogeneous population. The study of the optimal linear estimator performance in a heterogeneous population of correlated cells is the focus of section 3. 1.3.1 Linear Readout in a Correlated Homogeneous Population of Neurons. In the case of a homogeneous population of neurons, K (θ ) = 0 (see equation 1.8), the optimal linear estimator, zˆ ole = rT wole , is given by
Implications of Neuronal Diversity on Population Coding
1959
(see Shamir & Sompolinsky, 2004) zˆ pv =
∗ N f˜ (1) e iφ j r j , N | f˜ (1) |2 + c˜ 1 j=1
(1.22)
where we have used the following definitions for the Fourier transforms: 1 f (φ j )e inφ j = N N
f˜ (n) =
j=1
dφ f (φ)e inφ 2π
N dφ 1 in(φ j −φk ) C jk e = c˜ n = 2 C(φ)e inφ . N 2π
(1.23)
(1.24)
j=1
Note that N˜c n is the eigenvalue of the correlation matrix, C. The asterisk ∗ in the nominator of the right-hand side of equation 1.22, f˜ (1) , denotes (1) the complex conjugate of the Fourier transform f˜ . However, since the average tuning curve, f (θ ), is a real, even function of the stimulus, θ , its Fourier transforms are pure real. We shall omit the use of conjugate notation of real-valued terms hereafter. This linear estimator, equation 1.22, is the population vector readout. In this case, one can show that the population vector is unbiased with respect to its argument and that its average quadratic error, E(P V), is given by E(P V) =
1 , ˜ 1 + | f (1) |2 /˜c 1
(1.25)
which is of O(1) even in the limit of large N. Assuming small, angular estimation errors, one can expand θˆ in the fluctuations of xˆ and yˆ and study the efficiency of the population vector angular estimation. For a homogeneous population, this error will result only from the trial-to-trial fluctuations and will be independent of the stimulus in its magnitude. The efficiency of the population vector, in this case, is given by (see Sompolinsky et al., 2001; details of the calculation of the population vector efficiency also appear in appendix B) 1 2| f˜ (1) |2 = < J. c˜ 1 (δ θˆ pv )2
(1.26)
Thus, in the biologically relevant regime for the correlations, c > 0, c = O(1), ρ = O(1), both the nominator and denominator of equation 1.26 are O(1) in N, and the efficiency of the population vector saturates to a finite limit for large N. This can be seen in Figure 1, which shows the population vector efficiency as a function of the number of cells in the population
1960
M. Shamir and H. Sompolinsky
−2
OLE & PV efficiency [deg ]
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0
100
200
300
400
500
N Figure 2: Linear readout efficiency in an uncorrelated population of neurons. The population vector efficiency, (θˆ pv − θ )2 −1 (solid line), is shown as a function of the population size, N. The inverse of the population vector bias, b 2pv −1 , and the population vector variance, (δ θˆ pv )2 −1 , are shown by the open circles and boxes respectively. The efficiency of the optimal linear estimator, (θˆole − θ )2 −1 , is represented by the dashed line, and the inverse of its average variance, (δ θˆole )2 −1 , by the open triangles. The dotted line shows the population vector efficiency in an uncorrelated homogeneous neural population. The estimator’s efficiency was calculated numerically by averaging over 400 different realizations of the neuronal diversity for each point. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the readout for 500 trials of estimating θ = 0. In this figure, c = 0 was used. For the population vector efficiency in a homogeneous neural population (dotted line), κ = 0 was used.
(dashed line and open circles). For large systems, the population vector efficiency reaches a size-independent limit. 1.3.2 Linear Readout in a Heterogeneous Population of Uncorrelated Neurons. In a diverse population of neurons, the population vector readout is no longer the optimal linear estimator. Moreover, the population vector estimator is biased. Figure 2 shows the efficiency of the population vector for angular estimation, (θˆ pv − θ )2 −1 , in an uncorrelated (c = 0)
Implications of Neuronal Diversity on Population Coding
1961
heterogeneous (κ > 0 in the amplitude diversity model, equation 1.10) population of neurons (solid line). The inverse of the population vector bias, b 2pv −1 is shown by the open circles. The inverse of the population vector variance is shown by the open squares. For comparison, we plot the efficiency of the population vector in a homogeneous population (κ = 0) by the dotted line. From the figure, one can see that in a heterogeneous population of neurons, the efficiency of the population vector is decreased, relative to the homogeneous case, due to the added bias term that scales as b 2pv ∝ 1/N; note that the population vector variance remains the same (compare open boxes and dotted line). The dashed line in Figure 2 shows the efficiency of the optimal linear estimator, and the triangles show the inverse of its variance. The performance of the optimal linear estimator is superior to that of the population vector for two reasons. First, the optimal linear estimator is practically unbiased (compare the dashed line and open triangles). Second, the variance of the optimal linear estimator is smaller than the variance of the population vector. This results from the fact that the population vector extracts information only from the slowly varying collective modes of the system, whereas the optimal linear estimator can extract information from the higher-order modes as well, thus increasing its signal-to-noise ratio. However, the efficiency of both readouts scales linearly with the size of the system. Thus, in the case of an uncorrelated population, there is no qualitative difference between the two readouts. Below we show that in the presence of correlations, the neuronal diversity produces a qualitative effect on both the information capacity of the system (section 2) and the efficiency of different readouts (section 3). 2 The Fisher Information of a Diverse Correlated Population The Fisher information, equation 1.18, of this model with K (θ ) > 0 (see equation 1.8) is given by J=
N
( f i + mi )Cij−1 ( f j + mj )
i, j=1
=
N i, j=1
f i Cij−1 f j + 2
N
mi Cij−1 f j +
i, j=1
N
mi Cij−1 mj .
(2.1)
i, j=1
The Fisher information of a specific system, that is, for a given realization of the {mi }, is a random variable that fluctuates from one realization of the neuronal diversity to another with respect to the quenched distribution of the {mi }. The statistics of the Fisher information can be characterized by its moments. We find (see appendix A) that to a leading order in N, J = N K¯ d + J homog
(2.2)
1962
M. Shamir and H. Sompolinsky
J homog = fT C−1 f = O(1) (J )2 = 2NK 2 d 2 + O(1),
(2.3) (2.4)
where d is the diagonal element of the inverse of the correlation matrix, d = [C−1 ]ii ; J homog is the Fisher information of a homogeneous population (κ = 0), which is of O(1) in the presence of correlation; and K¯ = K2 =
dθ K (θ ) 2π
(2.5)
dθ K (θ )2 . 2π
(2.6)
From equations 2.2 and 2.4, one finds that for large systems, the sampleto-sample fluctuations of the Fisher√information are small relative to its mean, (J )2 / J = O(1/ N). Hence, for large populations, the Fisher information of a typical system will be equal to its mean across many samples. This property of the Fisher information is an example of self-averaging. Figure 3a shows the mean and Figure 3b the variance of the Fisher information as a function of the population size N. The mean and variance of J were computed by averaging over 400 realizations of the neuronal populations (open circles) in the amplitude diversity model with κ = 0, 0.05, 0.1, 0.25 from bottom to top. The analytical results, equations 2.2 and 2.4, are shown by the solid lines as a function of the population size. Note that κ = 0 is the case of an isotropic population of neurons. In this case, the Fisher information saturates to a finite value in the limit of large N. In contrast, for κ > 0, the Fisher information of the system increases linearly with the population size (top lines) with a slope that is linear in κ. Interestingly, after averaging over the heterogeneity of the population, the statistics of the Fisher information are independent of the stimulus, θ . Hence, for all θ ∈ [−π, . . . π)√the Fisher information, J (θ ), will be equal to its quench average up to O( N) corrections. Figure 4 shows the stimulus fluctuations of the Fisher information for a typical realization of the neuronal diversity in a system of N = 1000 neurons in the amplitude diversity model, equation 1.10, with κ = 0.25. From the figure, one can see that the Fisher information is a smooth function of θ and is equal to J up to √ small fluctuations of O( N). For local discrimination tasks, linear readout can extract all the information coded by the first-order statistics of the neural responses (see Shamir & Sompolinsky, 2004). Below we study the efficiency of the linear readout for global estimation tasks.
Implications of Neuronal Diversity on Population Coding
1963
(a)
−2
mean of FI [deg ]
0.25 0.2 0.15 0.1 0.05 0 0
100
200
300
400
500
300
400
500
N x 10
−4
variance of FI [deg ]
−4
5
(b) 4 3 2 1 0 0
100
200
N Figure 3: The mean (a) and variance (b) of the Fisher information with respect to the quenched statistics are shown as a function of the system size, N. The open circles show statistics of the Fisher information as calculated numerically by averaging the Fisher information of 400 different realizations of the neuronal diversity. The statistics were calculated in the amplitude diversity model with κ = 0, 0.05, 0.1, 0.25 from bottom to top. The solid lines in a show the analytical result for the average Fisher information, equations 2.2 and 2.3. The solid lines in b show the leading, O(N), term of equation (30) for the Fisher information variance, 2NK 2 d 2 .
1964
M. Shamir and H. Sompolinsky
J(θ) [deg−2]
0.5 0.4 0.3 0.2 0.1 0 −180
−90
0
θ [deg]
90
180
Figure 4: The Fisher information of a specific system. The Fisher information of a single system is plotted (solid line) as a function of the stimulus, θ, for a given typical realization of the neuronal diversity, the {εi }. The quenched average of the Fisher information, J , √is shown in the dashed line for comparison. The dotted lines show J ± 2 J 2 . In this figure, we used N = 1000.
3 Linear Readout We now turn to the efficiency of linear readouts in a correlated heterogeneous population of neurons. For simplicity we restrict the discussion to the amplitude diversity model, equation 1.10. We first study the efficiency of the population vector readout. As mentioned above, in a homogeneous population of neurons, the population vector yields the optimal linear estimator. However, although the Fisher information in a diverse population grows linearly with the size of the system, the population vector efficiency saturates to a finite limit, as in the homogeneous case. In contrast, we show that the optimal linear estimator efficiency scales linearly with the number of cells in the population, N. 3.1 The Population Vector Readout. The efficiency of the population vector readout depends on the specific realization of the {εi }. We therefore characterize the population vector efficiency by its statistics with respect to the quenched disorder. Calculation of the quenched statistics of the
Implications of Neuronal Diversity on Population Coding
1965
population vector readout (see appendix B) reveals that the population √ vector bias is typically of order 1/ N with b pv = 0 and variance, b 2pv =
f 2 (0) − f 2 (2) κ κ , = O 2N N | f˜ (1) |2
(3.1)
where we have used the following notation for the Fourier transform of f 2 (θ ): f 2 (n) =
dϕ 2 f (ϕ)e inϕ . 2π
(3.2)
Note that after the quenched averaging, the statistics of the bias and variance of the population vector readout are independent of the stimulus. Analysis of the population vector trial-to-trial fluctuations (see appendix B) shows that the variance of the population vector estimator, (δ θˆ pv )2 , is a selfaveraging quantity on the order of 1 with (δ θˆ pv )2 =
c˜ 1 = O(1), 2| f˜ (1) |2
(3.3)
which is the same as the population vector efficiency in the homogeneous case, equation 1.26. Figure 5 shows the average efficiency of the population vector in a diverse population (circles) in terms of one over the average square angular estimation error, (θˆ − θ )2 −1 . The population vector efficiency was calculated by averaging over 400 realizations of the neuronal diversity. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the population vector for 200 trials of estimating θ = 0. The analytical results of substituting equations 3.1 and 3.3 into equation 1.16 are shown by the overlapping solid line. Comparing Figures 1 and 5 and equations 1.26 and 3.3, it is easy to see that the efficiency of the population vector readout is almost unaffected by the neuronal diversity in the limit of large N. Thus, although the Fisher information of a diverse population scales linearly with the size of the system, the efficiency of the population vector saturates to a finite limit as N grows, in the presence of correlations. Similarly, we find that to a leading order in N, the neuronal diversity has no effect on the population vector performance in terms of the Euclidean distance measure, equation 1.19. √ Hence, the result of equation 1.25 still holds up to a correction of O(1/ N) (see appendix B), yielding O(1) error even in the large N limit. These results raise the question of whether it is possible to obtain a linear estimator that will be able to use the neuronal diversity in order to overcome the correlated noise and obtain an efficiency that scales linearly with the size of the system. As discussed above, in
1966
M. Shamir and H. Sompolinsky
, estimator s efficiency [deg−2]
0.1
0.08
0.06
0.04
0.02
0 0
100
200
300
400
500
N Figure 5: Linear readout average efficiency. The average efficiency of the population vector (open circles) and of the optimal linear estimator (open boxes) is plotted as a function of the size of the system. The efficiency of these readouts was calculated numerically by averaging over 400 different realizations of the neuronal diversity for each point. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the readout over 200 trials of estimating θ = 0. The Fisher information is shown in the upper dashed line for comparison. The bottom solid line shows the analytical result for the population vector error, equations 1.16, 3.3, and 3.1.
an isotropic population of neurons, the population vector is the optimal linear estimator. However, in a diverse population, the population vector is no longer optimal. Below we study the efficiency of the optimal linear estimator in a heterogeneous population of neurons. 3.2 The Optimal Linear Estimator. The optimal linear estimator weights are given by (see equations 1.19–1.21) wole = Q−1 U Qij = Cij + (1 + εi )(1 + ε j ) U j = (1 + ε j ) f˜ (1) e iφ j .
(3.4) dθ f (φi − θ ) f (φ j − θ ) 2π
(3.5) (3.6)
Implications of Neuronal Diversity on Population Coding
1967
The efficiency of the optimal linear estimator, in terms of one over the average quadratic estimation error, (θˆ − θ )2 −1 , is shown in Figure 5 (boxes). The optimal linear estimator efficiency was calculated by averaging over 400 realizations of the neuronal diversity. The estimation error for a given realization of the neuronal diversity was computed by averaging the angular estimation error of the optimal linear estimator over 200 trials of estimating θ = 0. From the figure, one can see that while the population vector efficiency (solid line and open circles) saturates to a finite limit, the optimal linear estimator efficiency scales linearly with the population size, N. Obtaining a complete analytical expression for the optimal linear estimator and its efficiency is not an easy task, mainly due to the difficulty of inverting the random matrix Q. However, for large populations, one can expand the optimal linear estimator to a leading order in 1/N, as presented in the following section. 3.2.1 The Zeroth Approximation of the Optimal Linear Estimator. To a leading order in 1/N, the optimal linear estimator weights are given by (see appendix C)
wole, j = w j + O(N−3/2 ) (0)
(0)
wj =
(3.7)
1 j e iφ j . Nκ f˜ (1)
(3.8)
Figure 6 shows the average overlap between the optimal linear estimator † w w(0) and the zeroth approximation readout weights: Real √ ole (0) . wole w
As can be seen from the figure, the overlap approaches the value of 1 as N grows. Hence, for large N, the zeroth approximation converges to the N→∞
optimal linear estimator, w(0) −→ wole . We find (see appendix D) that the average Euclidean error, equation 1.19, of the zeroth approximation fluctuates from sample to sample with mean and standard deviation of the order of 1/N. Assuming small angular estimation errors, we can study the angular estimation efficiency of the zeroth approximation. Our calculations (see appendix D) show that the bias of the zeroth √ approximation is zero, on average, with fluctuations that are of order 1/ N:
b =0 (b)2 =
f 2 (0) − f 2 (2) 1 + 2κ1 . N | f˜ (1) |2
(3.9) (3.10)
1968
M. Shamir and H. Sompolinsky
1 0.9
overlap
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
200
400
600
800
1000
N Figure 6: The average overlap between the optimal linear estimator and the zeroth approximation of the optimal linear estimator linear weights, Real †
(0)
w w { √ ole
wole w(0)
}, is shown as a function of the system size N. Note that
we find the imaginary part of the average overlap to be on the order of 10−3 and decreases rapidly with N (results not shown). Every point on the graph shows the overlap averaged over 400 realizations of the neuronal diversity.
The trial-to-trial fluctuations of the zeroth approximation obey the following statistics (see appendix D), (δ θˆ )2 = ((δ θˆ )2 )2 =
a 2Nκ| f˜ (1) |2
(3.11)
B˜ (0) + 12 B˜ (2) , 2(Nκ)2 | f˜ (1) |4
(3.12)
where a is the single-cell response variance (see equation 1.14) and B(θ ) ≡ C 2 (θ ). Thus, the efficiency of the zeroth-order approximation scales linearly with √ the size of the system, yielding an estimation error on the order of 1/ N. Figure 7 shows the efficiency of this readout as a function of the population size N; the open circles show the numerical evaluation of the efficiency, and the solid line shows the analytical results of substituting
Implications of Neuronal Diversity on Population Coding
1969
−2
readout efficiency [deg ]
0.04
0.03
0.02
0.01
0 0
500
1000
1500
2000
N Figure 7: The efficiency of the zeroth-order approximation of the optimal linear estimator as a function of the population size, N. The efficiency is shown in terms of one over the average quadratic angular estimation error, θˆ − θ2 −1 . The analytical results of equations 1.16, 3.10, and 3.11 are shown by the solid line. The open circles show the numerical estimation of the efficiency. For comparison, we present the optimal linear estimator efficiency (dashed line), as calculated numerically. The numerical calculation of the readouts efficiencies was done by averaging over 400 different realizations of the neuronal diversity and over 200 trials of simulating the neuronal stochastic responses to stimulus θ = 0 for each realization.
equations 3.11 and 3.10 into equation 1.16. For comparison, the optimal linear estimator efficiency is shown by the dashed line. As can be seen from the figure, the efficiency of the zeroth approximation scales linearly with the size of the system, even in the presence of correlated noise. Nevertheless, its performance is considerably inferior to that of the optimal linear estimator. This result contrasts with the high degree of similarity between w(0) and wole (see Figure 6), emphasizing the importance of the higher-order corrections to the optimal linear estimator readout. We find that by also incorporating the first-order corrections, we can retrieve most of the efficiency of the optimal linear estimator. However, the first-order corrections are nonlocal in space and involve global averages of the neuronal diversity across the entire population (results not shown). On the other hand, the analysis of the
1970
M. Shamir and H. Sompolinsky
zeroth-order approximation efficiency is sufficient to prove the linear scaling of the optimal linear estimator efficiency with the population size. The problem of fine tuning of the optimal linear estimator weights is addressed in section 4. 4 Summary and Discussion The efficiency of population codes has been the subject of considerable theoretical effort. Early studies investigated the efficiency of the code using the theoretical concept of the Fisher information and quantifying the accuracy of simple readout mechanisms (Paradiso, 1988; Seung & Sompolinsky, 1993). Assuming the trial-to-trial fluctuations in the responses of different cells have zero correlation, these studies have shown that the coding efficiency of the population grows linearly with the number of cells. Zohary et al. (1994) have shown the possible detrimental effect of nonzero cross-correlations on the coding efficiency of the population. On the other hand, differing results were presented by Abbott and Dayan (1999) claiming that correlations do not have a detrimental effect on the coding efficiency. Abbott and Dayan considered several models for the cross-correlations. One model incorporated a short-range correlation structure. In terms of the current study, this corresponds to scaling the correlation length, ρ, inversely with the number of cells in the population, ρ ∼ 1/N (see equation 1.14); hence, in the large N limit, this model is very similar to a model without correlations. Another interesting model studied by Abbott and Dayan is one with uniform correlations;1 it corresponds to the other extreme of taking the limit of very large correlation length, ρ −→ ∞ (see equation 1.14). Uniform correlations generate large collective fluctuations in the neural responses. However, these collective fluctuations are limited to the uniform direction, (1, 1, . . . , 1), that contains no information about the stimulus identity. Thus, signal and noise in this model are completely segregated into orthogonal subspaces of the phase space of neural responses. Due to this segregation, one can treat this model as one of an independent population of neurons and obtain qualitatively similar results. Qualitatively different results are obtained in the intermediate regime of ρ = O(1). In this regime, a considerable fraction of neuronal pairs shows significant correlations. Moreover, correlations are stronger for pairs of neurons with closer preferred directions. These properties of the ρ = O(1) regime are in agreement with experimental findings (see, e.g., Zohary et al., 1994; Lee et al., 1998). In this case, signal and noise are not segregated, and the collective fluctuations in the population response to the stimulus cause the 1 Abbott and Dayan also considered a model in which information is coded by the correlated neuronal responses. This coding strategy is not the topic of our article and was addressed elsewhere (Shamir & Sompolinsky, 2004; see also Wu, Amari, & Nakahara, 2004).
Implications of Neuronal Diversity on Population Coding
1971
saturation of the information capacity of the system. Typical values for the neural response properties and pairwise correlations yield an accuracy bound that is inconsistent with the known psychophysical accuracy (for a discussion in greater details, see Sompolinsky et al., 2001; Wu, Amari, & Nakahara, 2002). This raises the theoretical question of how a neural population with the experimentally observed correlation structure can overcome these collective fluctuations and obtain a more accurate code that will be able to account for psychophysical performance. One possible mechanism for overcoming the strong collective noise fluctuations is by utilizing the heterogeneity inherent in any neural population. In this work, we studied the effect of neuronal heterogeneity on the coding capabilities of large populations of neurons with correlated firing rate fluctuations. The heterogeneity of the system enables information to be coded in all of the spatial modes of the network. The population vector readout extracts information only from the slowly varying collective modes of the system; hence, the optimal linear estimator yields superior performance to that of the population vector (see Figure 2). Moreover, the diversity of the population adds a bias to the population vector estimate of θ , thus decreasing its efficiency with respect to the homogeneous case (see Figure 2). However, in an uncorrelated population of neurons, the efficiency of both readouts scales linearly with the number of cells in the population (see Figure 2). In a correlated population of neurons, in the biologically interesting regime of ρ = O(1), correlations generate large fluctuations in the slowly varying collective modes of the system. If the information coded by the neuronal responses resides only in this low-dimensional subspace of collective fluctuations, as in the homogeneous case, then the information capacity of the system will remain finite even in the limit of infinitely large networks (see Figure 3, bottom line). The neuronal diversity, applied here to the firstorder statistics of the neuronal responses, allows information to be coded in other, more rapidly varying spatial modes of the system. In this case, signal and noise are not segregated; however, they are also not entirely overlapping, as in the homogeneous case. Consequently, the Fisher information of this system does not saturate to a finite limit; rather, it scales linearly with the population size (see equations 2.2–2.4 and Figure 3). We further investigated the efficiency of the population vector and the optimal linear estimator in the case of correlated heterogeneous populations of neurons. It was shown that the population vector readout fails to extract the information coded by the neuronal diversity and that its efficiency is bound due to the correlated noise in the slowly varying collective modes of the system (see Figure 5). On the other hand, the optimal linear estimator can extract most of the information coded by the quenched fluctuations of the neuronal responses. A numerical study of the optimal linear estimator performance revealed that its efficiency scales linearly with the size of the system (see Figure 5). Note that for both correlated and uncorrelated cases,
1972
M. Shamir and H. Sompolinsky
the optimal linear estimator bias has a negligible contribution to the error. Our study shows that for large N, the optimal linear estimator converges (0) to the simple form of the zeroth approximation, wi ∝ εi e iφi (see Figure 6). We have shown analytically that the efficiency of the zeroth approximation scales linearly with N, equations 3.9 to 3.12. However, its performance is considerably inferior to that of the optimal linear estimator (see Figure 7). These two findings highlight the sensitivity of the optimal linear estimator performance to the higher-order corrections and raise the question of whether the central nervous system can perform such an accurate readout. This last question is addressed in the context of supervised learning of the optimal linear estimator weights. We assume the linear readout weights of the system are learned by an online learning process (e.g., Radons, 1993; Hansen, Pathria, & Salamon, 1993) and ask whether the performance of this readout will converge to that of the optimal linear estimator in a reasonable time. In every learning step, an example stimulus is chosen randomly from p a uniform distribution on the circle, {θ k }k=1 ∼ i.i.d. U([−π, π)). The kth k example stimulus, θ , generates a response of the neural population, r(k) , that is distributed according to r(k) ∼ P(r(k) |θ k ), as defined in equation 1.1. Given the neural responses, the system calculates its estimation of the kth example, zˆ k = r(k)† w(k), and updates the readout weights according to the estimation error in the kth example. Formally, we define a momentary cost function, 1 1 E k (w) = |ˆzk − zk |2 = |r(k)† w − e iθk |2 , 2 2
(4.1)
and the learning dynamics are defined by w(k + 1) = w(k) − η(k)
∂ E k (w(k)) = w(k) − η(k)(ˆzk − zk )r(k) , ∂w
(4.2)
∂ is the gradient with where η(k) is the momentary learning rate and ∂w respect to w. Note that in Hebbian learning, the synaptic weight is modified in proportion to the product of its input and output. Here, in the case of supervised learning of a linear system, the update takes the form of the output error, (ˆzk − zk ), times the input, r(k) . Figure 8 shows the online learning curve of the optimal linear estimator for a population of N = 400 neurons (solid line). The Euclidean error, equation 1.19, of the zeroth-order approximation to the optimal linear estimator and of the population vector readout are plotted for comparison by the horizontal solid and dashed lines, respectively. As can be seen from the figure, the learning converges rather fast and presents a superior performance to that of the zeroth-order approximation after p ∼ N learning steps. Hence, despite the sensitivity of the optimal linear estimator to the higher-order, O(N−3/2 ), corrections, a
Implications of Neuronal Diversity on Population Coding
1973
2
(E(w)−E
ole
)/E
ole
10
0
10
−2
10
−4
10
−2
10
0
10
2
p/N
10
4
10
Figure 8: Online learning curve for the optimal linear estimator. Generalization error, equation 1.19, is plotted as a function of the number of examples shown to the system, p, per weight that needs to be learned, N. For comparison, we show the zeroth-order error and the population vector readout error by the horizontal solid and dashed lines, respectively. In this graph, N = 400 was used. In our η0 learning dynamics, we have scaled the learning rate according to η(k) = 1+k/k 0 −6 with η0 = 5 · 10 and k0 = 20,000.
simple learning rule converges to the optimal linear estimator performance rather fast. Recently, an interest in the learning of the optimal linear estimator has appeared in the context of constructing a brain machine interface (e.g., Wessberg et al., 2000; Carmena et al., 2003; Schwartz, Taylor, & Tillery, 2001). For this application, learning can be done using a batch learning algorithm (Seung, Tishby, & Sompolinsky, 1992). We found that similar to online learning, batch learning converges to the optimal linear estimator performance on the scale of several N examples, where N is the size of the system (results not shown). Wessberg et al. (2000) and later Carmena et al. (2003) applied a batch algorithm to learning of optimal linear estimator weights for predicting motor commands from cortical neuron activity (mainly from motor cortex). Wessberg et al. also investigated the scaling of the learned readout accuracy by the number of its input neurons. Their results suggest that the optimal linear estimator accuracy scales linearly with
1974
M. Shamir and H. Sompolinsky
the number of pooled neurons. Their results are consistent with our finding on the scaling of the optimal linear estimator performance with the population size in a correlated heterogeneous network. Our results indicate that in order to obtain an accuracy of 5 degrees, similar to the psychophysical accuracy in a simple reaching task, it is sufficient to implement the optimal linear estimator readout on a population of several hundred neurons (see Figure 5). The results of this work are not limited to the amplitude diversity model, equation 1.10, and can be generalized to other kinds of neuronal heterogeneities. Assuming a small measure of heterogeneity in the system, κ 1, one can approximate the tuning curves by equation 1.9 and generalize the calculation of appendixes C and D, yielding a zeroth approximation to the (0) optimal linear estimator of the form w j = Nκ1g˜ (1) j e iφ j (see equation 1.9 for the definition of g). For example, one may consider the case of a population with diverse tuning curves widths. The tuning curve width of a specific cell, σi , can be characterized as a sum of the average width, σ ≡ σ¯ , and a fluctuation, σi = σ¯ + εi . Interestingly, in this case, one obtains that g˜ (1) < 0; hence, the zeroth-order approximation predicts that the optimal estimator will give a higher weight to neurons with a sharper tuning curve (ε < 0) than for neurons with broader-than-average tuning curves (ε > 0). Similarly, the calculation can be expanded to a model in which the tuning curve of every cell is characterized by a vector of heterogeneity parameters. Qualitatively, results are the same. This argument excludes diversity in the preferred directions of the cells. The deviations from homogeneity in the distribution of preferred directions in the population result from finite sampling size. This corresponds to scaling the measure of heterogeneity, κ, inversely with the number of cells, κ ∼ 1/N. Hence, in the limit of large populations, this source of heterogeneity is not expected to have a significant contribution to the coding efficiency of the system. It is interesting to note, however, that one of the problems that motivated the work of Salinas and Abbott (1994), which introduced the concept of the optimal linear estimator, was the finite-size heterogeneity in the distribution of preferred directions. Further extensions of this theory for studying the coding of a more complex stimulus can be obtained once a biologically plausible model for the tuning and pairwise correlation of the neural responses is established. In this work, we considered only variability in the first-order statistics of the neural responses. However, the second-order statistics of neural responses, namely, the firing rates covariance, is also very diverse (see, e.g., Zohary et al., 1994; Lee et al., 1998; Maynard et al., 1999). The way in which diversity in the second-order statistics affects our conclusions depends on the amount of overlap and segregation between eigenvectors of C corresponding to its largest eigenvalues and f . Unfortunately, the relation between the diversity of the first- and second-order statistics of the neural responses is not yet clear. One possibility is to consider independent
Implications of Neuronal Diversity on Population Coding
1975
0.25
−2
mean FI [deg ]
0.2
0.15
0.1
0.05
0 0
100
200
300
400
500
N Figure 9: The mean Fisher information of a model with heterogeneity in the first- and second-order statistics. The quenched mean of the Fisher information, J , was calculated by averaging over the 100 realizations of the neuronal diversity (open circles) with = 0.25 in the amplitude diversity model with κ = 0, 0.1, 0.25 from bottom to top. The Fisher information was calculated for the stimulus value θ = 0. For comparison, the average Fisher information of the system with = 0 is shown by the solid lines, as computed by equation 2.2.
additive variability to the covariance matrix. Assuming the covariance matrix obeys Cij = C¯ ij + Tij
(4.3)
¯ i − φj) C¯ ij = C(φ
(4.4)
Tij = ξij + ξ ji ,
(4.5)
where ξ is a random gaussian matrix, that is, the {ξij } are i.i.d. gaussian random variables with means 0 and variance that are also independent of the heterogeneity of the first-order statistics. Hence, T is a real symmetric √ matrix with maximum eigenvalue 2 2N (see, e.g., Mehta, 1991). The ¯ obeys equation 1.14. Figure 9 shows the mean Fisher information matrix C of such a system with = 0.25 for different values of κ in the amplitude diversity model (open circles); for comparison, the results with = 0 are
1976
M. Shamir and H. Sompolinsky
shown in solid lines. As can be seen from the figure, adding heterogeneity to the second-order statistics does not alter our previous results qualitatively. In particular, the Fisher information of a system with κ = 0 saturates to a finite limit, whereas the Fisher information of systems with κ > 0 grows linearly with the size of the system. In this work, we emphasized the possible role of the “quenched” fluctuations (neuronal heterogeneity) in coding information in correlated populations of neurons. In another study, the stimulus-dependent “thermal” (trial-to-trial) fluctuations were suggested as a primary source of information (Shamir & Sompolinsky, 2001, 2004). Neural populations exhibit both stimulus-dependent thermal and quenched fluctuations. Studying the optimal linear estimator efficiency, equations 1.19 to 1.21, we find that only the stimulus average of the correlation matrix appears in E(w). Hence, the performance of the optimal linear estimator is not affected by the tuning of the higher-order statistics. Similarly the contribution of the stimulusdependent correlations to the Fisher information, J cov = Tr {C C−1 C C−1 } (where C = dC ; see Shamir & Sompolinsky 2004), is not affected by the dθ heterogeneity of the first-order statistics of the neural responses. Thus, although thermal and quenched fluctuations are usually considered sources of noise in the system, our theory suggests that these fluctuations may have a central role in coding information in populations of neurons with correlated firing rate statistics. However, since the information capacity of both thermal and quenched stimulus-dependent fluctuations scales linearly with the population size, several decoding strategies are possible. Our current theory is unable to make a definitive claim about which strategy is most plausible. Nevertheless, in order to utilize neuronal diversity, fine tuning of the readout mechanism is required. On the other hand, our theory suggests that if neuronal diversity is not utilized by the readout, a nonlinear readout is essential to obtain a high degree of accuracy. Further theoretical as well as experimental effort is needed to provide additional tests for the alternative decoding schemes used in the brain.
Appendix A: Calculation of the Fisher Information Statistics The Fisher information of a diverse population, equation 2.1, can be written as the sum of three terms: J = I1 + 2I2 + I3 , where
I1 =
N
f i Cij−1 f j
(A.1)
mi Cij−1 f j
(A.2)
i, j=1
I2 =
N i, j=1
Implications of Neuronal Diversity on Population Coding
I3 =
N
mi Cij−1 mj .
1977
(A.3)
i, j=1
The first term, I1 , is the Fisher information of an isotropic system. As discussed in section 1, in the biologically relevant parameter regime for the correlations, ρ = O(1) and c = O(1), c > 0, this term saturates to a finite limit even when the population size grows to infinity, yielding a contribution to J that is O(1). The second term, I2 , has zero mean, I2 = 0, and variance (I2 )2 =
N
K (φi − θ )vi2
(A.4)
Cij−1 f j .
(A.5)
i=1
vi =
N j=1
2 Now, K (φi − θ ) = O(N). Since v2 √ = fT C−2 f = O(1/N) (eigenvalues −2 −2 of C scales like N ), then vi = O(1/ N) ∀i, yielding i vi4 = O(1/N). Hence, applying the Cauchy-Schwartz inequality to equation 4.4, one obtains N N 2 (I2 ) ≤ K 2 (φi − θ ) v 4 = O(1). i=1
(A.6)
i=1
The last term, I3 , has a mean that scales linearly with the size of the system, N,
I3 =
N
K (φi − θ )Cii−1 = N K¯ d,
(A.7)
i=1
dϕ where d is the diagonal element of C−1 and K¯ = 2π K (ϕ). Note that one can approximate d using the fact that the eigenvalue spectrum of the correlation matrix, C, contains a small number, p(N) = o(N), of low Fourier modes with eigenvalues scaling like N, while the rest of the spectrum rapidly 1 decays to a (1 − c). Hence, for large N, one can approximate Cii−1 = a (1−c) + O( p(N)/N), ∀i ∈ {1, . . . , N}. In this model, the eigenvalue spectrum √ of C decays algebraically with the Fourier mode, n, Cn ∝ n−2 ; hence, p(N) ∝ N, and in the limit of large N, one can neglect the contribution of O( p(N)/N) to the diagonal of C−1 .
1978
M. Shamir and H. Sompolinsky
The variance of I3 is given by (I3 )2 = 2
K (φi − θ )K (φ j − θ )(Cij−1 )2
ij
=2
N
K 2 (φi − θ )(Cii−1 )2 + 2
i= j
i=1
K (φi − θ )K (φ j − θ )(Cij−1 )2 . (A.8)
The last term in the right-hand side of equation A.8 can be bound by max{K }2 i= j (Cij−1 )2 . For a long-range correlation matrix with a strongly decaying eigenvalue spectrum, we can approximate the off-diagonal terms of C−1 by Ck−1 j ≈−
1 Na (1 − c)
e in(φk −φ j ) (k = j).
(A.9)
|n|< p(N)
Using this approximation, one obtains i= j (Cij−1 )2 ∝ p(N) = o(N). Hence, for large N, to a leading order in N, the quenched fluctuations in I3 are given by (I3 )2 = 2NK 2 d 2 ,
(A.10)
dϕ 2 K (ϕ). In corollary of the above, the Fisher information where K 2 = 2π of such a system is a self-averaging quantity with a quenched mean and variance that scale linearly with the population size. Appendix B: Statistics of the Population Vector Efficiency B.1 Average Euclidean Error. The population vector linear readout ˜ (1) γ iφ j e with2 γ = | f˜ (1)f |2 +˜c . In the amplitude variweights are given by w j = N 1 ability model, equation 1.10, one obtains U j = (1 + ε j ) f˜ (1) e iφ j Qij = Cij + Fij Fij = (1 + εi )(1 + ε j )
2
(B.1)
(B.2) dθ f (φi − θ ) f (φ j − θ ). 2π
Note that f˜ (1) and hence also γ are real.
(B.3)
Implications of Neuronal Diversity on Population Coding
1979
For the calculation of E(P V), equation 1.19, we need to compute the following terms: U† w =
N
f˜ (1) (1 + ε j )e −iφ j
j=1
γ iφ j e N
1 (1 + ε j ) = |γ f˜ (1) | (1 + I1 ) N N
= f˜ (1) γ
(B.4)
j=1
w† Cw = γ 2 c˜ 1 w† Fw =
(B.5)
N dθ γ2 i(φ j −φk ) f (φ j − θ ) f (φk − θ ) (1 + ε )(1 + ε )e j k 2 N 2π j,k=1
=γ2 =γ2
dθ 2π
2 N 1 iφ j f (φ − θ )(1 + ε )e j j N j=1
2 N dθ (1) 1 iφ j ˜ + f f (φ − θ )ε e j j 2π N j=1
= |γ f˜ (1) |2 + 2|γ f˜ (1) |2 I1 + γ 2 I2 ,
(B.6)
where we have used in equation B.5 the fact that w is an eigenvector of the correlation matric C with an eigenvalue of N˜c 1 . The terms I1 and I2 are defined by 1 i N N
I1 =
(B.7)
i=1
dθ 1 I2 = 2 k j f (φk − θ ) f (φ j − θ )e i(φk −φ j ) . N kj 2π
(B.8)
Substituting the above in equation 1.19 and expressing γ in terms of the Fourier transforms of the average tuning curve and of the correlation matrix, we obtain E(P V) = 1 −
| f˜ (1) |2 | f˜ (1) |2 c˜ 1 | f˜ (1) |2 − 2 I + 1 2 I 2 . 2 | f˜ (1) |2 + c˜ 1 | f˜ (1) |2 + c˜ 1 | f˜ (1) |2 + c˜ 1
(B.9)
Now I1 is a gaussian random variable that fluctuates from one realization of the neural population to another with I1 = 0 and (I1 )2 = Nκ .
1980
M. Shamir and H. Sompolinsky
The random variable I2 obeys the following statistics, κ 2 = O(κ/N) f N (0) κ 2 2 2 2 2 (I2 )2 = 2 G (0) + G (2) = O(κ /N ), N I2 =
(B.10) (B.11)
dθ 2 (n) = dϕ G 2 (ϕ)e inϕ , and f (θ ) f (θ − φ), G f 2 (n) = G(φ) = 2π 2π 2 inϕ f (ϕ)e . Hence, E(P V) is a self-averaging quantity of O(1) with 2π √ fluctuations that are O(1/ N):
with dϕ
E(P V) =
√ 1 + O(1/ N) 1 + | f˜ (1) |2 /˜c 1
(B.12)
B.2 Angle Estimation Error. Let xˆ and yˆ be the real and imaginary parts of the population vector, respectively: zˆ = xˆ + i yˆ = wT r.
(B.13)
The population vector angular estimator is given by θˆ (xˆ , yˆ ) = arctan
yˆ . xˆ
(B.14)
For small, angular estimation errors, we can expand θˆ in powers of δ xˆ and δ yˆ around their means and approximate θˆ = θˆ (xˆ , yˆ ) +
∂ θˆ (xˆ , yˆ ) ∂ θˆ (xˆ , yˆ ) δ xˆ + δ yˆ . ∂ xˆ ∂ yˆ
(B.15)
The first term in the right-hand side of equation B.15 yields the bias; the two other terms provide the trial-to-trial fluctuations of the estimator. Thus, to the lowest order in fluctuations at θ = 0, the bias and variance of θˆ are given by θˆ = arctan (δ θˆ )2 =
(δ yˆ )2 . xˆ 2
yˆ xˆ
≈
yˆ xˆ
(B.16) (B.17)
Implications of Neuronal Diversity on Population Coding
1981
Computing the statistics of xˆ and yˆ , one obtains xˆ = γ f˜ (1) γ 2 κ 2 f 2 (2) . (xˆ )2 = f (0) + 2N
(B.18) (B.19)
Thus, xˆ is a self-averaging quantity with respect to the quenched fluctuations. For yˆ , we obtain yˆ = 0
(B.20)
γ 2 κ 2 f 2 (2) f (0) − 2N 1 (δ yˆ )2 = γ 2 c˜ 1 . 2
( yˆ )2 =
(B.21) (B.22)
Note that the trial-to-trial fluctuations of the population vector readout, reflected here by (δ yˆ )2 , have zero variability with respect to the quenched statistics. Summarizing the above results yields, for the bias,
b pv = b 2pv =
yˆ yˆ = =0 xˆ xˆ
f 2 (2) f 2 (0) − ( yˆ )2 κ = . xˆ 2 2N | f˜ (1) |2
(B.23) (B.24)
Note that in the above equations, we have used the self-averaging property of xˆ , neglecting its quenched fluctuations. For the population vector variance, one obtains
(δ θˆ pv )2 =
c˜ 1 (δ yˆ )2 = . 2 ˜ xˆ 2| f (1) |2
(B.25)
Note that since (δ yˆ )2 has zero sample-to-sample variability and xˆ is a selfaveraging quantity, then (δ θˆ pv )2 is a self-averaging quantity. To summarize the above results, the population vector readout in a diverse population has a small bias that decays with the increase of the pool size and trial-to-trial fluctuations that remain O(1) even in the limit of infinitely large neuronal populations.
1982
M. Shamir and H. Sompolinsky
Appendix C: The Zeroth-Order Approximation to the Optimal Linear Estimator The main difficulty in calculating the optimal linear estimator, equations 3.4 to 3.6, is the inversion of the random matrix Q. It is useful to first study how Q acts as a linear transformation. Let us denote v nj = N1 e inφ j and t nj = 1 ε e inφ j . Note that U = N f˜ (1) (v1 + t1 ), N j N
Qij v nj = N(˜c n + | f˜ (n) |2 )vin + N| f˜ (n) |2 tin
j=1
+
(n−m)
S1
| f˜ (m) |2 (vim + tim )
(C.1)
m N
Qij t nj = Nκ| f˜ (n) |2 (tin + vin ) +
j=1
+
(n−m) n vi
c˜ m S1
m (n−m)
| f˜ (m) |2 S1
(n−m)
+ S2
(vim + tim ),
(C.2)
m
(n) (n) where S1 = Nj=1 ε j e inφ j and S2 = Nj=1 ε 2j e inφ j . Note that S1n and S2n √ are of order N. Hence, to a leading order in N, we can treat the subspace spanned by {v(1) , t(1) } as an invariant subspace of Q. Let us denote by xˆ = xxˆˆ 12 the reduction of an N-dimensional vector x to the subspace spanned by {v(1) , t(1) }, that is, (xˆ 1 v1 + xˆ 2 t1 ) is equal to the projection of x on the two ˆ = N f˜ (1) 1 . The reduction of dimensional subspace. With this notation, U 1 Q is given by √ c˜ 1 + | f˜ (1) |2 κ| f˜ (1) |2 ˆ + O( N). Q=N | f˜ (1) |2 κ| f˜ (n) |2 ˆ −1 U ˆ = Thus, to a leading order in 1/N, one obtains Q this order of approximation, wole, j =
1 ε j e iφ j + O(N−3/2 ). Nκ f˜ (1)
(C.3) 0 1 , κ f˜ (1) 1
and hence to
(C.4)
Appendix D: Statistics of the Efficiency of the Zeroth Approximation to the Optimal Linear Estimator D.1 The Euclidean Error. The zeroth approximation readout weights (0) are given by w j = Nκ1f˜ (1) εi e iφ j . For the calculation of E (0) ≡ E(w(0) ) (see
Implications of Neuronal Diversity on Population Coding
1983
equation 1.19), we need to compute the following terms: U† w(0) = w† U = 1 +
N 1 (εi2 + εi ) Nκ
(D.1)
i=1
w(0)† Cw(0) =
κ 2|
w(0)† Fw(0) = 1 +
1 I1 f˜ (1) |2
(D.2)
N 2 1 (εi2 + εi ) + 2 (1) 2 (I2 + 2Real{I3 } + I4 ) , ˜ | Nκ κ | f i=1
(D.3)
where I1 = I2 = I3 = I4 =
1 ε j εk e i(φ j −φk ) C jk N2 jk
dθ 1 ε ε f (φ j − θ ) f (φk − θ )e i(φ j −φk ) j k 2 N jk 2π
dθ 1 2 ε ε f (φ j − θ ) f (φk − θ )e i(φ j −φk ) j k N2 jk 2π
dθ 1 2 2 f (φ j − θ ) f (φk − θ )e i(φ j −φk ) . ε ε j k 2 N jk 2π
(D.4)
(D.5)
(D.6)
(D.7)
2 and (I1 )2 = Nκ 2 ( B˜ (0) + The statistics of I1 are given by I1 = κa N B˜ (2) ), where B˜ (n) = dφ C 2 (φ)e inφ . For I2 , I3 , and I4 , one obtains: I2 = 2π 2 κ f 2 (0) , I3 = 0, I4 = 2 κN f 2 (0) for the means, and for the variances: N
(I2 )2 =
κ 2 2 2 (2) ) (G (0) + G N2
(2Real {I3 })2 = 8 (I4 )2 = 4 where G(ϕ) = f ∗ f (ϕ) ≡ tion 1.19 yields E (0) =
(D.8)
κ 3 2 2 (2) ) + O(N−3 ) (G (0) + G N2
(D.9)
κ 4 2 2 (2) ) + O(N−3 ), (G (0) + G N2
(D.10)
dθ 2π
f (θ ) f (ϕ − θ ). Substituting the above in equa-
1 (I1 + I2 + 2Real {I3 } + I4 ) . κ 2 | f˜ (1) |2
(D.11)
1984
M. Shamir and H. Sompolinsky
Hence, E (0) is a random variable with quenched fluctuations with mean and standard deviation that are O(1/N), E (0) =
1 a + f 2 (0) [1 + 2κ] (1) 2 ˜ Nκ| f |
(E (0) )2 = O(1/N2 )
(D.12) (D.13)
where we have used Ix I y ≤ inequality.
(Ix )2 (I y )2 in the last
D.2 Angle Estimation Error. Let xˆ and yˆ denote the real and imaginary parts, respectively, of the zeroth approximation estimator: zˆ = rT w(0) = xˆ + i yˆ . Assuming small angular estimation errors, we use equations B.14 to B.17 to calculate the statistics of the zeroth approximation, at θ = 0. As above, after averaging over the statistics of the neuronal diversity, the results are independent of the specific choice of θ = 0. The statistics of xˆ and yˆ are given by xˆ = 1 (xˆ )2 =
(D.14)
f 2 (2) f 2 (0) + 1 + 2κ1 . N | f˜ (1) |2
(D.15)
Hence, xˆ is a self-averaging quantity with respect to the quenched disorder. The statistics of yˆ are given by yˆ = 0 ( yˆ )2 = (δ yˆ )2 = ((δ yˆ )2 )2 =
1+ N
(D.16) 1 2κ
f 2 (2) f 2 (0) − | f˜ (1) |2
(D.17)
a 2Nκ| f˜ (1) |2
(D.18)
B˜ (0) + 12 B˜ (2) . 2N2 κ 2 | f˜ (1) |4
(D.19)
Summarizing the above results yields for the bias θˆ = 0 (θˆ )2 =
f 2 (2) f 2 (0) − 1 + 2κ1 , (1) N | f˜ |2
(D.20) (D.21)
where we have used the self-averaging property of xˆ , neglecting its quenched fluctuations. For the trial-to-trial variability of the angle
Implications of Neuronal Diversity on Population Coding
1985
estimation, one obtains (δ θˆ )2 = ((δ θˆ )2 )2 =
a 2Nκ| f˜ (1) |2
(D.22)
B˜ (0) + 12 B˜ (2) . 2N2 κ 2 | f˜ (1) |4
(D.23)
Hence, the zeroth approximation of the optimal linear estimator has bias √ and trial-to-trial fluctuations that are of order O(1/ N). Acknowledgments This research is partially supported by the Israel Science Foundation, Center of Excellence grant 8006/00 and by the USA-Israel Binational Science Foundation. M.S. was supported in part by a scholarship from the Clore Foundation and by the Burroughs Wellcome Fund. References Abbott, L. F., & Dayan P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. Carmena, J. M., Lebedev, M. A., Crist R. E., O’Doherty J. E., Santucci, D. M., Dimitrov, D., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biol., 1(2), 193–207. Coltz, J. D., Johnson, M. T., & Ebner, T. J. (2000). Population code for tracking velocity based on cerebellar Purkinje cell simple spike firing in monkeys. Neurosci. Lett., 296(1), 1–4. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2(11), 1527–1537. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hansen, L. K., Pathria R., & Salamon, P. (1993). Stochastic dynamics of supervised learning. J. Phys. A: Math. Gen., 26(1), 63–71. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Kay, S. M. (1993). Fundamentals of statistical signal processing. Upper Saddle River, NJ: Prentice Hall. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neurosci., 18(3), 1161–1170. Mastronarde, D. N. (1983). Correlated firing of cat retinal ganglion cells. II. Responses of X- and Y-cells to single quantal events. J. Neurophysiol., 49(2), 325–349.
1986
M. Shamir and H. Sompolinsky
Maynard, E. M., Hatsopoulos, N. G., Ojakangas, C. L., Acuna, B. D., Sanes, J. N., Normann R. A., & Donoghue, J. P. (1999). Neuronal interactions improve cortical population coding of movement direction. J. Neurosci., 19(18), 8083–8093. Mehta, L. M. (1991). Random matrices (2nd ed.). San Diego, CA: Academic Press. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58(1), 35–49. Radons, R. (1993). On stochastic dynamics of supervised learning. J. Phys. A: Math. Gen., 26(14), 3455–3461. Razak, K. A., & Fuzessery, Z. M. (2002). Functional organization of the pallid bat auditory cortex: Emphasis on binaural organization. J. Neurophysiol., 87(1), 72–86. Ringach, D. L., Shapley, R. M., & Hawken, M. J. (2002). Orientation selectivity in macaque V1: Diversity and laminar dependence. J. Neurosci., 22(13), 5639–5651. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. J. Comp. Neurosci., 1 (1–2), 89–107. Schwartz, A. B., Taylor, D. M., & Tillery, S. I. (2001). Extraction algorithms for cortical control of arm prosthetics. Curr. Opin. Neurobiol., 11(6), 701–707. Seung H. S., & Sompolinsky H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Seung H. S., Tishby N., & Sompolinsky H. (1992). Statistical mechanics of learning from examples. Phys. Rev. A, 45(8), 6056–6091. Shamir, M., & Sompolinsky, H. (2001). Correlation codes in neuronal networks. In D. G. Thomas, B. Suzanna, & G. Zoubin (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shamir, M., & Sompolinsky, H. (2004). Nonlinear population codes. Neural Comput., 16(6), 1105–1136. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5 Pt. 1), 051904. Thomas, J. A., & Cover, T. M. (1991). Elements of information theory. New York: Wiley. van Kan, P. L., Scobey, R. P., & Gabor A. J. (1985). Response covariance in cat visual cortex. Exp. Brain. Res., 60(3), 559–563. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361–365. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural field: A computational study. Neural Comput., 14(5), 999–1026. Wu, S., Amari, S., & Nakahara, H. (2004). Information processing in a neuron ensemble with the multiplicative correlation structure. Neural Netw., 17(2), 205–214. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. J. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140– 143.
Received May 26, 2005; accepted January 10, 2006.
LETTER
Communicated by Jonas Sjoeberg
A Neighborhood-Based Enhancement of the Gauss-Newton Bayesian Regularization Training Method Miguel Pinzolas [email protected] Departamento de Ingenier´ıa de Sistemas y Autom´atica, Universidad Polit´ecnica de Cartagena, 30202, Cartagena, Spain
Ana Toledo [email protected] Departamento de Tecnolog´ıa Electr´onica, Universidad Polit´ecnica de Cartagena, 30202, Cartagena, Spain
˜ Juan Lu´ıs Pedreno [email protected] Departamento de Tecnolog´ıas de la Informaci´on y las Comunicaciones, Universidad Polit´ecnica de Cartagena, 30202, Cartagena, Spain
This work develops and tests a neighborhood-based approach to the Gauss-Newton Bayesian regularization training method for feedforward backpropagation networks. The proposed method improves the training efficiency, significantly reducing requirements on memory and computational time while maintaining the good generalization feature of the original algorithm. This version of the Gauss-Newton Bayesian regularization greatly expands the scope of application of the original method, as it allows training networks up to 100 times larger without losing performance. 1 Introduction The Gauss-Newton Bayesian regularization training method (GNBR) (Foresee & Hagan, 1997) is a network training strategy that emphasizes not only achieving a good training score, but also obtaining a network that generalizes well. A good generalization is always a desirable characteristic on a neural network because it measures the ability of the trained net to give reasonable responses to inputs that are different from the training set. Achieving a small training error does not ensure good generalization; in fact, usually if the error on the training set is driven to a very small value, overfitting can occur, especially if the training data are contaminated with noise. A rigorous approach to overfitting and the bias/variance trade-off can be found in Geman, Bienenstock, and Doursat (1992). Neural Computation 18, 1987–2003 (2006)
C 2006 Massachusetts Institute of Technology
1988
˜ M. Pinzolas, A. Toledo, and J. Pedreno
Overfitting reflects the fact that the network has memorized the training examples, but it has not learned to generalize to new situations. There are a number of strategies that can be employed to avoid overfitting. One of them is to minimize a combination of squared errors and weights (Foresee & Hagan, 1997; MacKay, 1992). This approach reflects the fact that generally a network whose neurons are saturated is more likely overfitting than one whose neurons are working near their linear zone (Bartlett, 1997). The subsequent problem is to determine the optimal regularization parameters that adequately combine the squared error and weights so as to produce a network that generalizes well. One approach to the automation of this process comes from the Bayesian framework described in MacKay (1992). In this framework, the weights and biases of the network are assumed to be random variables with specified distributions. The regularization parameters are related to the unknown variances associated with these distributions, so that these parameters can be estimated using statistical techniques. Once the error index to minimize is fixed, a suitable minimization method must be selected to perform the network training. A usual choice is the Levenberg-Marquardt (LM) second-order algorithm (Levenberg, 1944; Marquardt, 1963; Hagan & Menhaj, 1994). A detailed discussion of the use of Bayesian regularization in combination with LM training can be found in Foresee and Hagan (1997). In general, second-order optimization methods have been demonstrated to perform better than first-order ones. However, their performance degrades when the error reduction is compared against the time and the number of parameters to be optimized is considerable. This degradation is due to the fact that in these methods, the Hessian or some approximation to it must be computed at each iteration and then inverted. This requires the creation and storage of a matrix of n2 elements, where n is the number of adjustable parameters, plus its inversion, that roughly takes n3 operations to be performed. As in the neural networks, the number of adjustable parameters is usually large, the use of second-order training algorithms has been typically restricted to those applications in which the networks employed are small, or the training time is not a relevant factor. And even in these cases, the use of second-order algorithms could be impractical due to the memory requirements. That is why some effort has been made in the past few years to reduce the time and memory needs of these algorithms (see, e.g., Ngia and Sjoberg, 2000; Asirvadam, McLoone, & Irwin, 2002; Lera & Pinzolas, 1998, 2002; Toledo, Pinzolas, Ibarrola, & Lera, 2005). Among them, there are noticeable achievements by Lera and Pinzolas (1998, 2002) and Toledo et al. (2005). In this work, a considerable reduction of the storage and number of operations is reported by means of a neighborhood-based approach of the LM second-order training algorithm. With this modification, it is shown that the performance of the LM method can be maintained, and even improved, with a minimum fraction of the memory requirements of the original method.
NB Enhancement of the GNBR Training Method
1989
In this work, an extension of this approach is considered together with the Bayesian regularization strategy in order to join its good generalization ability to the good training performance of the neighborhood-based second-order algorithms. We show that by selecting and training a subset (neighborhood) of the whole set of weights on the network, a similar performance on the training is reached when considered against the time, while only a very reduced memory portion is used compared with the original Gauss-Newton Bayesian regularization method. We also show that for some neighborhood sizes, the generalization capabilities of the trained network are better and use only a fraction of memory and time. As a consequence, much larger networks can be trained, and the scope of application of the Gauss-Newton Bayesian regularization becomes notably broadened. 2 Learning Algorithm Whereas typical training algorithms aim to reduce the sum of squared errors, regularization adds a term that penalizes the sum of squares of the network weights. The objective function becomes F = α · E D + β · EW,
(2.1)
where E D is the summed squared error (which can be substituted by the mean squared error or some other usual error index), E W is the sum of squares of the network weights, and α and β are parameters that dictate the emphasis for training. If α β, then the training algorithm will drive the errors smaller. If α β, the training will emphasize weight size reduction at the expense of network errors, thus producing a smoother network response. To deduce the optimal values of these parameters, the Bayesian framework of MacKay (1992) is applied. In it the weights of the network are considered random variables. Assuming that both the noise in the data set and the prior distribution for the weights are gaussian, it can be found that, as detailed in Foresee and Hagan (1997), the optimal values of α and β, α MP and β MP , can be approximated by γ 2E W (w MP ) n−γ = , 2E D (w MP )
α MP = β MP
(2.2)
where γ = N − 2α MP tr(H−1 MP )
(2.3)
˜ M. Pinzolas, A. Toledo, and J. Pedreno
1990
is called the effective number of parameters, N is the total number of parameters in the network, n is the number of examples in the data set, and HMP is the Hessian of the objective function when the weights of the network are w MP . An approximation to the Hessian HMP is naturally obtained if the LM optimization algorithm is used to locate the w MP corresponding to the minimum (Hagan & Menhaj, 1994). The conjunction of the Bayesian regularization and the Gauss-Newton approximation to the Hessian matrix gives the Gauss-Newton Bayesian regularization training scheme that can be summarized as follows: 1. Initialize α, β and the weights. 2. Take one step of the Levenberg-Marquardt algorithm to minimize the objective function in equation 2.1. 3. Compute the effective number of parameters making use of the Gauss-Newton approximation to the Hessian available in the LM training algorithm. 4. Compute new estimates for the objective function parameters α and β, following equation 2.2. 5. Iterate steps 2 through 4 until convergence. In the core of the LM optimization process, the Jacobian jW of the error is calculated. Then, an with respect to the weight and bias variables W approximation of the Hessian (jj) is created by computing T
jj = j W · j W .
(2.4)
in vector W is obtained by solving As in the LM method, the change d W
je = jW · E
H = jj + I · λ
= − je, H · dW
(2.5)
where E are all errors and I is the identity matrix. The learning coefficient λ is an adaptive parameter that balances the behavior of the method between the inverse Hessian and steepest-descent ones. Its value is increased by λINC (typically λINC = 10) until the change on the weights results in a reduced performance value. The change is then made to the network, and λ is decreased by λDEC (typically λDEC = 0.1).
NB Enhancement of the GNBR Training Method
1991
It can be seen from equation 2.5 that if m is the number of weights, it is necessary to calculate and store the m2 elements of the H matrix at each LM cycle and find its inverse, which needs about m3 operations to be carried out. This makes the LM method very expensive in both memory and number of operations when the network to be trained has a significant number of adaptive weights. Neighborhood-based Levenberg-Marquardt (NBLM) is a modification of the LM algorithm, based on a reduction in the number of weights adapted at each iteration (Lera & Pinzolas, 1998, 2002; Toledo et al., 2005). To do this, the network is divided into several neighborhoods, which act as independent learning units. This allows applying the LM method to one neighborhood in each step of the algorithm, thus decreasing the number of operations and the amount of memory required. The neighborhoods can be built in many ways. A brief discussion of the different methods of forming neighborhoods is provided in section 3. Also, the selection of the neighborhood to be trained at each iteration could be done in several ways. In this work, at each iteration, all neighborhoods had equal probabilities of being chosen for training. The NBLM algorithm can be summarized as follows: 1. Define the network structure, assign initial weights, and define neighborhoods (as in Lera & Pinzolas, 1998, 2002 or as in Toledo et al., 2005). 2. Assign one initial value λi of the learning coefficient for each neighborhood considered. 3. Randomly select the neighborhood to be trained. 4. Perform an LM cycle on the selected neighborhood, and evaluate the network error with the new weights. Stop the learning process if one of the termination conditions (elapsed time or maximum number of epochs, for example) is met. 5. If the error has decreased, decrease the value of the corresponding λi , and go back to step 3. 6. Otherwise, increase the value of λi , and go back to step 4. Applying Bayesian regularization to this modification results in the proposed neighborhood-based Bayesian regularization method (NBBR). The corresponding algorithm is: 1. Define the network structure, compute the initial values for α and β, assign initial weights, and define neighborhoods by chunking the total vector of weights (see section 3). Assign one initial value λi of the learning coefficient for each neighborhood considered.
1992
˜ M. Pinzolas, A. Toledo, and J. Pedreno
2. Randomly select the neighborhood to be trained. 3. Perform an LM cycle on the selected neighborhood, and evaluate the network error with the new weights. Stop the learning process if the terminating condition is met. 4. Compute the effective number of parameters, making use of equation 2.3. This equation must be applied taking into account that N is now the number of parameters (weights) on the selected neighborhood, and HMP is the Gauss-Newton approximation to the reduced Hessian calculated in step 3. 5. Compute new estimates for the objective function parameters α and β. 6. If the error has decreased, decrease the value of λi , and go back to step 2. 7. Otherwise, increase the value of λi , and go back to step 3. As we will show in section 4, NBBR shows a better convergence rate than GNBR when compared against the time and maintains the good generalization properties of BR while using a fraction of the memory, which allows training very large networks without losing performance. This better convergence rate can be attributed to three related facts. First, the use of a different learning coefficient for each neighborhood allows a better estimation of the local shape of the error surface in each w-subspace. Second, as MacKay (1992) noted, some of the approximations employed in the GNBR method hold better when the K/N quotient is large, K being the number of output units times the number of data pairs and N being the number of free parameters. When a neighborhood is trained, N is restricted to the number of weights belonging to it, so that K/N is bigger than in the case of training the whole net. Finally, the time required for one iteration of the learning algorithm is drastically reduced. Therefore, in the same time, more iterations (and therefore more adaptations of the learning coefficients) are carried out for the neighborhood-based algorithm, thus allowing a quick decrease of the error, mainly at the beginning of the training process. 3 Neighborhood Selection The neighborhoods can be formed in many ways. One approach is randomly grouping the neurons in equally sized neighborhoods (as in Lera and Pinzolas, 1998, 2002). Alternatively, neighborhoods could be arranged considering some distance measure defined between neurons or weights. Toledo et al. (2005) proposed a method to form neighborhoods by “chunking” the total vector of weights on equal-sized chunks, each
NB Enhancement of the GNBR Training Method
1993
constituting a neighborhood. The key to this method is that the total vector of weights of the network is formed in a determinate way. For example, if the network is a two-layer SISO (single input, single output) network with a first layer of 300 sigmoids and an output linear layer, the vector of weights has its 901 components ordered as follows: the first 300 correspond to the weights of the nonlinear layer, the following 300 to the biases of the nonlinear layer, after them the 300 weights of the linear neuron, and the last component is given by its bias. Forming the neighborhoods by slicing this vector ensures, especially in the smaller neighborhood sizes, that each neighborhood is formed mainly by weights or biases of the same layer. This procedure brings two main advantages. First, the neighborhoods formed in this way are more likely to have similar local shapes in the error surface. This helps to adjust the algorithm better to the shape of the error surface in the subspace spanned by the weights belonging to the neighborhood. For example, if a neighborhood is formed only by weights of a linear layer, the corresponding error surface is quadratic; thus, the convergence of the LM algorithm is quadratic, and low values of the learning coefficient λ can be used to exploit this fact. Second, with the neighborhoods formed in this way, the matrices involved in the minimization are better conditioned for inversion, as weights of the same class and on the same layer tend to have similar ranges of variation for the corresponding components of the Jacobian. In this work, this method has been used in all the experiments. Undoubtedly, more elaborate criteria for grouping weights into neighborhoods can be found; still, as shown in the following section, the results obtained with this approach are significant. 4 Experimental Results In the experiments, two different goals were pursued. First, it was necessary to study the evolution of the training error with different neighborhood sizes. Results of a training episode strongly depend on the initial weights and the training set of examples. In the first experiment, the examples were extracted from randomly generated functions in order to discard, as far as possible, any structural influence of the training set on the results. In each experiment, we carried out a set of 100 training episodes, initializing all the weights from a uniformly distributed, symmetric random number generator giving values in the [−1, 1] interval. Then the mean of the error evolutions was considered against time. Comparisons were performed against both GNBR and the conjugate gradient with Fletcher-Reeves updates (CGF) methods (Scales, 1985), taking several neighborhood sizes. The neighborhood size is given as a percentage, indicating the fraction of the total weights that is considered for training at each training cycle. A second experiment was aimed at studying the relationship between learning performance and generalization capabilities. The function
1994
˜ M. Pinzolas, A. Toledo, and J. Pedreno
employed is very similar to the one used in Lera and Pinzolas (1998, 2002). In this second experiment, the evolution of both the training error and the effective number of parameters is studied to justify the relationship between training and generalization errors and these two parameters. The last set of experiments is designed to test that the improvements in training performance are not at the cost of losing generalization capabilities. To prove that, four testing functions were used: two one-dimensional functions (a triangular wave and a sinusoidal one) and two bidimensional ones (sinusoidal and triangular 2D waves). Both kinds of signals represent two different challenges to generalization: the triangular ones are characterized by abrupt changes in direction that lead to discontinuities in the first derivative, while the sinusoidals are smooth. 4.1 Experiment 1. To create the sets of training examples, 100 random functions were generated in the following way: 1. A SISO neural network was generated, with 900 hyperbolic tangent neurons in the hidden layer and one linear neuron in the output. 2. The weights of the hidden layer were initialized randomly in the interval [−1000, 1000] to ensure that the generated function is spiky enough. 3. Weights of the output neuron were randomly initialized inside the interval [−1, 1]. For each experiment, a different network was created, and 2000 examples were generated by giving to that network uniformly distributed inputs in the [−10, 10] interval. The form of the functions generated by this method was similar to that of Figure 1. The training neural network was composed of a hidden layer with 300 neurons (with hyperbolic tangent activation function) and one linear neuron in the output layer. One hundred training episodes were run for each of four different neighborhood sizes with NBBR, GNBR, and CGF. The initial weights for each training episode were chosen randomly with uniform distribution in the [−1, 1] interval. The means of the error evolutions are shown in Figure 2, in which neighborhoods are labeled by the percentage of weights considered for training at each iteration. As in the GNBR algorithm, the performance function varies at each step of the training, and the mean squared Error (MSE) has been selected as a measure of the network performance in reproducing the training examples. In the figure, it can be seen that all neighborhood sizes perform better than GNBR and CGF. It is noticeable than the training performance is better even with the smallest neighborhood size, which requires only 1% of the memory needed by the GNBR method. A summary of the memory and time-per-cycle savings in NBBR can be seen in Table 1, jointly with the minimum MSE reached after training.
NB Enhancement of the GNBR Training Method
1995
15
10
Output
5
0
-5
-10 -10
-5
0
5
10
Input
Figure 1: Random function for training.
MSE mean evolution
102 NBBR 10% NBBR 20% NBBR 50% NBBR 80% GNBR CGF
101
100
0
20
40
60
80
Time (sec.)
Figure 2: MSE mean error evolution.
100
120
˜ M. Pinzolas, A. Toledo, and J. Pedreno
1996
Table 1: Training Comparison Between NBBR and GNBR. NBBR Training
10%
20%
50%
80%
GNBR
Memory usage Relative time (one iteration) Final MSE in training
1% 1.00 1.13
4% 1.10 1.04
25% 1.88 0.99
64% 3.55 1.28
100% 4.88 1.62
Experiments carried out with higher input-output dimensions showed similar tendencies. However, it cannot be ensured that the behavior of the error evolutions will be the same in all cases, as some neighborhood sizes could perform better than others for determined kinds of objective functions. This can be seen, for example, in the second experiment. In general, it can be expected that the more adjusted the size of the network to the problem, the lower the performance of the lower neighborhood sizes. In spite of this, the use of small neighborhood sizes may be the only option in the case of very large networks, little time for training, or reduced memory requirements. 4.2 Experiment 2. In the second experiment, a smoother function was chosen to examine the relationship between the expected test set and training set errors. This relationship has been studied by Moody (1992), who found a second-order approximation with the following expression: 2 εtest ξ ξ ≈ εtrain ξ + 2σeff
peff n
.
(4.1)
In this equation, n is the size of the training sample ξ , ξ is the set used 2 for testing, σeff is the effective noise variance in the response variables, and peff is the effective number of parameters in a nonlinear model. Although this expression does not accurately bound the generalization error, as pointed by MacKay (1992), it effectively gives a guide on what should be expected: from equation 4.1, lower generalization errors should be found for those identifiers that achieve lower training errors with less effective parameters. The examples used for training were created by contaminating with zeromean noise, uniform in the interval [−0.5, 0.5], the 991 samples obtained by uniformly sampling in the x ∈ [0.1, 10] interval the following function: y=
sin(8 · x) . x
(4.2)
NB Enhancement of the GNBR Training Method
1997
8
sin(8*x)/x
6 4 2 0 -2
0
2
4
X A
6
8
10
8
sin(8*x)/x
6 4 2 0 -2
0
2
4
X B
6
8
10
Figure 3: (A) Noiseless function for validation. (B) Noisy samples used for training.
For testing generalization, the noiseless samples have been used. The two sets can be observed in Figure 3. In Figure 4, the means of 100 training episodes are shown for four different neighborhood sizes of the NBBR, the GNBR method, and the conjugate gradient. As in the previous experiment, 300 neurons (this time, logistic sigmoids) were used in the hidden layer and one linear neuron in the
˜ M. Pinzolas, A. Toledo, and J. Pedreno
1998
4x10-1
MSE mean evolution
3x10
NBBR 10% NBBR 20% NBBR 50% NBBR 80% GNBR CGF
-1
2x10-1
10-1
7x10-2 0
20
40
60
80
100
Time (sec.)
Figure 4: MSE mean error evolution.
output layer. For each training episode, weights were chosen randomly in the [−1, 1] interval. In Figure 5, the evolution of the mean effective number of parameters for several neighborhood sizes is shown against the time. It can be observed that for the 10% and 20% neighborhoods, this number changes abruptly due to the large number of neighborhoods, each with its own effective number of parameters. For the larger neighborhoods, this number remains more stable. The MSE for the noiseless validation set for the NBBR, GNBR, and CGF training methods is shown in Figure 6. It can be observed that, as expected, the neighborhoods attaining lower error scores and having the smaller effective numbers of parameters at the end of training achieve better generalization. 4.3 Experiment 3. The behavior of the method in generalization has been tested further using the functions shown in Figure 7. The training sets were generated by uniformly sampling the functions and adding a zeromean noise, uniform, into the [−0.1, 0.1] interval. The networks employed had 300 hyperbolic tangents in the hidden layer, one linear neuron in the output layer, and one or two inputs, depending on the input dimension of the objective functions. As in the previous experiment, the networks have been deliberately oversized to make them prone to overfitting.
NB Enhancement of the GNBR Training Method
1999
Mean of the effective number of parameters
102
101 NBBR 10% NBBR 20% NBBR 50% NBBR 80% GNBR
100 0
20
40
60
80
100
Time (sec.)
Figure 5: Mean evolution of the effective number of parameters.
Means of the MSE in generalization
0,008 0,007 0,006 0,005 0,004 0,003 0,002 0,001 0,000 NBBR 10%NBBR 20%NBBR 50%NBBR 80%
Figure 6: Mean of the MSE for the test set.
GNBR
CGF
˜ M. Pinzolas, A. Toledo, and J. Pedreno
2000
Figure 7: Objective functions for testing generalization. (A) 1D-sin(x). (B) 1Dtriangular. (C) 2D-sin(x)cos(y). (D) 2D-triangular. Table 2: Generalization Comparison Between NBBR and GNBR. NBBR Generalization Memory usage Relative time (one iteration) MSE (sine wave) MSE (2D sine-cosine) MSE (triangular wave) MSE (2D triangular wave)
10%
20%
50%
80%
GNBR
1% 1.00 0.0143 0.0003 0.1041 0.0041
4% 1.11 0.0107 0.0003 0.0179 0.0033
25% 1.86 0.0083 0.0058 0.0066 0.0141
64% 3.49 0.0665 0.0562 0.0033 0.0888
100% 4.80 0.0444 0.2244 0.0121 0.1108
After training, the output of the network was compared against noiseless samples to test whether good generalization had been achieved. As in the previous sections, 100 training episodes were considered for each
NB Enhancement of the GNBR Training Method
2001
Figure 8: Generalization results for the 2D-sin(x)cos(y) test case when the time used for training is short. (A) Noisy training data. (B) 10% NBBR. (C) 20% NBBR. (D) 50% NBBR. (E) 80% NBBR. (F) GNBR.
2002
˜ M. Pinzolas, A. Toledo, and J. Pedreno
neighborhood size and for GNBR. The mean results for generalization are summarized in Table 2. For the one-dimensional case, it can be observed that no performance is lost for some neighborhood sizes. In the case of the sinusoidal function, all neighborhood sizes but 80% perform better than GNBR. For the triangular wave, the performance of the smaller neighborhood degrades notably, but for the rest, it is similar to or better than the GNBR method. In all cases, the savings in memory and computing time are significant, as shown in Table 2. When the time for training is very short, the advantages of the proposed method become evident. Figure 8 shows the performance in generalization attained by the different neighborhood sizes and GNBR when training is stopped after a short period of time. In spite of the noisy training data, the lower neighborhood sizes succeed in capturing the essential tendency of the function. On the other hand, the NBBR largest neighborhood size and the GNBR method are not able to approach the objective function in the time employed for training. 5 Conclusions In this work, a neighborhood-based Bayesian regularization algorithm is proposed. This method allows significantly reducing the memory and computations required for the Gauss-Newton Bayesian regularization, while maintaining and even improving its good generalization feature. Experimentally, it is shown that NBBR can achieve better performance than GNBR even for very small neighborhood sizes. In this way, very large neural networks can be efficiently trained using a fraction of the memory and time GNBR would require, while reaching lower learning errors and better generalization performance. References Asirvadam, V. S., McLoone, S. F., & Irwin, G. W. (2002). Parallel and separable recursive Levenberg-Marquardt training algorithm. In Proceedings of the 2002 IEEE Workshop Neural Network in Signal Processing (pp. 129–138). Piscataway, NJ: IEEE Press. Bartlett, P. L. (1997). For valid generalization, the size of the weights is more important than the size of the network. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 134–140). Cambridge, MA: MIT Press. Foresee, F. D., & Hagan, M. T. (1997). Gauss-Newton approximation to Bayesian regularization. In Proceedings of the IJCNN’97 (Vol. 3, pp. 1930–1935). Piscataway, NJ: IEEE Press. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Comp., 4, 1–58.
NB Enhancement of the GNBR Training Method
2003
Hagan, M. T., & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Networks, 5, 989–993. Lera, G., & Pinzolas, M. (1998). A quasi-local Levenberg-Marquardt algorithm for neural network training. In Proceedings of the IJCNN’98 (Vol. 3, pp. 2242–2246). Piscataway, NJ: IEEE Press. Lera, G., & Pinzolas, M. (2002). Neighborhood-based Levenberg-Marquardt algorithm for neural network training. IEEE Trans. Neural Networks, 13(5), 1200–1203. Levenberg, K. (1944). A method for the solution of certain problems in least squares. Quart. Appl. Math., 2, 164–168. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Comp., 4(3), 448–472. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math., 11, 431–441. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 847– 854). Cambridge, MA: MIT Press. Ngia L. S. H., & Sjoberg J. (2000). Efficient training of neural nets for nonlinear adaptive filtering using a recursive Levenberg-Marquardt algorithm. IEEE Trans on Signal Processing, 48(7), 1915–1926. Scales, L. E. (1985). Introduction to non-linear optimization. New York: Springer-Verlag. Toledo, A., Pinzolas, M., Ibarrola, J. J., & Lera, G. (2005). Improvement of the neighborhood based Levenberg-Marquardt algorithm by local adaptation of the learning coefficient. IEEE Trans. Neural Networks, 16(4), 988–992.
Received December 27, 2004; accepted January 18, 2006.
LETTER
Communicated by William Lytton
Exact Simulation of Integrate-and-Fire Models with Synaptic Conductances Romain Brette [email protected] D´epartement d’Informatique, Equipe Odyss´ee, Ecole Normale Sup´erieure, 75230 Paris Cedex 05, France
Computational neuroscience relies heavily on the simulation of large networks of neuron models. There are essentially two simulation strategies: (1) using an approximation method (e.g., Runge-Kutta) with spike times binned to the time step and (2) calculating spike times exactly in an event-driven fashion. In large networks, the computation time of the best algorithm for either strategy scales linearly with the number of synapses, but each strategy has its own assets and constraints: approximation methods can be applied to any model but are inexact; exact simulation avoids numerical artifacts but is limited to simple models. Previous work has focused on improving the accuracy of approximation methods. In this article, we extend the range of models that can be simulated exactly to a more realistic model: an integrate-and-fire model with exponential synaptic conductances. 1 Introduction There is an increasingly large body of evidence that in neurons, spike timing matters. Neurons have been shown to produce spike trains with submillisecond accuracy in vitro (Mainen & Sejnowski, 1995), a property that is shared by a large class of spiking neuron models (Brette & Guigon, 2003). Functional properties of in vivo neural networks rely on precise synchronization of neural discharges (e.g., in olfaction; Stopfer, Bhagavan, Smith, & Laurent, 1997). Synaptic plasticity depends on the relative timing of presynaptic and postsynaptic spikes (Abbott & Nelson, 2000), which has important functional implications (Song & Abbott, 2001). Therefore, spiking neuron models, in particular, the integrate-and-fire model (Lapicque, 1907; Knight, 1972) and variants, have become increasingly popular in computational neuroscience (Gerstner & Kistler, 2002), which has raised the problem of simulating them in an efficient and accurate way. Assuming that only chemical synapses are considered (we will not consider electrical gap junctions in this article), neurons communicate with each other by spikes, which are discrete events. A spiking neuron model describes a transformation from a set of input spike trains into an output Neural Computation 18, 2004–2027 (2006)
C 2006 Massachusetts Institute of Technology
Exact Simulation of Integrate-and-Fire Models
2005
spike train. It typically comprises a set of state variables (e.g., the membrane potential) whose evolution is governed by a set of differential equations. Incoming spikes induce discrete changes in the state variables, and outgoing spikes are triggered by a threshold condition. One of the simplest spiking models is the leaky integrate-and-fire model with instantaneous synaptic interactions, which is described by one state variable, the membrane potential V, governed by the following equation, τ
dV = −(V − V0 ), dt
(1.1)
where τ is the membrane time constant and V0 is the resting potential. A spike coming from synapse i at time t induces an instantaneous change in the potential: V → V + wi , where wi is the weight of synapse i. A spike is triggered when V reaches the threshold Vt , when it is instantaneously reset to Vr . More complex models include conductances, in particular, synaptic conductances, for example, τ
dV wi g(t − tij )(V − E s ), = −(V − V0 ) − dt
(1.2)
ij
where tij is the time of the jth spike coming from synapse i, E s is the synaptic reversal potential, and g(·) is the unitary conductance triggered by a single spike. Although it would seem that simulating such a model requires storing spike times, biophysical models can usually be reformulated in the form of a spiking model as described above—a set of differential equations with spikes triggering discrete changes in the state variables (Destexhe, Mainen, & Sejnowski, 1994). For example, equation 1.2 with exponential conductances g(s) = exp(−s/τs ) can be rewritten as dV = −(V − V0 ) − g × (V − E s ) dt dg τs = −g, dt
τ
and each spike coming from synapse i at time tij triggers an instantaneous change in the total synaptic conductance g → g + wi . With this formulation, storing spike times is not necessary (unless transmission delays are considered). Transformations of this kind can be applied to all reasonable synaptic models (e.g., α−functions), and in many cases, the number of state variables grows with the number of synapse types, not with the number of synapses (Lytton, 1996). There are essentially two strategies to simulate networks of spiking models: (1) integrating the differential equations with an approximation method
2006
R. Brette
(e.g., Euler, Runge-Kutta), advancing the simulation by fixed time steps and communicating the spikes only at the boundaries of the steps; and (2) simulating the network exactly (up to machine precision) in an event-driven way, that is, advancing the simulation from spike to spike and calculating spike times exactly. By “exact,” we mean that results are derived from analytical formulas but are stored with the precision of the floating-point representation of the machine (this is not symbolic computation). There exist hybrid strategies such as independent adaptive time-stepping methods (Lytton & Hines, 2005), which we will not discuss here. Both strategies perform similarly in terms of simulation time, as we will show in the next section, but they are not equivalent. The greatest asset of the approximation strategy is that it can be applied to any model. However, by definition, it is imprecise. It may not seem very relevant at first sight because real neurons are noisy, but numerical error is not equivalent to random noise. For example, Hansel, Mato, Meunier, and Neltner (1998) showed that networks of spiking neurons could fail to display synchronization properties when implemented with a naive Euler or Runge-Kutta algorithm, which led them to improve it with a better handling of the reset (see also Shelley & Tao, 2001). It is generally hard to predict how small errors in spike times are reverberated in recurrent networks, whether they are amplified to create numerical artefacts or remain irrelevant, and this is certainly an important issue for research purposes. Exact simulation obviously avoids this problem and makes results perfectly reproducible, but it applies to a limited range of models—essentially, to models with instantaneous interactions (Mattia & Del Giudice, 2000) or, at best, to linear models with synaptic currents (Makino, 2003). The purpose of this article is to extend exact simulation to more realistic models with noninstantaneous synaptic conductances. We will describe a method to simulate exactly a common spiking model: the leaky integrate-and-fire model with exponential synaptic conductances, in which the membrane potential V is governed by the following differential equation,
τ
dV = −(V − V0 ) − g + (t)(V − E + ) − g − (t)(V − E − ), dt
(1.3)
where g + (·) (resp. g − (·)) is the total excitatory (resp. inhibitory) conductance (relative to the leak conductance) and E + is the excitatory (resp. inhibitory) reversal potential. Both conductances are described as sums of spike-triggered exponential conductances,
g(t) =
ij
wi exp(−(t − tij )/τs )(t − tij ),
(1.4)
Exact Simulation of Integrate-and-Fire Models
2007
where (s) = 1 if s ≥ 0 and (s) = 0 otherwise, and τs is the synaptic time constant. The constraint of the method we present is that both excitatory and inhibitory conductances must have the same synaptic time constant τs . This is an important limitation, but it still significantly advances the boundaries of exact simulation in terms of biophysical realism. In the next section, we analyze the computational costs of the two simulation strategies and show that for large networks, the simulation time grows linearly with the number of synapses. Then we will describe our method to simulate exactly integrate-and-fire models with exponential conductances. Finally, we will test it in two settings: (1) a single neuron with spike-timedependent plasticity and excitatory inputs (as in Song, Miller, & Abbott, 2000) and (2) random networks of excitatory and inhibitory neurons (as in Brunel, 2000, but with conductances instead of currents). 2 Computational Complexity of Network Simulations In order to simplify the analysis, we will ignore the problem of managing transmission delays and consider only simulations on a single CPU. We analyze the time required to simulate 1 second of biological time for a network comprising N neurons, p synapses per neuron, with average firing rate F . Before analyzing separately the simulation time of both strategies, we first note that any algorithm must make at least F × N × p operations per second of biological time: indeed, the network produces on average F × N spikes per second, and each of these needs to be sent to p target neurons. Therefore, the computational cost of any algorithm is at least linear in the number of synapses. 2.1 Approximation Methods. A step t → t + dt of a typical approximation algorithm consists of two phases: Update: The state variables of all neurons are updated with an integration method (e.g., Euler, Runge-Kutta), and the neurons that are going to spike are identified by a threshold condition (V > Vt ). Propagation: Spikes are propagated to the targets of these neurons. Assuming that the number of state variables does not grow with the number of neurons or synapses, which is usually the case (see Lytton, 1996), the cost of the update phase is of order N for each step, so it is O(N/dt) per second of biological time. This component grows with the complexity of the neuron models and the precision of the simulation. Every second (biological time), an average of F × N spikes is produced by the neurons. Each of these needs to be propagated to p target neurons (propagation meaning changing the state variables of the target neurons).
2008
R. Brette
Thus, the propagation phase consists of F × N × p spike propagations per second. These are essentially additions of weights wi to state variables, and thus are simple operations whose cost does not grow with the complexity of the models. Summing up, the total computational cost per second of biological time is of the order Update + Propagation cU ×
N + cP × F × N × p, dt
where cU is the cost of one update and cP is the cost of one spike propagation; typically, cU is much higher than cP , but this is implementation dependent. Therefore, for very dense networks, the total is dominated by the propagation phase and is linear in the number of synapses, which is optimal. In practice, however, the first phase is negligible when the following condition is met: cP × F × p × dt 1 cU
(∗).
For example, the average firing rate in the cortex might be as low as F = 1 Hz (Olshausen & Field, 2005), and assuming p = 10, 000 synapses per neuron and dt = 0.1 ms, we get F × p × dt = 1. In this case, considering that each operation in the update phase is heavier than in the propagation phase (especially for complex models), that is, cP < cU , the former is likely to dominate the total computational cost. Thus, it appears that even in networks with realistic connectivity, increases in precision (smaller dt) can be detrimental to the efficiency of the simulation. One way to get around this problem is to use higher-order integration methods, allowing larger time steps for comparable levels of precision. Several authors (Hansel et al., 1998; Shelley & Tao, 2001) have noticed that for integrate-and-fire models, the discontinuity in the state variables at spike time annihilates the increase in precision associated with higher-order integration methods (e.g., second-order Runge-Kutta), but they have proposed a correction of the reset that solves this problem. Considering that the additional cost is negligible (of order F × N), this correction should always be used in simulations. We note, however, that it imposes smoothness constraints on the models, which implies higher values of cU . 2.2 Exact Simulation. Exact simulation of networks of spiking models fits in the more general setting of the simulation of discrete event systems. Implementations are typically more complicated than their approximate analog, but they are well established because they are not specific to neural
Exact Simulation of Integrate-and-Fire Models
2009
networks (see, e.g., Ziegler, Praehofer, & Kim, 2000). We can describe a step of a typical exact event-driven algorithm as follows: 1. Determine the next event, that is, which neuron is going to spike next. 2. Update the state variables of this neuron. 3. Propagate the spike, that is, update the state variables of the target neurons. In order to determine the next event, one needs to maintain an ordered list of future events. This list contains the time of the next spike for every neuron. Spike times are conditional to the fact that no spike is received in the meantime, but causality implies that the earliest spike time in the list is always valid. In terms of data structure, this list is a priority queue. For every event, we need to extract the highest-priority item and insert or modify p items in this queue. There exist data structures and algorithms that can implement these two operations in constant (O(1)) time, for example, calendar queues (Brown, 1988). These are similar to the ring buffers used in Morrison, Mehring, Geisel, Aertsen, and Diesmann (2005), except there is no fixed time step (schematically, events are stored in a calendar queue as in an agenda, but the duration of days can change). We can subdivide the computational cost of handling one event as follows:
r r r
Updating the neuron and its targets: p + 1 updates (the cost is modulated by model complexity) Updating the spike times of p + 1 neurons (again, the cost is modulated by model complexity) Extracting the highest-priority event and inserting p events in the queue: p + 1 operations (depends on the implementation of the priority queue)
Since there are F × N spikes per second of biological time, the computational cost is approximately proportional to F × N × p. The total computational cost per second of biological time can be written concisely as follows: Update + Spike + Queue (cU + c S + c Q ) × F × N × p, where cU is the cost of one update of the state variables, c S is the cost of calculating the time of the next spike, and cQ is the cost of inserting an event in the priority queue. Thus, the simulation time is linear in the number of synapses, which is optimal. Nevertheless, we note that the operations involved are heavier than in the propagation phase of approximation methods (see the previous section); therefore, the multiplicative factor is likely
2010
R. Brette
to be larger. However, in the worst case, exact simulation is slower than approximate simulation only by a constant multiplicative factor, and it can outperform it in cases when condition (∗) is not met. Therefore, it seems advisable to use exact simulation when possible. Several exact simulation engines have been developed specifically for spiking neural networks that can handle propagation delays and complex network structures (Mattia & Del Giudice, 2000; Rochel & Martinez, 2003). Considerable effort has been devoted to the efficient management of event queues (Lee & Farhat, 2001; Mattia & Del Giudice, 2000; Connolly, Marian, & Reilly, 2003), and event-handling algorithms can be made parallel (see, e.g., Grassmann & Anlauf, 1998). In principle, general clock-driven simulation engines such as NEST (Morrison et al., 2005) can also handle exact simulation, provided there is a minimum transmission delay (larger than the time step), although they are probably less efficient than dedicated event-driven simulation engines (but one may appreciate the greater expressivity). To simulate a particular neuron model exactly, one needs to provide the simulator with three functions:
r r r
A function that updates the neuron following an incoming spike A function that updates the neuron following an outgoing spike (reset) A function that gives the time of the next spike (possibly +∞ if there is none), provided the values of the state variables
So far, algorithms have been developed for simple pulse-coupled integrateand-fire models (Claverol, Brown, & Chad, 2002; Delorme & Thorpe, 2003) and more complex ones such as some instances of the spike response model (Makino, 2003; Marian, Reilly, & Mackey, 2002; Gerstner & Kistler, 2002), but there is none for more realistic models with synaptic conductances, which are often used in computational neuroscience. 3 Exact Calculations for the IF Model with Conductances In this section we describe the functions required to simulate an integrateand-fire model with exponential synaptic conductances, as defined by equations 1.3 and 1.4. First, we reformulate the model into a standard spiking neuron model with two dynamic state variables. Then we calculate the solution of this couple of differential equations given the initial state. Finally, we describe a method to calculate the time of the next spike, with a quick test to check whether it is +∞ (i.e., no spike). 3.1 Rewriting the Model. We start by expressing equations 1.3 and 1.4 as a spiking model as described in section 1—a set of autonomous differential equations with incoming spikes triggering instantaneous changes in the state variables. First, we express time in units of the membrane time
Exact Simulation of Integrate-and-Fire Models
2011
constant τ , with t = 0 as the origin of time, and voltage relatively to the resting potential V0 , that is, we assume V0 = 0. We can rewrite equation 1.3 as follows: dV = −V + (g + (t) + g − (t))(E s (t) − V), dt
(3.1)
where E s (t) is the effective synaptic reversal potential defined as E s (t) =
g + (t)E + + g − (t)E − . g + (t) + g − (t)
In intervals with no spike, E s (t) is constant. Indeed, g + (·) and g − (·) both satisfy τs
dg + = −g + dt
τs
dg − = −g − , dt
and it follows, assuming no spike in [0, t], E s (t) = =
g + (0)e −t/τs E + + g − (0)e −t/τs E − g + (0)e −t/τs + g − (0)e −t/τs g + (0)E + + g − (0)E − . g + (0) + g − (0)
We define the total synaptic conductance as g(t) = g + (t) + g − (t). Then we can rewrite the model as a system of two differential equations, dV = −V + g(E s − V) dt dg = −g, τs dt
(3.2a) (3.2b)
and both g and E s are instantaneously modified when a spike is received. The reset conditions follow from simple calculations:
r
When a spike is received from an inhibitory synapse with weight w > 0: g E s + wE − Es → g+w g→g +w (note that the order is important)
2012
r
R. Brette
When a spike is received from an excitatory synapse with weight w > 0: Es →
g E s + wE + g+w
g→g +w
r
When V reaches the threshold Vt : V → Vr
Thus, we have reformulated the integrate-and-fire model with exponential synaptic conductances as a two-variable spiking model (i.e., an autonomous system of two differential equations with incoming spikes triggering instantaneous resets). We can sum up the first two reset conditions in a single condition by using signed weights (i.e., w < 0 for spikes coming from inhibitory synapses) as follows: Es →
g E s + αw + β|w| g + |w|
g → g + |w| with α = (E + − E − )/2 and β = (E + + E − )/2. 3.2 Solution of the Differential Equation. In equation 3.2a, let us express V as a function of g: dV 1 = τs 1 + V − τs E s . dg g It follows that d (V exp(−τs (g + log g))) = −τs E s exp(−τs (g + log g)). dg Integrating between g(0) and g(t), we get: V(t) exp(−τs g(t))g(t)−τs − V(0) exp(−τs g(0))g(0)−τs g(t) = −τs E s g −τs exp(−τs g)dg g(0)
= −τsτs E s
τs g(t)
τs g(0)
h −τs e −h dh
= −τsτs E s (γ (1 − τs , τs g(t)) − γ (1 − τs , τs g(0))),
Exact Simulation of Integrate-and-Fire Models
2013
where γ (a , x) =
x
e −t t a −1 dt
0
is the nonnormalized incomplete gamma integral. The incomplete gamma integral has fast computation algorithms, which are implemented in most numerical libraries. In the following, we will use the algorithm from Press, Flannery, Teukolsky, and Vetterling (1993). Recalling that g(t) = g(0)e −t/τs , we get: g(0)−τs exp t − τs g(0)e −t/τs V(t) = V(0)e −τs g(0) g(0)−τs − τsτs E s γ 1 − τs , τs g(t) − γ 1 − τs , τs g(0) . The incomplete gamma integral can be developed as either a power series, γ (a , x) = e −x x a
∞ n=0
(a ) xn , (a + 1 + n)
(3.3)
where (a ) =
∞
e −t t a −1 dt
0
is the gamma integral, or a continued fraction, γ (a , x) = e
1 · (1 − a ) 2 · (2 − a ) 1 ... . x (a ) x + 1 − a− x + 3 − a− x + 5 − a−
−x a
(3.4)
It appears that in both expressions, we can factorize by e −x x a . Let us define ρ(a , x) = e x x −a γ (a , x). Then we can rewrite: τsτs E s γ (1 − τs , τs g(0)) = τs E s g(0)ρ(1 − τs , τs g(0))e −τs g(0) g(0)−τs τsτs E s γ (1 − τs , τs g(t)) = τs E s g(t)ρ(1 − τs , τs g(t))g(0)−τs exp(t − τs g(0)e −t/τs ). Thus we get: e t−τs g(t) (V(t) + τs E s g(t)ρ(1 − τs , τs g(t))) = e −τs g(0) (V(0) + τs E s g(0)ρ(1 − τs , τs g(0)).
2014
R. Brette
Finally, we get the following expression for V(t): V(t) = −τs E s g(t)ρ(1 − τs , τs g(t)) + exp(−t + τs (g(t) − g(0)))(V(0) + τs E s g(0)ρ(1 − τs , τs g(0)).
(3.5)
Thus, we can calculate the value of the solution at any time very efficiently. It turns out that the power series, equation 3.3, converges rapidly for x < a + 1, while the continued fraction, equation 3.4, is more efficient for x > a + 1. Although at first sight it may seem that using an infinite series or continuous fraction makes the calculation very heavy, it is in fact not fundamentally different from calculating an exponential function. Only the first terms are indeed necessary to achieve an accuracy close to the precision of the floating-point representation. It is possible to use precalculated tables for the exponential function and incomplete gamma integral to calculate expression 3.5. If linear interpolation is used, one can obtain almost exact results. We tested the accuracy of formula 3.5 calculated with tables with linear interpolation for the exponential function and the ρ function (more precisely, the function g → τs g × ρ(1 − τs , τs g)). The results are shown in Table 1: with conductances tabulated with precision 10−4 (in units of the leak conductance), the relative accuracy for the membrane potential is of order 10−9 , 1000 times better than with a second-order Runge-Kutta (RK) method with dt = 0.01 ms (10 million times better than RK with dt = 1 ms; note that time steps lower than 0.1 ms are likely to slow the simulation; see the previous section). Thus, it is certainly acceptable to use precalculated tables to speed up the simulation while keeping the calculations almost exact. The speed-up factor for computing V(t) was about 10 (compared to using the series expansion 3.3 with relative precision 10−15 ). 3.3 Spike Test. In many situations, the time of the next spike for a given neuron (ignoring future incoming spikes) will be +∞ most of the time, because the distribution of the membrane potential in cortical networks usually displays an approximately gaussian distribution centered far from the threshold (see, e.g., Destexhe, Rudolph, Fellous, & Sejnowski, 2001). This phenomenon appears clearly in the simulations (see the next section). Here we will describe an algorithm that can quickly test whether the neuron is going to spike. We assume that E s > Vt ; otherwise, there can be no spike. The dynamics of the differential system, equations 3.2a and 3.2b, and therefore whether the membrane potential reaches the threshold Vt after some time, is completely determined by the initial condition (V(0), g(0)). If the model spikes for an initial condition (V0 , g0 ), then it does so for any initial condition (V1 , g0 ) such that V0 < V1 . Thus, there is a minimal potential V ∗ (g0 ) above which
t = τs t = 5τs
Accuracy Speed (Hz) Accuracy Speed
Table 10−3
Table 10−4
RK 1 ms
RK 0.01 ms
1.9 ± 1.2 × 10−7 8.7 × 106 8.5 ± 5.2 × 10−8 8.8 × 106
1.9 ± 1.2 × 10−9 8.8 × 106 8.5 ± 5.2 × 10−10 8.8 × 106
1.9 ± 1.2 × 10−2 5.9 × 106 1.2 ± 0.6 × 10−2 1.5 × 106
1.6 ± 1 × 10−6 8.2 × 104 1 ± 0.5 × 10−6 1.6 × 104
Notes: The initial potential V(0) is picked at random from a uniform distribution between 0 and 1 (normalized potential); the initial conductance is also picked from a uniform distribution, between 0.5 and 5.0. The mean and standard deviation are calculated from 105 iterations. We chose τs = 3 ms and τ = 20 ms. We used two levels of precision for the tables: for the first column, 10−3 for conductance (in units of the conductance leak) and 10−3 for time (in units of the membrane time constant); for the second column, 10−4 for conductance and 10−4 for time. The true value for V(t) was calculated using the series formula for the incomplete gamma integral, with relative precision 10−15 . For the Runge-Kutta method, we used time step 1 ms and 0.01 ms. The speed is indicated as the number of operations per second. Individual update operations were about two to three times faster for the Runge-Kutta method than for the quasi-exact method with tables.
Exact Simulation of Integrate-and-Fire Models
Table 1: Accuracy of the Calculation of V(t) with Precalculated Tables (Using Equation 3.5 and with the Second-Order Runge-Kutta Method, for t = τs and t = 5τs (τs = 3, τ = 20 ms)).
2015
2016
R. Brette
A
B
V Vt = 1 V(g*)
Quick spike test Es > V t ?
0.5 V(0) 0.5 g*
1 g(0)
g
1.5
V*(g)
no spike
no
no spike
no
no spike
yes g > g* ?
-0.5
no
yes
V(g)
C Number of iterations
6 Full spike test V(g*) > Vt ? 4 yes 2 spike 0 10
−3
−5
10 Accuracy
10
−8
10
− 10
Figure 1: Computation of spike timings. (A) Minimal potential for spiking V ∗ as a function of the synaptic conductance g (see text). The solution V(g) is below V ∗ ; hence, it never hits the threshold (no spike). We used Vt = 1, E s = 5, τs = 1/4, g(0) = 1, V(0) = 0.2. (The time goes from right to left, since g decreases with time.) (B) Summary of the spike test algorithm. (C) Number of iterations in spike timing computation as a function of accuracy. The initial potential V(0) is picked at random from a uniform distribution between 0 and 1 (normalized potential); the initial conductance is also picked from a uniform distribution, between 2.5 and 5 (we chose values high enough to guarantee spiking). The mean and variance are calculated from 105 iterations (the dashed lines represent mean ± standard deviation). Accuracy is calculated on the potential axis; if t is the computed spike time and ε is the accuracy, V(t) is within ε of the threshold.
the neuron spikes. The set of points C ∗ = {(V ∗ (g), g)}, g ≥ 0} defines the minimum spiking curve (see Figure 1A); that is, the neuron spikes if and only if its state (V, g) is above this curve. We need to calculate this curve. Consider the trajectory in the phase space of a solution V(g) starting on C ∗ from (V ∗ (g0 ), g0 ). By definition, all solutions starting from (V0 , g0 ) with V0 > V ∗ (g0 ) hit the threshold, and all solutions starting from (V0 , g0 ) with V0 < V ∗ (g0 ) do not. Because the phase space is two-dimensional, trajectories cannot cross. It follows that
Exact Simulation of Integrate-and-Fire Models
2017
any trajectory that hits the threshold must be above the trajectory of V(g) at all times, and conversely. Therefore, the trajectory of V(g) is precisely the minimum spiking curve C ∗ . Besides, this trajectory must be tangential to the threshold V = Vt ; otherwise, there would be a trajectory below it that hits the threshold. Therefore, the minimal potential V ∗ (g) is the solution of the following differential equation, d V∗ 1 = τs 1 + V ∗ − τs E s , dg g such that d V/dg = 0 at threshold, that is, with g ∗ being the conductance at threshold: 1 0 = 1 + ∗ Vt − E s g g∗ =
1 . (E s /Vt ) − 1
Note that in Figure 1A, solutions travel from right to left in the phase space. Therefore, for g(0) < g ∗ , there can be no spike in the future (the solution hits threshold at a negative time), so that g ∗ is the minimal conductance for spiking. To test whether the trajectory of the membrane potential is above or below the trajectory of the minimal potential, we only need to compare V(t) at the time when g(t) = g ∗ with the threshold Vt . We can calculate this value using equation 3.5: V(g ∗ ) = −τs E s g ∗ ρ(1 − τs , τs g ∗ ) ∗ τ s g + exp(τs (g ∗ − g(0)))(V(0) + τs E s g(0)ρ(1 − τs , τs g(0)) (3.6) g(0) The neuron spikes if and only if V(g ∗ ) > Vt . In the worst case, the algorithm takes about the same time as updating the membrane potential. In many cases, the simple checks E s > Vt and g > g ∗ are sufficient most of the time (for example, 86% of the time for the random networks simulated in the next section). The algorithm is summed up in Figure 1B. 3.4 Computation of Spike Timing. When the spike test is positive, we need to compute the timing of the spike. This can be done with very high accuracy using the Newton-Raphson method. One can easily show that the potential V(·) is first increasing, then decreasing (the second derivative is negative when the first derivative cancels out), which implies that there are two threshold crossings. The spike corresponds to the first crossing, on the increasing part of the trajectory. On this
2018
R. Brette
part, the trajectory is concave (derive equation 3.2a); therefore, the NewtonRaphson method converges to the first crossing, which is the correct one. At every iteration of the Newton-Raphson method, we need to evaluate the membrane potential at the current approximation of the spike timing. Note that only one evaluation of the incomplete gamma integral is then necessary, since the one for time 0 is the same for all iterations. Numerical results show that this method converges very quickly in the present case. Figure 1C shows that extremely high accuracy is reached after just four iterations. Considering that in many situations (see the next section with simulation results) spike timings are computed for only a few percentages of all events, computing spike times is expected to have a small impact on the performance of the simulation. 3.5 Summary: Three Functions for Exact Simulation. In section 2.2, we indicated that a simulator needs three functions to simulate a model exactly (recall that since we are not doing symbolic calculus, “exact” should be understood as several orders of magnitude more accurate than approximation methods—we discuss this matter further in section 5). Here we sum up our results and describe these functions. The state of a neuron is determined by three variables: the membrane potential V, the total synaptic conductance g, and the effective synaptic reversal potential E s . The first two variables evolve continuously between spikes according to a system of differential equations (3.2a and 3.2b). Variable V is reset when a spike is produced (threshold condition V > Vt ), and variables g and E s are updated when a spike is received. 3.5.1 Update After an Incoming Spike. When a spike is received with signed weight w (i.e., w < 0 when the synapse is inhibitory) at time t, the following operations are executed, given that tl is the time of the last update: 1. V → V(t − tl ) using formula 3.5 with V(0) = V and g(0) = g. 2. g → g × exp(−(t − tl )/τs ) (note that the calculation has actually been done when updating V). 3. E s →
g E s +αw+β|w| g+|w|
with α = (E + − E − )/2 and β = (E + + E − )/2.
4. g → g + |w|. 3.5.2 Update After an Outgoing Spike. When a spike is produced at time t, the following operations are executed, given that tl is the time of the last update: 1. V → Vr . 2. g → g × exp(−(t − tl )/τs ).
Exact Simulation of Integrate-and-Fire Models
2019
3.5.3 Spike Timing. To calculate the time of the next spike, we first check whether this is +∞ according to the algorithm summed up in Figure 1B. If it is finite (i.e., if the neuron is going to spike), then spike timing is computed by the Newton-Raphson method (about four iterations of formula 3.5), which is guaranteed to converge to the correct value. 4 Simulations Models were simulated on a standard desktop PC with a general eventdriven simulator called MVASpike (developed by Olivier Rochel and freely available online at http://www.comp.leeds.ac.uk/olivierr/mvaspike/). The C code for the functions discussed in this article is available on the author’s Web page (http://www.di.ens.fr/∼brette/papers/Brette2005NC. htm). To give an idea of typical simulation times, we provide some figures for the coefficients in the formulas of section 2 (complexity). In our simulations, the cost cU of one update of the membrane potential was approximately: cU (exact) ≈ 1 µs series expansion (precision 10−15 ) cU (table) ≈ 0.1 µs precalculated tables cU (RK2) ≈ 0.04 µs second-order Runge-Kutta Calculating the timing of the next spike is at most c S ≈ cU ≈ 0.1 µs if the spike test is negative (which occurred most of the time in our simulations; see below), and about c S ≈ 3cU − 4cU ≈ 0.3 µs −0.4 µs (with tables) if the spike test is positive. The cost of one queue insertion cQ depends on the structure of the network (and obviously on the implementation of priority queues) and can be fairly high when the network has heterogeneous transmission delays. For example, for the random network described in section 4.2, and with the software we used for event management, cQ ≈ 4 µs − 5 µs, so that the program spends most of its time managing events and delays. In this case, the problem of dealing with transmission delays is the same with an approximation method—cP ≈ cQ . When delays are homogeneous, there can be a drastic reduction in computation time because for a given outgoing spike, the p corresponding events are inserted at the same place in the queue. 4.1 Single Neuron with STDP 4.1.1 The Model. We simulated the scenario depicted in Song and Abbott, 2001, (Fig. 1.B), consisting of a single integrate-and-fire model receiving excitatory inputs from 1000 synapses (with exponential conductances) with spike-timing-dependent plasticity (see Figure 2A). Each
2020
R. Brette
A
C
400 Neurons
1000 Poisson inputs
200
STDP
1 neuron
B
0
D
Event
Quick spike test? yes 47%
0
E
53%
46.3%
1
0.5 g/g
1
1
No spike g/g
yes
No spike
0.5
0.5
no
Full spike test? no
1
1 g/g
100%
0.5
0.5
0.7% Compute spike timing
0
Figure 2: Simulation of an integrate-and-fire model with spike-time-dependent plasticity. (A) Architecture of the model (see details in the text). (B) Statistics of the outcome of the spike test (cf. Figure 1B). (C) Distribution of the weights after 30 minutes of simulation. White: without precalculated tables; black: with tables (precision 10−4 ); stripes: with the original algorithm but one initial weight is flipped (see the text). (D) Black: weights obtained after 30 seconds with precalculated tables versus weights without tables. Gray: weights obtained after 30 seconds without tables and one initial weight flipped versus weights without tables and weights. (E) As in D but after 30 minutes.
synaptic input consists of spike trains modeled as Poisson processes with constant rate 10 Hz. The parameters of the integrate-and-fire model are V0 = −74 mV, Vt = −54 mV, Vr = −60 mV, and τ = 20 ms. Synaptic conductances are initially distributed uniformly in [0, g], with g = 0.015 (in units of the leak conductance). The synaptic time constant is τs = 5 ms, and the reversal potential is E s = E + = 0 mV. Conductances evolve slowly with spike-time-dependent plasticity as described in Song and Abbott (2001): a pair of presynaptic and postsynaptic spikes occurring at times tpre and tpost triggers a synaptic change equal to A+ exp(−(tpost − tpre )/τ + ) −
−
−A exp(−(tpre − tpost /τ )
if
tpre < tpost
if
tpost < tpre
Exact Simulation of Integrate-and-Fire Models
2021
with τ + = τ − = 20 ms, A+ = 0.005 × g, and A− = A+ × 1.05. All synaptic changes combine linearly, but synaptic conductances must lie in the interval [0, g]. This learning rule has been shown to produce synaptic competition and stabilization of firing rates (Song & Abbott, 2001; Kempter, Gerstner, & van Hemmen, 2001; Cateau & Fukai, 2003). We ran the model for 30 minutes of biological time (which took about 10 minutes of CPU time). 4.1.2 Profiling. The heaviest operation is the computation of spike timing, which is about three to four times slower than updating the membrane potential. We suggested earlier that this operation should be executed only rarely, because most of the time, the neuron is in a state in which it cannot spike without receiving other excitatory events. Every time a spike is received, the time of the next (outgoing) spike is recalculated, but only if the spike test is positive (otherwise there is no spike; the algorithm is described in Figure 1B). We analyzed the outcome of the spike test (see Figure 2B) for the model described in this section. The neuron received about 18 million spikes, each inducing one spike test. It failed the spike test in 99.3% of the cases, meaning that spike timing was computed only 126,000 times. Thus, computing the spike times was a negligible component of the overall simulation time. Besides, in half of the cases, only the quick spike test (see Figure 1B, top) was executed. 4.1.3 Accuracy. After 30 minutes (biological time), the distribution of the weights was bimodal and concentrated near the bounds of the limiting interval [0, g], as expected (see Figure 2C). The distribution was almost unchanged if precalculated tables (precision 10−4 ) were used instead of the full expression in formula 3.5. Individually, however, the weights were slightly different after 30 seconds (Pearson correlation c = 0.9996; see Figure 2D), and significantly different after 30 minutes (c = 0.88; see Figure 2E). About 8% of the weights were flipped from one side to the other side of the bimodal distribution. However, 30 minutes is a very long time from the point of view of the neuron: it corresponds to 90,000 membrane time constants and 18 million received spikes, so the observed discrepancy might be due only to the fact that the model, seen as a dynamical system, is chaotic or at least unstable. In order to test this hypothesis, we tested the effect of changing the initial conditions very slightly in the original algorithm (without tables). We flipped one of the initial weights from 0 to g and ran the simulation again with the same seed for the random number generator, so that the timings of all incoming spikes are unchanged. It turns out that after 30 seconds, the weights differ from the original ones even more than the weights computed previously using tables (Pearson correlation c = 0.9912 versus c = 0.9996; see Figure 2D). After 30 minutes, the weights differed significantly from the original ones, in the same way as with tables (Pearson
2022
R. Brette
correlation c = 0.84 versus c = 0.88; see Figure 2E), but the distribution of the weights remained almost unchanged (see Figure 2C). Interestingly, after 30 minutes, the correlation between the weights obtained with the original algorithm and with precalculated tables was unaffected by the precision of the tables (c = 0.87 − 0.88 for precisions 10−3 , 10−4 , and 10−5 ). Moreover, the correlation between weights obtained with different precisions for the tables was equivalent to the correlation with the weights obtained with the original algorithm (c = 0.85). We conclude that the long-term outcome of the simulation is very sensitive to initial conditions, and increasing the precision of the tables does not seem to make the weights converge. This indicates that the system is chaotic, which implies that only statistical measures are meaningful. We note that the distribution of weights is very well reproduced when using precalculated tables, and in the initial phase of the simulation (at least the first 30 seconds), the individual weights follow the original ones more faithfully than if just one initial weight (out of 1000) is modified.
4.2 Random network 4.2.1 The Model. We simulated a network of excitatory and inhibitory neurons receiving external excitatory inputs and connected randomly (see Figure 3A), as described in Brunel (2000), but with synaptic conductances instead of currents. Neuron models had the same parameter values as for the spike-timing-dependent plasticity (STDP) model, but there were two types of synapses: excitatory (reversal potential E + = 0 mV and weight w + = 0.12 in units of the leak conductance) and inhibitory (E − = −80 mV, w − = 1.2). Transmission delays were included in the model; they were picked at random from a uniform distribution in the interval 0.1–1 ms. Each neuron received excitatory spike trains, modeled as Poisson processes with rate 500 Hz. We ran simulations lasting 1 second with eight different configurations for the numbers of excitatory neurons (4000–10,000) and inhibitory neurons (1000–3000) and the connection probability (0.02–0.1). The number of synapses per neuron ranged from 150 to 520 (not including external synapses); the average firing rates ranged from 2 Hz to 30 Hz. A sample of the spike trains produced is shown in Figure 3C. 4.2.2 Profiling. As for the STDP model, we analyzed the outcome of the spike test (see Figure 3B). On average, spike timing had to be calculated for only 4% of the events. Thus, again, the calculation of spike timing (the heaviest operation) has a small influence on the computational complexity of the simulation. Besides, for 86% of the cases, only the quick spike test was executed. Thus, most of the simulation time was spent in updating the state variables and handling event queues.
Exact Simulation of Integrate-and-Fire Models
A external inputs
exc. neuron
C
2023
3000
Neuron
2800 inh. neuron
2600 2400 2200 2000
B
Event 100%
simulation linear fit
no
Time (s)
Full spike test?
86% No spike
500
500
no 14%
250 Time (ms)
D
Quick spike test? yes
0
250
yes 10%
4%
No spike
Compute spike timing
0 0
5 Number of events
10 x 10
7
Figure 3: Simulation of random networks of excitatory and inhibitory neurons. (A) Architecture of the model (see details in the text). (B) Statistics of the outcome of the spike test (cf. Figure 1B). (C) Sample of the spike trains produced by the network in 500 ms. (D) Simulation time as a function of the total number of events (spike transmissions) for eight different configurations of the network.
4.2.3 Scaling of the Simulation Time. We wanted to check that the simulation time scales with the number of events handled, that is, is proportional to F × N × p × T (cf. section 2.2), where F is the average firing rate, N is the number of neurons, p is the number of synapses per neuron, and T is the time of the simulation. Figure 3D shows that for the eight configurations we simulated, the relationship between the simulation time and the total number of events is very well fitted by a line, as expected. 5 Discussion We have shown that exact simulation of spiking neural networks, in an event-driven way, is in fact not restricted to simple pulse-coupled integrateand-fire models. We have presented a method to simulate integrate-and-fire models with exponential synaptic conductances, which is a popular model in computational neuroscience. The simulation time scales linearly with the total number of events (number of spike transmissions), as for algorithms
2024
R. Brette
based on integration methods (Euler, Runge-Kutta) or, in fact, for optimal algorithms. For medium-sized networks, exact simulation may even be faster than approximation methods (this is, however, implementation dependent). It may be argued that what we call “exact” simulation means only “more precise than integration methods,” because the calculations involve series for which we consider only a finite number of terms and spike timings are calculated approximately by the Newton-Raphson method. In principle, it is true that calculations are not exact: after all, in any case, they cannot be more precise than the floating-point representation in the machine. However, it is relevant to distinguish these so-called exact methods from approximation methods with time steps for two essential reasons: i. The error done in neglecting the tail of a series is not of the same order as the error done with Euler or Runge-Kutta methods. The former decreases exponentially with the number of iterations, while the latter decreases only polynomially. Therefore, the level of precision that can be reached with the former method is extremely high—incomparably higher than with integration methods. Such high-precision approximations are done routinely when using exponential or logarithmic functions in computer programs. The Newton-Raphson method that we use to calculate spike times also converges exponentially. ii. In approximation methods, other types of errors are induced by the fact that spike times are constrained to the boundaries of time steps. For example, the order of spikes within one time step is lost; these spikes are glued together. This might cause problems when investigating synchronization properties or spike-timing-dependent plasticity. These errors do not arise when using exact, event-driven methods. One important difficulty with event-driven simulations is to introduce noise in the models. It has been shown in vitro that the response of cortical neurons to realistic currents injected at the soma is reproducible (Mainen & Sejnowski, 1995); therefore, a large part of the noise would come from synapses—from either the input spike trains or transmission failures. Therefore, one reasonable way to introduce noise in an event-driven simulation is to add random input spikes to the neuron models (as in Song & Abbott, 2001) or to introduce random failures when inserting a spike in an event queue. Finally, an important limitation of the algorithm we have presented is that the time constants of excitatory and inhibitory synapses must be identical. We acknowledge that this is restrictive, but it still makes a significant progress in the biophysical realism of exactly simulable models. We were able to find analytical expressions for the membrane potential and for the spike test because the model could be reduced to a differential system with only two variables. If the time constants of excitatory and inhibitory
Exact Simulation of Integrate-and-Fire Models
2025
synapses are different, this is no longer possible. We do not know if this is definitely untractable or if other tricks can be found. One possible track of investigation might be to use series expansion of the solutions of the differential equations. Finally, we mention two other directions in which efforts should be made to extend the range of models that can be simulated exactly: (1) integrate-and-fire models with soft threshold (e.g., quadratic, Ermentrout & Kopell, 1986; and exponential, Fourcaud-Trocme, Hansel, van Vreeswijk, & Brunel, 2003), which are more realistic than the leaky integrate-and-fire model, and (2) two-dimensional integrate-and-fire models, which can account for adaptation (Liu & Wang, 2001) and resonance (Richardson, Brunel, & Hakim, 2003). Acknowledgments I thank Olivier Rochel and Michelle Rudolph for fruitful discussions. This work was partially supported by the EC IP project FP6-015879, FACETS. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3 (Suppl), 1178. Brette, R., & Guigon, E. (2003). Reliability of spike timing is a general property of spiking model neurons. Neural Comput., 15(2), 279–308. Brown, R. (1988). Calendar queues: A fast 0(1) priority queue implementation for the simulation event set problem. J. Commun. ACM, 31(10), 1220–1227. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Cateau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Comput., 15(3), 597– 620. Claverol, E., Brown, A., & Chad, J. (2002). Discrete simulation of large aggregates of neurons. Neurocomputing, 47, 277–297. Connolly, C., Marian, I., & Reilly, R. (2003, Aug. 28–30). Approaches to efficient simulation with spiking neural networks. Paper presented at the Eighth Neural Computation and Psychology Workshop, University of Kent, U.K. Delorme, A., & Thorpe, S. J. (2003). Spikenet: An event-driven simulation package for modelling large networks of spiking neurons. Network, 14(4), 613–627. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comput., 6(1), 14–18. Destexhe, A., Rudolph, M., Fellous, J. M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107(1), 13–24. Ermentrout, B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. Siam J. Appl. Math., 46(2), 233–253.
2026
R. Brette
Fourcaud-Trocme, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci., 23(37), 11628–11640. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Grassmann, C., & Anlauf, J. (1998). Distributed, event driven simulation of spiking neural networks. Proceedings of the International ICSC/IFAC Symposium on Neural Computation (NC’98) (pp. 100–105). Canada: ICSC Academic Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput., 13(12), 2709– 2741. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59(6), 734–766. Lapicque, L. (1907). Recherches quantitatives sur l’excitation e´ lectrique des nerfs trait´ee comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Lee, G., & Farhat, N. H. (2001). The double queue method: A numerical method for integrate-and-fire neuron networks. Neural Netw., 14(6–7), 921–932. Liu, Y. H., & Wang, X. J. (2001). Spike-frequency adaptation of a generalized leaky integrate-and-fire model neuron. J. Comput. Neurosci., 10(1), 25–45. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comput., 8(3), 501–509. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comp., 17(4), 903–921. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Comput. and Applic., 11, 210–223. Marian, I., Reilly, R., & Mackey, D. (2002). Efficient event-driven simulation of spiking neural networks. In Proceedings of the 3rd WSEAS International Conference on Neural Networks and Applications. Interlaken, Switzerland. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17, 1776–1801. Olshausen, B. A., & Field, D. J. (2005). How close are we to understanding V1? Neural Comput., 17(8), 1665–1699. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1993). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Richardson, M. J., Brunel, N., & Hakim, V. (2003). From subthreshold to firing-rate resonance. J. Neurophysiol., 89(5), 2538–2554. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In Proc. 11th European Symposium on Artificial Neural Networks (pp. 295–300). Bruges, Belgium.
Exact Simulation of Integrate-and-Fire Models
2027
Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Song, S., & Abbott, L. (2001). Cortical development and remapping through spike timing–dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3(9), 919–926. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390(6655), 70–74. Ziegler, B., Praehofer, H., & Kim, T. (2000). Theory of modeling and simulation. Second edition. Integrating discrete event and continuous complex dynamic systems. Orlando, FL: Academic Press.
Received July 15, 2005; accepted January 18, 2006.
NOTE
Communicated by Peter Rowat
Bursting Without Slow Kinetics: A Role for a Small World? Jie Shao [email protected]
Tzu-Hsin Tsao [email protected]
Robert Butera [email protected] Laboratory for Neuroengineering, Georgia Institute of Technology, Atlanta, GA 30332-0535, U.S.A.
Bursting, a dynamical phenomenon whereby episodes of neural action potentials are punctuated by periodic episodes of inactivity, is ubiquitous in neural systems. Examples include components of the respiratory rhythm generating circuitry in the brain stem, spontaneous activity in the neonatal rat spinal cord, and developing neural networks in the retina of the immature ferret. Bursting can also manifest itself in single neurons. Bursting dynamics require one or more kinetic processes slower than the timescale of the action potentials. Such processes usually manifest themselves in intrinsic ion channel properties, such as slow voltage-dependent gating or calcium-dependent processes, or synaptic mechanisms, such as synaptic depression. In this note, we show rhythmic bursting in a simulated neural network where no such slow processes exist at the cellular or synaptic level. Rather, the existence of rhythmic bursting is critically dependent on the connectivity of the network and manifests itself only when connectivity is characterized as small world. The slow process underlying the timescale of bursting manifests itself as a progressive synchronization of the network within each burst. In one class of rhythmic bursting networks, the connectivity is predominantly mediated by excitatory synapses. Examples include rhythmic bursting in transverse brain stem slices (Smith, Ellenberger, Ballanyi, Richter, & Feldman, 1991) and rhythmic activity in disinhibited embryonic spinal cords (Streit, 1993; O’Donovan, 1989). In the former case, burst initiation is attributed to some combination of intrinsic cellular properties and/or recurrent excitatory coupling, and burst termination is due to the slower kinetics of ionic currents intrinsic to the component neurons. In the latter example, initiation is due to recurrent excitation, and termination is attributed to the slower kinetics associated with synaptic depression (Tabak, Senn, O’Donovan, & Rinzel, 2000). Neural Computation 18, 2029–2035 (2006)
C 2006 Massachusetts Institute of Technology
2030
J. Shao, T.-H. Tsao, and R. Butera
Results from modeling studies (Netoff, Clewley, Aron, Keck, & White, 2004) conducted previously demonstrate a dependency of network activity on network topology. It has been shown that regular network firing activity intensifies and eventually transitions into bursting as network topology changes from regular lattice to small world (SW), and eventually to random connectivity. Regular lattice topology is where each neuron is connected to a specific number k of its nearest neighbor. SW topology is the regular lattice with its synaptic connections being reconnected to a randomly chosen postsynaptic neuron with a reconnection probability p. Random network topology is a network where synaptic connections are randomly assigned to pairs of neurons. Our study presents a new hypothesis that neuronal network topology alone is sufficient to support network-wide bursting activity. We focused on a neuronal network consisting of Morris-Lecar (ML) model neurons instead of other canonical model neurons such as Hodgkin-Huxley (HH). The ML model is distinct from HH in two ways. First, the ML model has no absolute refractory period. Second, the response of ML model to input during an action potential (AP) is similar to that of bursting neurons during the active phase. Specifically, an AP in the ML model, once generated, can be modulated in shape and duration with successive excitatory stimuli to the neuron model. Figure 1 illustrates our primary results. As p is increased from 0 to 1, the connectivity of the network changes from local to SW to random (see Figure 1a). In a regular network ( p = 0) the pacemaker node (cells numbers 0-4 and 507-511) initiates a wave of activity that progresses around both sides of the ring and terminates at the opposite side (see Figure 1b1) when both waves collide. In the SW regime, a different dynamic emerges (see Figure 1b2): after the initial wave triggered by the pacemaker node, successive waves from this node propagate with a slightly increased velocity. Newer wave sources appear due to the presence of the long-range connections within the network. The network becomes increasingly more active until the entire episode of activity ultimately ceases. The duration of the burst of activity lasts several times longer than the period of the firing of the pacemaker node. As p is increased in the SW regime, the period of the bursting rhythm decreases (see Figure 1b3). In the random regime (see Figure 1b4), the entire network fires in near synchrony with the pacemaking node. Figure 1a (triangles) illustrates the dominant frequency of the network as a function of p. The period progressively decreases in the SW regime, and the transition from the regular rhythm to the bursting (SW) rhythm is distinct. The period of the regular network is identical to that of the isolated cells comprising the pacemaking node, and the period of the random network is slightly (1–3%) faster than the pacemaking node due to recurrent excitatory input speeding up the pacing cells.
Bursting Without Slow Kinetics
2031
Figure 1: Small world connectivity leads to bursting in a network of ML models. Simulations performed in NEURON (Hines and Canevale, 1997, 2000) using nominal parameter values for the ML model (Rinzel and Ermentrout, 1989) scaled to a whole-cell capacitance based on an arbitrary soma compartment surface area. The 10 pacemaker node neurons had Istim set to 2.5 µA/cm2 . For all simulations N = 512, k = 10, synaptic delay = 1.5 ms, synaptic weight 0.1 µS. (a) Normalized clustering coefficient (circle), mean path length (square), and dominant frequency (triangle). Error bars indicate mean, minimum, and maximum values of associated measures across 10 sets of simulations for each value of p and 4 classes of initial conditions for each simulation. (b1–b4). Typical network activity at particular values of p; plots are raster plots indicating cell firing. Values of p are 0, 0.025, 0.079, and 0.794, respectively.
2032
J. Shao, T.-H. Tsao, and R. Butera
These simulations were run for four classes of initial conditions and for 10 randomly generated connectivity sets for each value of p. The general results just described were robust for changes in neighborhood size for k from 6 to 20 (and larger, though fewer simulations were run). As long as a small-world region still existed in clustering and mean-length statistics plot (see Figure 1a), rhythms persisted. The rhythms also persisted when we used parameter sets corresponding to both class I (saddle node) and class II (Andronov-Hopf) excitability (Rinzel & Ermentrout, 1989). Rhythms also persisted when moderate parameter heterogeneity was introduced to the network by random assigning Istim to each neuron according to a normal distribution with a standard deviation of 0.5 µA/cm2 . Synaptic delay was an important component in the robustness of this phenomenon. The simulations in Figure 1 were repeated for synaptic delays of 0 to 3 ms in 0.75 increments. Small-world bursting occurred consistently for delays of 0.75 and 1.5 ms. Such rhythms did not occur at all in the absence of a synaptic delay: rather, a transition from propagating to nearsynchronous firing occurred at a specific value of p. For longer delays, the range of p of the bursting regime became smaller, with bursting at lower values of p replaced by a self-sustained excited state (Roxin, Riecke, & Solla, 2004; Netoff et al., 2004; Lago-Fernandez, Huerta, Corbacho, & Siguenza, 2000; Lago-Fernandez, Corbacho, & Huerta, 2001). Within each burst, network synchronization level varies within a range and eventually increases to a local maximum. This phenomenon is demonstrated in Figure 2, where we consider the variation in the size of a time window preceding a specific time point during which 85% of the population fires at least once. By the definition of this time window, its size can be considered as a measure of population-wide synchrony; a smaller time window implies a more synchronized network state, while a larger time window implies a less synchronized network state. This measure was relatively insensitive to our choice of threshold to claim population-wide activity had occurred; similar results were obtained with values from 70% to 95%. In a separate set of simulations investigating the effect of network synchronization on network activity, a stimulus sufficient to generate a single AP was administered at a randomly generated time to each neuron during a prespecified timewindow while poststimulus activity was monitored. This random delivery of stimuli was drawn from a uniform distribution and applied to the network initially in a quiescent state. We are imposing an initial synchrony condition on the network and examining the likelihood of poststimulus activity. The optimal time window range that renders probability of poststimulus activity equaling one is 28 to 44 (this range corresponds to the dashed lines in Figure 2), which roughly correspond to the lower and upper bound on time window size observed in the main simulations. Note that Figure 2c was computed when the network was in a quiescent state, not entirely analogous to the state of population bursting in Figures a1 and a2.
Bursting Without Slow Kinetics
2033
Figure 2: Typical progressive synchronization during a burst and associated probabilities of bursting given initial synchrony conditions. Synchronization measures (a1/b1) for associated simulations (a2/b2) for p values of 0.0316 and 0.0631, respectively. Synchronization at each given time is measured by the time window preceding the reference point during which 85% of the neuron population fires at least once. (c) Illustration of the probability of a self-sustained burst from equilibrium conditions given an initial level of synchronized networkwide firing (closed circles and open circles represent p = 0.0361 and 0.0631, respectively). (c) Dashed lines at 28 and 44 ms are also shown in panels a1/a2 for reference.
However, the same general trend is evident: bursting ceases when the network activity becomes too synchronous and there is a trend of increasing synchrony in the latter half of each burst. Briefly summarizing, burst initiation is due to the initial firings of the pace-making node and recurrent excitation within the network, and burst termination can be attributed to
2034
J. Shao, T.-H. Tsao, and R. Butera
changing levels of activity synchrony in the network beyond a range where a self-sustained burst is likely. These results do not occur with integrate-and-fire or Hodgkin-Huxley models. Our results uniquely depend on the two input-output dynamics of the ML model stated earlier that distinguish ML model neurons from other model neurons such as HH or integrate and fire. In comparison to topology-dependent activity variations shown in other studies, more distinct transition of network activity to bursting at lower values of p can be attributed to our smaller simulation population size and a relatively higher connectivity (k, 1.9%). The two-step transition—from normal firing activity to intense firing activities characterized as “seizing” and then to bursting—as described in Netoff et al. (2004) did become more prominent as we expanded our simulation population to 3000 while the other simulation setup remains the same (results not shown). There are many neural systems where SW connectivity is likely to be present, and we speculate that the mechanisms here may foster slower neural rhythms with timescales that currently cannot be accounted for by the time constants of voltage-gated, calcium-activated, or synaptic-kinetic mechanisms. The input-output properties of the ML model are not esoteric; if one interprets the ML model as a generic measure of neural activity, the response of the ML model to input during the active phase is remarkably similar to the response of endogenously bursting neurons. We speculate that the input-output properties of endogenously bursting neurons may endow networks with the ability of an excitatory network to maintain rhythms much slower than the kinetics of bursting of the isolated neurons. Network rhythms in a transverse brain stem slice containing the ¨ pre-Botzinger complex have been observed to burst with periods as slow as minutes, a timescale that cannot be accounted for by the kinetic properties of the bursting neurons within the transverse slice (C. A. DelNegro, personal communication, April 2005). Similarly, in the developing spinal cord, bursts of neural activity can also be described as bursts of bursts (Tabak, Rinzel, & O’Donovan, 2001). Acknowledgments This work was supported by the National Institutes of Health (R01MH62057). We thank Tay Netoff, John White, and Hermann Riecke for helpful comments on their related work. References Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Hines, M. L., & Carnevale, N. T. (2000). Expanding NEURON’s repertoire of mechanisms with NMODL. Neural Computation, 12, 995–1007.
Bursting Without Slow Kinetics
2035
Lago-Fernandez, L. F., Corbacho, F. J., & Huerta, R. (2001). Connection topology dependence of synchronization of neural assemblies on class 1 and 2 excitability. Neural Networks, 14, 687–696. Lago-Fernandez, L. F., Huerta, R., Corbacho, F. J., & Siguenza, J. A. (2000). Fast response and temporal coherent oscillations in small-world networks. Physical Review Letters, 84, 2758–2761. Netoff, T. I., Clewley, R., Aron, S., Keck, T., & White, J. A. (2004). Epilepsy in small world networks. Journal of Neuroscience, 24, 8075–8083. O’Donovan, M. J. (1989). Motor activity in the isolated spinal cord of the chick embryo: Synaptic drive and firing pattern of single motoneurons. Journal of Neuroscience, 9, 943–958. Rinzel, J., & Ermentrout, B. (1989). Methods in neuronal modeling. Cambridge, MA: MIT Press. Roxin, A., Riecke, H., & Solla, S. A. (2004). Self-sustained activity in a small-world network of excitable neurons. Physical Review Letters, 92, 198101. Smith, J. C., Ellenberger, H. H., Ballanyi, K., Richter, D. W., & Feldman, J. L. (1991). ¨ Pre-Botzinger complex: A brainstem region that may generate respiratory rhythm in mammals. Science, 254, 726–729. Streit, J. (1993). Regular oscillations of synaptic activity in spinal networks in vitro. Journal of Neurophysiology, 70, 871–878. Tabak, J., Rinzel, J., & O’Donovan, M. J. (2001). The role of activity-dependent network depression in the expression and self-regulation of spontaneous activity in the developing spinal cord. Journal of Neuroscience, 21, 8966–8978. Tabak, J., Senn, W., O’Donovan, M. J., & Rinzel, J. (2000). Modeling of spontaneous activity in developing spinal cord using activity-dependent depression in an excitatory network. Journal of Neuroscience, 20, 3041–3056.
Received June 1, 2005; accepted February 17, 2006.
LETTER
Communicated by Jose C. Principe
Error Entropy in Classification Problems: A Univariate Data Analysis Lu´ıs M. Silva [email protected]
Carlos S. Felgueiras [email protected] Instituto de Engenharia Biom´edica, Laborat´orio Sinal e Imagem Biom´edica, 4200-465, Porto, Portugal
Lu´ıs A. Alexandre [email protected] Departamento de Inform´atica, Universidade da Beira Interior, Covilh˜a, Portugal, and Instituto de Telecomunicac¸o˜ es, Networks and Multimedia Group, Covilh˜a, Portugal
J. Marques de S´a [email protected] Instituto de Engenharia Biom´edica, Laborat´orio Sinal e Imagem Biom´edica, 4200-465, Porto, Portugal, and Faculdade de Engenharia da Universidade do Porto, Departamento de Engenharia Electrot´ecnica e Computadores, 4200-465, Porto, Portugal
Entropy-based cost functions are enjoying a growing attractiveness in unsupervised and supervised classification tasks. Better performances in terms both of error rate and speed of convergence have been reported. In this letter, we study the principle of error entropy minimization (EEM) from a theoretical point of view. We use Shannon’s entropy and study univariate data splitting in two-class problems. In this setting, the error variable is a discrete random variable, leading to a not too complicated mathematical analysis of the error entropy. We start by showing that for uniformly distributed data, there is equivalence between the EEM split and the optimal classifier. In a more general setting, we prove the necessary conditions for this equivalence and show the existence of class configurations where the optimal classifier corresponds to maximum error entropy. The presented theoretical results provide practical guidelines that are illustrated with a set of experiments with both real and simulated data sets, where the effectiveness of EEM is compared with the usual mean square error minimization.
Neural Computation 18, 2036–2061 (2006)
C 2006 Massachusetts Institute of Technology
Error Entropy in Classification Problems
2037
1 Introduction Entropy and related concepts of mutual information and Kulback-Leibler divergence have been used in learning systems (supervised or unsupervised) in several ways. The principle of minimum cross-entropy enunciated by Kullback (1959) was introduced as a powerful tool to build complete probability distributions when only partial knowledge is available. The maximization of mutual information between input and output of a neural network (the Infomax principle) was introduced by Linsker (1988) as an unsupervised method that can be applied, for example, to feature extraction. Recently Principe, Xu, and Fisher (2000) proposed new approaches to the application of entropic criteria to learning systems. In particular, they proposed the minimization of R´enyi’s quadratic entropy of the error for regression, time series prediction, and feature extraction tasks (Erdogmus & Principe, 2000, 2002). The principle is as follows. Having an adaptive system (e.g., a neural network) with output variable Y and a target variable T, the error variable is measured as the difference between the target and the output of the system, E = T − Y. The minimization of error entropy implies a reduction on the expected information contained in the error, which leads to the maximization of the mutual information between the desired target and the system output (Erdogmus & Principe, 2000). This means that the classifier is learning the target variable. Entropy-based cost functions, as functions of the probability density functions, reflect the global behavior of the error distribution; therefore, learning systems with entropic cost functions are expected to outperform those that use the popular mean square error (MSE) rule, which reflects only the second-order statistics of the error. In this letter, we are concerned with the criterion of error entropy minimization (EEM) between the output of a classifier and the desired target. Santos, Alexandre, and Marques de S´a (2004) and Santos, Marques de S´a, Alexandre, and Sereno (2004) applied the EEM rule using Renyi’s quadratic entropy to classification tasks, obtaining better results than with MSE rule. Silva, Marques de S´a, and Alexandre (2005) have also proposed the use of Shannon’s entropy with the EEM principle; the results were also better than those obtained with MSE. Despite the evidence provided by these experimental results, which suggests that EEM is an interesting alternative to the MSE principle, very little is known about the theoretical properties of EEM when applied to data classification, in terms of convergence to the optimal classifier, as well as whether Bayes error is attainable. This letter is meant as a contribution to the theoretical study of the EEM principle, using Shannon’s entropy, in classification tasks. We analyze the case of univariate data splitting in two-class problems. We will use Shannon’s formula (Shannon, 1948) for the entropy
2038
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
of a discrete random variable X, HX , taking N values with probability pi HX = −
N
pi log pi .
(1.1)
i=1
Despite the simplicity of the univariate data splitting model, this analysis will provide interesting insights and practical guidelines for the error entropy minimization rule. The organization of the letter is as follows: In section 2 we introduce the univariate data splitting problem: In section 3 we analyze univariate EEM splits in the case of uniformly distributed data and show their convergence to the optimal classifier: In section 4 we present a more generalized analysis of univariate EEM splits and show the existence of situations where the optimal classifier corresponds to maximum error entropy. In section 5 we illustrate with simulated and real data the presented theoretical results: Finally, in section 6, we draw some conclusions and discuss future work. 2 The Univariate Split Problem Let us consider the two-class classification problem with class-conditional distributions given by Ft (x) = P(X ∈] − ∞, x]|T = t), t ∈ {−1, 1}, where X and T are univariate input and target random variables, respectively, and f t (x) the corresponding probability density functions (pdf). The simplest possible linear discrimination rule corresponds to a classifier output, y, as y = g(x) =
y , x ≤ x , −y , x > x
(2.1)
where x is a data-splitting threshold and y ∈ {−1, 1} is a class label. The theoretic optimal rule corresponds to a split point x ∗ and class label y∗ such that (x ∗ , y∗ ) = arg min P(g(X) = T)
(2.2)
with minimum probability of error, P ∗ , given by P ∗ = inf{I y =−1 ( p F1 (x ) + q (1 − F−1 (x ))) + I y =1 ( p(1 − F1 (x )) + q F−1 (x ))},
(2.3)
where p = P(T = 1) and q = P(T = −1) (the class priors). In equation 2.3, the first term inside braces corresponds to the situation where P ∗ is reached when y = −1 is at the left of x ; the second term corresponds to swapping
Error Entropy in Classification Problems
2039
the class labels. A split given by (x ∗ , y∗ ) is called a theoretical Stoller split (for details see Devroye, Gyorfi, & Lugosi, 1996). We define the error variable E = T − Y, as the difference between the target and the classifier’s output and notice that E ∈ {−2, 0, 2}1 . What does it mean to minimize the error entropy in this situation? Does it also lead to the optimal solution for the class of linear threshold decision rules represented by equation 2.1? As we are dealing with a discrete random variable, entropy is a concave function of the pi in equation 1.1 (Kapur, 1993), where each pi corresponds to the probability of E taking one of the values {−2, 0, 2}. These are precisely the probabilities of error Pt for each class t ∈ {−1, 1} and the probability of correct classification 1 − t Pt . Denoting Ft (x ) simply as Ft and considering from now on, without loss of generality, that y = −1, one has P−1 = P(E = −2) = q (1 − F−1 ) P1 = P(E = 2) = p F1 1 − P−1 − P1 = P(E = 0) = q F−1 + p (1 − F1 ).
(2.4)
Thus, the discrete entropy is HE = −[P−1 log P−1 + P1 log P1 + (1 − P−1 − P1 ) log (1 − P−1 − P1 )]. (2.5) In the following sections, we study the behavior of equation 2.5 as we vary x . Section 3 is devoted to the case of uniform distributions, and the following sections consider the situation where the data distributions can be described in terms of continuous class-conditional density functions, where the following applies: Theorem 1. For continuous class-conditional density functions f −1 and f 1 , the Stoller split occurs at an intersection of either q f −1 with p f 1 or at +∞ or −∞. Proof.
See section B.1.
3 EEM Splits for Uniform Distributions Let us consider that the two classes have univariate uniform distributions, f −1 (x) =
1 I[a ,b] (x) b−a
f 1 (x) =
1 I[c,d] (x), d −c
(3.1)
1 E = −2 and E = 2 means a misclassification for class t = −1 and t = 1, respectively. E = 0 means correct classification.
2040
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
f−1
1/(b−a)
f1
1/(d−c)
a
c
b
d
Figure 1: Schematic drawing of the simple problem of setting x to classify two uniform overlapped classes.
where I (x) is the indicator function. We first assume that the classes overlap, such that a < c ≤ b < d. Figure 1 depicts this situation in terms of the density functions f t (x). For this problem and making use of the formulas in equation 2.4, it is straightforward to compute HE as in equation 2.5 for x varying on the real line. Indeed, one has HE (x ) = q log q + 0 log 0 + p log p, −a −a q b−x + 0 log 0 + q xb−a log q b−x + p log q xb−a +p , b−a b−a q b−x log q b−x + p x −c log p x −c b−a b−a d−c d−c − x −a d−x −a , + q b−a + p d−c log q xb−a + p d−x d−c x −c x −c d−x d−x 0 log 0 + p d−c log p d−c + q + p d−c log q + p d−c , 0 log 0 + p log p + q log q ,
x < a x ∈ [a , c[
x ∈ [c, b[ x ∈ [b, d[ x ≥ d. (3.2)
Figure 2 (dashed line) shows some examples for p = 1/2, [a , b] = [0, 1] and different values of c and d. First, one can see that although within each interval of x (corresponding to the different cases above), HE is a concave function, as a whole HE is not concave. Second, whenever the overlap is nondegenerate (all figures of Figure 2 except 2c), we have two local minima located at the extremes of the
Error Entropy in Classification Problems
2041
P
HE
P
0.4
0.56
0.4
0.55
0.2
0.43
0.2
0.35
H
E
[c,d]=[0.7, 1.7]
0
0.7 1
1.7
[c,d]=[0.7, 3]
0
x’
0
(a)
0.7 1
3
x’
(b)
P
H
0.4
0.53
0.2
0.26
P
HE
0.5
0.8
E
0.3
0.6
[c,d]=[1, 1.7]
0
0
1
(c)
1.7
x’
[c,d]=[0.2, 1.7]
0
0 0.2
1
1.7
x’
0.5
(d)
Figure 2: Shannon entropy (dashed line) and probability of error (solid line) plotted as functions of x .
overlapped regions. A local maximum (global in some cases, as in Figure 2d), say, at x0 , is located within the overlapped region. If we have equal support for the two distributions (and equal priors), entropy is perfectly symmetric at x0 , and this is exactly the midpoint of the overlapped region (see Figure 2a). In the other cases, we have a local and a global minimum, and x0 is deviated toward the former. Let us now determine the probability of error P for this example. Making use of the above expressions, we have q, b − x q , b−a b − x x − c P(x ) = P−1 (x ) + P1 (x ) = q + p , b−a d −c x − c , p d −c p,
x < a x ∈ [a , c[ x ∈ [c, b[ x ∈ [b, d[ x ≥ d.
(3.3)
2042
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
0.5 1
P
H
E
0.6
0.4
0.6
log(3)
P1
0.4
0.8
Ppath
1
0.4 0.2
1 0.
0
6
0.1
0.2
0.8 0.2
0.8
0.4
06
0.2
P−1
(a)
0.5
0 −1
0
2/3 1
x’
2
(b)
Figure 3: (a) Contour lines of HE with a general path, Ppath , produced by P. (b) P and HE plotted as functions of x for the path Ppath in (a).
Figure 2 (solid line) plots P as a function of x for the same values of a , b, c, and d. One can see that the global minimum of the error entropy corresponds to the theoretical Stoller split. In fact, for this problem, it also corresponds to the optimal decision in the Bayes sense. If we take the special case where b − a = d − c (see Figure 2a), using the minimum probability of error criteria, we may locate x ∗ anywhere in [c, b]; for entropy, it is preferable to choose either x ∗ = c or x ∗ = b. The reason is that the choice x ∗ ∈]c, b[ increases the uncertainty or unstability of the system. At c or b, E takes only two values of {−2, 0, 2}; otherwise, E can assume every value in that set, which implies an increase in entropy. In other words, entropy prefers to classify correctly one class and leave all the errors to the other one. Figure 2 can be easily reproduced for unequal priors, where the general behavior is the same. In fact, we can show that: Theorem 2. Suppose we have two overlapped uniform distributions as in equation 3.1 such that a < c ≤ b < d. HE and P have the same global minimum. Proof. Consider the P−1 × P1 plane. First, notice that a probability path, Ppath , produced by P as in equation 3.3 is always composed by three linear segments: two along the axes connected by the remaining one (in some situations degenerated in one point). Second, notice that HE , as a function of the probabilities, is concave and symmetric about the vertical plane P−1 = P1 . Therefore, the global minimum of HE always coincides with the global minimum of the probability of error. The demonstration is illustrated in Figure 3a, where the contour lines of HE are plotted as functions of P−1 and P1 . The solid line represents Ppath (see Figure 3b plots P and HE as functions of x for this path), and the dashed lines are contours of equal probability.
Error Entropy in Classification Problems
2043
pf1
qf−1
P−1
P1
a−1
x
a0
Figure 4: Stoller split problem for two univariate continuous distributions.
When we have separable classes, it is obvious that we should set x ∗ anywhere in ]b, c[. The minimum entropy value (HE = 0) also occurs in that interval because P(E = 0) = 1. Again we are led to the minimum probability of error. 4 EEM Splits for Mutually Symmetric Distributions 4.1 Critical Points of the Error Entropy. Suppose the two classes Ct , t ∈ {−1, 1} are represented by arbitrary continuous pdf’s, f t (x). We define the center a t of a distribution as its median. Let us consider, without loss of generality, that class C1 is centered at 0 and the center of class C−1 lies in the nonpositive part of the real line. Figure 4 depicts this setting. Theorem 3. In the univariate two-class problem, the Stoller split x ∗ is a critical point of the error entropy if the error probabilities of each class at x ∗ are equal. Proof.
From formula 2.5, one derives
dH = q f −1 log (P−1 ) − (q f −1 − p f 1 ) log (1 − P−1 − P1 ) − p f 1 log (P1 ) . d x (4.1)
2044
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
A critical point (roots of first derivative) of H must satisfy dH p f1 log(P−1 /(1 − P−1 − P1 )) =0⇔ = . dx q f −1 log(P1 /(1 − P−1 − P1 ))
(4.2)
If the densities are continuous, the Stoller split x ∗ is obtained either at a p f 1 versus q f −1 intersection p f 1 (x ∗ ) = q f −1 (x ∗ ) or at +∞ or -∞ (see theorem 1). In the last case, the error probabilities of each class are unequal. In the first case, we have, from equation 4.2, p f 1 (x ∗ ) = q f −1 (x ∗ ) ⇔ P−1 (x ∗ ) = P1 (x ∗ ),
(4.3)
where P−1 (x ∗ ), P1 (x ∗ ) are the probabilities of error of each class with split point at x ∗ . Example: In the uniform example of Figure 2a, the Stoller split can be at any point of [c, b] = [0.7, 1], but the critical point (in this case a maximum) of entropy occurs at the middle point of that interval, which corresponds precisely to the split where the two classes have equal error probabilities. The above result states the conditions for a correspondence between the Stoller split and an entropy extremum. This means that the EEM principle cannot be applied in general situations. Moreover, theorem 3 says nothing about the nature (maximum or minimum) of the critical point. As we will see, the solution in theorem 3 is not guaranteed to be an entropy minimum.
2 Let us determine the sign of dd xH2 x∗ . One has d f1 d f −1 P−1 P1 d2 H − p = q log log d x 2 d x 1 − P−1 − P1 d x 1 − P−1 − P1 −
2 q 2 f −1 p 2 f 12 (q f −1 − p f 1 )2 − − . 1 − P−1 − P1 P−1 P1
(4.4)
In order to deal with expression 4.4, we make a simplification by analyzing the case of mutually symmetric distributions defined as: Definition 1. Two class distributions represented by probability densities g1 and g2 and priors p and q , respectively, are mutually symmetric if pg1 (a 1 − x) = q g2 (x − a 2 ) where a t is the center of the density gt .
Error Entropy in Classification Problems
2045
If the classes are mutually symmetric, one must have p = q = 1/2 and
d f −1
d f 1
= − . d x x∗ d x x∗
(4.5)
In the conditions of theorem 3, we have f 1 (x ∗ ) = f −1 (x ∗ ) and P−1 (x ∗ ) = P1 (x ∗ ). If we define
1 f (x ∗ ) 2 1
(4.6)
≡ f and P1 (x ∗ ) ≡ P, then
f2 d f
P d 2 H
+ . = −2 log d x 2 x∗ d x x∗ 1 − 2P P
(4.7)
Therefore, for mutually symmetric distributions, we need to analyze only what happens at one side of one of the distribution centers (in this case, a 1 ). Since we have set a 1 = 0, x ∗ occurs at half distance of the median of C−1 to the origin, somewhere in ] − ∞, 0]. Let G(x ∗ ) =
df P f2 log + , d x 1 − 2P P
(4.8)
where we let fall the dependence of the derivative on x ∗ in order to simplify notation. G(x ∗ ) plays the key role in the analysis of the error entropy critical points. If the classes are sufficiently distant, that is, C−1 sliding to the left (x ∗ → −∞ or x ∗ tends to the infimum of the support of C1 ), then ddxf > 0, and we can rewrite expression 4.8 as df G(x ) = d x ∗
P + log 1 − 2P
f2 df d x
P
.
(4.9)
Using the results given in section A.1, the second term between the parentheses is finite, while P can be made sufficiently small such that the first term is greater in absolute value than the second one. Thus, G(x ∗ ) < 0, and equation 4.7 is positive. Hence, the Stoller split x ∗ is an entropy minimum. If the classes are sufficiently close, that is, C−1 sliding rightward (x ∗ → 0), there are three situations to consider. Define xM and xm as the abcissas where f has the mode and the median, respectively. Then: 1. xM = xm . In this case, f is symmetric, and by the continuity of G(·) and the fact that G(xM ) > 0 (since ddxf xM = 0), G(x ∗ ) is positive in a neighborhood of xM .
2046
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
2. xM < xm . Again, G(x ∗ ) > 0 in a neighborhood of xM , because G(xM ) > 0. 3. xM > xm . We have no guarantee on a sign change in G(x ∗ ). The first two situations show that G(x ∗ ) changes its sign, which means that the Stoller split turns to be an entropy maximum if the distributions are close enough. In the third situation we may or may not have a sign change of G(x ∗ ). In fact, as shown in section 4.2.3 for the log normal distribution, we have situations where there is always an entropy minimum in an intersection of the posterior densities, but the Stoller split changes its location as the distributions get closer. Furthermore, for each probability distribution, the ratio x ∗ / between the possible solution of G(x ∗ ) = 0 and the distribution’s scale is a constant. In fact, for two variables X and Y, with Y = · X (Y is a scaled version of X), we have xY∗ / = xX∗ . 4.2 Critical Points for Some Distributions. This section presents three examples of univariate split problems that illustrate the results of previous sections. In the first two examples, for the triangular and gaussian distributions, we determine the minimum distance between classes such that the Stoller split is an entropy minimum. We define d/ as a normalized distance between the centers of the two classes where d = a 1 − a −1 and is the distributions scale. Remember from the end of the previous section that it is only needed to set = 1. We also set p = q = 1/2 in all examples. The third example shows that one can have an entropy minimum in an intersection point where the probabilities of error are equal but it is not the location of the Stoller split. 4.2.1 The Triangular Distribution Case. The triangular density function with width (scale) is given by 0, x<0
2
2 2
x − , 0 ≤ x ≤ f (x) = (4.10) −
2
0, x > . Setting = 1, class C1 is centered at 1/2 and class C−1 is moving between −1/2 and 1/2. The Stoller split occurs at x ∗ = (1/2 + a −1 ) /2. Carrying out the computation of G(x ∗ ), one finds that x ∗ will be a minimum of entropy iff 1 2 ∗2 2 x 1 2 1 2 log 4 1 + 22 < 0 ⇔ x ∗ < √ (4.11) 2 ∗2 2 2 1 − 22 x e +2 and a maximum otherwise.
Error Entropy in Classification Problems
2047
Thus, for any , the Stoller split is an entropy minimum if 2 d >1− √ ≈ 0.3473. 2 e +2
(4.12)
4.2.2 The Gaussian Distribution Case. For gaussian distributions, one has a t = µt , where µt is the distribution mean of class Ct . G(x ∗ ) can be easily rewritten as a function of d. Indeed, setting ≡ σ = 1, exp −d 2 /4 1 − (d/2) d 2 G(x ) ≡ √ exp(−d /8) log − 2(d/2) 4π (1 − (d/2)) 4 2π ∗
(4.13) where (·) is the standard gaussian cumulative distribution function. If d is below some value, expression 4.13 will be positive, and the Stoller split is an entropy maximum. If it is above, the Stoller split is an entropy minimum. This turning value was numerically determined to be tvalue = 1.405231264. 4.2.3 The Log Normal Distribution Case. density
The log normal distribution has
(log(x) − µ)2 1 g (x|µ, σ ) = . √ exp − 2σ 2 xσ 2π
(4.14)
We consider the splitting problem where f −1 (x) ≡ g(x) and f 1 (x) ≡ g(−x + a −1 + xm ) where xm ≡ a 1 is the center (median) of f 1 . Note that this is precisely the situation 3 referred to in section 4.1 (xM > xm ). In fact, xm = e µ 2 and xM = e µ−σ . Figure 5 shows the splitting problem in two different conditions: in Figure 5a, the distributions are distant, and in Figure 5b, the distributions have the inner intersection point at their centers. We found that this intersection is always an entropy minimum (thick solid line), but the Stoller split moves to one of the outer intersections (as we can see from the minimum probability of error curve represented by the dashed line) as the distributions get closer. This illustrates the way theorem 3 was enunciated, because one can have an intersection point with equal probabilities of error and thus an entropy critical point, but it may not correspond to the Stoller split intersection.
2048
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
0.98
0.6
P*
0.4
0.2
0 −2
0.96
H
qf−1
0
2
4
pf1
6
8
0.94
0.92 10
(a)
P*
0.5
1.06
H
0.4
1.02 0.2
qf−1
0.98
pf1
0.92 0 −2
0
2
4
6
8
10
(b) Figure 5: The log normal distribution case. (a) If the distributions are distant, the Stoller split is an entropy minimum at the inner intersection. (b) The inner intersection is still an entropy minimum, but the Stoller split is at one of the outer intersections.
5 EEM Splits in Practice 5.1 The Empirical Stoller Split and MSE. In section 2 we saw how to obtain a theoretical Stoller split for a given problem when the class distributions are known. However, in practice, one has available only a set of examples whose distributions are in general unknown. Stoller (1954)
Error Entropy in Classification Problems
2049
proposed the following practical rule to choose (x , y ) such that the empirical error is minimal: 1 I{Xi ≤x,Ti = y} + I{Xi >x,Ti =−y} . N N
(x , y ) =
arg min (x,y)∈R×{−1,1}
(5.1)
i=1
The probability of error of Stoller’s rule converges to the Bayes error for N → ∞ (for details, see Devroye et al., 1996). If we take the MSE cost function, MSE = c
N (ti − yi )2 ,
(5.2)
i=1
where c is a constant,2 it is easy to see that it is equivalent to Stoller’s rule, equation 5.1, in the sense that the same discrimination rule, equation 2.1, is determined. In fact, MSE = c =c
(ti − yi ) + 2
Xi ∈C−1
N
2
(ti − yi )
Xi ∈C1
4I{Xi >x} +
Xi ∈C−1
= 4c
(5.3)
4I{Xi ≤x}
(5.4)
Xi ∈C1
I{Xi ≤x,Ti =−1} + I{Xi >x,Ti =1} ,
(5.5)
i=1
which is the same as in equation 5.1 if we take c = 1/4N and use the convention that class C−1 is at the left of the splitting point. Thus, the solution to (x , y ) =
arg min
MSE
(5.6)
(x,y)∈R×{−1,1}
is the same as in equation 5.1. 5.2 EEM Empirical Procedure. We have to develop a practical rule to minimize (or maximize, depending on the conditions of the problem) the
2 The value of c (which can be 1/N for MSE definition or 1/2 for derivative simplification reasons) has no influence on the minimization of the cost function.
2050
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
error entropy, H(x) = −P−1 (x) log P−1 (x) − P1 (x) log P1 (x) − (1 − P−1 (x) − P1 (x)) log (1 − P−1 (x) − P1 (x)) ,
(5.7)
where P−1 (x) =
−∞
x
−∞
x
P1 (x) = p
q f −1 (s)ds = q 1 −
f −1 (s)ds
(5.8)
x
−∞
f 1 (s)ds.
(5.9)
Since we do not know the true class distributions, we estimate them using the gaussian kernel density estimator (Parzen, 1962), f t (x) ≈
(x − xi )2 1 1 . √ exp − Nh x ∈C 2π 2h 2 i t
(5.10)
Hence,
x
−∞
f t (s)ds ≈
x − xi
1 0, 1 , N x ∈C h
i
(5.11)
t
where (x|µ, σ 2 ) is the cumulative gaussian distribution, with mean µ and variance σ 2 , at x. Expression 5.11 is used to compute and optimize H(x) as in equation 5.7, and expression 5.1 is used to obtain the optimal solution for MSE. The optimization algorithm we have used in our experiments is based on the Golden Section search with parabolic interpolation (Press, Teukolsky, Vetterling, & Flannery, 1992). 5.3 Experiments 5.3.1 Simulated Data: The Two-Class Gaussian Problem. We first studied how the EEM procedure works with simulated gaussian data, where all the conditions can be controlled. To ensure the conditions of theorem 3, two classes with gaussian distribution differing only in location (σ was set to 1) were generated. We also set p = q = 0.5. Several experiments were made varying the normalized distance d/σ between classes. Taking into account the tvalue for gaussian classes, the distance values were chosen so as to have a maximization problem (d/σ = 1) and two minimization problems (d/σ = 1.5 and 3), one of them very close to tvalue . We also varied the number of available training (# train) and test (# test) patterns for each class. The
Error Entropy in Classification Problems
2051
Table 1: Test Error (%) and Standard Deviations Obtained with EEM and MSE for the Simulated Gaussian Data. # train # test
100 EEM
d = 3; Bayes error: 6.68% 50 6.79(2.41) 500 6.82(0.83) 5000 6.81(0.30) 50,000 6.81(0.20)
1000 MSE
EEM
100000 MSE
EEM
MSE
7.02(2.59) 7.07(1.00) 7.11(0.59) 7.11(0.60)
6.75(2.51) 6.70(0.81) 6.69(0.25) 6.69(0.08)
6.72(2.60) 6.76(0.81) 6.72(0.27) 6.74(0.13)
6.75(2.51) 6.66(0.81) 6.68(0.25) 6.68(0.08)
6.66(2.41) 6.69(0.81) 6.67(0.25) 6.68(0.08)
d = 1.5; Bayes error: 22.66% 50 25.23(4.65) 23.22(4.22) 500 25.33(2.75) 23.30(1.54) 5000 25.32(2.49) 23.22(0.84) 50,000 25.46(2.54) 23.27(0.77)
24.67(4.58) 24.82(2.48) 24.72(2.15) 24.83(2.21)
22.74(4.23) 22.84(1.36) 22.82(0.48) 22.81(0.23)
22.61(4.21) 22.83(1.38) 22.80(0.46) 22.82(0.24)
22.54(4.10) 22.67(1.32) 22.68(0.42) 22.67(0.13)
d = 1; Bayes error: 30.85% 50 30.63(4.64) 31.56(4.60) 500 30.95(4.64) 31.56(4.60) 5000 30.93(0.47) 31.37(0.84) 50,000 30.93(0.17) 31.39(0.70)
30.90(4.48) 30.88(1.45) 30.87(0.47) 30.86(0.14)
31.05(4.61) 31.01(1.46) 31.04(0.50) 31.02(0.26)
30.70(4.82) 30.80(1.49) 30.84(0.46) 30.85(0.14)
30.69(4.56) 30.75(1.44) 30.84(0.45) 30.86(0.15)
Notes: Different values of d were used, and the Bayes error was determined for each case. Standard deviations are in parentheses.
solution was determined for both EEM and MSE with the training set and tested with the test set over 1000 repetitions. To determine the value of h to use in each problem, we conducted preliminary experiments where we varied h in order to choose the best one. The final values used were h = 1.7, 0.1 and 0.8 for d = 1, 1.5, and 3, respectively. As these problems can be solved optimaly, in the Bayes sense, by a unique split, we have determined the Bayes error for each experiment for comparison purposes. Table 1 shows the mean values and standard deviations for the test error of each experiment. For d = 1 and d = 3, both EEM and MSE achieve Bayes discrimination if the training sets are asymptotically large, with slightly better results for EEM. However, with small training sets, EEM outperforms MSE. In fact, we encounter less test error and standard deviations for EEM, which means that its solutions have more stability and more generalization capability. Increasing the number of test patterns has the major effect of decreasing the standard deviation of the error estimates. In this sense, the results for d = 1.5 were quite unexpected. As we can see, the results of EEM are always worse than with MSE, mainly for small sample sizes. Further investigation revealed that the problem was due to the proximity of d = 1.5 to the turning value. The estimate of entropy has high variance, and the location of extrema is highly dependent on the value
2052
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
Table 2: Test Error (%) and Standard Deviations Obtained with EEM (Maximization Approach) and MSE for the Simulated Gaussian Data (d = 1.5). # train
100
1000
100,000
# test
EEM
MSE
EEM
MSE
EEM
MSE
50 500 5000 50,000
22.95(3.93) 22.73(1.28) 22.73(0.41) 22.75(0.17)
23.22(4.22) 23.30(1.54) 23.22(0.84) 23.27(0.77)
22.78(4.01) 22.71(1.32) 22.65(0.43) 22.67(0.14)
22.74(4.23) 22.84(1.36) 22.82(0.48) 22.81(0.23)
22.47(4.14) 22.63(1.33) 22.66(0.41) 22.67(0.13)
22.54(4.10) 22.67(1.32) 22.68(0.42) 22.67(0.13)
Note: Standard deviations are in parentheses.
of h. To solve this problem, we investigated the possibility of transforming the minimization problem into a maximization problem, getting a more accurate and stable procedure. This is achieved by increasing the value of h (the details are described in section A.2). The performance is increased not only in terms of lower test error but also lower number of iterations needed. Table 2 presents the comparison between MSE and the maximization approach, where h was determined by formula A.5 with c = 3. As we can see, EEM now behaves similarly as for d = 1 and d = 3 above, outperforming the results of MSE. 5.3.2 Real Data. The EEM and MSE procedures were also applied to real data. We’ve used four data sets: Corkstoppers from Marques de S´a (2001) and Iris, Wine, and Glass from the UCI repository (Newman, Hettich, Blake, & Merz, 1998). We intended to use the previous results for gaussian distributions; therefore, we have conducted hypothesis testing on the normality of the samples and homogeneity of variances. Table 3 shows a brief description of the data used and the results of these tests. All samples except the ones from Glass verify the normality assumption (for a significance level α = 0.05). The homogeneity of variance property can also be ensured for the same significance level, except for Wine and Glass. Thus, we expect a worse performance of EEM in these data sets, because the conditions of theorem 3 are not ensured. Taking into account the d/σ values, we have two minimization (Corkstoppers and Iris petal length) and four maximization problems. The train and test procedure was a simple holdout method: half of the data set for training and half for testing. This was repeated over 1000 times, varying the train and test sets. The results obtained are shown in Table 4. The results show that EEM outperforms MSE in most cases with definitely better results in four of the six data sets (according to the µEEM = µMSE test). Even in Wine, EEM outperformed MSE. In Corkstoppers, the minimization of error entropy performed poorly. This is in agreement with the
Error Entropy in Classification Problems
2053
Table 3: Description of the Univariate Two-Class Problems Used from Real Data. Iris
Wine
Corkstoppers x classes d/σ Normality σ12 = σ22 test
Glass
N
Sepal Length
Sepal Width
Petal Length
Alcalinity of Ash
NA
1 vs. 2 1.474 0.97; 0.76 0.72
1.036 0.58; 0.91 0.15
2 vs. 3 0.629 0.45; 0.53 0.85
2.341 0.25; 0.29 0.26
1 vs. 2 0.717 0.18; 0.43 0.01
1 vs. 2 0.068 0.04; 0.00 0.02
Notes: x is the input variable used, and classes is the two classes used from each data set. The last two rows show the p-values for the normality and homogeneity of variance tests.
Table 4: Percentage of Test Error for the Univariate Split Problems of Table 3 with EEM and MSE. Corkstoppers EEM MSE µEEM = µMSE
22.94(4.50) 26.19(4.84) 0.00
Iris 27.25(4.62) 30.24(5.43) 0.00
41.25(8.30) 40.77(8.2) 0.098
8.15(2.73) 8.52(3.17) 0.005
Wine
Glass
33.64(4.13) 35.43(4.50)
52.64(3.22) 52.81(3.29)
0.00
0.122
Notes: The last row presents the p-values of the test of equality of means µEEM = µMSE . Standard deviations are in parentheses.
results obtained for the gaussian simulated problem with d = 1.5. Thus, the results of Corkstoppers in Table 4 were obtained with the maximization procedure using equation A.5 with c = 3 to set h. In Iris petal length, we used the minimization approach with better results than MSE, but it was interesting to notice that the maximization approach achieved even better results: test error of 7.18% and standard deviation 2.78%. We sought an explanation for the difference of the maximization and minimization results and found it on the small number of patterns used each time in the training sets, where each class density is estimated with approximately 25 patterns. Also, the optimal value used for the minimization was a mere h = 0.16 (empirically found), which in conjunction with the small number of patterns produces very rough density estimates (see Figure 6a), contrasting with those obtained with a large h (Figure 6b, for the maximization procedure). Furthermore, as the training sets (remember that each experiment is repeated 1000 times) may vary a lot, the same value of h = 0.16 for all of them is certainly not an optimal
2054
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a h=0.16
h=1 0.18
0.4
0.16
0.3 0.2
0.13
0.1 3
4
5
6
(a)
7
0.1 3
4
5
6
7
(b)
Figure 6: Density estimates of a training set for the two class problem of Iris petal length with h = 0.16 in (a), minimization procedure and h = 1 in (b), maximization procedure.
choice. On the other hand, as the maximization approach uses larger h, the possible differences between different training sets are smoothed out and have less influence on the final result. This explains the difference between the minimization and maximization results. In conclusion, for very small data sets (when the sample may not be representative of the distribution), one should consider the maximization approach. 6 Discussion and Conclusions We analyzed the relation between the theoretical Stoller split (univariate two-class discrimination problem) and the error entropy minimization (EEM) principle. Besides the possible practical applications of this analysis to univariate data splitting with EEM (e.g., in tree classifier design, using the popular univariate data splitting approach), the results derived from the analysis are also important as a first step to a needed theoretical EEM assessment when applied to neural networks (e.g., multilayer perceptrons, MLP). For instance, this work has shown that for certain class configurations, one must use entropy maximization instead of minimization. We started by verifying that for two uniform classes, the EEM principle leads to the optimal classifier for the class of Stoller split decision rules. This optimal solution also corresponds to the optimal decision rule obtained using the minimum probability of error criterion. Thus, Bayes error is also guaranteed in this situation. For general class density functions, it was proven (in theorem 3) that a Stoller split occurs at an entropy extremum only if the error probabilities for both classes are equal; this restricts the applicability of the EEM principle to univariate splitting in the sense that the optimal classifier may not be achieved. Moreover, we showed that for
Error Entropy in Classification Problems
2055
mutually symmetric distributions and in the conditions of theorem 3, the Stoller split may be either an entropy minimum or maximum, depending on the proximity of the classes. In particular, it was possible to determine the turning proximity values for triangular and gaussian distributions. These were used as a guideline for the empirical procedure, where EEM outperformed MSE, especially for small sample sizes. With simulated data, we concluded that the EEM principle requires fewer training data than MSE and also fewer iterations of the optimization algorithm. This fast convergence evidence will be studied in more detail in future work, particularly when EEM is applied to general MLP classification. We also encountered a high sensitivity of the discrimination process to the smoothing parameter, h. This phenomenon has already been reported in previous work (Santos, Alexandre, et al., 2004; Santos, Marques de S´a, et al., 2004; Silva et al. 2005). Meanwhile, our analysis enlightened the fact that in the cases where d/ is near the turning proximity value, it is preferable to set h so as to convert the minimization process into a maximization process. Furthermore, our maximization results show that exact density estimation is not needed; a density estimation that is capable of extracting the main characteristics of the data is sufficient. All of these findings are certainly important for future study of the influence of h for MLP classifiers with either threshold or continuous activation functions. Appendix A: Additional Results ¨ A.1 A Result on the Holder Exponent Definition 2. Let α ∈ R+ and x0 ∈ R. A function f : R → R is said to be C [α] (x0 ) if there exists L > 0 and a polynomial P of degree [α]3 such that
∀δ > 0 : |x − x0 | < δ ⇒ f (x) − P(x − x0 ) ≤ L |x − x0 |α .
(A.1)
The maximum value of α that satisfies equation A.1 is known as the H¨older exponent of f at x0 . The polynomial P is the Taylor expansion of order [α] of f at x0 . The ¨ Holder exponent α measures how irregular f is at the point x0 . The higher the exponent α, the more regular is f . Figure 7 shows the behavior of f in a neighborhood of x0 for different values of α.
3 [α] represents the largest integer less than α. If α is not an integer, [α] ≡ α; otherwise, [α] ≡ α − 1.
2056
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
f(x)
α>1 α<1
α=1
x0
x
Figure 7: Local behavior of f for different values of α.
Theorem 4. Let f : R → R be a continuous function, such that f ≡ 0 for x ≤ x0 and differentiable for x > x0 . If the H¨older’s exponent of f at x0 is α, then limx→x0+
x x0
α+1 f 2 (x) = . df α f (y)dy d x (x)
(A.2)
The idea of the previous theorem is that in a sufficiently small neighborhood of x0 , f behaves like L(x − x0 )α . Then x x0
f 2 (x) = f (y)dy dd xf (x)
L 2 (x − x0 )2α L(x−x0 )α+1 α+1
αL(x − x0
)α−1
=
α+1 . α
The left-hand side of equation A.2 is also bounded if f has left unlimited support. The proof of this result can be made using a geometrical argument. In fact,
x
−∞
df f (x)b f (x) f (y)dy > , dx 2 b
Error Entropy in Classification Problems
2057
where b is the base of the shadowed triangle in Figure 7.4 Thus, x −∞
f 2 (x)
< 2. f (y)dy dd xf (x)
A.2 Turning Minimization into Maximization. Estimating a density function with kernel method leads to an estimate with σˆ 2 = h 2 + s 2
µ ˆ = x¯ ,
(A.3)
where x¯ is the sample mean and s 2 is the sample (not corrected) variance. When h is too small, the kernel estimate has a large variance leading to a nonsmooth entropy function. When h is large, we have an oversmoothed density, but entropy is smooth and preserves the extrema. Figure 8 depicts this dichotomy. In Figures 8b and 8c, the values of h are given by the optimal rule for gaussian distributions (Silverman, 1986) and by expression A.5 with c = 3, respectively. The vertical solid line shows the theoretical Stoller split for the problem. It is important to note that this is a minimization problem. In practice, the increased h has the effect of approximating classes and thus the maximum instead of the minimum in Figure 8c. This means that it is more efficient to maximize entropy when d/σ is close to the turning value. How can one set h in order to have a maximization problem? Just ensure that d tvalue ≈ σ c
(A.4)
where c > 1 and σ is the standard deviation of the estimated density. Thus, with straightforward calculations, one has h2 ≈
dc tvalue
2 − s2,
(A.5)
or h equal to some large value (empirically obtained) if the right-hand side of equation A.5 is nonpositive. An evident choice for c may be c = tvalue , because this implies d/σ = 1, which is the third gaussian problem of section 5.3.1. The increase in c leads to increased h, and the entropy function becomes smoother. Of course, one cannot increase h indefinitely, because with almost flat H, the optimization algorithm may fail to find its maximum. 4 Note that the behavior of f in this situation is similar to the case of f with left limited support and α > 1.
2058
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a h=0.2
h=0.498
H
H
E
E
0.7
0.8
0.6
0.7
0.5 −2
−1
0
1
2
0.6 −2
3
−1
0
1
2
3
x’
x’
(a)
(b) h=2.27
H
E
0.9
0.8
0.7 −2
−1
0
1
2
3
x’
(c) Figure 8: Error entropy for different values of h in the gaussian distribution example with d = 1.5.
Appendix B: Proof of Theorem 1 Proof. First, assume that there is no intersection of q f −1 with p f 1 (see Figure 9a). Then P ∗ = min( p, q ) ≤ 1/2 occurs at +∞ or −∞. For intersecting posterior densities, one has to distinguish two cases. First, assume that for δ > 0, p f 1 (x) < q f −1 (x)
x ∈ [x0 − δ, x0 ]
p f 1 (x) > q f −1 (x)
x ∈ [x0 , x0 + δ],
and (B.1)
where x0 is an intersection point (see Figure 9b). The probabilities of error at x0 and x0 − δ are P(x0 ) = p
x0 −δ
−∞
f 1 (t)dt +
x0 x0 −δ
f 1 (t)dt + q
+∞ x0
f −1 (t)dt
(B.2)
Error Entropy in Classification Problems
2059
pf1
qf−1
x0–δ x0
(a)
(b) qf−1
pf1
x0–δ x0
(c) Figure 9: Possible no-intersection or intersection situations in a two-class problem with continuous class-conditional density functions. The light shadowed area in b and c represents P(x0 ) where x0 is the intersection point. The dark shadowed area in b represents the amount of error probability added to P(x0 ) when the splitting point is deviated to x0 − δ. The dashed area in c is the amount of error probability subtracted from P(x0 ) when the splitting point is deviated to x0 − δ.
P(x0 − δ) = p
x0 −δ −∞
f 1 (t)dt + q
x0 x0 −δ
f −1 (t)dt +
+∞ x0
f −1 (t)dt .
(B.3)
Hence, P(x0 ) − P(x0 − δ) = p
x0 x0 −δ
f 1 (t)dt − q
x0 x0 −δ
f −1 (t)dt < 0
(B.4)
2060
L. Silva, C. Felgueiras, L. Alexandre, and J. Marques de S´a
by condition B.1. It is easily seen, using similar arguments, that P(x0 ) − P(x0 + δ) < 0. Thus, x0 is a minimum of P(x). Now, suppose that (see Figure 9c) p f 1 (x) > q f −1 (x)
x ∈ [x0 − δ, x0 ]
p f 1 (x) < q f −1 (x)
x ∈ [x0 , x0 + δ].
and (B.5)
Then x0 is a maximum of P(x). This can be proven as above or just by noticing that this situation is precisely the same as above but with a relabeling of the classes. For relabeled classes, the probability of error P (r ) (x) is given by (r )
(r )
P (r ) (x) = p(1 − F−1 (x)) + q F1 (x) = 1 − [q (1 − F−1 (x)) + p F1 (x)] = 1 − P(x).
(B.6)
Thus, P (r ) (x) is just a reflection of P(x) around 1/2, which means that P(x) maxima are P (r ) (x) minima and vice versa. The Stoller split is chosen as the minimum up to a relabel (see expression 2.3). Acknowledgments This work was supported by the Portuguese FCT-Fundac¸a˜ o para a Ciˆencia e a Tecnologia (project POSC/EIA/56918/2004). L.M.S. is also supported by FCT grant SFRH/BD/16916/2004. References Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag. Erdogmus, D., & Principe, J. (2000). Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In Proceedings of the Intl. Conf. on ICA and Signal Separation (pp. 75–80). Helsinki, Finland. Erdogmus, D., & Principe, J. C. (2002). An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Transactions on Signal Processing, 50(7), 1780–1786. Kapur, J. (1993). Maximum-entropy models in science and engineering (rev. ed). New York: Wiley. Kullback, S. (1959). Statistics and information theory. New York: Wiley. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. Marques de S´a, J. (2001). Pattern recognition: Concepts, methods and applications. Berlin: Springer-Verlag. Newman, D., Hettich, S., Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Irvine: University of California, Irvine, Department of
Error Entropy in Classification Problems
2061
Information and Computer Sciences. Available online at http://www.ics.uci. edu/∼mlearn/MLRepository.html. Parzen, E. (1962). On the estimation of a probability density function and the mode. Annals of Mathematical Statistics, 33, 1065–1076. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Principe, J. C., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering, Vol. 1: Blind source separation (pp. 265–319). New York: Wiley. Santos, J., Alexandre, L., & Marques de S´a, J. (2004). The error entropy minimization algorithm for neural network classification. In Int. Conf. on Recent Advances in Soft Computing. Nottingham, U.K.: Nottingham Trent University. Santos, J., Marques de S´a, J., Alexandre, L., & Sereno, F. (2004). Optimization of the error entropy minimization algorithm for neural network classification. In C. Dagli, A. Buczak, D. Enke, M. Embrechts, & O. Ersoy (Eds.), Intelligent engineering systems through artificial neural networks (Vol. 14, pp. 81–86). New York: American Society of Mechanical Engineers. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Silva, L., Marques de S´a, J., & Alexandre, L. (2005). Neural network classification using Shannon’s entropy. In European Symposium on Artificial Neural Networks. Bruges, Belgium: d-side publications. Silverman, B. (1986). Density estimation for statistics and data analysis. London: Chapman & Hall. Stoller, D. (1954). Univariate two-population distribution free discrimination. Journal of the American Statistical Association, 49, 770–777.
Received February 2, 2005; accepted February 17, 2006.
LETTER
Communicated by Jerome Friedman
Online Adaptive Decision Trees: Pattern Classification and Function Approximation Jayanta Basak [email protected],[email protected] IBM India Research Lab, Indian Institute of Technology, New Delhi 110016, India
Recently we have shown that decision trees can be trained in the online adaptive (OADT) mode (Basak, 2004), leading to better generalization score. OADTs were bottlenecked by the fact that they are able to handle only two-class classification tasks with a given structure. In this article, we provide an architecture based on OADT, ExOADT, which can handle multiclass classification tasks and is able to perform function approximation. ExOADT is structurally similar to OADT extended with a regression layer. We also show that ExOADT is capable not only of adapting the local decision hyperplanes in the nonterminal nodes but also has the potential of smoothly changing the structure of the tree depending on the data samples. We provide the learning rules based on steepest gradient descent for the new model ExOADT. Experimentally we demonstrate the effectiveness of ExOADT in the pattern classification and function approximation tasks. Finally, we briefly discuss the relationship of ExOADT with other classification models. 1 Introduction Decision trees (Duda, Hart, & Stork, 2001; Durkin, 1992; Fayyad & Irani, 1992; Friedman, 1991; Breiman, Friedman, Olshen, & Stone, 1983; Quinlan, 1993, 1996; Brodley & Utgoff, 1995), well known for their simplicity and interpretability, usually perform hard splits of data sets, and if a mistake is committed in splitting the data set at a node, it cannot be corrected further down the tree. Attempts have been made to address this problem (Friedman, Kohavi, & Yun, 1996); in general, they employ various kinds of look-ahead operators to estimate the potential gain in the objective after more than one recursive split of the data set. Moreover, a decision tree is always constructed in batch mode; the branching decision at each node of the tree is induced based on the entire set of training data. Variations have been proposed for efficient construction mechanisms to aleviate this difficulty (Utgoff, Berkman, & Clouse, 1997; Mehta, Agrawal, & Rissanen, 1996). Online algorithms (Kalai & Vempala, n.d.; Albers, 1996), sometimes prove to be more useful in the case of streaming data, very large data sets, and situations where memory is limited. Unlike decision trees, neural Neural Computation 18, 2062–2101 (2006)
C 2006 Massachusetts Institute of Technology
Online Adaptive Decision Trees
2063
networks (Haykin, 1999) make decisions based on the activation of all nodes; therefore, even if a node is faulty (or commits mistakes) during training, it does not affect the performance of the network very much. Neural networks also provide a computational framework for online learning and adaptivity. In short, a decision tree can be viewed as a token passing system, and a pattern is treated as a token that follows a path in the tree from the root node to a leaf node. Each decision node behaves as a gating channel depending on the attribute values of the pattern token. Neural networks, on the other hand, resemble the nervous system in the sense of distributed representation. A pattern is viewed as a phenomenon of activation, and the activation propagates through the links and nodes of the network to reach the output layer. In a recent article (Basak, 2004), we showed that decision trees with a specified depth can be trained in the online adaptive mode (OADT), leading to a better generalization score as compared to conventional decision trees and neural networks. Online adaptive decision trees are complete binary trees with a specified depth where a subset of leaf nodes is assigned for each class, in contrast to the hierarchical mixture of experts (Jordan & Jacobs, 1993, 1994), where the root node always represents the class. In the online mode, for each training sample, the tree adapts the decision hyperplane at each intermediate node from the response of the leaf nodes and the assigned class label of the sample. Unlike the hierarchical mixture of experts, OADT uses the top-down structure of a decision tree. Instead of probabilistically weighing the decisions of the experts and combining them, OADT allocates a data point to a set of leaf nodes (ideally one) representative of a class. Training of OADT was performed by considering the representation of class by an aggregate of leaf node activations, and, correspondingly, the hyperplanes of the intermediate (nonterminal) nodes were updated to reduce an aggregated average error. OADTs also proved to be more efficient in terms of classification accuracy when compared to the neural networks and the hierarchical mixture of experts. However, the OADTs are seriously bottlenecked by two factors. First, in our previous discussion (Basak, 2004), we outlined how to represent a two-class classification problem using the leaf nodes of an OADT in a topdown manner. However, it was not clear how to represent a multiclass classification problem using a single such network. Second, the OADTs were designed to address the classification task only, and no issues were discussed to address the problem of function approximation. In this letter, we present an extension of OADT (we call it ExOADT) to handle both multiclass pattern classification and the function approximation problems. Various attempts have been made to enhance decision trees by constructing oblique decision trees (Murthy, Kasif, & Salzberg, 1994) and hybridizing the decision trees with other computational paradigms such as neural networks, support vector machines, and fuzzy set theoretic concepts for the purpose of classification (Bennett, Wu, & Auslender, 1998;
2064
J. Basak
¨ Stromberg, Zrida, & Isaksson, 1991; Golea & Marchand, 1990; Boz, 2000; Janickow, 1998; Su´arez & Lutsko, 1999). In almost all of these approaches, the basic formalism of decision trees is maintained in the sense that the data are recursively partitioned locally at each nonterminal node, and finally the class labels are assigned at the leaf nodes. Thus, the overall construction of the decision tree is data driven, and it reduces the bias at the cost of variance. Model-based construction of decision trees (Geman & Jedynak, 2001) has also been attempted in order to optimize the overall structure. Deviations from these data-driven construction of decision trees are the hierarchical mixture of experts (HME) and online adaptive decision tree (OADT). In both cases, instead of local splitting criteria at each nonterminal node, a global objective measure is minimized. In the case of HME, the decisions are hierarchically agglomerated in a bottom-up fashion, and the learning rules based on expectation-maximization (EM) are used. In the case of OADT, decision is made in a top-down manner, and steepest descent rule is used. In most of the decision tree construction approaches and their applications (Chien, Huang, & Chen, 2002; Cho, Kim, & Kim, 2002; Riley, 1989; Salzberg, Delcher, Fasman, & Henderson, 1998; Yang & Pedersen, 1997; Zamir & Etzioni, 1998), the decision tree–like network is modeled for classification tasks. Decision trees are employed for function approximation mostly in the context of reinforcement learning for value function estimation (Uther & Veloso, 1998; Pyeatt & Howe, 1998; Wang & Dietterich, 1999) where the input space is divided into different regions and at each subspace a local regression is performed. One of the most elegant approaches in this line is treeNet (Friedman, 2001; Friedman, Hastie, & Tibshirani, 1998), where additive expansions of the predicted functions are performed by tree boosting where piecewise constant approximation is performed at a finer granularity. A large number of studies are available in the literature for function approximation based on neural networks, radial basis functions, and additive splines (Girosi, Jones, & Poggio, 1995; Buhmann, 1990; Broomhead & Lowe, 1988; Grimson, 1982; Moody & Darken, 1989; Poggio & Girosi, 1990a, 1990b; Powell, 1987; Poggio, Torre, & Koch, 1985), which are mostly based on regularization theory (Tikhonov & Arsenin, 1977). A comprehensive discussion on these approaches can be found in Girosi et al. (1995). Recently, the function approximation task has been dealt with in the framework of structural risk minimization, namely, support vector regres¨ sion (Vapnik, 1998; Vapnik, Golowich, & Smola, 1997; Smola & Scholkopf, 1998a, 1998b; Gunn, 1998; Burges, 1998). In this letter, we show that the extension of the online adaptive decision tree (ExOADT) is able not only to handle the muticlass classification problem but also to provide a robust smooth estimate of the desired functions. ExOADT not only adapts the local hyperplanes in the nonterminal nodes in the online mode, but is also able to smoothly change the structure depending on the data set. The rest of the letter is organized as follows. In section 2,
Online Adaptive Decision Trees
2065
x
Adaptive Tree
u(1) = 1 (w1 ,θ1 )
x
x
u(2)
u(4)
u(3)
(w2 ,θ2 )
x
(w4 ,θ4 )
x
(w ,θ5 )
x u(5)
u(6)
5
β11
(w ,θ ) 3
3
x
(w6 ,θ6 ) (w7 ,θ7)
u(7)
βNk
Regression Layer
o1
o2
ok
Figure 1: Structure of an extended OADT network. The network consists of two parts: one adaptive decision tree and a regression layer. Each intermediate node of the adaptive decision tree accepts the input pattern vector x as input, and it activates its two child nodes differentially depending on the embedded sigmoidal function defined by its parameters (w, θ). The root node activation u(1) is always unity. The leaf nodes of the adaptive tree are connected to the output layer through an adaptive connection matrix β. The output of the output layer is represented by o 1 , o 2 , . . . , o k respectively.
we describe the model and the associated learning rules for pattern classification and function approximation. We then demonstrate the effectiveness of the model for performing the task of function approximation and pattern classification on certain real-life data sets in section 3. We discuss various aspects of the model in section 4 conclude in section 5. 2 Description of the Model The structure of the ExOADT is shown in Figure 1. It consists of a complete binary tree of certain depth l with the leaf nodes connected to the output nodes via adaptive weight matrix β. Each nonterminal node i has a local decision hyperplane characterized by a vector (wi , θi ) with wi = 1 such that (wi , θi ) together has n free variables for an n-dimensional input space.
2066
J. Basak
2.1 Activation of the Nodes. As a convention, we number the nodes in ExOADT in the breadth-first order with the root node numbered as 1; this convention is followed throughout this letter. The activation of two child nodes 2i and 2i + 1 of the node i is given as
u2i = ui . f (wi x + θi )
u2i+1 = ui . f (−(wi x + θi )),
(2.1) (2.2)
where x is the input pattern, ui represents the activation of node i, and f (.) represents the local soft partitioning function attributed by the parameter vector (wi , θi ). We choose f (.) as a sigmoidal function. The activation of the root node is set to unity. Thus, the activation of a leaf node p can be represented as up =
f (lr (i, p)(wi .x + θi )) .
(2.3)
i∈Pp
Here Pp is the path from the root node to the leaf node p, and lr (.) is an indicator function about whether the leaf node p is on the right or left path of node i such that 1 if p is on the left path of i lr (i, p) = −1 if p is on the right path of i 0 otherwise.
(2.4)
Considering a sigmoidal activation function f (.), Basak (2004) shows that the sum of the activation of all leaf nodes is always unity provided that the root node has unit activation. The activation of the output node k, in the case of pattern classification, is given as ok = f
β jk u j ,
(2.5)
j∈
where is the set of leaf nodes, β is the connection matrix between the leaf nodes and the output nodes, and f (.) is a sigmoidal activation function to restrict the output in [0, 1]. In the case of k-class classification problem, we have k such output nodes. In the case of function approximation, we have only one output node (considering that there is only one predictor variable),
Online Adaptive Decision Trees
2067
and the output is simply the weighted sum of the leaf node activations: o=
βjuj.
(2.6)
j∈
Note that this kind of representation is also performing a piecewise linear approximation of the output function. However, due to soft splitting of the input data in the nonterminal nodes, it is not essential that only one leaf node will be activated at a time. Second, due to soft splitting, activation of the leaf nodes makes a smooth transition in the range [0, 1] governed by the fact that (Basak, 2004)
u j = 1.
(2.7)
j∈
For each nonterminal node, we select the sigmoidal form of activation as f (v) =
1 , 1 + exp(−m1 v)
(2.8)
where m1 decides the steepness of the function and it is the same for all nonterminal nodes. Considering the minimum activation of a maximally activated leaf node to be greater than a certain threshold, we derived (Basak, 2004) that m1 ≥
1 log l/, δ
(2.9)
where δ is a margin between a point (pattern) and the closest decision hyperplane, and is a constant such that the minimum activation of a maximally activated node should be no less that 1 − . For example, if we choose = 0.1 and δ = 1, then m1 = log(10l). Note that the parameter m1 can be learned as part of the supervised training. However, we fix m1 in such a way that the maximally activated leaf node gets an activation as close to unity as possible with a margin . This is done in analogy with the classical decision tree where only one leaf node is assigned to a given pattern. For the sigmoidal functions of the output nodes for classification, the steepness parameter m2 is different from that of the nonterminal nodes, and we always select m2 = 1. The parameter m2 can also be learned along with β during the supervised training. However, since the output o k of the kth output node can be expressed as ok =
1 , 1 + exp(−m2 j∈ β jk u j )
(2.10)
2068
J. Basak
the parameter m2 acts as a scaling factor to β. Since we do not impose any restriction on the maximum or minimum value of β (unlike w, where we always have w = 1), we need to fix m2 at some value. Equation 2.10 reveals that we can fix m2 at any value, and β will change accordingly. Here we selected m2 = 1. 2.2 Learning Rule for Pattern Classification. For pattern classification, we define the loss function as the squared error based on the assumption that the model noise introduced on the predictor gaussian in nature (Amari, 1967, 1998) such that E(x) =
1 (yk (x) − o k (x))2 2 k
(2.11)
where yk ∈ {0, 1} depending on whether the pattern belongs to the kth class. Subject to steepest descent, the parameters are changed as wi = −ηopt
∂E ∂wi
(2.12)
θi = −ηopt
∂E ∂θi
(2.13)
β k = −ηopt
∂E , ∂βk
(2.14)
where ηopt is the step size of the steepest descent using line search (Friedman, 2001): ηopt = argminη (E(x; w + w, θ + θ, β + β).
(2.15)
Considering that each wi has n − 1 free variables in an n-dimensional space subject to wi = 1, we have wi = −ηopt
∂E (I − wi wi ). ∂wi
(2.16)
Evaluating the partials, we get
wi = ηopt m1 q i (I − wi wi )x θi = ληopt m1 q i
(2.17) (2.18)
and β j = ηopt m2 e j o j (1 − o j )z.
(2.19)
Online Adaptive Decision Trees
2069
The parameter λ is a constant to differentiate the learning rates of w and θ . Normally the parameter θ depends on the range of the distribution of input attribute values. In our algorithm, we normalize all input attributes in [−1, 1] so that we choose λ = 1. The error in each output node j is e j , that is, e j = y j − o j . The error at each intermediate (nonterminal) node i is q i , which is given as qi =
δ j u j (1 − vi j )lr (i, j),
(2.20)
j∈
where δ is the backpropagated error such that δj =
β jk e k ,
(2.21)
k∈O
where O is the set of output nodes. Each nonterminal node i locally computes vi j , which is given as
vi j = f ((wi x + θi )lr (i, j)).
(2.22)
The activation of the leaf nodes is represented by the vector z, that is, z = [u j | j ∈ ]. Equations 2.17 and 2.18 reveal that we first compute the error δ backpropagated to the leaf nodes from the output nodes and then compute the local changes in each nonterminal node. Note that the values of q can be computed in each nonterminal node independent of the other nonterminal nodes. Equations 2.17 and 2.18 show that the parameters of a nonterminal node i are updated based on the error only in those leaf nodes that are on either the left path or the right path of i. If a leaf node j is on neither the left nor the right path of node i, then δ j does not affect the parameter changes in i. Therefore, the root node parameters are affected by all the leaf nodes; that is, the parameters in the root node start capturing the gross behavior of the data in terms of linear separability. As we go down the tree, only the data points in a subspace affect the parameters of the nonterminal nodes. Equation 2.19 is the adaptive updating of the parameters β to obtain the linear regression except the factor o j (1 − o j ), which is a nonlinear transformation due to the presence of sigmoidal activation. In order to perform the line search on η to get ηopt such that ηopt = argminη (E(x) + E(x)), we let the first-order approximation of E = −E. Algebraic manipulation (see the appendix) yields ηopt =
m22 z2
E . 2 2 2 2 + m2 2 2 e o (1 − o ) k k∈O k k i∈ q i (λ + x − (wi x) ) 1 (2.23)
2070
J. Basak
At a near-optimal point, if E becomes very small, then the denominator in equation 2.23 becomes very small, and the behavior will become unstable. In order to stabilize the behavior of the network, we make ηopt =
1+
m22 z2
2 2 k∈O e k o k (1
− ok
)2
E . + m21 i∈ q i2 (λ + x2 − (wi x)2 ) (2.24)
2.3 Learning Rule for Function Approximation. The learning rule for function approximation is the same as that for classification except that we use linear instead of sigmoidal units in the output. Since there is no sigmoidal function at the output node, we denote the steepness parameter of the nonterminal node transfer function by m. We consider a squared error loss at the output E= where o = as
1 (y − o)2 , 2 j∈
(2.25)
β j u j . Following steepest descent, we get the learning rules
wi = ηopt m(y − o)ri (I − wi wi )x θi = ληopt m(y − o)ri
(2.26) (2.27)
and β = ηopt (y − o)z
(2.28)
where ri =
β j u j (1 − vi j )lr (i, j),
(2.29)
j∈
the activation of leaf nodes z = [u j | j ∈ ], and the error at the output node e = (y − o). The learning rate parameter ηopt is obtained by performing a line search so that ηopt = argminη |y − (o + o)|. Letting the first-order approximation of o = −e, we get ηopt =
2 j∈ u j
+
m2
1 . 2 2 2 i∈ ri (λ + x − (wi x) )
(2.30)
Online Adaptive Decision Trees
2071
In order to stabilize the learning behavior at a near-optimal point, we modify the learning rate as
ηopt =
1+
2 j∈ u j
+
m2
1
2 i∈ ri (λ
+ x2 − (wi x)2 )
.
(2.31)
One point that may be noted here is that the learning rate depends on the error e = (y − o). Since the output is not necessarily bounded in [0, 1] for the function approximation task, the step size can be large depending on the value of the predictor variable. However, we always restrict the input in [−1, 1] so that the value of θ is also restricted in that space. Therefore, if the step size becomes large for greedy approximation of the steepest descent, then θ can become very large, and the network may not converge. In order to address this problem, we always confine the predictor variable in [−1, 1], and after the network is trained, we rescale the output accordingly. 3 Experimental Results We experimented with both synthetic and real-life data sets for classification and function approximation. In this section and later in the letter, we demonstrate the performance of ExOADT and also compare it with other classification and function approximation paradigms. 3.1 Protocol. We trained ExOADT in the online mode. The entire batch of samples is repeatedly presented in the online mode to ExOADT, and we refer to the number of times the batch of samples presented to an OADT as the number of epochs. If the size of the data set is large (i.e., the data density is high), then it is observed that ExOADT takes fewer epochs to converge, and for relatively smaller data sets, ExOADT takes a larger number of epochs. On average, we found that OADT converges near its local optimal solution within 50 epochs, although the required number of epochs increases with the depth of the tree. Subsequently, we report all results for 200 epochs. The performance of ExOADT depends on the depth (l) of the tree and the parameter m. We have chosen the parameters and δ as 0.1 and 1, respectively, that is, m1 = log(10l). Since m1 is determined by l in our setting, the performance is solely dependent on l. We report the results for different depths of the tree. We choose the steepness parameter of the output node transfer function m2 = 1 for all experiments in pattern classification. We initialize each wi randomly with the constraint that wi = 1. We always initialize each θi = 0. For the regression layer, we initialize each βik as βik = 1/N for every i and k where N = || is the number of leaf nodes.
2072
J. Basak
We normalize all input patterns such that any component of x lies in the range [−1, +1]. We normalize each component of x (say, xi ) as xˆi =
2xi − (ximax + ximin ) ximax − ximin
(3.1)
where ximax and ximin are the maximum and minimum values of xi over all observations, respectively. Note that in this normalization, we do not take the outliers into account. Data outliers can badly influence the variables in such normalization. Instead of linear scaling, using certain nonlinear scaling or certain more robust scaling such as that based on interquartile range may improve the performance of our model. However, we stick to the simple linear scaling mainly for two reasons. First, data outliers can be handled separately. We did linear scaling to fit the data in our model, not for data cleansing. We therefore preserve the exact structure of data distribution for each variable. Second, we compared our model with several existing classifiers. In order to investigate the performance of other classifiers, we used certain standard software packages as described in the next section. Since these packages use linear scaling for normalization, we used them for a fair comparison. 3.2 Results of Classification. We first demonstrate the effectiveness of ExOADT in the classification task. We have considered the data sets available in the UCI machine learning repository (Blake & Merz, 1998). Table 1 summarizes the data set description. We considered the data sets from the UCI machine learning repository (Blake & Merz, 1998) mostly based on the fact that the data should not contain any missing variable and consist of only numeric attributes. We modified the Ecoli data originally used in Nakai and Kanehisa (1991) and later in Horton and Nakai (1996) for predicting protein localization sites. The original data set consists of eight different classes out of which three classes—outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane cleavable signal sequence—have only five, two, and two instances, respectively. Since we report the 10-fold crossvalidation score, we omitted samples from these three classes and report the results for the rest of the five different classes. Note that in the original work (Nakai & Kanehisa, 1991), the results are not cross-validated either. We performed 10-fold cross validation on each data set. Table 2 summarizes the performance of ExOADT on these data sets. We also provide a comparison of the results with the hierarchical mixture of experts (HME) (Jordan & Jacobs, 1993, 1994), C4.5 (Quinlan, 1993, 1996), SVM (Vapnik, ¨ 1998; Smola & Scholkopf, 1998a; Gunn, 1998), k-nearest neighbor algorithm (Duda et al., 2001), naive Bayes classifier, bagging (Breiman, 1996), and AdaBoost (Schapire & Singer, 1999) algorithm. We implemented the hierarchical mixture of experts by using the MATLAB toolbox as provided in
Online Adaptive Decision Trees
2073
Table 1: Description of the Pattern Classification Data Sets Obtained from the UCI Machine Learning Repository (Blake & Merz, 1998).
Data Set Indian Diabetes (Pima) Diagnostic Breast Cancer (Wdbc) Prognostic Breast Cancer (Wpbc) Liver Disease (Bupa) Flower (Iris) Bacteria (Ecoli) AI (Monks1) AI (Monks2) AI (Monks3)
Number of Instances
Number of Features
Number of Classes
768
8
2
569
30
2
198
32
2
345
6
2
150 326 556 601 554
4 7 6 6 6
3 5 2 2 2
Murphy (2001, 2003). We used the implementation of WEKA (Garner, 1995; Witten & Frank, 2000) in studying the performance of C4.5, k-nearest neighbor, naive Bayes classifier, bagging, and AdaBoost. To implement SVM, we used the publicly available code in Canu, Grandvalet, and Rakotomamonjy (2003) which is also referred to in Cristianini and Shawe-Taylor (2000). Note that WEKA also provides an implementation of SVM, particularly the sequential minimal optimization (SMO) (Platt, 1998). However, the SMO algorithm as provided in WEKA cannot handle more than two class labels. We therefore implemented the multiclass SVM using the software provided in Canu et al. (2003). We also graphically depict the classification results in Figure 2. Figure 2 illustrates the performances of the best-performing ExOADT and the bestperforming HME. Note that depicting the results of the best-performing model structure is a bit overoptimistic. However, in model selection, usually the best-performing model is selected based on the cross-validation performance. In the domain of neural networks also, the results of the bestperforming neural network structure are usually reported. In this context, we must mention that we implemented multilayer perceptron (MLP) with backpropagation learning using the WEKA tool. We studied the performance of MLP for various structures and observed that for certain data sets, MLP provides very poor performance. For example, with the liver disease data set (Bupa), MLP often gets stuck to local minima if the parameters (such as learning rate and the momentum factor) are not properly tuned. With different experiments, we obtained a maximum score of 57.97% for this data set using an MLP with one hidden layer and 15 hidden nodes. Possibly an increase in the number of hidden layers may increase the score.
2074
Table 2: Ten-Fold Cross-Validation Performance of ExOADT on the Classification Data Sets as in Table 1 for Different Depths from Depth = 3 to Depth = 7. SVM (Gaussian Kernel) Data Set Pima
Hierarchical mixture of experts ExOADT Wdbc
Hierarchical mixture of experts ExOADT Wpbc
k-NN
NaiveBayes
(size = 1)
(size = 5)
74.09 Bagging (C4.5): 74.87
70.44 Bagging (k-NN): 73.44
66.93 AdaBoost (C4.5): 71.35
73.30 AdaBoost (k-NN): 70.44
Height = 3: 71.09
Height = 4: 69.92
75.78 Bagging (Naive Bayes): 75.91 Height = 5: 70.03
Height = 6:68.60
Height = 7: 68.75
AdaBoost (Naive Bayes): 76.56 Height = 8: 68.98
Depth = 3: 73.85
Depth = 4: 77.49
Depth = 5: 76.97
Depth = 6: 75.80
Depth = 7: 77.10
—
92.44 Bagging (C4.5): 94.73
96.13 Bagging (k-NN): 96.49
96.48 AdaBoost (C4.5): 95.96
95.80 AdaBoost (k-NN): 96.13
Height = 3: 92.80
Height = 4: 91.05
93.15 Bagging (Naive Bayes): 93.15 Height = 5: 93.86
Height = 6: 93.50
Height = 7: 93.50
AdaBoost (Naive Bayes): 95.96 Height = 8: 92.97
Depth = 3: 96.65
Depth = 4: 97.18
Depth = 5: 97.18
Depth = 6: 97.53
Depth = 7: 97
—
76.77 Bagging (C4.5): 80.81
72.22 Bagging (k-NN): 71.72
76.70 AdaBoost (C4.5): 75.76
72.79 AdaBoost (k-NN): 72.22
Height = 3: 68.67
Height = 4: 74.78
66.67 Bagging (Naive Bayes): 66.67 Height = 5: 68.22
Height = 6: 70.17
Height = 7: 72.61
AdaBoost (Naive Bayes): 73.74 Height = 8: 66.72
Depth = 3: 77.82
Depth = 4: 78.26
Depth = 5: 79.26
Depth = 6: 80.76
Depth = 7: 81.82
—
—
J. Basak
Hierarchical mixture of experts ExOADT
C4.5
SVM (Gaussian Kernel) Data Set Bupa
Hierarchical mixture of experts ExOADT Iris
Hierarchical mixture of experts ExOADT Ecoli
Hierarchical mixture of experts ExOADT
k-NN
NaiveBayes
(size = 1)
(size = 5)
66.38 Bagging (C4.5): 72.46
64.35 Bagging (k-NN): 61.45:
60.62 AdaBoost (C4.5): 67.54
67.84 AdaBoost (k-NN): 64.35
Height = 3: 65.29
Height = 4: 66.19
55.94 Bagging (Naive Bayes): 55.36 Height = 5: 62.24
Height = 6: 62.19
Height = 7: 63.48
AdaBoost (Naive Bayes): 65.51 Height = 8: 63.90
Depth = 3: 64.62
Depth = 4: 66.76
Depth = 5: 63.52
Depth = 6: 62.31
Depth = 7: 62.36
—
95.33 Bagging (C4.5): 94.67
96 Bagging (k-NN): 96
96 Bagging (Naive Bayes): 96
91.33 AdaBoost (C4.5): 94
90.67 AdaBoost (k-NN): 96
Height = 3: 83.33
Height = 4: 81.33
Height = 5: 86.67
Height = 6: 84
Height = 7: 84.67
AdaBoost (Naive Bayes): 94.67 Height = 8: 87.33
Depth = 3: 95.33
Depth = 4: 96
Depth = 5: 96
Depth = 6: 96.67
Depth = 7: 96.67
—
84.66 Bagging (C4.5): 83.44
82.82 Bagging (k-NN): 83.13
74.18 AdaBoost (C4.5): 85.58
80.95 AdaBoost (k-NN): 82.82
Height = 3
Height = 4
88.04 Bagging (Naive Bayes): 87.42 Height = 5
Height = 6
Height = 7
AdaBoost (Naive Bayes): 88.04 Height = 8
— Depth = 3: 78.51
— Depth = 4: 81.29
— Depth = 5: 76.01
— Depth = 6: 76.61
— Depth = 7: 79.97
— —
—
2075
C4.5
Online Adaptive Decision Trees
Table 2: Continued
2076
Table 2: Continued Monks1
Hierarchical mixture of experts ExOADT Monks2
Hierarchical mixture of experts ExOADT Monks3
Hierarchical mixture of experts ExOADT
89.93 Bagging (C4.5): 100
88.49 Bagging (k-NN): 87.23
86.97 AdaBoost (C4.5): 91.37
98.52 AdaBoost (k-NN): 88.49
Height = 4: 95.54
66.55 Bagging (Naive Bayes): 66.19 Height = 5: 98.39
Height = 6: 98.20
Height = 7: 98.39
AdaBoost (Naive Bayes): 65.83 Height = 8: 98
Height = 3: 96.96 Depth = 3: 90.11
Depth = 4: 96.76
Depth = 5: 94.77
Depth = 6: 98.36
Depth = 7: 98.03
—
92.85 Bagging (C4.5): 98.67
89.52 Bagging (k-NN): 89.52
94.86 AdaBoost (C4.5): 99.67
95.67 AdaBoost (k-NN): 89.52
Height = 3: 76.20
Height = 4: 76.36
65.39 Bagging (Naive Bayes): 65.06 Height = 5: 77.05
Height = 6: 80.20
Height = 7: 81.02
AdaBoost (Naive Bayes): 61.40 Height = 8: 80.54
Depth = 3: 72.26
Depth = 4: 78.05
Depth = 5: 80.07
Depth = 6: 79.10
Depth = 7: 82.72
—
98.92 Bagging (C4.5): 98.92
78.70 Bagging (k-NN): 83.75
95.66 AdaBoost (C4.5): 97.83
93.68 AdaBoost (k-NN): 78.70
Height = 3: 91.52
Height = 4: 92.62
91.52 Bagging (Naive Bayes): 91.52 Height = 5: 92.40
Height = 6: 91.70
Height = 7: 92.07
AdaBoost (Naive Bayes): 94.95 Height = 8: 92.01
Depth = 3: 95.83
Depth = 4: 96.19
Depth = 5: 96.73
Depth = 6: 97.82
Depth = 7: 96.57
—
J. Basak
Notes: We compare the performance with other classifiers C4.5, k-NN, naiveBayes, SVM, bagging, boosting, and hierarchical mixture of experts. C4.5 (J4.8), k-NN, naiveBayes, bagging, and boosting implementations are available in the WEKA software package (Garner, 1995; Witten & Frank, 2000). In WEKA, the value of k in the case of the k-NN classifier is selected by the leave-one-out rule. We implemented SVM with gaussian kernels using the MATLAB code available in Canu et al. (2003) and Cristianini and Shawe-Taylor (2000), and the HME has been implemented using the toolbox (Murphy, 2001, 2003).
Online Adaptive Decision Trees 80
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
78 76 74 72
2077
100
96 94 92
70
90
68
88
66
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
98
86
PIMA
WDBC
A
B
85
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
80
75
70
75
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
70
65
60
55
65
50
60
WPBC
BUPA
C 98
D C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
96 94 92 90 88
90
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB ExOADT
85
80 86 84 82
75
IRIS
E
ECOLI
F
Figure 2: Pictorial illustration of the performance of different classifiers. NB = naive Bayes classifier. Bg = bagging. AB = Adaboost ensemble classifiers. For HME and ExOADT, the best results are depicted.
2078
J. Basak
100
100
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
95 90 85 80 75
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
95 90 85 80 75 70 65
70
60 65
55 50
60
MONKS1
MONKS2
G
H 100
C4.5 kNN NB SVM Bg−C4.5 Bg−kNN Bg−NB AB−C4.5 AB−kNN AB−NB HME ExOADT
90
80
70
60
MONKS3
I Figure 2: Continued.
In general, HME performs better than MLP in most of the cases in terms of classification score, which is also reported in Jordan and Jacobs (1993). We therefore do not report the classification score obtained with MLP separately. We observe that for the Pima Indians Diabetes data set, ExOADT (of depth = 7) outperforms all classifiers, including bagging and AdaBoost. It is interesting to observe a similar set of results in the case of breast cancer data (both the diagnostic and the prognostic breast cancers). However, we observe that in the case of liver disease, ExOADT performs poorly as compared to bagging. However, its performance is still better than the hierarchical mixture of experts. Again in the case of Iris data set, we observe that ExOADT performs better than its counterparts, including the bagging and AdaBoost. However, in the case of Ecoli data, ExOADT cannot perform as well as the others. One of the most interesting aspects is that for this task of protein site localization, the naive Bayes performs the best—even better than the bagged and boosted trees. Here we did not report the results of hierarchical mixture of experts since it could not properly converge
Online Adaptive Decision Trees
2079
using the EM algorithm. In the artificial domain of the Monks problem, we find that bagged and boosted trees perform very well—better than the others. However, we observe that the ExOADT performs better than other classifiers including SVM if we leave out the bagging and boosting. Overall, we observe that the ExOADT has the potential to be established as a best classifier if we compare it with single decision trees, hierarchical mixture of experts, and even SVM. However, we see that the multiclassifier systems like bagging and boosting have the capacity to outperform ExOADT in many situations. It may be noted here that the classification and function approximation performance of a model also depends on the data set, as pointed out by Friedman (2001). Evidently without any theoretical establishment, it is difficult to claim that ExOADT is the best single classifier (bagging and boosting combine the decision of more than one classifier); however, we claim that on the data sets that we used, ExOADT performs best. 3.3 Results on Function Approximation. We now demonstrate the effectiveness of ExOADT in function approximation tasks. As mentioned before, we normalize the input data in the range [−1, 1]. Since the learning rate parameter is always multiplied by the output error e = (y − o), the learning step size can be large for large predictor output. This in turn results in a larger step size for the changes in θ of the nonterminal nodes, and the network search space can shoot out of the normalized input space—the network may not converge as a consequence. We therefore normalize the predictor variable also in the range [−1, 1], and the output produced by ExOADT is rescaled accordingly to compare with the original predictor variable values. First, we demonstrate the behavior of ExOADT in approximating synthetic functions. We considered a one-dimensional function as shown in Figure 3. The characteristic of the function is such that it is often difficult to approximate it using the standard multilayer perceptron due to the presence of two very different-sized humps. An MLP network often takes a very high learning time to fine-tune the smaller hump. Figure 3A illustrates the results obtained by ExOADT with different depths from depth = 3 to depth = 6 under the noiseless condition. Note that in the one-dimensional case every nonterminal node has w = 1 and according to the learning rules (see equation 2.17), w is not updated; only the value of θ is updated. We observe that initially, the complexity of the predicted function increases with depth; however, after a certain depth, the complexity of the predicted function does not increase, and the network is able to produce a smooth function. This is in contrast to the conventional multilayer neural networks, where, without any imposed regularization (Haykin, 1999), the network often overfits with the increase in the number of hidden nodes. We then added noise to the predictor variable; making a random shift in a range [−, ]. Figures 3B, 3C, and 3D illustrate the performance of
2080
J. Basak
25
25 Depth=3 Depth=4 Depth=5 Depth=6
20
20
y→
15
y→
15
10
10
5
5
0
0
−5 0
Depth=3 Depth=4 Depth=5 Depth=6
50
100
150
200
x→
250
300
350
400
−5 0
50
100
150
A
250
300
350
400
B
25
25 Depth=3 Depth=4 Depth=5 Depth=6
20
Depth=3 Depth=4 Depth=5 Depth=6
20
y→
15
y→
15
10
10
5
5
0
0
−5 0
200
x→
50
100
150
200
x→
C
250
300
350
400
−5 0
50
100
150
200
x→
250
300
350
400
D
Figure 3: Approximation of a one-dimensional function using ExOADT for different depths from depth = 3 to depth = 6. (A) A noiseless situation. (B–D) The approximated curve obtained noise levels 1, 2, and 3, respectively.
the ExOADT for different noise levels with = 1, 2, and 3, respectively. We observe that even with a high degree of noise, the ExOADT is able to find a smooth function without doing the overfitting, although the depth is increased to l = 6. There is an embedded regularization in ExOADT since the summation of the activations of the leaf nodes is always unity. We demonstrate the behavior of ExOADT in approximating twodimensional functions. We generated four gaussian humps on the four corners of a two-dimensional plane, as shown in Figure 4. We illustrate the function approximated by ExOADT for different depths and different numbers of training samples. We observe a similar behavior as in the case of one-dimensional function that as the depth increases initially, the complexity of the estimated function increases, and it fine-tunes the training data. However, after a certain depth, the function complexity does not increase, and as a result it does not overfit the data (see Figures 4A, 4B, and 4C). We then experimented by introducing noise in the predictor variable. We added
Online Adaptive Decision Trees
2081
A
B
C
D
E
F
G
H
I
Figure 4: Approximation of a two-dimensional function for different depths of ExOADT. (A–C) The function interpolated from only 300 data points by ExOADT of depth = 5, 6, and 7, respectively. (D–F) The same function interpolated from 300 noisy data points with noise level unity for depth = 5, 6, and 7, respectively. (G–I) The functions interpolated for 1000 noisy data points with noise level = 3 for depth = 5, 6, and 7, respectively.
noise in the same way as in the case of the one-dimensional function: we randomly shifted the predictor value in the range [−, ]. Figures 4D, 4E, and 4F illustrate the results on noisy data obtained by ExOADT of depths from 5 to 7 for 300 sample points with = 1. We then experimented with = 3 and observed that ExOADT fails to interpolate the function within 200 epochs. We then considered 1000 randomly generated nosiy points with noise level = 3 and approximated the functions with depths from 5 to 7. Figures 4G, 4H, and 4I illustrate the respective results. We observe that in the highly noisy situation, an ExOADT of depth = 5 fails to interpolate the function properly, although it is able to do so with depths 6 and 7.
2082
J. Basak
Table 3: Description of the Function Approximation Data Sets Obtained from the Bilkent University Data Repository (Guvenir & Uysal, 2000).
Data Set Baseball (BB) Housing (HH) Kinematics (KI) Elevator (EV) Ailerons (AL) Friedman (FR)
Number of Instances
Number of Variables
Maximum Predicted Value
Minimum Predicted Value
337 506 8192 8752 7154 40,768
16 13 8 18 40 10
6100 50 1.458521 0.078 −0.0003 30.522
109 5 0.040165 0.012 −0.0035 −1.228
3.4 Function Approximation Results on Real-Life Data. We now demonstrate the effectiveness of ExOADT in approximating functions in real-life data sets. We collected the data sets for function approximation from the publicly available web site of the Bilkent University data repository (Guvenir & Uysal, 2000). We have chosen six data sets mostly considering the fact that the data should not contain any categorical variable and there should be no missing values. Table 3 summarizes the data sets in terms of the number of variables and the number of samples. Let us provide a brief description as in Guvenir and Uysal (2000). The baseball data set contains the performance of the players in terms of batting average, number of runs, number of hits, number of doubles, and others and their salary in terms of thousands of dollars. The data were collected in 1992 (the original source description is provided in Guvenir & Uysal, 2000). The task is to find a correspondence between the performance of the players and their salaries. The Ailerons data set addresses a control problem: flying an F16 aircraft. The attributes describe the status of the aircraft, and the goal is to predict the control action on the ailerons of the aircraft. The Elevator data set is similar to the Ailerons, and obtained from the task of controlling an F16 aircraft. However, attributes here are different from the Ailerons domain, and the goal (predictor) variable is related to an action taken on the elevators of the aircraft (Guvenir & Uysal, 2000). The Boston Housing data originally appeared in Harrison and Rubinfeld (1978), and are available in the StatLib library maintained by Carnegie Mellon University and UCI (Blake & Merz, 1998, Guvenir & Uysal, 2000). The task is to predict housing prices in the Boston area depending on such factors as pollution, accessibility, crime rate, and structure. The Kinematics data set is concerned with the forward kinematics of an eight-link robot arm. The Friedman data set is an artificial data set originally used by Friedman (1991) in approximating functions with MARS (multivariate adaptive regression splines). We divided the data randomly in different proportions. First, we considered 25% data for training and the rest, 75%, for testing. We then considered
Online Adaptive Decision Trees
2083
50% for both training and testing, and finally considered 75% for training and 25% for testing. We report the results in terms of the absolute difference of the actual predictor value and the output of the network for both the training and test sets: Absolute error = Ex |y(x) − o(x)|,
(3.2)
where E stands for the expected value. We also report the normalized error measure as provided by Friedman (2001), which is given as
Normalized error =
Ex |y(x) − o(x)| . Ex |y(x) − medianx y(x)|
(3.3)
The normalized error is the average absolute error normlized by the optimum constant error (Friedman, 2001). We report the average absolute error and the normalized error for both the training data set and the test data set for different sizes of the training data in Table 4. We compare our results with the performance of support ¨ vector regression (Vapnik et al., 1997; Smola & Scholkopf, 1998b; Gunn, 1998). We approximated the functions in SVR using the -insensitive loss function (Vapnik et al., 1997) with gaussian kernels. We implemented the algorithm in Matlab using the software available in Canu et al. (2003), which is also referred to in Cristianini and Shawe-Taylor (2000). We also pictorially depict the performance of both ExOADT and SVR in Figure 5. We observed that the performance of SVR is dependent on the choice of and the gaussian kernel size. For the normalized data set, we have chosen = 0.01 and report the results for different sizes of the gaussian kernels. The results show that SVR is much better than ExOADT in terms of the training error. However, if we compare the test errors, we observe that they are more or less comparable, although SVR has an edge over ExOADT. We report here the best results that are obtained with SVR by fine-tuning the parameters. We report the results for ExOADT with = 0.1 and δ = 1 in all the cases. We did not investigate whether fine-tuning the parameters of ExOADT would improve the results. We did not report the results of SVR for the Kinematics data set with 75% training data and the other larger data sets since we could not get results for these data sets using SVR due to the memory limitation. Note that due to the memory requirements, SVR is bottlenecked in terms of scalability, which is not true for ExOADT since it operates in the online mode without needing to store the data points. We demonstrate this by reporting the results of ExOADT for certain larger data sets: Elevator, Ailerons, and the Friedman (as used in MARS) data sets.
Data Set
BB
Train Size
25%
Error Type
Absolute Friedman
Absolute Friedman
50%
Absolute Friedman
Absolute
Error Depth = 4 131.64 595.15 0.1379 0.6236
ExOADT Depth = 5 Depth = 6 119.42 121.02 601.37 584.85 0.1251 0.1268 0.6301 0.6128
Depth = 7 119.65 553.59 0.1254 0.5800
—
Train Test Train Test
Depth = 3 158.07 661.98 0.1656 0.6936
Train Test Train Test
size = 0.5 56.7471 684.1490 0.05946 0.7168
size = 1 57.0302 576.9421 0.05976 0.6045
SVR (Gaussian kernel) size = 2 size = 3 68.4514 119.8526 593.5051 546.8188 0.07172 0.1256 0.6219 0.5730
size = 4 172.1374 519.8181 0.1804 0.5447
size = 5 209.5573 504.6622 0.2196 0.5288
Depth = 4 219.18 680.67 0.2297 0.7132
ExOADT Depth = 5 Depth = 6 191.07 218.13 580.77 678.56 0.2002 0.2286 0.6085 0.7110
Depth = 7 211.01 729.81 0.2211 0.7647
—
Train Test Train Test
Depth = 3 328.80 517.48 0.3445 0.5422
Train Test Train Test
size = 0.5 56.5042 647.7100 0.0592 0.6787
size = 1 55.5422 572.5080 0.0582 0.5999
SVR (Gaussian kernel) size = 2 size = 3 129.6090 212.3789 601.8826 550.6553 0.1358 0.2225 0.6364 0.5770
size = 4 262.3273 514.1745 0.2749 0.5387
size = 5 298.2583 498.5944 0.3125 0.5224
J. Basak
Friedman
2084
Table 4: Performance of ExOADT on the Function Approximation Data Sets as in Table 3, for Different Depths from 3 to 7.
Data Set
Train Size
75%
Error Type
Absolute Friedman
Absolute Friedman
HH
25%
Absolute Friedman
Absolute Friedman
Error ExOADT Depth = 5 Depth = 6 253.88 259.19 513.63 517.54 0.2660 0.2716 0.5382 0.5423
Depth = 7 291.37 577.67 0.3053 0.6053
Train Test Train Test
size = 0.5 57.3366 545.4307 0.0601 0.5715
size = 1 59.1674 567.1411 0.0620 0.5942
SVR (gaussian kernel) size = 2 size = 3 151.4899 229.6618 532.5809 516.1467 0.1587 0.2406 0.5580 0.5408
size = 4 289.5370 475.0909 0.3034 0.4978
size = 5 329.1706 448.6760 0.3449 0.4701
Depth = 4 1.537 2.580 0.2353 0.3951
ExOADT Depth = 5 Depth = 6 1.321 1.759 2.694 2.916 0.2022 0.2693 0.4125 0.4466
Depth = 7 1.319 3.034 0.2020 0.4646
—
Train Test Train Test
Depth = 3 1.747 2.706 0.2676 0.4144
Train Test Train Test
size = 0.5 0.4231 3.5631 0.0648 0.5456
size = 1 0.4713 3.0910 0.0722 0.4733
SVR (gaussian kernel) size = 2 size = 3 0.9389 1.3615 2.5980 2.5112 0.1438 0.2085 0.3978 0.3845
size = 4 1.7887 2.4927 0.2739 0.3817
size = 5 2.0910 2.7686 0.3202 0.4239
2085
Depth = 4 276.57 555.76 0.2898 0.5823
—
Train Test Train Test
Depth = 3 308.77 503.87 0.3235 0.5280
Online Adaptive Decision Trees
Table 4: Continued
2086
Table 4: Continued
HH
50%
Absolute Friedman
Absolute Friedman
75%
Absolute Friedman
Absolute Friedman
ExOADT Depth = 5 Depth = 6 1.716 1.524 2.261 2.536 0.2628 0.2334 0.3462 0.3883
Depth = 7 1.599 2.257 0.2450 0.3456
Train Test Train Test
size = 0.5 0.4244 2.8217 0.0649 0.4321
size = 1 0.6415 2.4236 0.0982 0.3711
SVR (gaussian kernel) size = 2 size = 3 1.2046 1.6172 2.1322 2.2677 0.1845 0.2476 0.3265 0.3472
size = 4 1.8200 2.2789 0.2787 0.3489
size = 5 2.0522 2.4503 0.3142 0.3752
Depth = 4 1.698 2.049 0.2601 0.3137
ExOADT Depth = 5 Depth = 6 1.570 1.476 2.162 2.422 0.2405 0.2260 0.3310 0.3708
Depth = 7 1.491 2.367 0.2283 0.3624
—
Train Test Train Test
Depth = 3 1.788 2.281 0.2737 0.3494
Train Test Train Test
size = 0.5 0.4179 2.7586 0.0640 0.4224
size = 1 0.6348 2.2838 0.0972 0.3497
SVR (gaussian kernel) size = 2 size = 3 1.3013 1.6397 2.0105 2.1420 0.1993 0.2511 0.3078 0.3280
size = 4 1.8293 2.2519 0.2801 0.3448
size = 5 1.9801 2.3860 0.3032 0.3653
J. Basak
Depth = 4 1.909 2.521 0.2924 0.3861
—
Train Test Train Test
Depth = 3 2.094 2.623 0.3206 0.4017
Data Set
KI
Train Size
25%
Error Type
Absolute Friedman
Absolute Friedman
50%
Absolute Friedman
Absolute Friedman
Error ExOADT Depth = 5 Depth = 6 0.0592 0.0474 0.0684 0.0677 0.2747 0.2198 0.3174 0.3141
Depth = 7 0.0433 0.0690 0.2010 0.3203
Train Test Train Test
size = 0.5 0.0134 0.0914 0.0620 0.4243
size = 1 0.0143 0.0769 0.0664 0.3571
SVR (gaussian kernel) size = 2 size = 3 0.0537 0.0843 0.0750 0.0982 0.2491 0.3910 0.3482 0.4558
size = 4 0.1017 0.1129 0.4721 0.5238
size = 5 0.1136 0.1229 0.5271 0.5703
Depth = 4 0.0776 0.0836 0.3602 0.3883
ExOADT Depth = 5 Depth = 6 0.0604 0.0536 0.0658 0.0619 0.2803 0.2491 0.3055 0.2875
Depth = 7 0.0488 0.0614 0.2268 0.2849
—
Train Test Train Test
Depth = 3 0.1008 0.1052 0.4679 0.4883
Train Test Train Test
size = 0.5 — — — —
size = 1 — — — —
SVR (gaussian kernel) size = 2 size = 3 — — — — — — — —
size = 4 — — — —
size = 5 — — — —
2087
Depth = 4 0.0740 0.0782 0.3435 0.3630
—
Train Test Train Test
Depth = 3 0.0978 0.1007 0.4539 0.4675
Online Adaptive Decision Trees
Table 4: Continued
2088
Table 4: Continued
75%
Absolute Friedman
EV
25%
Absolute Friedman
50%
Absolute Friedman
75%
Absolute Friedman
AL
25%
Absolute Friedman
Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Depth = 3 0.0927 0.0922 0.4304 0.4279 0.0017 0.0017 0.4135 0.4266 0.0016 0.0017 0.4009 0.4174 0.0019 0.0020 0.4788 0.4871 0.000117 0.000123 0.3861 0.4059
Depth = 4 0.0752 0.0763 0.3488 0.3540 0.0017 0.0018 0.4067 0.4278 0.0015 0.0016 0.3632 0.3774 0.0018 0.0018 0.4375 0.4433 0.000116 0.000129 0.3828 0.4257
ExOADT Depth = 5 Depth = 6 0.0631 0.0535 0.0649 0.0587 0.2927 0.2485 0.3014 0.2723 0.0016 0.0015 0.0017 0.0017 0.3799 0.3632 0.4072 0.4029 0.0014 0.0014 0.0015 0.0015 0.3477 0.3431 0.3687 0.3666 0.0017 0.0017 0.0017 0.0017 0.4087 0.4062 0.4217 0.4183 0.000118 0.000115 0.000128 0.000130 0.3894 0.3795 0.4224 0.4290
Depth = 7 0.0528 0.0590 0.2451 0.2739 0.0015 0.0016 0.3557 0.4004 0.0014 0.0015 0.3409 0.3709 0.0015 0.0016 0.3726 0.3888 0.000115 0.000131 0.3795 0.4323
J. Basak
Data Set
Train Size
50%
Error Type
Absolute Friedman
75%
Absolute Friedman
FR
25%
Absolute Friedman
50%
Absolute Friedman
75%
Absolute Friedman
Error
Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Depth = 3 0.00121 0.000125 0.3993 0.4125 0.00124 0.000128 0.4092 0.4224 1.1516 1.1546 0.2833 0.2841 1.0698 1.0737 0.2632 0.2641 1.0510 1.0548 0.2586 0.2595
Depth = 4 0.000117 0.000124 0.3861 0.4092 0.000123 0.000127 0.4059 0.4191 0.9132 0.9267 0.2247 0.2280 0.8817 0.8938 0.2169 0.2199 0.9132 0.9131 0.2247 0.2246
ExOADT Depth = 5 0.000116 0.000126 0.3828 0.4158 0.000117 0.000124 0.3861 0.4092 0.8544 0.8862 0.2102 0.2180 0.8428 0.8586 0.2074 0.2112 0.8653 0.8680 0.2129 0.2135
Depth = 6 0.000115 0.000124 0.3795 0.4092 0.000120 0.000130 0.3960 0.4290 0.8022 0.8653 0.1974 0.2129 0.8105 0.8419 0.1994 0.2071 0.8195 0.8367 0.2016 0.2059
Depth = 7 0.000116 0.000125 0.3828 0.4125 0.000112 0.000123 0.3696 0.4059 0.7826 0.8718 0.1925 0.2145 0.8119 0.8542 0.1997 0.2102 0.7998 0.8308 0.1968 0.2043
2089
Notes: The table also demonstrates a comparison with the SVR for certain data sets. We implemented the SVR using the toolbox as available in Canu et al. (2003).
Online Adaptive Decision Trees
Table 4: Continued
2090
J. Basak
depth=3 depth=4 depth=5 depth=6 depth=7
ExOADT
1
1
Train
Train
Train
Train
0
Test
Train
Train
0
Test
A
depth=3 depth=4 depth=5 depth=6 depth=7
kernel size=0.5 kernel size=1 kernel size=2 kernel size=3 kernel size=4 kernel size=5
SVR
1
ExOADT
Test
0.5
0.5
25%
0.5
25%
0 0
Test
Train
depth=3 depth=4 depth=5 depth=6 depth=7
1
1 25%
Train
Train
B
ExOADT
0
Test
75% 0.5
0
Test
Train
1
0.5
0
Test
Train 75%
0.5 Train
Test
50%
1 75%
Train
1 0.5
0
Test
1 75%
0.5
0
Test
50% 0.5
0
Test
1
Train
1 50%
0.5
0
25% 0.5
0
Test
1 50%
0.5
1 25%
0.5
0
Test
1
kernel size=0.5 kernel size=1 kernel size=2 kernel size=3 kernel size=4 kernel size=5
SVR
1
0.5
0
depth=3 depth=4 depth=5 depth=6 depth=7
ExOADT
25%
25% 0.5
0
kernel size=0.5 kernel size=1 kernel size=2 kernel size=3 kernel size=4 kernel size=5
SVR
Train
Test
1
Test
1 50%
50%
0.5
0.5 0
0 Train
Test
75%
Train
0
Test
C depth=3 depth=4 depth=5 depth=6 depth=7
1
ExOADT
depth=3 depth=4 depth=5 depth=6 depth=7
25% 0.2
Train
Test
1
0
Train
Test
0.4 50%
50% 0.5
0.2 Train
Test
1
0
Train
Test
0.4 75%
75% 0.5 0
Test
0.4 25%
0.5
0
Train
D
ExOADT
0
Test
75%
0.5
0.5 0
Train
1
1
0.2 Train
E
Test
0
Train
Test
F
Figure 5: Pictorial illustration of the performance of ExOADT and SVR for function approximation. We provide the results for different depths of ExOADT and different kernel sizes of SVR. (A–F) The results for the Baseball (BB), Housing (HH), Kinematics (KI), Elevator (EV), Ailerons (AI), and Friedman (MARS) data sets, respectively. Each figure illustrates the performance for the training data and test data separately with different training sample sizes of 25%, 50%, and 75% respectively. For data sets Elevator (EV), Ailerons (AI), and Friedman (MARS) (panels (D–F), we have results only for ExOADT; results for SVR could not be obtained due to memory constraints.
Online Adaptive Decision Trees
2091
4 Discussion 4.1 Relationship with the Decision Trees. The proposed model ExOADT is essentially an extension to the OADT (Basak, 2004) for multiclass (more than two) classification and function approximation tasks. We train ExOADT using greedy gradient descent rule in the online mode. Since we do not freeze the learning, ExOADT can adapt to changing patterns. One interesting observation is that an ExOADT of depth l can embed any decision tree of depth, 1 ≤ depth ≤ l. Let us consider the different structures of binary decision trees of depth, depth ≤ l that can occur for a k-class problem. Let us consider that every nonterminal node is described by a condition of the form w x + θ ≤ 0, where if the ≤ condition is satisfied, then the pattern is allocated to the left child; otherwise, it is allocated to the right child. Evidently the axis-parallel decision trees are also embedded in this structure. If we restrict the values of βi j ∈ {0, 1} for all i ∈ and j ∈ O, then we can realize any binary decision tree of a depth, depth ≤ l for a k-class problem. For a two-class classification problem, we illustrate a few examples in Figure 6. We consider only those decision trees that do not represent any null condition in the decision space; each nonterminal node has exactly two children. Since if any nonterminal node has only one child in either the left or right path, then the other path represents a null space, which is an “unknown” decision. Evidently for a k-class problem there are k N possible decision trees, where N = 2l is the number of leaf nodes of a tree of depth = l. However, this includes the possibilities where all k classes are not represented. If we consider the number of possible decision trees that can represent exactly k classes, then the problem is different. It is equivalent to finding out the number of ways N balls can be placed in k holes so that each hole gets at least one ball. This is equivalent to the problem in the clustering as to finding out the number of possible ways a set of N points can be partitioned into k clusters so that each cluster consists of at least one point. Let T(k, N) represent the number of possible ways. Then T(k, N) =
k k (−1)i (k − i) N , i
(4.1)
i=0
which can be proved by induction. Evidently, T(k, N) << k N for a large value of N and k > 1. Here in the case of ExOADT, even if we restrict βi j ∈ {0, 1} for a k-class classification problem, we can generate 2k N different structures—most of which are invalid or ambiguous. For example, these representations also include the possibilities of having multiple class labels assigned to a single leaf node. In short, even if we restrict the values of β in {0, 1}, the valid search space is less than a fraction of ( 2kk ) N of the entire number of possibilities. The fraction ( 2kk ) N is much smaller than unity for a
2092
β=
J. Basak
00 00 00 00 00 00 00 00
10 10 10 10 10 10 10 10
β=
φ
β=
(Class 1)
10 10 10 10 01 01 01 01
(Class 1)
(Class 2)
β=
β=
01 01 01 01 01 01 01 01
β= (Class 2)
11 11 11 11 11 11 11 11
β=
(Class 1 & 2)
10 10 10 10 01 01 00 00
00 00 10 10 01 01 01 01
(Class 2)
(Class 1)
(Class 1)
(Class 2)
A
B
β=
10 10 01 01 10 01 01 01
(Class 1) (Class 2)
(Class 2)
(Class 1) (Class 2)
β=
01 01 01 10 01 01 10 01
(Class 2)
(Class 2)
(Class 2)(Class 1) (Class 1) (Class 2)
C Figure 6: Realization of different decision tree structures for different connection matrix β.
Online Adaptive Decision Trees
2093
large value of N—the number of leaf nodes. Thus, for a continuous space of β, we have a very narrow search space for obtaining the desired solution, and therefore the steepest gradient descent algorithm works. Unlike the existing OADT (Basak, 2004), ExOADT provides a flexibility of embedding different decision tree structures in a single framework. In the case of OADT, the structure is fixed for a given depth, and we are allowed to change only the local hyperplanes in each nonterminal node. In the case of ExOADT, we can not only change the local hyperplanes but also, in effect, can smoothly change the structure of the tree. 4.2 Relationship with RBF. In radial basis function networks, normally the joint distribution P(x, y) is modeled as a sum of kernel densities represented by the hidden nodes (Moody & Darken, 1989). The output is then usually obtained by regression of the predictor variable and the activation of the hidden nodes. In general, gaussian kernels are selected for the hidden nodes. In the case of normalized RBF networks (Moody & Darken, 1989), the function is approximated as (see Girosi et al., 1995) n c i G(x − ti ) o(x) = i=1 , n i=1 G(x − ti )
(4.2)
where G(.) are the kernel functions and ti are the kernel centers or parameters. In the case of normalized RBF, the space spanned by the individual hidden nodes is no longer elliptical, and the sum of the activations of the hidden nodes is always normalized to unity. Girosi et al. (1995) noted that normalized RBFs have a tight connection with the regularization theory. In the case of ExOADT, the sum of the leaf node activations always goes to unity (or some constant) provided that the root node activation is unity. The leaf nodes in ExOADT in a sense behave similarly to the hidden nodes of the normalized RBFs. However, space spanned by the leaf nodes is not governed by the parametric kernel functions; rather, they are spanned by the decision hyperplanes in the nonterminal nodes from the root to the leaf nodes. The very fact that the summation of the activation of the leaf nodes goes to unity imposes certain inherent regularization in ExOADT similar to the normalized RBFs (Girosi et al., 1995). Instead of the kernel functions, the individual leaf node’s activation space defined by the hyperplanes is determined depending on the data distribution. 4.3 Relationship with MLP. In ExOADT, we considered that each nonterminal node has a hyperplane defined by (w, θ ) where we constrain w = 1. If we relax this constraint and make w = 0 for all nonterminal nodes except the last but one layer (i.e., parents of the leaf nodes), then ExOADT is structurally the same as a multilayer perceptron with one hidden layer. Since w = 0 for all previous layers, the conditioning of a path
2094
J. Basak
from the root to the leaf becomes independent of x, and therefore the hyperplanes defined in the last layer interact with each other to generate complex decision boundaries like MLP (Haykin, 1999). As we make w = 1 for all previous layers, it imparts the network (ExOADT) to restrict the decision space confined in certain regions in each path, and these regions defined by the leaf nodes do not interact with each other. This very fact gives ExOADT more efficiency in learning the classification and function approximation tasks as compared to the multilayer perceptron, which is reflected in the results. 4.4 Relationship with HME. The hierarchical mixture of experts is structurally similar to the decision trees except that the decision trees compute the class labels by partitioning the decision space in a top-down manner, whereas the HME computes the class labels by agglomerating the decision space hierarchically upward. ExOADT follows the same concept as in the classical decision trees. However, in HME, the branching factor from child nodes toward the parent nodes is decided by the number of classes. In the case of ExOADT, the branching factor is always two; it always generates a binary tree. An analog of the difference between HME and ExOADT can be drawn from the literature of clustering (unsupervised learning): in agglomerative clustering, the smaller clusters are hierarchically agglomerated upward based on certain criteria (linkage algorithms), whereas in divisive clustering, the data set is iteratively partitioned downward based on certain criteria. 4.5 Other Issues. ExOADT inherently provides certain regularization in the approximation of the class labels and the predictor variables in the sense that the sum of activations of all leaf nodes is always unity. However, it is not clear how to select the depth of an ExOADT. One alternative is to create a growing ExOADT where we first start with an ExOADT of depth one. Once the network converges to a local optimum, we increase the depth. For example, let the connection matrix β be [β1 , β2 ] for one class. We can add one more level in the tree and set the connection matrix as [β1 , β1 , β2 , β2 ] for the same class. Independent of the parameters of the nonterminal nodes in the added level of the tree, this configuration is equivalent to the ExOADT of depth = 1. This is due to the fact that the sum of activation of the two sibling leaf nodes is equal to that of their parent. Therefore, the increase in depth does not cause any change in performance intially. We then train the network again with the increased depth: depth = 2. After convergence, we again increase the depth by adding one more level to the tree and adjust the connection matrix β in the same way as above. Thus, we create an ExOADT by iteratively increasing its depth. However, we found that the growing ExOADT does not perform better than the normal ExOADT. One reason is that when we create the growing ExOADT, its construction is similar to that of decision trees in the sense that the soft partitioning is more affected by
Online Adaptive Decision Trees
2095
the local decision. The data set is first linearly partionined by an ExOADT of depth = 1 to obtain an optimum score. As the depth is increased, the network starts with these hyperplanes and then reiterates to obtain the next best solution. As this process continues, the next solution is guided by the starting point of the previous solution every time. On the other hand, if we start with an ExOADT of a certain depth (say, depth = 7), the search for the hyperplanes is activated at each layer from a random initialization. As we mentioned in the case of OADT, ExOADT exhibits performance better than the existing single classifiers on certain data sets, and it learns totally in the online mode. The learning in the online mode relieves a classifier and function approximator of the memory storage required as in the batch mode. Thus, ExOADT can be effectively used for larger data sets as well, trading memory requirements at the expense of time required to learn the class labels. In Tables 5 and 6, we report the CPU time in seconds taken by the ExOADT algorithm for classification and function approximation, respectively, including the total time for initialization and learning in 200 epochs. We report the CPU time for classification in Table 5 with 90% data as the training data and 10% as the test data. We have chosen 90% data for training since the 10-fold cross validation (as we performed) always use 90% data for training in each fold. Table 6 reports the CPU time for function approximation where we used 75% data samples as the training data and 25% as the test data. We can observe that the CPU time increases exponentially with the depth of the tree. We have noted that ExOADT converges well before 200 epochs, although we report all the results for 200 epochs. The CPU time, as reported in Tables 5 and 6, taken by the ExOADT can be reduced if the number of epochs is reduced. It is interesting to observe in Table 6 that the Friedman data set took almost 30 hours for 200 epochs and 75% data for training. However, SVR could not handle even 10% of this data set for training due to the memory constraint. We have not investigated the pruning of ExOADT as it is often performed in the classical decision trees. In ExOADT, as discussed in section 4 (see Figure 6), the pruning mechanism can be performed by constraining the values of βi j . However, simply constraining β by adding regularization constraint λβ2 (λ being a Lagrangian) in the cost function may not serve the purpose, as that will create the null decision space. We can regularize the network by constraining such that the direct siblings of the leaf nodes approach having the same β values for all class labels. A suitable regularization criterior in the space of β still needs to be formulated. 5 Conclusions and Future Scope We reported a new model for pattern classification and function approximation. This model is similar to an earlier reported model, online adaptive decision tree (OADT; Basak, 2004), extended with a regression layer; we call it ExOADT. OADT was designed only for two-class classification
2096
J. Basak
Table 5: CPU Time Taken by ExOADT of Different Depths for Different Data Sets of Classification. CPU Time (seconds) Data Set
Depth = 3
Depth = 4
Depth = 5
Depth = 6
Depth = 7
Pima Wpbc Wdbc Bupa Iris Monks1 Monks2 Monks3 Ecoli
124 43 121 48 21 78 84 79 47
241 87 243 93 40 151 163 152 91
481 179 494 188 80 302 327 305 181
1029 386 1060 406 174 655 712 660 393
2404 919 2551 1060 456 1713 1872 1715 1020
Notes: The CPU time is for 200 epochs, including the time for initialization and learning. A 90% data set has been used for training, and 10% has been used for testing.
Table 6: CPU Time Taken by ExOADT of Different Depths for Different Data Sets of Function Approximation. CPU Time (seconds) Data Set Baseball (BB) Housing (HH) Kinematics (KI) Elevator (EV) Ailerons (AI) Friedman (MARS)
Depth = 3
Depth = 4
Depth = 5
Depth = 6
Depth = 7
45 64 923 1200 2127 4722
87 129 1872 2419 5000 9449
184 264 3782 4938 6477 19,350
397 591 8258 10,618 13,618 45,243
998 1482 21,387 26,585 32,306 109,138
Notes: The CPU time is for 200 epochs including the time for initialization and learning. Seventy-five percent of the data set has been used for training and the 25% for testing.
tasks. The current model can handle multiclass classification problems and effectively approximate continuous functions. We observe that ExOADT exhibits robust behavior in approximating the functions in the presence of a high amount of noise. We also observe that on certain data sets, ExOADT performs better than all single classifiers such as SVM; however, it falls short of the multiclassifier systems such as bagging and boosting in certain cases. ExOADT is able to learn completely in the online mode, and since we do not freeze the learning, it can adapt to a changing situation. In other words, if the input-output relationship of the data set changes—even if we have a dynamical relationship between the input variables and the output observation—ExOADT can capture the altered mapping between the input
Online Adaptive Decision Trees
2097
and output, provided the rate of change is much slower than the rate of learning. This adaptive behavior is possible since we need not necessarily freeze the learning of ExOADT. Note that this kind of adaptivity is not possible in the classical decision tree and the ensemble classifiers. ExOADT does not require any user-defined parameter. The only free parameter of ExOADT is its depth, and we demonstrated that after a certain depth, the performance of ExOADT does not vary to a great extent. One of the most interesting part of ExOADT is that its tree structure and depth can be smoothly changed using the different weights in the output regression layer. In OADT, we have seen that it can change only the local hyperplanes in the nonterminal nodes for a given depth of the tree, and the allocation of the class labels to the leaf nodes is also fixed. The ExOADT not only adds the flexibility of assigning class labels differentially to the leaf nodes but also adds a set of free variables that softly adapt the structure of the tree along with the local hyperplanes in the nonterminal nodes. We briefly discussed the relationship of ExOADT to other classification models such as decision tree, RBF, and multilayer perceptron. We have used only the steepest gradient descent learning; the performance can perhaps be enhanced by using smarter learning algorithms. Moreover, the performance of ExOADT can possibly be improved by having certain regularization on the output regression layer. Appendix: Derivation of Learning Rate We choose the optimal learning rate ηopt such that E(x) + E(x) → 0 for any new pattern x according to the first-order approximation (line search), that is, E → −E. We can approximate E (for small change in E) as ∂ E T ∂E ∂ E T E = θi + wi + β k . ∂wi ∂θi ∂βk i i k
(A.1)
Evaluating the partials, from equations 2.12 to 2.19, we get E(x) = −η
m22 z2
e k2 o k2 (1
− ok ) + 2
m21
q i2 (λ
+ x − (wi x) ) . 2
2
i∈
k∈O
(A.2) Since we choose η = ηopt such that E(x) → −E(x), we have ηopt =
m22 z2
2 2 k∈O e k o k (1
− ok
)2
E . + m21 i∈ q i2 (λ + x2 − (wi x)2 ) (A.3)
2098
J. Basak
References Albers, S. (1996). Competitive online algorithms (Tech. Rep. No. BRICS Lecture Series LS-96-2). University of Aarhus, BRICS, Department of Computer Science. Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Trans., EC-16, 299– 307. Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Basak, J. (2004). Online adaptive decision trees. Neural Computation, 16, 1959–1981. Bennett, K. P., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing (Tech. Rep. No. RPI Math Report 98-100) Troy, NY: Department of Mathematical Sciences, Rensselaer Polytechnic Institute. Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html. Boz, O. (2000). Converting a trained neural network to a decision tree DecText—decision tree extractor. Unpublished doctoral dissertation, Lehigh University. Available online at citeseer.ist.psu.edu/boz00converting.html. Breiman, L. (1996). Bagging predictors. Machine Learning, 26, 123–140. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1983). Classification and regression trees. New York: Chapman & Hall. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Buhmann, M. D. (1990). Multivariate cardinal interpolation with radial basis functions. Constructive Approximation, 6, 225–255. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Canu, S., Grandvalet, Y., & Rakotomamonjy, A. (2003). SVM and kernel methods matlab toolbox. Rouen, France: Perception Syst`emes et Information, INSA de Rouen. Available online at http://asi.insa-rouen.fr/%7Earakotom/toolbox/. Chien, J., Huang, C., & Chen, S. (2002). Compact decision trees with cluster validity for speech recognition. In IEEE Int. Conf. Acoustics, Speech, and Signal Processing (pp. 873–876). Piscataway, NJ: IEEE Press. Cho, Y. H., Kim, J. K., & Kim, S. H. (2002). A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications, 23, 329–342. Cristianini, N., & Shawe-Taylor, J. (2000). Support vector machines. Available online at http://www.support-vector.net/software.html. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). New York: Wiley. Durkin, J. (1992). Induction via ID3. AI Expert, 7, 48–53. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-values attributes in decision tree generation. Machine Learning, 8, 87–102. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals Statistics, 29, 1189–1232.
Online Adaptive Decision Trees
2099
Friedman, J. H., Hastie, T., & Tibshirani, R. (1998). Additive logistic regression: A statistical view of boosting (Tech. Rep.). Palo Alto, CA: Department of Statistics: Stanford University. Friedman, J. H., Kohavi, R., & Yun, Y. (1996). Lazy decision trees. In H. Shrobe & T. Senator (Eds.), Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference (pp. 717–724). Menlo Park, CA: AAAI Press. Garner, S. (1995). Weka: The waikato environment for knowledge analysis. In Proc. of the New Zealand Computer Science Research Students Conference (pp. 57–64). Available online at citeseer.nj.nec.com/garner95weka.html. Geman, D., & Jedynak, B. (2001). Model-based classification trees. IEEE Trans. Information Theory, 47, 1075–1082. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural network architectures. Neural Computation, 7, 219–269. Golea, M., & Marchand, M. (1990). A growth algorithm for neural network decision trees. Europhysics Letters, 12, 105–110. Grimson, W. E. L. (1982). A computational theory of visual surface interpolation. Proc. of the Royal Society of London B, 298, 395–427. Gunn, S. R. (1998). Support vector machines for classification and regression (Tech. Rep. No. http://www.ecs.soton.ac.uk/∼srg/publications/pdf/SVM.pdf). Southampton: University of Southampton: Faculty of Engineering, Science and Mathematics, School of Electronics and Computer Sceince. Guvenir, H. A., & Uysal, I. (2000). Bilkent University function approximation repository. Available online at http://funapp.cs.bilkent.edu.tr/DataSets/. Harrison, D., & Rubinfeld, D. (1978). Hedonic prices and the demand for clean air. J. Environ. Economics and Management, 5, 81–102. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Horton, P., & Nakai, K. (1996). A probablistic classification system for predicting the cellular localization sites of proteins. In Intelligent Systems in Molecular Biology (pp. 109–115). New York: AAAI Press. Janickow, C. Z. (1998). Fuzzy decision trees: Issues and methods. IEEE Trans. Systems, Man, and Cybernetics, 28, 1–14. Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the EM algorithm (Tech. Rep. No. AI Memo 1440). Cambridge, MA: Massachusetts Institute of Technology, Artificial Intelligence Laboratory. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Kalai, A., & Vempala, S. (N.d). Efficient algorithms for online decision. Available online at http://citeseer.nj.nec.com/585165.html. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. In P. Apers, M. Bouzeghoub, & G. Gardarln (Eds.) Advances in database technology (pp. 18–32). Berlin: Springer. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Murphy, K. (2001). The Bayes net toolbox for MATLAB. Computing Science and Statistics, 33.
2100
J. Basak
Murphy, K. (2003). Bayes net toolbox for MATLAB. Available online at http://www.ai.mit.edu/∼murphyk/Software/index.html. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32. Nakai, K., & Kanehisa, M. (1991). Expert sytem for predicting protein localization sites in gram-negative bacteria. PROTEINS: Structure, Function, and Genetics, 11, 95–110. Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines (Tech. Rep. No. MSR-TR-98-14). Redmond, WA: Microsoft Research. Poggio, T., & Girosi, F. (1990a). Networks for approximation and learning. Proc. of the IEEE, 78(9), 1481–1497. Poggio, T., & Girosi, F. (1990b). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: A review. In J. C. Mason & M. G. Cox (Eds.), Algorithms for approximation. Oxford: Clarendon Press. Pyeatt, L. D., & Howe, A. E. (1998). Decision tree function approximation in reinforcement learning (Tech. Rep. No. CS-98-112). Fort Collins, CO: Colorado State University. Quinlan, J. R. (1993). Programs for machine learning. San Francisco: Morgan Kaufmann. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence, 4, 77–90. Riley, M. D. (1989). Some applications of tree based modeling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop (pp. 339–352). San Mateo, CA: Morgan Kauffman. Salzberg, S., Delcher, A. L., Fasman, K. H., & Henderson, J. (1998). A decision tree system for finding genes in DNA. Journal of Computational Biology, 5, 667–680. Schapire, R. E., & Singer, Y. (1999). Improved boosting algorithms using confidencerated predictions. Machine Learning, 37, 297–336. ¨ Smola, A. J., & Scholkopf, B. (1998a). On a kernel-based method for pattern recognition, regression, function approximation and operator inversion. Algorithmica, 22, 211–231. ¨ Smola, A. J., & Scholkopf, B. (1998b). A tutorial on support vector regression (Tech. Rep. No. NC-TR-98-030). London: Royal Holloway College, University of London, NeuroCOLT. ¨ Stromberg, J. E., Zrida, J., & Isaksson, A. (1991). Neural trees—using neural nets in a tree classifier structure. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 137–140). Piscataway, NJ: IEEE. Su´arez, A., & Lutsko, J. F. (1999). Globally optimal fuzzy decision trees for classification and regression. IEEE Trans. Pattern Analysis and Machine Intelligence, 21, 1297–1311. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, DC: W. H. Winston. Utgoff, P. E., Berkman, N. C., & Clouse, J. A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1), 5–44.
Online Adaptive Decision Trees
2101
Uther, W. T. B., & Veloso, M. M. (1998). Tree based discretization for continuous state space reinforcement learning. In Proc. Sixteenth National Conference on Artificial Intelligence (AAAI-98). Cambridge, MA: MIT Press. Vapnik, V. (1998). Statistical learning theory. New York, USA: Springer-Verlag. Vapnik, V., Golowich, S., & Smola, A. (1997). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 281–287). Cambridge, MA: MIT Press. Wang, X., & Dietterich, T. (1999). Efficient value function approximation using regression trees. In T. Dean (Ed.), Proceedings of the IJCAI-99 Workshop on Statistical Machine Learning for Large-Scale Optimization. San Francisco: Morgan Kaufmann. Witten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Yang, Y., & Pedersen, J. O. (1997). A comparatative study on feature selection in text categorization. In Proc. Fourteenth Int. Conference on Machine Learning (ICML97) (pp. 412–420). San Francisco: Morgan Kaufmann. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Research and development in information retrieval (pp. 46–54). New York: ACM.
Received April 11, 2005; accepted December 16, 2005.
LETTER
Communicated by Simon Laughlin
Neuronal Algorithms That Detect the Temporal Order of Events Gonzalo G. de Polavieja [email protected] Neural Processing Laboratory, Instituto “Nicolas Cabrera” and Department of Theoretical Physics, Universidad Autonoma de Madrid, Madrid 28049, Spain
One of the basic operations in sensory processing is the computation of the temporal order of excitation of sensors. Motivated by the discrepancy between models and experiments at high signal contrast, we obtain families of algorithms by solutions of a general set of equations that define temporal order detection as an input-to-output relationship. Delays and nonlinear operations are the basis of all algorithms found, but different algorithmic structures exist when the operations are multiplications, OR gates, different types of AND-NOT logical gates, or concatenated ANDNOT gates. Among others, we obtain the Hassenstein-Reichardt model, a network using a multiplicative operation that has been proposed to explain fly optomotor behavior. We also find extensions of the BarlowLevick model (based on an AND-NOT gate with delayed inhibition and nondelayed excitation as inputs), originally proposed to explain the bipolar cell response of the rabbit retina to motion stimuli. In the extended models, there are two more steps, another AND-NOT gate, and a subtraction or two subtractions that make the model responsive only to motion. In response to low-contrast inputs, the concatenated AND-NOT gates or the AND-NOT gate followed by a subtraction in these new models act as the multiplicative operation in the Hassenstein-Reichardt model. At high contrast, the new models behave like the Hassenstein-Reichardt model except that they are independent of contrast as observed experimentally. 1 Introduction What are the possible neural networks that determine whether a group of sensors A has been excited before a group B or vice versa? A study of this problem has been pursued in the auditory system of the barn owl (Carr & Konishi, 1988) and more quantitatively in the visual system of insects and vertebrates. When A and B are photoreceptors or groups of them looking at two different points in space, temporal order translates into direction of visual motion. Important behaviors like prey detection (Land & Collett, 1974), gaze stabilization (Hausen & Egelhaaf, 1989; Egelhaaf et al., 2002), and distance estimation (Esch & Burns, 1996; Srinivasan, Zhang, & Bidwell, 1997) can depend on the ability to compute the direction of motion. Neural Computation 18, 2102–2121 (2006)
C 2006 Massachusetts Institute of Technology
Neural Algorithms That Detect the Temporal Order of Events
2103
τ Figure 1: Two models of temporal order detection have been obtained from experimental data. (a) Hassenstein-Reichardt model, obtained from the optomotor behavior of beetle and fly. Its main operations are a delay τ and a multiplication. (b) Barlow-Levick model, proposed to explain the response of ganglion cells in the rabbit retina. Its main operations are a delay τ and an AND-NOT logical gate with excitatory and delayed inhibitory inputs. The AND-NOT gate gives a nonzero output only when receiving a signal from Aand not from a channel that delays the signal from B that we denote as Bτ . Black squares (circles) indicate excitatory (inhibitory) inputs.
The first proposal for a network of temporal order detection was given by Exner (1894). Quantitative models emerged later in study of the computation of the direction of motion in insect and vertebrate retinae. The first mathematical model was proposed by Hassenstein and Reichardt to account for the optomotor behavior of the Chlorophanus beetle and was later extended to the fly (Hassenstein & Reichardt, 1956; Reichardt, 1961). The model, illustrated in Figure 1A, has two arms. The left arm consists of the coincidence detection of a delayed signal from Aand the signal from B. This arm detects A → B motion and its mirror-symmetric arm analogously for B → A motion. The two relevant operations for the detection of temporal order are a delay and a multiplication. The underlying biological network responsible for these operations remains elusive due to the complexity of the network and the small size of the neurons implicated. However, the electrophysiological properties of neurons in lobula plate, a ganglion postsynaptic to the neurons computing the direction of motion, are consistent with the Hassenstein-Reichardt (HR) model at low light contrast (see, e.g., Borst & Egelhaaf, 1989; Haag, Denk & Borst, 2004). In vertebrates there is better experimental access to the network implicated in motion detection. The Barlow-Levick model (BL) (Barlow & Levick, 1965), shown in Figure 1B, has been proposed to explain extracellular recordings from the ganglion cells of the rabbit retina. Its main operations are a delay and an AND-NOT logical gate with excitation and delayed inhibition as inputs. The AND-NOT
2104
G. de Polavieja
gate has an output only when there is a signal coming from Aand not from a channel that delays the signal from B. This scheme then has A → B motion as its preferred direction and B → A motion as its null direction. The underlying physiology of this model is, as for the HR model, unknown, but it involves amacrine cells (Euler, Detwiller, & Denk, 2002; Taylor & Vaney, 2003). Two other models have made an impact on our understanding of motion detection: the energy model (Adelson & Bergen, 1985) and the gradient detector (Limb & Murphy, 1975). The algorithm of the energy model is known to be mathematically equivalent to the HR model (Adelson & Bergen, 1985). The gradient model detects image velocity and not only temporal order. An interesting relationship between models using multiplication and velocity estimation has been shown using a version of estimation theory within a statistical mechanics framework (Potters & Bialek, 1994). Optimal velocity estimation implies a smooth transition between a gradient scheme at high signal-to-noise ratio (SNR) and a multiplication scheme at low signal-to-noise ratio (Potters & Bialek, 1994). Behavioral experiments show that bees can detect velocity (Kirchner & Srinivasan, 1989; Srinivasan, Lehrer, Kirchner, & Zhang, 1991), but no evidence for a model like the gradient detector has been found (see, however, discussions of their possible relevance in biological motion detection in Hildreth & Koch, 1987, and Srinivasan, 1990). Although previous studies have greatly advanced our understanding by establishing a methodology, identifying key features (e.g., delay and nonlinearities), and explaining many aspects of motion coding and perception, some fundamental problems are still unsolved. The HR model works well for low contrast but is inconsistent with the observed independence of response on image contrast at high-contrast values (Fermi & Reichardt, ¨ ¨ 1963; Buchner, 1976; Gotz, 1964; Hengstenberg & Gotz, 1967; Egelhaaf & Borst, 1989). The BL model has the serious problem of not being a proper motion detector as it also responds to static stimuli. Its output to motion in its preferred direction is the same as the excitation of a single receptor, while its output to motion in the null direction is the same as the excitation of neither of the receptors. In the limit of small signals, the BL scheme reduces to a “dirty multiplication” (Thorson, 1966; Torre & Poggio, 1978). In fact, a model similar to the HR scheme in Figure 1A but with BL AND-NOT gates instead of the multiplications was shown to have an output with a term that depends on single-receptor activation minus the HR terms, so further spatiotemporal processing is needed to make a proper motion detector (Thorson, 1966). Furthermore, we have yet to discover how a real neural circuit detects motion, and in the two best-studied examples, insects and ganglion cells in vertebrate retina, the neural structures involved appear more intricate and complicated than the models (Taylor & Vaney, 2003; Higgins, Douglass, & Strausfeld, 2004). There is a need to explore more ways to detect the temporal order of events. Here we propose a simple general procedure to obtain models
Neural Algorithms That Detect the Temporal Order of Events
2105
of temporal order detection. It consists of translating the operation of a temporal order detector into a set of simultaneous equations whose solutions are the temporal order detection algorithms. We find families of models with a range of structures depending on the nonlinearities involved. These models are able to reconcile the apparent differences between the computational operations that are performed by the lateral excitation (HR) models and the lateral inhibition (BL) models. Algorithms that are extensions of the BL model are found that are proper motion detectors, as they do not respond to the activation of single sensors. These models function like the HR model at low image contrast and are independent of contrast at high image contrast, converging to a value that depends on temporal frequency, as displayed by experimental data (de Ruyter van Steveninck, Bialek, Potters, Carlson, & Lewen, 1996). The letter is organized as follows. First, we give the general procedure to obtain algorithms of motion detection. Second, we use this method to obtain different algorithmic structures depending on the nonlinearities used. Third, we show that some of these models are extended BL schemes and that they behave like the HR model at low contrast and are independent of contrast at high contrast. Finally, we discuss the implications of the results in the study of neural circuitry involved in motion detection. 2 Simple Method to Obtain Algorithms That Detect Temporal Order Our procedure involves translating the function of the network into an input-output relationship and obtaining the algorithms that obey it given a set of allowed operations. Consider a discrete space-time, say, two points in space at time t − τ , A(t − τ ) and B(t − τ ), and at time t, A(t) and B(t). There is motion from A to B when A is activated first, and it is then followed by the activation of B. A simple representation of this is to say that A(t − τ ) = 1, B(t − τ ) = 0, A(t) = 0, and B(t) = 1. There are 16 possible matrices of the discrete spatiotemporal input, represented in Figure 2. An algorithm for the detection of temporal order has to respond to these 16 inputs with the outputs given below the matrices in Figure 2. These output values mean that the algorithm cannot respond to static stimuli and that it has to distinguish A → B from B → A motion. We are now ready to express the problem formally. For compactness, let A(t − τ ) be the vector formed by the 16 matrix elements A(t − τ ) in the order given in Figure 2, and similarly for A(t), B(t − τ ) and B(t), and let R be the vector of output values, A(t − τ ) = (1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0) A(t) = (0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0) B(t − τ ) = (0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0)
G. de Polavieja
time
2106
sensor A B1 t-τ t 1
0
5
9
-1
2
-1
0
10
1
0
0
0
7
11
1 14
13
0
6
4
3
0
0
8
12
-1 15
0
16
0
Figure 2: The 16 possible 2 × 2 input matrices for binary detection elements and below them the outputs for an ideal temporal order detector. The first matrix, for example, has values A(t − τ ) = B(t) = 1 represented in white and A(t) = B(t − τ ) = 0 in black, corresponding to light exciting A first and then B. For this A → B motion, the output is 1. The second matrix corresponds to B → A motion and has an output of −1.
B(t) = (1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1) R = (1, −1, 0, 0, 0, 0, 0, 0, −1, 1, 1, −1, 0, 0, 0, 0).
(2.1)
Having defined the input-output relationship, we need to restrict the possible operations among the elements of the input matrix. Consider two operations, the sum + and the abstract operation , that in succeeding sections will be substituted by relevant nonlinear operations like multiplications and OR gates. We are interested in obtaining temporal order detectors using sums and the operation . Formally, this consists of finding the parameters {a , b, . . . , u} that make the set operations with one or two matrix elements equal to the output vector R when the inputs are those of
Neural Algorithms That Detect the Temporal Order of Events
2107
the matrices in Figure 2, written in vector form in equation 2.1, R = a A(t − τ ) + b A(t) + c B(t − τ ) + d B(t) + e A(t − τ ) A(t − τ ) + f A(t) A(t) + +g B(t − τ ) B(t − τ ) + h B(t) B(t) + i A(t − τ ) B(t − τ ) + j A(t − τ ) B(t) + k A(t) B(t − τ ) + l A(t) B(t) + mB(t − τ ) A(t) + nB(t) A(t − τ ) + o B(t) A(t) + p B(t) A(t) + q A(t − τ ) A(t) + r A(t) A(t − τ ) + s B(t − τ ) B(t) + uB(t) B(t − τ ),
(2.2)
where the operations are entry-wise, that is, A(t − τ ) A(t) = (1, 0, . . . , 0) (0, 1, . . . , 0) = (1 0, 0 1, . . . , 0 0).
(2.3)
Note that we have used the parameter u instead of the parameter t that would correspond by the alphabetical order used to avoid confusion with time t. Equation 2.2 will be used throughout this article to obtain different models. When using commutative operations, that is, when A B = B A, we do not need the terms with parameters from m to p and r and u as they would be redundant with other terms. Equation 2.2 contains all possible terms with one and two elements of the input matrices in Figure 2. Similar procedures can be used including the operation of three or four elements. It is instructive to consider in some detail the procedure to find whether there exists a model using only the linear terms. This corresponds to the solutions of equation 2.2 without the operation (e = f = . . . = u = 0), a A(t − τ ) + b A(t) + c B(t − τ ) + d B(t) = R.
(2.4)
Substituting the vectors in equation 2.1 into 2.2, we obtain 16 relationships between the four parameters a , b, c, and d as a + d = 0;
b + c = −1;
a + b = 0;
c + d = 0;
0 = 0;
a + b + c + d = 0;
a + c = 0;
a + b + c = −1;
a + c + d = 1;
b + c + d = −1;
a = 0;
c = 0;
b+d =0
a + b + d = 1; b = 0;
d = 0.
(2.5)
It is, however, impossible to obey these 16 relationships simultaneously. The last four relations require all the parameters to be zero, while previous relations, for example, the sixth and tenth, imply a = 1. Therefore, no
2108
G. de Polavieja
temporal order detection is possible with only the linear terms, and we need to include a nonlinear operation , giving terms like + A(t − τ ) B(t). Systems of temporal order detection then need to operate input values at time t and at time t − τ . For example, the term + A(t − τ ) B(t) can be computed by a system delaying the signal A(t − τ ) by a time τ to operate it with B(t). To write down the temporal detection algorithms from the point of view of the system, we include the delay as a subscript, that is, + A(t − τ ) B(t)(input operations) −→ + Aτ B(system operations). (2.6) By considering a discrete space-time and binary inputs, the conditions for a motion detector are then given by 16 relationships between parameters, as in equation 2.2. This procedure has the advantage, along with simplicity, of requiring only a minimal set of conditions. There are several extra conditions we do not impose, with the following advantages. First, we fix the output values only for discrete inputs. This allows us to obtain models with different outputs when the input is a continuous signal. Second, we do not restrict the outputs when the inputs have negative values. This could be incorporated in equation 2.2 in a simple way, for example, by requiring that the outputs remain the same when changing the sign of the input. However, vertebrate and insect retinae probably process differently motion detection for negative contrast, so we choose to leave open the behavior of the models for negative input. Finally, we do not model explicitly the event-detecting processing steps before temporal order detection. Some sensory systems first extract relevant features (i.e., bars or contours in the case of the visual system) and only later compute temporal order. However, basic features of the HR and the BL models compare well to experiments performed without the need to explicitly model the first sensory steps (i.e., photoreceptors and first interneurons), and we follow the same strategy here. 3 Hassenstein-Reichardt Model as the Simplest Solution for a Multiplicative Nonlinearity Substituting in equation 2.2 the abstract operation for a multiplication ×, our procedure involves finding the parameters that obey a A(t − τ ) + b A(t) + c B(t − τ ) + d B(t) + e A(t − τ ) × A(t − τ ) + f A(t) × A(t) + +g B(t − τ ) × B(t − τ ) + h B(t) × B(t) + i A(t − τ ) × B(t − τ ) + j A(t − τ ) × B(t) + k A(t) × B(t − τ ) + l A(t) × B(t) + q A(t − τ ) × A(t) + s B(t − τ ) × B(t) = R.
(3.1)
Neural Algorithms That Detect the Temporal Order of Events
2109
We find a family of solutions obeying a = −e, b = − f, c = −g, d = −h, j = 1, and k = −1 and zero for the remaining parameters. The simplest member of this family corresponds to a = b = c = d = 0, such that A(t − τ ) × B(t) − B(t − τ ) × A(t) = R. Thus, the simplest set of input operations that distinguish temporal order is of the form A(t − τ ) × B(t) − B(t − τ ) × A(t).
(3.2)
As discussed in equation 2.6, these input operations correspond to the system operations Aτ × B − Bτ × A,
(3.3)
that is, the HR model depicted in Figure 1A. In this article, we consider only the simplest solutions. For example, the algorithm A2τ − Aτ + Aτ × B − Bτ × A is a valid solution for a = −e = −1, but it only adds terms with no net output to the HR model. We do not consider models obtained with definitions of motion detection different from the one given in Figure 2. For example, we could consider a definition with null outputs for the inputs in the third row as they are neither clean motions nor static cases like the rest of the rows. This alternative definition would correspond to a substitution of R in equation 2.2 for R’ = (1, −1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0). Models with this output contain terms operating three or four matrix elements. In the case of multiplications, the simplest model we find for this case is of the form Aτ × B − Bτ × A + (A × Aτ × Bτ − A × Aτ × B + A × B × Bτ − Aτ × B × Bτ ), (3.4) which adds four extra terms to the HR model found in equation 3.3, each with the multiplication of three matrix elements, a complexity that does not correspond to known biology. 4 Other Models with the Nonlinearity Using Only Excitation To search for other algorithms for which the nonlinearity is also excitatory, we consider other operations. An AND logical gate with inputs x1 and x2 , x1 ∧ x2 , has an output of 1 when x1 = x2 = 1 and zero otherwise. An AND gate can be realized biophysically with a spike threshold that is reached only when two excitatory inputs are active. The input-output of an AND gate is the same as that of the multiplication for binary inputs. Therefore, our procedure finds the same models for AND gates and multiplications, the simplest being Aτ ∧ B − Bτ ∧ A.
2110
G. de Polavieja
∼
∼
τ
∼
τ
∼
τ
τ
τ
τ
τ
τ
Figure 3: Some network structures performing temporal order detection. (A, B) Two models of temporal order detection based on (A) OR gates and (B) ANDNOT gates. (C, D) Two models of temporal order detection with a first layer of AND-NOT gates and a second layer subtracting their output from the output coming from a single point in space. The Barlow-Levick model corresponds to the first layer of C. (E, F) Two models using concatenated AND-NOT gates. The Barlow-Levick model corresponds to the first layer of E.
An OR gate with inputs x1 and x2 , x1 ∨ x2 , has an output of 1 when x1 = 1 or x2 = 1 or when both equal 1, x1 = x2 = 1, and zero otherwise. OR gates can be realized biophysically with a spike threshold lower than each of two excitatory inputs. Substituting the operation in equation 2.2 by the operation ∨, we find a solution corresponding to the system operations Aτ − Bτ − A + B + A ∨ Bτ − Aτ ∨ B,
(4.1)
depicted in Figure 3A. No experimental evidence for visual motion detection based on OR gates has been found so far. 5 Models with the Nonlinearity Using Excitation and Inhibition The recordings implicated in the computation of visual motion detection are those found in the vertebrate retina (Barlow & Levick, 1965). In this case, the
Neural Algorithms That Detect the Temporal Order of Events
2111
relevant nonlinearity is the AND-NOT gate or a divisive operation (Amthor & Grzywacz, 1993; Grzywacz & Amthor, 1993). Anatomical evidence also points to the AND-NOT gate as a candidate operation in insect motion detection (Higgins et al., 2004). An AND-NOT gate with inputs x1 and x2 of the form x1 . ∼ x2 has an output 1 only when x1 = 1 and x2 = 0 and zero otherwise. This is similar to a divisive operation x1 /x2 , which has a high value (corresponding to the value 1 in the AND-NOT gate) only when x1 = 1 and x2 = 0 and low values otherwise (corresponding to the value 0 in the AND-NOT gate). The biophysics behind this operation could vary among systems. It could be based on the subtraction of excitatory and inhibitory inputs followed by a threshold, silent inhibition (Torre & Poggio, 1978) in nonspiking neurons (in spiking neurons, silent inhibition has only a subtractive effect on the firing rate; see Holt & Koch, 1997), calciumenhanced calcium released (Barlow, 1996), or network effects (Holt & Koch, 1997). 5.1 Models with AND-NOT Gates and Without Linear Terms. Substituting the operation in equation 2.2 by the operation AND-NOT, we find a family of models without the linear terms and another family with them. Models without the linear terms (a = b = c = d = 0) are obtained with parameters i = 1 − p, j = −1 + p, k = p, l = − p, m = p − 1, n = − p, and o = 1 − p and zero for the remaining parameters. The simplest cases are obtained for p = 0 and p = 1, corresponding, respectively, to the system operations Aτ . ∼ Bτ − Aτ . ∼ B − Bτ . ∼ Aτ + Bτ . ∼ A
(5.1)
A. ∼ Bτ − A. ∼ B − B. ∼ Aτ + B. ∼ A.
(5.2)
Both models contain only AND-NOT gates, but only the second, depicted in Figure 3B, contains BL interactions, represented by the terms A. ∼ Bτ and B. ∼ Aτ . However, this model also contains two other AND-NOT gates that do not have the BL structure. 5.2 Models with AND-NOT Gates and Linear Terms. We have obtained two simple models including linear terms. The first one has b = −1, d = 1, k = 1, and n = −1 and zero for the remaining parameters and obeys B − (B. ∼ Aτ ) − {A − (A. ∼ Bτ },
(5.3)
depicted in Figure 3C. The second model has a = 1, c = −1, j = −1, and o = 1 and zero for the remaining parameters, obeying Aτ − (Aτ . ∼ B) − {Bτ − (Bτ . ∼ A)},
(5.4)
2112
G. de Polavieja
shown in Figure 3D. Combinations of any of the arms of these two models is also a solution, for example, B − (B. ∼ Aτ ). ∼ A) − {Bτ − (Bτ . ∼ A)}. The first of these two models, equation 5.3, is particularly interesting as it extends in a simple way the BL model to make it a proper motion detector. Its layer of AND-NOT gates has a pure BL structure and is sensitive to the direction of motion. The second layer has a structure with an inhibitory input from the first layer and excitation from a single receptor. This configuration reverses the preferred direction of the detector and eliminates responses to stimuli coming from only one receptor. The third step, a subtraction of the output from the two arms, eliminates the response to flicker affecting both receptors simultaneously. Note that the second and third steps can be interchanged without any consequence. Also note that the information from isolated sensors, for example, B − A in equation 5.3, could be coming from operations involving other receptors, say, C and D, that nevertheless do not respond in the A ↔ B direction. For example, B − A can be substituted for structures like B. ∼ D − A. ∼ C if C and D are not active in A ↔ B motion but could participate in the calculation of time order in the directions A ↔ C and B ↔ D, respectively. 5.3 Models with Concatenated AND-NOT Gates. Given the relevance of the AND-NOT gates, we have also considered an extension of our procedure in equation 2.2 to include terms with the operation of three elements, in this case corresponding to concatenated AND-NOT gates. We find a model with concatenated AND-NOT gates of the form B. ∼ (B. ∼ Aτ ) − A. ∼ (A. ∼ Bτ ),
(5.5)
represented in Figure 3E. This model is similar to the one in equation 5.3 except that the output from the BL layer and the output from a single receptor meet at a second layer of AND-NOT gates. A second model is found to be of the form Aτ . ∼ (Aτ . ∼ B) − Bτ . ∼ (Bτ . ∼ A),
(5.6)
which is depicted in Figure 3F. This model is similar to the one in equation 5.4, except that the second layer also uses AND-NOT gates. The AND-NOT gates of this second model do not have the structure of the BL scheme. Combinations of any of the arms of these two models are also a solution, for example, B. ∼ (B. ∼ Aτ ) − Bτ . ∼ (Bτ . ∼ A). The two models in equations 5.3 and 5.5, illustrated in Figures 3E and 3F, are particularly interesting as their first layer of operations has a BL structure that is corrected in a second layer to make it a temporal order detector.
Neural Algorithms That Detect the Temporal Order of Events
2113
6 Response of the Extended Barlow-Levick Models to Continuous Signals The models obtained have the same input-output relationship for binary signals. For example, the HR model, equation 3.3, has the same input-output relationship as the extended BL models, equations 3.3 and 3.5. In particular, the multiplicative operation of the HR model has the same input-output relationship as the subtraction of the signal coming from a single receptor and an AND-NOT gate or as two concatenated AND-NOT gates, Aτ × B = B − (B. ∼ Aτ ) = B. ∼ (B. ∼ Aτ ).
(6.1)
However, the response of the models should differ for continuous signals, and this helps us to identify which is used in nature. We are therefore interested in comparing their differences to known features of experimental data. Motion experiments typically use a sinusoidal grating of mean intensity I0 , modulation I , velocity ω, and spatial period λ, I (x, t) = I0 + I sin(2π(x − ωt)/λ). When the receptors A and B are separated by a distance x and the delay in the system is given by a first-order low-pass filter of time constant τ , the signals arriving at the nonlinearity are of the form (Zanker, Srinivasan, & Egelhaaf, 1999) 2πω t λ 2πω Aτ (t) = I0 + F I sin t− λ 2πω B(t) = I0 + I sin (t − x/ω) λ 2πω Bτ (t) = I0 + F I sin (t − x/ω) − , λ
A(t) = I0 + I sin
(6.1a) (6.1b) (6.1c) (6.1d)
with F = (1 + (2πωτ/λ)2 )1/2 and = − arctan(2πω/λ) the amplitude and phase response resulting from the filter, respectively. The time integral for the HR model then gives (Zanker et al., 1999)
∞ 0
dt(Aτ (t) × B(t) − Bτ (t) × A)
= (I )2 sin(2πx/λ)
sin (arctan(2πωτ/λ) . (1 + 4π 2 ω2 τ 2 /λ2 )1/2
(6.2)
2114
G. de Polavieja
The output of the HR model to a sinusoidal grating has a simple form, the product of three terms. The first term depends on only contrast I , the second on the spatial period λ, and the third on the temporal frequency ω/λ. Similar results can be obtained using a high-pass filter for the arm without delay (Buchner, 1976). The first two columns in Figure 4 compare this output of the HR model to the output of the extended BL model in equation 5.3 of the form B − (B. ∼ Aτ ) − {A − (A. ∼ Bτ }. The AND-NOT gate with inputs x1 and x2 of the form x1 . ∼ x2 can be written mathematically in terms of the Heaviside function (x1 − x2 − 1/2), that is, 0 when x1 − x2 − 1/2 < 0 and 1 for x1 − x2 − 1/2 > 0. Biological implementations of the models must use smooth operations, and for this reason we use a smooth version of the Heaviside function of the form s (x1 − x2 ) = (1 + tanh(5((x1 − x2 ) − 1/2))/2. Other parameters for this function or other smooth functions do not change the results qualitatively, for example, the extrema of the output are the same. We expect different species to differ in the values of the parameters of this function. We have numerically calculated the time integral of the extended BL model using this smooth function and compared the results with the three terms resulting from the HR model in equation 6.2:
r
r
r
Term sin(2πx/λ). At fixed temporal frequency ω/λ and contrast I , the HR model has an output proportional to sin(2πx/λ). Figure 4A illustrates the output of the HR and the extended BL model in equation 5.3, respectively, against the inverse of the spatial period λ with a fixed temporal frequency ω/λ = 20 and contrast I = 0.4 and the remaining parameters τ = 5, x = 2.5, and I0 = 0.5. Both models show a sinusoidal variation with extrema at 1/λ = n/(4x) and n = {1, 3, . . .}. Term sin (arctan(2πωτ/λ) (1 + 4π 2 ω2 τ 2 /λ2 )−1/2 . For fixed contrast I and spatial period λ, the HR model has an output proportional to sin (arctan(2πωτ/λ) (1 + 4π 2 ω2 τ 2 /λ2 )−1/2 . Figure 4B illustrates the output of the HR and extended BL models for I = 0.4 and λ = 10 and with the model parameters as before. There is again very good correspondence between the two models. Term I 2 . Fixing the spatial period λ and velocity ω, the HR model has an output proportional to I 2 . Figure 4C gives the output of the HR and the extended BL model for τ = 5, x = 2.5, I0 = 0.5, λ = 20, and ω = 0.5. The extended BL model shows different behaviors at low and high contrast. While at low contrast it behaves similarly to the HR model, at high contrast it converges to an output value. Interestingly this convergence is not due to saturation, but instead its value depends on temporal frequency. Figure 4D shows the output for the two models against contrast and velocity, clearly showing that the output of the
Neural Algorithms That Detect the Temporal Order of Events
τ
τ
2115
τ
τ
1/λ
|
1/λ
1/4∆
output(a.u.)
|
1/4∆
output(a.u.)
output(a.u.)
(A)
|
1/λ
1/4∆
|
ω /λ
|
|
-1/2 πτ 1/2πτ
ω/λ
output(a.u.)
|
-1/2πτ 1/2 πτ
output(a.u.)
output(a.u.)
(B)
|
|
-1/2πτ 1/2πτ
ω /λ
|
0.5
output(a.u.)
output(a.u.)
output(a.u.)
(C)
|
∆Ι
|
∆Ι
0.5
∆Ι
0.5
0 -1/4 ∆Ι 1 1/4
ω/λ
output(a.u.)
output(a.u.)
output(a.u.)
(D)
0
-1/4 ∆Ι 1 1/4
ω/λ
0
-1/4 ∆Ι 1 1/4
ω/λ
Figure 4: Comparison of the output of the Hassenstein-Reichardt (left) and the extended Barlow-Levick models in equation 5.3 (middle) and 5.5 (right). The AND-NOT gates are smoothed using the function s (x) = (1 + tanh(5(x − 1/2))/2 for x > 0 and zero for x < 0. Other parameters and smooth functions give similar results. (A) Output varying the spatial period λ for fixed temporal frequency ω/λ = 1/20 and fixed contrast I = 0.4 and remaining parameters τ = 5, x = 2.5, and I0 = 0.5, chosen for illustrative purposes. (B) Output varying the grating velocity ω at fixed contrast I = 0.4 and spatial period λ = 10, and remaining parameters as in A. (C) Output varying the contrast I at fixed spatial period λ = 20 and velocity ω = 0.5, and remaining parameters as in A. (D) Same as C but showing that the output at high contrast is not saturated but converges to a value that depends on temporal frequency.
2116
G. de Polavieja
extended BL model converges to a value that depends on temporal frequency. The third column in Figure 4 gives the output for the extension of the BL model in equation 5.5. The results are similar to those in the second column, but with a slightly worse correspondence to the HR model. For example, it shows a maximum (minimum) at a value larger (lower) than ω/λ = 1/(2πτ ) (ω/λ = −1/(2πτ )). Both of the extended BL models have an output very similar to that of the HR model, except for their contrast independency at high contrast, consistent with experimental data. It is important to note that the results do not depend qualitatively on the smoothing function. For example, using pure AND-NOT gates, the output depends on the inverse of the spatial period in a way similar to Figure 4A but with flatter output at the minima and maxima of the sine function. The output using pure AND-NOT gates also depends on contrast similarly to Figure 4C but with a more abrupt transition from the low-contrast to the high-contrast behavior. Also note that the results shown in Figure 4 need the full network structure of the extended BL models as single elements like an isolated AND-NOT gate have different outputs. 7 Discussion We have given a general scheme to obtain models for the detection of the order of events as solutions of a set of simultaneous equations. Different families of network structures have been obtained depending on the nature of the nonlinearity they use. Models based on OR gates, different types of AND-NOT gates, and concatenated gates have been given. This methodology can be used to obtain other algorithms by using other nonlinearities (e.g., logarithms, powers) or by adding terms known to take place in real systems. We have also shown that extended Barlow-Levick (BL) models function similarly to the Hassenstein-Reichardt (HR) model at low contrast and are contrast independent at high contrast converging to a value that depends on temporal frequency, as displayed by experimental data (de Ruyter van Steveninck et al., 1996). One of the lessons from this study is that there are several families of models that are able to perform the detection of the order of events. Their final outputs are similar, and differences can always be minimized by adding extra elements. For example, even if the extended BL models obtained have been shown to have the contrast independency at high contrast, a modified HR scheme with contrast adaptation or a saturating nonlinearity can display similar behavior (Egelhaaf & Borst, 1989). The relevance of having several models is therefore in making predictions for experiments designed to study the networks implicated in motion detection and not only their final output. For example, neurons participating
Neural Algorithms That Detect the Temporal Order of Events
2117
in a motion detection network could be found firing for a shorter time in one direction of motion than in the opposite direction. While these neurons might not be thought to be critical in HR or extended BL schemes, they could be performing the processing of an OR gate, the most relevant processing step in a motion detection network like that of Figure 3A. A search for a network performing motion detection is in fact mostly a search for nonlinearities and delays. When physiological results show the existence of a nonlinearity, we can use models like those in Figure 3 to predict the existence of the other operations needed for the computation of motion detection. For example, in the case of finding an OR gate, the model in Figure 3A predicts that its output has to be subtracted from the sum of the delayed and nondelayed inputs. Other nonlinearities make different predictions. If an AND-NOT gate of the BL type (excitation and delayed inhibition as inputs) is found, the simplest models, Figures 3C and 3E, predict that its output is compared to the undelayed line. On the other hand, when an AND-NOT gate is found with inhibition and delayed excitation, as in Figures 3D and 3F, it is predicted that its output is compared to the delayed line. In case AND-NOT gates are found without any delayed inputs, Figure 3B predicts that its output has to be mixed with outputs of different types of AND-NOT gates. As an example of the predictive value of the models, in the following we discuss the predictions that the extended BL models make for the insect and vertebrate retinae given our current knowledge. Most of the discussions on insect motion detection focus on the output of motion detectors, as it is much easier to record from lobula plate neurons, which are postsynaptic to the motion detector networks believed to be in the medulla. However, a recent study has used anatomical and physiological data in medulla (Douglass & Strausfeld, 2001) to propose a model that maps to the HR scheme (Higgins et al., 2004). The HDS model has three steps. First, neuron T5 receives excitatory input from the transmedullary neuron Tm1 (say, our B) and inhibitory input from Tm9 (our A), the latter acting as a low-pass filter. The HDS model assumes this interaction to be implementing a “dirty multiplication” by shunting inhibition (Thorson, 1966; Torre & Poggio, 1978). Although less anchored in empirical evidence, the HDS model assumes two more steps. In the second step, the inhibitory input from an interneuron that receives the same input from two T5 cells (from the left and right arms of the model) returns a weighted inhibition to them. The final step is the spatial averaging, necessary to avoid the response to a single receptor. The extended BL models share with the HDS model the first step, in our case without the need to invoke the dirty multiplication approximation that is valid only for low contrast. Neuron T5 would be then performing an AND-NOT gate with inputs from Tm1 and Tm9. The prediction of the model is that the output of this gate has to be further compared (by subtraction or a second AND-NOT gate) with information from T1. This second step could in principle take place in the dendritic tree of T5 in
2118
G. de Polavieja
a shunting-inhibition model or in a different interneuron. For small inputs, the AND-NOT gate B. ∼ Aτ performed by shunting inhibition reduces to B(1 − Aτ /Imax ) (Thorson, 1966; Torre & Poggio, 1978). The extended BL model can then be approximated at low contrasts by the HR model, B − (B. ∼ Aτ ) − {A − (A. ∼ Bτ } ≈ (Aτ B − ABτ )/Imax .
(7.1)
Other possibilities are open for motion detection in insects, like networks using AND-NOT gates with delayed excitatory and nondelayed inhibitory inputs like the one in Figure 3D. This model predicts that the output of the AND-NOT gates has to be subtracted from the single-receptor delayed signal, and it also reduces to the HR model in the small signal limit of the shunting inhibition, Aτ − (Aτ . ∼ B) − {Bτ − (Bτ . ∼ A)} ≈ (Aτ B − ABτ )/Imax .
(7.2)
Note that the HDS and our extended BL models work with positive numbers. In our case the models are solutions obtained for a binary input values 0 and 1, as shown in Figure 2. This means that inputs below zero are treated as the symbol 0 and inputs above 0 are treated as the symbol 1. This procedure treats negative values as if they were a value of 0. However, for multiplication, negative numbers matter, as negative × negative = positive and negative × positive = negative. A simple way to achieve this behavior is to use four subsystems: ON-ON (A and B responding to light increases), OFF-OFF, ON-OFF, and OFF-ON. This separation into channels is common to the HDS model, and our extended BL model and has even been proposed as a silicon implementation of the HR model (see, e.g., Liu, 2000). The vertebrate visual system also shows differences at low and high contrast. Humans perceive low-contrast gratings to be slower than highcontrast ones at the same velocity (e.g., Blakemore & Snowden, 1999). To understand the network implicated in motion detection, we need to look for the neurons presynaptic to a motor system that respond differently depending on the external motion direction. The nucleus of the optic tract in the pretectum and the accessory optic system distinguish the direction of motion and connect to the motor system that controls eye movements. The role of these systems is probably to respond to wide-field stimulation by summing the responses of many motion detectors, as does the insect lobula plate (Krapp & Henstenberg, 1996). Neurons in the nucleus of the optic tract have an output that depends quadratically with contrast, but at high contrast the response tends to a plateau (Ibbotson, Clifford, & Mark, 1999), consistent with our extended BL models. Also note that the nucleus of the optic tract receives inputs from the ganglion cells, which can act as the AND-NOT gate of the pure BL model. The prediction of the extended BL models is then that the output of the AND-NOT gates in ganglion cells
Neural Algorithms That Detect the Temporal Order of Events
2119
is subtracted from nonmotion information from one point in space. The outputs of these subtractions from the two arms in the models are further subtracted, in this case possibly in the nucleus of the optic tract. Acknowledgments I am very grateful to Horace Barlow, Brian Burton, and Simon Laughlin for their critical and detailed reading of the manuscript and for many useful suggestions. Discussions with Sara Arganda, Raul Guantes, Mikko Juusola, and Ignacio Ramis are also acknowledged. I thank the support of the program Understanding the Brain at KITP, Santa Barbara, and grants from the Wellcome Trust, MEC, fBBVA, and CAM-UAM. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Amthor, F. R., & Grywacz, N. M. (1993). Inhibition in on-off directionally selective ganglion cells in the rabbit retina. J. Neurophysiol., 69, 2174–2187. Barlow, H. B. (1996). Intraneuronal information processing, directional selectivity and memory for spatio-temporal sequences. Network, 7, 251–259. Barlow, H. B., & Levick, W. R. (1965). The mechanism of directionally selective units in rabbit retina. J. Physiol. (Lond.), 178, 477–504. Blakemore, M., & Snowden, R. J. (1999). The effect of contrast on perceived speed: A general phenomenon? Perception, 28, 33–48. Borst, A., & Egelhaaf, M. (1989). Principles of visual motion detection. Trends Neurosci., 12, 297–306. Buchner, E. (1976). Elementary movement detectors in an insect visual system. Biol. Cybern., 24, 85–101. Carr, C. E., & Konishi, M. (1988). Axonal delay lines for time measurement in the owl’s brainstem. Proc. Nat. Acad. Sci., 85, 8311–8315. de Ruyter van Steveninck, R. R., Bialek, W., Potters, M. Carlson, R. H., & Lewen, G. D. (1996). Adaptive movement computation by the blowfly visual system. In D. Waltz (Ed.), Natural and artificial parallel computation: Proc. of the 5th NEC Research Symposium (pp. 21–41). Philadelphia: SIAM. Douglass, J. K., & Strausfeld, N. J. (2001). Pathways in dipteran insects for early motion processing. In J. M. Zanker & J. Zeil (Eds.), Motion vision: Computational, neural and ecological constraints (pp. 66–81). Berlin: Springer. Egelhaaf, M., & Borst, A. (1989). Transient and steady-state response properties of movement detectors, J. Opt. Soc. of America, 6, 116–127. Egelhaaf, M., Kern, R., Krapp, H. G., Ketzberg, J., Kurtz, R., & Warzecha, A. K. (2002). Neural encoding of behaviorally relevant visual-motion information in the fly. TINS, 25, 96–102. Esch, H., & Burns, J. (1996). Distance estimation by foraging honeybees. J. Exp. Biol., 199, 155–162.
2120
G. de Polavieja
Euler, T., Detwiller, P. B., & Denk, W. (2002). Directionally selective calcium signals in dendrites of starburst amacrine cells. Nature, 418, 845–852. Exner, S. (1894). Entwurf zu Einer Physiologischen Erklarung der Psychischen Erscheinungen. I. Teil. (pp. 37–140). Leipzig: Deuticke. Fermi, G., & Reichardt, W. E. (1963). Optomotorische Reaktionen der Fliege Musca domestica. Kybernetik, 2, 15–28. ¨ K. G. (1964). Optomotorische Untersuchungen des visuellen Systems einiger Gotz, Augenmutanten der Fruchtfliege Drosophila. Kybernetik, 2, 77–92. Grywacz, N. M., & Amthor, F. R. (1993). Facilitation in on-off directionally selective ganglion cells in the rabbit retina. J. Neurophysiol., 69, 2188–2199. Haag, J., Denk, W., & Borst (2004). Fly motion vision is based on Reichardt detectors regardless of the signal-to-noise ratio. Proc. Nat. Acad. Sci., 101, 16333–16338. Hassenstein, B., & Reichardt, W. (1956). Systemtheorische analyse der Zeit-, Reihenfolgen-und Vorzeichenauswertung bei der Bewegungsperzeption des Russelkafers Chlorophanus. Zeitschrift fur Naturforschung, 116, 513–524. Hausen, K., & Egelhaaf, M. (1989). Neural mechanisms of visual course control in insects. In D. Stavenga & R. Hardie (Eds.), Facets of vision (pp. 391–424). New York: Springer. ¨ K. G. (1967). Der Einflu des Schirmpigmentgehalts auf die Hengstenberg, R., & Gotz, Helligkeits- und Kontrastwahrnehmung bei Drosophila Augenmutanten. Kybernetik, 3, 276–285. Higgins, C. M., Douglass, J. K., & Strausfeld, N. J. (2004). The computational basis of an identified neuronal circuit for elementary motion detection in dipterous insects. Vis. Neurosci., 21, 567–586. Hildreth, E. C., & Koch, C. (1987). The analysis of motion: From computational theory to neuronal mechanisms. Ann. Rev. Neurosci., 10, 477–533. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Comp., 9, 1001–1013. Ibbotson, M. R., Clifford, C. W. G., & Mark, R. F. (1999). A quadratic nonlinearity underlies direction selectivity in the nucleus of the optic tract. Visual Neurosci., 16, 991–1000. Kirchner, W. H., & Srinivasan, M. V. (1989). Freely flying honey bees use image motion to estimate object distance. Naturwissenchaften, 76, 281–282. Krapp H. G. & Hengstenberg, R (1996). Estimation of self-motion by optic flow processing in single visual interneurons. Nature, 384, 463–466. Land, M. F., & Collett, T. S. (1974). Chasing behaviour of houseflies (Fannia canicularis). J. Comp. Physiol., 89, 331–357. Limb, J. O., & Murphy, J. A. (1975). Estimating the velocity of moving images in television signals. Comp. Graph Im. Process., 4, 311–327. Liu, S-C. (2000). A neuromorphic aVSLI model of global motion processing in the fly. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 47, 1458–1467. Potters, M., & Bialek, W. (1994). Statistical mechanics and visual signal processing. J. Phys. I France, 4, 1755–1775. Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. A. Rosenblith (Ed.), Sensory communication (pp. 303–317). New York: Wiley.
Neural Algorithms That Detect the Temporal Order of Events
2121
Srinivasan, M. V. (1990). Generalized gradient schemes for the measurement of twodimensional image motion. Biol. Cybern., 63, 421–431. Srinivasan, M. V., Lehrer, M., Kirchner, W. H., & Zhang, S. W. (1991). Range perception through apparent image speed in freely flying honey bees. Visual Neurosci., 6, 519– 535. Srinivasan, M. V., Zhang, S. W., & Bidwell, N. (1997). Visually mediated odometry in honeybees. J. Exp. Biol., 200, 2513–2522. Taylor, W. R., & Vaney, D. I. (2003). New directions in retinal research. Trends in Neurosci., 26, 379–385. Thorson, J. (1966). Small-signal analysis of a visual reflex in the locust II. Frequency dependence. Kybernetik, 3, 56–66. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. Roy. Soc. London, 202, 409–416. Zanker, J. M., Srinivasan, M. V., & Egelhaaf, M. (1999). Speed tuning in elementary motion detectors of the correlation type. Biol. Cybern., 80, 109–116.
Received May 19, 2005; accepted February 17, 2006.
LETTER
Communicated by Stephen Jose Hanson
Excessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented Environments C. P. Unsworth [email protected] Department of Engineering Science, University of Auckland, Auckland 1001, New Zealand
G. Coghill [email protected] Department of Electrical and Computer Engineering, University of Auckland, Auckland 1001, New Zealand
In this letter, we demonstrate that the generalization properties of a neural network (NN) can be extended to encompass objects that obscure or segment the original image in its foreground or background. We achieve this by piloting an extension of the noise injection training technique, which we term excessive noise injection (ENI), on a simple feedforward multilayer perceptron (MLP) network with vanilla backward error propagation to achieve this aim. Six tests are reported that show the ability of an NN to distinguish six similar states of motion of a simplified human figure that has become obscured by moving vertical and horizontal bars and random blocks for different levels of obscuration. Four more extensive tests are then reported to determine the bounds of the technique. The results from the ENI network were compared to results from the same NN trained on clean states only. The results pilot strong evidence that it is possible to track a human subject behind objects using this technique, and thus this technique lends itself to a real-time markerless tracking system from a single video stream. 1 Introduction One of the great advantages of neural networks (NN) is their ability to generalize (Bishop, 1995). This is the ability of a NN to correctly classify patterns that have never been presented to it before. A technique known as noise injection (Hanson, 1990; Sietsma & Dow, 1991; Matsuoka, 1992; Edwards & Murray, 1993; Guozhong, 1996; Grandvalet, Canu, & Boucheron, 1997) is well documented for improving the generalizing properties of an NN. Sietsma and Dow (1991) showed that this leads to an improvement of performance of the NN and requires the NN to have more weights Neural Computation 18, 2122–2145 (2006)
C 2006 Massachusetts Institute of Technology
Excessive Noise Injection for Markerless Tracking
2123
or hidden units. Levin, Lieven, and Lowenberg (2000) showed that noise injection does not depend on the type of noise used and that around 10% of noise can be injected into the input space to improve the generalizing properties of the network. Human motion analysis is a broad topic and well documented in Aggarwal and Cai (1999). Their article is concerned with tracking a human subject. There are two main approaches in tracking a human subject. The first approach is that of markered tracking, which requires multiple cameras and identifiable markers attached to key points on the subject (Wren, 1996; Gavrila & Davis, 1996). This form of tracking lends itself to gait analysis and sports performance measurements. One of the advantages of markered tracking is that if many cameras are employed, it is possible to locate the absolute positions of the markers if they become obscured by objects, known as segmentation (Aggarwal, 2003), or body occlusions from the different camera angles available. Disadvantages are that such systems are high in complexity and high in cost for the consumer. A second approach is that of markerless tracking of human motion (Calais & Legrand, 2000; Gao, Xu, & Tsuji, 1994). Markerless tracking aims to simplify this study considerably by removing identifiable markers on the subject and the need for multiple cameras. Markerless tracking systems are applicable to the same applications as markered tracking plus videoconferencing and security monitoring systems. The advantages and disadvantages of markerless tracking are reciprocal to those of markered tracking. After development costs, a realized markerless tracking system is low in complexity and low in consumer cost. Common problems with these systems arise when the image becomes segmented or obscured (Aggarwal, 2003) in the foreground or the background. Horizontal lines or vertical lines can typically interfere with some tracking algorithms of human subjects. This results in the algorithms’ locking onto the incorrect objects in the foreground or background and becoming confused. 2 Motivation A small amount of noise that has been synthetically added to the input images can be used to aid the generalization of a NN. The motivation for this work was to determine a method such that an object of interest could be tracked behind clutter successfully without markers. Our conjecture was whether one could extend the noise injection concept such that it would not only generalize the NN in the standard way but would also allow the NN to take into account the physical obscuration or segmentation that may be part of an input image. In this letter, we explore this topic by piloting a method of training an NN with what we have termed excessive noise injection (ENI) to achieve this aim. We demonstrate that training an NN with ENI allows the excessive noise-injected neural network (ENINN) to perform markerless tracking of an object of interest in a segmented or obscured
2124
S1
C. P. Unsworth and G. Coghill
S2
S3
S4
S5
S6
Figure 1: Six states of a walking motion (S1–S6).
environment, thus supporting the original conjecture. We demonstrated this with a simplified human figure, a result with direct applicability to the field of human motion tracking and of interest to the field of human vision (Johannson, 1973; Schiffrar & Lorenceau, 1996; Gavrilla 1999). However, the technique is object independent and equally applicable to any object that one might wish to track in a cluttered environment. 3 Method To demonstrate our conjecture, a simplified figure of a human performing six states of a walking motion was designed to a similar prescription as that in Gao et al., 1994, shown in Figure 1. Each of the six states of the figure consisted of an 80 × 40 image containing 3200 pixels. The human figure was in black and the background in white. It was decided to train the NN on black and white images rather than color images. In using color, one provides another degree of freedom (the colors themselves), which serves to act as distinct features for the NN to pick up on and distinguish the image. Therefore, using black and white reduces the degrees of freedom the NN can work with in order to discern one image from another and thus poses a harder problem than in the case of color. For each figure, 200 ENI versions of the clean image were generated. We elected to introduce the new term excessive noise injection in order to differentiate it from standard noise injection (around 10%), which is employed for standard generalization of a NN. Each ENI version consisted of 25% (6 dB) of noise added to the clean image. The noise was added by randomly flipping 25%, or 800, of the pixels to the opposite color. An example of an ENI image is given in Figure 2. The ENI images would then be used to train the NN to be robust to obscuration in the image. The NN used was a simple feedforward MLP with a vanilla backward error propagation neural network (Rumelhart, Hinton, & Williams, 1986; Ham & Kostanic, 2001; O’Reilly, 2001; Haykin, 2002) consisting of 3200 input neurons, representing the number of pixels in the image, one hidden layer of 10 neurons, and 6 output neurons to represent the six states of the walking motion. Each of the 6 output neurons would be trained to respond with a 1 for correct
Excessive Noise Injection for Markerless Tracking
2125
Figure 2: A 25% ENI image used for training.
Input Layer (3200 neurons)
Hidden Layer (10 neurons)
Output Layer (6 neurons)
I1 I2 I3 : : I3200 1/O OUTPUT Figure 3: Schematic of the feedforward MLP with backpropagation network used.
identification of the state of motion and a 0 for nonidentification. This is shown in Figure 3. The Stuttgart Neural Network Simulator (SNNS) (Zell, et al., 1995) was employed to obtain the results reported here. The NN was then trained by passing all of the ENI images of the six states of motion (1200 images) to it. The weights of the NN were adjusted under standard backpropagation rules (Ham, 2001). Then the order of the 1200 ENI images was randomly shuffled and passed back to the NN to continue the learning process. This training cycle was repeated 100 times. Thus, the NN was trained on 120,000 presentations of the ENI images. After the training phase had finished, the validation phase was performed. This was done by passing the NN 600 new ENI images (100 for each phase of motion) that were different from those of the training ENI images. The training and validation graphs are shown in Figure 4. The lower line corresponds to the training data, and the upper line corresponds to the validation set. During the validation phase, 595 images out of the 600 images were correctly
2126
C. P. Unsworth and G. Coghill
Figure 4: Training and validation plot of the network.
identified, giving an accuracy of 99% when training with 25% ENI and validating with images with a 25% level of obscuration. 4 Results What follows is series of six tests that were devised to determine how well the 25% ENINN could identify images behind obscuring objects. The first three tests were designed to examine how the ENINN performs for moving vertical and horizontal bar obscuration. This was because vertical and horizontal bars can be problematic for some markerless tracking algorithms. Test 4 was designed to determine how the ENINN performed when the input images were obscured by a collection of random blocks. For tests 1 to 4, the obscuration level used was the same as the level of noise the ENINN was trained on: 25%. Tests 5 and 6 were for vertical and horizontal bars and random blocks, respectively. In these tests, the obscuration level was twice as high as the level of noise the ENINN was trained on: 50%. 4.1 Test 1: 25% Obscuration as a Moving Horizontal Bar. The first test consisted of 25% obscuration, which was localized in the form of a moving horizontal bar. This test was applied to states 5 and 3. The four input images shown in Figure 5 are I1–I4. The outputs (N1– N6) are the responses from the six neurons in the output layer shown in Table 1. N5 correctly identifies the 25% obscured image. Another point to note is that although the ENINN was trained by injecting random noise onto the image, the image was detected behind the obscuring object, which was not random-like. This seems contrary to way the network was trained. One explanation is that since the ENINN successfully detected the image
Excessive Noise Injection for Markerless Tracking
I1
I2
2127
I3
I4
Figure 5: 25% obscured input images (I1–I4) of state 5. Table 1: Network Output for Figure 5 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0 0 0.00043 0.00004
0.00006 0 0.00012 0.0003
0.00001 0.00198 0.00001 0
0.02259 0.00022 0.00008 0.00002
0.99806 0.99982 0.99999 0.99937
0.00759 0.00143 0.00002 0.14115
behind an obscuration of this type (although the obscuring object does not seem like noise to us), the obscuring object could be viewed as noise that is localized to a region of space. This test was also applied to state 3. The input images of state 3 obscured by a horizontal bar are displayed in Figure 6, and the network results are shown in Table 2. 4.2 Test 2: 25% Obscuration as a Moving Vertical Bar. The second test consisted of 25% obscuration, which was localized in the form of a moving vertical bar. This test was applied to states 5 and 3. Figure 7 shows the four inputs to the network, and Table 3 demonstrates how the N5 identifies the correct state 5. Figure 8 shows the input to the network for state 3, and Table 4 shows the networks output. State 3 was chosen because it is one state that could be easily mistaken for its adjacent states. With this in mind, one would expect the network to have some difficulty with obscuration in the vertical sense. Nevertheless, the network correctly identifies state 3 for the horizontal bar and the vertical bar. In the vertical case, there is a slight contention, as the second input image (I2) obtains a 0.38171 result from the N2. But this is still small in comparison to the 0.89297 result obtained from the correct neuron N3. 4.3 Test 3: 25% Obscuration as Two Moving Horizontal or Vertical Bars. The next test, localized 25% of obscuration, was in the form of two moving horizontal bars (see Figure 9) and then two moving vertical bars (see Figure 10). The results are given in Tables 5 and 6, respectively. This
2128
C. P. Unsworth and G. Coghill
Figure 6: 25% obscured input images (I1–I4) of state 5. Table 2: Network Output for Figure 6 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00034 0.00005 0.00001 0.00013
0.00985 0.00002 0.00021 0.00007
0.99907 0.99987 0.99947 0.99997
0.00013 0.00006 0.00006 0.00028
0 0.00021 0.00125 0
0.00066 0.00011 0.0006 0.00007
Figure 7: 25% vertically obscured input images of state 5. Table 3: Network Output for Figure 7 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00001 0.00011 0.00055 0.00002
0 0.06067 0.00003 0.00001
0.00001 0 0.00002 0
0.00158 0.00034 0.0098 0.00239
1 0.9978 0.98955 0.99958
0.00056 0.00101 0.00053 0.00005
is applied to state 2, which is another difficult state to resolve. The reason this test was devised was to allow more opportunities for the two bars to obscure prominent features of the image than in the single bar case. Neuron N2 correctly identifies state 2 for both the horizontal and vertical cases of two moving bars. Table 6 highlights the first input image (I1) that obtains a 0.52169 result from N3, but this is still small in comparison to the 0.98571 result obtained from the correct neuron N2.
Excessive Noise Injection for Markerless Tracking
2129
Figure 8: 25% vertically obscured input images of state 3. Table 4: Network Output for Figure 8 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00007 0.00239 0 0.00005
0.00019 0.38171 0.00004 0.00015
0.99967 0.89297 0.99936 0.99996
0.00024 0 0.03921 0.00674
0.00002 0.00003 0.00001 0.00001
0.00106 0.00063 0.00027 0.00001
Figure 9: 25% horizontal obscured input images of state 2.
Figure 10: 25% vertical obscured input images of state 2.
4.4 Test 4: 25% Obscuration as Sixteen Random Blocks. In this test, 25% obscuration was localized in the form of 16 (10 × 5) random blocks. This was applied to all states (twice per state) displayed in Figure 11. This test was devised to allow the greatest degree of freedom for the obscuration of prominent features to occur. The results from the network are displayed
2130
C. P. Unsworth and G. Coghill
Table 5: Network Output for Figure 9 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00004 0.01044 0.00021 0.0001
0.99995 0.99982 0.99926 0.99998
0.00156 0.00077 0.00013 0
0.00024 0 0 0.00003
0 0.00002 0.00053 0.00075
0.00031 0 0.0001 0.00014
Table 6: Network Output for Figure 10 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00083 0.01544 0.00071 0.00012
0.98571 0.99762 0.99999 0.99923
0.52169 0.00021 0.00002 0
0.00036 0.00063 0.00001 0
0 0.00002 0.00003 0.00272
0.00025 0.00004 0 0.00027
in Table 7. The network correctly identifies all the images that had been obscured by 25% of noise localized in the form of 16 random blocks. 4.5 Test 5: 50% Obscuration, Two Moving Vertical and Horizontal Bars. In the two following tests, twice as much noise 50% (3 dB) was applied to the testing images as what the ENINN was originally trained for. The 50% obscuration is localized in the form of two moving vertical and horizontal bars applied to state 2. The input images are shown in Figure 12. Table 8 shows how the NN correctly identifies state 2. One point to note here is that there is an overlap of the bars of 200 pixels, which reduces the level of obscuration to 44%, or about 3.6 dB. Although this is less than the 50% stated, it is significantly more than the ENI of 25% that the NN is trained on. Since the ENINN was trained on 25% of noise, one would expect the network not to identify the image. However, it is quite clear that the NN does discern the correct image from the other similar states. Just as a human can perceive an image behind obscuration, it seems that the ENINN does not need all the information to discern the state of the image. The NN is probably picking up on clusters of data that appear in state 2 and not in other states in order to identify the image. 4.6 Test 6: 50% Obscuration as Thirty-Two Random Blocks. In test 6 (see Figure 13), we apply 50% of obscuration localized in the form of 32 (10 × 5) random blocks applied to all states (once per state). The results are shown in Table 9, which shows how the network correctly identifies the correct state of motion. For input image I6, it can be seen that although neuron N6 identifies state 6 with a higher probability (seven times higher) than any of the other neurons, it is still a low probability. In this event, even
Excessive Noise Injection for Markerless Tracking
2131
State 1
State 2
State 3
State 4
State 5
State 6
Figure 11: 25% random block obscuration input images.
though the correct neuron identified the correct state, one would probably choose to reassess this image. 5 Bounds of the Technique Four extensive tests now follow that examine the bounds of the technique for the six states of motion presented. Test 7 examines how a 25% trained ENINN performed when presented with 100 images for each state with 50%
2132
C. P. Unsworth and G. Coghill
Table 7: Network Output for Figure 11 Images.
I1 I2 I3 I4 I5 I6
N1
N2
N3
N4
N5
N6
0.9997 0.99378 0.00018 0.0074 0.00001 0.00001 0.00001 0.00007 0.00024 0.00001 0.00077 0.00046
0.00018 0.01716 0.99985 0.98698 0.00026 0.00026 0.00001 0.00039 0.00019 0.00387 0.00003 0.00001
0 0 0.10209 0.00039 0.99848 0.99897 0.00005 0.00003 0 0.00007 0.00006 0.00016
0.00459 0 0 0.00001 0.08134 0.0001 0.99574 0.99921 0.00352 0.00207 0.00096 0.00139
0.00806 0.00338 0.00015 0.00012 0.0004 0.0001 0.00212 0.00085 0.9997 0.97629 0.00656 0.00146
0.00001 0.00014 0.00002 0.00004 0.00044 0.00069 0.00093 0.00149 0 0.00003 0.99954 0.96858
Figure 12: 50% vertical and horizontal obscuration input images. Table 8: Network Output for Figure 12 Images.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.0003 0.0001 0.00034 0.00088
0.99987 0.99955 0.96913 0.70501
0.00274 0.00002 0.00001 0.01252
0.00001 0.00006 0.00001 0.00013
0.00002 0.00124 0.02627 0
0.00017 0.00003 0.00059 0.09599
of obscuration in the form of random blocks. Test 8 examines how a 25% trained ENINN performed when presented with 100 images for each state with 75% of obscuration in the form of random blocks. Together, tests 7 and 8 will determine how much obscuration can be applied to a 25% ENINN before it starts to degrade in performance. Test 9 examines how a 50% trained ENINN performed when presented with 100 images for each state with 50% of obscuration in the form of random blocks. Test 10 examines how a 50% trained ENINN performed when presented with 100 images for each state with 75% of obscuration in the form of random blocks. Together,
Excessive Noise Injection for Markerless Tracking
2133
State 1
State 2
State 3
State 4
State 5
State 6 Figure 13: 50% random block obscuration input images.
tests 7 to 10 will determine the maximum amount of ENI that an NN can be trained and tested on before it overgeneralizes and cannot distinguish one image from another. 5.1 Test 7: 25% Trained ENINN Presented with 50% Obscuration as Random Blocks. The results of test 7 for each state are given in Figure 14. The way to interpret the following plots is now explained. For state 1, the upper left plot of Figure 14, 100 images of state 1 with 50% of obscuration in the form of random blocks were presented to the ENINN. The connected
2134
C. P. Unsworth and G. Coghill
Table 9: Network Output for Figure 13 Images.
I1 I2 I3 I4 I5 I6
N1
N2
N3
N4
N5
N6
0.72369 0.00082 0.00047 0.00027 0.00024 0.00004
0.01193 0.99719 0.0001 0.00001 0 0.00035
0.00058 0.00405 0.99521 0.00023 0.00612 0.00412
0.00874 0.02308 0.00023 0.97582 0.00054 0.00004
0.00061 0.00009 0.00137 0.05142 0.97049 0.01529
0.00002 0 0 0.00003 0.00009 0.07713
Figure 14: Output from 25% trained ENINN presented with 50% obscuration as random blocks.
Excessive Noise Injection for Markerless Tracking
2135
Table 10: Output from 25% Trained ENINN. Number of Incorrectly Identified Images in States Obscuration Level
1
2
3
4
5
6
Mean
SD
50% 75%
4 9
1 7
1 22
0 0
1 12
1 16
1.3 11
1.3 7.6
circles represent the values returned by the network from the desired output neuron—in this case, neuron 1. The crosses represent the values returned from the nondesired output neurons—namely, neurons 2 to 6. Optimally, one would expect the output from the desired neuron to be unity and the output from the nondesired neurons to be close to zero. For each case, a count was made of the number of times the nondesired neuron’s value was higher than the desired neuron’s value. A maximum count of one was given for each test image. Thus, if more than one of the nondesired neurons had a value higher than the desired neuron, as occurs for sample 36 in state 1 (where two cross markers are higher than a circular marker), a maximum count of, 1 not 2, is given. From a visual inspection of the results of test 7, one can see that the output from the desired neurons for all six states was significantly larger than those of the undesired states. Very few incorrect predictions occurred. This is probably due to the fact that the NN does not need all the information to discern the state of the image. The NN is probably picking up on clusters of data that appear in the desired state and not in other states in order to identify the image. Thus, it is possible to have more obscuration than the ENINN is trained for. Also, it should be noted that some states are easier to recognize than others (e.g., state 4 in Figure 14). Thus, recognition is image dependent. Tabulated results of the number of incorrect predictions, mean, and standard deviation across all states for Figure 14 are given in Table 10. 5.2 Test 8: 25% Trained ENINN Presented with 75% Obscuration as Random Blocks. The results of test 8 for each state are given in Figure 15. As one can see, more incorrect predictions are made than for 50% obscuration. Again some states are easier to identify than others. The results for the 25% ENINN when presented with images that had a level of 50% and 75% obscuration are tabulated in Table 10. The mean and standard deviation were determined across all six states. Thus, at 50% obscuration, an average of 1% of incorrect predictions occur, and at 75% obscuration, an average of 11% of incorrect predictions occur. Some states (e.g., state 3) become harder to discern than others as the amount of obscuration increases.
2136
C. P. Unsworth and G. Coghill
Figure 15: Output from 25% trained ENINN presented with 75% obscuration as random blocks.
5.3 Test 9: 50% Trained ENINN Presented with 50% Obscuration as Random Blocks. Tests 9 and 10 train the ENINN with 50% (3 dB) of obscuration and then test the training with presenting the ENINN with 100 images of each state that have 50% and 75% obscuration on them, respectively. An example of a typical state 1, 50% ENI image used for training is
Excessive Noise Injection for Markerless Tracking
2137
Figure 16: A clean state 1 image and a 50% ENI state 1 image.
shown in Figure 16. The clean version of state 1 is located next to it for comparison. If one looks closely, the feet of the subject are partly discernible. The results for test 9 for each state are shown in Figure 17. As Figure 17 shows, the 50% trained ENINN performs badly when trained at this level. We believe this is because not enough clusters of pertinent information are retained in the training images such that the network can differentiate between one state and another. This implies that at this level of ENI, the ENINN overgeneralizes and cannot distinguish one state from another. Similar results were found when the 50% trained ENINN was tested on 75% ENI images. Both results are tabulated in Table 11. It is clear from the tabulated results that an average of 84% incorrect predictions are made for obscuration levels of 50% and 75%. The results in Tables 10 and 11 can be used to plot general performance charts of the ENINN for each of the states. The general performance charts are shown in Figure 18. Figure 18 clearly shows the performance of the ENINN for different levels of ENI in the training and amount of obscuration placed on the testing images. For all six states of motion, it is quite evident that having an ENI level of 25% allows the ENINN to be able to cope with segmentation and obscuration of 50% extremely well, with 1% errors, and at 75% with 11% errors. As the level of ENI increases to the 50% region, the ENINN gives errors of 84%. 6 Performance Comparison of the NN Trained with Zero Noise Injection This section describes how the NN performed when trained with the six clean only and no noise injection. The six clean states were shuffled and passed to the NN until the states were learned by the network. Table 12 shows how the NN had sufficiently learned the six states. Clearly, the NN successfully detects the six states of motion when trained with no NI. The
2138
C. P. Unsworth and G. Coghill
Figure 17: Output from 50% trained ENINN presented with 50% obscuration as random blocks.
noisy data from tests 1, 3 and 5 were then passed to the cleanly trained NN. Results are shown in Tables 13 to 19 of the performance of the NN. Only 36% (10 of 28 images) were detected using the cleanly trained NN for tests 1, 3, and 5. In comparison, the ENINN detected 100%. Thus, a significant improvement was obtained using the ENINN in these initial tests.
Excessive Noise Injection for Markerless Tracking
2139
Table 11: Output from 25% Trained ENINN. Number of Incorrectly Identified Images in States Obscuration Level
1
2
3
4
5
6
Mean
SD
50% 75%
84 75
88 88
68 67
79 90
85 85
93 91
83.8 82.7
8.6 9.6
Figure 18: General performance charts for the ENINN for the six states of motion (a) 25% trained ENINN, (b) 50% trained ENINN. Table 12: Cleanly Trained NN Performance on Six States. NETWORK OUTPUT
I1 I2 I3 I4 I5 I6
N1
N2
N3
N4
N5
N6
0.96977 0.00823 0.00218 0.01148 0.00429 0.02783
0.01191 0.97828 0.01661 0.00746 0.00949 0.00055
0.00792 0.01705 0.97398 0.01429 0.00290 0.01641
0.00846 0.00040 0.00886 0.97848 0.00916 0.01639
0.00530 0.00065 0.01918 0.00482 0.97798 0.01626
0.01898 0.00876 0.01598 0.01146 0.01514 0.96958
2140
C. P. Unsworth and G. Coghill
Table 13: Cleanly Trained NN Output for Images of Figure 5: Test 1—Gait 3, Horizontal Bar.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.25259 0.04958 0.00138 0.10396
0.04587 0.00265 0.17598 0.00733
0.01316 0.10766 0.03762 0.10574
0.43446 0.00386 0.02917 0.62934
0.00060 0.74338 0.15914 0.00082
0.00144 0.32277 0.00148 0.54964
Table 14: Cleanly Trained NN Output for Images of Figure 3: Test 1—Gait 5, Horizontal Bar.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.36868 0.02488 0.00239 0.72152
0.22533 0.00165 0.01664 0.00941
0.00105 0.08509 0.01063 0.00131
0.39275 0.02006 0.01400 0.00348
0.00160 0.88511 0.67422 0.01372
0.04688 0.20263 0.00042 0.32227
Table 15: Cleanly Trained NN Output for Images of Figure 7: Test 2—Gait 5, Vertical Bar.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.01579 0.00571 0.06823 0.01480
0.00012 0.00242 0.04226 0.00032
0.00425 0.00123 0.00772 0.01549
0.01416 0.03010 0.00338 0.00042
0.83916 0.30000 0.00903 0.73287
0.00534 0.01207 0.19619 0.20428
Table 16: Cleanly Trained NN Output for Images of Figure 8: Test 2—Gait 3, Vertical Bar.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.20898 0.29838 0.17068 0.09587
0.02661 0.01540 0.12153 0.00162
0.05028 0.12334 0.03596 0.70543
0.31068 0.04745 0.00177 0.01004
0.79725 0.00312 0.00151 0.06657
0.00207 0.01088 0.25829 0.20264
Table 17: Cleanly Trained NN Output for Images of Figure 9: Test 3, Gait 2, Two Horizontal Bars.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00489 0.01182 0.12336 0.46085
0.88235 0.02071 0.01145 0.96221
0.23118 0.05195 0.15884 0.02083
0.02933 0.00858 0.00059 0.00262
0.00147 0.00005 0.02747 0.00252
0.01430 0.00641 0.09789 0.33148
Excessive Noise Injection for Markerless Tracking
2141
Table 18: Cleanly Trained NN Output for Images of Figure 10: Test 3—Gait 2, Two Vertical Bars.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.01564 0.02168 0.54083 0.01535
0.24621 0.10940 0.40633 0.36749
0.00285 0.04299 0.23138 0.77264
0.00360 0.00214 0.08105 0.00008
0.03234 0.00055 0.00031 0.00330
0.05786 0.27214 0.07037 0.79547
Table 19: Cleanly Trained NN Output for Images of Figure 12: Test 5—Gait 2, Two Horizontal and Vertical Bars.
I1 I2 I3 I4
N1
N2
N3
N4
N5
N6
0.00411 0.03359 0.01832 0.10695
0.04181 0.00365 0.03139 0.49596
0.50766 0.03366 0.03846 0.06902
0.00902 0.04584 0.00099 0.00629
0.00238 0.00009 0.02273 0.00636
0.08841 0.05982 0.44582 0.08321
Figure 19: Response of cleanly trained NN to 100 images of gait 5 at the (a) 50% and (b) 75% noise level.
Tests 7 and 8 were then repeated for the cleanly trained NN. (Tests 4 and 6 were not repeated because tests 7 and 8 are more intensive versions of tests 4 and 6.) Figure 19 shows the response of the cleanly trained NN to 100 images of gait 5 at the 50% and 75% noise level. The cleanly trained NN responded in a similar way for the other gaits at the corresponding noise levels. Randomly guessing a state would give a one in six chance of getting the correct state. Thus, over 100 images, one would expect to get 17 correct
2142
C. P. Unsworth and G. Coghill
Table 20: Output from Cleanly Trained NN. Number of Incorrectly Identified Images in States Obscuration Level
1
2
3
4
5
6
Mean
SD
50% 75%
88 88
82 89
79 94
86 90
85 98
87 81
84.5 90
3.4 5.8
states identified. Therefore, by random guesswork, one would expect 83 incorrectly identified states. Comparing this result to Table 20, it is obvious that random guesswork would provide a better result than using a cleanly trained NN. The likely reason that the cleanly trained NN performs so badly is that it is not allowed to randomly guess an image. Rather, it is constrained to make a prediction from the measurements it makes. When the measurements are cluttered excessively with noise, its prediction becomes highly inaccurate. In comparison to the ENINN results in Table 10, an average of 1.3 errors occurs in every 100 images at a 50% noise level, and an average of 11 errors occurs every 100 images at the 75% noise level. Thus, the ENINN is extremely noise robust. 7 The Effect of Scaling Introducing more states to the problem increases what is termed as the scale of the problem. How robust such a technique is to scaling is an important issue. The scaling problem is application dependent. Obviously, an application that requires a large sequence of gait positions is going to lead to a network that takes a very long time to train, if at all. One way to reduce the training period would be to reduce the effective scale by extracting the key states of motion from a continuous video stream (as we have demonstrated with the human motion example). By capturing and identifying these key positions first, a continuous video stream could then be broken down into a sequence of states. The continuous sequence could then be approximated using applications software (Abrosoft Fantamorph) that is able to morph between key states in order to produce a flowing image that is similar to the original sequence in appearance. Although this is a suggestion, we accept that not all applications will be practical in scaling terms using this technique. 8 Conclusion In this letter, it has been demonstrated that the generalization properties of a neural network (NN) can be extended to encompass objects that obscure or segment the original image in its foreground and background. We achieve this by piloting an extension of the noise injection training technique, which
Excessive Noise Injection for Markerless Tracking
2143
we term excessive noise injection, on a simple feedforward multilayer perceptron (MLP) network with standard backward propagation to achieve this aim. However, one could apply the ENI training technique to other NN types. Six simple tests were reported that show the ability of a 25% trained ENINN to distinguish among six similar states of motion of a simplified human figure that have become obscured by moving vertical and horizontal bars and random blocks for levels of obscuration of 25% and 50%. The results showed a very high success rate. One important observation that we drew from the first six tests was that although the ENINN was trained by injecting random noise onto the image, the image was detected behind the obscuring object, which was not random. This seems contrary to the way the network was trained. One explanation is that since the ENINN successfully detected the image behind an obscuration of this type (even though the obscuring object does not seem like noise to us), the obscuring object could be viewed as noise that is localized to a region of space. A further four extensive tests using random block obscuration were then reported to determine the bounds of the technique. A 25% trained ENINN was presented with levels of obscuration of 50% and 75%. This gave a 1% error rate at 50% obscuration and an 11% error rate at 75% obscuration. In comparison, results for the same NN trained with no NI gave an 85% error rate at 50% obscuration and a 90% error rate at 75% obscuration (both proved worse than randomly guessing the state). Thus, the ENINN significantly outperformed its none NI counterpart. This implies that the ENINN requires only part of the information from images to discern one image from another. Similarly, a 50% trained ENINN was presented with levels of obscuration of 50% and 75%. This gave an 84% error rate at both the 50% and 75% obscuration levels. This implies that overgeneralization was occurring at 50% ENI and that the ENINN could not differentiate one state from another. From the results, one can say that a level of around 25% ENI can be used successfully to identify images up to 75% of object obscuration or segmentation and that a region can be said to exist between the standard generalization and overgeneralization regions at approximately 10% to 25% noise injection that allows for obscuring objects to be encompassed into the generalization of a NN. The results pilot strong evidence that it could be possible to track a subject behind objects using this technique and thus lend itself to a realtime markerless tracking system from a single video stream. References Aggarwal, J. K. (2003). Problems, ongoing research and future directions in motion research. Machine Vision and Applications, 14(4), 199–201.
2144
C. P. Unsworth and G. Coghill
Aggarwal, J. K. & Cai, Q. (1999). Human motion analysis: A review. Computer Vision and Image Understanding, 73(3), 428–440. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Calais, E. F., & Legrand, L. (2000). Analysis and tracking of human gait via a markerfree system. In SPIE-Int. Soc. Opt. Eng. Proceedings of SPIE—the International Society for Optical Engineering, 4067, (pt. 1–3, pp. 252–260). Bellingham, WA: SPIE. Edwards, P. J., Murray, A. F. (1993). Analogue synaptic noise-implications and learning improvements. International Journal of Neural Systems, 4(4), 427–433. Gao, Y., Xu, G., & Tsuji S. (1994), Understanding human motion patterns. In Proceedings of the 12th IAPR International Conference on Pattern Recognition (pp. 325–328). Los Alamitos, CA: IEEE Computer Society Press. Gavrilla, D. M. (1999). A visual analysis of human movement: A survey. Computer Vision and Image Understanding, 73, 82–98. Gavrila, D. M., & Davis, L. S. (1996). 3D model-based tracking of humans in action: A multi-view approach. In Proceedings of the 1996 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 73–80). Los Alamitos, CA: IEEE Computer Society Press. Grandvalet, Y., Canu, S., & Boucheron, S. (1997). Noise injection: Theoretical prospects. Neural Computation, 9, 1093–1108. Guozhong, A. (1996). The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8, 643–674. Ham, F. M., & Kostanic, I. (2001). Principles of neurocomputing for science and engineering. New York: McGraw-Hill. Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D, 42, 265– 272. Haykin, S. (2002). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14, 201–211. Levin, R. I., Lieven, N. A. J., & Lowenberg, M. H. (2000). Measuring and improving neural network generalization for model updating. Journal of Sound and Vibration, 238, 401–424. Matsuoka, K. (1992). Noise injection into inputs in backpropagation learning. IEEE Trans. Systems, Man, and Cybernetics, 22(3), 436–440. O’Reilly, R. C. (2001). Generalization in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199– 1241. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagation errors. Nature, 323, 533–536. Shiffrar, M., & Lorenceau, J. (1996). Increased motion linking across edges with decreased luminance contrast, edge width and duration. Vision Research, 36, 2061– 2067. Sietsma, J., & Dow, R. J. F. (1991). Creating artificial neural networks that generalize. Neural Networks, 4, 67–79. Wren, C., Azarbayejani, A., Darrell, T., & Pentland, A. (1996), Real time tracking of the human body. SPIE Proc., 2615, 89–98.
Excessive Noise Injection for Markerless Tracking
2145
¨ ¨ Zell, A., Mamier, G., Vogt, M., Mache, N., Hubner, R., Doring, S., Herrmann, K., Soyez, T., Schmalzl, M., Sommer, T., Hatzigeorgiou, A., Posselt, D., Schreiner, T., Kett, B., Clemente, G., & Wieland, J. (1995). Stuttgart Neural Network Simulator (SNNS). Available online at http://www-ra.informatik.uni-tuebingen.de/ SNNS/.
Received June 17, 2005; accepted February 17, 2006.
LETTER
Communicated by Arnaud Delorme
Analytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation Strategies Michelle Rudolph [email protected]
Alain Destexhe [email protected] Unit´e de Neuroscience Int´egratives et Computationnelles, CNRS, 91198 Gif-sur-Yvette, France
Event-driven simulation strategies were proposed recently to simulate integrate-and-fire (IF) type neuronal models. These strategies can lead to computationally efficient algorithms for simulating large-scale networks of neurons; most important, such approaches are more precise than traditional clock-driven numerical integration approaches because the timing of spikes is treated exactly. The drawback of such event-driven methods is that in order to be efficient, the membrane equations must be solvable analytically, or at least provide simple analytic approximations for the state variables describing the system. This requirement prevents, in general, the use of conductance-based synaptic interactions within the framework of event-driven simulations and, thus, the investigation of network paradigms where synaptic conductances are important. We propose here a number of extensions of the classical leaky IF neuron model involving approximations of the membrane equation with conductancebased synaptic current, which lead to simple analytic expressions for the membrane state, and therefore can be used in the event-driven framework. These conductance-based IF (gIF) models are compared to commonly used models, such as the leaky IF model or biophysical models in which conductances are explicitly integrated. All models are compared with respect to various spiking response properties in the presence of synaptic activity, such as the spontaneous discharge statistics, the temporal precision in resolving synaptic inputs, and gain modulation under in vivo–like synaptic bombardment. Being based on the passive membrane equation with fixed-threshold spike generation, the proposed gIF models are situated in between leaky IF and biophysical models but are much closer to the latter with respect to their dynamic behavior and response characteristics, while still being nearly as computationally efficient as simple IF neuron models. gIF models should therefore provide a useful tool for efficient and precise simulation of large-scale neuronal networks with realistic, conductance-based synaptic interactions. Neural Computation 18, 2146–2210 (2006)
C 2006 Massachusetts Institute of Technology
IF Neuron Models with Conductance-Based Dynamics
2147
1 Introduction Computational modeling approaches face a problem linked to the size of neuronal populations necessary to describe phenomena that are relevant at macroscopic biological scales, for example, at the level of the neocortex or visual cortex. Tens to hundreds of thousands of neurons, each synaptically linked with tens of thousands of others, organized in computational “modules” (e.g., Mountcastle, 1997), might be necessary to capture emergent functional properties and realistic dynamic behaviors. However, simulations of such modules are still beyond the limits of currently available conventional computational hardware if the neuronal units, as building blocks of the network module, are endowed with biophysically realistic functional dynamics. In such a case, the only reasonable compromise is to trade off complex dynamics in neuronal units with the achievable scale of the simulated network module. Indeed, reducing the neuronal dynamics down to that of simple integrate-and-fire (IF) neurons allows the efficient modeling of networks with tens to hundreds of thousands of sparsely interconnected neurons (e.g., Brunel, 2000; Wielaard, Shelley, McLaughlin, & Shapley, 2001; Shelley, McLaughlin, Shapley, & Wielaard, 2002; Delorme & Thorpe, 2003; Mehring, Hehl, Kubo, Diesmann, & Aertsen, 2003; Hill & Tononi, 2005). Another way for optimizing neural network simulations is to search for more efficient modeling strategies. In biophysical models, neuronal dynamics is described by systems of coupled differential equations. Such systems are in general not analytically solvable, and numerical methods based on a discretization of space and time remain the principal simulation tool. A variety of techniques exist, which all have in common that the state variables of the system in question are evaluated for specific points of a discretized time axis. In such synchronous or clockdriven approaches, the algorithmic complexity and, hence, computational load scale linearly with the chosen temporal resolution. The latter introduces an artificial cutoff for timescales captured by the simulation and in this way sets strict limits for describing short-term dynamical transients (Tsodyks, Mit’kov, & Sompolinsky, 1993; Hansel, Mato, Meurier, & Neltner, 1998) and might have a crucial impact on the accuracy of simulations with spike-timing depending plasticity (STDP) or dynamic synapses (e.g., Markram, Wang, & Tsodyks, 1998; Senn, Markram, & Tsodyks, 2000). Recently, a new and efficient approach was proposed (Watts, 1994; Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003) that sets the algorithmic complexity free from its dependence on the temporal resolution and, thus, from the constraints imposed on timescales of involved biophysical processes. In such asynchronous or event-driven approaches, the gain in accuracy is counterbalanced by the fact that the computational load scales linearly with the number of events, that is, with the average activity,
2148
M. Rudolph and A. Destexhe
in the network. The latter strongly constrains the dynamic regimes which can be simulated efficiently. However, a coarse evaluation of this activitydependent computational load suggests that the event-driven simulation strategy remains an efficient and more exact alternative to clock-driven approaches if cortical activity typically seen in vivo is considered (for a review, see Destexhe, Rudolph, & Par´e, 2003). The event-driven approach was successfully applied in a variety of contexts. These range from networks of spiking neurons with complex dynamics and focused on hardware implementations (Watts, 1994; Giugliano, 2000; Mattia & Del Giudice, 2000), up to networks of several hundreds of thousands of neurons modeling the processing of information in the visual cortex (Delorme, Gautrais, van Rullen, & Thorpe, 1999; Delorme & Thorpe, 2003). Recently, stochastic neuronal dynamics was made accessible to the event-based approach based on the analytic expression of the probability density function for the evolution of the state variable (Reutimann et al., 2003). This approach is applicable in cases in which neuronal dynamics can be described by stochastic processes, such as intrinsically noisy neurons or neurons with synaptic noise stemming from their embedding into larger networks. The latter provides an efficient simulation strategy for studying networks of interacting neurons with an arbitrary number of external afferents, modeled in terms of effective stochastic processes (Ricciardi & Sacerdote, 1979; L´ansky´ & Rospars, 1995; Destexhe, Rudolph, Fellous, & Sejnowski, 2001). So far, concrete applications of event-driven strategies of IF dynamics have been restricted to current-based synaptic interactions (Mattia & Del Giudice, 2000; Reutimann et al., 2003; Hines & Carnevale, 2004). However, to simulate states of intense activity similar to cortical activity in vivo, in particular high-conductance states with rapid synaptically driven transient changes in the membrane conductance (Destexhe et al., 2003), it is necessary to consider conductance-based synaptic interactions. In this letter, we propose an extension of the classical leaky IF neuron model (Lapicque, 1907; Knight 1972), the gIF model, which incorporates various aspects of presynaptic activity dependent state dynamics seen in real cortical neurons in vivo. The relative simplicity of this extension provides analytic expressions for the state variables, which allow this model to be used together with event-driven simulation strategies. This therefore provides the basis for efficient and exact simulations of large-scale networks with realistic synaptic interactions. In the first half of this letter, we outline the basic idea and introduce three analytic extensions of the classic leaky IF neuron model that incorporate presynaptic-activity dependent state dynamics as well as the state-dependent scaling of synaptic inputs. The second half of the letter is dedicated to a detailed investigation of the dynamics of these different models, specifically in comparison with more realistic biophysical models of cortical neurons as well as the widely used leaky IF neuron model.
IF Neuron Models with Conductance-Based Dynamics
2149
2 Integrate-and-Fire Neuron Models with Presynaptic-Activity Dependent State Dynamics After a brief review of the leaky integrate-and-fire (LIF) neuron model, the core idea behind the incorporation of specific aspects of presynapticactivity dependent state dynamics is presented. Based on this, three extended integrate-and-fire models of increasing complexity but with simple analytic forms of their state variables are proposed. 2.1 The Classic LIF Neuron Model. In the simplest form of IF neuron models, the leaky integrate-and-fire model (Lapicque, 1907; Knight, 1972), the evolution of the state variable is described by the following first-order differential equation, τm
dm(t) + m(t) = 0, dt
(2.1)
where τm denotes the membrane time constant and m(t) the state variable mrest ≤ m(t) ≤ mthres at time t. Upon arrival of a synaptic input at time t0 , m(t) is instantaneously updated by m, that is, m(t0 ) −→ m(t0 ) + m. After that, the state variable decays exponentially with time constant τm toward a resting state mrest (for the moment, we assume mrest = 0) until the arrival of a new synaptic input at time t1 (see Figure 1A, cLIF). If m(t) crosses a threshold value mthres (usually assumed to be mthres = 1), the cell fires a spike and is reset to its resting value mrest , at which it stays for an absolute refractory period tref . The simple form of equation 2.1 allows an explicit solution in between the arrival of synaptic events, given by m(t) = m(t0 ) e−
t−t0 τm
,
(2.2)
where t0 ≤ t < t1 and 0 ≤ m(t) ≤ 1. Depending on the value of the membrane time constant τm , we will distinguish two different LIF models. First, a model with large τm , of the order of membrane time constants typically observed in low-conductance states, will be referred to as a classic leaky integrate-and-fire (cLIF) neuron model. Second, a LIF model with small τm mimicking a (static) high-conductance state will be referred to as a very leaky integrate-and-fire (vLIF) neuron model. 2.2 LIF Neuron Models with Presynaptic-Activity Dependent State Dynamics. In real neurons, the effect of synaptic inputs can be described by transient changes in the conductance of the postsynaptic membrane. This synaptic conductance component G s (t) adds to a constant leak conductance G L (the contribution of active membrane conductances will not be
2150
M. Rudolph and A. Destexhe
considered here for simplicity), yielding a time-dependent total membrane conductance, G m (t) = G L + G s (t)
(2.3)
(see Figure 2A, top), which determines the amplitude and shape of postsynaptic potentials (PSPs) and, hence, the response of the cellular membrane to subsequent synaptic inputs. We will restrict what follows to two direct consequences of changes in the membrane conductance caused by synaptic activity. First, a change in G m results in a change of the membrane time constant τm of the form τm (t) =
C , G m (t)
(2.4)
where C denotes the membrane capacity (see Figure 2A, middle). This leads to an alteration in the membrane integration time and, therefore, the shape of the PSPs (see Figure 2A, bottom). For larger G m , the PSP rising and decay times will be shorter, whereas a lower G m results in a slower membrane. Second, a change in the membrane conductance will also result in a change of the PSP peak height and, thus, the amplitude of the subthreshold cellular response to synaptic stimulation.
Figure 1: Comparison of excitatory postsynaptic potentials (EPSPs) for different neuron models. (A) The EPSP time course in biophysical models described by a passive membrane equation with synaptic input following exponential time course (BM exp/PME), α-kinetics (BM α), and two-state kinetics (BM 2-state) are compared to that in a corresponding classic LIF neuron model (cLIF) for three different total membrane conductances (1 × G L , 5 × G L , 10 × G L ; G L = 17.18 nS corresponding to τmL = 22.12 ms). (B) Comparison of the EPSP time course for the gIF models for membrane conductances as in A. (C) EPSP peak height (left) and integral over the EPSP peak (right) for biophysical models as a function of the total membrane conductance, given in multiples of G L . Whereas the EPSP shape in classical IF neuron models (cLIF) remains constant, both the EPSP peak height and integral decrease for increasing membrane conductance for the biophysical models. (D) EPSP peak height (left) and integral over EPSP peak (right) for the gIF models. Whereas the peak amplitude stays constant for the gIF1 model, the decrease in the EPSP integral for this model as well as the functional dependence of the peak height and integral for the gIF2 and gIF3 model compare well to the detailed models shown in C. Computational models are described in appendix A; time courses of synaptic conductances and used parameters are given in Tables 1 and 2.
IF Neuron Models with Conductance-Based Dynamics
1 x GL
5 x GL
BM exp/PME BM α BM 2-state cLIF
A
B
10 x G L
0.05 mV
gIF1 gIF2 gIF3 cLIF
10 ms
2 ms
integral (mV ms)
peak height (mV)
20 ms
C
2151
0.25
·
0.2 0.15 0.1 0.05 1
3
5
7
6 5 4 3 2 1
9
BM exp/PME BM α BM 2-state cLIF
1
3
0.25
·
0.2 0.15 0.1 0.05 1
3
5
x GL
5
7
9
7
9
x GL integral (mV ms)
D
peak height (mV)
x GL
7
9
6 5 4 3 2 1
gIF1 gIF2 gIF3 cLIF 1
3
5
x GL
2152
M. Rudolph and A. Destexhe
Generalizations of the IF model incorporating a time-dependent, more precisely a spike-time-dependent (Wehmeier, Dong, Koch, & van Essen, 1989; LeMasson, Marder, & Abbott, 1993; Stevens & Zador, 1998a; Giugliano, Bove, & Grattarola, 1999; Jolivet, Lewis, & Gerstner, 2004), membrane time constant, have already been investigated in the context of the spike response model (e.g., Gerstner & van Hemmen, 1992, 1993; Gerstner, Ritz, & van Hemmen, 1993; Gerstner & Kistler, 2002). In this case, right after a postsynaptic spike, the membrane conductance is increased due to the contribution of ion channels linked to the spike generation. This leads to a reduction of the membrane time constant immediately after a spike, which shapes the amplitude and form of subsequent postsynaptic potentials (for experimental studies showing a dependence of the PSP shape on intrinsic membrane conductances, see, e.g., Nicoll, Larkman, & Blakemore, 1993; Stuart & Spruston, 1998). However, in contrast to the spike response model in which the membrane time constant is a function of the time since the last postsynaptic spike and, hence, the postsynaptic activity, in this article the time dependence results directly from synaptic inputs and, hence, is a consequence of the overall presynaptic activity. To study in more detail the scaling properties of the PSPs as a function of the total membrane conductance, we investigated the peak amplitude and integral of excitatory postsynaptic potentials (EPSPs) as a function of a static membrane conductance G L for different models of postsynaptic
Figure 2: Integrate-and-fire models with presynaptic-activity dependent state dynamics. (A) Synaptic inputs lead to transient changes in the total membrane conductance G m that depend on the synaptic kinetics and release statistics. Exponential synapses, for instance, cause an instantaneous increase in the G m , followed by an exponential decay back to its resting (leak) value (left, top). This change in G m is paralleled by a transient change in the effective membrane time constant τm (left, middle), and change in the membrane state variable m (left, bottom). At high synaptic input rates (right), the membrane is set into a high-conductance state with rapidly fluctuating membrane conductance and, hence, membrane time constant. The resulting high-amplitude variations of the membrane state variable m resemble those found in vivo during active network states. (B) Comparison of high-conductance dynamics in the passive biophysical model with two-state synaptic kinetics (BM), the classical LIF neuron model (cLIF), and the gIF models (bottom). In each panel, the upper trace shows a 500 ms time course of the membrane time constant and the lower trace the corresponding membrane state time course. The gIF3 model comes closest to the dynamics seen in the detailed model. Computational models are described in appendix A; time courses of synaptic conductances and used parameters are given in Tables 1 and 2. Input rates were 10 Hz (A, left), 6 kHz (A, right), and 3 kHz (B).
IF Neuron Models with Conductance-Based Dynamics
Gm (nS)
high input rate
low input rate
A
32 28 24 20
17.8 17.4
τm (ms)
17
20
22
16
21
14
m (mV)
20 -79.6
-60
-79.8 -80
2153
-70 100
200
300
-80
400
time (ms)
B
100
200
300
time (ms) BM
0 τm
cLIF
2 ms
firing threshold 10 mV
resting state input onset
gIF1
100 ms
gIF2
gIF3
400
2154
M. Rudolph and A. Destexhe
Table 1: Models and Parameters of Postsynaptic Membrane Conductances. Exponential Synapse Conductance time course
α-Synapse t
G s (t) = G e− τs exp
exp
G αs (t) = G
t τs
Two-State Kinetics t
e − τs
G αtotal = G τs
G 2s (t) = s −t A Gr∞ 1 − e for 0 ≤ t < tdur Gr (tdur ) e−β(t−tdur ) for tdur ≤ t < ∞
Total conductance per quantal event
G total = G τs
G 2s total =
Excitatory conductance parameters
G = 0.66 nS τs = 2 ms
G = 0.66 nS τs = 2 ms
G = 1.2 nS α = 1.1 (ms mM)−1 β = 0.67 ms−1 Tmax = 1 mM tdur = 1 ms
Inhibitory conductance parameters
G = 0.632 nS τs = 10 ms
G = 0.632 nS τs = 10 ms
G = 0.6 nS α = 5 (ms mM)−1 β = 0.1 ms−1 Tmax = 1 mM tdur = 1 ms
GTmax α A2 β
× {βtdur A +
αTmax − αTmax e −tdur A}
Notes: The conductance time course, total quantal conductance, and conductance parameters used in this study are given for exponential synapses, α-synapses, and synapses described by (pulse-based) two-state kinetic models (Destexhe et al., 1994, 1998; see section A.1 for definitions). For simplicity, the time of the release t0 was set to 0. G denotes the maximal conductance and τs the time constant describing the synaptic kinetics. In the max with two-state kinetic model, tdur is the duration of transmitter release, and r∞ = αTA A = β + αTmax , where Tmax denotes the transmitter concentration, α and β the forward and backward transmitter binding rates, respectively. Conductance parameters for the exp different models were chosen to yield the same total conductance: G total = G αtotal = G 2s total .
conductance dynamics (exponential synapse; α-synapse, see Rall, 1967; synapse with two-state kinetics, see Destexhe, Mainen, & Sejnowski, 1994, 1998; computational models are described in appendix A). In all models, the parameters were adjusted to yield the same total conductance applied to the membrane for individual synaptic inputs (see Table 1). As expected, the EPSP peak height declines for increasing membrane conductance, with absolute amplitude and shape depending on the kinetic model used (see Figures 1A and 1C, left). The integral over the EPSP was much less dependent on the kinetic model, a consequence of the equalized total conductance for each synaptic event, but decreased markedly for increasing G L (see Figure 1C, right).
IF Neuron Models with Conductance-Based Dynamics
2155
In the LIF neuron model, neither of these effects seen in the biophysical model and experiments occurs. Both the EPSP peak height and the integral over the EPSP (determined by the membrane time constant) are constant (see Figures 1A and 1C). The question arising now is which modifications or extensions of the LIF neuron model can account for EPSP peak height and EPSP integral as functions of the membrane conductance. In what follows we will present such extensions. The constructed models will be called conductance-based (g-based) integrate-and-fire neuron models, which we refer to as gIF neuron models. 2.3 gIF1—The Basic gIF Model. We consider the simplest transient change in the total membrane conductance after a synaptic input, namely, the exponential synapse. At the arrival of a synaptic event at time t0 , the synaptic contribution to G m (see equation 2.3) rises instantaneously by a constant value G s and decays afterward exponentially with time constant τs to zero until the arrival of a new synaptic event at time t1 : G s (t0 ) −→ G s (t0 ) + G s , G s (t) = G s (t0 ) e−
t−t0 τs
for t0 ≤ t < t1 .
(2.5)
Due to the additivity of conductances for this model, equations 2.5 hold also for multiple synaptic inputs. This yields, in general, an average synaptic contribution to the membrane conductance whose statistical properties, such as mean, variance, or spectral properties, are determined by the statistics and functional properties of quantal synaptic release events (Rudolph & Destexhe, 2005). Due to the correspondence between membrane conductance and membrane time constant (see equation 2.4), equations 2.5 can be translated into corresponding changes in τm . The state of the membrane at time t is characterized by a membrane time constant τm (t) obeying 1 1 1 = L + s , τm (t) τm τm (t)
(2.6)
where τmL = C/G L denotes the membrane time constant at the resting state without synaptic activity and τms (t) = C/G s (t) the time-varying time constant due to synaptic conductances. After the arrival of a synaptic input, τms rises instantaneously by τms = C/G s , 1 1 1 . −→ s + τms (t0 ) τm (t0 ) τms
(2.7)
Due to the decay of G s given in equation 2.5, the membrane time constant τm increases (“decays”) after this change back to its resting value τmL
2156
M. Rudolph and A. Destexhe
according to t−t0 1 1 1 = L + s e− τs τm (t) τm τm (t0 )
for t0 ≤ t < t1 .
(2.8)
This well-known translation of membrane conductance changes due to synaptic activity into corresponding changes in the membrane time constant now provides a simple way to incorporate the effect of conductancebased synaptic activity into the IF neuron model framework. Replacing the membrane time constant τm in the LIF model, equation 2.1, by the timedependent membrane time constant τm (t) given by equation 2.8, yields τm (t)
dm(t) + m(t) = 0, dt
(2.9)
which describes the evolution of the neuronal state variable between two synaptic arrivals. Equation 2.9 can be solved explicitly, leading to
t−t0 t − t0 τs 1 − e− τs m(t) = m(t0 ) exp − L − s , τm τm (t0 )
(2.10)
where m(t0 ) and τms (t0 ) are the membrane state and the synaptic contribution to the membrane time constant after the last synaptic input at time t0 . This solution defines the core of the gIF models considered in this article. The apparent difference to the LIF neuron model is that now the state variable m(t) decays with an effective time constant that is no longer constant but depends on the overall synaptic activity. In contrast to the synaptic input modulated membrane dynamics in between synaptic events, the membrane state m(t) is still updated by a constant value m upon arrival of a synaptic input, that is, m(t0 ) −→ m(t0 ) + m.
(2.11)
This corresponds to a constant PSP peak amplitude, independent of the current membrane state (see Figure 1D, left). However, due to the statedependent membrane time constant, the decay and, hence, shape of the PSP will depend on the level of synaptic activity (see Figure 1B). This dependence reflects in the change of the integral over the PSP, which decreases with decreasing average membrane time constant and matches closely the values observed in conductance-based models of synaptic kinetics (see Figure 1; compare Figures 1C and 1D, right). We note that in subsequent gIF models introduced in the following sections, the constant m in equation 2.11 will be replaced by expressions that capture the effect of a state-dependent scaling of the PSP peak amplitude.
IF Neuron Models with Conductance-Based Dynamics
2157
So far, only one conductance component and its impact on the membrane time constant was considered (see equation 2.6). However, many studies in anesthetized animals (e.g., Borg-Graham, Monier, & Fr´egnac, 1998; Hirsch, Alonso, Reid, & Martinez, 1998; Par´e, Shink, Gaudreau, Destexhe, & Lang, 1998; for a review, see Destexhe et al., 2003), as well as during wakefulness and natural sleep periods, have directly shown that both excitatory and inhibitory synaptic conductance contributions shape the state of the cellular membrane. Generalizing equations 2.5 to 2.10 to the situation of independent excitatory and inhibitory synaptic inputs, we finally obtain a set of state equations that define the dynamics of the gIF1 model. Specifically, the inverse membrane time constant is a sum of resting state as well as excitatory and inhibitory synaptic contributions: 1 1 1 1 = L + e + . τm (t) τm τm (t) τmi (t)
(2.12)
It “decays” with two different time constants τe and τi for excitatory and inhibitory synaptic conductances, respectively, to its resting value τmL , t−t t−t0 1 1 1 1 − 0 = L + e e− τe + i e τi . τm (t) τm τm (t0 ) τm (t0 )
(2.13)
With this, the solution of equation 2.9 is given by
t−t t−t τi t − t0 τe − τ0 − τe 0 i 1−e 1−e − i , m(t) = m(t0 ) exp − L − e τm τm (t0 ) τm (t0 ) (2.14) where τme (t0 ) and τmi (t0 ) are the excitatory and inhibitory synaptic contributions to the membrane time constant at time t0 . Upon arrival of a synaptic event at time t0 , the state variable and synaptic contributions of the membrane time constant are updated according to m(t0 ) −→ m(t0 ) + m{e,i} , 1 {e,i} τm (t0 )
−→
1 {e,i} τm (t0 )
+
1 {e,i}
τm
(2.15) ,
(2.16)
where the indices e and i denote excitatory and inhibitory synaptic inputs, respectively. 2.4 gIF2—An Extended gIF Model. So far we have considered the effect of synaptic activity only on the membrane time constant τm , whereas the amplitude of individual synaptic events was kept constant. In other words,
2158
M. Rudolph and A. Destexhe
the update of the state variable m(t) by m at the arrival of an synaptic event (equations 2.11 and 2.15) was independent of the presynaptic activity. However, for equal synaptic conductance time course and membrane state, a leakier and, hence, faster membrane will result in an effective reduction of the PSP amplitude (see Figures 1A and 1C, left). To extend the gIF1 model in this direction, we analytically solved the membrane equation for a single synaptic input event described by an exponential conductance time course (see section B.1) and approximated the obtained explicit expression for the PSP time course by an α-function (see section B.2). This yields for the update m(τm (t0 )) of the membrane state variable m(t) due to the arrival of a synaptic input at time t0 m(t0 ) −→ m(t0 ) + m(τm (t0 )),
(2.17)
with m(τm (t0 )) = m( ˜ τ˜m )
1 1 1 + + τ˜m τs τms
1 1 1 + + τm (t0 ) τs τms
−1
.
(2.18) Here, τm (t0 ) denotes the actual membrane time constant at the time of the synaptic event, and m( ˜ τ˜m ) is the reference value for the PSP peak amplitude in a control state characterized by the membrane time constant τ˜m (see section B.2). Throughout the text and in all numerical simulations, this state was taken to be the resting state at m = m L = 0, that is, τ˜m ≡ τmL . Equations 2.17 and 2.18 can be generalized to independent excitatory and inhibitory synaptic input channels. In this case, the membrane state variable m(t) is subject to an update m(t0 ) −→ m(t0 ) + m{e,i} (τm (t0 )),
(2.19)
with
1 1 1 + + {e,i} τ˜m τ{e,i} τm −1 1 1 1 + + . × {e,i} τm (t0 ) τ{e,i} τm
˜ {e,i} (τ˜m ) m{e,i} (τm (t0 )) = m
(2.20)
These equations describe the scaling of the PSP peak amplitude depending on the actual membrane conductance (see Figure 1) and define together with equations 2.12, 2.13, and 2.14 the dynamics of the gIF2 model. 2.5 gIF3—A gIF Model with Synaptic Reversal Potentials. Barrages of synaptic inputs not only modulate the membrane time constant, but also
IF Neuron Models with Conductance-Based Dynamics
2159
result in a change of the actual membrane state variable due to the presence of reversal potentials for synaptic conductances. In this case, excitatory synaptic inputs push the membrane closer to firing threshold, whereas inhibitory inputs in general will have the opposite effect. The average membrane potential is determined by the actual values of inhibitory and excitatory conductances, as well as the leak conductance, and their respective reversal potentials. To account for this effect, we extend the defining equation for the LIF model, equation 2.9, by an actual “effective reversal state” mrest (e.g., see Borg-Graham et al., 1998; Shelley et al., 2002) to which the state variable m(t) decays with the membrane time constant τm (t) (see section B.3): τm (t)
dm(t) + m(t) − mrest = 0. dt
(2.21)
Here, τm (t) is given by equation 2.8 and mrest =
mL ms + s τmL τm (t0 )
1 1 + s τmL τm (t0 )
−1
,
(2.22)
where m L denotes the true resting (leak reversal) state in the absence of synaptic activity (usually assumed to be m L = 0), and ms is the value of the state variable corresponding to the synaptic reversal. Note that in order to allow a solution of equation B.3, mrest is assumed to stay constant between two synaptic inputs. When a new synaptic event arrives, mrest is updated according to equation 2.22 with a new value for τms , thus endowing mrest with an indirect time dependence. Equation 2.21, which describes the evolution of the neuronal state variable between two synaptic inputs in the presence of a synaptic reversal potential, can be explicitly solved, yielding
t−t0 τs t − t0 m(t) = mrest + m(t0 ) − mrest exp − L − s 1 − e− τs . τm τm (t0 ) (2.23) In the case of excitatory and inhibitory synaptic inputs, this solution generalizes to t−t τe t − t0 − τe 0 m(t) = mrest + m(t0 ) − mrest exp − 1 − e − τmL τme (t0 )
t−t τi − τ0 i 1−e − i , (2.24) τm (t0 )
2160
M. Rudolph and A. Destexhe
with the generalized effective reversal state variable mrest =
mi mL me + + e τmL τm (t0 ) τmi (t0 )
1 1 1 + + e τmL τm (t0 ) τmi (t0 )
−1
.
(2.25)
Finally, the change in the membrane state variable due to synaptic inputs will bring the membrane closer to or farther away from the corresponding synaptic reversal potential, thus yielding a change of the synaptic current linked to the synaptic events. The latter results in an additional modulation of the PSP peak amplitude (see Figure 3A), which appears to be particularly important for inhibitory synaptic inputs. Because the synaptic reversal potential lies in general between the resting state and the firing threshold, inhibitory inputs can have both a depolarizing and hyperpolarizing effect as the membrane state increases from its resting value to firing threshold (see Figures 3A right and 3C left). To incorporate this effect in the gIF3 model, we extended the solution of the full membrane equation for a single exponential synaptic input (see section B.3). From this we obtain a simple explicit expression for the PSP peak amplitude, hence the update m(τm (t0 ), m(t0 )) of m(t) at arrival of a synaptic input at time t0 , as a function of both the actual total membrane conductance and the actual membrane state (hence, the actual distance to the reversal state): m(t0 ) −→ m(t0 ) + m(τm (t0 ), m(t0 )),
(2.26)
with m(τm (t0 ), m(t0 )) = m( ˜ τ˜m , m) ˜
m(t0 ) − ms m ˜ − ms
1 1 1 × + + τm (t0 ) τs τms
1 1 1 + + τ˜m τs τms
−1
,
(2.27)
where m( ˜ τ˜m , m) ˜ denotes the reference value for the PSP peak amplitude in a control state characterized by the membrane time constant τ˜m and membrane state m ˜ (taken to be the resting state at m = m L = 0), and ms is the value of the state variable corresponding to the synaptic reversal. Generalizing the model to scope with both excitatory and inhibitory synaptic inputs, equations 2.26 and 2.27 take the form m(t0 ) −→ m(t0 ) + m{e,i} (τm (t0 ), m(t0 )), with
(2.28)
IF Neuron Models with Conductance-Based Dynamics
2161
excitation
A
inhibition -50 mV -55 mV -60 mV -65 mV -70 mV -75 mV
0.2 mV
-80 mV 20 ms
B
BM
0.15 0.1
Ethres
0.05 1
3
5
7
mL
0.2 0.15 0.1
m thres
0.05
9
1
3
x GL
C peak height (mV)
peak height (mV)
EL
0.2
gIF3
0.25
EL 0
-0.1
Ethres
-0.2 1
3
5
x GL
5
7
9
7
9
x GL peak height (mV)
peak height (mV)
0.25
7
9
mL 0
-0.1
m thres
-0.2 1
3
5
x GL
Figure 3: Excitatory and inhibitory postsynaptic potentials. (A) Comparison of EPSPs (left) and IPSPs (right) in the two-state kinetic model (gray) and gIF3 model (black) for different membrane potentials. The arrow marks the reversal potential for inhibition. (B) EPSP peak height for the two-state kinetic model (left) and gIF3 model (right) as a function of the total membrane conductance, given in multiples of the leak conductance G L = 17.18 nS, for different membrane potentials ranging from the leak reversal (E L and m L , respectively) to the firing threshold E thres and mthres , respectively). (C) Comparison of IPSP peak heights for the two-state kinetic model (left) and gIF3 model (right) as a function of the total membrane conductance and different membrane potentials as in B. Used parameters of synaptic kinetics and the time course of synaptic conductances are given in Tables 1 and 2. The membrane state values in the gIF3 model were normalized as described in section A.3.
2162
M. Rudolph and A. Destexhe
m(t0 ) − m{e,i} m ˜ − m{e,i} 1 1 1 × + + {e,i} τ˜m τ{e,i} τm −1 1 1 1 + + , × {e,i} τm (t0 ) τ{e,i} τm
m{e,i} (τm (t0 ), m(t0 )) = m ˜ {e,i} (τ˜m , m) ˜
(2.29)
where m{e,i} denotes the synaptic reversal state variable for excitation (index e) and inhibition (index i), respectively. The two last equations, together with equations 2.12, 2.13, 2.24, and 2.25, define the dynamics of the gIF3 model. Due to the incorporation of the effect of the synaptic reversal states, the gIF3 model is the most realistic of the introduced gIF models with presynaptic-activity dependent state dynamics and state-dependent synaptic input amplitude, and describes best the behavior seen in the biophysical model (see Figures 1 and 3). 3 Response Dynamics of gIF models In this section, we compare the spiking response dynamics of the introduced gIF models with presynaptic-activity dependent state dynamics to that of biophysical models with multiple synaptic inputs described by two-state kinetics, to the passive membrane equation with exponential conductancebased synapses and fixed spike threshold, as well as with the behavior seen in leaky IF neuron models. In the following section, we first characterize the statistics of spontaneous discharge activity. In section 3.2, we study the temporal resolution of synaptic inputs in the different models. Finally, in section 3.3, we investigate the modulatory effect of synaptic inputs on the cellular gain. Computational models are described in appendix A, with parameters provided in Tables 1 and 2. 3.1 Spontaneous Discharge Statistics. Spontaneous discharge activity in the investigated models was evoked by Poisson-distributed random release at excitatory and inhibitory synaptic terminals with stationary rates that were selected independently in a physiologically relevant parameter regime (see appendix A). In the biophysical model, 10,000 input channels for excitatory and 3000 for inhibitory synapses releasing in a range from 0 to 10 Hz each were used, thus yielding total input rates from νe = 0 to 100 kHz for excitation and from νi = 0 to 30 kHz for inhibition (see section A.1). The cell’s output rate (νout ) increased gradually for increasing νe up to 300 Hz for extreme excitatory drive (see Figure 4, BM). Moreover, in the investigated input parameter regime, a nearly linear relationship between excitatory and inhibitory synaptic release rates yielding comparable output rates, indicated by a linear behavior of the equi-νout lines, was observed.
IF Neuron Models with Conductance-Based Dynamics
2163
Table 2: Parameter Values for Integrate-and-Fire Neuron Models. cLIF/vLIF Membrane properties Excitatory inputs
τmL = 22.12 ms (τmL = 4.42 ms)
Inhibitory inputs
mi = −0.0072
me = 0.0095
gIF1 τmL
= 22.12 ms
me = 0.0095 τe = 2 ms τme = 575.96 ms mi = −0.0072 τi = 10 ms τmi = 601.3 ms
gIF2 τmL
= 22.12 ms
me = 0.0095 τe = 2 ms τme = 575.96 ms mi = −0.0072 τi = 10 ms τmi = 601.3 ms
gIF3 τmL
= 22.12 ms mL = 0 me = 0.0076 τe = 2 ms τme = 575.96 ms me = 2.667 mi = 0.0014 τi = 10 ms τmi = 601.3 ms mi = 0.167
Notes: Values for the passive (leak) membrane time constant τmL , excitatory and inhibitory synaptic time constants (τe and τi , respectively), synaptic reversal states (me and mi ), as well as changes in the membrane state variables (me and mi ) and synaptic contributions to the membrane time constant (τme and τmi ) for excitatory and inhibitory synaptic inputs, respectively, are given. For definitions, see section 2 and appendix A.
In the classic LIF (cLIF) neuron model, single independent input channels for excitatory and inhibitory synapses were used with rates that were downscaled by a factor of 5 compared to the total input rates in the biophysical model (0 ≤ νe ≤ 20 kHz and 0 ≤ νi ≤ 5 kHz for excitatory and inhibitory inputs, respectively) to account for the larger membrane time constant (see section A.3). Although the firing rates were, in general, larger than for corresponding input values in the biophysical model, also here a linear dependence of the equi-νout lines was found (see Figure 4, cLIF). The latter suggests that inhibitory inputs play a less crucial role in determining the output rate. A qualitatively similar behavior was also observed in the very leaky IF (vLIF) neuron model, which mimics a (static) high-conductance state by a small membrane time constant (see section A.3). In this case, two single independent input channels for excitatory and inhibitory inputs with rates equivalent to the total input rates in the biophysical model were used. Although the firing rates were, in general, larger than for corresponding input values in the cLIF as well as biophysical model (see Figure 4, vLIF), the linear dependence of νout on νe and νi corresponded qualitatively to that seen in both of these models, with a large slope as in the cLIF model. Both the overall higher output rates and the diminished modulatory effect of inhibitory inputs can be viewed as a direct consequence of a static membrane time constant or, equivalently, membrane conductance. This contrasts the situation seen in models driven with synaptic conductances, where the intensity of synaptic inputs determines the conductance state of the membrane. Here, at high input rates, the high membrane conductance will shunt
2164
M. Rudolph and A. Destexhe
BM
30
cLIF
6
νout (Hz)
0
νi (kHz)
νi (kHz)
> 150
0
νe (kHz)
100
PME
30
0
100 50
0
νe (kHz)
20
vLIF
30
νout (Hz)
0
νe (kHz)
gIF1
0
0
100
0
νe (kHz)
gIF2
0
νe (kHz)
100
0
100
gIF3
30
νi (kHz)
30
νi (kHz)
30
100
200
νi (kHz)
0
νi (kHz)
νi (kHz)
> 300
0
νe (kHz)
100
0
0
νe (kHz)
100
Figure 4: Spontaneous discharge rate as function of the total frequency of inhibitory and excitatory synaptic inputs. The biophysical model with two-state kinetic synapses (BM) is compared with the passive membrane equation (PME), the classical leaky and very leaky IF (with five times reduced membrane time constant) neuron models (cLIF and vLIF, respectively), as well as the gIF neuronal models with presynaptic-activity dependent state dynamics (gIF1, gIF2, gIF3). The synaptic input in the biophysical model consisted of 10,000 independent excitatory and 3000 independent inhibitory channels, releasing with individual rates between 0 and 10 Hz each. For all other models, two independent input channels for excitation and inhibition releasing at rates between 0 and 100 kHz (for excitation) and 0 and 30 kHz (for inhibition) were used (except for the leaky IF, in which case the rates varied between 0 and 20 kHz for excitation and 0 and 6 kHz for inhibition). Used parameters of synaptic kinetics and the time course of synaptic conductances are given in Tables 1 and 2 as well as appendix A.
IF Neuron Models with Conductance-Based Dynamics
2165
the membrane and in this way effectively lower the impact of individual synaptic inputs. The latter results in lower average firing rates compared to models with smaller and fixed membrane conductance but comparable synaptic input drive. The modulatory effect of inhibitory inputs for an equivalent synaptic input regime was larger in all gIF models (see Figure 4, gIF1 to gIF3). Specifically, in the gIF1 and gIF3 models, the slope of the equi-νout lines was smaller than in the LIF models. Indeed, the gIF3 model reproduced best the qualitative behavior seen in the biophysical model, with only the output rate being larger for comparable input settings. Major deviations from the behavior seen in the biophysical model were observed only for the gIF2 model. In this case, the equi-νout lines showed a nonlinear dependence on νe and νi , with firing rates that were markedly lower in most of the investigated parameter regime. To explain this finding, we note that in the gIF2 model, only the impact of the total conductance state on the PSP peak amplitudes is incorporated, while the current value of the membrane state variable and, hence, the distance to synaptic reversal potentials is not considered (see section 2.4). In general, a higher membrane conductance, as seen for higher input rates of both excitation and inhibition, will yield smaller PSP amplitudes. On the other hand, as described in section A.3, the amplitude of the PSPs was adjusted to those seen in the biophysical model close to firing threshold. Here, EPSPs have a smaller amplitude due to the smaller distance to the excitatory reversal potential, whereas for IPSPs, the opposite holds. As observed in the gIF2 model, this will lead to an effective decrease, or asymptotic “saturation,” of the firing rate, in particular for high input rates where the PSP peak amplitudes are rescaled to smaller values due to the shunting effect of the membrane. The output rate can be modulated by tuning the amplitude values for PSPs (not shown), but without qualitative change in the nonlinear behavior seen in Figure 4 (gIF2). In order to decompose the effect of active and synaptic conductances on the discharge activity, simulations of the passive membrane equation with fixed spike threshold were performed (see section A.2). Interestingly, the slope of the equi-νout lines was much smaller than in the biophysical as well as gIF3 model, and the firing rate increased much faster beyond 300 Hz for increasing νe than in all other investigated models (see Figure 4, PME). This indicates that incorporating a realistic PSP shape alone without the transient effects of spike-generating conductances on the membrane time constant and spike threshold does not suffice to reproduce a realistic spontaneous discharge behavior. Surprisingly, the gIF3 model, although being dynamically simpler, reproduced the spiking behavior seen in the biophysical model much better (see Figure 4). A possible explanation for this observation might be that the instantaneous rise of the membrane state variable on arrival of a synaptic inputs mimics a marked transient increase in the total membrane conductance and, hence, a faster membrane due to spike generation. This instantaneous update could effectively relax
2166
M. Rudolph and A. Destexhe
the lack of the temporal effect of active membrane conductances and thus lead to a behavior closer to that seen in the biophysical model. To further characterize the statistics of the discharge activity in the different models, we calculated the coefficient of variation C V , defined by CV =
σISI I SI
,
(3.1)
where σISI denotes the standard deviation of the interspike intervals (ISIs) and I SI the mean ISI. In all investigated models, higher firing rates led to a more regular discharge, that is, smaller C V values (see Figure 5). However, whereas in the biophysical model and for the passive membrane equation the regime with high discharge variability was broad and increased for higher input rates (see Figure 5, BM and PME), C V values around unity were obtained in the LIF models only for a very tight balance between inhibitory and excitatory drive. The latter depended in the investigated parameter regime only minimally on the input rates (see Figure 5, cLIF and vLIF). This finding of a tight balance is in agreement with previously reported results (e.g., Softky & Koch, 1993; Rudolph & Destexhe, 2003). Interestingly, no differences were observed between the cLIF and vLIF models, although the membrane time constant in the vLIF model was five times smaller than in the cLIF model, thus yielding a much faster decay of individual PSPs. This indicates that the required higher input rates for excitation and inhibition and the resulting quantitatively different random-walk process close to threshold in the vLIF model did not relax the requirement of a narrow tuning of the synaptic input rates. Although qualitative differences were found among the gIF models, high C V values were observed in a generally broader regime of input frequencies (see Figure 5, gIF1 to gIF3) when compared with the LIF models. The gIF3 model came closest to discharge behavior observed in the biophysical model. This is interesting, as it suggests that a biologically more realistic discharge statistics can indeed be obtained with a simple threshold model without involvement of complex conductance-based spike-generating mechanisms (Rudolph & Destexhe, 2003). Support of this was also found in simulations with the passive membrane equation, although here a generally broader input regime resulting in high C V values as well as smaller slope of equi-C V lines compared to the biophysical and gIF3 models were observed (see Figure 5, PME). The smaller slope of the equi-C V lines in the gIF models compared to the LIF models further suggests that here inhibitory inputs can tune the neural discharge activity in a much broader range of driving excitatory inputs. However, comparing the results obtained for the gIF3 model with the behavior observed for the gIF1 and gIF2 model (see Figure 5, gIF1 and gIF2), both the incorporation of the state-dependent PSP amplitude and the effect of the reversal potential on the PSP amplitude are necessary conditions to reproduce the discharge statistics of the biophysical model.
IF Neuron Models with Conductance-Based Dynamics
BM
30
2167
cLIF
6
CV
νi (kHz)
νi (kHz)
> 1.4 1.0 0.6 0.2
0
0
νe (kHz)
100
PME
νe (kHz)
vLIF
100
0
0
νe (kHz)
gIF2
0
νe (kHz)
100
0
100
gIF3
30
νi (kHz)
30
νi (kHz) 0
20
νi (kHz) 0
gIF1
30
νe (kHz)
νi (kHz)
0
0
30
νi (kHz)
30
0
0
νe (kHz)
100
0
0
νe (kHz)
100
Figure 5: Coefficient of variation C V as function of the total frequency of inhibitory and excitatory synaptic inputs. The C V is defined as C V = σISI /I SI , where σISI denotes the standard deviation of the interspike intervals and I SI the mean interspike interval. The biophysical model with two-state kinetic synapses (BM) is compared with the passive membrane equation (PME), the classical leaky and very leaky IF (cLIF and vLIF, respectively), as well as the gIF models. Used parameters of synaptic kinetics, the time course of synaptic conductances, and synaptic release activity are the same as for Figure 4.
In order to fully reproduce the spontaneous discharge statistics seen in experiments (e.g., Smith & Smith, 1965; Noda & Adey, 1970; Burns & Webb, 1976; Softky & Koch, 1993; Stevens & Zador, 1998b), the high irregularity should also stem from a Poisson process; that is, the spike trains must be both exponentially distributed according to a gamma distribution and independent (Christodoulou & Bugmann, 2001; Rudolph & Destexhe, 2003).
2168
M. Rudolph and A. Destexhe
Table 3: Specific Parameter Setup Used in Some of the Simulations.
BM PME cLIF vLIF gIF1 gIF2 gIF3
νe
νi
νout
CV
24 kHz 30 kHz 6 kHz 26 kHz 32 kHz 40 kHz 20 kHz
9.3 kHz 8.2 kHz 1.68 kHz 8.4 kHz 8.4 kHz 6.0 kHz 10.2 kHz
13.5 Hz 12.8 Hz 13.7 Hz 12.5 Hz 12.2 Hz 11.9 Hz 13.9 Hz
1.09 0.96 0.36 0.80 0.92 0.95 0.91
Notes: For all models, the synaptic input rates for excitation and inhibition (νe and νi , respectively) were chosen to yield an output rate νout of about 13 Hz, a high discharge variability C V around 1 (except for cLIF and vLIF), and in the conductance-based models (BM, PME, gIF1, gIF2, gIF3) a total input conductance about five times larger than the leak conductance and comparable to the leak conductance of the vLIF model. For both the cLIF and vLIF models no combination of driving frequencies yields C V values around 1 at an output rate of 13 Hz. Therefore, in the vLIF model, the input rates were chosen by taking the inhibitory rate of the gIF1 model and adjusting the excitatory rate to yield the desired output rate of about 13 Hz. For the cLIF model, the inhibitory input rate was then five times reduced and the excitatory rate adjusted to yield the same νout . Model descriptions are given in appendix A.
To test this, we chose for each model a synaptic activity that resulted in an average output rate around 13 Hz, a value consistent with the spontaneous discharge rate observed in cortical neurons in vivo (e.g., Evarts, 1964; Steriade & McCarley, 1990). Moreover, in order to account for experimental observations in the cortex in vivo (Borg-Graham et al., 1998; Par´e et al., 1998), in all models except the cLIF model, synaptic activity was chosen to yield an about fivefold reduced membrane time constant compared to the leak time constant (see appendix A and Table 3). In all cases, the ISI histograms (ISIHs) could be well fit with gamma distributions (see Figure 6). However, only in the biophysical and the gIF models, the behavior expected from a Poisson process, namely, gammadistributed ISIs and a flat autocorrelogram (Figure 6, BM and gIF1 to gIF3), was accompanied by a high C V value around unity (see Table 3). Indeed, the discharge in both the cLIF and vLIF models was much more regular, although the ISIHs resembled gamma distributions for the examples studied (Figure 6, cLIF and vLIF). Surprisingly, for both LIF models, no parameter setup in the investigated parameter space yielded both a high C V and a desired output rate around 13 Hz at the same time. Moreover, the autocorrelogram in the LIF model showed a peak at small lag times (Figure 6, cLIF, star), indicating that subsequent output spikes were not independent. This behavior was mirrored in simulations of the PME model (Figure 6, PME). Here, despite the highly irregular discharge, a pronounced peak in the ISIH at small interspike intervals also suggests deviations from a spontaneous
IF Neuron Models with Conductance-Based Dynamics
50 150 lag time (ms)
0.02
CV = 0.96
vLIF ISI density (1/ms)
50 150 lag time (ms)
0.02
50 150 lag time (ms)
100 200 ISI (ms)
0.01
0.001
CV = 0.95
50 150 lag time (ms)
100 200 ISI (ms)
gIF3 0.02
0.01
0.001
CV = 0.91
ρ (1/ms)
0.02
100 200 ISI (ms)
ρ (1/ms)
gIF2 ISI density (1/ms)
0.01
CV = 0.92
ρ (1/ms)
ISI density (1/ms)
0.02
0.001
CV = 0.80
50 150 lag time (ms)
0.01
100 200 ISI (ms)
gIF1
0.001
ISI density (1/ms)
0.01
CV = 0.36
100 200 ISI (ms)
ρ (1/ms)
ISI density (1/ms)
0.02
0.001
∗
50 150 lag time (ms)
0.01
100 200 ISI (ms)
PME
0.001
ρ (1/ms)
0.01
cLIF
ρ (1/ms)
CV = 1.09
ISI density (1/ms)
0.02
0.001
ρ (1/ms)
ISI density (1/ms)
BM
2169
50 150 lag time (ms)
100 200 ISI (ms)
Figure 6: Typical interspike-interval histograms (ISIH) and autocorrelograms (insets) for the biophysical model (BM), the passive membrane equation (PME), the classical and very leaky IF (cLIF and vLIF, respectively) as well as the three gIF models. Synaptic input rates were chosen to yield comparable output rates and, except for the cLIF model, a five-fold decrease in input resistance compared to the quiescent case (see Table 3). ISIHs were fitted with gamma distributions ρISI (T) = q1! ar (r T)q e −r T , where ρISI (T) denotes the probability for occurrence of ISIs of length T and r , a , and q are parameters. Fitted parameters are: q = 3, r = 0.091 ms−1 , a = 0.737 (BM); q = 1, r = 0.042 ms−1 , a = 0.824 (PME); q = 10, r = 0.159 ms−1 , a = 0.965 (cLIF); q = 1, r = 0.029 ms−1 , a = 0.970 (vLIF); q = 1, r = 0.031 ms−1 , a = 0.873 (gIF1); q = 1, r = 0.028 ms−1 , a = 0.845 (gIF2); q = 1, r = 0.034 ms−1 , a = 0.868 (gIF3).
discharge that is both independent and Poisson distributed, thus indicating limitations of a fixed threshold spike-generating mechanism to reproduce realistic discharge statistics. Finally, it is interesting to note that in the ISIH of all models, a small peak for small ISIs was observed (see Figure 6). A possible explanation is that the models here are driven by uncorrelated synaptic inputs. In general, correlated input leads to an increase in the variability of the membrane state (Rudolph & Destexhe, 2005), which translates into a higher firing
2170
M. Rudolph and A. Destexhe
rate. In order to obtain a desired output rate, the lack of correlation in the synaptic input has to be “compensated” for by an increase in the ratio between excitatory and inhibitory synaptic rates. The resulting pronounced excitatory drive causes volleys of “preferential” firing, which show up as peaks in the ISIH. 3.2 Temporal Resolution of Synaptic Inputs. As a second comparative test of the considered models, we investigated to what extent synaptic inputs could be temporally resolved. Due to a smaller membrane time constant and, hence, faster membrane in high-conductance states, a better temporal resolution of synaptic inputs is expected. To test this, we complemented a fixed background activity (see Table 3) with a suprathreshold periodic stimulus corresponding to a simultaneous activation of 30 synapses (6 synapses in the cLIF model). The frequency of the stimulus νstim was changed in successive trials between 1 and 200 Hz, corresponding to interstimulus intervals Tstim ranging from 1000 to 5 ms, respectively. The response of the cell to the periodic stimulus results in peaks at corresponding interspike intervals TISI in the ISIH (see Figure 7A). Due to limitations in the temporal resolution capability of the membrane and the related low-pass filtering of synaptic inputs, TISI will in general be larger than Tstim , especially for highfrequency stimuli. We used this “deviation” from the ideal situation, for which the stimulus would lead to a sharp peak in the ISIH at TISI = Tstim , as
Figure 7: Temporal resolution of periodic synaptic inputs. (A) In addition to Poisson synaptic inputs (see Table 3), a periodic stimulus of frequency νstim between 1 Hz and 200 Hz as well as an amplitude corresponding to the simultaneous activation of 30 synapses (6 synapse in the cLIF model) was used. The temporal resolution was quantified by computing the ratio between the stimulus interval Tstim and the mode of the corresponding ISIH at TISI (see equation 3.2). If multiple peaks were present in the ISIH (see Figure 8), the interspike interval TISI of the leading peak was used. (B) Output frequency νout as a function of νstim for the biophysical model (BM), the passive membrane equation (PME), the classical and very leaky IF (cLIF and vLIF, respectively), as well as the three gIF models. Compared to the biophysical model, the LIF neuron models yield much lower firing rates for all stimuli (left). Only in the gIF models were the output rates comparable to those observed in the biophysical model (right). (C) Temporal resolution as a function of stimulus frequency. For the chosen parameter setup, the LIF models were not able to resolve higher frequencies beyond 30 Hz (cLIF) and 80 Hz (vLIF; see left panel). Moreover, the temporal resolution changed in this model abruptly due to mode locking (see text). This behavior was not seen in the biophysical model and occurred in the gIF1 and gIF2 model (right) in a reduced fashion for high νstim . In contrast, the gIF3 model temporally resolved inputs above 120 Hz in a reliable fashion.
IF Neuron Models with Conductance-Based Dynamics
Tstim TISI
ISI density (1/ms)
A 0.3
temporal resolution =
0.2
10
15
20
BM PME cLIF vLIF
ISI (ms)
B
gIF1 gIF2 gIF3
100
νout (Hz)
100
νout (Hz)
Tstim TISI
0.1
5
80 60 40 20
80 60 40 20
40
80
120
νstim (Hz)
160
40
Temporal resolution
C Temporal resolution
2171
1 0.8 0.6 0.4 0.2 40
80
120
νstim (Hz)
160
80
120
160
80
120
160
νstim (Hz)
1 0.8 0.6 0.4 0.2 40
νstim (Hz)
2172
M. Rudolph and A. Destexhe
a measure of temporal resolution (Destexhe et al., 2003), namely, Temporal resolution =
Tstim . TISI
(3.2)
Already for the output rates νout as a function of the stimulus frequency, marked differences between the biophysical, PME and gIF models, on one hand, and the leaky IF neuron models, on the other hand, were found (see Figure 7B). In the conductance-based models, νout increased in coherence with the input frequency, νout ∼ νstim , until an input frequency of about 80 Hz (see Figure 7B, solid lines). For the given parameter setup, this frequency marks the stimuli for which temporal resolution started to decrease (see below). In contrast, in the LIF models, the rise of the output rate as a function of νstim was much smaller and, surprisingly, nearly independent of the membrane time constant (see Figure 7B, dashed lines). Larger membrane time constants impair the ability of the neuronal membrane to temporally resolve fast synaptic inputs. Theoretically, no inputs faster than the total membrane time constant can be reliably resolved, thus making the neuronal membrane a low-pass filter. In agreement with this, in the cLIF model, the temporal resolution dropped markedly at around 30 Hz (see Figure 7C, light gray dashed line), whereas the smaller membrane time constant in the vLIF model allowed resolving inputs up to 80 Hz (see Figure 7C, dark gray dashed line). Although the membrane time constant in the vLIF model corresponded to that in the biophysical model, in the latter case no sharp drop but a smooth decrease in the temporal resolution measure for increasing νstim was observed (compare the dark gray dashed and black solid lines in Figure 7C). This abrupt change in the response behavior is typical for IF neurons and at least partially linked to the fixed firing threshold and current-based (i.e., not state-dependent) update of the membrane state upon arrival of a synaptic input or stimulus (for experimental and theoretical investigations, see, e.g., Brumberg, 2002; Fourcaud-Trocm´e, Hansel, van Vreeswijk, & Brunel, 2003; Gutkin, Ermentrout, & Rudolph, 2003). The ISIH in the biophysical model showed up to 200 Hz only one wide peak (see Figure 8, BM, stars), indicating a reliable cellular response to the suprathreshold input, which is jittered around the stimuli due to the presence of random synaptic activity. This peak shifted toward smaller ISIs for increasing νstim . However, the higher the stimulating frequency, the more the low-pass property of the membrane determined the response, leading to a limit in the temporal resolution. Although at high νstim there was still a clear response of the cell (see Figure 8, BM, right), this response became increasingly decoupled from the stimulus in the sense that the cell spikes at a designated rate driven by the suprathreshold input, but independent of its temporal characteristics.
IF Neuron Models with Conductance-Based Dynamics
gIF1
gIF2
gIF3
ISI density ISI density ISI density
vLIF
ISI density
cLIF
ISI density
PME
ISI density
BM
ISI density
νstim = 10 Hz
50 Hz 0.6
0.12
∗
0.08 0.04
0.4 0.2
0.6
0.12
∗
0.08 0.04
0.4 0.2
2173
100 Hz 0.6
∗
0.4 0.2
0.6
∗
0.4
∗
0.4
∗
0.2
0.6
0.2
0.12
0.6
0.6
0.6
0.08
0.4
0.4
0.4
0.04
0.2
0.2
0.2
0.12
0.6
0.6
0.6
0.08
0.4
0.4
0.4
0.04
0.2
0.2
0.2
0.12
0.6
0.6
0.6
0.08
0.4
0.4
0.4
0.04
0.2
0.2
0.2
0.12
0.6
0.6
0.6
0.08
0.4
0.4
0.4
0.04
0.2
0.2
0.2
0.12
∗
0.08 0.04
0.6 0.4
0.6
∗
0.4
0.2 40
80
ISI (ms)
120
∗
0.6 0.4
0.2 20
40
ISI (ms)
∗
0.2
0.6 0.4
150 Hz
0.2 20
40
ISI (ms)
20
40
ISI (ms)
Figure 8: Typical ISIHs for all models at four different stimulation frequencies (10 Hz, 50 Hz, 100 Hz, and 150 Hz; parameters of synaptic inputs are the same as in Figure 7). The cLIF and vLIF model were not able to resolve higher frequencies in a reliable fashion. Together with the gIF1 and gIF2 models, these LIF models showed a locking to the stimulus frequency, thus leading to sudden jumps in the temporal resolution (see Figure 7C). The gIF3 model and passive membrane equation (PME) came closest to the behavior of the biophysical model (BM) with respect to both temporal resolution of high frequencies (see the stars for corresponding modes in the ISIHs) and the lack of locking behavior.
2174
M. Rudolph and A. Destexhe
In contrast, the response of the LIF models showed multiple peaks in the ISIH (see Figure 8, cLIF and vLIF) as a result of mode skipping. As in the biophysical model, the ISIs of the peaks decrease for increasing νstim . The latter leads also to a decrease in the amplitude of the leading peak, indicating that fewer and fewer responses can follow the temporal structure of the driving stimulus. For some stimulation frequency and determined by the membrane time constant, this leading peak disappears, and the following peak must be viewed as the direct cellular response to the stimulus. Because the TISI of the leading peak was used to estimate the temporal resolution (see equation 3.2), this leads to an abrupt change in the temporal resolution, as shown in Figure 7C (left) by the steplike decline. Although the cell still responds locked to the stimulus for higher frequencies, modes of the stimulus are persistently skipped (for a discussion of this issue, see, e.g., Gutkin et al., 2003). Mode skipping was, although to a smaller extent, also observed in simulations of the passive membrane equation (see Figure 7C, left, PME) as well as the gIF1 and gIF2 model (see Figure 7C). In general, there were fewer modes in the ISIH of the gIF models, indicated by the larger peaks (see Figure 8, gIF1 and gIF2; for the LIF models, many peaks occurred outside the depicted ISI regime shown in Figure 8). Moreover, all gIF models as well as the PME model resolved reliably synaptic inputs beyond 100 Hz frequency. The gIF3 model best reproduced the response behavior found in the biophysical model with respect to both temporal resolution of high stimulating frequencies and a graded decrease of the temporal resolution for very high frequencies (see Figure 7C, right). Quantitative differences observed for the temporal resolution between these two models are primarily attributable to the effects of active membrane conductances, which are significant at high stimulation frequency, as well as the activity-dependent spike threshold present in the biophysical but not the gIF3 model. This conclusion is also supported by the response behavior of the passive membrane with fixed spike threshold, which mimicked that of the biophysical model in both output rate and temporal resolution for a very broad regime at the low end of driving input frequencies (see Figures 7B and 7C, left). Finally, we note that the ISIHs of the PME and gIF3 models showed in general only one peak (see Figure 8, gIF3 and PME, stars), in accordance with the biophysical model. Only for very large νstim did a second peak occur which, however, was at least for the gIF3 model always less pronounced than the leading peak. These results suggest that not only the sharp firing threshold and, thus, the lack of more realistic spike generation dynamics, can be made responsible for mode skipping, but also the nature of the synaptic inputs. The latter update the membrane state in the LIF model in a state-independent and, thus, current input resembling fashion. This state-independent update is also present in the gIF1 and partially in the gIF2 model, despite the conductance-based dynamics that describes here the cellular behavior in between synaptic inputs.
IF Neuron Models with Conductance-Based Dynamics
2175
3.3 Gain Modulation. In a final set of simulations, we addressed the question to which extent the simplified models capture the modulatory effect of synaptic background activity on the response gain. To that end, we stimulated the cells periodically with 0 to 50 simultaneously releasing excitatory synapses with parameter settings given in Table 2, thus leading to synaptic stimuli of different amplitude. In addition, synaptic background activity was altered by scaling the frequency of excitatory and inhibitory inputs between 0.5 and 2.0 around the values given in Table 3. The behavior was then characterized by the probability of emitting a spike in response to a given excitatory stimulus. In all models, the response probability showed a sigmoidal behavior as a function of the stimulation amplitude (see Figure 9). Therefore, the amplitude for which the response probability takes 50% (midamplitude) and maximum slope (gain) were used to further quantify the response (see Figure 10A, left). In all cases, synaptic background activity was efficient in modulating the cellular response (see Figure 9), in particular the response gain. With the exception of the gIF2 model, the response became more graded for increasing frequency of the synaptic background, as indicated by the smaller slope of the response curves for higher synaptic background activity (see Figure 10C). In the LIF models, a sharp decrease of the slope to nearly zero for synaptic background frequencies larger than the one found optimal (see below) was observed (see Figure 10B, left, star). In the gIF2 model (see Figure 9, gIF2), the response probability for a given stimulus amplitude was, in general, higher for smaller synaptic background activity. This result can again be explained by the membrane dynamics of the gIF2 model. In this model, the PSP amplitude depends on the total membrane conductance only and decreases for larger membrane conductance. Therefore, in highconductance states caused by intense synaptic inputs, the fluctuations of the membrane state variable are reduced, which leads to an effective reduction of the discharge rate and, hence, response probability. Although the response curves in the different models showed a sigmoidal behavior, only the gIF3 model came qualitatively close to the behavior seen in the biophysical model (see Figure 9; compare BM and gIF3), closely followed by the passive membrane with fixed spike threshold (see Figure 9, PME). In the LIF models, the mid-amplitude covered a huge range of values as a function of the background intensity (see Figure 10B, left, dashed lines). Our results indicate an optimal regime (see Figure 10B, left, star) where the stimulation amplitude is smallest to evoke 50% of the response. This behavior resembles that seen in the stochastic resonance phenomenon; for small noise levels, stronger stimuli are needed to cross firing threshold, whereas more intense synaptic background activity will cause more spontaneous spikes, which interfere with the response to a stimulus with given amplitude. Although some indications of such an optimal response regime are also present in the biophysical and gIF models (see Figure 10B, right), the
2176
M. Rudolph and A. Destexhe
BM
cLIF
1
0.8
probability
probability
1
0.6 0.4 0.2
0.8
SBA scaling
0.6
0.5
0.4
1.0
0.2
1.5 2.0
10
20
30
40
10
stimulation amplitude
probability
probability
0.6 0.4 0.2
0.8 0.6 0.4 0.2
10
20
30
40
10
stimulation amplitude
gIF1
0.6 0.4 0.2 10
20
30
40
40
gIF3
1
0.8 0.6 0.4 0.2
stimulation amplitude
30
probability
0.8
20
stimulation amplitude
gIF2
1
probability
probability
1
40
vLIF
1
0.8
30
stimulation amplitude
PME
1
20
0.8 0.6 0.4 0.2
10
20
30
40
stimulation amplitude
10
20
30
40
stimulation amplitude
Figure 9: Response probability as function of stimulation amplitude and synaptic background activity for the biophysical model (BM), the passive membrane equation (PME), classic and very leaky IF (cLIF and vLIF, respectively), and the three gIF models (gIF1 to gIF3). The stimulation amplitude is given by the number of simultaneously releasing excitatory synapses (between 0 and 50; see appendix A and Table 2 for parameters). The synaptic background activity was changed by applying a common scaling factor (ranging from 0.5 to 2.0) to the frequency of excitatory and inhibitory synaptic inputs given in Table 3. The response probability shows, in general, a sigmoidal behavior, but only the gIF3 (and to a lesser extend the gIF1) model was able to capture the qualitative behavior of the response curves seen in the biophysical model.
total range of covered mid-amplitudes in these models was much smaller than in the LIF models. This wider working range is a direct result of the conductance-based nature of the given models. The total membrane conductance increases with the level of synaptic activity, this way lowering the
IF Neuron Models with Conductance-Based Dynamics
140
probability
1
slope
0.8 0.6
simulation sigmoidal fit
0.4 0.2
mid amplitude
mid amplitude
A
2177
100
60
20 10
20
30
40
stimulation amplitude
0.02
B 140
mid amplitude
mid amplitude
0.04
0.06
slope
100
∗
60 20 0.6
1.0
1.4
140
BM PME cLIF vLIF
100
gIF1 gIF2 gIF3
60 20
1.8
0.6
SBA scaling
1.0
1.4
1.8
SBA scaling
C 0.06
slope
slope
0.06 0.04 0.02
0.04 0.02
0.6
1.0
1.4
SBA scaling
1.8
0.6
1.0
1.4
1.8
SBA scaling
Figure 10: Gain modulation in the biophysical model (BM), the passive membrane equation (PME), classic and very leaky IF (cLIF and vLIF, respectively), and the three gIF models. (A) Left: Response probability ρ(gstim ) (gray) as a function of the excitatory synapses. ρ(gstim ) was fit with a sigmoidal function (black) ρ(gstim ) = (1 − exp(−agstim ))/(1 + b exp(−agstim )), where a and b are free parameters and gstim denotes the stimulation amplitude. From these fits, the stimulation amplitude (number of simultaneously releasing excitatory synapses; see Figure 9) yielding a probability of 50% (mid-amplitude) and the slope were estimated and plotted against each other for different levels of synaptic background activity (right). (B, C) Mid-amplitude and slope of response probability curves as functions of the synaptic background activity (SBA) for the different models. In all cases, the synaptic background activity was changed by applying a common scaling factor (SBA scaling), ranging from 0.5 to 2.0, to the frequency of excitatory and inhibitory synaptic inputs given in Table 3.
2178
M. Rudolph and A. Destexhe
probability for evoking a response for stimuli of given amplitude by shifting the response curve to higher-stimulation amplitudes (shunting effect). On the other hand, more intense synaptic activity increases the fluctuation amplitude of the membrane state and thus fosters the (spontaneous) discharge rate. The latter results in a shift of the response curve to smaller stimulation amplitudes. Both effects robustly counterbalance, leaving the mid-amplitude nearly unaffected in the investigated parameter regime. This allows the cell to respond in a discriminating manner over a broad synaptic input regime (see Figures 10B and 10C, solid), which may provide the basis for more efficient computations and therefore add computational advantages to single-cell dynamics. In contrast, in the LIF models, this conductance-induced shift of the response curve is lacking, thus leaving alone the observed increase in the mid-amplitude for increased synaptic noise. This leads to a saturation of the cellular response for moderate total synaptic input rates (see Figures 10B and 10C, dashed) and therefore narrows the synaptic input regime that can be discriminated and utilized for computations. 4 Event-Driven Implementation of gIF Models The analytic form of the state variable m(t) (see equations 2.14 and 2.24) as well as its update m at arrival of a synaptic event (see equations 2.20 and 2.29) allow the introduced gIF models with presynaptic-activity dependent state dynamics to be used together with event-driven simulation strategies. In this section, we first recall briefly the basic ideas behind event-driven and clock-driven simulation strategies, before we present a specific implementation of the proposed models in an event-driven framework along with a coarse evaluation of its performance. 4.1 Clock-Driven vs. Event-Driven Simulation Strategies. In most computational neuronal models, in particular biophysical models, neuronal dynamics is described by systems of, in general, nonlinear coupled differential equations. The strict constraints in solving such systems analytically let to the development of a variety of numerical techniques based on the discretization of space and time. Due to the evaluation of neuronal state variables on a discretized time axis, or time grid, and the fact that times of occurring events, such as synaptic inputs or spikes, are assigned to discrete grid points, such simulation techniques are called synchronous or clock driven. While the algorithmic complexity, or computational load, scales with the number of neurons in the modeled network as well as the number of differential equations describing the single-neuron dynamics, it also scales linearly with the used temporal resolution or time-grid constant. Although the computational load is expected to be largely independent of the activity in the network, the temporal discretization in numerical methods utilizing in particular fixed-time steps introduces an artificial cutoff for timescales
IF Neuron Models with Conductance-Based Dynamics
2179
captured by the simulation. As a direct consequence, short-term dynamical transients might not or only incompletely be captured (Tsodyks et al., 1993; Hansel et al., 1998). Moreover, the artificial assignment of event times to grid points might lead to a bias in dynamic behaviors, such as the emergence of synchronous network activity or oscillating network states (e.g., Hansel et al., 1998; Shelley & Tao, 2001) or to an impact on the weight development of synapses subject to spike-timing dependent plasticity that we have observed. A more accurate way for simulating neural activity is to keep the exact event times and to evaluate state variables at the times of occurring events, thus setting the algorithmic complexity free from its dependence on the temporal resolution. Such an asynchronous or event-driven approach was proposed recently (Watts, 1994; Mattia & Del Giudice, 2000; Reutimann et al., 2003). It makes use of two ideas. First, in biological neural networks, interaction between neurons occurs primarily by synaptic interactions, which can be viewed as discrete events in time. Second, at least in mammalian cortex, the average firing rate is low (around 10 Hz for spontaneous activity during active states; see Evarts, 1964; Hubel, 1959; Steriade, 1978; Matsumura, Cope, & Fetz, 1988; Holmes & Woody, 1989; Steriade, Timofeev, & Grenier, 2001). Therefore, synaptic events occur rather isolated, and a single neuron is dynamically decoupled from the network for most of the time. If the differential equations describing the biophysical dynamics allow an analytic solution of the state variables (for a relaxation of this requirement, see Hines & Carnevale, 2004; Lytton & Hines, 2005), the neuronal state at a time t can be explicitly determined from an initial state at an earlier time t0 and the elapsed time interval t − t0 without iteratively updating state variables at a discrete time step in between t0 and t. This uncouples the computational load from the numerical accuracy of the simulation and, thus, from constraints imposed on timescales of involved biophysical processes. However, it does so at the expense of the requirement for an analytically describable evolution, either exact or approximated, of the neural state variables. Moreover, although the computational load still depends on the number of neurons in the same way as in clock-driven approaches, it scales now in addition linearly with the number of events, that is, with the average activity, in the network. However, evaluating the activity-dependent computational load in the event-driven simulations with that of clock-driven simulations with reasonable temporal resolution suggests that the event-driven simulation strategy remains a highly efficient alternative to clock-driven approaches if network activity typically seen in the cortex in vivo is considered. For instance, assuming a network of N neurons each interconnected by 104 synapses (Szentagothai, 1965; Cragg, 1967; Gruner, Hirsch, & Sotelo, 1974; DeFelipe & Fari˜nas, 1992; DeFelipe, Alonso-Nanclares, & Arellano, 2002) and with an average discharge rate of 10 Hz for each neuron (e.g., Steriade et al., 2001), the total number of events generated is of the order of N · 105 per second.
2180
M. Rudolph and A. Destexhe
This number equals that of state variable updates, which need to be performed within the same time interval. On the other hand, in clock-driven simulations, N · dt state variable updates have to be performed, where dt denotes the temporal resolution of the simulation. Thus, for dt = 0.01 ms, the number of state updates is the same in both simulation approaches. However, the event-driven simulation will be superior in accuracy compared to the approach with fixed temporal binning, as the precision is constrained only by the limitations set by the machine precision. 4.2 A Specific Event-Driven Implementation. In its most efficient implementation, an event-driven simulation strategy makes use of an analytic form of the state equations describing the evolution of the membrane state variables in between the arrival of synaptic events. This way, the sole knowledge of the membrane state at the last synaptic event along with the time difference is sufficient to calculate and update the state variables at the arrival time of a new synaptic event. This allows “jumping” from event to event rather than evaluating the membrane state variables on a temporal grid that defines synchronous or clock-driven simulation strategies. In all IF neuron models with presynaptic-activity dependent state dynamics proposed here (gIF1 in section 2.3, gIF2 in section 2.4, and gIF3 in section 2.5), the equations describing the membrane state and its update at arrival of synaptic events are analytically closed and, hence, applicable in event-driven simulations. We incorporated the gIF models in the NEURON simulation environment (Hines & Carnevale, 1997, 2004), which provides an efficient and flexible framework for event-driven modeling (scripts for the gIF models are available online at http://cns.iaf.cnrs-gif.fr/), as well as a custom C/C++ software tool for large-scale network simulations. In these implementations, the following steps are executed on arrival of a synaptic event at time t1 : Step 1. At the arrival of a synaptic event, independent of the current state of the membrane, the actual membrane time constant τm (t1 ) and its synaptic {e,i} contributions τm (t1 ) are calculated using equation 2.13 (for gIF1, gIF2, and gIF3) with the values of the synaptic contributions to the membrane time {e,i} constant τm (t0 ) for the previous synaptic event at time t0 . Step 2. In the gIF3 model, the effective reversal state mrest (t1 ) is calculated (see equation 2.25) based on the value of the synaptic contributions to the {e,i} membrane time constants τm (t1 ) at time t1 (calculated in step 1). Step 3. If the neuron is not in its refractory period, the state variable m(t1 ) is calculated using equation 2.14 for the gIF1 and gIF2 models or equation 2.24 for the gIF3), in which t0 denotes the time of the previous {e,i} synaptic event, m(t0 ) is the membrane state, and τm (t0 ) are the excitatory
IF Neuron Models with Conductance-Based Dynamics
2181
and inhibitory synaptic contribution to the total membrane time constant at time t0 . For the gIF3 model, in addition, the actual effective reversal state mrest (t1 ) (calculated in step 2) is used. Step 4. If the neuron is not in its refractory period, the state variable m(t1 ) is updated by m{e,i} = const (for gIF1 model), m{e,i} (τm (t1 )) (for gIF2 model, equation 2.20), or m{e,i} (τm (t1 ), m(t1 )) (for gIF3 model, equation 2.29), where the indices e and i denote excitatory and inhibitory synaptic inputs, respectively. Step 5. Depending on the type of synaptic input received at time t1 , the corresponding synaptic contribution to the membrane time constant {e,i} {e,i} τm (t1 ) is updated by τm = const (see equation 2.16 for gIF1, gIF2 and gIF3). Step 6. If the updated m(t1 ) exceeds the firing threshold mthres , a spike is generated and the cell enters an absolute refractory period after which the state variable is reset to its resting value mrest . The spike event is added to an internal event list and causes, after a transmission delay, a synaptic event in each target cell. Note that this implementation constitutes only one possibility, as the order of updating state variables can be modified.
4.3 Performance Evaluation. As outlined in the previous section, the analytic form of the state equations allows the use of the gIF models in precise and efficient simulation strategies—in particular, event-driven simulation approaches. However, due to a more complex dynamics of the gIF models compared to the LIF neuron models, a reduction in performance compared to their LIF counterparts must be expected. To investigate this issue in more detail, we analyzed the performance of all neuron models by comparing the time needed to simulate 100 s of neural activity in the NEURON simulation environment (Hines & Carnevale, 1997, 2004), running on a 3 GHz Dell Precision 350 workstation (see appendix A). In the LIF and gIF models, synaptic inputs were chosen to be Poisson distributed with a rate between 4 and 80 kHz for the excitatory channel and between 1 and 20 kHz for the inhibitory channel. Both rates were varied proportionally, such that the total average rate took values between 5 and 100 kHz. In the biophysical model, 8000 excitatory and 2000 inhibitory channels were driven by Poisson distributed inputs with average rates between 0.5 and 10 Hz, thus yielding the same total average rate. Synaptic and cellular properties were the same as in the previous models (see Tables 1 and 2 as well as appendix A). In the investigated input parameter regime, the total simulation time, consisting of both the updates of the neural state variables and the
2182
M. Rudolph and A. Destexhe
generation of random synaptic inputs, was at least two orders of magnitude smaller for the IF neuron models compared to the biophysical model (see Figure 11A). Moreover, as expected, for lower and biophysically more realistic input rates around 20 kHz, the event-driven simulation strategy was more efficient than the clock-driven simulation approach (see Figure 11A; compare the solid and dashed lines). However, event-driven simulations were at higher rates due to their approximately linear scaling of the simulation time with number of events (note the logarithmic scale in Figures 11A to 11C; for linear plots, see the insets), outperformed by corresponding clock-driven simulations (see Figure 11A, star). The performance of the latter was nearly independent on the input drive. Considering the time needed for updating the neuronal state variables only, no significant differences between clock-driven and event-driven simulation strategies were observed in the case of the LIF and gIF models (see Figure 11B; compare the gray dashed and solid lines). Whereas the performance of the biophysical model as well passive membrane model with fixed spike threshold remained nearly independent on the input drive (see Figure 11B, black solid and dashed lines), both the LIF and gIF models showed an approximately linear scaling with total input rate (see Figure 11B, gray solid and dashed lines; see also the inset), independent of which simulation strategy was used. This linear scaling behavior with the number of events is expected for event-driven simulation strategies. Its unexpected appearance in clock-driven simulations, where the simulation time depends ideally on only the chosen temporal resolution, can be explained by the use of a common optimization scheme as well as the relation between temporal resolution tr es of the simulation and the number of occurring events. Although in the ideal clock-driven approach, the neuronal state is evaluated at each point on a fixed time grid, optimization can be achieved if the state variable is not updated because no synaptic event was present within the preceding time interval of length tr es . Thus, the state updates, which constitute the major part of the computational load in the considered models, are mainly driven by input events as long as the number of events (total input rate) is smaller than 1/tr es . This leads to a scaling comparable to that expected for pure event-driven simulation strategies. Accordingly, due to the use of a time resolution of 0.01 ms (see section A.3), for biophysically rather unrealistic rates beyond 100 kHz, the clock-driven simulations no longer scaled with the input rates and therefore ran faster than corresponding event-driven simulations. A better performance for clock-driven simulations was also achieved by lowering the temporal resolution, but not without crucial impact on the precision (for a discussion of this subject, see Hansel et al., 1998; Shelley & Tao, 2001). Finally, the LIF neuron model outperformed the biophysical model on average by a factor of about 600 and the passive membrane model by a factor of 15, whereas the gIF models were, on average, only a factor of three slower than the corresponding LIF simulations (see Figures 11C and 11D).
IF Neuron Models with Conductance-Based Dynamics
10 3
80
10 2
20
∗
10 20
40
60
80
10 3
gIF1 gIF2 gIF3
10 2 10 1 20
100
total input rate (kHz)
40
60
80
100
total input rate (kHz)
D 600 10 2
performance ratio
B simulation time (s)
BM PME LIF
C performance ratio
simulation time (s)
A
2183
10 2
10 1 20
40
60
80
100
500 clock-driven event-driven
20 10 4 3 2 1 BM PME
LIF
gIF1
gIF2
gIF3
total input rate (kHz)
Figure 11: Performance evaluation of the biophysical model (BM), the passive membrane equation (PME), the leaky IF (LIF) as well as the three gIF models (gIF1 to gIF3). (A) Total time needed for simulating 100 s neural activity (neural dynamics and generation of random synaptic inputs) as a function of the total synaptic input rate. Whereas in clock-driven simulations (solid lines) the simulation time was nearly independent of the input rate, a linear scaling with input rate was observed in event-driven simulations (dashed lines; see the linear plot in the inset). The star marks the setting for which the efficacies of clock- and event-driven stimulation approaches change (see text). (B) Total time needed for simulating 100 s neural activity (neural dynamics without generation of random synaptic inputs) as a function of the total synaptic input rate. Both simulation strategies yield the same qualitative scaling behavior, with the gIF models only minimally slower than the LIF neuron model. The inset shows a linear plot of the simulation times for the LIF and gIF models. (C) Performance ratios relative to the clock-driven simulation of the LIF neuron model as a function of total input rate (neural dynamics only; see B). Whereas the gIF models were only about three times slower than the LIF neuron models in the investigated input parameter regime, the biophysical model showed about a 600 times performance deficit, and also the clock-driven simulation of the passive membrane equation (PME) remained about 15 times slower that equivalent simulations with the LIF model. (D) Comparison between the performance, normalized to the clockdriven simulation of the LIF neuron model (BM: 592.8; PME: 15.65; IF, clock driven: 1.04 ± 0.31; gIF1 to gIF3, clock driven: 2.82 ± 0.86, 2.70 ± 0.59, 2.95 ± 0.59; gIF1 to gIF3, event driven: 3.06 ± 0.66, 3.05 ± 0.69, 3.01 ± 0.73).
2184
M. Rudolph and A. Destexhe
This performance deficit is a direct consequence of the more complicated neuronal dynamics defining the gIF models. The latter contains the calculation of additional exponentials, which constitute the main computational load in the implementation. Without major impairment of precision, this load can be dramatically reduced by the use of look-up tables as well as the reduction of memory and function call overheads by optimized C/C++ programming. Preliminary results using an event-driven implementation of the classic IF model in custom C/C++ software show that 100 million events could be simulated in less than 90 s using a standard PC-based workstation (see appendix A). From these, about 97.8% (approximately 88 s) account for the generation of random synaptic inputs as described above, and 2.2% (approximately 2 s) for the update of the neural state variable of the IF model. This compares to about 860 s and 70 s for generating random synaptic inputs and updating neural state variables in the classical IF neuron model, respectively, if the same number of events is handled in the NEURON simulation environment, thus indicating a further performance gain of at least one order of magnitude with customized C/C++ implementations. The latter would allow simulating in real time mediumscale neural networks of a few thousand neurons with biophysically more realistic conductance-based dynamics and average rates of a few Hz. 5 Discussion The leaky integrate-and-fire neuron model, whose dynamics is characterized by an instantaneous change of the membrane state variable upon arrival of a synaptic input, followed by a decay with fixed time constant, has proven to be an efficient model suitable for large-scale network simulations. However, due to its simple dynamics, in particular, the currentbased handling of synaptic inputs, the use of the LIF model for simulating biophysically more realistic neural network behavior is strongly limited. In this article, we proposed a simple extension of the LIF neuron toward conductance-based dynamics, the gIF model, which is consistent with the impact of synaptic inputs under in vivo–like conditions. Below we summarize and discuss the basic approach we followed here, along with an evaluation of the performance and possible extensions of the proposed gIF models with conductance-based state dynamics. 5.1 The gIF Neuron Models. With respect to their dynamical complexity, gIF models are situated in between the LIF and full conductance-based IF models, which were modeled here by a passive membrane equation with fixed spike threshold. Whereas the membrane state still undergoes an instantaneous change upon arrival of a synaptic input, the following decay is governed by a time-dependent membrane time constant whose value is the result of synaptic activity. This state-dependent dynamics captures the primary effect of synaptic conductances on the cellular membrane and
IF Neuron Models with Conductance-Based Dynamics
2185
therefore describes a simple implementation of time-varying conductance states, as found in cortical neurons during active network states in vivo. Extensions of the model investigated in this article include the incorporation of the scaling of postsynaptic potentials as a function of the total membrane conductance (gIF2 model) as well as take into account changes in the driving force due to the presence of synaptic reversal potentials (gIF3 model). As we have shown, in all cases, the state-dependent membrane state dynamics of the gIF models is described by simple analytic expressions, thus providing the basis for an implementation of these models in exact event-driven simulation strategies. The latter have proven to be an efficient alternative to the most commonly used clock-driven approaches for modeling medium- and large-scale neural networks with biophysically realistic activity. 5.2 Response Characteristics of gIF Neuron Models. To test the validity of the proposed models, the spiking response behavior was compared with that of LIF neurons, a model of a passive membrane with exponential synapses and fixed spike threshold, as well as of a biophysically detailed model of cortical neurons with Hodgkin-Huxley spike-generation mechanism and two-state kinetic synapses. Three aspects of neural dynamics were investigated. First, the statistical investigation of the spontaneous discharge activity with respect to firing rate and irregularity for corresponding synaptic input parameters showed that, compared to the LIF models, the gIF1 and gIF3 models reproduced much better the behavior seen in the biophysical model. Second, due to the explicit incorporation of presynaptic-activity dependent state dynamics, only the gIF1 and gIF3 models showed a temporal resolution of synaptic inputs, which was comparable to that seen in the more realistic biophysical model. Finally, due to the synaptic inputdependent dynamics of the gIF models, aspects of gain modulation seen in the biophysical model were much better captured by the gIF1 and gIF3 models than in corresponding LIF neuron models. Interestingly, despite its mathematically simpler structure, the response behavior of the gIF3 model came generally closer to the biophysical model, as compared to the PME model. This marked gain in a more realistic biophysical dynamics capturing faithfully crucial aspects of high-conductance states in vivo was a trade-off, with only a minor decrease in computational performance compared to the LIF neuron model. The gIF2 model, however, failed to qualitatively and quantitatively reproduce the response behavior seen in the gIF1 and gIF3 models, as well as the biophysical model. Recalling its definition, the cellular dynamics of this intermediate model captures only the impact of the actual total membrane conductance, or membrane time constant, on the amplitude of PSPs, but not their dependence on the actual membrane state and, hence, distance to corresponding synaptic reversal potentials. The reported observations therefore suggest that both the total membrane conductance and
2186
M. Rudolph and A. Destexhe
membrane state dependent scaling of the postsynaptic potentials are crucial to recover a more realistic conductance-driven neuronal state dynamics. 5.3 Limits of the gIF and LIF Neuron Models. An overall evaluation of response characteristics of the gIF models, in particular the gIF3 model, in comparison with those of the LIF neuron models suggests that the former provide a better description of neural dynamics observed in the detailed biophysical model or real cells. However, the simple analytic description of these models also strictly limits the dynamic behaviors which can be faithfully reproduced. As our simulations show, the most notable quantitative differences between the gIF3 and biophysical model are found in the spontaneous discharge activity (see Figure 4). Although the general dependence of the output rate νout on the excitatory and inhibitory input rates is reproduced, the firing rates in the gIF3 model cover a much broader regime than in the biophysical model, especially at higher input rates. The primary reason for this difference is the missing description of conductance-based spike-generating mechanisms, which, especially at high firing rates, will lead to an additional contribution to the total membrane conductance and modulate the shunting properties of the membrane and shape the response to the driving synaptic inputs. This conclusion was confirmed by using a purely passive model with a fixed threshold mechanism replacing a more realistic spike generation. Surprisingly, the deviations from the biophysical model were stronger than in the gIF3 model, probably due to the difference in handling the update of the membrane state upon arrival of a synaptic input. Indeed, this observation suggests that the instantaneous update of the membrane state could effectively mimic the fast transient changes of membrane conductance and time constant caused by the activation of spikegenerating conductances. However, despite the excellent match between the gIF3 and biophysical model, the deviations at high input and firing rates resulting from the additional effect of active membrane conductances constitute one crucial limit of the gIF models. On the other hand, the LIF neuron models incorporate neither of these activity-dependent shunting effects and must, in this respect, be considered a less faithful approximation of biophysical neuronal dynamics. The fixed threshold description of spike generation in the gIF models will also have an impact on the achievable variability of spontaneous responses. Although the gIF3 model came qualitatively closest to reproducing the input dependence of the spontaneous discharge variability (see Figure 5), the C V was in general lower than in the biophysical model, although well in the experimentally observed regime (e.g., see Rudolph & Destexhe, 2003). This indicates that the lack of faithful description of spike generation might set another crucial limit in the dynamic behaviors that can be captured by the gIF models or any other passive model with fixed spike threshold. Specifically, a variable spike threshold and fast transient changes in the membrane conductance linked to spike generation provide other sources of variability
IF Neuron Models with Conductance-Based Dynamics
2187
in the neuronal discharge in conjuction with the synaptic input drive. Indeed, the upper limit of achievable C V values in the gIF model, as in the PME model, is defined by the synaptic inputs (for independent Poisson inputs C V ≤ 1), whereas active membrane conductances linked to spike generation can lead to a C V 1, for example, in bursting cells (Svirskis & Rinzel, 2000). Fusing simple IF neuronal models incorporating biophysically more realistic membrane dynamics (e.g., Lytton & Stewart, in press) with the idea behind the gIF models could provide a way for reproducing the full diversity of irregular discharge behaviors seen experimentally. In the classical LIF model considered here, the limiting factors are the fixed spike threshold and current-based description of synaptic activity, which were shown to yield generally lower C V values (e.g., Rudolph & Destexhe, 2003). Finally, other quantitative differences between the gIF3 model and the biophysical model were observed, such as the enhanced temporal resolution for input frequencies above 100 Hz (see Figure 7C) or the shallower scaling of the response gain as a function of overall synaptic activity (see Figure 10) in the gIF3 model. These differences can be traced back to the absence of a biophysically realistic spike-generating mechanism, although these can be incorporated (see below). However, the observed differences between the gIF, in particular the gIF3, and biophysical models were less crucial than between the biophysical and the passive model with fixed spike threshold, despite the fact that the latter reproduced realistic PSP shapes and thus should capture temporal aspects of membrane dynamics to a better extent. Much more crucial were the differences observed between the biophysical model and the LIF model, in which both spike-generating mechanisms as well as conductance-based synaptic activity are absent. In this case, not just quantitative differences in the investigated response behavior were observed, but qualitative differences such as a sensitive scaling of the spontaneous discharge, which is responsible for the narrow regime where highly irregular activity can be observed (see Figure 5) or the failure to faithfully resolve high-frequency synaptic inputs (see Figure 7). The missing activity-dependent membrane dynamics also results in a modulation of response gain that markedly deviates from those observed in real cells (see Figure 10; e.g., Chance et al., 2002; Fellous, Rudolph, Destexhe, & Sejnowski, 2003; Prescott & De Koninck, 2003; Shu, Hasenstaub, Badoual, Bal, & McCormick, 2003). 5.4 Possible Extensions of gIF Neuron Models. The proposed simple extensions of the LIF model concern only synaptic inputs. Neither active membrane conductances, such as those considered in the spike response model (e.g., Gerstner & van Hemmen, 1992, 1993; Gerstner et al., 1993; Gerstner & Kistler, 2002), nor simplifications of spike-generation mechanism (e.g., based on Hodgkin-Huxley kinetic models; Hodgkin & Huxley, 1952), such as those considered in the various instances of nonlinear
2188
M. Rudolph and A. Destexhe
IF neuron models (e.g., Abbott & van Vreeswijk, 1993; Fourcaud-Trocm´e et al., 2003; for a general review, see Gerstner & Kistler, 2002), were incorporated into the gIF models. The gIF models presented here are also different from mathematically more abstract models (e.g., Izhikevich, 2001, 2003) and will not capture intrinsic phenomena like bursting or spike rate adaptation. Moreover, the instantaneous update of the membrane state variable upon arrival of a synaptic input abstracts from a more realistic shape of the membrane potential following synaptic stimulation (for a model that considers this aspect, see, e.g., Kuhn, Aertsen, & Rotter, 2004). Therefore, aspects of neuronal dynamics that depend on both the exact form of PSPs or the presence of intrinsic state-dependent membrane currents cannot be captured by the gIF models proposed here. However, as we show, a great variety of principal spiking response characteristics for both spontaneous and stimulated synaptic activity can be faithfully described by models incorporating a synaptically driven fluctuating membrane time constant, that is, a presynaptic-activity dependent state membrane dynamics, alone. This includes spontaneous activity typically seen in high-conductance states in cortical networks in vivo, as well as the response to transient synaptic stimuli occurring during such states. We suggest that this crucial but restricted gain in realistic biophysical dynamics capturing characterizing aspects of high-conductance states in vivo is a fair trade for the simplicity of the considered extensions, which allow the efficient application in large-scale network models. Various extensions of the gIF model are currently under investigation (see also appendix C). First, the gIF1 to gIF3 models do not incorporate realistic PSP time courses, but instead are described by an instantaneous update of the membrane state upon arrival of a synaptic input. To relax this restriction and approach biophysically more realistic situations, analytic approximations of the full solution of the membrane equation can be used. A first attempt in this direction is presented in section C.1 for the PSPs of a passive leaky membrane subject to synaptic inputs with exponential conductance time course. The obtained approximation (gIF4 model; see equation C.4) describes excitatory and inhibitory postsynaptic potentials to an excellent extent (see Figures 12A to 12C). Moreover, the mathematical form of this approximation is sufficiently simple to allow implementation of the resulting model in the framework of the event-driven simulation strategy. However, for this purpose, the latter needs to be modified by incorporating the prediction of threshold crossings to scope with responses that, due to the PSP time course, occur now temporally separated from the synaptic input (see Figure 12D). Second, the gIF1 to gIF3 models do not incorporate spike-generating mechanisms but instead are based on purely passive membrane dynamics. To relax this restriction, simplified models describing active membrane conductances need to be considered. First attempts in this direction are presented in sections C.2 and C.3, where the quadratic IF (Latham,
IF Neuron Models with Conductance-Based Dynamics
2189
Richmond, Nelson, & Nirenberg, 2000; Feng, 2001; Hansel & Mato, 2001; Brunel & Latham, 2003; Fourcaud-Trocm´e et al., 2003) and the exponential IF (Fourcaud-Trocm´e et al., 2003) neuron models, respectively, are extended within the context of the gIF model approach. Although in all cases considered so far, the defining state equation cannot be solved analytically, appropriate approximations of their solutions might provide a sufficiently exact and mathematically simple description of active membrane dynamics close to spike threshold. With such descriptions, the modified eventdriven simulation strategy mentioned above can be utilized to implement the resulting models for efficient and precise network simulations. This approach might also be extendable to linearized versions of active membrane currents described by Hodgkin-Huxley kinetics (e.g., Mauro, Conti, Dodge, & Schor, 1970; Koch, 1999) and yield analytically simple expressions for the membrane state equations, thus allowing the capture of aspects of subthreshold activity of active conductances in a computationally efficient way. Third, the gIF1 to gIF3 models do not incorporate specific intrinsic cellular properties, such as adaptation or bursting. However, the fact that the passive membrane equation B.1 can still be solved analytically in the presence of an exponentially decaying function coupled to a linear function of the membrane state variable allows modeling biophysically more realistic membrane dynamics in an exact and efficient way (e.g., Lytton & Stewart, in press). Fourth, so far the emergence of high-conductance states requires an intense synaptic activity modeled by individual synaptic input channels. Although this is assumed to occur naturally in large-scale neural networks with self-sustained activity, this requirement might not be fulfilled in smaller network models or sparsely connected networks. The gIF models could be extended by including effective noise sources, as proposed by Reutimann et al. (2003) in the context of event-driven simulation strategies. Finally, the gIF1 to gIF3 models incorporate synaptic dynamics described by an exponential conductance time course. To relax this restriction, neuronal dynamics can be extended to other and more realistic synaptic kinetic models, such as conductance changes following α-functions (Rall, 1967), n-state kinetics (Destexhe et al., 1994, 1998; Destexhe & Rudolph, 2004), or a frequency-dependent dynamics (e.g., Lytton, 1996; Markram et al., 1998; Giugliano, 2000). The availability of exact event times in event-driven simulation approaches also allows the straightforward implementation of spike-timing dependent plastic changes (e.g., according to models in Song & Abbott, 2001; Froemke & Dan, 2002). 5.5 Future Directions. In addition to the different extensions of the gIF models outlined in the previous section, possible future research directions also include a more detailed study of each model with respect to specific dynamic behaviors. So far, only basic response characteristics, such
2190
M. Rudolph and A. Destexhe
as the spontaneous discharge (see section 3.1) or the response to periodic synaptic drive (see sections 3.2 and 3.3), were investigated. Other response characteristics, such as the cellular response to oscillating input currents in high-conductance states, in particular the frequency-response behavior (Brunel, Chance, Fourcaud, & Abbott, 2001; Fourcaud-Trocm´e et al., 2003) ¨ and instantaneous spike frequency (Rauch, La Camera, Luscher, Senn, & Fusi, 2003), or the detection of brief transient changes in statistical properties of the synaptic inputs, such as the temporal correlation among the activity of individual input channels (Rudolph & Destexhe, 2001), could provide further insight into the validity and limitations of the proposed simplified neuron models. Furthermore, differences in behaviors seen among these models, in particular between current-based and conductance-based models, will reveal basic requirements that neuronal models have to fulfill in order to reproduce specific dynamic behaviors faithfully. The latter could be used to constrain the complexity of neuronal models used in large-scale network simulations, thus allowing the construction of more efficient network models. Another important aspect linked to the faithful reproduction of specific and biophysically realistic neuronal behaviors using simplified models is the correct choice of model parameters. In this article, the subthreshold dynamics, specifically the characteristics of synaptic inputs as well as
Figure 12: A gIF model with realistic PSP time course. (A) Comparison of excitatory (top) and inhibitory (bottom) postsynaptic potentials in the gIF4 model (black traces; equation C.2) and the LIFcd model (gray traces; numerical solution of equation B.1). Traces are shown for different membrane potentials ranging from the leak reversal E L to the firing threshold E thres , as well as for three different total membrane conductances given in multiples of the leak conductance G L (G L = 17.18 nS). Used model parameters for the simulations are given in Tables 1 and 2. (B) Relative error (VLIFcd (t) − VgIF4 (t))/(VLIFcd (t) − E thres ) between the numerical solution of the PSP time course in the LIFcd model (VLIFcd (t)) and the gIF4 model (VgIF4 (t)) at firing threshold as a function of time after synaptic input. Results are shown for the three total membrane conductances used in A. In all cases, the error was smaller than 1% for times covering the PSP peak (gray area), suggesting the possibility of precise prediction of spike times based on the simple analytic state equation describing the gIF4 model (equation C.4). (C) Example of a membrane potential time course resulting from a barrage of synaptic inputs. The numerical solution of the LIFcd model (gray) is compared with the gIF4 model (black) for identical synaptic input pattern (total input rate 100 Hz). (D) Generalization of the event-driven simulation strategy. Upon arrival of a synaptic event at time t, the membrane state (black) is updated from its value at t0 . Threshold crossing is predicted (ts ) and overwritten (ts ) if another synaptic event arrives at t1 < ts .
IF Neuron Models with Conductance-Based Dynamics
A
excitation
1 x GL
2191
10 x G L
5 x GL
E thres
0.1 mV EL LIFcd gIF4
inhibition
E thres
0.05 mV
EL 10 ms
B
inhibition relative error (%)
relative error (%)
excitation 6 4 2
10
20
6 1 x GL 5 x GL 10 x G L
4 2
30
10
time (ms)
20
30
time (ms)
C
D 0.3 mV
EL
100 ms
LIFcd gIF4
t0
t t1 t`s
ts
time
2192
M. Rudolph and A. Destexhe
postsynaptic potentials, of a biophysically detailed model were used to adjust parameters in the simpler LIF and gIF models. Other studies used suprathreshold response characteristics, such as the output firing rate (e.g., Rauch et al., 2003) or static as well as the dynamic response properties (Fourcaud-Trocm´e et al., 2003), to fit model parameters in conductancedriven and current-driven LIF models. In further investigations, techniques for fitting high-dimensional parameter spaces, typical for detailed biophysical models and experimental recordings, to low-dimensional ones, must be evaluated with respect to their applicability and quality. Finally, the most challenging task for the future is to evaluate and understand neuronal dynamics at the network level, in particular under in vivo– like conditions, as well as the emergence of specific functional neuronal behaviors if neuronal models are endowed with self-organizing capabilities, such as plastic synapses. We hope that the simplified models proposed in this article will provide useful and efficient tools to facilitate this task. The evaluation of this method from experimental data and the assessment of its sensitivity at the network level will be the subject of forthcoming studies. Appendix A: Computational Models and Methods In this appendix, we briefly describe the biophysical model (BM; section A.1), the IF model based on the passive membrane equation and fixed threshold for spike generation (PME; section A.2), and the various IF neuron models (cLIF, cLIF, gIF; section A.3), and summarize the parameters used for numerical simulations. A.1 Biophysical Model. In what we refer to as the biophysical model (BM), membrane dynamics was simulated using a single-compartment neuron, described by the active membrane equation, 1 d V(t) 1 1 = − L V(t) − E L − Iact (t) − Isyn (t) . dt τm C C
(A.1)
Here, V(t) denotes the membrane potential, E L = −80 mV the leak reversal potential, τmL = C/G L the membrane time constant, C = aCm the membrane capacity (specific membrane capacity Cm = 1 µF cm−2 , membrane area a = 38, 013 µm) and G L = ag L the passive (leak) conductance (leak conductance density g L = 0.0452 ms cm−2 ). In equation A.1, Iact (t) denotes the active current responsible for spike generation. Voltage-dependent conductances were described by HodgkinHuxley type models (Hodgkin & Huxley, 1952), with kinetics taken from a model of hippocampal pyramidal cells (Traub & Miles, 1991) and adjusted to match voltage-clamp data of cortical pyramidal cells (Huguenard, Hamill, & Prince, 1988). Models for sodium current I Na and delayed-rectifier
IF Neuron Models with Conductance-Based Dynamics
2193
potassium current I K d were incorporated into the model, with conductance densities of 36.1 ms cm−2 and 7 ms cm−2 , respectively. The synaptic input current, 2s Isyn (t) = G 2s e (t) (V(t) − E e ) + G i (t)(V(t) − E i ),
(A.2)
was described by a sum over a large number of individual excitatory and inhibitory synaptic conductances,
N{e,i}
G 2s {e,i} (t)
=
(n)
G {e,i} m{e,i} (t)
(A.3)
n=1
(see also Table 1) with respective reversal potentials E e = 0 mV and E i = −75 mV. In the last equation, Ne = 10,000 and Ni = 3000 denote the total number of excitatory and inhibitory synapses, modeled by α-amino-3hydroxy-5-methyl-4-isoxazolepropionic (AMPA) and γ -aminobutyric acid (GABA A) postsynaptic receptors with quantal conductances G e = 1.2 nS (n) and G i = 0.6 nS, respectively (Destexhe et al., 1994). The functions me (t) (n) and mi (t) represent the fractions of postsynaptic receptors in the open state at each individual synapse and were described by the following pulse-based two-state kinetic equation, dm{e,i} (t) = α{e,i} T(t − t0 ) 1 − m{e,i} (t) − β{e,i} m{e,i} (t), dt
(A.4)
for t ≥ t0 , where t0 denotes the release times at the synapse in question, α{e,i} and β{e,i} are forward and backward rate constants for opening of the excitatory and inhibitory receptors, respectively. T(t) denotes the concentration of released neurotransmitter in the synaptic cleft at time t and is considered to be a step function with T(t) = Tmax for a short time period t0 ≤ t < tdur after a release and T(t) = 0 afterward (pulse kinetics). The parameters of these kinetic models of synaptic currents were obtained by fitting the model to postsynaptic currents recorded experimentally (Destexhe et al., 1998), and are given in Table 1. To simulate synaptic background activity, all synapses were activated randomly according to independent (temporally uncorrelated) Poisson processes with mean rates νe and νi between 0 and 10 Hz for both AMPA and GABA synapses, respectively. This yielded total input rates from νe = 0 to 100 kHz for excitation and from νi = 0 to 30 kHz for inhibition.
2194
M. Rudolph and A. Destexhe
In some simulations, in particular for evaluating the state dependence of the shape and amplitude of excitatory and inhibitory postsynaptic potentials, synaptic models with exponential conductance time course exp G {e,i} (t)
=
0
G {e,i} exp −
t−t0 τ{e,i}
for t < t0
for t ≥ t0
(A.5)
or α-kinetics G α{e,i} (t) =
0 G {e,i}
t−t0 τ{e,i}
exp −
t−t0 τ{e,i}
for t < t0 for t ≥ t0
(A.6)
were used. Here, G {e,i} and τ{e,i} denote quantal conductances and synaptic time constants for excitation and inhibition, respectively. Corresponding parameter values are given in Table 1. Simulations of 100 to 1,000 s neural activity with a temporal resolution of 0.1 ms were performed using the NEURON simulation environment (Hines & Carnevale, 1997), running on a 3 GHz Dell Precision 350 workstation (Dell Computer Corporation, Round Rock, TX) under the SUSE 8.1 LINUX operating system. A.2 Passive Membrane Equation with Fixed Spike Threshold. In what we refer to as the passive membrane equation with fixed spike threshold (PME), membrane dynamics was simulated using a single-compartment neuron, described by the passive membrane equation, 1 d V(t) 1 = − L V(t) − E L − Isyn (t), dt τm C
(A.7)
with parameter values and synaptic activity described in section A.1. Spike generation was described by a fixed voltage threshold (E thres = −50 mV) and reset potential (E rest = −80 mV). Simulations of 1000 s neural activity with a temporal resolution of 0.01 ms were performed using the NEURON simulation environment (Hines & Carnevale, 1997), running on a 3 GHz Dell Precision 350 workstation (Dell Computer Corporation, Round Rock, TX) under the SUSE 8.1 LINUX operating system. A.3 IF Neuron Models. Integrate-and-fire (IF) neuron models were modeled according to the state equations given in section 2.1 for the classical and very leaky integrate-and-fire neuron models (cLIF and vLIF, respectively), as well as sections 2.3, 2.4, and 2.5 for the gIF1, gIF2, and gIF3
IF Neuron Models with Conductance-Based Dynamics
2195
neuron models, respectively. Membrane properties as well as parameter values for excitatory and inhibitory synaptic inputs (see Table 2) were chosen by a normalization of the IF models between the resting state (mrest = 0 corresponding to Vm = E L = −80 mV) and the firing threshold (mthres = 1, corresponding to Vm = −50 mV). Average EPSP and IPSP peak amplitudes were estimated by numerical simulations of the passive membrane equation with synaptic models described by two-state kinetics (see section A.1), and were found to be 0.28 mV for EPSPs and −0.21 mV for IPSPs at firing threshold (used for estimating m in the cLIF, vLIF, gIF1, and gIF2 models; see sections 2.1, 2.3, and 2.4, respectively) as well as 0.46 mV (EPSPs) and 0.043 mV (IPSPs) at rest (used for estimating m in gIF3 model; see section 2.5). Synaptic inputs in the gIF models were simulated by exponential conductance changes (see equation A.5) with parameter values given in Table 2. Due to the additivity of synaptic inputs for this case, in all IF models, synaptic inputs were simulated by single independent input channels for excitation and inhibition, with total rates of 0 ≤ νe ≤ 100 kHz and 0 ≤ νi ≤ 30 kHz, respectively. These rates correspond to those used in the biophysical model (see section A.1). In conjunction with the cLIF neuron model, an additional model with small static membrane time constant (the vLIF model) was considered to mimic high-conductance states with a static synaptic conductance. In this case, synaptic rates were 0 ≤ νe ≤ 20 kHz and 0 ≤ νi ≤ 6 kHz, respectively. The refractory period in all cases was tref = 1 ms. To evaluate the computational performance of the constructed models, simulations utilizing the clock-driven simulation strategy (time resolution 0.01 ms) as well as the event-driven simulation strategy (Watts, 1994; Mattia & Del Giudice, 2000; Reutimann et al., 2003) were used. In all cases, simulations of 1000 s neural activity were performed. Appendix B: The Membrane Equation with Exponential Synaptic Conductance Time Course In section B.1, we briefly outline the explicit solution of the membrane equation, d V(t) 1 exp 1 = − L (V(t) − E L ) − G s (t) (V(t) − E s ), dt τm C
(B.1)
for a single synaptic input event arriving at time t0 and described by an exponential conductance time course: exp G s (t)
=
for t < t0
0 Ge
−
t−t0 τs
for t ≥ t0 .
(B.2)
2196
M. Rudolph and A. Destexhe
In equation B.1, E L and E s denote the leak and synaptic reversal potentials, respectively. G in equation B.2 denotes the maximal conductance linked to the update of the membrane time constant τm at time t0 by τms = C/G (see equations 2.6 and 2.7). In sections B.2 and B.3, this solution is then approximated, and simple analytic expressions for the PSP peak amplitude as a function of the actual membrane state under incorporation of the effect of synaptic reversal potential are deduced. These expressions constitute the basis for describing the membrane updates upon arrival of a synaptic input in the gIF2 and gIF3 models and are applicable in event-driven simulation strategies due to their analytic form. B.1 Solution of the Membrane Equation. To simplify notation but without restriction of generality, we solve equations B.1 and B.2 for t0 = 0. With the boundary condition V(t)|t→−∞ −→ E L , equation B.1 yields V(t) = E L for t < 0. For t ≥ 0, explicit integration gives
τs − t t τs e V(t) = exp − L + τm τms × EL e
− ττss
m
t
+ 0
×
s s τs − s ds exp − + L − e τs τs τm τms
Es EL s + L e τs τms τm
.
(B.3)
The integral expression, which is of general form t s A2 e A3 s + A4 e A5 s ds exp A1 e− τs X(t) := 0
with τs A1 = − τ s ,
A2 =
m
A4 =
EL τmL
Es τms
,
,
A3 = − τ1s +
A5 =
1 τmL
1 τmL
,
,
can be rewritten in terms of gamma functions. To that end, we expand the factor exp[A1 e−t/τs ]: X(t) =
∞
n=0
=
n=0
0
An1 − s e τs n!
ds
n−A3 τs n−A5 τs An1 A2 e− τs s + A4 e− τs s n!
0
∞
ds
t
t
n
A2 e A3 s + A4 e A5 s
IF Neuron Models with Conductance-Based Dynamics
2197
∞
An1 A2 τs − n−A3 τs t A4 τs − n−A5 τs t e τs e τs −1 + −1 n! A3 τs − n A5 τs − n n=0 t = A2 τs (−A1 ) A3 τs [−A3 τs , −A1 e− τs ] − [−A3 τs , −A1 ] t +A4 τs (−A1 ) A5 τs [−A5 τs , −A1 e− τs ] − [−A5 τs , −A1 ] ,
=
∞ where the incomplete gamma function [z, a ] = a dt t z−1 e−t was used. With z [−z, a ] = a −z e−a − [−z + 1, a ] and the fact that A3 τs = A5 τs − 1, we obtain after insertion of X(t) into equation B.3,
t τs − t τs V(t) = exp − L + e τm τms
t τs − t − e τs τmL τms
τs τs τs − t τs τs + − L, s − − L, e τm τm τm τms τLs τm τs τs (E s − E L ) . × τms τmL
× EL e
− ττss
m
− Es
− ττss
e
m
− exp
(B.4)
The latter can be further simplified by noting that
τs τs τs − t τs − t τs τs τs τs τs , = − + − , e , e , − L, τm τms τmL τms τms τmL τms b where [z, a , b] = a dt t z−1 e −t denotes the generalized incomplete gamma function. We obtain
t τs − t τs V(t) = exp − L + e τm τms × EL e
− ττss
m
− Es
− ττss
e
m
t τs − t − exp L − e τs τm τms
τs τs − t τs − − L , e τs , τm τms τms
τs τms
τLs
τm
τs (E s − E L ) τmL
(B.5)
as the general form of the membrane potential (i.e., postsynaptic potential) time course following a single synaptic stimulation described by an exponential conductance time course.
2198
M. Rudolph and A. Destexhe
B.2 Analytic Approximation of the PSP Peak Height. Equation B.5 provides a rather complicated expression for the PSP time course and is, therefore unsuitable for directly deducing a simple equation for the PSP peak amplitude we are looking for. However, guided by the general shape of the PSP, we will approximate equation B.5 using an α-function, Vα (t) = Vα0
t −t e τα + E L , τα
(B.6)
where Vα0 denotes the maximum and τα the time constant of the α-function. The desired peak amplitude is then given by Vmax = Vα0 e−1 . In order to approximate V(t), equation B.5, with Vα (t), we power-expand the difference V(t) − Vα (t) up to second order in t at t = 0. This yields the equations 0 = (E L − E s ) τα + τms Vα0 , 2 0 = (E L − E s ) τα2 τms τs + τmL τms + τs + 2τmL τms τs Vα0 , from which the parameters Vα0 and τα can be deduced. We obtain Vα0 =
2(E s − E L )τmL τs , τms τs + τmL τms + τs
(B.7)
which yields for the PSP peak amplitude Vmax =
2e−1 (E s − E L ) τmL τs . τms τs + τmL τms + τs
(B.8)
In equation B.8, τms denotes the change in the membrane time constant at arrival of a synaptic input, and τmL is the membrane time constant before the arrival of the synaptic input, which, due to the chosen baseline (membrane at rest for t < 0), equals the passive (leak) membrane time constant. However, in the general situation, τmL has to be replaced by the total membrane time constant τm (t) at time t0 of the arrival of a new synaptic event, which contains leak as well as synaptic contributions. To optimize the applicability of equation B.8 in event-driven simulation strategies, we consider the relative change of the PSP peak amplitude as a ˜ max denote the function of the actual state with respect to a control state. Let V PSP peak amplitude in the control state (e.g., the resting state) characterized by the total membrane time constants τ˜m . Then equation B.8 yields ˜ max Vmax (t) = V
1 1 1 + + τ˜m τs τms
1 1 1 + + τm (t) τs τms
−1 (B.9)
IF Neuron Models with Conductance-Based Dynamics
2199
for the PSP peak amplitude following a synaptic input in the actual state at time t characterized by a total membrane time constant τm (t). B.3 Incorporation of the Synaptic Reversal Potential. Equation B.5 shows that the full solution for the PSP depends not only on the average membrane conductance (see the previous section), but also on the actual membrane state V(t) and, hence, the distance to the corresponding synaptic reversal potential E s . Moreover, in the presence of synaptic reversal potentials, for example, for excitation and inhibition, the membrane will have an effective reversal state to which it decays exponentially with a time constant associated with the total membrane conductance. This effective reversal state is determined by the synaptic conductance contributions and the leak conductance, as well as the “distance” of the current membrane state to the respective reversal potentials of the conductances and leak. To incorporate both effects in a simple approximation of the PSP peak amplitude, this way extending the result in equation B.9, we start from the membrane equation B.1. In the case of a single synaptic input, it can be rewritten in the form d V(t) 1 V(t) − Vrest (t) = 0, + dt τm (t)
(B.10)
where Vrest (t) =
EL Es + s L τm τm (t)
1 1 + s L τm τm (t)
−1 ,
(B.11)
1 1 1 = , + s τm (t) τmL τm (t) t−t0 1 1 = e− τs τms (t) τms (t0 )
(B.12) for t ≥ t0 .
(B.13)
Due to the time dependence of Vrest (t), equation B.10 does not provide a closed-form solution (see equation B.5) simple enough to build a basis for an IF neuron model usable within an event-bases simulation approach. In order to obtain such a form, we use the fact that a single synaptic input has only a minimal contribution to the total membrane time constant. Therefore, we can truncate the explicit time dependence of Vrest (t) by replacing the synaptic contribution to the membrane time constant, τms (t) with its value at time t0 , that is, at the arrival of the synaptic event. Equation B.11 then yields Vrest (t) ∼ Vrest =
EL Es + s τmL τm (t0 )
1 1 + s τmL τm (t0 )
−1 .
(B.14)
2200
M. Rudolph and A. Destexhe
Note that Vrest will still indirectly depend on time, because τms (t0 ) is updated whenever a synaptic event arrives. To make the impact of synaptic reversal potentials on the PSP peak amplitude explicit, we extend the membrane equation B.1 by a constant stimulating current I0 : 1 1 1 d V(t) = − L V(t) − E L − G s (t) V(t) − E s + I0 . dt τm C C
(B.15)
Different values of I0 will hold the membrane potential at different base values, from which we will deduce the desired dependence of the PSP τL peak height. Defining E L = E L + Cm I0 as the effective leak reversal in the presence of a constant current, equation B.15 can be brought into a form 1 d V(t) 1 = − L V(t) − E L − G s (t) V(t) − E s , dt τm C
(B.16)
which has the same functional form as the original membrane equation B.1. Therefore, and because I0 is assumed to be constant, we can proceed along the lines outlined in sections B.1 and B.2 and obtain for the PSP peak amplitude Vmax =
2e−1 (E s − E L ) τmL τs . τms τs + τmL (τms + τs )
(B.17)
Note that E L denotes the base potential over which the PSP arises and therefore corresponds to the actual state of the membrane at the time of the synaptic input. Finally, we deduce the relative change of the PSP peak amplitude with respect to a control state as a function of the actual state of the membrane. Let the actual state be characterized by the membrane time constant τm (t) and ˜ max denote the PSP peak amplitude the membrane state variable V(t). Let V in the control state (e.g., the resting state) characterized by the time constant ˜ Then we obtain for the PSP peak amplitude in the τ˜m and state variable V. actual state ˜ max V(t) − E s Vmax (t) = V ˜ − Es V
1 1 1 + + τ˜m τs τms
1 1 1 + + τm (t) τs τms
−1
.
(B.18) ˜ − E s ) is responsible for the effect of the disThe first factor (V(t) − E s )/(V tance of the actual membrane state to the reversal potential, whereas the
IF Neuron Models with Conductance-Based Dynamics
2201
second term describes the scaling of the PSP peak amplitude as a function of the membrane time constant in accordance with equation B.9. Appendix C: Extension to Nonlinear IF Neuron Models In this appendix, we briefly present first attempts to generalize the idea behind the gIF models to neuronal models that contain, in addition to synaptic conductances, state-dependent currents due to active membrane conductances used, for example, to describe spike generation. We begin by analytically approximating the time course of a passive membrane after synaptic input in order to incorporate a more realistic PSP shape into the model (see section C.1). State equations for the quadratic integrate-andfire (QIF, section C.2) and exponential integrate-and-fire (EIF, section C.3) neuronal models with conductance-based synaptic interactions will be presented. A complete description of this work as well as a detailed evaluation of the behavior of these models will be presented in a forthcoming study. C.1 gIF Models with Realistic PSP Time Course. In appendix B.1, we simplified the exact solution (see equation B.3) of the passive membrane equation with exponential synaptic conductance time course (see equation B.1) in order to obtain an analytic expression for the PSP peak amplitude as a function of the actual membrane state and total membrane time constant (see sections B.2 and B.3). The advantage of this approach was that the deduced simple expression for the PSP peak amplitude allowed its use in the gIF neuron models as an instantaneous update value of the membrane state upon the arrival of a synaptic input. However, the temporal shapes of EPSPs and IPSPs are neglected, which might have a subtle impact on the cellular response characteristics, in particular its temporal aspects. Here, we analytically approximate the full solution and provide a simple expression for the PSP shape that can be used in event-driven simulation strategies. Equation B.3 can be rewritten as V(t) = e
Q1 (t)
EL e
− ττss
m
+
t
ds e 0
Q2 (s)
Es EL s + L e τs τms τm
(C.1)
with Q1 (t) = − Q2 (t) =
t τs − t + e τs , τmL τms
t t τs − s − − e τs . τmL τs τms
Power-expanding Q1 (t) and Q2 (t) up to first order at t = 0 and using the boundary conditions V(t) −→ E L for t → 0 and t → ∞, equation C.1 can
2202
M. Rudolph and A. Destexhe
be approximated by VPSP (t) = E L +
Es τms
1 1 1 − L − τs τm τms
−1 e
−t
1 L τm
+ τ1 s
m
t − e− τs .
(C.2)
We note that this general form of the PSP time course as a difference between two exponentials was already suggested in the context of the spike response model (Jolivet & Gerstner, 2004) and quadratic integrate-and-fire models (Latham et al., 2000). If the synaptic input arrives at time t0 on top of a barrage with synaptic inputs, equation C.2 can be generalized to provide the desired expression of the PSP time course as a function of the actual membrane state V(t0 ) and membrane time constant τm (t0 ). Following the argumentation presented in section B.3, we obtain for t ≥ t0 −(t−t0 ) 1 + 1s τm (t0 ) τm VPSP (t) = E L + V(t0 ) − E L e 1 1 1 1 1 −1 − + Q Es + − τms (t0 ) τms τs τm (t0 ) τms t−t0 −(t−t0 ) τm1(t ) + τ1 s 0 m − e− τs , × e
(C.3)
where τms (t0 ) denotes the synaptic contribution to the membrane time constant at time t0 and τms the update of τms (t0 ) due to the synaptic input (see equation 2.6). Q is a PSP scaling factor that was introduced in generalizing the approximation of equation C.1 to arbitrary effective reversal states. It primarily depends on the distance of the actual membrane state to the synaptic reversal potential, and a good approximation of Q is given by the first factor in equation B.18, thus yielding
−(t−t0 )
1 1 s τm (t0 ) + τm
VPSP (t) = E L + V(t0 ) − E L e 1 1 1 E s V(t0 ) − E s 1 1 −1 − + + − τms E L − E s τms (t0 ) τms τs τm (t0 ) τms t−t0 −(t−t0 ) τm1(t ) + τ1 s 0 m − e− τs . × e (C.4) This defines, along with equations 2.6, 2.7, and 2.8 for updating the membrane time constant upon arrival of a synaptic input, the basis of a gIF4 model. In order to evaluate the validity of the gIF4 model, we compared the EPSP and IPSP time course obtained from the analytic approximation,
IF Neuron Models with Conductance-Based Dynamics
2203
equation C.4, with the numerical solution of the underlying passive membrane equation B.1 (LIFcd model). For both excitatory and inhibitory synaptic inputs, the gIF4 model described remarkably well the postsynaptic membrane potential time course (see Figure 12A) for all membrane potentials ranging from the leak reversal E L up to the firing threshold E thres . For the realistic synaptic and cellular parameter values used here, the deviation, or relative error, was smaller for leakier and, hence, faster membranes (see Figure 12B) and did not exceed 1% for times that covered the PSP peak (see Figure 12B, gray area). The latter suggests that an exact and computationally fast prediction of spike times based on the analytic form of the approximated EPSP time course given by equation C.4 should be possible. However, although an excellent agreement of the gIF4 model with the LIFcd model was reached for small or medium synaptic input rates (see Figure 12C), total input rates above 1 kHz led to deviations from the numerical simulation. The reason for this can be found in the approximative character of the PSP time course, in particular, the PSP peak amplitude scaling (factor Q in equation C.3). This error could be reduced by fine-tuning the quantal conductance of the synaptic input or the scaling factor Q. To what extent the observed deviations for high input rates compensate the simplicity and, thus, computational efficiency of the obtained analytic approximation remains to be investigated. In contrast to the instantaneous update of the membrane state characteristic for the integrate-and-fire neuron models presented in section 2, the incorporation of a realistic PSP shape in the gIF4 model (equation C.4) no longer allows the application of the basic event-driven simulation strategy presented in section 4.1. Indeed, in models with instantaneous rise of the membrane potential, the firing threshold can be crossed only at the time of the synaptic input, whereas in models with EPSP time course, the membrane state can cross the threshold and thus generate an output spike even milliseconds after the synaptic input occurred. The latter makes the prediction of a future threshold crossing based on the EPSP time course necessary. In this case, a generalization of the event-driven simulation strategy (Hines & Carnevale, 2004) provides both an exact and computationally efficient way to simulate neural activity (see Figure 12D). First, upon arrival of a synaptic event at time t, the actual membrane state V(t) is calculated from the membrane state V(t0 ) at the time t0 of the previous synaptic input using equation C.4, and the membrane time constant τm (t) is updated using equations 2.7 and 2.8. Then equation C.4 is used to predict the time ts of a possible future threshold crossing, which is emitted as an event into the network. If another synaptic input arrives before t1 < ts , the membrane potential is updated, and a possible new spike time ts will overwrite the previous prediction ts . An implementation of this approach for the gIF4 model using the simple analytic form of the PSP time course, equation C.4, as well as evaluation of its precision and efficiency, will be the subject of a forthcoming study.
2204
M. Rudolph and A. Destexhe
C.2 gQIF—A Conductance-Based Quadratic IF Model. In order to describe the effect of state-dependent currents due to active membrane conductances, we generalize the passive membrane equation B.1 by incorporating a nonlinear current Iact (V(t)), 1 d V(t) 1 1 = − L V(t) − E L − Iact (V(t)) − Isyn (t) , dt τm C C
(C.5)
where Isyn (t) defines the synaptic current (e.g., Fourcaud-Trocm´e et al., 2003). Depending on the functional form of Iact (V(t)), different models can be defined. In this and the following sections, we give the explicit form of the state equation for various models in the presence of synaptic conductance, thus defining the basis for nonlinear IF neuron models with presynaptic-activity dependent state dynamics. However, a detailed analytical and numerical investigation exceeds the framework of this article and will be the subject of a forthcoming study. The quadratic integrate-and-fire (QIF) neuronal model is defined by (Latham et al., 2000; Feng, 2001; Hansel & Mato, 2001; Brunel & Latham, 2003; Fourcaud-Trocm´e et al., 2003) Iact (V(t)) = −
2 C C V(t) − E T − L V(t) − E L ) + IT , L 2T τm τm
(C.6)
where E T , defined by C d Iact (V) =− L, d V V=E T τm
(C.7)
denotes the threshold membrane state at which the slope of the I–V curve vanishes, IT =
C E T − E L + Iact (E T ) L τm
(C.8)
denotes the corresponding threshold current, and −1 d 2 Iact (V) T = −C τmL d V 2 V=E T
(C.9)
the spike slope factor. If the synaptic input current exceeds the threshold current IT , the membrane potential diverges in finite time to infinity. The latter can be used to define the discharge of a spike, after which the membrane potential is reset to a resting value. exp If the synaptic current takes the form Isyn (V(t)) = G s (t) V(t) − E s with exp synaptic conductance G s (t) given by equation B.2, we can define the basic
IF Neuron Models with Conductance-Based Dynamics
2205
state equation for a QIF model with presynaptic-activity dependent state dynamics (gQIF model): 2 d V(t) 1 1 1 V(t) − E s V(t) − E T − IT − s = L dt 2T τm C τm (t)
(C.10)
for t ≥ t0 with the time-dependent synaptic contribution to the membrane time constant τms (t) given by equation B.13. The latter is updated upon arrival of a synaptic event at time t0 according to equation 2.7. Unfortunately, equation C.10 is difficult to solve analytically due to the explicit time dependence of τms (t). However, if the conductance increment caused by a single synaptic input is small compared to the total membrane conductance (in the realistic cases considered here, the conductance change following a synaptic input is about 30 times smaller than the leak conductance; see Table 2), a good approximate solution of equation C.10 can be obtained by assuming a constant τms (t) between the arrival of two synaptic inputs at t0 and t1 . When a new synaptic input arrives, τms (t0 ) is updated according to equations 2.6 and 2.7. This approach was also used in the gIF3 model for the effective reversal state (equations 2.25 and B.14) and provided a good approximation even for low input rates. A detailed numerical analysis along with an analytical investigation of equation C.10 with respect to divergencies defining spiking events exceeds the framework of this article and will be the subject of a forthcoming study. C.3 gEIF—A Conductance-Based Exponential IF Model. Along the lines outlined in the last section, an exponential integrate-and-fire (EIF) neuron model (Fourcaud-Trocm´e et al., 2003) can be defined with Iact (V(t)) = −
T C T V(t)−E e T , L τm
(C.11)
where E T and T are defined in equations C.7 and C.9, respectively. With a conductance-based synaptic current, we obtain the basic state equation for an EIF model with presynaptic-activity dependent state dynamics (gEIF model): T V(t)−E T d V(t) 1 V(t) − Vrest (t) + L e T , =− dt τm (t) τm
(C.12)
where the effective reversal state Vrest (t) is given by equation B.11, and τm (t) obeys equations B.12 and B.13 with update τms upon arrival of a synaptic event (see equation 2.7) at time t0 . As in the case of the gQIF model, the state equation C.12 is not analytically solvable exactly. Again, for efficient use in event-driven simulation strategies, approximations remain the only tool to assess the membrane state at time t based on the state at the time
2206
M. Rudolph and A. Destexhe
of the previous synaptic event t0 and to predict possible threshold crossing for spike generation. Acknowledgments We thank Romain Brette, Martin Pospischil, and Andrew Davison for stimulating discussions. This research was supported by CNRS, HFSP, and the European Community (integrated project FACETS, IST-15879). References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulsecoupled oscillators. Phys. Rev. E, 48, 1483–1490. Borg-Graham, L. J., Monier, C., & Fr´egnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 369– 373. Brumberg, J. C. (2002). Firing pattern modulation by oscillatory input in supragranular pyramidal neurons. Neurosci., 114, 239–246. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8, 183–208. Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Brunel, N., & Latham, P. E. (2003). Firing rate of the noisy quadratic integrate-and-fire neuron. Neural Comput., 15, 2281–2306. Burns, B. D., & Webb, A. C. (1976). The spontaneous activity of neurons in the cat’s visual cortex. Proc. R. Soc. London B, 194, 211–223. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 15, 773–782. Christodoulou, C., & Bugmann, G. (2001). Coefficient of variation vs. mean interspike interval curves: What do they tell us about the brain? Neurocomputing, 38–40, 1141–1149. Cragg, B. G. (1967). The density of synapses and neurones in the motor and visual areas of the cerebral cortex. J. Anat., 101, 639–654. DeFelipe, J., Alonso-Nanclares, L., & Arellano, J. I. (2002). Microstructure of the neocortex: Comparative aspects. J. Neurocytol., 31, 299–316. ˜ DeFelipe, J., & Farinas, I. (1992). The pyramidal neuron of the cerebral cortex: Morphological and chemical characteristics of the synaptic inputs. Prog. Neurobiol., 39, 563–607. Delorme, A., Gautrais, J., van Rullen, R., & Thorpe, S. (1999). SpikeNET: A simulator for modeling large networks of integrate and fire neurons. Neurocomputing, 26–27, 989–996. Delorme, A., & Thorpe, S. J. (2003). SpikeNET: An event-driven simulation package for modeling large networks of spiking neurons. Network: Comput. Neural Syst., 14, 613–627.
IF Neuron Models with Conductance-Based Dynamics
2207
Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1, 195–230. Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling (pp. 1–26), Cambridge, MA: MIT Press. Destexhe, A., & Rudolph, M. (2004). Extracting information from the power spectrum of synaptic noise. J. Comput. Neurosci., 17, 327–345. Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neurosci., 107, 13–24. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–751. Evarts, E. V. (1964). Temporal patterns of discharge of pyramidal tract neurons during sleep and waking in the monkey. J. Neurophysiol., 27, 152–171. Fellous, J.-M., Rudolph, M., Destexhe, A., & Sejnowski, T. J. (2003). Synaptic background noise controls the input/output characteristics of single cells in an in vitro model of in vivo activity. Neurosci., 122, 811–829. Feng, J. (2001). Is the integrate-and-fire model good enough? A review. Neural Netw., 14, 955–975. Fourcaud-Trocm´e, N., Hansel, D., van Vreeswijk, C., & Brunel, N. (2003). How spike generation mechanisms determine the neuronal response to fluctuating inputs. J. Neurosci., 23, 11628–11640. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models Cambridge: Cambridge University Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). A biologically motivated and analytically soluble model of collective oscillations in the cortex: I. Theory of weak locking. Biol. Cybern., 68, 363–374. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting Units. Phys. Rev. Lett., 71, 312–315. Giugliano, M. (2000). Synthesis of generalized algorithms for the fast computation of synaptic conductances with Markov kinetic models in large network simulations. Neural Comput., 12, 903–931. Giugliano, M., Bove, M., & Grattarola, M. (1999). Activity-driven computational strategies of a dynamically regulated integrate-and-fire model neuron. J. Comput. Neurosci., 7, 247–254. Gruner, J. E., Hirsch, J. C., & Sotelo, C. (1974). Ultrastructural features of the isolated suprasylvian gyrus in the cat. J. Comp. Neurol., 154, 1–27. Gutkin, B. S., Ermentrout, G. B., & Rudolph, M. (2003). Spike generating dynamics and the conditions for spike-time precision in cortical neurons. J. Comput. Neurosci., 15, 91–103. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178.
2208
M. Rudolph and A. Destexhe
Hansel, D., Mato, G., Meurier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10, 467–483. Hill, S., & Tononi, G. (2005). Modeling sleep and wakefulness in the thalamocortical system. J. Neurophysiol., 93 1671–1698. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9, 1179–1209. Hines, M. L., & Carnevale, N. T. (2004). Discrete event simulation in the NEURON environment. Neurocomputing, 58–60, 1117–1122. Hirsch, J. A., Alonso, J. M., Reid, R. C., & Martinez, L. M. (1998). Synaptic integration in striate cortical simple cells. J. Neurosci., 18, 9517–9528. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Holmes, W. R., & Woody, C. D. (1989). Effects of uniform and non-uniform synaptic “activation-distributions” on the cable properties of modeled cortical pyramidal neurons. Brain Res., 505, 12–22. Hubel, D. (1959). Single-unit activity in striate cortex of unrestrained cats. J. Physiol., 147, 226–238. Huguenard, J. R., Hamill, O. P., & Prince, D. A. (1988). Developmental changes in Na+ conductances in rat neocortical neurons: Appearance of a slow inactivating component. J. Neurophysiol., 59, 778–795. Izhikevich, E. M. (2001). Resonate-and-fire neurons. Neural Netw., 14, 883–894. Izhikevich, E. M. (2003). Simple model of spiking neurons. IEEE Trans. Neural Networks, 14, 1569–1572. Jolivet, R., & Gerstner, W. (2004). Predicting spike times of a detailed conductancebased neuron model driven by stochastic spike arrival. J. Physiol. (Paris), 98, 442–451. Jolivet, R., Lewis, T. J., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol., 92, 959–976. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59 734–766. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. ´ P., & Rospars, J. P. (1995). Ornstein-Uhlenbeck model neuron revisited. Biol. L´ansky, Cybern., 72, 397–406. Lapicque, L. (1907). Recherches quantitatives sur l’excitation e´ lectrique des nerfs trait´ee comme une polarization. J. Physiol. Pathol. Gen., 9, 620–635. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity-dependent regulation of conductances in model neurons. Science, 259, 1915–1917. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comput., 8, 501–509. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comput., 17, 903–921.
IF Neuron Models with Conductance-Based Dynamics
2209
Lytton, W. W., & Stewart, M. (in press). RBF: Rule-based firing for network simulations. Neurocomputing. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA, 95, 5323–5328. Matsumura, M., Cope, T., & Fetz, E. E. (1988). Sustained excitatory synaptic input to motor cortex neurons in awake animals revealed by intracellular recording of membrane potentials. Exp. Brain Res., 70, 463–469. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12, 2305–2329. Mauro, A., Conti, F., Dodge, F., & Schor, R. (1970). Subthreshold behavior and phenomenological impedance of the squid giant axon. J. General Physiol., 55, 497–523. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern., 88, 395–408. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. Nicoll, A., Larkman, A., & Blakemore, C. (1993). Modulation of EPSP shape and efficacy by intrinsic membrane conductances in rat neocortical pyramidal neurons in vitro. J. Physiol., 468, 693–710. Noda, H., & Adey, R. (1970). Firing variability in cat association cortex during sleep and wakefulness. Brain Res., 18, 513–526. Par´e, D., Shink, E., Gaudreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical neurons in vivo. J. Neurophysiol., 79, 1450–1460. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci. USA, 100, 2076–2081. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophysiol., 30, 1138–1168. ¨ Rauch, A., La Camera, G., Luscher, H.-R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo like input currents. J. Neurophysiol., 90, 1598–1612. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulations of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process as a model for neuronal activity. I. Mean and variance of the firing time. Biol. Cybern., 35, 1–9. Rudolph, M., & Destexhe, A. (2001). Correlation detection and resonance in neural systems with distributed noise sources. Phys. Rev. Lett., 86, 3662–3665. Rudolph, M., & Destexhe, A. (2003). The discharge variability of neocortical neurons during high-conductance states. Neurosci., 119, 855–873. Rudolph, M., & Destexhe, A. (2005). Multi-channel shot noise and characterization of cortical network activity. Neurocomputing, 65–66, 641–646. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Shelley, M., McLaughlin, D., Shapley, R., & Wielaard, J. (2002). States of high conductance in a large-scale model of the visual cortex. J. Comput. Neurosci., 13, 93–109.
2210
M. Rudolph and A. Destexhe
Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11, 111–119. Shu, Y., Hasenstaub, A., Badoual, M., Bal, T., & McCormick, D. A. (2003). Barrages of synaptic activity control the gain and sensitivity of cortical neurons. J. Neurosci., 23, 10388–10401. Smith, D. R., & Smith, G. K. (1965). A statistical analysis of the continuous activity of single cortical neurons in the cat unanesthetized isolated forebrain. Biophys. J., 5, 47–74. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing–dependent plasticity. Neuron, 32, 339–350. Steriade, M. (1978). Cortical long-axoned cells and putative interneurons during the sleep-waking cycle. Behav. Brain Sci., 3, 465–514. Steriade, M., & McCarley, R. W. (1990). Brainstem control of wakefulness and sleep. New York: Plenum. Steriade, M., Timofeev, I., & Grenier, F. (2001). Natural waking and sleep states: a view from inside neocortical neurons. J. Neurophysiol., 85, 1969–1985. Stevens, C. F., & Zador, A. M. (1998a). Novel integrate-and-fire-like model of repetitive firing in cortical neurons. In Proc. of the 5th Joint Symposium on Neural Comput. (Vol. 8, pp. 172–177). La Jolla, CA: University of California, San Diego. Stevens, C. F., & Zador, A. M. (1998b). Input synchrony and the irregular firing of cortical neurons. Nature Neurosci., 1, 210–217. Stuart, G., & Spruston, N. (1998). Determinants of voltage attenuation in neocortical pyramidal neuron dendrites. J. Neurosci., 18, 3501–3510. Svirskis, G., & Rinzel, J. (2000). Influence of temporal correlation of synaptic input on the rate and variability of firing in neurons. Biophys. J., 79, 629–637. Szentagothai, J. (1965). The use of degeneration in the investigation of short neuronal connections. In M. Singer & J. P. Shade (Eds.), Progress in brain research, 14 (pp. 1–32). Amsterdam: Elsevier. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Tsodyks, M., Mit’kov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interaction. Phys. Rev. Lett., 71, 1280. Watts, L. (1994). Event-driven simulations of networks of spiking neurons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 927–934), San Mateo, CA: Morgan Kaufmann. Wehmeier, U., Dong, D., Koch, C., & van Essen, D. (1989). Modeling the mammalian visual system. In C. Koch & I. Seger (Eds.), Methods in neuronal modeling (pp. 335–359). Cambridge, MA: MIT Press. Wielaard, D. J., Shelley, M., McLaughlin, D., & Shapley, R. (2001). How simple cells are made in a nonlinear network model of the visual cortex J. Neurosci., 21, 5203– 5211.
Received June 23, 2005; accepted February 17, 2006.
LETTER
Communicated by C. Lee Giles
The Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx Henrik Jacobsson [email protected] School of Humanities, University of Sk¨ovde, Sk¨ovde, Sweden, and Department of Computer Science, University of Sheffield, United Kingdom
This letter presents an algorithm, CrySSMEx, for extracting minimal finite state machine descriptions of dynamic systems such as recurrent neural networks. Unlike previous algorithms, CrySSMEx is parameter free and deterministic, and it efficiently generates a series of increasingly refined models. A novel finite stochastic model of dynamic systems and a novel vector quantization function have been developed to take into account the state-space dynamics of the system. The experiments show that (1) extraction from systems that can be described as regular grammars is trivial, (2) extraction from high-dimensional systems is feasible, and (3) extraction of approximative models from chaotic systems is possible. The results are promising, and an analysis of shortcomings suggests some possible further improvements. Some largely overlooked connections, of the field of rule extraction from recurrent neural networks, to other fields are also identified. 1 Introduction Computer simulation is today a widely used method for testing theories. The domains under study through simulation are within fields like physics, chemistry, engineering, economics, ecology, molecular biology, and neuroscience. The simulated systems are often very complicated to analyze, partly because of their intrinsic complexity and partly because of the massive number of data stemming from numerous instances of models and automated repetitions of simulations (e.g., for multiple alternative scenarios). Many analytical and numerical analysis tools, generic as well as domain-specific ones, have been devised to understand simulated models better. For recurrent neural networks (RNNs) (e.g., Kremer, 2001; Kolen & Kremer, 2001), it has been natural to analyze them as finite state machines (FSMs), partly due to their common source of origin (McCulloch & Pitts, 1943) and partly due to the fact that they have often been trained to perform regular language recognition (e.g., Cleermans, McClelland & Servan-Shreiber, 1989; Christiansen & Chater, 1999). This has led to the development of algorithms for transforming one model into another, for example, from RNNs Neural Computation 18, 2211–2255 (2006)
C 2006 Massachusetts Institute of Technology
2212
H. Jacobsson
into FSMs. Similar approximative transformations are central issues in numerous fields, of which some will be brought up and related to explicitly at the end of this article. Unlike previous approaches, the novel algorithm presented here, CrySSMEx (the crystallizing substochastic sequential machine extractor), is parameter free, handles missing data, generates approximative rules if the underlying system is chaotic, and returns results at any time (Craven & Shavlik, 1999), that is coarse models are initially created and then iteratively refined.1 CrySSMEx is partially based on conclusions drawn from a recent review of earlier algorithms (Jacobsson, 2005) for rule extraction from recurrent neural networks (RNN-RE). The underlying idea behind CrySSMEx is to observe the state and output of a dynamic system, quantize the state space, and refine the quantization of the state space such that the resulting machine typically is minimal, deterministic, and equivalent to the RNN. The aim of the algorithm will now be presented together with a brief discussion and overview of what distinguishes CrySSMEx from earlier RNNRE algorithms. 1.1 Aim of CrySSMEx. CrySSMEx differs from many earlier approaches in that it strives for fidelity rather than accuracy of the rules. Fidelity is the degree to which the rules mimic the network, whereas accuracy is related to how well the rules generalize to unseen examples (Andrews, Diederich, & Tickle, 1995). When fidelity is the goal and the underlying network makes mistakes, the machine extracted from the network should also replicate those mistakes. Some earlier approaches have also been focused on fidelity (e.g., Vahed & Omlin, 2004), but most work has had accuracy as the prime goal for the rules (e.g., Giles, Miller, Chen, Chen, & Sun, 1992; Zeng, Goodman, & Smyth, 1993), which makes sense if the network is used as an intermediate step for acquiring symbolic knowledge from data, such as for grammar induction. In some cases, this approach has been very successful when the extracted rules were equivalent to the symbolic data generator (e.g., Giles, Miller, Chen, Chen, et al., 1992; Giles, Miller, Chen, Sun, et al., 1992). One reason to strive for fidelity is that it makes the rules useful for analyzing erroneous RNNs. One could compare an erroneous RNN with a sick patient and an RNN-RE algorithm with an instrument of a doctor diagnosing the patient. The doctor would not learn much from an accuracyseeking instrument describing what the patient should be like if completely healthy, which is basically what accuracy-optimizing methods strive for.
1 CrySSMEx is to be pronounced somewhat like “Christmas.” When we refer here to the implementation of the algorithm, this can be acquired as an open source distribution available online at http://cryssmex.sourceforge.net. The latest version and associated information can be found on this home page.
The Crystallizing Substochastic Sequential Machine Extractor
2213
Instead, the analysis tool should generate an analysis that reflects the actual condition of the patient. Another difference between accuracy and fidelity is that the latter does not presuppose the existence of any task in which errors can be defined. Instead, the quality of extraction is measured on how well the extracted model mimics the underlying system. This allows for the analysis of simulated systems other than just RNNs. Therefore, in this article, the extraction of rules from RNNs will be treated as an interesting special case of extraction from a broad range of dynamic systems (see section 2.1). 1.2 What Is New in CrySSMEx. The three main criteria in a recent taxonomy of RNN-RE methods (Jacobsson, 2005) were (1) the means of state observation, (2) the type of rules extracted, and (3) the state-space quantization method. The observation of states in CrySSMEx, as in many other approaches (e.g., ˇ & Vojtek, 1998; Tino ˇ Watrous & Kuhn, 1992; Manolios & Fanelli, 1994; Tino ˇ nansk ¨ ˇ Cer ˇ ´ & Benuˇ ˇ skov´a, 2004), is solely based on & Koteles, 1999; Tino, y, sampling the system as it behaves in its domain. The novel components of CrySSMEx are the rule type (see section 2.2), and the quantization method (see section 3). But what really distinguishes CrySSMEx from all earlier approaches, is the integration of the four basic elements found in previous approaches (Jacobsson 2005):
r r r r
Quantization of state Observation of the underlying system Rule construction Rule minimization
These four subprocedures have typically been quite separable in RNN-RE algorithms. In earlier approaches, the quantization of the state space has been done by traditional clustering techniques with no sensitivity to, or any integration with, the dynamics of the RNN. Also, the minimization of the rules (when conducted at all) was just a postprocessing of the rules. In CrySSMEx, all four constituents are tightly integrated into one system, resulting in an empirical loop of model refinement through model-based data selection (cf. section 4). 1.3 Overview. This letter is structured such that the main loop of the algorithm (in section 4.2) could be understood at an abstract level without knowing all the details of the constituents. Therefore, readers are recommended to look briefly at algorithm 3 in section 4.2, the point of convergence of this article, before continuing to read the letter. To further aid readers, a list of important abbreviations can be found in appendix B.
2214
H. Jacobsson
The letter is otherwise organized as follows. In section 2 the specific class of dynamic systems analyzable with CrySSMEx is defined together with a discrete stochastic model of these systems. In section 3 a novel vector quantizer is described. And section 4 connects the constituents of CrySSMEx into one coherent algorithm. The remaining sections contain experiments, discussion, and conclusions. 2 Modeling Dynamic Systems This section introduces a class of dynamic systems (see section 2.1), a finite stochastic model of these systems (see section 2.2), and a means of transforming the dynamic system into the stochastic model through system observation (see section 2.2.3). The translation process of the system into a model is refined by other parts of CrySSMEx (see sections 3–4) so that more precise translations can be made. 2.1 Situated Discrete Time Dynamic Systems. The target domain for CrySSMEx is a general class of dynamic systems that includes RNNs. There-
fore, only properties of RNNs that are of importance for rule extraction are included. Other properties typically associated with neural networks, such as weights, activation functions, and learning, are omitted. The resulting class of systems will here be referred to as a situated discrete time dynamic system, incorporating state, input, output, and dynamics of the system. The system is situated in the sense that it has a defined interface with a domain with which it interacts. After this point in the article, the extraction of rules from such dynamic systems rather than from only RNNs will be considered, but the underlying problems are precisely the same. 2.1.1 Definition Definition 1. A situated discrete time dynamic system (SDTDS) is a quadruple S, I, O, γ where S ⊆ Rns is a set of state vectors, I ⊆ Rni is a set of input vectors, O ⊆ Rno is a set of output vectors, γ : S × I → S × O is the transition function, and ns , ni , no ∈ N are the dimensionalities of the state, input, and output spaces, respectively. If the system, at time t, occupies a state s (t) and is fed an input ı(t), then the resulting next state and produced output is determined by [s (t + 1), o(t + 1)] = γ (s (t),ı(t)). The current and initial state of the system are not included in the SDTDS model since it is something imposed on the system (the SDTDS specifies the framework and behavior for any arbitrary initial state just as a function specifies the image of any arbitrary member of the domain of the function). To simplify descriptions, the transition function, γ , can be subdivided into two functions: γs : S × I → S and γo : S × I → O.
The Crystallizing Substochastic Sequential Machine Extractor
2215
It should be noted that the functional dependencies are those of a Mealy system rather than a Moore system in that the output is determined by state and input rather than a function of state alone (Hopcroft & Ullman, 1979). The reason for this choice is that a Mealy model can in this context subsume a Moore model, but not necessarily vice versa.2 In its current implementation, CrySSMEx also requires the set of input vectors to be finite, which, for example, is the case for any symbol processing RNN. This restriction is not included in the definition since it is more a matter of what is put into the SDTDS rather than a restriction of the system itself. Other than that, there are no theoretical restrictions on the SDTDS as defined above for CrySSMEx to analyze it. There are also some other implicit requirements, made by a rule extraction algorithm of the underlying SDTDS, that cause some systems of general interest not to fall under the definition above (Jacobsson, 2005). For example, the state, input, and output must be distinctly separable as well as fully and unintrusively observable. Moreover, γ must be a noise-free function, that is, the observed system is assumed to be completely deterministic. 2.1.2 Collection of Data from an SDTDS. An RNN-RE algorithm should transform an RNN into a discrete model mimicking the RNN to a satisfactory degree. To do this, a compositional approach has typically been adopted where data are gathered from the internal activations of the RNN and then a model is built from this (Tickle, Andrews, Golea & Diederich, 1998). Within the RNN-RE field, two subtypes of the compositional approach exist: one where the RNN-RE algorithm interacts directly with the RNN while performing a breadth-first search and another where the data are collected from the RNN during interaction with the domain in which it was trained (Jacobsson, 2005). In CrySSMEx, the latter is chosen for three reasons: (1) the data (and hence the extracted model) will only contain aspects of the RNN relevant for the domain, (2) it is far more efficient since, in effect, the domain is used as a heuristic when searching among all the possible models that describe the behavior of the system (Jacobsson & Ziemke, 2003b), and (3) it is possible to do the extraction off-line, that is, pregenerated data can be used in CrySSMEx since no direct interaction between extractor and underlying system is needed. When the SDTDS is set to hold a certain initial state and is then fed a sequence of input vectors from a domain, it will generate a sequence of states and outputs as a result. This domain interaction is the basis for the data collection, and the result is recorded as a sequence of transition events.
2
A Moore model, and a Moore machine extraction version of CrySSMEx, has also been implemented, but it is not presented here since it involves small changes in many different parts of the descriptions.
2216
H. Jacobsson
Definition 2. An SDTDS transition event at a time t, ω(t), is a quadruple s (t),ı(t), o(t + 1), s (t + 1) ∈ Rns × Rni × Rno × Rns where s (t + 1) is the state vector reached after the SDTDS received input ı(t) while occupying state s (t), and o (t + 1) is the output generated in the transition. Definition 3. A transition event set, , consists of selected transition events recorded from the SDTDS with a given set of input sequences. The reason that is defined to consist of selected events is that it is quite possible that some events are not wanted in the model, as when the user has made an explicit reset of the state with no wish to model the transition caused by this. The user may also want to let the system “settle in” before starting data collection.
2.1.3 Building a Stochastic Dynamic Model from a Quantized SDTDS. The essential part of CrySSMEx, and all earlier RNN-RE algorithms, is the quantization of the state space. The set of possible states in the state space of the SDTDS is uncountable and must be transformed to a finite domain to make the extraction of a finite machine possible. Definition 4. A quantizer : Rn → {1, 2, . . . m} is a function that separates an n-dimensional real space into m uniquely labeled disjoint subspaces. The maximum number of subspaces, m (i.e., the cardinality of the codomain of function ), will, for pragmatic reasons, be denoted ||. Although not explicitly stated in most RNN-RE articles, all three spaces of RNNs (input, state, and output) are actually labeled using some form of quantization function. The quantization of the state space is, of course, of central concern, but also the input and output need to be labeled into a finite set of symbols to produce the extracted finite machine. The state, input, and output quantizers will be denoted s , i , and o , respectively. The SDTDS is in itself capable of reacting according to any of the possible input vectors (since the SDTDS definition includes the entire vector spaces in the domains of the transition function), but in its current implementation, CrySSMEx requires the input domain to be finite (and i must be invertible). The frequencies of quantized transitions in the transition event set, , are transformed into a joint probability distribution that will later be used to build a dynamic model that mimics the SDTDS (see section 2.2.3): Definition 5. A stochastic dynamic model of an SDTDS is a joint probability mass function induced from a transition event set , and quantizers o , i and s . The stochastic model is defined as a function p : [1, |s |] × [1, |i |] × [1, |o |] × [1, |s |] → [0, 1] where p (i, k, l, j) denotes the probability that if one picks a random transition event from , it would be a transition from a state
The Crystallizing Substochastic Sequential Machine Extractor
2217
enumerated i by s , over an input vector enumerated k by i , which generated an output enumerated l by o , and a new state enumerated j by s .3 2.2 Substochastic Sequential Machines. Stochastic machines have ˇ & Vojtek, 1998; Tino ˇ & Koteles, ¨ been extracted earlier (Tino 1999), but without modeling the output of the system explicitly. In CrySSMEx, the output of the system will be modeled as well. The stochastic dynamic model ( p in definition 5) collected from the SDTDS in interaction with its domain gives us information about the estimated probabilities of the effect and outcome of transitions in the system as viewed through the quantizers. These probabilities are used to build a finite stochastic machine model of the SDTDS. This type of machine resembles stochastic sequential machines (Paz, 1971) or probabilistic automata (Rabin, 1963) but has some distinguishing features since there is a realistic possibility of model incompleteness due to a finite observed set of transition events. This is due to the fact that the sample of input sequences in will not necessarily provide examples of all possible input symbols in all possible enumerations of the quantized state space. The choice here is to make a “closed world assumption,” and consequently, only what is observed in will be included in the model. Missing data must therefore be handled when the model is built from . This causes the probabilistic model to become a substochastic sequential machine (SSM) rather than the stochastic sequential machines of Paz (1971). As a consequence, this incompleteness of the model implies that probability can “leak” out from the state of the machine during parsing of input sequences, causing the probability distributions to become substochastic (see appendix A). The details of what this entails will be clarified in the following sections. First, however, some additional definitions and notational conventions will be introduced; then the full SSM definition will be given. 2.2.1 Notation of Probability Distributions as Vectors. Sometimes a probability distribution is preferably denoted as a vector (cf. Paz, 1971). The probability mass function over a discrete stochastic variable X is denoted p(X = xi ), or p(xi ) for short. p(xi ) is interpreted as the probability of X having the value xi . If we want to express this probability as a vector, it is convenient to write p(xi ) as xi . The full vector, representing the full distribution over X, is denoted x —with no index. The vector and probability notation of distributions will be used interchangeably since they are more conveniently expressed as one or the other depending on context. Important types of substochastic vectors and operations on them are defined in appendix A.
3 The awkward order of i, j, k, and l is due to other contexts of the variables of p later in this article.
2218
H. Jacobsson
2.2.2 SSM Definition. Definition 6. A substochastic sequential machine (SSM) is a quadruple Q, X, Y, P = { p(q j , yl |q i , xk )} where Q is a finite set of state elements (SEs), X is a finite set of input symbols, Y is a finite set of output symbols, and P is a finite set of conditional probabilities (cf. the explanation of equation 2.3) where q i , q j ∈ Q, xk ∈ X and yl ∈ Y. The terminology here is somewhat different from that of conventional finite state machines. The input and output domains of the SSM will still be considered alphabets of symbols, whereas the Q of the SSM will instead be denoted state elements (SEs) so as not to confuse them with the state of the SDTDS. Also, the actual state of the SSM is more properly described as a (sub)stochastic distribution over these elements. The interpretation of p(q j , yl |q i , xk ) is that it is the probability of the machine entering the SE q j , and in this transition, producing symbol yl given that it occupied only SE q i and was fed input symbol xk . A more detailed description of the SSM interpretation is given in section 2.2.4, which describes the use of an SSM as a parser of input symbols. But first, the construction of an SSM from a model of the SDTDS will be described. 2.2.3 Translation of an SDTDS into an SSM. It is quite straightforward to see the similarities of the SDTDS and the SSM (cf. definitions 1 and 6). The difference lies mainly in the discreteness of the input, state, and output domains of the SSM versus the uncountable domains in the SDTDS. In practice, however, the SSM can be seen as a subclass of the set of SDTDSs since a substochastic SE distribution can be subsumed as an SDTDS state and correspondingly for input and output. When transforming an SDTDS into an SSM model, the uncountable domains S, I , and O of the SDTDS are reduced to the finite domains Q, X, and Y. The SSM is created from a quantized SDTDS such that the domains of the SSM are isomorphic to the codomains of the respective quantizers. In other words, Q of the SSM is isomorphic to [1, |s |] and correspondingly for the input and output symbols. In the following text, an SE denoted q i ∈ Q will correspond to the portion of the state space of the SDTDS enumerated i by the s quantizer. The joint probabilities of observed and quantized SDTDS transitions ( p ), are translated into joint probabilities of SSM transitions according to: p(q i , xk , yl , q j ) = p (s (s (t)) = i, i (ı(t)) = k, o (o (t + 1)) = l, s (s (t + 1)) = j),
(2.1)
that is, the joint probability of SSM transitions is defined such that they correspond to the observed frequency of transitions in the SDTDS. The
The Crystallizing Substochastic Sequential Machine Extractor
2219
conditional probability of the SSM, p(q j , yl |q i , xk ), is calculated from the joint probability according to equations 2.2 and 2.3:
p(q i , xk ) =
|Q| |Y|
p(q i , xk , yl , q j )
(2.2)
j=1 l=1
p(q i , xk , yl , q j ) p(q i , xk ) p(q j , yl |q i , xk ) = 0
if p(q i , xk ) > 0 if p(q i , xk ) = 0.
(2.3)
Although conceptually appealing, the distribution P = { p(q j , yl |q i , xk )}, is perhaps a bit haphazardly termed conditional probabilities since a conditional probability p(a |b) traditionally is undefined if p(b) = 0. But in the SSM, we need these to be defined since there might actually be cases where there is no transition from a SE q i over a specific symbol xk , simply because there are no observations in of any such event. Definition 7. If p(q j , yl |q i , xk ) = 0 for all q j ∈ Q and yl ∈ Y, then the transition from q i over input xk will be referred to as a dead transition. Definition 8. The procedure of transforming an SDTDS from , through the stochastic dynamic model, p of definition 5, into an SSM as defined above in equations 2.1 to 2.3, will in pseudocode be denoted as ssm = create machine(, s , i , o ) where s , i and o are the state, input, and output quantizer, respectively, and ssm the resulting SSM. When an SSM is created with create machine, the SDTDS from which was sampled will be referred to as the underlying system of the SSM. Next, the exact calculations of state and output of the SSM will be described. The SSM processes input such that its distributions over Q and Y correspond to the degree of belief of the occupied state and output enumeration of the underlying system. 2.2.4 Parsing an Input Sequence Using an SSM. Unlike a standard discrete Mealy machine where exactly one state is occupied at a time (Hopcroft & Ullman, 1979), the complete description of the state occupied by an SSM is the substochastic distribution over zero, one, or more SEs. Likewise, the transitions generate substochastic distributions of output symbols rather than individual symbols. The exact calculations of distributions are as follows. Let q (t) = ( q 1 (t), q2 (t), . . . , qn (t)) be a substochastic vector denoting the distribution over Q at time t and xk (t) ∈ X be the input symbol fed to the machine in
2220
H. Jacobsson
that time step. The resulting distribution vector over Q, q (t + 1), is calculated by4 q (t + 1) = Pq ( q (t), xk (t)),
(2.4)
where each element qj (t + 1) of q (t + 1) (corresponding to a probability of a SE) is calculated by qj (t + 1) =
|Q|
|Y| p q j (t + 1), yl q i (t), xk (t) , qi (t) ·
i=1
(2.5)
l=1
and, concurrently, the distribution of output symbols y(t + 1) over Y is generated in the transition by q (t), xk (t)), y(t + 1) = P y (
(2.6)
where each element yl (t + 1) of y(t + 1) (corresponding to a probability of an output symbol) is calculated by yl (t + 1) =
|Q| i=1
|Q| qi (t) · p q j , yl (t + 1)q i (t), xk (t) .
(2.7)
j=1
Note that if the transition from q i (t) over xk (t) is dead and qi (t) > 0, then the respective sum of the probabilities of distribution q j (t + 1) and yl (t + 1), will be less than 1. In such cases, distributions of the machine will become substochastic (cf. appendix A). Another possibility of parsing is, when possible, to divide the probabilities with the sum of the probabilities after each symbol. This mode of parsing will be referred to as normalized parsing, Pˆ ∗ ( q (t), xk (t)) = normalize(P∗ ( q (t), xk (t))),
(2.8)
where the ‘∗’ is either q or y (normalize is defined in appendix A). One may argue that instead of the notion of substochastic probabilities and state “leaking” from the machine, it would be better to add a state element, q dead to which all dead transitions are then made (producing an additional “dead” output symbol, ydead ). This would work, and it would also, as far as we can judge, create a machine equivalent to the machines Note that this is a case where the notational choice of letting p(q i ) = qi comes into play (cf. section 2.2.1); it is implicit that qi (t + 1) and p(q i (t + 1)) refer to the probability p(Q = q i ) at time t + 1. 4
The Crystallizing Substochastic Sequential Machine Extractor
2221
of Paz (1971). It would, however, wreck the otherwise complete semantic connection between the underlying system and SSM since there would be no corresponding elements in S and Y of the SDTDS for q dead and ydead . To illustrate the parsing of symbol sequences with SSMs, some examples will be explored in section 2.2.7. But first some important types of SEs must be introduced. There are a number of properties of SSMs and SEs that can be used for a deeper analysis of the machines. In this article, only the ones that are crucial for CrySSMEx will be mentioned: deterministic and equivalent SSM SEs. 2.2.5 SSM Determinism. An SSM will always be deterministic in the sense that the state element and output symbol distributions are always deterministically calculated. Therefore, the determinism of an SE is instead defined such that it reflects the degree to which the SSM determines the succeeding occupied state enumerations and output symbols of the underlying dynamic system. For this purpose, entropy, and especially conditional entropy (Cover & Thomas, 1990), are suitable (see definition 23 in appendix A). A conditional entropy H(Y|X = x) can be interpreted as the remaining uncertainty of variable Y given that variable X would be known to have the value x. Here, the conditional SSM-based entropy of the output given an SE q i and input xk in an SSM ssm will be denoted Hssm (Y|Q = q i , X = xk ) and is defined by Hssm (Y|Q = q i , X = xk ) = H(P y ( q , xk )),
(2.9)
where q is here the degenerate (see appendix A) SE distribution vector with qi = 1.0. The conditional entropy of the SE given the previous SE and input symbol is likewise denoted Hssm (Q|Q = q i , X = xk ) and is here defined by Hssm (Q|Q = q i , X = xk ) = H(Pq ( q , xk )),
(2.10)
with q degenerate as in equation 2.9. The interpretation of the entropies in equations 2.9 and 2.10 is that given that distribution over Q is concentrated to only q i and input then is xk , they return the degree of uncertainty of the SSM regarding the succeeding output symbol and occupied state enumeration of the underlying SDTDS, respectively. This is an idealized interpretation due to the substochastic nature of the model. The conditional entropy will also be zero when the SSM has another type of uncertainty when transition from q i over xk is dead. Definition 9. An SE q i ∈ Q of an SSM ssm is deterministic iff Hssm (Y|Q = q i , X = xk ) = 0 and Hssm (Q|Q = q i , X = xk ) = 0 for ∀xk ∈ X.
2222
H. Jacobsson
A deterministic SE has exactly zero or one outgoing transition for each input symbol. Definition 10.
An SSM is deterministic iff all SEs q i ∈ Q are deterministic.
If a machine is deterministic and its initial SE distribution is degenerate, then all subsequent SE and output distributions will both be either degenerate or exhausted (cf. appendix A). This definition of a deterministic machine differs somewhat from that of traditional deterministic finite automata (Hopcroft & Ullman, 1979), in which states (corresponding to the state elements of the SSM) must have transitions to exactly one state for all input symbols. It is quite straightforward to see that a deterministic SSM, in which there are no dead transitions, is equivalent to the nonstochastic standard Mealy machines as defined in Hopcroft & Ullman (1979), if a degenerate distribution over Q is defined as the initial state. Such a machine must always occupy only one SE at a time and generate a single output symbol at a time. SSM determinism will be used as a termination criterion in CrySSMEx (see algorithm 3). The conditional entropies will also be used as a basis for selection of the most informative state vectors of in order to perform optimization of the SDTDS state quantizer (see algorithm 2). 2.2.6 Equivalence and Nonequivalence of SEs. The second important property of SEs is equivalence. In automata theory, two states q i and q j of a machine are equivalent if, and only if, the output of the automata would be the same for all possible future input sequences independent of which of the two possible states that are occupied initially. This can be tested quite efficiently in traditional nonstochastic automata (Hopcroft & Ullman, 1979), but for stochastic machines, it is a bit more difficult. In fact, it is even impossible in general for substochastic machines since the model would not “know” what the outcome of dead transition would be in the underlying system. It would, for example, be impossible to determine what other state elements an SE with no outgoing transition is or is not equivalent to since the outcome of any possible future input sequence is undefined in the model. The only way to determine the equivalence of such an SE to other SEs is to go back to the underlying SDTDS to record the missing transitions and thereby make it part of the SSM model. Since this would break the closed-world assumption, it is not considered. It is, however, possible to determine that two SEs are not equivalent if they, in their outgoing transitions, share some input symbols and transitions over these that lead to discrepancies in the future output of the SSM. So what we will do is provide an algorithm that returns true if and only if
The Crystallizing Substochastic Sequential Machine Extractor
2223
two SEs are not decisively inequivalent (NDI).5 For example, an SE that has no outgoing transitions will be NDI equivalent with all other SEs since there will be no decisive evidence of the opposite. Two SEs with no input symbols in common in their outgoing transitions will also always be NDI equivalent. To determine the NDI equivalence of SEs q i and q j , the recursive function NDI equivalent(ssm, u , v , ∅) (described in algorithm 1) is called for, where u and v are the corresponding degenerate SE distributions for q i and q j , respectively (the need for the empty set is clarified in algorithm 1), and the result is true or false depending on whether the SE distributions u and v are NDI equivalent or not. The algorithm is highly recursive and uses a “trick” based on the support sets (see appendix A) of the SE distributions to avoid infinite recursions that otherwise could occur. If we allow ourselves to jump ahead to a later example, consider the testing of equivalence between SEs q 5 and q 6 in Figure 2. When starting in SE q 5 and the SSM is fed symbol b, the SE distribution is gradually approaching a pure SE q 6 , but it will never quite reach it. The algorithm would not stop if it were not for using reencounters of SE support sets as a termination criterion. Since there is a finite set of possible support sets (2 Q ), this guarantees that the algorithm will terminate. What is lacking at this moment is a formal proof that NDI equivalent works as intended for all possible SSMs under all possible conditions. A formal proof for a method for equivalence testing of states in a stochastic sequential machines, however, does exist (Paz, 1971). In that proof, strong similarities with this algorithm do occur, but a formal one-to-one connection is yet to be done. For now, the experiments of section 5 will be the only indication that the algorithm as a whole works for the presented cases. In addition to these experiments, the algorithm has been successfully tested in a number of hand-made SSMs, with properties making them interesting to analyze with respect to SE equivalence (e.g., the SSM of Figure 2). For three SEs q i , q j , and q k of an SSM, it may very well hold that q i and q j are NDI equivalent and likewise for q j and q k , while q i and q k are not. In other words, the relation is not transitive. It is required, by other parts of CrySSMEx (see algorithm 3), that states be grouped into disjoint equivalence sets, which is not possible if the equivalence relation is not transitive (and symmetric and reflexive as well). Definition 11. Let π(q i ) denote the set of SEs that q i is NDI equivalent with. Two SEs q i and q j are defined as universally NDI equivalent (UNDI equivalence) if π(q i ) = π(q j ). 5 It is also possible to test if SEs are decisively equivalent as well, that is, when all subsequent SEs have the same symbols for outgoing transitions. But preliminary studies have shown that more interesting results are achieved using NDI equivalence simply because dead transitions are quite common.
2224
H. Jacobsson
NDI equivalent(ssm, u , v , H) Input: an SSM ssm, SE distributions u and v , and history of state support sets H. Output: returns true if u and v are not decisively inequivalent given possible future input sequences. begin 1 if ∃xk ∈ X : (Pˆ y ( u, xk ) = Pˆ y (v , xk ) ∧ sup(Pq ( u, xk )) = ∅∧ sup(Pq (v , xk )) = ∅) then return false; /*i.e., the output must be the same for both SE distributions for all possible input symbols. */ else if u = v then return true; 2 /*i.e., if the distributions are identical, they are equivalent. */ else if sup( u), sup(v ) ∈ H then return true; 3 /*i.e., a loop has been encountered. Eventual inequivalence will be encountered in another branch of the recursion tree. */ else 4 /*If the equivalence/inequivalence cannot be asserted, then subsequent inputs must be tested. */ R := tr ue ; k := 1; while R = true ∧ xk ∈ X do /*As long as no inequivalence has been shown . . . */ u, xk ); u = Pˆ q ( v = Pˆ q (v , xk ); 5 if sup( u ) = ∅ ∧ sup(v ) = ∅ then /*. . . continue testing recursively. */ H = H ∪ sup( u), sup(v ); R := NDI equivalent(ssm, u , v , H ); end k := k + 1; end return R; end end , v , ∅) returns Algorithm 1: The recursive function NDI equivalent(ssm, u true if and only if there is no evidence that the future ssm output could differ depending on which of SE distributions u or v is occupied in the SSM. The if-statement on line 2 can be logically omitted, since line 3 will catch the equivalence in subsequent levels of recursion (line 4), but it makes the algorithm considerably more efficient in most realistic cases. The empty support set tests on lines 1 and 5 together with the normalized parsing (Pˆ q and Pˆ y ) cause the algorithm to return true when assessment of inequivalence cannot be performed.
The Crystallizing Substochastic Sequential Machine Extractor
2225
b:c:0.1 a:c b:c:0.9
1
2
b:d
a:c:0.8 a:d:0.2 A
b:c:0.1 a:c b:c:0.9
1
2 a:d B
Figure 1: The two SSMs of examples 1 (A) and 2 (B) (with q i denoted by i). A transition label x:y: p is to be read as a transition with x as input and y as output and p as the probability of this transition. For example, the transition label “b:c:0.1” from q 1 to q 2 corresponds to the conditional probability p(q 2 , c|q 1 , b) = 0.1. If the p is 1.0, the probability is omitted from the label.
UNDI equivalence is a transitive relation (symmetry and reflexiveness are inherited from the NDI equivalence) and therefore can be used to define nonoverlapping equivalence sets. There is, however, more than one way of translating the NDI equivalence into a transitive relation, and this issue is brought up again in section 6. Definition 12. A set of UNDI equivalence sets, E, consists of disjoint sets of SEs, e ∈ E where e ⊆ Q (with all es together covering all SEs in the SSM) and for all q i , q j ∈ e, q i and q j are UNDI equivalent. In the pseudocode notation, the function call E = generate UNDI equivalence sets(ssm) will be used to denote the generation of a set of UNDI equivalence sets E from an SSM ssm. A machine with equivalent SEs can be collapsed to a smaller machine by collapsing all equivalent sets into new, individual SEs. This collapsing, or merging, will, however, be part of another subalgorithm (the merge cvqfunction of definition 17). 2.2.7 SSM Examples and Interpretations Example 1. Consider an SSM with Q = {q 1 , q 2 }, X = {a, b}, Y = {c, d} and transition probabilities P = { p(q 2 , c|q 1 , a) = 1.0, p(q 2 , c|q 1 , b) = 0.1, p(q 1 ,
2226
H. Jacobsson
c|q 1 , b) = 0.9, p(q 1 , c|q 2 , a) = 0.8, p(q 1 , d|q 2 , a) = 0.2, p(q 2 , d|q 2 , b) = 1.0} (SSM A in Figure 1). All zero probabilities are left out from description, for example, that p(q 1 , a|q 1 , a) = 0.0. If we let the initial SE vector be q (0) = (1.0, 0.0) (that p(Q = q 1 ) = 1.0 at time t = 0) and then parse the string aabbba with the machine, the sequence of SE and output symbol distribution vectors (where the two elements of vector y correspond to probabilities of symbol c and d, respectively) would be as follows: (a) (a) (b) (b) (b) (a)
q (1) = (0.0, 1.0), y(1) = (1.0, 0.0) q (2) = (1.0, 0.0), y(2) = (0.8, 0.2) q (3) = (0.9, 0.1), y(3) = (1.0, 0.0) q (4) = (0.81, 0.19), y(4) = (0.9, 0.1) q (5) = (0.729, 0.271), y(5) = (0.81, 0.19) q (6) = (0.271, 0.729), y(6) = (0.9458, 0.0542).
Note that since the SSM of example 1 has no dead transitions, the sums of the SE and output probabilities are always one, respectively. In the next example, an SSM that has some dead transitions will be shown. Example 2. Consider an SSM with Q = {q 1 , q 2 }, X = {a, b}, Y = {c, d} and transition probabilities P = { p(q 2 , c|q 1 , a) = 1.0, p(q 1 , c|q 1 , b) = 0.9, p(q 2 , c|q 1 , b) = 0.1, p(q 1 , d|q 2 , a) = 1.0} (SSM B in Figure 1). Note that the machine has dead transitions since q 2 has no outgoing transition over symbol b. SE q 2 is also an example of a deterministic SE. If q (0) = (0.0, 1.0) and the SSM is fed symbol b as input, the probabilities of all SEs and outputs would therefore immediately reach zero. In other words, the possibility of being in SE q 2 is eliminated by the symbol b, and as a consequence of the SSM’s “observing” b, the probability of this impossibility vanishes from the machine. If we instead let q (0) = (1.0, 0.0) and then parse a sequence of t bs, the sum of SE probabilities would be 0.9(t−1) when t ≥ 1. As the examples illustrate, the SSM acts as an observer of inputs, from which it derives a modeled degree of belief of what the actual enumeration of the state and output of the underlying system would be given the same input sequence. Typically, if an SSM is given a uniform initial SE distribution, the SE distribution will, for each input symbol, gradually become more and more focused toward a small number of possible SEs (and output symbols). In a way, the SSM can be seen as to “condense,” or “crystallize,” to a minimal hypothesis of the factual state of the underlying system. An SSM can be quite different and counterintuitive compared to state machines typically encountered in the literature, as illustrated in the next example. Example 3. The SSM of Figure 2 represents a more complex SSM. This machine is not fully connected and also contains a “dead SE” from
The Crystallizing Substochastic Sequential Machine Extractor
1
a:c:0.9 a:c:0.1 5
b:c:0.01
a:c a:d:0.4
6
2
a:c,b:c
7
a:d:0.6, b:d:0.2 b:d:0.8
8 a:d
9
10
b:d:0.3
a:c,b:c:0.99
a:c,b:c
b:d:0.7
2227
4
3 a:d
a:d,b:d
Figure 2: A more complex SSM example where UNDI equivalence sets have been grouped together. See the discussion in Example 3.
which there are no transitions (q 10 ). This is a perfectly correct form of SSM, and, if provided with an initial SE distribution, this machine can process input sequences just like the SSMs of the previous examples. In this machine, many properties of the UNDI equivalence become clear. The set of equivalence sets returned by generate UNDI equivalence sets is {{q 1 }, {q 2 , q 3 , q 4 }, {q 5 , q 6 , q 7 }, {q 8 }, {q 9 }, {q 10 }}. State element q 10 will, since it has no outgoing transition, be NDI equivalent with all other SEs. However, since it is the only element with this property, it is not UNDI equivalent with anything but itself. q 8 is, on the other hand, the only one that is NDI equivalent with only itself. One can easily see that the SEs of the equivalence sets {q 2 , q 3 , q 4 } and {q 5 , q 6 , q 7 } have the output symbols in common, respectively. q 3 is special since it has no outgoing transition over symbol b, whereas q 2 and q 4 have. q 3 is, however, NDI equivalent with SEs q 2 and q 4 since it cannot be decided that symbol b should result in any different output given any of these three SEs. Then, for the same reason, why is q 9 not UNDI equivalent to q 5 , q 6 , and q 7 , although it too is constantly giving c as output? The reason is that SE q 9 is also NDI equivalent with q 1 , which none of q 5 , q 6 , and q 7 are; therefore, it is not UNDI equivalent with them the way q 3 is with q 2 and q 4 . If one added q 11 , NDI equivalent with q 3 but not with q 2 and q 4 , then this situation would change (even though q 11 may seem completely unrelated to q 3 ). Another thing to notice is that the transition from the equivalence set {q 2 , q 3 , q 4 } from q 2 to q 10 makes no difference for the assessment of the equivalence of q 2 with q 3 and q 4 since the transition is to a dead SE from which no decisive inequivalences can be derived. Now the format of the extracted rules of CrySSMEx has been described. The next step is to define a vector quantization function that will later be orchestrated to work in conjunction with these rules.
2228
H. Jacobsson
3 The Crystalline Vector Quantizer The observation and quantization of the state space of the underlying SDTDS is perhaps the most significant constituent of RNN-RE algorithms. In previous work, quite traditional clustering algorithms have been used to partition the state space of the RNN (Jacobsson, 2005), such as selforganizing maps and k-means clustering. The problem with these clustering algorithms is that they partition the state space solely according to spatial properties, for example, so that data points have low intracluster distances and high intercluster distances (Everitt, Landau, & Leese, 2001). In the case of RNNs and other dynamic systems, however, the spatial requirements should give way to functional requirements. The spatial (e.g., Euclidean) proximity of two states of the SDTDS is of less importance for deciding if they belong to the same cluster than the invariance of the apparent behavior of the SDTDS with respect to these states. Similar problems exist also when clustering internal activations of feedforward networks (Sharkey & Jackson, 1995). This means, among other things, that the quantizer may need to have varying granularity in different regions of the state space. A partitioning that is guided only by the dynamics of the SDTDS is, however, an idealization (Casey, 1996; Blair & Pollack, 1997; Jacobsson & Ziemke, 2003b). Instead we will have to be content with partitions that are equivalent for a specific and finite set of input sequences (in the finite of definition 3). To satisfy the functional requirements, we need a quantizer that allows generating a division of the state space based on spatial properties but also splitting and merging regions into new regions when the functional requirements are not satisfied (details of when exactly it is appropriate to split or merge are covered in section 4). For this purpose, a novel quantizer is suggested here: a crystalline vector quantizer (CVQ). The CVQ has some resemblance to the hierarchical decision tree representation extracted from feedforward networks by Craven and Shavlik (1996) but differs in the details. The CVQ is built on a graph, which now will be defined. How the information of this graph is used to quantize a vector space will then be described in section 3.2 and CVQ training in section 3.3. 3.1 Definition of CVQ Graph Definition 13. A CVQ graph is a quadruple C V Q = NLeaf , NVQ , NMerged , nroot where nroot ∈ NLeaf ∪ NVQ is the root node of the CVQ graph, where the constituents are defined as in definitions 14 to 16. The CVQ graph is a directed graph and could thus be described as a set of vertices and edges, but for notational reasons, it is easier to omit the edges from the description and instead of edges let nodes have explicit references
The Crystallizing Substochastic Sequential Machine Extractor
2229
VQ–node (n0) M = [m1 , m2 , m 3 ] C=[ ] Merged node (n1)
Leaf–node (n2 )
VQ–node (n3 )
ngroup =
ID = 1
M = [m1 , m2 ] C=[ ]
Merged node (n4 )
Leaf–node (n5 )
ngroup =
ID = 2
Leaf–node (n 6 ) ID = 3
Figure 3: Example of a CVQ with NLeaf = {n2 , n5 , n6 }, NMerged = {n1 , n4 }, NVQ = {n0 , n3 } and nroot = n0 .
to other nodes. The first node type, however, has no explicit references to any other nodes. Definition 14. A leaf node in a CVQ graph n ∈ NLeaf has only one constituent, n = ID, where ID ∈ N is a unique enumeration of the leaf nodes within the CVQ and 1 ≤ ID ≤ |NLeaf |. Definition 15. A vector quantizer (VQ) node in a CVQ graph, n ∈ NVQ is a tuple n = M, C where M is a list of K model vectors, [m 1, m 2, . . . , m K] , where [m] i ∈ Rd and C is a (nonrepetative) list of child nodes [c 1 , c 2 , . . . , c K ] where c i ∈ NLeaf ∪ NVQ ∪ NMerged . d ∈ N is the dimensionality of the vector space the CVQ will be used to quantize. Definition 16. A merged node in a CVQ graph, n ∈ NMerged , contains only a “link”, n = ngroup , where ngroup ∈ NLeaf ∪ NVQ ∪ NMerged . The interpretation will be clarified in the next section, where the use of a CVQ as a quantization function is described in which all CVQ node constituents are relevant. The example of Figure 3 is also useful for understanding the interpretation of the CVQ nodes. The constituents of a CVQ are as defined above, but there are a number of constraints on how the CVQ graph can be constructed—for example, there
2230
H. Jacobsson
may be no cycles in the graph. These constraints are not straightforward to formalize but are quite intuitive. So instead of a lengthy formal descriptions, an example will illustrate a typical CVQ topology (see Figure 3). Also, the way the CrySSMEx algorithm builds the CVQ defines the constraints in exact detail (see section 4). 3.2 Quantizing with a CVQ. When a CVQ is used as a quantizer (see definition 4), the corresponding quantization function is denoted cvq and is in turn defined by the recursive function winner : NLeaf ∪ NMerged ∪ NVQ × Rd → {1, 2, . . . m} as defined in6 cvq (v ) = winner(nroot , v ),
(3.1)
where winner is recursively defined as
winner(n, v ) =
n.ID
winner(n.ngroup , v ) winner(n.c w , v )
if n ∈ NLeaf if n ∈ NMerged if n ∈ NVQ ,
(3.2)
where w, the index of the winning child of a VQ node, is determined according to w = argmin v − n.[m] i ,
(3.3)
1≤i≤|n.C|
where v − n.[m] i denotes the Euclidean distance between the vector to be quantized and the ith model vector of the VQ node. If two model vectors have equal distance to the data vector, the smaller of the indices will be returned. The division of a two-dimensional state space using the CVQ of Figure 3 with exemplified instantiated model vectors is shown in Figure 4. 3.3 CVQ Training. The training of the CVQ in CrySSMEx is tightly connected with operations of the SSM and sampling of the SDTDS (as discussed in section 1.2). Here, however, the operations that later will be used to refine the CVQ will be defined as independent of their role in CrySSMEx (see section 4 for this context). The training consists of replacing leaf nodes with either merged nodes or VQ nodes, adding new leaf nodes, and then reenumerating the I Ds appropriately. Replacement of a leaf node with a VQ node results in a larger number of leaf nodes and is referred to as CVQ splitting. Replacement of
6
X.”
An object orientation like notation will be adopted here, where X.Y means “the Y of
The Crystallizing Substochastic Sequential Machine Extractor
2231
1
3
1
0.8
n .m 0
n .m
1
0
2
0.6 0.4 n .m
0.2 0 0
3
n3.m2 n .m 0
3
1
2 0.2
0.4
0.6
0.8
1
Figure 4: How a two-dimensional space would be quantized if the example CVQ in Figure 3 had model vectors n0 . M = [(0.25, 0.75), (0.75, 0.75), (0.5, 0.25)] and n3 .M = [(0.30, 0.20), (0.55, 0.35)].
several leaf nodes with merged nodes results in a smaller number of leaf nodes and is referred to as CVQ merging. After completion of each of these operations, leaf nodes will be reenumerated. First, merging will be described, then basic splitting, and then an operation called complete splitting. “The user” mentioned in the following descriptions will be another part of the CrySSMEx algorithm, but it should be possible to use CVQ in other contexts. 3.3.1 The Initial CVQ. The initial CVQ, denoted cvq 0 , is the simplest possible CVQ, consisting of only one leaf node (nroot ) with I D = 1. All vectors will thereby be quantized as nroot .I D = 1 by the initial CVQ. 3.2.2 Merging. Merging of nodes in a CVQ corresponds to merging regions in the quantized space. This is conveniently described with an example: Example 4. In the example of Figure 3, nodes n1 and n4 have been merged into n6 . Before this merge, n1 and n4 were two separate leaf nodes, but then the “user” discovered that the corresponding regions should not be separated for some reason. The merge was then conducted by creating a new leaf node, n6 , and then replacing the leaf nodes, n1 and n4 with merged nodes connected to n6 . In principle, any number of leaf nodes can be merged simultaneously. The merge is an operation on the CVQ graph, not necessarily related to
2232
H. Jacobsson
any spatial properties of the quantized space; disconnected regions can be merged. The decision of what leaf nodes to merge is also entirely independent of their position in the CVQ graph. Definition 17. The merging of one or more groups of leaf nodes will be denoted cvq := merge cvq(cvq , E), where E is a set of disjoint sets of IDs covering all leafs of the CVQ. The result, cvq , is the CVQ where leaves have been merged into one new leaf node per set in E (trivial sets in E, with only one member, are simply ignored). The leaf nodes are also reenumerated before returning the resulting CVQ. E will later (in algorithm 3) be connected to the set of equivalence sets generated from SSMs by the function generate UNDI equivalence sets (described in definition 12). Example 5. If cvq = merge cvq(cvq , {{1, 3, 5}, {2, 4}, {6}}) is called, it will replace leaf nodes with IDs 1, 3, and 5 with merged nodes connected to a new leaf node, and correspondingly for 2 and 4. The leaf node with ID = 6 will be left unaltered. CVQ-based quantization function cvq will then quantize vectors into the range [1, 3], whereas cvq quantized into the range [1, 6]. 3.3.3 Basic Splitting. When a CVQ leaf node is split, this corresponds to splitting the corresponding region enumerated by that leaf. It is easiest to describe this mechanism with an example: Example 6. A set of two-dimensional data vectors V is quantized as in Figure 4. Now the user has discovered that there are actually two types of vectors quantized as 1 (let us call the set of these vectors V1 ). The user wants the two classes to be correctly separated by the CVQ. To do this, the user collects all data vectors V1 = {vi : cvq (vi ) = 1} and separates this set into two sets V1+ and V1− corresponding to the two classes. The node with ID = 1 (i.e., n2 in Figure 3) would then be replaced with a VQ node with two model vectors and two new leaf nodes as children. The model vectors of the VQ node are then set to be the average of the vectors in sets V1+ and V1− , respectively. Above, a leaf was split into only two new leaves, but in general, a leaf’s corresponding region in the quantized space can be split into any number of regions.
3.3.4 Complete Splitting. It is possible that the model vectors will not perfectly separate the data vectors after a basic split, since they might be linearly inseparable or the average vectors of the data sets may not be the perfect
The Crystallizing Substochastic Sequential Machine Extractor
2233
model vectors for separating the data.7 In such cases, CrySSMEx would typically resplit the resulting region again automatically (see section 4, where CrySSMEx is described in detail). It is, however, possible that an imperfect split may cause a nonminimal machine to be extracted and also that CrySSMEx will not terminate due to spurious SEs. Therefore a “perfect,” or complete, split mechanism has also been devised. To perform a complete split, the splitting is first conducted as in example 6. But if the enumerated data vectors still are not separated, then the new leaves will be resplit using the corresponding subsets of the data vectors until the data vector class can be uniquely inferred from the identity of the involved leaf nodes. After this, all involved leaf nodes “belonging” to the same class are merged. Definition 18. The complete split of several VQ nodes using several data sets at once will be denoted cvq := split cvq(cvq , D) where cvq is the CVQ to be split and D = [D1 , D2 , . . . , D|cvq | ] is a list of data sets where Di is the data set for splitting the leaf node with = i (if the node should not be split, then Di = ∅). The elements of a data set are pairs v , where v ∈ Rn is the data vector and ∈ N is the label, or class, of the data vector. The leaf nodes of cvq are also reenumerated immediately after the completion of all splits. There is also a possibility that the averages of two or more classes are exactly the same, in which case the splitting will fail completely. This is unlikely to occur by chance, and no fixes are included in the definition of the algorithm. It has not occurred in any of the experiments (see section 5), and if it did, the implemented algorithm would abort execution and generate a warning. It is also very important that there is no pair of differently classified but identical data vectors. This should not happen in the context of CrySSMEx due to the way data are collected from a deterministic machine, but it should be kept in mind if the CVQ is to be used in another context. 4 The Crystallizing Substochastic Sequential Machine Extractor 4.1 Data Selection from . Perhaps the most important point of convergence of the various constituents described so far in this letter is where subsets of are selected and classified based on properties of the extracted
7 In fact, it can be a quite inefficient to use the average as model vectors. The average vectors are chosen basically because they are simple and deterministic and the calculation is parameter free. More sophisticated methods, such as resource-allocating learning vector quantization (Everitt et al., 2001), have also been tested, but they do not make a big difference apart from longer computation times and somewhat smaller CVQ graphs.
2234
H. Jacobsson
SSM. The goal of CrySSMEx is to generate a deterministic SSM from the underlying deterministic SDTDS by dividing the state space into a minimal set of quanta that can be used to describe the SDTDS perfectly in the context of . To do this, indeterministic SEs of the SSM are targeted for being split in the corresponding CVQ using selected state vectors from . The basis for the selection of state vectors is to choose the set that should convey the most information, primarily of the output of the SSM and, secondarily, the next state element of the SSM. The entropies Hssm (Y|Q = q i , X = xk ) and Hssm (Q|Q = q i , X = xk ) (definitions 9 and 10, respectively) are used for this selection. This basis for selection is not the only one possible, however, and this will be brought up again in section 6. The entire selection procedure is contained in the function collect split data, described in algorithm 2. 4.2 CrySSMEx Main Loop. The ingredients for CrySSMEx have now been presented:
r r r r r r r r
The SDTDS, which represents the class of specimens for CrySSMEx to analyze (see definition 1) The data set, that is, the SDTDS transition event set (see definition 3) SSMs, which can be viewed as a subtype of SDTDSs (see definition 6) SDTDS translation into SSM through quantization of input, output, and state (see definition 8) Generation of UNDI-equivalence sets of SEs in SSMs (see definition 12) CVQ (see definitions 13 to 15), used as a quantizer of vectors through the function cvq (see equation 3.1) Merging and splitting of CVQ leaf nodes (see definitions 17 and 18) A mechanism for selecting and labeling state vectors of based on properties of the SSM (see algorithm 2)
These constituents are integrated into the CrySSMEx-algorithm as described in algorithm 3. The principle behind the algorithm is that the SSM should be kept as small as possible through merging of SEs that are UNDI equivalent while at the same time splitting indeterministic SEs. It is important to decide before an SE is deemed to be indeterministic that it is not so because it, over one input symbol, transits to two or more SEs that are actually equivalent (or at least UNDI equivalent). If the machine was not minimized through the merging of equivalent state elements, it would risk an explosion of SEs due to unjustified splits. If the algorithm does not converge in due time, additional termination criteria could be added. For example, one may want to limit the number of possible iterations or put a limit on |Q|. The extracted, then possibly
The Crystallizing Substochastic Sequential Machine Extractor
2235
collect split data(,ssm, i , s , o ) Input: A transition event set, , an SSM, ssm, an input quantizer, i , a state quantizer, s , an output quantizer, o . Output: A list of data sets D, one data set per q ∈ Q. The element of each data set is a pairv , where v ∈ Rn is adata vector and ∈ N is the assigned label of the vector. begin D := [∅, ∅ . . . ∅]; for ∀s (t),ı(t), o (t + 1), s (t + 1) ∈ do q i := s (s (t)); xk := i (ı(t)); yl := o (o (t + 1)); q j := s (s (t + 1)); /*If q i is indeterministic, the state vector should be stored in D with an appropriate labelling. */ if ∃xm : Hssm (Y|Q = q i , X = xm ) > 0 then xmax := argmax Hssm (Y|Q = q i , X = xm ); xm ∈X
if xk = xmax then /*If output indeterministic with respect to q i and xk , label the state vector with the output symbol, yl . */ Di := Di ∪ s (t), yl ; end else if ∃xm : Hssm (Q|Q = q i , X = xm ) > 0 then /*If output is uniquely determined from q i , but next state is not, label state vector using next SE, q j . */ xmax := argmax Hssm (Q|Q = q i , X = xm ); xm ∈X
if xk = xmax then Di := Di ∪ s (t), q j ; end end end return D; end Algorithm 2: collect split data selects state vectors from and labels them according to either o or s such that they are suitable for use in splitting CVQ nodes. The resulting list of data sets, D consists of one data set of labeled state vectors for each SSM SE, that is, Di corresponds to the data set for splitting state q i .
indeterministic, SSM will still be a model of the underlying SDTDS, and, moreover, the more computational resources that are dedicated to iterate CrySSMEx, the better a model the SSM will be of the SDTDS, in terms of fidelity.
2236
H. Jacobsson
CrySSMEx(, o ) Input: An SDTDS transition event set, , and an output quantization function, o . Output: A deterministic SSM mimicking the SDTDS within the domain as described by o . begin let i be an invertible quantizer for all I in ; i := 0; ssm0 := create machine(, i , cvq 0 , o ); /*ssm0 has Q = {q 1 } with all transitions to itself. */ repeat i := i + 1; D := collect split data(, ssmi−1 , i , cvq i−1 , o ); cvq i := split cvq(cvq i−1 , D); ssmi := create machine(, i , cvq i , o ); if ssmi has UNDI-equivalent states then /*Merge SEs if possible. */ E := generate UNDI equivalence sets(ssmi ); cvq i := merge cvq(cvq i , E); ssmi := create machine(, i , cvq i , o ); end until ssmi is deterministic ; return ssmi ; end Algorithm 3: The main loop of CrySSMEx.
5 Experiments The main purpose of the experiments in this article is to show that CrySSMEx manages to extract machines in contexts previously unsolved using RNNRE algorithms. Another purpose is to identify remaining weaknesses of CrySSMEx by running it on challenging SDTDSs. Deeper analysis of how and why CrySSMEx behaves as it does and of the resulting SSMs and CVQs will have to be conducted in future work. 5.1 An Illustrative Example. Most previous work on RNN-RE algorithms has been experimentally tested on regular language domains (Jacobsson, 2005). The aim of this experiment is to demonstrate that this kind of domain is trivial and at the same time illustrate the extraction process. It has already been proven that if an RNN is robustly performing a regular language recognition task, then this model can always be extracted (Casey, 1996). But no technique can warrant that such an extraction is possible in practice: there is no guarantee for CrySSMEx either, of course, but it does seem to be quite a straightforward process.
The Crystallizing Substochastic Sequential Machine Extractor
2237
1 0.8
a ~a
0.6 0.4 0.2 0 0
~u
i
u
~i
0.2
0.4
0.6
0.8
1
Figure 5: The state space of the RNN in the badiiguuu domain. The states (∗) and transitions between states are shown, and the decision hyperplanes (Sharkey & Jackson, 1995) for each output unit are plotted (only three are visible).
Elman (1990) used simple, regular language to train a simple recurrent network (SRN). The domain consisted of three subsequences ba, dii, and guuu repeated in random order (e.g., babadiibaguuudiiguuu. . .). The task for the RNN was to do next-symbol prediction. In essence, only the vowels were at all predictable. An SRN was here trained with two hidden nodes on this domain with the symbols represented as six-dimensional vectors with one node active for respective symbol. To generate , a string of 105 randomly ordered substrings was used on the trained RNN. The state space of the RNN is shown in Figure 5. Three iterations completed the extraction, and the sequence of state-space divisions, the CVQ graphs, and the SSMs are shown in Figure 6. The breadth-first technique of Giles, Miller, Chen, et al. (1992) has also been tested on this domain, and it resulted in a large number of states never visited by the RNN when predicting the actual strings (Jacobsson & Ziemke, 2003b). Most essential features of CrySSMEx are exemplified in this extraction. In the initial SSM, ssm0 , it can be seen that input symbol i generates the maximum amount of uncertainty regarding the output symbol (the output symbol C here corresponds to the nonsymbol generated by the RNN when it predicts a consonant, with no possibility of predicting the exact symbol due to the random ordering of substrings). For that reason, collect split data will select state vectors that the RNN occupied when it received i as input and label them according to the output label as determined by o . The CVQ is then split according to the selected data, resulting in cvq 1 . The
2238
H. Jacobsson
ssm0
cvq 0
b:a,d:i,g:u,a:C,
1
1
1
i:i:0.5,i:C:0.5, u:u:0.7,u:C:0.3
ssm1
cvq 1
1
d:i
VQ
2
1
2
cvq 2
2 2
2 i:i
i:C, u:C
VQ
1
1
VQ
2
cvq 3 VQ
i:C, u:C
VQ
2 3
VQ 2
ssm2 a:C, i:i, u:u:0.5
b:a, d:i, g:u
2
1
u:u:0.7,u:C:0.3
1
1
2
b:a,g:u,a:C,i:C
3
b:a,g:u
1
u:u:0.5 1
ssm3 i:i,u:u
a:C d:i
3
2 u:u
Figure 6: Extraction from RNN predicting in the badiiguuu domain. The statespace divisions (cf. Figure 5), CVQ graphs, and SSMs are shown for the initial model and all subsequent iterations of CrySSMEx.
same procedure is repeated again with ssm1 , and an SSM of three SEs, of which two are UNDI equivalent, is generated (not shown). This results in two merged nodes in cvq 2 . As can be seen in the state-space division, cvq 2 merges two (locally) disconnected subspaces. From both SEs of ssm2 , the output can now be deterministically predicted, but q 2 is still indeterministic since transitions from it over symbol u are ambiguously mapped to q 1
The Crystallizing Substochastic Sequential Machine Extractor
2239
and q 2 . Therefore, collect split data selects those RNN state vectors in enumerated 2 by cvq 2 from which a transition induced by symbol u was made, and labels them according to the cvq 2 enumeration of the subsequent state vector. After the split of cvq 2 , CrySSMEx terminates since the resulting SSM is deterministic and will fully mimic the underlying RNN within . Note that there are some dead transitions in ssm3 , for example, for symbol g in q 3 . If the underlying RNN is fed a g while occupying a state in the corresponding subspace, it will certainly react in some manner, but since that event was not recorded in , the resulting SSM does not model it. Also, the resulting SSM is not a model of the input source; for example, although not supported in , ssm3 models the outcome of infinite sequence of symbols i and u. This is due to the fact that CrySSMEx does not build a model of the domain; it builds a model of how the underlying system interacts with its domain without guarantees of generalization to situations outside . 5.2 An RNN Trained on a Context-Free Language. Prediction of symbols in the context-free language an bn is a challenging domain for RNNs that has been studied quite intensely (e.g., Wiles & Elman, 1995; Bod´en & Wiles, 2000; Gers & Schmidhuber, 2001). CrySSMEx was here used to analyze 100 successfully trained8 SRNs (of one input node, two state nodes, and one output node) on predicting the predictable symbols of randomly ordered an bn -strings (1 ≤ n ≤ 10). was here generated by exposing the RNN to exactly 200 an bn -strings of each length (1 ≤ n ≤ 10) in random order. For all 100 RNNs, extraction was successful, resulting in SSMs of 11 SEs. An example of an extracted machine, together with the CVQ-quantized state space is shown in Figure 7. The regular grid lattice quantizer of Giles, Miller, Chen, Chen et al. (1992) was also tested, and it typically never found any SSM with the same behavior as the RNN until the state space was divided into at least 40 × 40 grids. The number of SEs was then between 25 and 70. If the breadth-first search of that article is employed, the number of states becomes even higher (Jacobsson & Ziemke, 2003b) because then many states that would not have been visited when processing an bn strings are also included. In this domain, some problems for CrySSMEx are exposed. First, although all extracted SSMs had the same |Q| and all SSMs generated exactly the same outputs as the RNNs (within the sampled domain), actually two types of SSMs were extracted: 90 SSMs of one type and 10 of the other. This is probably due to different forms of dynamics of the underlying RNNs (Tonkes, Blair, & Wiles, 1998). The extraction also took either 9 or 10 iterations depending on the underlying RNN. Clearly, CrySSMEx is sensitive
8 Using a genetic algorithm, see Jacobsson and Ziemke (2003a) for more details and a more detailed list of references.
2240
H. Jacobsson
b:a:0.18,b:b:0.82
b:b:0.80 a:a:0.22,b:b:0.20
1
a:a
1
a:a:0.78
2 a:a,b:a
a:a
3
2
4
5 b:b
b:b b:b,a:a b:a
a:a
a:a
a:a ...
a:a ...
b:b
b:b
a:a
11
10 b:b
b:b
1
1 0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
Figure 7: The two first machines (ssm0 and ssm1 ) and the last (ssm10 ) in the sequence of machines extracted by CrySSMEx in the an bn domain. The state space of the RNN and its cvq 10 division of the state space are shown below the machines. Note that some distant states belong to the same region, while some nearby states are divided due to the functionally driven quantization strategy employed in CrySSMEx. Some of the disjoint regions are also merged in the CVQ (which cannot be seen in the diagram).
The Crystallizing Substochastic Sequential Machine Extractor
2241
to internal properties of the RNN that give rise to these differences. This may be a problem, but it may also be a key to a window of analysis of the dynamics of the underlying RNN. A more serious problem was that if was too small, for example, with just 10 occurrences of each string length, CrySSMEx could get stuck in loops where an ssmi would be exactly equal to ssmi−n , where n ≥ 1. This was due to merging and splitting of SEs cancelling each other over one or more iterations. The mechanisms behind these loops are not entirely clear, and the issue requires more targeted experiments. It seems, however, to be linked with some kind of data starvation since it has occurred only for smaller s. To circumvent this problem, CrySSMEx is now implemented to abort execution by default if a loop is encountered. Another option, also implemented, is to skip the merge completely if it would result in a loop. This approach is successful in that CrySSMEx terminates with a deterministic SSM equivalent with the RNN, but unfortunately with more SEs than the 11 otherwise extracted. A third problem sometimes occurred when was generated with longer an bn strings. In some cases, when the RNN generalized perfectly to longer strings, this posed no problem. In other cases, erroneous prediction of the RNN was successfully modeled in the SSM, for example, that it predicts an a prematurely if n = 11. But in other cases, the temporal dependencies of the errors are quite complex, and the SSMs seemed to grow indefinitely (without ever exceeding the size of , of course). It is known that RNNs with weights in the vicinity of the correct solution have chaotic error gradients (e.g., Bod´en & Wiles, 2000), and perhaps a chaotic RNN may explain the difficulty for CrySSMEx (cf. section 5.4). 5.3 A Large RNN. An SRN of one input node, one output node, and 103 state nodes (i.e., more than 106 weights) was created to test the feasibility of extracting rules from SDTDSs of high dimensionality. The weights were initiated in the interval [−0.01, 0.01], and the network was then exposed to a sequence of 104 randomly ordered inputs (I = {(0), (1)}). The output quantizer used in this case gave three symbols, +, −, and 0, corresponding to whether activation of the output node increased, decreased, or remained the same. The input symbols a and b corresponded to the binary activation of the single input unit. The continuous activation function of all nodes, 1/(1 + exp(−net)), makes it typically impossible that the output should stabilize completely, that is, there should be no need for symbol 0, but limits in machine precision made it necessary. This kind of network, with small, random weights, has been theoretically studied and has been proven ˇ 2003). This is, to implement definite memory machines (Hammer & Tino, however, the first time a large-scale network of this type has been studied using RNN-RE. The extracted machine, with |Q| = 19, is shown in Figure 8. The machine perfectly emulated the behavior of the SRN within as “viewed” through
2242
H. Jacobsson
b:+
b:+
b:+
b:–
b:+
b:+
b:–
b:0 a:+
b:–
a:+ b:–
a:+
b:– a:0
a:–
a:+
a:–
a:+
a:–
a:–
a:–
a:–
Figure 8: An SSM extracted from an RNN with 103 state nodes and random weights. To save space, numbering is omitted and repetitive transition labels are bundled.
o . It may be of interest to mention that the generated data took up over 230 MB of storage, yet only six iterations in the CrySSMEx main loop were need to extract a machine of 19 states with the same apparent behavior as the significantly larger RNN. Some more preliminary experiments were carried out with the same RNN architecture but with larger random weights. For smaller weights, |Q| decreased, and for larger weights, the extraction may become impossible in that the SSM size seemed to grow indefinitely. SSMs were, of course, still extracted, but no deterministic SSMs were found within a reasonable time. A high-dimensional state space is not needed to make extraction of deterministic SSMs impossible, however, as the next experiment will demonstrate. 5.4 A Chaotic System. To do rule extraction from a chaotic system may be considered unreasonable. A chaotic system will never repeat its trajectory in state space, and infinitesimal differences of two states will grow over time (Devaney, 1992). These properties make the system impossible to describe deterministically with a finite set of states. Any attempt to group two distinct SDTDS states into the same subspace will fail since their future trajectories will inevitably diverge if the system is chaotic. It is, however, possible to use CrySSMEx to extract indeterministic SSMs from chaotic systems. To demonstrate this, an iterated quadratic map was used: s(t + 1) = a · s(t) · 1 − s(t) .
(5.1)
The Crystallizing Substochastic Sequential Machine Extractor
2243
The constant a , in the interval [0, 4], determines whether the attractor of the system is a fixed point, cyclic, or chaotic (Devaney, 1992). This system falls under the SDTDS definition, but with I = ∅ and O = ∅. A similar experiment, in the same domain, was conducted by Crutchfield and Young (1990), but their approach was quite different from CrySSMEx in that a fixed (unknown) translation from state space into a discrete set of observations was assumed. In CrySSMEx, it is precisely this translation that is the target for refinement. Data were generated by running the system for 105 time steps, after an initial 105 unmonitored steps to let the system “settle in” on its attractor. The output symbols were, as in the previous example, +, −, and 0, corresponding to whether the state increased, decreased, or remained unchanged, respectively. This choice of output symbols was just one of many possible o , which is why it is part of the input parameters of CrySSMEx. Some readers might protest that this contradicts earlier claims in this article that CrySSMEx is parameter free. The subtle difference here is that although CrySSMEx requires the output quantization as a parameter, this quantization is for RNN applications typically defined a priori as a direct consequence of the symbolic domain of the RNN. In the above case, however, a number of output quantizations are conceivable. The resulting machines are trivial when a is set such that the attractor is fixed or cyclic. If it is fixed, an SSM with one SE and a transition-generating symbol 0 are enough for describing the dynamics. If the system is cyclic, a finite set of SEs is enough to describe the system deterministically, for example, if a = 3.5 (having a period four cycle), two SEs are enough, since the system, as viewed through 0 , generates the output sequence . . . + − + − . . .. If a = 3.839, the system has a three-cycle attractor (Devaney, 1992) and generates the sequence . . . + + − + + − . . . and consequently, the SSM had three SEs. If a is chosen so that the system is chaotic, CrySSMEx will not terminate (at least not until the finite set of is fully accounted for). But the extracted SSMs can nevertheless be argued to account for some of the dynamics of the system. To test the fidelity of the SSM, the extracted machines were initialized with the s enumeration of an initial state (chosen within the attractor) of the underlying system, and both the SSM and the system were run in parallel until the SSM failed to predict the output symbol of the system. Six quadratic map systems, with a = 3.7, a = 3.75, a = 3.8, a = 3.9, a = 3.95, and a = 4.0, respectively, were analyzed. The quality of extracted SSMs, in terms of the average time until SSM output deviates from the underlying system, typically increased for higher iterations of CrySSMEx (see Figure 9). The number of SEs grew exponentially for all systems, and faster for higher values of a . The number of SEs in relation to the number of correctly predicted symbols reveals that the cost, in terms of SSM size, for each correctly predicted symbol also increases exponentially. It is, however,
2244
H. Jacobsson
average correct prediction
30 3.70 25
3.75 3.80 3.90 3.95
20 15
4.00
10 5 0 0
5
10 iteration
15
20
4
10
4.00
3
|Q|
10
3.95 3.90 3.80 3.75 3.70
2
10
1
10
0
10
0
5
10 iteration
15
20
Figure 9: Some results of CrySSMEx modeling chaotic systems from extracted ssm0 to ssm20 . (Top) The average number of correctly predicted symbols before the SSM failed to predict the output symbol of the system. (Bottom) The number of SEs of the extracted SSMs. When a = 4.0, the extraction was aborted at iteration 15 due to limited memory resources.
interesting to note that invested computational time clearly gives revenues in SSM fidelity. Other values of a were also tested, but if a = 3.85, for example, only three SEs were needed to predict the system indefinitely. So the seemingly
The Crystallizing Substochastic Sequential Machine Extractor
2245
monotonic relation between a and the number of SEs and prediction difficulty is merely an illusion. 6 Open Issues There are four main directions for future research: to further test, apply, understand (e.g., through mathematical proofs), and improve the algorithm. One obvious line of possible research is to test the outcome of using CrySSMEx in applications of previous RNN-RE-algorithms (accounted for in Jacobsson, 2005). How to proceed with testing CrySSMEx further and applying it to other systems will not be brought up here in detail, however. There are also many possibilities for how to improve CrySSMEx (e.g., through CVQ refinement, quantization of SDTDS input, and extension to indeterministic SDTDSs) that are here not discussed further. The most critical issue for now is perhaps the theoretical understanding of the algorithm. There are at least two central decisions in the algorithm that have been made on heuristic grounds:
r
r
NDI-equivalent SEs are now grouped using UNDI equivalence (see definition 11). There is, however, more than one way to group NDIequivalent SEs if the relation is nontransitive. The chosen solution is only one possible, and quite restrictive, way. For example, if SE pairs q 1 and q 2 and q 2 and q 3 are NDI equivalent, respectively, while q 1 and q 3 are not, then hypothesis equivalence sets are {{q 1 }, {q 2 }, {q 3 }}}, {{q 1 , q 2 }, {q 3 }}}, or {{q 1 }, {q 2 , q 3 }}, of which only the first, which results in no merge, is generated by UNDI equivalence. When data are collected in collect split data (see algorithm 2), a single input symbol is selected based on conditional entropy. It is, however, possible that another symbol should be selected or that more than one symbol should be included. The selected symbol affects very strongly what the model vectors will be, and even if the seemingly most informative symbol is selected, the selection mechanism includes no heuristics about the underlying geometrical consequences of the decision.
The solution to both of these issues should involve more than just finding alternatives to UNDI equivalence and entropy-based selection of input symbols. I suspect that it may involve systematic testing of merging and splitting in a breadth-first search manner. It may even be necessary to split one SE at a time. In this respect, it is perhaps reasonable to consider CrySSMEx a promising first step toward a more generic approach rather than a final solution to the problem of RNN-RE. Moreover, the fact that the algorithm performs quite
2246
H. Jacobsson
well on complex domains while there still are obvious ways to improve it can also be considered quite valuable. There are at this point no mathematical proofs that CrySSMEx will always provide the expected results, and clearly, some of the experiments demonstrate that it will not. A proof should distinguish the set of problems that can be solved by CrySSMEx from the ones that cannot. Therefore, a proof, or at least a deep theoretical analysis of the algorithm, is important. But such an analysis will arguably, since the parts are tightly integrated, require a merge of theories surrounding all CrySSMEx constituents, which brings together ideas from areas such as automata theory (Hopcroft & Ullman, 1979), stochastic machines (Paz, 1971), information theory (Cover & Thomas, 1990), and cluster analysis (Everitt et al., 2001). There are also strong connections with the highly developed mathematical field of dynamic systems theory (Devaney, 1992), especially within the context of symbolic dynamics (Crutchfield, 1994). And the whole procedure of generating minimal algorithms (in this case, SSMs) to explain a source of data () is, of course, related to algorithmic information theory (Chaitin, 1987). The algorithmic complexity remains an open issue too. The experiments clearly show how evasive this issue is. For example, the analysis of an RNN of 103 -dimensional state space ended up in a very modest SSM (see section 5.3) whereas a chaotic one-dimensional autonomous systems generated enormous SSMs (see section 5.4). The SSM size will be bounded by || for chaotic systems, but such an answer is quite unsatisfying since the algorithm should typically be terminated before it memorizes the entire data set. Given that the system is not chaotic, however, the computational complexity issue will be arguably more interesting, but at the same time very difficult to analyze since there are some arguably deeply challenging factors to include, such as properties of γ (of the SDTDS), in combination with the selected input sequences. 7 Discussion and Conclusions 7.1 Relation to Earlier Work. Extraction of deterministic SSMs using CrySSMEx has now been shown to be possible for a number of challeng-
ing domains. Extraction of stochastic SSMs from a chaotic system was also shown to be possible. The domains on which earlier approaches have been tested have, almost exclusively, been relatively simple binary regular grammars (cf. Jacobsson, 2005). Arguably, the context-free domain, the highdimensional SRN, and the chaotic system tested in the work reported in this article all constitute significantly more difficult problems. As discussed in section 1, there are three main differences between CrySSMEx and earlier approaches: the SSM, the CVQ, and the integration of quantization, observation, and minimization. These three ingredients make CrySSMEx more efficient than earlier algorithms simply because CrySSMEx
The Crystallizing Substochastic Sequential Machine Extractor
2247
performs a directed and deterministic search for a minimal quantization of the state space. Earlier approaches have relied on quantizers to find this minimal quantization without any information about the underlying dynamic system context of the state space. Another difference from most earlier approaches is that CrySSMEx is parameter free (apart from o , which is typically derivable from the RNN domain context). This is quite an advantage, since in the use of the algorithm as an analysis tool, the results are guaranteed not to be colored by the choice of parameters. Since CrySSMEx is also deterministic, there is no need for running it more than once on the same data. Another main feature is that the algorithm quickly creates an initial coarse stochastic model, which it then refines gradually until a deterministic model is found (if possible). This “anytime rule extraction” possibility was considered by Craven and Shavlik (1999) to be an important aspect with respect to the scalability of the algorithms. The algorithm can also handle missing data due to the substochastic nature of the extracted model, which is important since it uses observation of a system to build models—observations that may, or may not, include all relevant aspects of the underlying system. Another distinguishing feature of CrySSMEx as compared to most other RE algorithms is that the hypothesis space includes the system space—that SSMs is a subset of SDTDSs. I believe this is quite unusual for RE algorithms and could turn out to be very fruitful. I have already used this feature for some verification of CrySSMEx; two SSMs, of which one is extracted from the other, should generally be equivalent to each other. One may also, if not satisfied with reading nondeterministic SSMs, always attempt to extract a deterministic description of them. Since the SDTDS definition is so abstract, other interesting systems also fall into this class, for example, backpropagation learning if I = ∅, O = ∅, and S is the weight space (o could then simply be the enumeration of SSMs, extracted from the network). Last, but not least, the extraction results not only in a machine but also in a hierarchically structured geometrical organization of the state space of the underlying system (cf. the pure black box model of Vahed and Omlin, 2004). Intuitively, the relation of the structure of the CVQ graph, the topology of SDTDS state space, and the SSM should contain important seeds for further development of CrySSMEx and deeper analysis of the underlying system. 7.2 Related Fields. Following previous arguments (cf. section 1.1) that a rule extraction algorithm should not even assume that the underlying system is a neural network, I argue that RNN-RE should not be considered a field of neural computation; it should be considered a field of machine learning applied to models of neural computation (cf. Craven & Shavlik, 1994). The consequence is that related algorithms are not primarily found in the literature of neural computation (apart from pure RNN-RE algorithms).
2248
H. Jacobsson
One important field that is not immediately associated with machine learning is control theory. Especially when it comes to the system identification aspect of control theory, one suggested definition could as well have been for RE: “System identification deals with the problem of building mathematical models of dynamical systems based on observed data from the system” (Ljung, 1999, p. 1). In control theory (e.g., Young & Garg, 1995; Marculescu, Marculescu, & Pedram, 1996; Kumar & Garg, 2001), and especially for discrete event systems, similar problems as for RNN-RE algorithms have been dealt with for a long time. There are, however, some distinguishing features that separate CrySSMEx from algorithms of control theory—for example, the assumed full observability, discrete time, and determinism of the underlying system. To mature, however, the RNN-RE field needs to take into account the well-developed theory of this related field. But once the connection is made to control theory, an abundance of other (some partly overlapping) fields also needs to be taken into account: examples are inductive logic programming (e.g., Muggleton & Raedt, 1994), grammar induction (e.g., Moore, 1956; Gold, 1967; Lang, 1992; de la Higuera, 2005), computational learning theory (e.g., Valiant, 1984; Angluin, 1987, 2004), symbolic dynamics (Crutchfield, 1994), computational scientific and mathematical discovery (e.g., Simon, 1995–1996; Langley, Shrager, & Saito, 2002; Colton, Bundy, & Walsh, 2000), closed-loop discovery (also known as active learning) (e.g., Cohn, Atlas, & Ladner, 1994; Bryant, Muggleton, Page, & Sternberg, 1999), software testing (Bergadano & Gunetti, 1996), and data mining. Taken together, these fields form an almost insurmountable abundance of literature (only a few examples are cited here). Most likely, there are other fields that are also important to take into account (cf. references in section 6). Some of the goals of these fields are widely different. For example, the goal for system identification is to better facilitate control of the underlying system, whereas for software testing, it is to find errors. The terminology is also very diversified; the underlying systems may be called plants, machines, and dynamic systems in control theory; an abstract teacher in computational learning theory; or an interactive user in inductive logic programming. The process is about system identification, model induction, scientific discovery, or data mining, for example. The hypothesis space of the induced models also varies from differential equations, finite state machines, statements about systems, and ad hoc representations of engineering problems. After a brief review of the leading articles and books in the field, it becomes obvious from the lack of cross-referencing that the potential connections are not fully exploited. Yet all these fields have one thing in common with rule extraction: one of their central goals is to automatically induce models, conjectures, concepts, and/or predictions based on observations. The exact nature of the similarities and differences of these fields to each other and to RNN-RE is out of the scope of this letter. But since the goals
The Crystallizing Substochastic Sequential Machine Extractor
2249
of these fields overlap with science in general, I suggest that a natural way to bring them closer together could be to build an encompassing theory by taking advantage of the deep insights that philosophers of science already have provided (e.g., Simon, 1973; Williamson, 2004). 7.3 Future Directions and Final Thoughts. There are three major reasons for the choice of the word crystallizing in this work: (1) when a uniformly initiated SSM parses a sequence of inputs, its SE distribution gradually crystallizes into a more ordered distribution, (2) the quantization of the state space geometrically resembles the process of crystallization, and (3) the whole CrySSMEx process conceptually crystallizes a model of the underlying system from a rough and unordered model to a more refined model. Another reason is that one ambition with CrySSMEx is to make otherwise opaque dynamic systems crystal clear for human observers. To what degree this last ambition can be fulfilled by CrySSMEx remains perhaps unanswered in this letter. The comprehensibility of extracted rules has always been of central concern for rule extraction techniques (Andrews et al., 1995). But as a consequence of the fidelity of optimization, the rule set (i.e., the number of SEs and transitions) can become very large. This may, of course, reduce our capacity to comprehend the rules, but this is a consequence of embracing Albert Einstein’s famous motto: “A scientific theory should be as simple as possible, but no simpler.” The extracted machines are just that: as simple as possible, but not so simple that they deviate from the behavior of the underlying system for the sake of comprehensibility. I would, however, argue that to sacrifice comprehensibility for the sake of fidelity may be a route toward something more significant for science than current RNN-RE techniques are. As presented in this letter, CrySSMEx is postprocessing pregenerated data. It would, however, be quite straightforward to let be resampled based on the extracted SSMs in such a way that data are gathered about dead transitions or infrequently visited SEs (cf. active learning (Cohn et al., 1994; Bryant et al., 1999)). The process of model-based data selection is already an ingredient of CrySSMEx (see algorithm 2), but by letting the extractor “interrogate” the underlying system, the empirical loop would thereby be completed. I propose calling this kind of algorithm an empirical machine (Jacobsson & Ziemke, 2005). The generation of models and selection of data must, however, be based on an efficient strategy in order to minimize the number of queries needed by the machine to build fruitful models. A possible such strategy would be to follow Popper’s ideas that theories must be falsifiable and generate hypotheses regarding the extracted models (of singular or multiple SDTDSs) (Popper, 1990). Of special interest is the relation of falsifiability to universality and precision, two properties that could probably be quite straightforwardly translated into quantifiable properties of the models. Universality and precision could then be two
2250
H. Jacobsson
goals for theories competing for empirical data that can falsify them, a framework I suggest calling a Popperian machine (Jacobsson & Ziemke, 2005). One thing that distinguishes RNN-RE from most of the suggested related fields makes it especially promising: it focuses on simulated, and thus empirically accessible, systems. Many of the practical problems of induction are simply not present for SDTDSs—for example, lack of data, noisy data, and partial observations. Moreover, since an SDTDS is simulated, it is widely open for active learning since no costly actuators and sensors need to be situated in a real-world environment. The counterargument is, of course, that the interesting scientific problems are out there in the challenging and noisy reality. Why would we automate scientific discovery within comparably uninteresting simulated universes? There is only one reality, whereas there is an infinity of possible simulated realities, so why not focus on the reality that counts? My argument is that the reason is precisely that they are individually relatively uninteresting, numerous, and still complex. Some reasons for wanting to automate a process are that it is too exhaustive, too repetitive, too unrewarding, and yet too challenging for humans. Another fundamental requirement is that the process should be possible to automate, and the properties of these systems clearly make it so. Moreover, since the goal of many of the simulators is to mimic the reality, an analysis tool tailored to those systems may at a later stage be ideal for analyzing this reality. As pointed out at the beginning of this letter, there are many important scientific fields with an abundance of simulated models—models around which the same sound scientific methodology that generated them could be applied. Could the field of machine learning ask for more? These are toy world problems of scientific significance, created by respected researchers in need of automated assistance by what could possibly be empirical and Popperian machines. Appendix A: Substochastic Vectors Some important types of, and operations on, substochastic vectors will here be defined (some of these are also found in Paz, 1971): Definition 19. A substochastic vector v is a vector where all elements are nonnegative and the sum of the elements is ≤ 1. A special case of the substochastic distribution is where all probabilities are zero: Definition 20. An exhausted substochastic vector v is the special case of a substochastic vector where all elements are 0.
The Crystallizing Substochastic Sequential Machine Extractor
2251
As another special case, we find vectors with more conventional probabilistic properties: Definition 21. A stochastic vector v is the special case of a substochastic vector where the sum of the elements is exactly 1. A special case of stochastic vectors is where only one element is probable: Definition 22. A degenerate vector is a stochastic vector, with one element with probability 1 and the rest 0. Definition 23. The entropy of an n-dimensional substochastic vector v is here denoted as H(v ) and is calculated by
H(v ) = −
n
vi log vi .
i=1
Definition 24. The function normalize is used to transform a substochastic vector into a stochastic vector, if possible, according to normalize( v) =
nv
i=1
v · 0
vi
n if i= i > 0 1v . otherwise
Definition 25. The support set of a substochastic vector v = (v1 , v2 , . . . , vn ) is the set {i : vi > 0} and is denoted sup(v ). Appendix B: List of Abbreviations CrySSMEx
CVQ NDI equivalent RE RNN RNN-RE SDTDS SE SSM UNDI equivalent VQ
Crystallizing SSM extractor Crystalline vector quantizer Not decisively inequivalent Rule extraction Recurrent neural network RNN-specific RE Situated discrete time dynamic system State element (of an SSM) Substochastic sequential machine Universally NDI equivalent Vector quantizer Quantizer function Transition event set (from an SDTDS)
2252
H. Jacobsson
Acknowledgments ¨ I thank Andr´e Gruning, Andreas Hansson, Gunnar Buason, Amanda Sharkey, and Tom Ziemke for many valuable comments on this manuscript. I also thank the anonymous reviewers for helping me improve the article significantly. I especially thank them for the suggested connection of my work to control theory. References Andrews, R., Diederich, J., & Tickle, A. B. (1995). Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge Based Systems, 8(6), 373–389. Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75, 87–106. Angluin, D. (2004). Queries revisited. Theoretical Computer Science, 313(2), 175–194. Bergadano, F., & Gunetti, D. (1996). Testing by means of inductive program learning. ACM Transactions on Software Engineering and Methodology, 5(2), 119– 145. Blair, A., & Pollack, J. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5), 1127–1142. Bod´en, M., & Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science, 12(3/4), 196–210. Bryant, C. H., Muggleton, S. H., Page, C. D., & Sternberg, M. J. E. (1999). Combining active learning with inductive logic programming to close the loop in machine learning. In Proceedings of the AISB’99 symposium on AI and scientific creativity. Unpublished manuscript. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6), 1135–1178. Chaitin, G. J. (1987). Algorithmic information theory. Cambridge: Cambridge University Press. Christiansen, M. H., & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2), 157–205. Cleeremans, A., McClelland, J. L., & Servan-Schreiber, D. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1, 372–381. Cohn, D. A., Atlas, L., & Ladner, R. E. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221. Colton, S., Bundy, A., & Walsh, T. (2000). On the notion of interestingness in automated mathematical discovery. International Journal of Human Computer Studies, 53(3), 351–375. Cover, T. M., & Thomas, J. A. (1990). Elements of information theory. New York: Wiley. Craven, M. W., & Shavlik, J. W. (1994). Using sampling and queries to extract rules from trained neural networks. In W. W. Cohen & H. Hirsh (Eds.), Machine learning: Proceedings of the Eleventh International Conference. San Francisco: Morgan Kaufmann.
The Crystallizing Substochastic Sequential Machine Extractor
2253
Craven, M. W., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 24–30). Cambridge, MA: MIT Press. Craven, M. W., & Shavlik, J. W. (1999). Rule extraction: Where do we go from here? (Tech. Rep. No. Machine Learning Research Group Working Paper 99-1). Madison: Department of Computer Sciences, University of Wisconsin. Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics, and induction. Physica D, 75, 11–54. Crutchfield, J. P., & Young, K. (1990). Computation at the onset of chaos. In W. Zurek (Ed.), Complexity, entropy and the physics of information. Reading, MA: AddisonWesley. de la Higuera, C. (2005). A bibliographical study of grammatical inference. Pattern Recognition, 38, 1332–1348. Devaney, R. L. (1992). A first course in chaotic dynamical systems. Reading, MA: Addison-Wesley. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis. London: Arnold. Gers, F. A., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333–1340. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., & Sun, G. Z. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 317–324). San Mateo, CA: Morgan Kaufmann. Gold, M. E. (1967). Language identification in the limit. Information and Control, 10(5), 447–474. ˇ P. (2003). Recurrent neural networks with small weights impleHammer, B., & Tino, ment definite memory machines. Neural Computation, 15(8), 1897–1929. Hopcroft, J., & Ullman, J. D. (1979). Introduction to automata theory, languages, and compilation. Reading, MA: Addison-Wesley. Jacobsson, H. (2005). Rule extraction from recurrent neural networks: A taxonomy and review. Neural Computation, 17(6), 1223–1263. Jacobsson, H., & Ziemke, T. (2003a). Improving procedures for evaluation of connectionist context-free language predictors. IEEE Transactions on Neural Networks, 14(4), 963–966. Jacobsson, H., & Ziemke, T. (2003b). Reducing complexity of rule extraction from predic¨ tion RNNs through domain interaction (Tech. Rep. No. HS-IDA-TR-03-007). Skovde: ¨ Department of Computer Science, University of Skovde, Sweden. Jacobsson, H., & Ziemke, T. (2005). Rethinking rule extraction from recurrent neural networks. Paper presented at the IJCAI-05 Workshop on Neural-Symbolic Learning and Reasoning. Kolen, J. F., & Kremer, S. C. (Eds.). (2001). A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press.
2254
H. Jacobsson
Kremer, S. C. (2001). Spatiotemporal connectionist networks: A taxonomy and review. Neural Computation, 13(2), 248–306. Kumar, R., & Garg, V. K. (2001). Control of stochastic discrete event systems modeled by probabilistic languages. IEEE Transactions on Automatic Control, 46(4), 593– 606. Lang, K. J. (1992). Random DFA’s can be approximately learned from sparse uniform examples. In Proceedings of the Fifth ACM Workshop on Computational Learning Theory (pp. 45–52). New York: ACM. Langley, P., Shrager, J., & Saito, K. (2002). Computational discovery of communicable scientific knowledge. In L. Magnani, N. J. Nersessian, & C. Pizzi (Eds.), Logical and computational aspects of model-based reasoning. Dordrecht: Kluwer Academic. Ljung, L. (1999). System identification: Theory for the user. Englewood Cliffs, NJ: Prentice Hall. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6), 1155–1173. Marculescu, D., Marculescu, R., & Pedram, M. (1996). Stochastic sequential machine synthesis targeting constrained sequence generation. In Dac’96: Proceedings of the 33rd Annual Conference on Design Automation (pp. 696–701). New York: ACM Press. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. Moore, E. F. (1956). Gedanken-experiments on sequential machines. In C. E. Shannon & J. McCarthy (Eds.), Annals of mathematical studies (Vol. 34, pp. 129–153). Princeton, NJ: Princeton University Press. Muggleton, S., & Raedt, L. D. (1994). Inductive logic programming: Theory and methods. Journal of Logic Programming, 19–20, 629–679. Paz, A. (1971). Introduction to probabilistic automata. Orlando, FL: Academic Press. Popper, K. R. (1990). The logic of scientific discovery (14th ed.). London: Unwin Hyman. Rabin, M. O. (1963). Probabilistic automata. Information and Control, 6, 230–245. Sharkey, N. E., & Jackson, S. A. (1995). An internal report for connectionists. In R. Sun & L. A. Bookman (Eds.), Computational architectures integrating neural and symbolic processes (pp. 223–244). Boston: Kluwer. Simon, H. A. (1973). Does scientific discovery have a logic? Philosophy of Science, 40, 471–480. Simon, H. A. (1995–1996). Machine discovery. Foundations of Science, 1(2), 171–200. Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within mined artificial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057– 1068. ˇ nansk ˇ P., Cer ˇ ´ M., & Benuˇ ˇ skov´a, L. (2004). Markovian architectural bias of Tino, y, recurrent neural networks. IEEE Transactions on Neural Networks, 6–15. ˇ P., & Koteles, ¨ Tino, M. (1999). Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks, 10(2), 284–302. ˇ P., & Vojtek, V. (1998). Extracting stochastic machines from recurrent neural Tino, networks trained on complex symbolic sequences. Neural Network World, 8(5), 517–530.
The Crystallizing Substochastic Sequential Machine Extractor
2255
Tonkes, B., Blair, A., & Wiles, J. (1998). Inductive bias in context-free language learning. In Proceedings of the Ninth Australian Conference on Neural Networks (pp. 52–56). Brisbane, Australia: Department of Computer Science and Electrical Engineering, University of Queensland. Vahed, A., & Omlin, C. W. (2004). A machine learning method for extracting symbolic knowledge from recurrent neural networks. Neural Computation, 16, 59–71. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Watrous, R. L., & Kuhn, G. M. (1992). Induction of finite-state automata using secondorder recurrent networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 309–317). San Mateo, CA: Morgan Kaufmann. Wiles, J., & Elman, J. L. (1995). Learning to count without a counter: A case study of dynamics and activation landscapes in recurrent neural networks. In Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society (pp. 482–487). Cambridge MA: MIT Press. Williamson, J. (2004). A dynamic interaction between machine learning and the philosophy of science. Minds and Machines, 14(4), 539–549. Young, S., & Garg, V. K. (1995). Model uncertainty in discrete event systems. SIAM Journal on Control and Optimization, 33(1), 208–226. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990.
Received June 27, 2005; accepted March 27, 2006.
LETTER
Communicated by Emilio Salinas
Assessing Neuronal Coherence with Single-Unit, Multi-Unit, and Local Field Potentials Magteld Zeitler [email protected] Department of Medical Physics and Biophysics, Institute for Neuroscience, Radboud University Nijmegen, 6525 EZ Nijmegen, Netherlands
Pascal Fries [email protected] F. C. Donders Centre for Cognitive Neuroimaging and Department of Biophysics, Institute for Neuroscience, Radboud University Nijmegen, 6525 EZ Nijmegen, Netherlands
Stan Gielen [email protected] Department of Medical Physics and Biophysics, Institute for Neuroscience, Radboud University Nijmegen, 6525 EZ Nijmegen, Netherlands
The purpose of this study was to obtain a better understanding of neuronal responses to correlated input, in particular focusing on the aspect of synchronization of neuronal activity. The first aim was to obtain an analytical expression for the coherence between the output spike train and correlated input and for the coherence between output spike trains of neurons with correlated input. For Poisson neurons, we could derive that the peak of the coherence between the correlated input and multi-unit activity increases proportionally with the square root of the number of neurons in the multi-unit recording. The coherence between two typical multi-unit recordings (2 to 10 single units) with partially correlated input increases proportionally with the number of units in the multi-unit recordings. The second aim of this study was to investigate to what extent the amplitude and signal-to-noise ratio of the coherence between input and output varied for single-unit versus multi-unit activity and how they are affected by the duration of the recording. The same problem was addressed for the coherence between two single-unit spike series and between two multi-unit spike series. The analytical results for the Poisson neuron and numerical simulations for the conductance-based leaky integrate-and-fire neuron and for the conductance-based HodgkinHuxley neuron show that the expectation value of the coherence function does not increase for a longer duration of the recording. The only effect of a longer duration of the spike recording is a reduction of the noise in the Neural Computation 18, 2256–2281 (2006)
C 2006 Massachusetts Institute of Technology
Assessing Neural Coherence
2257
coherence function. The results of analytical derivations and computer simulations for model neurons show that the coherence for multi-unit activity is larger than that for single-unit activity. This is in agreement with the results of experimental data obtained from monkey visual cortex (V4). Finally, we show that multitaper techniques greatly contribute to a more accurate estimate of the coherence by reducing the bias and variance in the coherence estimate. 1 Introduction The recent advent of multiple electrode recording technology makes it possible to study the simultaneous spiking activity of many neurons. This allows us to explore how stimuli are encoded by neuronal activity and how groups of neurons act in concert to define the function of a given brain region. However, in spite of the considerable technological developments and the advanced analysis tools (for an overview, see Brown, Kass, & Mitra, 2004), there are many fundamental questions regarding the interpretation of multi-unit activity. The gold standard in animal neurophysiology has been thought to be the study of isolated single units for a long time. However, it appears as if the use of measures of neuronal aggregate activity, like multi-unit or local field potential recordings, greatly enhances the sensitivity of correlation and coherence analyses (see, e.g., Baker, Pinches, & Lemon, 2003; Rolls, Franco, Aggelopous, & Reece, 2003). This empirical observation is not yet understood. Related to this is the question whether a multi-unit recording for time T and consisting of m single units with the same correlated input carries the same information as a single-unit recording for time mT. Many studies (see, e.g., Singer & Gray, 1995; Kreiter & Singer, 1996; Engel, Fries, & Singer, 2001; Fries, Neuenschwander, Engel, Goebel, & Singer, 2001) have so far demonstrated that neurons in early and intermediate visual cortex in cat and macaque exhibit significant correlated fluctuations in their responses to visual stimuli. These cells undergo attention-modulated fluctuations in excitability that enhance temporal coherence of the responses to ¨ visual stimuli (Fries, Reynolds, Rorie, & Desimone, 2001; Fries, Schroder, Roelfsema, Singer, & Engel, 2002). The coherence is an important parameter, since it provides a measure for the similarity between two signals. Moreover, coherence among subthreshold membrane potential fluctuations likely expresses functional relationships during states of expectancy or attention, allowing the grouping and selection of distributed neuronal responses for further processing (Fries, Neuenschwander, et al., 2001). The coherence between spike activity and local field potential was larger for multi-unit activity than for single-unit activity. Along the same lines, Baker et al. (2003) studied the cross-correlation and coherence between local field potentials and neural spike trains in monkey primary motor cortex. They concluded that a (small) population of neurons is necessary to encode effectively the
2258
M. Zeitler, P. Fries, and S. Gielen
cortical oscillatory signal, that is, the rapid modulations of synaptic input reflected in the oscillatory local field potential. Several studies reported a lack of evidence for synchronized neuronal activity. For example, Tovee and Rolls (1992), in the inferior temporal visual cortex, and Luck, Chelazzi, Hillyard, and Desimone (1997) did not observe clear synchronization in neuronal responses in V2 and V4. However, Kreiter and Singer (1996) did find clear synchronization in the middle temporal area (MT) if two cells were activated by the same stimulus. Besides recording in different recording areas and the use of different types of stimuli, the statistical analysis technique might also play an important role in detecting synchronization. Advanced multitaper techniques (Percival & Walden, 2002) have proven to be useful in estimating coherence between spike trains and local field potentials by improving the signal-to-noise ratio (Pesaran, Pezaris, Shahani, Mitra, & Andersen, 2002; see also Jarvis & Mitra, 2001). These multitaper techniques improved the significance of synchronized oscillatory neuronal activity. The aim of this study was threefold. First, we wanted to obtain a quantitative understanding of the interpretation of correlated output spike trains in terms of correlated input (indirectly related to the local field potential) to the neurons. In order to do so, we started with a network of simple Poisson neurons, the behavior of which could be analyzed analytically. This simple model was then made more realistic by replacing the Poisson neurons by conductance-based neurons. The second aim of this study was to investigate to what extent the shape, amplitude, and signal-to-noise ratio of the coherence between input and output varied for single-unit versus multi-unit activity and whether the recording of single-unit activity over a long period of time could produce the same cross-correlation and coherence with local field potential as multi-unit activity over a shorter period of time. We addressed the same question for the coherence between two spike outputs for both two single-unit and two multi-unit spike series. The third aim of this study was to investigate the effectiveness of analysis techniques in revealing coherent activity in multi-unit activity. These three topics were investigated by comparing the results of coherence for single-unit and multi-unit activity in theoretical analyses for Poisson neurons, in computer simulations for conductance-based model neurons, and for data measured in monkey visual cortex (V4) (Fries, Reynolds, et al., 2001).
2 Methods and Theory In order to obtain better insight into the coherence between the local field potential (LFP) at the one hand and single-unit or multi-unit activity at the other hand and in the coherence between spike trains of neurons that receive partially correlated input, we will start with a simple model (see
Assessing Neural Coherence
2259
Figure 1). The local field potential reflects mainly the sum of postsynaptic potentials from local cell groups (Buzs´aki, 2004). Therefore, the local field potential is seen to be indirectly related to the correlated input of neurons. We consider groups of neurons receiving correlated input that is reflected in a simulated LFP. We therefore modeled those neurons as rate-varying Poisson processes with a baseline firing rate plus rate modulations driven by the LFP fluctuations. Note that in this study, we refer to the LFP as common rate fluctuations of the input signal (for short, common input). In order to prevent any misunderstanding, we would like to point out that this meaning of common input differs from the usual physiological meaning of common input, which implies that two neurons receive the same synaptic input due to a bifurcating axon. In this study, we will determine the coherence between different signals present in the model, as shown in Figure 1. First, we concentrate on the Poisson model and derive an expression for the coherence between the common input (LFP) and the response of a single Poisson neuron (the small circle in Figure 1). After deriving a similar expression for multi-unit activity, we compare both results of spike-field coherence functions. We finish the theoretical part, concerning the coherence functions, by deriving expressions for the spike-spike coherences, first between two single-unit activities and later between two multi-unit series of Poisson neurons. Simulation results of these coherence measures will complete the Poisson model section. We continue by simulations of the complete model, including the conductancebased neurons (the large circles in Figure 1). The common input (LFP) to the Poisson neurons will be taken as the local field potential in order to determine the spike-field coherences between the common input and the response(s) of the conductance-based neuron(s). The spike-spike coherences are taken between the responses of two conductance-based neurons (single-units) and then between the sums of 10 responses (multi-units) of this neuron type. We finish with the coherence analysis of experimental data. 2.1 Poisson Model and Coherences. In the simple model in Figure 1, we feed Poisson neurons with partially common rate fluctuations Nc σ η0 (t) and uncorrelated noise (1 − Nc )σ ηi (t) (as described below), in order to translate the LFP into a series of (partially) correlated spike trains. For this part of the model, we derive analytical expressions for the coherence between LFP and single-unit or multi-unit activity and for the coherence between spike trains. The spike output of the Poisson neurons is fed into a set of neurons, which could be conductance-based leaky integrate-and-fire neurons or conductance-based Hodgkin-Huxley neurons. The Poisson neurons each receive an input xi (t) = λ + Nc σ η0 (t) + (1 − Nc )σ ηi (t)
(2.1)
2260
M. Zeitler, P. Fries, and S. Gielen
with a constant input λ, gaussian colored noise η0 , and gaussian white noise ηi , with < ηi (t)η j (t + τ ) >= δi j (τ ). The common input ratio Nc varies from zero (uncorrelated input to all neurons) to one (the same input to all neurons). Both η0 (t) and ηi (t) have zero mean and a variance of one. In this study, σ is set to λ/3, so the total input to the neurons is always positive and, therefore, the probability that a spike occurs too. Experiments in visual cortex (Fries, Neuenschwander, et al., 2001; Fries, Reynolds, et al., 2001; Fries et al., 2000) have shown that the local field potential, which represents a measure of the local correlated input to a group of neurons (Buzs´aki et al., 2004), has a peak in the power spectrum in the range between 40 and 60 Hz. Therefore, we used bandpassfiltered gaussian white noise η0 (t) as a time-dependent common rate fluctuation, which was obtained by filtering gaussian white noise with a bandpass filter with 3 dB points at 45 and 55 Hz and a quality factor Q of 5. The response of Poisson neuron i to the input xi (t) is represented by a sequence of action potentials yi (t) = j δ(t − t ij ), where t ij represents the occurrence time of the jth spike of neuron i. In this study, we introduce a discretization of time in bins t of 1 ms, such that yi (t) = 1 for an action potential in the time interval [t, t + t) with probability xi (t)t and with yi (t) = 0 with probability (1 − xi (t)t). activity is defined as the Multi-unit m sum of m single-unit activities z(t) = i=1 δ(t − t ij ). j A commonly used measure to estimate the relation between input x(t) and output y(t) of a neuron is the normalized cross-covariance function
Figure 1: Schematic overview of the network of neurons for the simulations. A set of Poisson neurons receives common rate fluctuations (local field potential) and uncorrelated input to generate a set of correlated spike trains. These spike trains provide the input for a set of neurons, which are modeled as leaky integrate-and-fire (LIF) neurons or Hodgkin-Huxley (HH) neurons. A population of Poisson neurons is represented by an oval with small circles. Each Poisson neuron receives a common input given by λ + Nc σ ηo (t) and a unique input given by (1 − Nc )σ ηi j (t), which is uncorrelated in time and space. λ is a constant, η0 represents the common rate fluctuations to the Poisson neurons and is represented by bandpass filtered gaussian white noise, and ηi j is gaussian white noise for the jth Poisson neuron of the ith population. Poisson model: Only one population of 20 Poisson neurons is used for the Poisson model. yi (t) represents the single-unit activity of Poisson neuron i, multi-unit activity is the sum of the responses of 10 neurons. LIF (HH) model: Each of the 20 LIF (HH) neurons (large circle) receives input from one of the 20 populations with 100 Poisson neurons each (oval). Single-unit activity is the response of one conductance-based neuron; multi-unit activity is the sum of 10 single-unit activities.
Assessing Neural Coherence
2261
y1(t)
…
…
(1-Nc) ση1j(t)
yn(t)
…
…
(1-Nc) ση2j(t)
λ +Nc ση0(t)
…
σηnj(t) …
(1-Nc)
2262
M. Zeitler, P. Fries, and S. Gielen
or correlation coefficient function, which is defined by (Marmarelis & Marmarelis, 1978), C xy (τ ) ρxy (τ ) ≡ C xx (0)C yy (0)
(2.2)
with the cross-covariance function between two ergodic signals x and y defined as C xy (τ ) =
x(t + τ ) y(t) p(x(t + τ ), y(t)) d x(t + τ )dy(t) − x y,
(2.3)
where p(x(t + τ ), y(t)) is the joint probability distribution of x(t + τ ) and y(t) and where x and y represent the averaged value of signal x and y, respectively. The coherence function γ (ω) reflects how much of the variation in the output y can be attributed to a linear filtering of the input signal x. The coherence function γ (ω) is defined by | γ (ω) |=
| C xy (ω) | . | C xx (ω) | | C yy (ω) |
(2.4)
The coherence takes values in the range between 0 (input and output are fully uncorrelated) and 1 (the output is equal to the input after convolution by a linear system). First, we determine the coherence between the single-unit activity of a Poisson neuron and the common rate fluctuations by deriving expressions for the covariance functions in the denominator and the cross-covariance function in the numinator of equation 2.4. Consider x(t) to be the input given by equation 2.1 and yi (t) = y(t) the response of a single Poisson neuron. Each Poisson neuron is represented by a small circle in Figure 1. The covariance function of the input is given by C xx (τ ) =
x(t)x(t + τ ) p(x(t), x(t + τ ))d x(t)d x(t + τ )
=
x(t)x(t + τ ) p(η0 (t), η0 (t + τ )) p(ηi (t), ηi (t + τ )) dη0 (t)dη0 (t + τ )dηi (t)dηi (t + τ ) − x 2
= Nc2 σ 2 ρ(τ ) + (1 − Nc )2 σ 2 δ(τ ),
(2.5)
Assessing Neural Coherence
2263
where the joint probability distributions for τ = 0 are given by p(η0 (t + τ ), η0 (t)) =
1
2π 1 − ρ 2 (τ ) η02 (t + τ ) − 2ρ(τ )η0 (t + τ )η0 (t) + η02 (t) exp − 2(1 − ρ 2 (τ ))
p(ηi (t + τ ), ηi (t)) = p(ηi (t)) p(ηi (t + τ ),
(2.6)
with ρ(τ ) = ρη0 η0 (τ ) being the normalized covariance function of the gaussian colored noise η0 . In order to obtain equation 2.5, we used for the common input colored noise η0 (t) and for the uncorrelated noise ηi (t) in the input signal x(t) defined in equation 2.1 for τ = 0: ηo (t + τ ) p(η0 (t + τ ), η0 (t) | τ )dη0 (t + τ ) = η0 (t) p(η0 (t)) η02 (t) η0 (t) = √ exp − 2 2π
ηi (t + τ ) p(ηi (t + τ ), ηi (t) | τ ) dηi (t + τ ) = ηi (t) p(ηi (t)) 2 η (t) ηi (t) = √ exp − i , 2 2π (2.7) The first term on the right-hand side of equation 2.5 is due to the common rate fluctuations to the neurons, and the second term due to the neuronspecific input fluctuations. The covariance function of a single-unit response results in C yy (τ ) = p(y(t + τ ) = 1, y(t) = 1) − y2 = p(y(t + τ ) = 1 | η0 (t + τ ), ηi (t + τ )) p(y(t) = 1 | η0 (t), ηi (t)) p(η0 (t), η0 (t + τ )) p(ηi (t), ηi (t + τ ))dη0 (t)dη0 (t + τ )dηi (t) × dηi (t + τ ) − y2 = t 2 σ 2 Nc2 (ρ(τ ) − δ(τ )) + tλ(1 − tλ)δ(τ ),
(2.8)
where ρ is the normalized covariance function of the gaussian colored noise.
2264
M. Zeitler, P. Fries, and S. Gielen
The cross-covariance function between the input x and the single-unit response y is given by C xy (τ ) =
x(t + τ ) p(x(t + τ ), y(t) = 1 | τ )d x(t + τ ) − xy
= tσ 2 Nc2 ρ(τ ) + tσ 2 (1 − Nc )2 δ(τ ).
(2.9)
The first term on the right-hand side is due to the common rate fluctuations, and the second term is due to the neuron-specific input fluctuations. The local field potential is considered to be a measure of the local common rate fluctuation of the neurons near the recording electrode. Therefore, we will take only the contributions of the common rate fluctuations in equations 2.5 and 2.9 into account for determining an analytical expression for the spike-field coherence between single-unit activity and local field potential. The spike-field coherence between the single-unit activity and the common rate fluctuations can be obtained by taking the Fourier transform of equation 2.8 and the first terms on the right-hand side of equations 2.5 and 2.9. This results in SU γ (ω) = Sp F
tσ 2 Nc2 | ρ(ω) |
σ Nc | ρ(ω) || tλ(1 − tλ) + (t σ )2 Nc2 (ρ(ω) − 1) |
=
tσ Nc | ρ(ω) | | tλ(1 − tλ) + (tσ )2 Nc2 (ρ(ω) − 1) |
tσ Nc | ρ(ω) |, ≈ √ tλ
(2.10)
where ρ(ω) is the Fourier transform of the normalized covariance function of the colored noise. The approximation in the last step is valid since (tσ )2 tλ. In order to obtain an expression for the coherence between multi-unit activity and the common rate fluctuations, we have to determine the covariance function of multi-unit activity and the cross-covariance function between multi-unit activity and common rate fluctuations. Since the probability that a neuron fires twice within a time bin t is very small ((tλ)2 1), we take terms only to the second order of t into account. For multi-unit activity z, which is the summation over m simultaneously recorded single1 unit signals yi (t) with a common input ratio Nc and for m << tλ , we find
Assessing Neural Coherence
2265
for the multi-unit covariance function: C zz (τ ) =
m m
jkp(z(t + τ ) = j, z(t) = k) − z2
j=0 k=0
≈ m(t)2 σ 2 Nc2 (mρ(τ ) − δ(τ )) + mtλ(1 − tλ)δ(τ ).
(2.11)
The cross-covariance function between multi-unit activity and the total input is given by C xz (τ ) =
m
x(t + τ ) j p(x(t + τ ), z(t) = j)d x(t + τ ) − xz
j=0
≈ mtσ 2 Nc2 ρ(τ ) + mtσ 2 (1 − Nc )2 δ(τ ).
(2.12)
Equation 2.12 is equal to equation 2.9 except for the factor m. Combining equation equation 2.11 and the first term on the right-hand side of equations 2.5 and 2.12 leads to the expression for the spike-field coherence between multi-unit activity and the common rate fluctuations: MU γ (ω) ≡ Sp F
≈
| C xz (ω) | | C xx (ω) | | C zz (ω) | tσ Nc | mρ(ω) |
| tλ(1 − tλ) + (tσ )2 Nc2 (mρ(ω) − 1) | tσ Nc | mρ(ω) |. ≈ √ tλ
(2.13)
The spike-field coherence for multi-unit activity, which is the summation of m single-unit recordings, is equal√to that for single-unit activity (see equation 2.10) except for a coefficient m. We can also calculate the coherence between two single-unit responses or between two multi-unit recordings. The cross-covariance function between two single-unit signals y1 and y2 is given by C y1 y2 (τ ) = p(y1 (t + τ ) = 1, y2 (t) = 1) − y1 y2 = (tσ )2 Nc2 ρ(τ ).
(2.14)
The spike-spike coherence between two simultaneously recorded singleunit signals with partly common rate fluctuations is given by SU |≡ | γ SpSp
| C y1 y2 (ω) | | C yy (ω) |
(2.15)
2266
M. Zeitler, P. Fries, and S. Gielen
=
(tσ )2 Nc2 | ρ(ω) | tλ(1 − tλ) + (tσ )2 Nc2 (ρ(ω) − 1)
≈
(tσ Nc )2 | ρ(ω) |, tλ
where we used C yy = C y1 y1 = C y2 y2 . The cross-covariance function of two multi-unit signals is given by: C z1 z2 (τ ) =
m
jkp(z1 (t + τ ) = j, z2 (t) = k | τ ) − z2
j,k=0
≈ m2 Nc2 (tσ )2 ρ(τ ).
(2.16)
The spike-spike coherence between two simultaneously recorded multiunit signals is given by MU |= | γ SpSp
m2 (tσ )2 Nc2 | ρ(ω) | | m2 (tσ )2 Nc2 ρ(ω) + m(tλ(1 − tλ) − (tσ )2 Nc2 ) |
≈
m(tσ )2 Nc2 | ρ(ω) | | tλ(1 − tλ) + (tσ )2 Nc2 (mρ(ω) − 1) |
≈
(tσ Nc )2 m | ρ(ω) | . tλ
(2.17)
Equations 2.15 and 2.17 show that for low firing rates (λt << 1) and for m << 1/(λt), the expected spike-spike coherence between multi-unit signals is approximately m-times larger than the expected spike-spike coherence between single-unit signals. Equations 2.13 and 2.17 show that the spike-spike coherence is (approximately) the square of the spike-field coherence and thus much smaller. In summary, for our Poisson model, the spike-field coherence and the spike-spike coherence are larger for multi-unit recordings than for singleunit recordings and the spike-spike coherences are much smaller than the spike-field coherences. 2.2 Conductance-Based LIF Model. Since the simple Poisson model is not very realistic, we will discuss a model where conductance-based leaky integrate-and-fire neurons (LIF) receive spike input from the Poisson neurons. The membrane equation of the neurons is then given by C
dU = −Ie (t) − Il (t), dt
(2.18)
Assessing Neural Coherence
2267
with membrane capacitance C, membrane potential U, and excitatory and leak currents Ie and Il , respectively. These currents are given by Ie (t) = G e (t)(U(t) − E e ) Il (t) = G l (U(t) − Er ),
(2.19)
with the excitatory reversal potential E e , rest potential Er , and excitatory (leak) conductance G e (t) (G l ). The excitatory conductance depends on the recent presynaptic spike times and is modeled by: max
G e (t) =
ki m
ge (t − tik ),
(2.20)
i=1 k=1
with tik the time of the kth spike of neuron i and with m the number of input neurons. In this study, the conductivity is modeled by an alpha function: ge (t) = g0
t τe
t exp − (t). τe
(2.21)
Here τe denotes the time-to-peak of the conductivity ge (t). is the Heaviside function. When the membrane potential reaches the threshold Uthr , a spike will be generated, and the membrane potential U is reset. Specific values for the LIF model are (Stroeve & Gielen, 2001): membrane capacitance C = 325 pF, threshold potential Uthr = −55 mV, excitatory reversal potential E e = 0 mV, rest potential Er = −75 mV, leak conductance G l = 25 nS, g0 = 3.24 nS and τe = 1.5 ms. Each LIF neuron (the large circle in Figure 1) receives input from a population of 100 Poisson neurons (oval), with a spike rate output modulated by a common input (λ + Nc σ η0 (t)) and an uncorrelated input ((1 − Nc )λ + σ ηi (t)), where η0 (t) is gaussian colored noise and ηi (t) is gaussian white noise, both with zero mean and variance one. For our simulations, these quantities are chosen as for the Poisson model except for σ , which has been chosen by σ = 20/12, for λ = 20. In our simulations, we derived the membrane potential by using Euler integration with a step width of 1 ms. 2.3 Conductance-Based Hodgkin-Huxley Model. The next modification of our simple model in Figure 1 is the replacement of the conductance-
2268
M. Zeitler, P. Fries, and S. Gielen
based LIF neurons (circles) by conductance-based Hodgkin-Huxley neurons. These neurons are characterized by the differential equation C
dU = −I Na (t) − I K (t) − Il (t) − Ie (t), dt
(2.22)
where the sodium and potassium currents are given by I Na (t) = g Na m3 h(U(t) − VNa ) I K (t) = g K n4 (U(t) − VK ),
(2.23)
and the leak and excitatory currents are as described before (see equation 2.19). VNa and VK are the sodium and potassium reversal potentials. The time-varying gate variables m, h, and n are given by the differential equation dx x∞ − x = dt τx with x {m, h, n}, τx = by αm = 0.1
(2.24) 1 αx +βx
and x∞ =
αx . These parameters are expressed αx +βx
U + 40 1 − exp(−0.1(U + 40))
βm = 4 exp(−(U + 65)/18) αn =
0.01(U + 55) 1 − exp(0.1(U + 55))
βn = 0.125 exp(−(U + 65)/80) αh = 0.07 exp(−(U + 65)/20) βh =
1 . 1 + exp(−0.1(U + 35))
(2.25)
The typical values of the parameters at 6.3◦ C for the squid axon are membrane capacitance per unit surface, C = 1µF/cm2 ; maximum conductance per unit area for the sodium, potassium, and leak currents, g Na = 120 mS/cm2 , g K = 36 mS/cm2 , and G l = 0.3 mS/cm2 ; excitatory reversal potential, E e = 0 mV; rest potential, Er = −75 mV; sodium reversal potential, VNa = 50 mV; and potassium reversal potential VK = −77 mV, g0 = 1.5µS/cm2 , and τe = 1.5 ms. As for the conductance-based LIF model, we use spike trains as input for the conductance-based HH neurons. We derived the membrane potential
Assessing Neural Coherence
2269
using Euler integration with a step width of 0.05 ms for the HH neurons. The sequence of output action potentials of the HH model was represented in time bins of 1 ms. 2.4 Multitaper Method. The usual way of estimating the frequency content of a signal is by taking the Fourier spectrum (periodogram). If the signal x(t) has a stochastic character, the variance of the spectral estimates in the Fourier transformed signal may be considerable. This is particularly important if we are dealing with the coherence of two stochastic spike series. This is not solved by taking a signal with a longer duration since a longer time signal gives rise to a higher spectral resolution in the Fourier transformed signal but does not decrease the variance of each point in the frequency spectrum. To solve this problem, the multitaper estimation procedure was introduced (see Thomson, 1982; Mitra & Pesaran, 1999). The key idea behind the Welch method and the multitaper method is that a physiological signal has no discontinuities in the frequency spectrum and that the variability in the estimate of a signal can be reduced by smoothing in the frequency domain. The multitaper method achieves this by optimizing the minimum of bias and variance of the estimate. This involves the use of multiple orthonormal data tapers, which provide a local eigenbasis in frequency space for finitelength data sequences. A simple example of the method is given by the direct multitaper spectral estimate SMT ( f ) of a discrete time series signal xt with t = nt and n ∈ 1, 2, . . . , N defined as the average over individual tapered spectral estimates, 1 | x˜k ( f ) |2 N K
SMT ( f ) =
(2.26)
k=1
where x˜k ( f ) =
N
wt (k)xt exp(−2πi f t).
(2.27)
1
Here wt (k) (k = 1, 2, . . . , K ) are K orthogonal taper functions with appropriate properties. Let wk (k, W, N) be the kth taper of length N and frequency bandwidth parameter W. This forms an orthogonal basis set for sequences of length N, characterized by a bandwidth W. The important feature of these sequences is that for a given bandwidth parameter W and taper length N, K = 2NW − 1 sequences out of a total of N each have their energy effectively concentrated within a range 2W in frequency space. This range can be shifted from [−W, W] centered around zero frequency to any nonzero center frequency interval
2270
M. Zeitler, P. Fries, and S. Gielen
[ f 0 − W, f 0 + W] by simply multiplying by the appropriate phase factor exp(2π f 0 t). The product of the number N of samples in the signal and the bandwidth W of the spectral estimator (NW) is used to balance between variance and resolution of the power spectral density estimation. In this article, we use a simple set of orthonormal sine tapers {ωt,k : t = 1, . . . , N; k = 0, . . . , N − 1} (McCoy, Walden, & Percival, 1997). The kth taper is given by
ωt,k =
(k + 1)πt 1 sin . N+1 N+1
(2.28)
For our analysis, we used signals of length 0.512 s and the first K = 2NW − 1 tapers, which gave K = 6. This means that the bandwidth W of the spectral estimator is 6.83 Hz. The frequency bin width is f s /nfft= 1.95 Hz, with sampling frequency f s (1000 Hz) and where nfft is the number of data in the FFT (512). 2.5 Neurophysiology 2.5.1 Surgery. Experiments were performed on two male Macaca mulatta, weighting 8 to 11 kg. Each monkey was surgically implanted with a head post, a scleral eye coil, and a recordings chamber. Surgery was conducted under aseptic conditions with isofluorane anesthesia. Antibiotics and analgesics were administered after the operation. The skull remained intact during the surgery. Subsequently, small holes (5 mm in diameter) were drilled within the recording chamber under ketamine anesthesia and xylazine analgesic. All experimental procedures were performed in accordance with the National Institutes of Health guidelines and approved by the National Institute of Mental Health Intramural Animal Care and Use Committee. 2.5.2 Recording-Technique. Neuronal recordings were made through the surgically implanted chamber overlying area V4. Recordings were made from two hemispheres in two monkeys. Four to eight tungsten microelectrodes (Frederick Haer and Co., Brunswick, ME) were inserted through the intact dura mater by means of a hydraulic microdrive (Frederick Haer) mounted to the recording chamber. The electrodes had tip impedances of one to two M and were separated by 650 or 900 µm. Each electrode was advanced separately at a very slow rate (1.5 mm/s) to minimize suppression artifacts (dimpling) resulting from the deformation of the cortical surface by the electrode. Data amplification, filtering, and acquisition was done with a multichannel acquisition processor (MAP) system from Plexon Incorporated (Dallas, TX). The signal from each electrode was passed through a headstage with unit gain and an output impedance of 240 . It
Assessing Neural Coherence
2271
was then split to separately extract the spike and the LFP components. For spike recordings, the signals were filtered with a passband of 100 to 8000 Hz, further amplified and digitized with 40 kHz. A threshold was set interactively, and spike waveforms were stored for a time window from 150 µs before to 700 µs after threshold crossing. The threshold clearly separated spikes from noise but was chosen to include multi-unit activity. Off-line, we performed a principal component analysis of the waveforms and plotted the first against the second principal component. Those waveforms that corresponded to artifacts were excluded. For multi-unit analyses, all other waveforms were accepted. For single-unit analyses, only clearly isolated clusters of high-amplitude spikes were accepted. For all further analyses involving spikes, only the times of threshold crossing were kept and downsampled to 1 kHz. For LFP recordings, the signals were filtered with a passband of 0.7 to 170 Hz, further amplified, and digitized at 1 kHz. Each electrode was lowered separately until it recorded visually driven activity. Once this had been achieved for all electrodes, we fine-tuned the electrode positions to optimize the signal-to-noise ratio of the multiple spike recordings and obtain as many isolated single units as possible. Since the penetration was halted as soon as clear visually driven activity was obtained, most of the recordings were presumably done from the superficial layers of the cortex. 2.5.3 Behavioral Paradigm and Visual Stimulation. Stimuli were presented on a 17 inch CRT monitor 0.57 m from the monkeys’ eyes that had a resolution of 800 × 600 pixel and a screen refresh rate of 120 Hz noninterlaced. Stimulus generation and behavioral control were accomplished with the CORTEX software package (http://www.cortex.salk.edu/). A trial started when the monkey touched a bar mounted in front of him; 250 ms later, a fixation point appeared at the center of the screen. When the monkey brought his gaze within 0.7 degree of the fixation spot for at least 1000 ms, stimulus presentation commenced. The task of the monkey was to fixate the fixation target while a drifting sine wave grating was presented within the receptive field. He had to release the bar between 150 and 650 ms after a change in stimulus color of the sine-wave grating. That change in stimulus color could occur at an unpredictable moment in time between 500 and 5000 ms after stimulus onset. With this task, we ensured that the monkey was constantly monitoring the grating that induced the recorded neuronal activity while fixating the fixation target. The first 300 ms after stimulus onset were discarded in order to avoid strong stimulus-onset-related transients, and the rest of the data were analyzed until the time of the color change. Successful trial completion was rewarded with four drops of diluted apple juice. If the monkey released the bar too early or moved his gaze out of the fixation window, the trial was immediately aborted and followed by a time-out.
2272
M. Zeitler, P. Fries, and S. Gielen
3 Results In this section, we describe coherence estimates between various signals. We always first analyze the spike-field coherence followed by the spikespike coherence for both single-unit and multi-unit activity. The simulation results will be shown first for the Poisson model neurons (small circles in Figure 1), followed by the conductance-based neurons (LIF and HH; big circles in Figure 1). We end this section with the results of the spike-field and spike-spike coherences of experimental data. Finally, we compare an analysis without and with the use of multitaper techniques. 3.1 Simulation Results of the Poisson Model. The top panels of Figure 2 show the predicted (dashed line) and the simulated (solid line) coherence between the LFP and single-unit activity (see Figure 2A) and between LFP and multi-unit activity (see Figure 2B) for the Poisson neurons. In both cases, there is a good match between the simulated and predicted spike-field coherence functions. The “predicted” coherence functions were obtained using the Fourier transform of the normalized covariance function ρ(τ ) of the LFP. Since the LFP had a finite duration, ρ(ω) has noisy fluctuations that are evident in the “predicted” coherence function of Figure 2. The coherence is larger for the multi-unit activity in Figure 2B than for the single-unit activity in Figure 2A. The ratio between the peak coherence for multi-unit versus single-unit activity (0.37/0.12 √ = 3.08) is in agreement with the square root of the number of neurons ( 10 = 3.16) that contributes to the multi-unit activity (see Equations 2.10 and 2.13). One could argue that the larger coherence for the multi-unit case could be due to the fact that the multi-unit recording with 10 (simultaneously measured) single-unit signals contains 10 times more action potentials. In order to correct for this, the singleunit signal in our simulations was 10 times longer than the multi-unit signal such that the number of action potentials was the same in both signals. Figure 2C shows the simulated (solid line) and predicted (dashed line) spike-spike coherence for single-unit activity for the Poisson neurons. Figure 2D shows the same results for multi-unit activity. The simulated and predicted coherence are in agreement for the single-unit and multiunit data. The spike-spike coherence for multi-unit activity increases linearly with the number of units (m = 10) in the multi-unit recording for the spike-spike coherence as long as m << 1/(λt). This is shown by the peaks of the coherences in Figures 2C and 2D (0.015 versus 0.14). The spike-spike coherence differs from the spike-field coherence in two aspects (see equations 2.17 and 2.13). The first difference concerns the fac√ tor m versus m for spike-spike versus spike-field coherence. The second difference is that the spike-field coherence is proportional to ρ(ω), whereas
Assessing Neural Coherence
2273
Spike-Field Coherence
Single-unit 0.4
A
0.2 0
Spike-Spike Coherence
Multi-unit
0
0.4
B
0.2 50
100
0
0
50
C 0.1 0
0
100 D
0.1
50
100
frequency (Hz)
0
0
50
100
frequency (Hz)
Figure 2: Predicted (dashed lines) and simulated (solid lines) coherence functions for LFP and single-unit (A,C) and multi-unit (B,D) signals for the Poisson neurons (see Figure 1). Parameter values used were λ = 20, σ = 20/12, Nc = 0.4, and a simulation duration of 512 s. The number of action potentials in the multi-unit and in the single-unit signals is about 20.480 spikes. (A) The coherence between LFP and single-unit activity. (B) The coherence between LFP and multi-unit activity shows a peak near 50 Hz, which is larger than that for single-unit activity shown in A. (C) The predicted and simulated coherences between two single-unit activities. (D) The predicted and simulated coherence function between two multi-unit activities.
the spike-spike coherence is proportional to the normalized covariance function of the common rate fluctuations, ρ(ω). Since 0 <| ρ(ω) |< 1, ρ(ω) is smaller and more narrow than ρ(ω). Both aspects are reproduced in Figure 2. The peak value of the spikespike coherence (see figure 2D: 0.14) is approximately the square root of the maximum peak value of the spike-field coherence (see Figure 2B: 0.37). Equations 2.10, 2.13, 2.15, and 2.17 for the spike-field and spike-spike coherence do not depend on the duration of the LFP and spike series. Therefore, the expectation value for the coherence functions will not change if the duration of the single-unit recordings increases. The only effect of a longer duration of the spike recording is a reduction of the noise in the coherence
2274
M. Zeitler, P. Fries, and S. Gielen
Spike-Field Coherence
Single-unit A
0.5
0
B
0.5
0 0
Spike-Spike Coherence
Multi-unit
50
0.4
100
C
0
50
0.4
100 D
0.2
0.2 0
0 0
50
100
frequency (Hz)
0
50
100
frequency (Hz)
Figure 3: Coherences between LFP and single- and multi-unit activities for the conductance-based LIF model (dashed-dotted lines), the HH model (dashed lines), and the predictions for the Poisson model (solid line) according to Equations 2.10, 2.13, 2.15, and 2.17. Parameter values used were λ = 20, σ = 20/12, Nc = 0.4, and a simulation duration of 512 s. (A) Spike-field coherence estimates for single-unit activity. (B) Spike-field coherence estimates for multi-unit activity. (C) Spike-spike coherence estimates for single-unit activity. (D) Spike-spike coherence estimates for multi-unit activity.
function. Therefore, a smaller coherence for single-unit recording relative to multi-unit recording cannot be compensated by a longer recording time for the single-unit recordings. 3.2 Simulation Results for the Conductance-Based LIF and HH Model. Figure 3 shows the spike-field and the spike-spike coherences for singleunit and multi-unit recordings for the conductance-based LIF neuron (dashed-dotted line), the conductance-based HH neuron (dashed line) model, and the predictions for the Poisson model (solid line) according to equations 2.10, 2.13, 2.15, and 2.17, all with σ = 20/12. The parameters were chosen in such a way that the mean firing rate was the same for the Poisson neuron, the LIF, and the HH neurons. Figure 3A (3B) shows the coherence between the LFP and single-unit (multi-unit) activity. For both the
Assessing Neural Coherence
2275
single-unit and multi-unit recordings, the spike-field coherence estimate shows a significant peak near 50 Hz. The peak value of the spike-field coherence estimates for multi-unit recording in Figure 3B is considerably higher than the peak value for the single-unit recording in Figure 3A. The spike-field coherence estimates for the LIF and HH network have much higher values than the spike-field estimates of the Poisson network. The ratio of the two peak spike-field coherence values (multi-unit/single-unit) √ is smaller than the square root of the number m (m = 10; m = 3.16) of neurons active in the multi-unit. Figure 3C (3D) shows the coherence between two single-unit (multi-unit) recordings. For the single-unit recordings, no significant peak near 50 Hz is visible. The predicted coherence for the Poisson model is small and lies almost on the x-axis, with a small (hardly visible) peak near 50 Hz for the multi-unit activity. For multi-unit activity (see Figure 3D), a significant peak near 50 Hz is visible. The peak coherence is larger for the LIF neuron and the HH model than for the Poisson neuron, for both the spike-field coherence and the spikespike coherence. The question is whether the higher coherence values for the LIF and HH neuron are due to the dynamics of these neurons or due to the different type of input (continuous LFP signal for the Poisson neurons versus spike input to the LIF and HH neuron). In order to investigate this, we have calculated the coherence between the spike input to the LIF and HH neuron (i.e., the sum of spike series of the Poisson neurons) and their output. These coherence values are much higher than the coherence between the input and the output of a Poisson neuron, with the same input as the LIF and HH neuron. Therefore, we conclude that the higher coherence values of the LIF and HH neuron are the result of the dynamic properties of those neuron types. 3.3 Data from Monkey Visual Cortex. As a final test of the significance of the model simulations, we have analyzed data obtained in monkey visual cortex (Fries, Reynolds, et al., 2001). The data consisted of single- and multiunit activity and local field potential activity recorded simultaneously in area V4 of the awake macaque monkey. Figure 4A shows the coherence between the measured LFP and the single-unit signal, which contains 15,371 spikes. The dashed-dotted lines indicate the 95% confidence levels of the coherence estimates, calculated with 130 bootstraps, and the solid line is the average of the bootstrap replications. For a multi-unit recording with a similar number (n = 16,031) of spikes and thus a shorter duration, the spike-field coherence is shown in Figure 4B. In both Figures 4A and 4B, there is a peak in spike-field coherence near 50 Hz. For multi-unit activity (sum of approximately eight singleunit activities), this peak is significantly higher than that for single-unit activity. Figure 4C shows the spike-field coherence for a multi-unit signal with a duration equal to the duration of the single-unit recording used for
2276
M. Zeitler, P. Fries, and S. Gielen
Spike-Field Coherence
Single-unit 0.4 0.2 0 0 0.2 Spike-Spike Coherence
A
0.1
Multi-unit (short) 0.4
B
D
0 0 0.2 0.1
0.4
C
0.2
0.2
50 100
Multi-unit (long)
0 50 100 E
0.2
0
50 100 F
0.1
0 0 0 0 50 100 0 50 100 0 50 100 frequency (Hz) frequency (Hz) frequency (Hz)
Figure 4: Coherences between LFP and single-unit or multi-unit recordings (experimental data), using the multitaper method. For the multitaper method, we used a set of six orthonormal sine tapers. The 95% confidence level (dasheddotted lines) is obtained using 130 bootstraps. (A) Coherence between LFP and single-unit recording with 15,371 spikes. (B) Coherence between LFP and multi-unit recording, with 16,031 spikes. (C) Coherence between LFP and multiunit recording, with 668,766 spikes. (D) Coherence between two single-unit recordings. The variance in coherence is too large to detect a significant peak near 50 Hz. (E) Coherence between two multi-unit recordings. (F) Coherence between two multi-unit recordings, with durations equal to those of the singleunit recordings used in Figure 4A. Compared to Figure 4E, the 95% confidence regime has been reduced.
Figure 4A. The coherence estimate, including the 95% confidence level, in Figure 4C, is entirely within the 95% regime shown in Figure 4B. Figures 4B and 4C illustrate that increasing the duration of a spike recording improves the signal-to-noise ratio but does not change the expectation value of the coherence function. The spike-spike coherence for single-unit signals in Figure 4D does not show a significant peak near 50 Hz. Neither does the spike-spike coherence for multi-unit signals if the analyzed time period is shortened such that
Assessing Neural Coherence
2277
Spike-Spike Coherence
without multitaper method 0.2
A
with multitaper method 0.2
0.1
0.1
0
0 0
50 100 frequency (Hz)
0
B
50 100 frequency (Hz)
Figure 5: The effect of the multitaper method on the spike-spike coherence between multi-units. (A) The spike-spike coherence estimate of experimental data without the use of multitapers. There is no significant peak near 50 Hz. (B) Using the multitaper method with sine tapers resulted in a significant peak and a significant reduction of the 95% regime.
the number of spikes is the same as in the longer single-unit recording in Figure 4E. The coherence values for the multi-unit signals in Figure 4E are larger than for single unit signals shown in Figure 4D. However, the 95% confidence regime is relatively large. Figure 4F shows the spike-spike coherence for multi-unit signals with duration equal to the duration of the single-unit activities used in Figure 4D. The coherence function in Figure 4F shows a significant peak near 50 Hz. The signal-to-noise ratio is considerably better than in Figure 4E. The results shown in Figure 4 are typical for the spike signals that were obtained in the study by Fries, Reynolds, et al. (2001). All coherence estimates of Figure 4 were obtained with the multitaper method as described in the section 2. Each trial was cut into equally long segments of 512 ms such that the number of tapers was constant. In fact, this is a combination of the Welch method and the multitaper technique. As an alternative to using the Welch method with equally long time segments, one could use the multitaper technique for the analysis of spike signals, which in general each have a different duration. Since the trial durations are different, so is the number of samples in each trial. In order to keep the frequency of smoothing in the frequency domain (2W) constant, the number of tapers given by K = 2NW − 1 is different for each trial. Since averaging over power spectra of different trials requires that the frequency resolution of the spectra be the same for all trials, all signals (after application of the tapers) are made of equal duration by adding zeros (zero padding). Now the average FFTs of the cross and covariance functions can be derived for the coherence functions.
2278
M. Zeitler, P. Fries, and S. Gielen
Figure 5A shows the spike-spike coherence estimate of experimental multi-unit spike trains without the use of multitapers and without using the Welch method. The frequency resolution in Figure 5 is much higher than that in Figure 4F because the number of data points in the frequency domain is eight times larger. The variance in the coherence estimate is large, and no significant peak is visible near 50 Hz. By applying the multitaper method with W = 5 Hz, the variance is reduced, and a significant peak near 50 Hz is visible in the same data (see Figure 5B). 4 Discussion The coherence between neuronal signals (e.g., EEG, MEG, two spike trains) or between neuronal input (e.g., a local field potential) and neuronal output is generally considered as an important measure for synchronization or temporal locking. The main result of this study is that the coherence reaches higher values when multi-unit spike activity is used instead of single-unit activity. This cannot be overcome by extending the recording time of the single-unit signal. The latter only improves the signal-to-noise ratio (SNR) of the coherence. The SNR can also be improved by using multitaper techniques or by using the Welch method. Experimental data obtained in monkey V4 could be reproduced by simulations. Our results illustrate the significance of multi-unit activity over single-unit activity and provide new insights for the interpretation of multi-unit activity and for the interpretation of coherence estimates using oscillatory activity such as βoscillations and γ -oscillations in cognitive neuroscience studies. Our results will be discussed in more detail below. Although many studies have investigated the firing behavior of Poisson neurons and integrate-and-fire neurons for partially correlated and uncorrelated input (for an overview, see Salinas & Sejnowski, 2001), most studies have focused on the mean firing rate and the coefficient of variation (see, e.g., Feng & Brown, 2000; Stroeve & Gielen, 2001; Salinas & Sejnowski, 2000, 2002; Kuhn, Aertsen, & Rotter, 2004). The coefficient of variation is an important parameter to understand the temporal structure of spike trains, but this parameter itself cannot provide insight into the temporal correlation of the action potential signals of different neurons receiving partially correlated input. As far as we know, this study is the first to give analytical expressions and results of computer simulations for the coherence between local field potential and neuronal firing and for the coherence between spike signals for neurons receiving (partly) correlated input. In this study, we investigated the relations between spike-field and spikespike coherences for single-unit and multi-unit activity. Analytical expressions (see equations 2.10, 2.13, 2.15, and 2.17) showed that the spike-field coherence values are higher than the spike-spike coherence values and that the coherences are larger for multi-unit recordings than for singleunit recordings. Although we could derive analytical expressions for the
Assessing Neural Coherence
2279
coherence between input and spike output only for the Poisson neurons, simulations show qualitatively similar results for an ensemble of conductance-based LIF neurons or HH neurons. For Poisson neurons, the spike-spike coherence should be proportional to the square of the spike-field coherence. This was confirmed by simulations (compare Figures 2B and 2D) where the spike-spike coherence equals the square of the spike field coherence for the Poisson model. Figures 3A, 3B, and 3D show similar results for the conductance-based LIF model and the Hodgkin-Huxley neuron model. The full width at half maximum and the amplitude of the peak are smaller for the spike-spike than for the spike-field coherence, as expected in case the spike-spike coherence is proportional to the square of the spike-field coherence, which has values between zero and one. However, coherence values were typically larger for the conductance-based LIF and Hodgkin-Huxley neuron than for the Poisson neuron. This is due to the characteristic dynamic properties of the neuron models. Our results demonstrate that multi-unit activity gives significantly higher estimates for the coherence than single-unit activity even if the number of action potentials in both signals is the same (see Figures 4A, 4B and 4D, 4E). This is partially due to the fact that the mean firing rate is typically higher in a multi-unit recording than in a single-unit recording and the modulations in firing rate are larger. Equations 2.10 and 2.13 show that the coherence decreases with the square root of firing rate (proportional to λ) but increases linearly with modulation depth σ . Since firing rate and modulation depth increase proportionally when adding single-unit signals, the coherence will effectively increase with the number of single-unit contributions in a multi-unit signal. Several studies have reported a lack of evidence for synchronized neuronal activity; see, for example Tovee and Rolls (1992) in the inferior temporal visual cortex and Luck et al. (1997) who did not observe clear synchronization in neuronal responses in V2 and V4. This is in contrast to findings by Fries, Reynolds, et al. (2001). Our results indicate that the explanation for these apparently contradictory findings may be related to the techniques used to analyze the neuronal data. In Figure 3C the spike-spike coherence between single-unit signals is small and disappears in the relatively high variance of the estimate. Simulations with larger values of σ (larger modulations of the stimulus) showed a clear, small, and narrow peak near 50 Hz. However, the signal-to-noise ratio increases to plausible levels only for unrealistically high modulations of the input. Therefore, the variance in experimental data should be reduced by using dedicated data analysis techniques like the multitaper method (see Figures 4 and 5). Our simulations were done for data segments of equal duration (512 ms) and with a constant number (K = 6) of tapers repeated over many time segments. This results in smoothing of the frequency spectrum by averaging over many signals.
2280
M. Zeitler, P. Fries, and S. Gielen
In electrophysiological experiments, the recording duration will vary by experiment and will typically be much longer than 512 ms. Therefore, Fries, Reynolds, et al. (2001) used a different number of tapers for each recording signal, such that smoothing was done over the same frequency window (2W = constant) for all experimental data. Since the duration of their recordings was typically much longer than 512 ms, the longer duration gives more samples in the time domain, which results in a higher resolution in the frequency domain. This is illustrated in Figure 5. Their result shows a higher resolution in the frequency domain but averaging over a smaller number of signals. Effectively the result is the same: the reduction of smoothing by the smaller number of signals is compensated by smoothing by the tapers over a larger number of samples in the frequency domain. However, note that the multitaper method with Slepian sequences as tapers is optimal among quadratic estimators because of the good concentration properties of Slepian sequences (see Percival & Walden, 2002). The lack of optimality of the Welch estimates means that it is a more biased estimate than the multitaper estimate with Slepian sequences, the variance and the frequency resolution being equal. The bias will grow as the size of the windows becomes smaller. Acknowledgments We thank Robert Desimone and John Reynolds for supporting the experimental recordings. References Baker, S. N., Pinches, E. M., & Lemon, R. N. (2003). Synchronization in monkey cortex during a precision grip task. II. Effect of oscillatory activity on corticospinal output. J. Neurophysiol., 89, 1941–1953. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Buzs´aki, G. (2004). Large-scale recording of neuronal ensembles. Nature Neuroscience, 7, 446–451. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in top-down processing. Nature Review Neuroscience, 2(10), 704–716. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Computation, 12, 671–692. Fries, P., Neuenschwander, S., Engel, A. K., Goebel, R., & Singer, W. (2001). Rapid feature selective neuronal synchronization through correlated latency shifting. Nature Neuroscience, 4, 194–200. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560– 1563.
Assessing Neural Coherence
2281
¨ Fries, P., Schroder, J. H., Roelfsema, P. R., Singer, W., & Engel, A. K. (2002). Oscillatory neuronal synchronization in primary visual cortex as a correlate of stimulus selection. J. Neurosci., 22, 3739–3754. Jarvis, M. R., & Mitra, P. P. (2001). Sampling properties of the spectrum and coherence of sequences of action potentials. Neural Computation, 13, 717–749. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neural responses in the visual cortex of awake macaque monkey. J. Neurosci., 16(7), 2381–2396. Kuhn, A., Aertsen, A., & Rotter, S. (2004) Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77(1), 24–42. Marmarelis, P. Z., & Marmarelis, V. Z. (1978). Analysis of physiological systems: The white-noise approach. New York: Plenum Press. McCoy, E. J., Walden, A. T., & Percival, D. B. (1997). Multitaper spectral estimation of power law processes. IEEE Trans. on Sign. Proc., 46(3), 655–668. Mitra, P. P., & Pesaran, B. (1999). Analysis of dynamic brain imaging data. Biophys. J., 76(2), 691–708. Percival, D. B., & Walden, A. T. (2002). Spectral analysis for physical applications: Multitaper and conventional univariate techniques. Cambridge: Cambridge University Press. Pesaran, B., Pezaris, J. S. Shahani, M., Mitra, P. P., & Andersen, R. A. (2002). Temporal structure in neuronal activity during working memory in macaque parietal cortex. Nature Neuroscience, 5, 805–811. Rolls, E. T., Franco, L., Aggelopous, N. C., & Reece S. (2003). An information theoretic approach to the contributions between the firing rates and the correlations between the firing of neurons. J. Neurophysiol., 89, 2810–2822. Salinas, E., & Sejnowski, T. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Computation, 14, 2111–2155. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Stroeve, T., & Gielen, C. (2001). Correlation between uncoupled conductance-based integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural Computation, 13, 2005–2030. Thomson, D. J. (1982). Spectrum estimation and harmonic analysis. Proc. IEEE., 70, 1055–1096. Tovee, M. J., & Rolls, E. T. (1992). Oscillatory activity is not evident in the primate temporal visual-cortex with static stimuli. NeuroReport, 3(4), 369–372.
Received June 30, 2005; accepted February 17, 2006.
NOTE
Communicated by Terrence Sejnowski
Consistency of Pseudolikelihood Estimation of Fully Visible Boltzmann Machines Aapo Hyv¨arinen [email protected] HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland
A Boltzmann machine is a classic model of neural computation, and a number of methods have been proposed for its estimation. Most methods are plagued by either very slow convergence or asymptotic bias in the resulting estimates. Here we consider estimation in the basic case of fully visible Boltzmann machines. We show that the old principle of pseudolikelihood estimation provides an estimator that is computationally very simple yet statistically consistent. 1 Introduction Assume we observe a binary random vector x ∈ {−1, +1}n , and we want to model its probability distribution function by 1 1 T T P(x) = exp x Mx + b x . Z(M, b) 2
(1.1)
The parameter matrix M = (m1 , . . . , mn ) has to be constrained in some way to make it well defined, because M and MT give the same probability distribution, and the diagonal elements of M do not interact with x at all. We choose the conventional constraint that M is symmetric and has zero diagonal. The vector b is an n-dimensional parameter vector. This is a special case (“fully visible,” that is, no latent variables) of the Boltzmann machine framework (Ackley, Hinton, & Sejnowski, 1985). The central problem in the estimation is that we do not know the constant Z(M, b). In principle, Z is given by the sum: Z(M, b) =
1 T T ξ Mξ + b ξ , exp 2 n
(1.2)
ξ ∈{−1,+1}
whose computation is exponential in the dimension n. Thus, for any larger dimension n, direct numerical computation of Z is out of the question. Neural Computation 18, 2283–2292 (2006)
C 2006 Massachusetts Institute of Technology
2284
A. Hyv¨arinen
For continuous-valued variables, we could use score matching (Hyv¨arinen, 2005), but here we have binary variables. Maximum likelihood estimation of the model is not possible without some kind of computation of the normalization constant Z, also called the partition function. Typical methods for maximum likelihood estimation are thus computationally very complex (e.g., Markov chain Monte Carlo, MCMC, methods). Different kinds of approximation methods have therefore been developed, including pseudolikelihood (Besag, 1975), contrastive divergence (Hinton, 2002), and linear response theory (Kappen & Rodriguez, 1998). None of these approximative methods has been shown to be consistent. Our contribution here is to show that pseudolikelihood is consistent, and it is closely connected to contrastive divergence. 2 Pseudolikelihood of the Model In pseudolikelihood estimation (Besag, 1975), we consider the conditional probabilities P(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ; θ ), that is, conditional probabilities of the random variable given all other variables, where θ denotes the / parameter vector. Let us denote by x∈i the vector with xi removed, / = (x1 , . . . , xi−1 , xi+1 , . . . , xn ), x∈i
(2.1)
and the logarithms of the conditional probabilities by / / , θ ) = log P(xi |x∈i , θ ). Ci (xi ; x∈i
(2.2)
We then estimate the model by maximizing these conditional probabilities in the same way as one would maximize ordinary likelihood. Given a sample x(1), . . . , x(T), the pseudolikelihood (normalized as a function of sample size by dividing by T) is thus of the form
J PL (θ ) =
n T 1 / Ci (xi (t); x∈i (t), θ ). T
(2.3)
t=1 i=1
Consistency of the pseudolikelihood has been thoroughly investigated for Markov random fields (see, e.g., Gidas, 1988, and Mase, 1995). However, there seem to be few results for the basic case of a random vector. It is easy to compute the pseudolikelihood for the model in equation 1.1. We have / P(xi |x∈i , M, b) =
exp(xi miT x + b i xi ) , + b i ) + exp(−miT x − b i )
exp(miT x
(2.4)
Pseudolikelihood Estimation of Boltzmann Machines
2285
which gives / Ci (xi |x∈i , M, b) = xi miT x + b i xi − log cosh(miT x + b i ) − log 2,
(2.5)
and thus, for a given sample x(1), . . . , x(t) of T observations, J PL (M, b) =
n T 1 xi (t)miT x(t) + b i xi (t) T t=1 i=1
− log cosh(miT x(t) + b i ) + const.,
(2.6)
where the constant does not depend on the parameters. 3 Consistency Proof We now proceed to prove the consistency of the maximum pseudolikelihood estimator obtained by maximization of J PL with respect to the parameters. The natural starting point is to analyze the point where the gradient of J PL with respect to the parameters is zero. The point of true parameter values is one such point, as shown in the following proposition: Proposition 1. Assume data are generated by the distribution in equation 1.1 for parameters m ˜ i j and b˜ i . Then the gradient of J PL is zero at mi j = m ˜ i j , b i = b˜ i . Proof. We first compute the derivative of the pseudolikelihood with respect to mi j , i = j: 1 ∂ J PL = xi (t)x j (t) − x j (t) tanh(miT x(t) + b i ) ∂mi j T t
(3.1)
A well-known property of Boltzmann machines is that / E{xi |x∈i }=
˜ iT x(t) − b˜ i ) ˜ iT x(t) + b˜ i ) − exp(−m exp(m ˜ iT x + b˜ i ). = tanh(m ˜ iT x(t) + b˜ i ) + exp(−m ˜ iT x(t) − b˜ i ) exp(m (3.2)
At the point where the parameters have the true values, the derivative thus becomes T ∂ J PL ˜ ˜ 1 / (M, b) = x j (t)(xi (t) − E{xi (t)|x∈i (t)}). ∂mi j T t=1
(3.3)
2286
A. Hyv¨arinen
/ Now, by the basic properties of conditional expectations, xi − E{xi |x∈i }, ∈i / which is the residual in the best prediction of xi given x , is uncorrelated / from x∈i and thus of x j .1 Thus, we have in the limit of T → ∞,
∂ J PL / = E{x j }E xi {xi − E{xi |x∈i }} = E{x j } × 0, ∂mi j
(3.4)
/ }} = E{xi }. Thus, because the expectation of the residual is zero: E xi {E{xi |x∈i the gradient with respect to mi j is zero. As for the b i , we obtain
∂ J PL = E{xi } − E tanh miT x + b i , ∂b i
(3.5)
which is zero by the same logic. We have proven the proposition. We still have to make sure that this critical point is really the global maximum of pseudolikelihood. For this end, we have to make the following assumption. Denote by x¯ T = (x1 , . . . , xn , 1)T an augmented data vector. We assume 2 (3.6) E qT x¯ cosh−2 mT x + b > 0 for any vector q ∈ Rn+1 of nonzero norm, and for any m ∈ Rn and b ∈ R. This is not a very strong assumption because, obviously, the expectation is always nonnegative (cosh is a positive function). Basically, the expectation could be zero only in some pathological cases. Now we use the concavity of J PL , which is possible due to the following proposition: Proposition 2. Assuming equation 3.6 and in the limit of an infinite sample, J PL is strictly concave with respect to the vector consisting of the elements of M and b. Proof. Since a sum of strictly concave functions is still strictly concave, we can consider each term in the sum with respect to i separately. Each such term is a function of [mi , b i ] only. So we only have to prove that J i (mi , b i ) = E xi miT x + b i xi − log cosh miT x + b i
(3.7)
1 In general, we have for any two random variables x, y:
E x,y {(E{y|x} − y)x} = x y ( y p(y |x) y dy x − xy) p(x, y)d xd y = x ( y p(y |x) xy dy )
( y p(x, y)dy)d x − x y p(x, y)xy dyd x. Since y p(x, y)dy p(y |x) = p(x, y ), the two terms are equal, and the difference is zero.
Pseudolikelihood Estimation of Boltzmann Machines
2287
is strictly concave. The Hessian of J i with respect to mi equals Hmi J PL = −E xxT cosh−2 miT x + b i .
(3.8)
The second derivative with respect to b i equals −E cosh−2 miT x + b i ,
(3.9)
and the cross-derivatives equal ∂ Ji = −E xT cosh−2 miT x + b i . ∂mi ∂b i
(3.10)
Collecting these in a single matrix, we see that the total Hessian equals H[mi ,bi ] J i = −E x¯ x¯ T cosh−2 miT x + b i ,
(3.11)
which is, by our assumption in equation 3.6, negative-definite for any values of the parameters. A function whose Hessian is always negative-definite is strictly concave. Thus, we have proven the strict concavity of J PL . This leads us finally to the theorem Theorem 1. Assume equation 3.6. Then the pseudolikelihood estimator is (globally) consistent for the model in equation 1.1. Proof. A strictly concave function defined in a real space has a single maximum. If the function is differentiable (as J PL here), the maximum is obtained at the point of zero gradient. This would seem to prove the theorem. However, we have one additional complication because M is constrained to be symmetric and to have zero diagonal. This is actually not problematic since it means only that the optimization is constrained to a linear subspace. The restriction of a strictly concave function on a linear subspace is still strictly concave. Also, since the gradient is zero for the true parameter values, the projection of the gradient is zero for the true parameter values. Thus, the restrictions of symmetricity and zero diagonal do not change anything. So we have proven that in the limit of an infinite sample, the pseudolikelihood is maximized by the true parameter values alone. This implies the theorem. 4 Gradient Algorithm Let us briefly consider how pseudolikelihood can be computationally maximized. The simplest way of maximizing the pseudolikelihood is by gradient ascent. The relevant gradients were already given above. However, since
2288
A. Hyv¨arinen
M is constrained to be symmetric and to have zero diagonal, the gradient has to be projected on this linear space. Thus, we compute the symmetrized gradient, D(m ˆ ij) =
1 J PL 1 J PL + , 2 ∂mi j 2 ∂m ji
(4.1)
where the derivatives are given in equation 3.1 and evaluated at the current estimates for the parameters. We then update the current estimates m ˆ i j , for i = j only, using this projected gradient in a gradient ascent step: m ˆ i j = µD(m ˆ i j ) for all i = j,
(4.2)
where µ is a step size. As for the b i , we can use the gradient directly and update bˆ i = µ
∂ J PL , ∂b i
(4.3)
where the derivative is given in equation 3.5. The algorithm we have given here is a batch algorithm, using the whole sample to calculate the pseudolikelihood. Online variants are easy to construct as well. 5 Connection to Contrastive Divergence Contrastive divergence (Hinton, 2002) is an approximation of MCMC methods. It consists of two related ideas: first, we fix the initial values in the MCMC method to be equal to the sample points themselves, and second, we take a small number of steps in the MCMC method—perhaps just one. This is a general framework that can be applied on nonnormalized models with continuous-valued or discrete-valued variables and also in latent variable models. We shall here prove that for the model in equation 1.1, contrastive divergence is equivalent to pseudolikelihood if we use single-step Gibbs sampling, which is the most basic setting. In the general MCMC setting, the expectation of the gradient of mi j , i = j is given by Ackley et al. (1985), mi j = Eˆ xi x j − E M xi x j ,
(5.1)
where Eˆ denotes the expectation over the sample distribution and E M denotes the expectation over the distribution given by the model with current parameter values.
Pseudolikelihood Estimation of Boltzmann Machines
2289
In contrastive divergence, the expected gradient update for mi j is given by mi j = Eˆ xi (t)x j (t) − Eˆ E G(k) xi (t)x j (t),
(5.2)
where E G(k) means the expectation under the distribution given by one step of Gibbs sampling on the kth variable, that is, replacing xk (t) by a random variable that follows the conditional distribution of xk given all other variables. In the simplest random update sheme, the index k is a random variable that has uniform distribution over the indices 1, . . . , n. Note that there are two different methods called contrastive divergence defined in Hinton (2002): one based on an objective function and the other based on an approximative gradient of that objective function. Here, we consider the latter because it is the one to be used in practice. As above, the expectation of the conditional distribution can be computed as E G(i) xi (t) = tanh miT x(t) + b i
(5.3)
while E G(k) xi (t) = xi (t) for k = i. Now, in the second term on the right-hand side of equation 5.2, there is a probability of (n − 2)/n that the index k is not equal to i or j. Then the Gibbs sampling has no effect and can be ignored. With probability 1/n, k equals i, and with the same probability, it equals j. Thus, equation 5.2 equals n−2 Eˆ xi (t)x j (t) n 1 1 − Eˆ tanh miT x(t) + b i x j (t) − Eˆ xi (t) tanh mTj x(t) + b i n n 1 2 Eˆ xi (t)x j (t) − Eˆ tanh miT x(t) + b i x j (t) = n 2 1 − Eˆ xi (t) tanh mTj x(t) + b i . (5.4) 2
mi j = Eˆ xi (t)x j (t) −
As for the parameters b i , we obtain in a similar way ˆ G(k) xi (t) = Eˆ xi (t) − tanh miT x(t) + b i . b i = Eˆ xi (t) − EE
(5.5)
As the gradient step size in contrastive divergence is typically taken from a sequence that converges to zero fast enough, the convergence of contrastive divergence is given by the point where the expected gradient is zero. Now, the expected gradients in equations 5.4 and 5.5 are equal (up
2290
A. Hyv¨arinen
to some insignificant multiplicative constants) to the corresponding symmetrized gradients of the pseudolikelihood. So the two methods converge in the same points. The convergence of contrastive divergence (the same gradient version as ˜ an and Hinton (2005), we analyzed here) was analyzed in Carreira-Perpin´ with the conclusion that contrastive divergence is asymptotically “biased” for the model in equation 1.1. This discrepancy with our results is due to ˜ an and Hinton the difference of the definition of biases. In Carreira-Perpin´ (2005), the bias was computed as the Kullback-Leibler divergence between the distributions given by the model when the estimated parameters for contrastive divergence or likelihood are used. Thus, their conclusion was that contrastive divergence gives, in general, a different estimate from likelihood. However, they also noted that the difference disappears (asymptotically) if the data are really generated by the model, which is the case we consider here. Different variants of contrastive divergence that always give the same estimate as maximum likelihood were further developed in ˜ an and Hinton (2005). (See also Welling & Sutton, 2005, for Carreira-Perpin´ related work.)
6 Simulation Results We performed simulation to validate the different estimation methods for the fully visible Boltzmann machine. We created random matrices M so that the elements had independent normal distributions with zero mean and standard deviation of .5. The parameters b i were randomly generated from the same distribution. The dimension n was set to 5, which is small enough to enable exact sampling from the distribution, which is important in order to be able to reliably validate the estimation results. We generated data from the distribution in equation 1.1 and estimated the parameters using maximum pseudolikelihood for various sample sizes: 500, 1000, 2000, 4000, 8000, and 16,000. We also estimated the parameters using ordinary likelihood for comparison; exact computation of the maximum likelihood estimator was possible due to the small dimension. For each sample size, we created five different data sets and ran the estimation once on each data set using a random initial point. For each estimation, the estimation error was computed as the Euclidean distance of the real matrix [M, b] and its estimate. Finally, we took the mean of the logarithms of the five estimation errors. The results are shown in Figure 1. The maximum pseudolikelihood estimator seems to be consistent in the sense that the estimation error seems to go to zero when the sample size grows, as implied by our theorem. Surprisingly, its estimation errors are not really larger than that of ordinary maximum likelihood. Actually the errors are almost identical; they seem to depend more on the random parameters generated than on the method.
Pseudolikelihood Estimation of Boltzmann Machines
2291
-0.2
-0.3
-0.4
-0.5
-0.6
-0.7
-0.8
-0.9
-1
-1.1 2.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
4.4
Figure 1: The estimation errors of maximum pseudolikelihood/contrastive divergence (solid line) and maximum likelihood (dashed line). Horizontal axis: log10 of sample size. Vertical axis: log10 of estimation error.
7 Conclusion We have shown that pseudolikelihood, a rather old estimation principle (Besag, 1975), provides a consistent estimator for the fully visible Boltzmann machine. This estimator turns out to be a special case of contrastive divergence. The literature on Boltzmann machines does not seem to have paid much attention to pseudolikelihood estimation so far. We considered the fully visible case only, because that is where pseudolikelihood estimation can be directly applied. Extensions to hidden variables are an important subject for future work and have been partly ad˜ an & Hinton, dressed in work on contrastive divergence (Carreira-Perpin´ 2005). Acknowledgments This work was supported by the Academy of Finland, Academy Research Fellow position and project #106473. I am grateful to Sam Roweis and ˜ an Patrik Hoyer for interesting discussions and to Miguel Carreira-Perpin´ and Geoffrey Hinton for providing access to unpublished results.
2292
A. Hyv¨arinen
References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Besag, J. (1975). Statistical analysis of non-lattice data. Statistician, 24, 179–195. ˜ an, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. Carreira-Perpin´ In Proc. Workshop on Artificial Intelligence and Statistics (AISTATS2005). Barbados. Gidas, B. (1988). Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbsian distributions. In W. Fleming & P.-L. Lions (Eds.), Stochastic differential systems, stochastic control theory and applications. New York: Springer. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. of Machine Learning Research, 6, 695–709. Kappen, H., & Rodriguez, F. (1998). Efficient learning in Boltzmann machines using linear response theory. Neural Computation, 10(5), 1137–1156. Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian processes. Annals of Applied Probability, 5(3), 603–612. Welling, M., & Sutton, C. (2005). Learning Markov random fields using contrastive free energies. In Proc. Workshop on Artificial Intelligence and Statistics (AISTATS 2005). Barbados.
Received September 19, 2005; accepted March 24, 2006.
LETTER
Communicated by Pawan Sinha
Images, Frames, and Connectionist Hierarchies Peter Dayan [email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N@3AR
The representation of hierarchically structured knowledge in systems using distributed patterns of activity is an abiding concern for the connectionist solution of cognitively rich problems. Here, we use statistical unsupervised learning to consider semantic aspects of structured knowledge representation. We meld unsupervised learning notions formulated for multilinear models with tensor product ideas for representing rich information. We apply the model to images of faces. 1 Introduction What do we know when we know the story of Moby Dick or the face of an acquaintance? This question, or, more formally, that of competently representing objects with hierarchical structure in connectionist systems, has a critical part to play in addressing a range of pressing challenges in computational cognitive science. Various ingenious suggestions have been made (many starting and building from the foundation provided by the seminal collection of papers in Hinton, 1991), involving a wide range of computationally sophisticated mechanisms. Much of this work is strongly motivated by aspects of predicate logic. Thus, it is concerned with discrete literals and logical terms or propositions linking them, and specifically with the idea that it should be possible to fashion a representation freshly on the fly for essentially arbitrary concepts. This rather episodic view accurately characterizes some cognitive tasks, particularly those associated with linguistics. However, there are many other tasks for which there is substantial structured semantic knowledge, that is, (typically hierarchically) structured networks of statistical relationships among a set of entities. This semantic structure provides a framework within which episodic information should be viewed. The example we use in this article is visual images of faces. There are numerous strong statistical constraints in such images—for one simple instance, the close similarity of the two eyes or two ears—and it is these that we seek to capture. 1.1 Unsupervised Learning. The requirements for this view amount to finding this semantic structure and using it to practical effect. We seek Neural Computation 18, 2293–2319 (2006)
C 2006 Massachusetts Institute of Technology
2294
P. Dayan
to do both of these in a connectionist context, with distributed representations and without explicit pointers and the like. One obvious direction to turn for ideas and methods is statistical unsupervised learning algorithms (see Hinton, 1990; Rao, Olshausen, & Lewicki, 2002), which are explicitly designed to extract and represent semantic structure of various sorts, and whose connectionist credentials are burnished by their widespread use for modeling the nature and development of the tuning properties of cortical neurons. However, bar a few exceptions (notably for our work, Tenenbaum & Freeman, 2000, and, under a somewhat whiggish interpretation, Grimes & Rao, 2005), they have not been much applied to hierarchical structure. Versions of unsupervised learning based on density estimation can be viewed in the informal terms of characterizing the statistical structure of the input patterns in terms of low-dimensional manifolds and finding a coordinate system that parameterizes these manifolds. For instance, G. Hinton (personal communication, 1994) has estimated that images of faces live in a roughly 30- to 40-dimensional implicit space, embedded in the huge numbers of dimensions of pixel-based inputs. Edelman (1999) presents an excellent discussion of this sort of representation, albeit somewhat divorced from the context of statistical unsupervised learning. The way the manifolds are embedded in the space in which the input lives, captures the overall statistical constraints among the collection of patterns. The manifolds are useful in that individual examples can be represented in terms of their coordinates. The resulting representation system, if learned correctly, provides an optimally compact representation for new inputs drawn from the same distribution. It explicitly does not, however, provide sensible coordinates for inputs that come from different distributions. It is intended to solve a different problem from that of representing arbitrary episodic structure. Unsupervised learning algorithms that employ strong priors can make strong claims for the manifolds and coordinate systems they extract, in the sense of finding things like underlying independent structure in the collection of examples. In sum, we consider the problem of discovering and using semantic structure in domains in which it has inherently hierarchical forms. Although it is an important task for the future, in this letter, we do not seek to find the hierarchy itself (for images, this is provided naturally by the focus of attention) but rather to elucidate its representational implications. 1.2 Representations for Visual Objects. Connectionist representations sit at levels of detail and abstraction above those of neurally realizable codes. We focus on a relatively narrow and concrete question about the representation of hierarchical structure in a particular domain, and therefore adopt three broad constraints and (gross) simplifications associated with the sort of visual images we use: segmentation, invariance, and distributed representations. Further, again to focus on representation, we do not attempt to solve the challenging general problem of detection and classification of
Images, Frames, and Connectionist Hierarchies
2295
faces and other objects in images, a task that over the past several years has attracted a number of powerful and probabilistic approaches (Burl, Leung, & Perona, 1995; Burl, Weber, & Perona, 1998; Schiele & Crowley, 1996, 2000; Fei-Fei, Fergus, & Perona, 2003; Fergus, Perona, & Zisserman, 2003; Liebe & Schiele, 2003, 2004; Schneiderman & Kanade, 2004; Amit & Trouv´e, 2005; Sudderth, Torralba, Freeman, & Willsky, 2005; Crandall, Felzenszwalb, & Huttenlocher, 2005). We discuss the relationship between our work and these ideas later. Issues of segmentation and invariance mostly have to do with preprocessing. We help ourselves to a mechanism capable of extracting the elements of a scene (e.g., a whole face, an eye, a nose) at appropriate scales, in normalized coordinates. This is exactly the intent of Olshausen, Anderson, & Van Essen, (1993) explicit shifter circuit, and the recent architecture of Amit and Mascaro (2003), which powerfully integrates detection and recognition. It also underlies von der Malsburg’s (1988) dynamic link architecture, and indeed has some resonances in the more bottom-up invariance sought in architectures such as the MAX model (Riesenhuber & Poggio, 1999). Even in the face of the limited evidence about basis functions associated with the focus of attention (Connor, Gallant, Preddie, & Van Essen, 1996) for achieving an equivalent (Pouget & Sejnowski, 1997) of shifting, this is obviously a large simplification. We justify it on the basis of our key interest in the question of representation. We also make the great simplification of restricting the manifold to be a mixture of factor analysers (Hinton, Dayan, & Revow, 1997). This does lead to a distributed code, but one that is obviously far too simple to reflect faithfully the sort of population code representation that we might legitimately expect in the brain (Pouget, Dayan, & Zemel, 2000). 1.3 Tensors and Distributed Representations. Smolensky (1990) suggested that tensor products are the natural means for representing and manipulating structured knowledge that is represented as distributed patterns of activity over multiple units. This idea has exerted significant influence over a wealth of subsequent work in the field, including, for instance, the approaches of Plate (1995, 2003) and Gayler (1998), who have studied generalizations and simplifications of tensor products, and also the community working on recursive autoassociative memories (Hinton, 1990; Pollack, 1990; Sperduti, 1994). Our work, which is an extension of Riesenhuber and Dayan (1996), also fits comfortably into this tradition, albeit in the context of semantic statistical ideas of unsupervised learning and thus the multilinear modeling framework of Tenenbaum and Freeman (2000; see also Vasilescu & Terzopoulos, 2002, 2003, and Grimes & Rao, 2005). Compared with Riesenhuber and Dayan (1996), we consider a much richer domain of visual objects (Blanz & Vetter, 1999), and employ a more powerful unsupervised learning algorithm that can also automatically cluster the objects into classes.
2296
P. Dayan
In section 2 we describe the statistics of a structured domain, using a form of discrete, multiscale representation of images of faces as an example. We also describe the multilinear, unsupervised learning model that we employ to capture these statistics. In section 3, we generalize this approach to encompass unsupervised clustering of separate object classes or subclasses. Finally, in section 4, we consider how our model fits in with other ideas on structured knowledge representation and present some more speculative notions about domains rather far removed from face images. 2 Multilinear Models The critical representational notion in this article is that hierarchically structured image objects should be considered as mappings from some form of generalized focus of attention or eye position e to a form of observation x that would be made at that focus of attention, that is: image object : attentional position ⇒ observation.
I:
e⇒x
(2.1)
Connor et al’s (1996) findings on the effects of the focus on attention on receptive fields in area V4 in visual cortex underlay Riesenhuber and Dayan’s (1996) suggestion of a model of exactly this form (see also Salinas & Abbott, 1997). However, mappings of this sort date back at least to ideas on the interactions between action and observation (see, e.g., the extensive discussion in Bridgeman, van der Heijden, & Velichkovsky, 1994). In terms of a statistical generative model (MacKay, 1956; Neisser, 1967; Grenander, 1976–1981; Mumford, 1994; Hinton & Zemel, 1994; Dayan, Hinton, Neal, & Zemel, 1995; Olshausen & Field, 1996; Hinton & Ghahramani, 1997), often ascribed to feedback connections between cortical areas, the mapping in equation 2.1 suggests that correlations among the observations x should be explained by two structural features of the inputs: the existence of multiple attentional foci for the same underlying object and the semantic (and episodic) structure of the image objects I. Figure 1 illustrates one way to conceive of the generative structure in the images of faces and also shows how we generated the training data for the letter. We used the face images from Blanz and Vetter (1999) together with fiducial markers (T. Vetter, personal communication, May 2005; M. Riesenhuber, personal communication, May 2005; Riesenhuber, Jarudi, Gilad, & Sinha, 2004) that locate particular features (such as the pupils of the eyes) in each image. The markers for the faces in the database were labeled by hand. As discussed in section 1, doing this automatically is an important task for preprocessing and is one of the key computations in models such as Olshausen et al.’s (1993) and Amit and Mascaro’s (2003) shifter models; however, we do not model it explicitly here.
Images, Frames, and Connectionist Hierarchies
original face
2297
canonical coords
fiducial warp
reduced dimensions principal components
complete description
high res
mean
medium res
low res
Figure 1: Hierarchical image decomposition. The left and right eyes, the nose, and the mouth of 190 faces from the Blanz and Vetter (1999) database (an example is shown in the top left, with the segments defined by fiducial markers) are warped into the common reference frame specified by one of the faces. Seven different images at three different resolutions are defined for each warped face (the four separate subparts at the highest resolution, the two eyes together, and the nose and mouth together at medium resolution, and all of them combined at the lowest resolution), and are projected onto the top 20 principal components of the seven separate covariance matrices. The 1488 pixels are therefore represented three times over (once per resolution) in 140 numbers (middle rectangular block). The subparts of the face and the face itself can be reconstructed from these coefficients to quite high fidelity (lower figures; with the irregular outlines showing which parts defined the foci of attention).
The markers create a linear object class representation of the faces (Vetter & Poggio, 1997; Beymer & Poggio, 1996), which allows them to be warped into a common reference frame, which we arbitrarily define based on the first face in the database (we could equally well have used the average face).
2298
P. Dayan
For simplicity, we concentrate in this article on the two eyes, the nose, and the mouth. The top left-hand image in Figure 1 is an example face from the database. The irregular lines delimit regions containing the eyes, nose, and mouth, using the fiducial markers. These regions are then separately warped into canonical coordinates, defined as those of the eyes, nose, and mouth of the first face in the database. The image at the top right of the figure shows the result of this warping. The full images in the database are defined over 100 × 100 = 10, 000 pixels; in the canonical representation, right eye, left eye, nose, and mouth are defined by 433, 394, 310, and 351 pixels, respectively (since these are the sizes of these features in the first face). We assume that subjects can determine their focus of attention at one of three resolutions and thereby to seven discrete parts or subparts. At the highest resolution, the four individual elements of the face can be separately attended; at a medium resolution, either the two eyes or the nose and mouth together can be selected; at the lowest resolution, all four parts are represented collectively. The difference in resolution arises since items in the focus of attention are represented in a fixed size structure, so, for instance, the fidelity with which the full face can be represented is roughly a quarter that of the individual elements. In practice, we create this fixed structure by projecting the full input onto a fixed number d (d = 20 in the figure) of the principal eigenvectors of their covariance matrices (using separate covariance matrices for each of the seven resolutions). In terms of the relationship in equation 2.1, an observation x is the reduced, d-dimensional description of one element of a face at one resolution. Principal component analysis (PCA) is exactly the outcome of the simplest Hebbian unsupervised learning algorithm applied to the warped images (Linsker, 1988). Performing PCA is sensible because of the linear class structure created by the fiducial markers (Vetter & Poggio, 1997; Beymer & Poggio, 1996). In our highly simplified description, we consider the warping and projection to happen at the lowest levels of visual processing in both recognition and generative directions. We thus have eigenfaces (Turk & Pentland, 1991) plus equivalent eigenanalyses for the six other substructures. One can alternatively think of these coefficients as part-specific features of the input. The middle panel of Figure 1 shows the seven separate sets of coefficients for the particular face, and the three lower panels show how well these coefficients can reconstruct the parts of the face at the different resolutions. The irregular lines show which parts were separately decoded from the coefficients and then pasted together. For the left image, the 4 collections of 20 coefficients representing the individual subparts have been separately decoded and pasted to generate a single image. For the middle image, at an intermediate resolution, the pair of 20 coefficients has been used—one for the two eyes together and one for the nose and mouth together. For the right image, at the lowest resolution, only a single set of coefficients has
Images, Frames, and Connectionist Hierarchies
2299
been used for all the subparts. The inset images show the principal components and the mean at this lowest resolution. If one looks closely, this reconstruction (depending on only 20 coefficients) is a little worse than that at the high resolution (depending on 80), but the difference is relatively subtle. However, note that the faces are certainly not all the same; for instance, the mean face looks quite different from this particular example. In total, including all possible resolutions, the complete input associated with each face lives in a 7 × 20 = 140-dimensional space. One way of illustrating the overall task for unsupervised learning is through the covariance matrix of all the faces in this space. That we normalized the dimensionality of each input using PCA implies that there is no off-diagonal structure within the 20 × 20 blocks along the diagonal of this full covariance matrix, but we can expect substantial structure between the blocks, because of correlations between the features of the subparts (e.g., the eyes are usually similar to each other), and the relationships between the different resolutions. Note that the PCA at the lower resolutions was formally separate from the PCA at the higher resolution, so the coefficients are not trivially related to each other. The rows of Figure 2 show the top few eigenvectors of the full covariance matrix, ordered by increasing eigenvalue (every fifth one of which is shown on the left of the figure). The eigenvectors can be thought of in 7 sets of 20 columns arising from the image substructures, whence some interrelationships are apparent. For instance, the “rightwards” structure in the eigenvectors arises since PCA was used to generate the fixed-size representations of all seven elements of each complete face description, and the resulting coefficients are also ordered. Also, the forms of the weightings in the components associated with similar parts (e.g., left and right eyes) are somewhat similar, even though separate eigenanalyses were used to generate the 20 coefficients per component. To capture the common structure among the component coefficients of the faces and their parts, we consider a φ-dimensional hidden or latent space (φ ≤ 140). That is, we consider the full representation of a face to be a φ-dimensional vector h from which the PCA coefficients x associated with each of the seven elements can be generated. In terms of relationship 1, this entity should parameterize a map from attentional focus e to the PCAreduced, d-dimensional observation x that could be made at that focus of attention. Following Tenenbaum and Freeman (2000), we do this using a bilinear model, with xi =
Oi jk e j h k + ηi
(2.2)
jk
where ηi is component-wise independent noise and O is an observation or imaging tensor that specifies how the latent description h of the face determines the mapping from attentional focus to observation.
2300
P. Dayan
0.011
0.016
0.036
0.048
0.118
1.764 Figure 2: Eigenvectors and eigenvalues of the covariance matris for the full 140 × 140–dimensional representation of the faces. The structure in the 140dimensional space is evident for the eigenvectors with the largest eigenvalues. The eigenvalues of every fifth eigenvector are shown.
The last required element of the model is to allow for different classes of faces. We do this by considering a mixture model. This adds significant representational power, which is necessary given the highly constrained factor analysis–based representation that we are employing. It can also be seen as an abstraction of the sort of population code representation ubiquitously employed by the cortex. We consider each class (each mixture component) as being a separate (informal) manifold in h space. We describe the manifold of class c by a φ-dimensional mean value ν c and a ψ × φ–dimensional factor loading matrix Gc , where ψ is the true underlying dimension of the manifold. This makes the full latent description of a specific face in this class from the class be h = g · Gc + ν c
(2.3)
where g are the (episodic) ψ-dimensional factor values for this specific face and are assumed to have an identity covariance matrix.
Images, Frames, and Connectionist Hierarchies
Thus, in total, we have the multilinear model c c xi = Oi jk e j gl Glk + νk + ηi . jk
2301
(2.4)
l
We consider the parameters of the imaging model O to be fixed for all classes of images, since they share a single latent space; the parameters of each class to be Gc and ν c ; and the parameters of a particular face within a class (its unique episodic description) to be the factors g. Figure 3 shows the full generative model in the case that there are two classes (c ∈ {1, 2}) with a handful of faces assigned (for the moment, arbitrarily) to each. The episodic descriptions of the faces from each class are shown as rectangles containing their ψ-dimensional descriptions (g). These, via factor loadings (Gc ) and together with class-specific mean values ν c (not shown), specify a location in a common φ-dimensional space (h), which acts as the model’s hidden representation of the 140-dimensional full representation of the face. The focus of attention (e) acts in a multilinear manner to select which resolution and which subpart should be imaged. This creates the canonical (in this case, 20-dimensional) representation x of the part or subpart, which can then be imaged in canonical (warped) coordinates by reversing (as best as possible) the projection from the collection of eigenvectors used to create the reduced input representations. In the terms of Tenenbaum and Freeman (2000), the imaging process in equation 2.2 is symmetric, with h and e being treated equally. Given that there are only a few possible foci of attention, we can also consider e an asymmetrical model with Oik = j Oi jk e j for the vector e associated with attentional focus e. In this case, instead of using a φ-dimensional mean vector ν c for each class, it is easiest to use a 7d-dimensional mean for each class and attentional focus. This has the disadvantage of not capturing the fact that there is coordinated structure in the mean coming from the observation process. However, it has the didactic advantage of making more meaningful the comparisons between different values of the hidden dimension φ, uncorrupted by errors in the means. We therefore use this variant in the figures below. Concomitantly, we allow each (of the e = 1, . . . , 7) distinct attentional foci to have separate, independent noise terms and so make η ∼ N [z, U], with diagonal covariance matrix U = diag (υ1e . . . υde )
(2.5)
consisting of the uniquenesses υie (the independent variance terms) for each component i and attentional focus e. In this case, again dropping the class label c for convenience, we can write xie = νie +
k
e Oik
l
gl Glk + ηie .
(2.6)
2302
P. Dayan class 1
class 2
Wulfram Ann Scott Billie
Gerry Chuck Edmund Igor
g
G1
G2
attention
h
30
tensor
1 1 r
l
m
n
%
o
a
20
x
...
eigenvectors
Figure 3: Generative model. Face factors g in one class (2) are mapped through a class-specific factor loading matrix G2 into a hidden latent representation h and are transformed by the observation tensor O to give the reduced representation x of one of the subparts, from which the (warped) image can be reconstructed via the principal components. Top-down control (the transparent gray blobs) acts to control the choice of face and the choice of attentional focus, which influences the use of O and the reconstruction process. The warping is not shown. There is one collection of eigenvectors for each of right eye r, left eye e, nose n, mouth m, both eyes %, mouth and nose o, and the whole face a.
Images, Frames, and Connectionist Hierarchies
2303
Having specified a rather rich representational structure for the images, we use unsupervised learning to infer the parameters. In section 3, we consider the case that we are ignorant of the true class of each face, turning to the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). If, however, we do know the classes (in this section, we arbitrarily assign half the faces to one class, the other half to the other), then a maximum likelihood fit of the parameters ν e , Oe , Gc , and U to observed data amounts to a form of weighted least-squares minimization, where the weights arise as part of the full gaussian model. In this section, we consider the related unweighted least-squares problem for which Tenenbaum and Freeman (2000) suggested a solution method involving singular value decomposition in an inner loop. To encompass the next section, in the full weighted problem that has to be solved, we use a conjugate gradient scheme (Carl Rasmussen’s minimize). As is conventional, we add a baseline to υie to prevent the problem from becoming ill conditioned. The unweighted case studied by Tenenbaum and Freeman (2000) can be seen as introducing extra parameters g for each face and then minimizing with respect to ν e , Oe , Gc , and g the mean square error, 2 e e e Oik gl Glk − xi νi + , ei
k
(2.7)
l
averaged over all the faces in all the classes (which only share Oe ). In this case, we can readily judge the model by considering the reconstructions of the reduced representations xi , νˆ ie +
k
e Oˆ ik
ˆ lk , gˆ l G
(2.8)
l
of the inputs at each attentional focus arising from each face associated with the optimized values.1 Figure 4 duly shows the result of this optimization in various ways. Figure 4A shows the reconstruction error per pattern as a function of φ, the underlying dimension of the hidden space, and ψ, the number of hidden factors. Reconstruction is already quite good for a hidden dimension of around 30 or 40 and around 20 to 30 factors. As might be expected, in the face of multilinearity, it is not generally very useful to trade off φ and ψ, the reconstruction is high quality only if both are adequate. This is particularly true for this case of only two classes of face. Figure 4B shows how the multiple possible observations of a single face are reconstructed as a function of the number of factors ψ, using as a hidden 1 With the important exception of the prior over the factors, this is very similar to the outcome of a noise clean-up process associated with use of the generative model.
2304
P. Dayan
A
B
reconstruction error
reconstruction
0.4 10
0.3 0.2
20
0.1 30
0 0
0 fact35 70 70 35 ors φ hidden dimension ϕ
C
φ=30
80%
0
ψ = 10
20
30
40
50
60
70
φ= 70
80%
0 −0.2
0.2 −0.2
0.2 −0.2
0.2 −0.2
0.2 −0.2
0.2 −0.2
0.2 −0.2
0.2
Figure 4: Reconstruction and reconstruction error for the multilinear model. (A) Mean square reconstruction error per full (7 × 20 = 140-dimensional) representation of the faces from two classes as a function of the dimensionality φ of the hidden space and the number ψ of the factors within each space (for comparison, the mean square weight of the representations is 3.7). (B) Reconstruction of a single face pattern (the lowest three images, showing high, medium, and low resolutions as in the bottom row of Figure 1 for φ = 30 dimensions and ψ = {10, 20, 30} factors). (C) Histograms of the reconstruction errors for φ = 30, 70 for various numbers of factors.
dimension φ = 30. The model generates the reduced observations x1 , . . . , x7 , and these have then been mapped into the canonical face coordinates, just as in the bottom row of Figure 1. Again, the differences are rather subtle, and the reconstruction is quite competent even for relatively few factors. This arises because of the redundancy in the full 140-dimensional representation of the faces. Figure 4C shows the quality of reconstruction in a slightly different manner. Each subplot shows a histogram of the errors in reconstructing all the elements of the xi for one whole class of faces. The upper row is for a hidden dimension of φ = 30, the lower for a hidden dimension of φ = 70. Along the rows, the number of factors ψ increases. Again, the high quality of the reconstruction is readily apparent.
Images, Frames, and Connectionist Hierarchies
2305
A further way to test the model’s ability to capture the structure of the domain is to see how well it can construct one part of a face from other parts. If we know which class the face comes from and the attentional focus of a given sample x, then we can reconstruct the mean (and variance) of the observations at all the other possible attentional foci. Under gaussian assumptions, the best way to do this is to use the full (in this case, 140 × 140– dimensional) covariance matrices shown in Figure 2. Consider the case that we observe a face from its first attentional focus x1 . Then write the covariance matrix for the class as =
11
11¯
1T1¯
1¯ 1¯
,
where 1 represents all the (in this case, 20) indices associated with the first attentional focus, and 1¯ the (120) indices associated with the other attentional foci. In this case, for jointly gaussian x1 , x2 , . . . , x7 , we have the conditional means −1 E[x2 , . . . , x7 |x1 ] = (x1 − x¯ 1 ) · 11 11¯ + [¯x2 , . . . , x¯ 7 ],
where the x¯ i are the unconditional means of the observations. Remember that the key parts of the reconstructions are therefore the deviations from their means of the reduced representations of the subparts. Figure 5 shows this for the first class of face. Each small graph shows a histogram of the errors in reconstructing the part shown in the icon in the column from the part shown in the icon in the row. These errors are normalized by the standard deviations of the reconstructed parts so that they are comparable. Various features of these histograms are in accord with obvious intuitions. For instance, given the whole, low-resolution face, the reconstruction of all the other resolutions is good, with the medium resolution easier to reconstruct than the others. The medium-resolution depiction of the combined mouth and nose supports reconstruction of the high-resolution mouth and nose representations much better than it does the high-resolution eye representations, and conversely. The reconstruction of the nose from the eyes or the mouth is superior to the reconstruction of any of the other high-resolution parts. However, the predictions in Figure 5 are based on the nearly 106 components of the class-conditional covariance matrices. We seek reconstruction based on our factor analysis model. Here, given a sample, such as x1 , we infer the distribution over the unknown factors g associated with the whole face, which we can then use to synthesize approximations to x 2 , . . . , x 7 . In the next section, we do this (implicitly) using the full factor model; here, we
2306
P. Dayan
reconstruct: x1
x2
x3
x5
x4
x6
x7
x1 x2
from:
x3 x4 x5 x6 x7 –1
1
–1
1
–1
1
–1
1
–1
1
–1
1
Figure 5: Reconstruction of the reduced representations. The plots show histograms of the errors in reconstruction at high, medium, and low resolutions (columns) based on inputs at each of these resolutions (rows), using the full covariance matrix for the first class of faces. The errors are normalized by the standard deviations of the representations of the reconstructed part to make them directly comparable.
use the solution from the unweighted least-squares problem and therefore an empirical sample factor covariance matrix, ˆ i j = gˆ i gˆ j ,
(2.9)
averaging over the samples, and uniquenesses, υˆ ie
=
νˆ ie
+
k
e Oˆ ik
l
ˆ lk gˆ l G
2 −
xie
(2.10)
Images, Frames, and Connectionist Hierarchies hidden dimension 10
2307
70
factors
30
20
40
Figure 6: Reconstructions of all the parts from the full trilinear model, using ψ = 20, 40 factors (rows) and ψ = 10, 30, 70 hidden dimensions (columns). Each plot is exactly as in Figure 5, with the same limits for each graph.
using the best fit ˆg to the full x 1 , . . . , x 7 . Then if we write the ψ × d– dimensional matrices, Pˆ lie =
e ˆ Glk , Oˆ ik
k
we have, approximately, E[gl |x1 ] = (ˆ −1 + Pˆ 1 · [Uˆ 1 ]−1 · Pˆ 1 )−1 · Pˆ 1 (x1 − νˆ 1 ), where Uˆ 1 snips out just the uniquenesses associated with the attentional focus that is actually observed. For new attentional focus e, we have
E[xie |x1 ]
=
νˆ ie
+
k
e Oˆ ik
ˆ lk E[gl |x ]G 1
.
(2.11)
l
Figure 6 shows reconstruction (for testing data) according to equation 2.11, using the same format (and the same limits for each individual plot) as Figure 5 and for various values of φ and ψ. The first thing to notice is that when there are sufficient factors and dimensions (φ = 30; ψ = 20), reconstruction is nearly indistinguishable from that involving the full covariance matrix (in Figure 5). This is despite the use of many fewer parameters. For
2308
P. Dayan
latent dimension of maximal effect on: mouth nose left eye right eye
high
medium
low
R
Figure 7: Factor Effects. Each, row shows the effect of a unit change to a single factor on high-, medium-, and low-resolution images, shown in the same format as the bottom row of Figure 1. The four rows are for the four hidden factors that exert the greatest influence over the right and left eyes, nose, and mouth (top to bottom). Influence can be either positive (bright) or negative (dark).
too small a hidden dimension, there is a near-uniform degradation in the quality of all the reconstructions. As a final view on the multilinear model, we can use the linearity to map the hidden factors back into the image to see the effect of changing one of their values on the whole face. First, we assess which hidden factor exerts the greatest single influence over each of the four high-resolution parts. We then calculate the net effect of changing this factor on all parts of the image and at all resolutions by multiplying together the various observation and factor matrices, and projecting the resulting change back into the full image. Figure 7 shows the result. There is one row each for the factor with the strongest influence over right and left eye, and nose and mouth; each
Images, Frames, and Connectionist Hierarchies
2309
column shows full images constructed from the subparts at high, medium, and low resolutions, just as in the bottom row of Figure 1. The most obvious aspect of these plots is that the factors chosen for maximal effect on one subpart do indeed have a greater effect on this subpart than on the others. However, they are much more promiscuous than one might have expected from the reconstruction plots in Figure 6, in which there appeared to be a rather modest effect of one subpart on others. For instance, in these factor effect plots, the changes to the two eyes are almost the same for both factors. There are also more subtle effects—for example, the factor that changes the left eye the most also has a tendency to change the left part of the nose. These plots also show the correct operation of the observation hierarchy in that the changes to parts at the high resolution are replicated at lower resolutions. This was not a forgone conclusion—there are separate parameters for Oe for different attentional foci e. 3 Clustering We have so far assumed perfect knowledge of the classes from the outset (using an arbitrary division into two equally sized groups). However, this is clearly unreasonable, and we should also infer the classes from the data themselves. As Tenenbaum and Freeman (2000) noted in their bilinear work, the EM algorithm (Dempster et al., 1977) is ideal for this, provided that we have a fully probabilistic model for each class. In the E step, the posterior probability that each input face comes from each class is assessed. In the M step, these posterior probabilities are used as class-specific importance weightings for each face when both the parameters associated specifically with each class and the common parameters are updated. In our case, the model of equation 2.4, together with the basic assumptions, amounts to a full generative model (albeit in the eigenfeatures x rather than the pixel input). It is therefore straightforward to perform an exact E step, estimating the posterior responsibilities of the clusters for the data. Here, we consider operating in the regime in which it is possible to study a face from all possible attentional foci in order to calculate the posterior responsibilities. However, the generative model would also make it possible to do incremental learning based on only partial views—x only from a few attentional foci. Since it is necessary to estimate the uniquenesses, learning is more brittle than for the previous section. We therefore execute only a partial M step, improving the estimates of the parameters of each class given these responsibilities. We also anneal the minimum uniqueness as a way of avoiding premature convergence. Once again, we use Rasmussen’s minimize. Figure 8 shows the result of performing clustering on the entire collection of faces. The faces are relatively homogeneous, and so we do not expect a strong underlying cluster structure. In fact, on synthetic data that actually satisfy the precepts of the full generative model, EM finds the true clusters
2310
P. Dayan
cluster reponsibilities 1
error histograms
mean deviation
responsible
0
0
1
0.8
0
0
1
0.8
0
0
1
0.8
0
irresponsible
0.8
0 −0.2
0
0.2 −0.2
0
0.2
Figure 8: Unsupervised clustering of the face classes. (Column 1) Posterior responsibilities of each of the four clusters for the 190 faces. (Column 2) Deviations of the mean face of each class (those for which the posterior responsibility is greater than 0.8) from the overall mean face. (Column 3) Histograms of the errors in reconstructing the within-class face representations. (Column 4) Histograms of the errors in reconstructing the representations of faces from the other classes. Here φ = ψ = 70. For this figure, the partiality of the M-step involved a fixed ˆ and U, ˆ G, ˆ and a learning rate of number of line searches in minimize for O, 0.01 for changing the prior responsibilities of the clusters. The initial minimal uniqueness was 0.9 and was annealed toward 0.1 at a rate of 0.995 per iteration. We weakened the prior over g by multiplying the data by a factor of 10. The histograms are directly comparable with those in Figure 4.
(data not shown). The left column shows the posterior responsibilities of each of four classes for all 190 faces. EM does indeed assign faces to each class, though with varying frequencies (27%, 47%, 7%, and 19%, respectively). The second column shows how the mean (warped) face from each class deviates from the overall mean face—some reasonable structure is apparent, such as different overall skin tone. The third and fourth columns show minimal evidence of efficacy of the clustering in that histograms of errors in the reconstructions of all the (reduced representations of the) faces, show that within-class faces (third column) are reconstructed more proficiently than out-of-class faces (fourth column).
Images, Frames, and Connectionist Hierarchies
latent dimension of maximal effect on: mouth nose left eye right eye
1
2
class
2311
3
4
R
Figure 9: Class-specific factor effects. Each column is associated with one class, each row with the latent factor within the class having the maximal impact on the relevant part. Each figure shows the net effect on high-resolution version of each image of a unit change in the factors. After Figure 7.
Figure 9 shows another view of the differences between the different classes, using the same underlying scheme as in Figure 7. Here, each column is associated with a single class, showing in successive rows the impact on the high-resolution version only (equivalent to just the left-hand column of Figure 7) of the factor with the maximal effect on the right eye, left eye, nose, and mouth. The differences between the different classes are quite marked, despite the fact that all the effects happen through the medium of the same hidden space (h). 4 Discussion In this letter, we have considered the representation of hierarchically organized classes of images whose key structuring entity is an explicit variable:
2312
P. Dayan
a focus of attention. We showed how to build a factor analysis–based generative model for such classes, and how it can be inferred from data. This was in both a simple case, in which class identity was assumed to be known, and a richer case, involving a mixture-generative model and the EM algorithm, in which unsupervised clustering is also essential. We used the face data from Blanz and Vetter (1999) as our key example. Our prime objective has been to investigate how a single representation can encompass multilevel, statistical, hierarchical structures of different identifiable sorts. Only after we understand this better, perhaps also in richer statistical models than the multilinear gaussian ones here, could the foundation of key cognitively compelling computations over those representations become set. Manipulating more richly structured knowledge is a topic of some current interest in the belief net community (Koller & Pfeffer, 1998; Milch, Marthi, & Russell, 2004); we have considered it in more connectionist terms. Our work has a rather diverse range of links. First, it took its structuring of the problem of hierarchical representation from Riesenhuber and Dayan (1996). That article set out to put into context the neurophysiological results of Connor et al. (1996), who tested Olshausen et al.’s (1993) shifter model of attention. Connor et al. found that a major effect of specifying the focus of (visual) attention was not to translate and scale the mapping from lower to higher visual areas, as expected from the shifter model, but rather to scale multiplicatively the activity of at least one population of V4 neurons. Riesenhuber and Dayan (1996) and Salinas and Abbott (1997) treated this scaling as part of a basis function representation involving the simultaneous coding of the focus of attention and image-object features. As has been well explored (Poggio, 1990), particularly through the medium of models of parietal cortex (Pouget & Sejnowski, 1997; Deneve & Pouget, 2003), such basis function mappings allow a simple instantiation of complex functions of all the variables represented. One can see the multilinear form in equation 2.6 in between a representation basis function terms, involving an interaction e l gl Glk of the image contents and the effect Oik of the attentional focus e. More general basis functions could be used to allow nonlinear models of the classes themselves and of the effects of the focus of attention. Amit and Mascaro’s (2003) shifter-like model is also related. That model has an attractively sophisticated shifting process that integrates bottom-up and top-down information; it would be interesting to employ within it the sort of hierarchical representations of the top-down information that has been our focus. We have relied on shifting to achieve the sort of preprocessing normalization leading to our observations, x. A second key antecedent is the work of Tenenbaum and Freeman (2000) on multilinear generative models for the mostly unsupervised separation of different factors (“content,” for us the nature of the face, and “style,” the attentional focus) that interact to determine inputs. They articulated a general framework to study the sort of multiplicative interactions that we have employed, related them to a range of existing ideas about bilinearity
Images, Frames, and Connectionist Hierarchies
2313
in perceptual domains (Koenderink and van Doorn, 1997), and showed a most interesting application to typefaces. The particular method that Tenenbaum and Freeman (2000) used to fit their generative model (avoiding one gradient step through the use of singular value decomposition) could be adapted to our case, but an iterative method seems to be required for trilinear and higher-order models in any event. We used a gradient-based minimizer to solve the whole problem. Vasilescu and Terzopoulos (2002, 2003), building partly on the work of De Lathauwer (1997) and Kolda (2001) (and based on ideas dating back at least to Tucker, 1966), used a tensor extension to singular valued decomposition (SVD) to find what can be seen as a joint coordinate scheme for structured collections of images. Take the case of faces. Their method starts from a data tensor, with the different dimensions of structural variation of the images (such as viewpoints, lighting conditions, and identities) kept as separate dimensions. Just as SVD on a normal two-dimensional matrix finds left and right coordinate systems for the two spaces acted on by the matrix, together with singular values that link them, joint SVD on the tensor finds coordinate systems for each dimension together with what is called a core tensor that links them collectively. Each coordinate system parameterizes its dimension of variability. In terms of this scheme, our method is rather like using the focus of attention as the independent dimension and marginalizing over identity (which can then be separated through the medium of the mixture model). In these terms, we might expect the observation tensor O to have a formal relationship with the SVD coordinates associated with the focus of attention. Given this, extensions of the tensor decomposition idea, such as to independent components analysis (De Lathauwer & Vandewalle, 2004; Vasilescu & Terzopoulos, 2005), could be most useful directions for our work on representation. A third link is to tensor product–based representations of structured knowledge (Smolensky, 1990; Plate, 1995, 2003; Gayler, 1998). This strand of work has placed most of its efforts into the problem of the representation of arbitrary episodic structured facts, with representational elements newly minted for each new case. The same is true for methods that are further from tensor product notions, such as Rachkovskij and Kussul’s (2001) contextdependent thinning method for binary distributed representations or Kanerva’s (1996) binary spatter codes. By contrast, we have focused on the semantic structure underlying domains such as faces. However, the basic linear operations inherent in the multilinear models (such as equation 2.4) are indeed just tensor products of various sorts. Perhaps an even closer link is to recursive autoassociative memories (RAAMs; Pollack, 1990), their ancestor in Hinton’s notion of reduced descriptions (Hinton, 1990) and their relational descendants (Sperduti, 1994), since at their heart they are autoencoders, which are best understood as forms of statistical generative model. However, again, RAAMs are normally considered in episodic rather than semantic terms, so the influence
2314
P. Dayan
exerted by the overall statistical structure of a domain can be hard to discern. Paccanaro and Hinton’s (2001) linear relational embedding (LRE) can be seen as an intermediate case. LRE, which significantly generalizes and formalizes Hinton’s (1989) famous family trees network, does learn aspects of the semantics of a domain, considering the overall structure of the group of related facts about a number of different episodic examples. A final important link arises through ideas in computational vision for the representation of structured objects such as faces. There is a huge wealth of techniques based on generative models of various sorts, from the sort of image-based methods favored by Edelman (1999) through a variety of approaches that decompose objects into parts and learn something about the relative positions and form of these parts. For some methods, the parts are sometimes intended to capture something about the true structure of the object (Fischler & Elschlager, 1973; Grenander, 1976–1981; Mjolsness, 1990; Revow, Williams, & Hinton, 1996). For other methods, the parts are features more like dense subcomponents, or local patches of the images, or local wavelet coefficients (Burl et al., 1995, 1998; Schiele & Crowley, 1996, 2000; Fei-Fei et al., 2003; Liebe & Schiele, 2003, 2004; Schneiderman & Kanade, 2004; Amit & Trouv´e, 2005). Most, though not all, methods incorporate explicit knowledge about the geometrical relationships among the parts and have it play a key role in the recognition processes of detection and classification. Our method is best seen as image based, and although it has implicit information about these relationships in its ability to generate representations of parts from wholes (one of the main intents in Riesenhuber & Dayan, 1996), we have not considered such sophisticated recognition issues. The most important future direction for this work is in the direction of knowledge structures that are more general than images. Take stories as an extreme, but seductively motivating, example, to which, for instance, Dolan (1989) took the idea of tensor product representations. However, as with other tensor product notions, this was formulated before the widespread formulation of the sort of sophisticated statistical unsupervised learning model that Tenenbaum and Freeman (2000) promulgated. Stories of a given class (just like faces of a given class) share a semantic structure that defines constraints among the actors and actions in the story (just like the eyes in a face). A most critical difference is that although there are intuitive notions of scale (perhaps summarization scale) and substructure in stories, there is no obvious equivalent of what we have called the attentional focus e, as a way of defining observations at different scales or resolutions. One possible generalization of the key mapping definition (see equation 2.1) is:
story : question ⇒ answer
I:
e⇒x
(4.1)
Images, Frames, and Connectionist Hierarchies
2315
with question and answer being coded in the same latent space. In a linear version, this would imply a generalization of equation 2.4, with xi =
Oi jk
jk
m
q m Qmj
c gl Glk
+ ηi
(4.2)
l
(ignoring the means), where q are the hidden factors underlying the question, x is a representation of the answer, and O maps together the question and story representations. Altogether q · Q acts as the equivalent of the attentional focus e. In this case, the answer should perhaps live in the same representational space as the question, that is, be itself captured through factors Q, although this poses a rather more challenging unsupervised learning problem. Of course, the restriction to a purely multilinear generative model is rather severe for learning. It will be important to consider nonlinear generalizations of this in which the eye position and the latent space (or the question and the story) interact in richer manner. Some of the recent structured image representations mentioned above may provide some pointers. The first of a very large number of steps might be to take advantage of the much greater flexibility of a mixture model. Acknowledgments I am very grateful to Max Riesenhuber for most helpful discussion and comments and particularly for encouraging and helping me to use the faces from Blanz and Vetter (1999). Two anonymous reviewers and Yali Amit and Chris Williams made very helpful suggestions. I also thank Volker Blanz and Thomas Vetter for making the faces available, and particularly to the latter for allowing me to use the fiducial markers. Support was from the Gatsby Charitable Foundation. References Amit, Y., & Mascaro, M. (2003). An integrated network for invariant visual detection and recognition. Vision Research, 43, 2073–2088. Amit, Y., & Trouv´e, A. (2005). POP: Patchwork of parts models for object recognition. Unpublished technical report. Available online from http://galton.uchicago.edu/ amit/Papers/pop.pdf. Beymer, D., & Poggio, T. (1996). Image representations for visual learning. Science, 272, 1905–1909. Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In SIGGRAPH’99 (pp. 187–194). New York: ACM Computer Society Press. Bridgeman, B., van der Heijden, A. H. C., & Velichkovsky, B. (1994). Visual stability and saccadic eye movements. Behavioral and Brain Sciences, 17, 247–258.
2316
P. Dayan
Burl, M. C., Leung, T. K., & Perona, P. (1995). Face localization via shape statistics. In M. Bichsel (Ed.), Proceedings of the International Workshop on Automatic Face and Gesture Recognition (pp. 154–159). Piscataway, NJ: IEEE. Burl, M., Weber, M., & Perona, P. (1998). A probabilistic approach to object recognition using local photometry and global geometry. Berlin: Springer. Connor, C. E., Gallant, J. L., Preddie, D. C., & Van Essen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. Journal of Neurophysiology, 75, 1306–1308. Crandall, D., Felzenszwalb, P., & Huttenlocher, D. (2005). Spatial priors for partbased recognition using statistical models. Proceedings of the International Coference of Computer Vision and Pattern. Piscataway, NJ: IEEE. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. De Lathauwer, L. (1997). Signal processing based on multilinear algebra. Unpublished doctoral dissertation, Katholieke Universitiet Leuven, Belgium. De Lathauwer, L., & Vandewalle, J. (2004). Dimensionality reduction in higher-order signal processing and rank-(R1 , R2 , . . . , RN ) reduction in multilinear algebra. Linear Algebra Applications, 391, 31–55. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1–38. Deneve, S., & Pouget, A. (2003). Basis functions for object-centered representations. Neuron, 37, 347–359. Dolan, C. P. (1989). Tensor manipulation networks: Connectionist and symbolic approaches to comprehension, learning and planning (Tech. Rep. UCLA-AI-89-06). Los Angeles: Computer Science Department, AI Lab, UCLA. Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press. Fei-Fei, L., Fergus, R., & Perona, P. (2003). A Bayesian approach to unsupervised oneshot learning of object categories. In Proceedings of the International Conference on Computer Vision ICCV (Vol. 1). Piscataway, NJ: IEEE. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of the International Conference of Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE. Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22, 67–92. Gayler, R. W. (1998). Multiplicative binding, representation operators and analogy. In K. Holyoak & D. Gentner (Eds.), Advances in analogy research: Integration of theory and data from the cognitive, computational, and neural sciences. Sofia, Bulgaria: New Bulgarian University. Grenander, U. (1976–1981). Lectures in pattern theory I, II and III: Pattern analysis, pattern synthesis and regular structures. Berlin: Springer-Verlag. Grimes, D. B., & Rao, R. P. N. (2005). Bilinear sparse coding for invariant vision. Neural Computation, 17, 47–73. Hinton, G. E. (1989). Learning distributed representations of concepts. In R. G. M. Morris (Ed.), Parallel distributed processing: Implications for psychology and neurobiology (pp. 46–61). New York: Oxford University Press.
Images, Frames, and Connectionist Hierarchies
2317
Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46, 47–76. Hinton, G. E. (Ed.). (1991). Connectionist symbol processing. Cambridge, MA: MIT Press. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London, B, 352, 1177–1190. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Kanerva, P. (1996). Binary spatter-coding of ordered K-tuples. In C. von der Mals¨ burg, W. von Seelen, J. C. Vorbruggen, & B. Sendhoff (Eds.), Proceedings of ICANN 1996 (pp. 869–873). Berlin: Springer-Verlag. Koenderink, J. J., & van Doorn, A. J. (1997). The generic bilinear calibrationestimation problem. International Journal of Computer Vision, 23, 217–234. Kolda, T. G. (2001). Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23, 243–255. Koller, D., & Pfeffer, A. (1998). Probabilistic frame-based systems. In Proceedings of the 15th National Conference on Artificial Intelligence (pp. 580–587). Madison, WI: AAAI Press. Liebe, B., & Schiele, B. (2003). Interleaved object categorization and segmentation. In British Machine Vision Conference (BMVC’03). Norwich, UK. Liebe, B., & Schiele, B. (2004). Scale invariant object categorization using a scaleadaptive mean-shift search. In DAGM’04 Annual Pattern Recognition Symposium (pp. 145–153). Berlin: Springer-Verlag. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105–117. MacKay, D. M. (1956). The epistemological problem for automata. In C. E. Shannon & J. McCarthy (Eds.), Automata studies (pp. 235–251). Princeton, NJ: Princeton University Press. Milch, B., Marthi, B., & Russell, S. (2004). BLOG: Relational modeling with unknown objects. In Proceedings of the ICML-04 Workshop on Statistical Relational Learning. Banff, Canada: International Machine Learning Society. Mjolsness, E. (1990). Bayesian inference on visual grammars by neural nets that optimize (Tech. Rep. YALEU/DCS/TR-854). New Haven, CT: Computer Science Department, Yale University. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. Davis (Eds.), Large-scale theories of the cortex (pp. 125–152). Cambridge, MA: MIT Press. Neisser, U. (1967). Cognitive psychology. New York: Appleton-Century-Crofts. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
2318
P. Dayan
Paccanaro, A., & Hinton, G. E. (2001). Learning distributed representation of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering, 13, 232–245. Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6, 623–641. Plate, T. A. (2003). Holographic reduced representations. Stanford, CA: CSLI Publications. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46, 77–105. Pouget, A., Dayan, P., & Zemel, R. S. (2000). Computation with population codes. Nature Reviews Neuroscience, 1, 125–132. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9, 222–237. Rachkovskij, D. A., & Kussul, E. M. (2001). Binding and normalization of binary sparse distributed representations by context-dependent thinning. Neural Computation, 13, 411–452. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (2002). Probabilistic models of the brain. Cambridge, MA: MIT Press. Revow, M., Williams, C. K. I., & Hinton, G. E. (1996). Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 592–606. Riesenhuber, M., & Dayan, P. (1996). Neural models for part-whole hierarchies. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 17–23). Cambridge, MA: MIT Press. Riesenhuber, M., Jarudi I., Gilad, S., & Sinha, P. (2004). Face processing in humans is compatible with a simple shape-based model of vision. Proceedings of the Royal Society of London, B (Suppl.), 271, S448–S450. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 11, 1019–1025. Salinas, E., & Abbott, L. F. (1997). Invariant visual responses from attentional gain fields. Journal of Neurophysiology, 77, 3267–3272. Schiele, B., & Crowley, J. L. (1996). Probabilistic object recognition using multidimensional receptive field histograms. In International Conference on Pattern Recognition. Piscataway, NJ: IEEE. Schiele, B., & Crowley, J. L. (2000). Recognition without correspondence using multidimensional receptive field histograms. International Journal of Computer Vision, 36, 31–50. Schneiderman, H., & Kanade, T. (2004). Object detection using the statistics of parts. International Journal of Computer Vision, 56, 151–177. Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46, 159–216. Sperduti, A. (1994). Labeling RAAM Connection Science, 6, 429–459. Sudderth, E. B., Torralba, A., Freeman, W. T., & Willsky, A. S. (2005). Learning hierarchical models of scenes, objects and parts. In Proceedings of the International Conference on Computer Vision. Piscataway, NJ: IEEE.
Images, Frames, and Connectionist Hierarchies
2319
Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12, 1247–1283. Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31, 279–311. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3, 71–86. Vasilescu, M. A. O., & Terzopoulos, D. (2002). Multilinear analysis of image ensembles: TensorFaces. In A. Heyden, G. Sparr, M. Nielsen, & P. Johansen (Eds.), Computer vision (pp. 447–460). Berlin: Springer-Verlag. Vasilescu, M. A. O., & Terzopoulos, D. (2003). Multilinear subspace analysis for image ensembles. In Proceedings of Computer Vision and Pattern Recognition Conference, CVPR ’03 (Vol. 2, pp. 93–99). Piscataway, NJ: IEEE. Vasilescu, M. A. O., & Terzopoulos, D. (2005). Multilinear independent components analysis. In Proceedings of Computer Vision and Pattern Recognition Conference, CVPR ’05 (Vol. 2, pp. 547–553). Piscataway, NJ: IEEE. Vetter, T., & Poggio, T. (1997). Linear object classes and image synthesis from a single example image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 733–742. von der Malsburg, C. (1988). Pattern recognition by labelled graph matching. Neural Networks, 1, 141–148.
Received August 1, 2005; accepted April 4, 2006.
LETTER
Communicated by Andrea D’Avella
Properties of Synergies Arising from a Theory of Optimal Motor Behavior Manu Chhabra Department of Computer Science, University of Rochester, Rochester, NY 14627, U.S.A.
Robert A. Jacobs Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
We consider the properties of motor components, also known as synergies, arising from a computational theory (in the sense of Marr, 1982) of optimal motor behavior. An actor’s goals were formalized as cost functions, and the optimal control signals minimizing the cost functions were calculated. Optimal synergies were derived from these optimal control signals using a variant of nonnegative matrix factorization. This was done using two different simulated two-joint arms—an arm controlled directly by torques applied at the joints and an arm in which forces were applied by muscles—and two types of motor tasks—reaching tasks and via-point tasks. Studies of the motor synergies reveal several interesting findings. First, optimal motor actions can be generated by summing a small number of scaled and time-shifted motor synergies, indicating that optimal movements can be planned in a low-dimensional space by using optimal motor synergies as motor primitives or building blocks. Second, some optimal synergies are task independent—they arise regardless of the task context—whereas other synergies are task dependent—they arise in the context of one task but not in the contexts of other tasks. Biological organisms use a combination of task-independent and task-dependent synergies. Our work suggests that this may be an efficient combination for generating optimal motor actions from motor primitives. Third, optimal motor actions can be rapidly acquired by learning new linear combinations of optimal motor synergies. This result provides further evidence that optimal motor synergies are useful motor primitives. Fourth, synergies with similar properties arise regardless if one uses an arm controlled by torques applied at the joints or an arm controlled by muscles, suggesting that synergies, when considered in “movement space,” are more a reflection of task goals and constraints than of fine details of the underlying hardware.
Neural Computation 18, 2320–2342 (2006)
C 2006 Massachusetts Institute of Technology
Properties of Optimal Synergies
2321
1 Introduction Marr (1982) defined three levels of analysis of a complex information processing device. The top level, known as the computational theory, examines what the device does and why. A distinguishing feature of this level is that it provides an explanation for why a device does what it does by studying the device’s goals. Although there may be many different ways of developing a computational theory of aspects of human behavior, an increasingly popular way is through optimal models that formalize goals as mathematical constraints or criteria, search for behaviors that optimize the criteria, and compare the optimal behaviors with human behaviors. If there is a close match, then it is hypothesized that people are behaving as they do because they are efficiently satisfying the same goals as were built into the optimal model. This approach is commonplace in the study of human motor behavior (see Todorov, 2004, for a review). Flash and Hogan (1985), for example, proposed an optimal model of how people plan trajectories for reaching movements. This model emphasizes that trajectories should be smooth—the model searches for trajectories that minimize the jerk of a movement (i.e., the third derivative of position with respect to time). It is able to explain the fact that reaches tend to move along straight lines and tend to have bellshaped velocity profiles. Harris and Wolpert (1998) developed an optimal model of motor control that attempts to minimize the variance of the point reached at the end of a movement despite motor noise whose magnitude is dependent on the size of the control signals. They showed that this model explains several aspects of both eye movements and hand reaches. This article is concerned with motor synergies arising from a computational theory (in the sense of Marr, 1982) of optimal motor behavior. To understand motor synergies, it is helpful to first understand the degrees of freedom problem (Bernstein, 1967). Biological motor systems typically have many degrees of freedom, where the degrees of freedom in a system are the number of dimensions in which the system can independently vary (Rosenbaum, 1991). Because the number of degrees of freedom of a system carrying out a task often exceeds the number of degrees of freedom needed to specify the task, the degrees of freedom are typically redundant (Jordan & Rosenbaum, 1989). Consider, for example, the problem of touching the tip of your nose. The location of your nose has three degrees of freedom (its x, y, and z position in Cartesian coordinates), but the joints of your arm have seven degrees of freedom (the shoulder has three degrees of freedom, and the elbow and wrist each have two). Consequently, there are many different settings of your arm’s joint positions that all allow you to touch your nose. Which setting should you use? A solution to this problem is to create motor synergies, which are dependencies among dimensions of the motor system. For example, a motor synergy might be a coupling of the motions of your shoulder and elbow.
2322
M. Chhabra and R. Jacobs
Motor synergies provide two types of benefits to motor systems. First, synergies ameliorate the problem of redundancy—they can constrain the set of possible shoulder, elbow, and wrist positions that allow you to touch your nose. Second, synergies reduce the number of degrees of freedom that must be independently controlled, thereby making it easier to control a motor system (Bernstein, 1967). Because synergies make motor systems easier to control, they are often hypothesized to serve as motor primitives, building blocks, or basis functions: they provide fundamental units of motor behavior that can be linearly combined to form more complex units of behavior. Investigators of motor control are attempting to develop a comprehensive understanding of biological motor synergies. Typically, these researchers analyze neuroscientific or behavioral data using mathematical techniques in order to derive the motor synergies used by an organism. Sanger (1995) analyzed people’s cursive handwriting using principal component analysis (PCA) to discover their motor synergies. He showed that linear combinations of these synergies closely reconstructed human handwriting. Thoroughman and Shadmehr (2000) studied people’s motor learning behaviors to derive motor synergies based on gaussian radial basis functions. They showed that linear combinations of these synergies matched people’s behaviors when adapting to new environmental conditions. Mussa-Ivaldi, Giszter, and Bizzi (1994) identified frogs’ motor synergies by stimulating sites in their spinal cords and verified that stimulation of two sites leads to the vector summation of the forces generated by stimulating each site separately. A possible confusion in the motor control literature is that synergies derived from neuroscientific or behavioral data using mathematical techniques are sometimes referred to as “optimal.” For example, Sanger (1995) derived synergies from human behavioral data using PCA, a linear optimal dimensionality-reduction technique, and referred to the results as “optimal movement primitives.” It is important to keep in mind, however, that these synergies arise from optimal analysis of people’s actions and are not necessarily the same ones as would arise from a computational theory (again, in the sense of Marr, 1982) of optimal motor behavior. Based on the discussion above, a computational theory might involve a model that formalizes the actor’s goals as mathematical criteria and searches for the actions that optimize the criteria. An optimal analysis of the optimal actions could then derive the motor synergies. Synergies discovered in this way would be “optimal” in the sense that they arise from a computational theory of optimal motor behavior. To date, we know of only one study of motor synergies that arise from a computational theory. Todorov and Jordan (2002) proposed a computational theory that uses an optimal feedback controller as a model of motor coordination and noted that this controller produces motor synergies. In brief, the controller implements the “principle of minimal intervention”—it
Properties of Optimal Synergies
2323
does not attempt to control a system along dimensions that are irrelevant for a task. Because the system’s degrees of freedom are controlled along some task dimensions but not others, couplings or synergies among the degrees of freedom arise. Todorov and Jordan thereby explained the emergence of synergies. Although Todorov and Jordan (2002) explained the emergence of synergies; they did not study the specific properties of these synergies. In contrast, this article details the properties of synergies arising from a theory of optimal motor behavior. We have created an optimal controller for a nonlinear system that formalizes goals as mathematical constraints and searches for control signals that optimize the constraints. This was done using two different simulated two-joint arms—an arm controlled directly by torques applied at the joints and an arm in which forces are applied by muscles—and two types of motor tasks—reaching tasks (move an end effector from one point to another) and via-point tasks (move from one point to another while passing through an intermediate point). In all cases, we derived synergies from the optimal control signals using an extension to nonnegative matrix factorization (d’Avella, Saltiel, & Bizzi, 2003) and studied the properties of these synergies. Our studies of the resulting motor synergies reveal several interesting findings. First, optimal motor actions can be generated by summing a small number of scaled and time-shifted motor synergies, indicating that optimal movements can be planned in a low-dimensional space by using optimal motor synergies as motor primitives or building blocks. Second, some optimal synergies are task independent—they arise regardless of the task context—whereas other synergies are task dependent—they arise in the context of one task but not in the contexts of other tasks. Biological organisms use a combination of task-independent and task-dependent synergies. Our work suggests that this may be an efficient combination for generating optimal motor actions from motor primitives. Third, optimal motor actions can be rapidly acquired by learning new linear combinations of optimal motor synergies. This result provides further evidence that optimal motor synergies are useful motor primitives. Fourth, synergies with similar properties arise regardless if one uses an arm controlled directly by torques applied at the joints or an arm controlled by muscles, suggesting that synergies, when considered in “movement space,” are more a reflection of task goals and constraints than of fine details of the underlying hardware. 2 Computing the Optimal Control Signals We simulated a two-joint arm that can be characterized as a second-order nonlinear dynamical system (e.g., Hollerbach & Flash, 1982): M(θ )θ¨ + C(θ, θ˙ ) + B θ˙ = τ
(2.1)
2324
M. Chhabra and R. Jacobs
Table 1: Parameter Values Used in the Simulation of a Two-joint Arm. Parameter b 11 b 21 m1 l1 I1 s1
Value
Parameter
Value
0.05 kgm2 s−1 0.025 kgm2 s−1 1.4 kg 0.30 m 0.025 kgm2 0.11 m
b 22 b 12 m2 l2 I2 s2
0.05 kgm2 s−1 0.025 kgm2 s−1 1.0 kg 0.33 m 0.045 kgm2 0.16 m
Note: The parameters {b i j } denote the (i, j)th elements of a joint friction matrix; mi , li , and Ii denote the mass, length, and moment of inertia of the ith link, respectively; and si denotes the distance from the ith joint to the ith link’s center of mass.
where τ is a vector of torques, θ is a vector of joint angles, M(θ ) is an inertial matrix, C(θ, θ˙ ) is a vector of coriolis forces, and B is a joint friction matrix. We used the same parameter values for the arm as Li and Todorov (2004). These values are listed in Table 1. We studied two types of tasks: reaching tasks and via-point tasks. In a reaching task, the arm must be controlled so that its end effector moves from a start location to a target location. A via-point task is identical except that there is an additional requirement that the end effector also move through an intermediate location known as a via-point. For any reaching or via-point task, there are many time-varying torque vectors τ (t) that will move the arm so that it successfully performs the task. As discussed above, this multiplicity of control solutions is due to redundancy in the two-joint arm and is known as the degrees-of-freedom problem. How do we choose a particular solution? According to the optimality framework, an actor’s goals are formalized as mathematical constraints that are combined in a cost function, and an optimal control signal is a signal that minimizes this function. For the reaching task, we used the following cost function, J (τ (t)) =
1 k2 e(T) − e ∗ 2 +k1 e(T) ˙ 2 + 2 2
T
τ (t)T τ (t)dt,
(2.2)
0
where k1 and k2 are constants (we used the same values as Todorov & Li, 2005: k1 = 0.001 and k2 = 0.0001), T is the duration of the movement, e(T) is the end-effector location at time T, and e ∗ is the target location at time T. The first term penalizes reaches that deviate from the target location, the second term penalizes reaches that do not have a zero velocity at the end of the movement, and the third term penalizes reaches that require large torques (or “energy”). This cost function has previously been used by Li and Todorov (2004; see also Todorov & Li, 2005). Minimization of this function results in control signals that produce reaches with several properties of
Properties of Optimal Synergies
2325
natural movements, including bell-shaped velocity profiles, lower velocities at higher curvatures, and near-zero velocities at the beginnings and ends of movements. For the via-point task, we modified the above cost function to also penalize movements that do not pass through the via-point midway through the movement. The cost function has the form J (τ (t)) =
1 1 e(T) − e ∗ 2 + e(T/2) − e v∗ 2 2 2 T k 2 + k1 e(T) ˙ 2 + τ (t)T τ (t)dt, 2 0
(2.3)
where e v∗ is the via-point or desired end-effector location at the middle of the movement. This function penalizes reaches that deviate from the via-point at time T/2. To find the optimal control signal for a reaching or via-point task, the corresponding cost function must be minimized. Unfortunately, when using nonlinear systems such as the two-joint arm described above, this minimization is computationally intractable. Researchers typically resort to approximate methods to find locally optimal solutions. We used one such method, known as the iterative linear quadratic regulator (iLQR), developed by Li and Todorov (2004; see also Todorov & Li, 2005). We now briefly summarize this method in a generic setting. A continuous-time linear dynamical system is given by x(t) ˙ = f (x(t), u(t)),
(2.4)
where x is the state of the system and u is the input control signal. For the two-joint arm described above, the state x is (θ, θ˙ )T , and the control u is τ . Consider a cost function of the form J (u(t)) =
i
h i (x(ti )) +
T
l(x(t), u(t))dt.
(2.5)
0
Note that the cost functions for the reaching and via-point tasks are of this form. In the cost function for the reaching task, for example, the two discrete penalties (deviation of the end-effector location from the target location at time T and deviation of the end-effector velocity from zero at time T) correspond to the first term on the right-hand side of equation 2.5, and a continuous energy-like cost for large torques corresponds to the second term. The iLQR starts with an initial guess of the optimal control signal and iteratively improves it. From the control signal ui (t) at iteration i, the trajectory xi (t) is computed using a standard Euler approximation. The algorithm
2326
M. Chhabra and R. Jacobs
uses three steps to find the control signal for the next iteration. It starts by linearly approximating the dynamical system given in equation 2.4 and quadratically approximating the cost function given in equation 2.5. These approximations are made around (xi (t), ui (t)) at each time step t. Using these approximations, a modified linear quadratic gaussian (LQG) control problem is then formulated in the (δx, δu) space, where x + δx and u + δu are improved approximations to x and u, respectively. This formulation is valid only where these approximations are accurate: a small region around (xi (t), ui (t)). Finally, the optimal correction to control ui (t) at iteration i, denoted δui∗ (t), is computed by solving this modified LQG problem. This step requires the solution to a modified Riccatti-like set of equations. Fortunately, finding this solution is computationally efficient. Once the optimal corrections have been obtained, the algorithm sets ui+1 (t) = ui (t) + δui∗ (t) and proceeds to the next iteration. The algorithm stops if there is no significant improvement in the trajectory. The end result is a locally optimal trajectory x ∗ and locally optimal control signal u∗ . We have found that the iLQR works well on both reaching and via-point tasks when using the two-joint arm. Figure 1 shows examples of optimal reaching (top row) and via-point (bottom row) movements computed by the iLQR. The graphs in the left column show the movement of the end effector (horizontal and vertical axes give the x and y coordinates of the end effector in Cartesian space), whereas the graphs in the right column show the velocity profiles (horizontal axes represent time, and vertical axes represent velocity of the end effector). Clearly, the iLQR produces smooth movements with bell-shaped velocity profiles. In addition, the velocity profile for the via-point movement (bottom-right graph) indicates that end-effector velocity decreases with increasing path curvature. 3 Obtaining Optimal Synergies As discussed above, motor synergies are dependencies among dimensions of a motor system. They are useful because they can ameliorate the problem of redundancy and because they reduce the number of degrees of freedom that must be independently controlled, thereby making it easier to control a motor system. Synergies are often hypothesized to serve as motor primitives, building blocks, or basis functions. Researchers have used a variety of methods to compute motor synergies. We used a variant of nonnegative matrix factorization developed by d’Avella et al. (2003). This algorithm requires two inputs. One input is the number of synergies, denoted N. The other is a matrix of control signals, where each control signal is a 2 × T matrix of optimal torques computed by the iLQR for a given task (this matrix has 2 × T elements because torques are applied to both joints of the two-joint arm at each time step of a movement, and there are T time steps per movement). The input matrix of control signals is a vertical stack of individual control signal matrices.
Properties of Optimal Synergies
2327 1.5
Velocity (m/s)
Y
0.5
0
−0.5 −0.5
0 X
1
0.5
0 0
0.5
100 200 Time (ms)
300
100 200 Time (ms)
300
1.5
Velocity (m/s)
Y
0.5
0
−0.5 −0.5
0 X
0.5
1
0.5
0 0
Figure 1: Examples of optimal reaching (top row) and via-point movements (bottom row) computed using the iLQR. The graphs in the left column show the movement of the end effector: the horizontal and vertical axes represent the x and y coordinates of the end effector in Cartesian space. The thick lines show the orientation of the two-joint arm at the start of the movement, and the thin lines show the path of the end effector. The graphs in the right column show the velocity profiles: the horizontal axes represent time, and the vertical axes represent velocity of the end effector.
For example, if the iLQR was used to find the optimal control signals for 500 reaching tasks (tasks with different initial configurations of the arm or different target locations) and each reach had a duration of 400 time steps, then the matrix would consist of 1000 rows where each block of two rows is a 2 × 400 element matrix giving the optimal torques for each joint at each time step of a reach. As its output, the algorithm seeks a set of synergies such that every control signal can be expressed as a sum of scaled and timeshifted synergies. Mathematically, it seeks a set of N synergies, denoted {wi , i = 1, . . . , N}, such that control signal m can be written as follows,
m(t) =
N i=1
c i wi (t − ti ),
(3.1)
2328
M. Chhabra and R. Jacobs
where {c i , i = 1, . . . , N} is a set of coefficients that scale the synergies, and {ti , i = 1, . . . , N} is a set of times that time-shift the synergies. The algorithm searches for the synergies, scaling coefficients, and time shifts that minimize the sum of squared errors between the actual control signals and the reconstructed signals. A technical detail is that the algorithm requires a set of nonnegative control signals (each element of a control vector must be nonnegative). In our case, a torque vector might have negative elements. We overcame this problem in a manner inspired by biological motor systems’ use of agonist and antagonist muscles to apply torques at joints. We recoded a 2 × 1 torque vector as a 4 × 1 vector in which the first two elements give the anticlockwise and clockwise torques for the first joint (shoulder), and the last two elements provide the same information for the second joint (elbow).1 For example, if torque (2, −1)T is applied to the joints, it means that a +2 torque is applied to the first joint in the anticlockwise direction, and a +1 torque is applied to the second joint in the clockwise direction. We recoded this torque vector to the nonnegative vector (2, 0, 0, 1)T . 4 Simulation Results This section reports the results of seven experiments. The first four experiments used the two-joint arm described above in which torques were applied at the joints. The last three experiments used the same arm, except forces were applied by muscles. All experiments used the same collection of reaching and via-point tasks. We created 320 instances of each task as follows. Ten initial positions of the arm were randomly generated by uniformly sampling the first joint angle from the set [−π/4, π/2] and the second joint angle from the set [0, 3π/4]. For each initial position, 32 target locations were generated. A target was generated by randomly selecting a movement distance (sampled uniformly from the range 10–50 cm) and an angle of movement (sampled uniformly from the range 0–2π). For the via-point task, a via-point was placed at a random angle (sampled uniformly from the set [−π/3, π/3]) from the line joining the initial and target locations. The via-point’s distance from the 1 Below we present results in which we consider the number of motor synergies required to reconstruct optimal movements with small error when an arm is controlled by torques applied directly to its joints. It is possible that the operation of mapping a two-dimensional vector with real values to a four-dimensional vector with nonnegative values introduces a bias into the estimate of this number. Nonetheless, our use of the mapping is justified as follows. We wish to compare synergies obtained when an arm is controlled by torques applied directly to its joints with synergies obtained when an arm is controlled by forces applied by muscles. Therefore, it is necessary to use the same representational format and dimensionality-reduction algorithm for obtaining synergies in both cases. When an arm is controlled by muscles, synergies are extracted on the basis of muscle activations, which must be nonnegative values.
Properties of Optimal Synergies
2329 3
3
2.5
2.5
RMSE
RMSE
2 1.5
1.5 1
1 0.5 0
2
5 10 15 20 Number of Synergies
0.5 0
5 10 15 20 Number of Synergies
Figure 2: The graphs plot the root mean squared error (RSME) between actual and reconstructed test items for reaching (left graph) and via-point (right graph) tasks as a function of the number of synergies used in the reconstructions. The error bars give the standard errors of the means.
initial location was selected randomly to be between one-third and twothirds of the distance between initial and target locations. The duration of a movement was 350 msec, and new torques were applied every 7 msec. 4.1 Experiment 1: A Small Set of Synergies Can Reconstruct Optimal Movements. The first experiment evaluated whether optimal reaching or via-point control signals can be expressed as a sum of a small number of scaled and time-shifted synergies. If so, then the synergies can be regarded as useful motor primitives. For each type of task, the iLQR was applied to each instance of the task to generate 320 optimal control signals. These signals were divided into five equal-sized sets, which were then used by a fivefold cross-validation procedure to create training and test data items. Four sets of control signals were used for training, and the remaining set was used for testing. This was repeated for all five such combinations of training and test sets. During training, nonnegative matrix factorization was used as described above to discover a set of synergies. During testing, these synergies were time-shifted and linearly combined to reconstruct the test control signals. Nonnegative matrix factorization was used to find the time shifts and linear coefficients. The results for the reaching and via-point tasks are shown in the left and right graphs of Figure 2, respectively. The horizontal axes give the number of synergies. The vertical axes give the root mean squared error (RMSE) between actual and reconstructed test control signals. The error bars show the standard errors of the means. With both reaching and viapoint tasks, the error is near its minimum when relatively few synergies (about six or seven) were used. For our purposes, this is an important result
M. Chhabra and R. Jacobs
Reaching Synergies
2330
1 2 3 4 5 6 1
2
3
4
5
6
Via Point Synergies Figure 3: The similarity matrix when six synergies were obtained from the reaching task and the via-point task. The lightness of the square at row i and column j gives the cosine of the angle between the ith reaching-task synergy vector and the jth via-point task synergy vector: white is a value of 1, black is a value of 0, and intermediate gray-scale values represent intermediate values.
because it means that the synergies are useful motor primitives: optimal movements can be planned in a relatively low-dimensional space by timeshifting and linearly combining a small number of synergies. Furthermore, the fact that the error curves for the reaching and via-point tasks are similar suggests that these tasks have similar task complexity. This is surprising because generating optimal via-point movements intuitively seems more complicated than generating optimal reaches. 4.2 Experiment 2: Task-Independent and Task-Dependent Synergies. The second experiment evaluated whether optimal motor synergies are task independent or task dependent. This issue is interesting due to recent neurophysiological findings. d’Avella and Bizzi (2005), for example, recorded electromyographic activity from 13 muscles of the hind limbs of frogs performing jumping, swimming, and walking movements. An analysis of the underlying motor synergies revealed that some synergies were used in all types of movements, whereas other synergies were movement dependent. Figure 3 shows the similarity matrix when six synergies were obtained for the reaching task and six synergies were obtained for the via-point task. The lightness of the square at row i and column j gives the cosine of the angle between the ith reaching-task synergy vector and the jth via-point
Properties of Optimal Synergies
2331
0 ms
350ms 1
2
3
4
5
6
Figure 4: The six synergies obtained for the reaching task. Each synergy is represented by four columns; the first two columns represent the anticlockwise and clockwise torques for the first joint (shoulder), whereas the second two columns represent this same information for the second joint (elbow). Torques were linearly scaled to the interval [0, 1]. White indicates a torque of 1, black indicates a torque of 0, and intermediate shades of gray represent intermediate values.
task synergy vector: white is a value of 1, black is a value of 0, and intermediate gray-scale values represent intermediate values. Some synergies, such as the third reaching-task synergy and the first via-point task synergy, are highly similar, indicating that these synergies are task independent. In contrast, other synergies, such as the fourth reaching-task synergy or the third via-point task synergy, are dissimilar from all other synergies, indicating that they are task dependent. This result suggests that the combination of task-independent and task-dependent synergies found in biological organisms (e.g., d’Avella & Bizzi, 2005; Jing, Cropper, Hurwitz, & Weiss, 2004) may be efficient for generating optimal motor actions from motor synergies. 4.3 Experiment 3: Visualizing Synergies. In experiment 3, we obtained synergies for the purpose of visualizing the movements induced by these synergies. Using our collections of instances of each type of task, six synergies for the reaching task and six synergies for the via-point task were calculated as described above. The scaling coefficients for the reaching-task
2332
M. Chhabra and R. Jacobs
0.6
0.6
5
1
1
0.4 3
2
0.2
0 0
6
4 2
Y
4
Y
0.41
6
3 5
0.2
0.2
0.4 X
0.6
0 0
0.2
0.4
0.6
X
Figure 5: Movements induced by six synergies obtained for the reaching task. The horizontal and vertical axes of each graph give the x and y coordinates of the end effector in Cartesian space, the gray lines show the initial configuration of the arm, the black lines show the movements of the end effector, and the number next to each movement indicates the synergy that was applied (using the same labels as Figure 4). The left and right graphs illustrate induced movements when the initial configuration of the arm was near the center of the work space or at a far edge of the work space, respectively.
synergies or the via-point task synergies were set to their average values over the collection of reaching tasks or via-point tasks, respectively. The time-shift parameters were set to zero. The six synergies obtained for the reaching task are illustrated in Figure 4. The horizontal axis labels the synergies, and the vertical axis depicts time. Each synergy is represented by four columns; the first two columns represent the anticlockwise and clockwise torques for the first joint, whereas the second two columns represent this same information for the second joint. Torques were linearly scaled to the interval [0, 1]. White indicates a torque of 1, black indicates a torque of 0, and intermediate shades of gray represent intermediate values. Figure 5 illustrates movements based on these synergies. The left graph shows the induced movements when the initial arm configuration was near the center of the workspace. The horizontal and vertical axes of the graph give the x and y coordinates of the end-effector in Cartesian space, the gray lines show the initial configuration of the arm, the black lines show the movements of the end effector, and the number next to each movement indicates the synergy that was applied (using the same labels as Figure 4). The induced movements tend to be relatively straight (though some are curved) and tend to cover a wide range of directions. The right graph of Figure 5 shows the induced movements when the initial arm configuration was at a far edge of the work space. Again, the movements tend to be relatively straight. As should be expected, movements in this case are directed toward the center of the work space. Figure 5 demonstrates that synergies tend to broadly cover all possible directions of motion.
Properties of Optimal Synergies
2333
0 ms
350ms 1
2
4
3
5
6
Figure 6: The six synergies obtained for the via-point task. 0.6
4
0.6
6
2
0.4
0.4 Y
Y
1
0.2
0 0
5
3
0.2
0.2
0.4 X
0.6
0 0
0.2
0.4
0.6
X
Figure 7: Movements induced by synergies obtained from the via-point task. The left and right graphs illustrate movements induced by task-independent and task-dependent synergies, respectively. The number next to each movement indicates the synergy that was applied (using the same labels as Figure 6).
The six synergies obtained for the via-point task are illustrated in Figure 6. It uses the same format as Figure 4. Figure 7 illustrates movements based on these synergies. The left graph illustrates movements induced by two synergies that were highly similar to synergies obtained from the reaching task—that is, these are task-independent synergies. The induced movements are relatively straight. Consequently, the underlying
M. Chhabra and R. Jacobs 1
1
0.8
0.8
Velocity (m/s)
Velocity (m/s)
2334
0.6 0.4 0.2 0 0
100
200 time (ms)
300
0.6 0.4 0.2 0 0
100
200 time (ms)
300
Figure 8: Velocity curves for induced movements. The left graph plots the velocity curve for a movement based on a synergy obtained from the reaching task. The right graph plots the velocity curves for two movements based on two task-dependent synergies obtained from the via-point task.
synergies are useful for both reaching and via-point tasks. The right graph illustrates movements based on four synergies that are task dependent; these synergies were not similar to synergies obtained from the reaching task. The induced movements tend to be almost piecewise linear, with a region of large curvature near the middle of the movement that is preceded and followed by regions of relatively straight motion. Figure 8 shows the velocity curves (velocity at each moment in time) for induced movements. The left graph plots the velocity curve for a movement based on a synergy obtained from the reaching task. This curve has a bellshaped profile, which is commonly found for reaching movements. The right graph plots the velocity curves for two movements based on two task-dependent synergies obtained from the via-point task. The shapes of these curves are typical for via-point movements. In summary, we find that the synergies for reaching and via-point movements have intuitive forms. Movements based on synergies obtained from the reaching task tend to be straight, to broadly cover the directions available to the arm based on its initial configuration, and to have bellshaped velocity profiles. Movements based on task-independent via-point synergies tend to have these same properties. In contrast, movements based on task-dependent via-point synergies tend to have a piecewise-linear shape with a region of high curvature near the middle of the movement and have velocity profiles with two bell shapes. 4.4 Experiment 4: Learning with Synergies. Experiment 4 evaluated whether the use of optimal motor synergies makes it easier to learn to perform new optimal motor actions. If motor synergies are useful motor primitives, then this ought to be the case.
Properties of Optimal Synergies
2335
The task was to learn to generate a reaching movement starting from an initial configuration of the arm so that the arm’s end point reached a randomly selected target location. When synergies were used, control signals were expressed as linear combinations of synergies (to minimize computational demands, we did not time-shift synergies), meaning that the parameter values that needed to be learned were the linear coefficients. When synergies were not used, the values that needed to be learned were the torques applied to each joint at each moment in time. From a collection of 320 instances of the reaching task, fivefold cross validation was used to create training and test sets. Policy gradient, a type of reinforcement learning algorithm, was used to learn estimates of the relevant parameter values (Sutton, McAllester, Singh, & Mansour, 2000). This algorithm was applied for 300 iterations. Learning with synergies occurred as follows. We calculated the optimal movements for each instance in a training set using the iLQR, and obtained four motor synergies using nonnegative matrix factorization. The policy gradient algorithm was then used to learn to perform each instance of the reaching task in the test set. At each iteration of the learning process, we numerically computed the derivatives of the reaching-task cost function (see equation 2.2) with respect to the linear coefficients used in the linear combination of synergies and performed gradient descent with the constraint that the coefficients had to be nonnegative. When learning without synergies, we computed the derivatives of the reaching-task cost function with respect to the torques at each joint and at each time step and performed gradient descent. Step sizes or learning rates that produced near-optimal performance were used when performing gradient descent with and without synergies. The results for a typical instance of a reaching task from a test set are shown in Figure 9. The graph on the left shows the learning curves for learning with and without motor synergies. The horizontal axis gives the iteration number, and the vertical axis gives the value of the reaching-task cost function. Whereas learning without synergies was slow and never achieved good performance, learning with synergies was rapid and achieved excellent performance. Indeed, learning with synergies achieved roughly the same cost as the iLQR. The graph on the right shows the movements learned with and without synergies in Cartesian coordinates and the movement calculated by the iLQR. The movement learned without synergies never reached the target location, whereas the movement learned with synergies did. Overall, the results indicate that optimal synergies are useful motor primitives or building blocks in the sense that their use in linear combinations leads to rapid and accurate acquisition of new optimal motor actions.
4.5 Experiment 5: Motor Synergies When Forces Are Applied by Muscles. Whereas experiments 1 to 4 simulated a two-joint arm controlled directly by torques applied at the joints, experiments 5 to 7 simulated the
2336
M. Chhabra and R. Jacobs
0.5
Pure gradient descent
0.1
gradient descent in synergy space
Y
Cost
0.2
0
discovered by iLQR
−0.5 0 0
100 200 Iterations
300
Discovered by iLQR
Gradient descent in synergy space
−0.5
0 X
Pure gradient descent
0.5
Figure 9: The graph on the left shows the learning curves for learning with and without motor synergies on a typical instance of a reaching task. The horizontal axis gives the iteration number, and the vertical axis gives the value of the reaching-task cost function. The graph on the right shows the movements learned with and without synergies in Cartesian space and the movement calculated by the iLQR.
same arm except forces were applied by muscles. We conducted experiments with muscles so that we could verify that the results reported above are also valid when simulating more complex and biologically realistic systems such as an arm controlled by muscles. We used the muscle model developed by Todorov and Li (2005; see also Brown, Cheng, & Leob, 1999). In brief, this model uses six muscles that apply forces to a two-joint arm. The control signal is the neural input to the muscles. This input passes through a nonlinear low-pass filter to produce muscle activations. The tension of a muscle is a function of the muscle’s current activation, length, and length velocity. The tension produces forces on the arm’s links that, in turn, produce joint torques. Note that this system is significantly more complicated than the arm in which torques are applied directly at the joints. This system has a six-dimensional control space (neural input to each of six muscles), a 10-dimensional state space (six muscle activations and the angular position and velocity of each joint), muscle activations that might saturate, and dynamics with temporal delays (due to the low-pass filtering of neural input). In experiments 1 to 4, nonnegative matrix factorization was applied to the optimal control signals to obtain motor synergies. In contrast, this factorization was not applied to the control signals—the neural input—in experiments 5 to 7; rather, it was applied to the optimal muscle activations. We found that the factorization procedure was significantly more robust when applied to the muscle activations due to the smoothness of their values (recall that the activations are low-pass accumulations of the neural inputs). Factorization of muscle activations was also conducted by d’Avella et al. (2003).
2337
0.05
0.05
0.04
0.04 RMSE
RMSE
Properties of Optimal Synergies
0.03
0.02 0
5 10 15 Number of Synergies
0.03
0.02 0
5 10 15 Number of Synergies
Figure 10: The graphs plot the root mean squared error (RMSE) between actual and reconstructed test items for reaching (left graph) and via-point (right graph) tasks when using the arm controlled by muscles as a function of the number of synergies used in the reconstructions. The error bars give the standard errors of the means.
Data for experiments 5 to 7 were collected in a similar manner as for experiments 1 to 4. We created 320 instances of the reaching task and the via-point task. Fivefold cross validation was used to create training and test sets of task instances. The iLQR was used to calculate the optimal sequence of neural inputs for each training instance (the cost functions for the reaching and via-point tasks given above were suitably modified by replacing the torque vector—the control input in experiments 1 to 4—with the neural input vector—the control input in experiments 5 to 7). Optimal muscle activations were created from the optimal neural inputs by low-pass filtering. Nonnegative matrix factorization was applied to the optimal muscle activations to generate optimal synergies. Based on these synergies, nonnegative matrix factorization was also used to perform task instances from test sets by computing optimal sums of scaled and time-shifted synergies. Experiment 5 parallels experiment 1 in the sense that it evaluated whether optimal muscle activations can be expressed as a linear combination of a small number of time-shifted synergies. The results for the reaching and via-point tasks are shown in the left and right graphs of Figure 10, respectively. With both reaching and via-point tasks, the error is near its minimum when relatively few synergies were used. We conclude that synergies are useful motor primitives because optimal movements can be planned in a relatively low-dimensional space by summing a small number of scaled and time-shifted synergies. 4.6 Experiment 6: Task-Independent and Task-Dependent Synergies When Forces Are Applied by Muscles. Experiment 6 parallels experiment 2 in the sense that it evaluated whether optimal motor synergies are task independent or task dependent (d’Avella & Bizzi, 2005; Jing et al., 2004).
M. Chhabra and R. Jacobs
Reaching Synergies
2338
1 2 3 4 5 6 1
2
3
4
5
6
Via Point Synergies Figure 11: The similarity matrix for the six synergies obtained from the reaching task and the six synergies obtained from the via-point task using the arm controlled by muscles. The lightness of the square at row i and column j gives the cosine of the angle between the ith reaching-task synergy vector and the jth via-point task synergy vector—white is a value of 1, black is a value of 0, and intermediate gray-scale values represent intermediate values.
Figure 11 shows the similarity matrix for the six synergies obtained from the reaching task and the six synergies obtained from the via-point task. Some synergies, such as the second reaching-task synergy and third viapoint task synergy are highly similar, indicating that these synergies are task independent. In contrast, other synergies, such as the third reaching-task synergy and the second via-point task synergy, are dissimilar from all other synergies, indicating that they are task dependent. A combination of taskindependent and task-dependent synergies was also found in experiment 2, which used an arm controlled directly by torques. This result suggests that the combination of task-independent and task-dependent synergies found in biological organisms may be efficient for generating optimal motor actions from motor primitives. 4.7 Experiment 7: Visualizing Synergies When Forces Are Applied by Muscles. In experiment 7, we obtained synergies for the purpose of visualizing the movements induced by these synergies when forces are applied by muscles. Consequently, experiment 7 parallels experiment 3 and was conducted in an analogous manner. Six synergies were obtained for the reaching task when the arm was controlled by forces applied by muscles. These synergies are illustrated in
Properties of Optimal Synergies
2339 0 ms
0 ms
350ms 1
2
3
4
5
6
350ms 1
2
3
4
5
6
Figure 12: Six synergies were obtained for the reaching task when the arm was controlled by forces applied by muscles. The left graph shows the muscle activations. For each synergy, there are six columns corresponding to the activations of the six muscles. The right graph shows the torques generated by the synergies. For each synergy, the four columns indicate the anticlockwise and clockwise torques for joints 1 and 2, respectively.
Figure 12. The left graph shows the muscle activations. For each synergy, there are six columns corresponding to the activations of the six muscles.2 Muscle activations were linearly scaled to the interval [0, 1]. White indicates an activation of 1, black indicates an activation of 0, and intermediate shades of gray indicate intermediate activation values. The right graph shows the torques generated by the synergies. For each synergy, the four columns indicate the anticlockwise and clockwise torques for joints 1 and 2, respectively. Figure 13 illustrates the movements induced by these synergies. The left graph shows the induced movements when the initial arm configuration was near the center of the work space. These movements tend to be relatively straight (though some are curved) and tend to cover a wide range of directions. The right graph shows the induced movements when the initial arm configuration was at a far edge of the work space. The movements tend to be relatively straight and are directed toward the center of the work space. For our purposes, a notable feature of these induced movements is that they closely resemble the movements induced by reaching-task synergies obtained when the arm was controlled by torques applied directly at the joints (see experiment 3 above). These data are consistent with the idea that synergies, when considered in “movement space,” are more a reflection of task goals and constraints than of fine details of the underlying hardware.
2
The muscles are (1) biceps long, brachialis, brachioradialis; (2) triceps lateral, anconeus; (3) deltoid anterior, coracobrachialis; (4) deltoid posterior; (5) biceps short; and (6) triceps long. See Li and Todorov (2004) for details.
2340
M. Chhabra and R. Jacobs 0.6
2
6
3 1
5
2
0.4
Y 0.2
0 0
4
1 3
Y
0.4
0.6
4 6
5
0.2
0.2
0.4 X
0.6
0 0
0.2
0.4
0.6
X
Figure 13: Movements induced by six synergies obtained for the reaching task when the arm was controlled by forces applied by muscles. The left and right graphs illustrate induced movements when the initial configuration of the arm was near the center of the work space or at a far edge of the work space, respectively. The number next to each movement indicates the synergy that was applied (using the same labels as Figure 12).
5 Discussion In summary, this letter has considered the properties of synergies arising from a computational theory (in the sense of Marr, 1982) of optimal motor behavior. An actor’s goals were formalized as cost functions, and the optimal control signals minimizing the cost functions were calculated by the iLQR. Optimal synergies were derived from these optimal control signals using a variant of nonnegative matrix factorization. This was done for both reaching and via-point tasks and for a simulated two-joint arm controlled by torques applied at the joints as well as an arm in which forces were applied by muscles. In brief, studies of the motor synergies revealed several interesting findings: (1) optimal motor actions can be generated by summing a small number of scaled and time-shifted motor synergies; (2) some optimal synergies are task independent, whereas other synergies are task dependent; (3) optimal motor actions can be rapidly acquired by learning new linear combinations of optimal motor synergies; and (4) synergies with similar properties arise regardless if one uses an arm controlled by torques applied at the joints or an arm controlled by muscles. Future work will need to address shortcomings of our experiments. Our findings were obtained using simple motor tasks and a simple two-joint arm. We used reaching and via-point tasks because these are commonly performed movements and are frequently studied in the literature. We used a two-joint arm because it is computationally tractable. We conjecture that our basic results will still be found even with more complex tasks. This hypothesis is based on the fact that many complex movements can be regarded as combinations of simpler reaching and via-point movements.
Properties of Optimal Synergies
2341
We also conjecture that our results will still be found with more complex arms. This hypothesis is based on the fact that we obtained similar results regardless of whether we used a simple arm—a two-joint arm controlled by torques applied at the joints—or a more complex arm—a two-joint arm controlled by forces applied by muscles. Computationally, an obstacle to using more complex tasks and arms is the need to calculate optimal control signals. Using current computer technology, the calculation of optimal controls for nonlinear systems with many degrees of freedom is typically not possible. Our findings were also obtained using specific mathematical techniques, such as the iLQR optimization method and the nonnegative matrix factorization method. We believe that our choices of mathematical techniques were reasonable. Again, this is an area in which important computational issues will need to be addressed before future studies can consider more complex motor tasks and arms. In particular, there is a need to develop improved dimensionality-reduction techniques for obtaining synergies. For example, the nonnegative matrix factorization method, like other methods, cannot be applied when movements have widely different durations and, thus, control signals have widely different dimensions. Future work will need to address this and many other unsolved problems. Acknowledgments We thank E. Todorov for help with the iLQR optimal control algorithm and two anonymous reviewers for helpful comments on an earlier version of this article. This work was supported by NIH research grant R01-EY13149. References Bernstein, N. (1967). The coordination and regulation of movements. London: Pergamon. Brown, I. E., Cheng, E. J., & Leob, G. E. (1999). Measured and modeled properties of mammalian skeletal muscle; II: The effects of stimulus frequency on forcelength and force-velocity relationships. Journal of Muscle Research and Cell Motility, 20, 627–643. d’Avella, A., & Bizzi, E. (2005). Shared and specific muscle synergies in natural motor behaviors. Proceedings of the National Academy of Sciences USA, 102, 3076–3081. d’Avella, A., Saltiel, P., & Bizzi, E. (2003). Combinations of muscle synergies in the construction of a natural motor behavior. Nature Neuroscience, 6, 300–308. Flash, T., & Hogan, N. (1985). The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5, 1688–1703. Harris, C. M., & Wolpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hollerbach, J. M., & Flash, T. (1982). Dynamic interactions between limb segments during planar arm movement. Biological Cybernetics, 44, 67–77.
2342
M. Chhabra and R. Jacobs
Jing, J., Cropper, E. C., Hurwitz, I., & Weiss, K. R. (2004). The construction of movement with behavior-specific and behavior-independent modules. Journal of Neuroscience, 24, 6315–6325. Jordan, M. I., & Rosenbaum, D. A. (1989). Action. In M. I. Posner (Ed.), Foundations of cognitive science. Cambridge, MA: MIT Press. Li, W., & Todorov, E. (2004). Iterative linear-quadratic regulator design for nonlinear biological movement systems. In Proceedings of the First International Conference on Informatics in Control, Automation, and Robotics (pp. 222–229). Marr, D. (1982). Vision. New York: Freeman. Mussa-Ivaldi, F. A., Giszter, S. F., & Bizzi, E. (1994). Linear combination of primitives in vertebrate motor control. Proceedings of the National Academy of Sciences USA, 91, 7534–7538. Rosenbaum, D. A. (1991). Human motor control. San Diego: Academic Press. Sanger, T. D. (1995). Optimal movement primitives. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. ¨ Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Thoroughman, K. A., & Shadmehr, R. (2000). Learning of action through adaptive combination of motor primitives. Nature, 407, 742–747. Todorov, E. (2004). Optimality principles in sensorimotor control. Nature Neuroscience, 7, 907–915. Todorov, E., & Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. Proceedings of the 2005 American Control Conference (Vol. 1, pp. 300–306).
Received August 16, 2005; accepted February 17, 2006.
LETTER
Communicated by Carlos Brody
Recognition by Variance: Learning Rules for Spatiotemporal Patterns Omri Barak [email protected]
Misha Tsodyks [email protected] Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
Recognizing specific spatiotemporal patterns of activity, which take place at timescales much larger than the synaptic transmission and membrane time constants, is a demand from the nervous system exemplified, for instance, by auditory processing. We consider the total synaptic input that a single readout neuron receives on presentation of spatiotemporal spiking input patterns. Relying on the monotonic relation between the mean and the variance of a neuron’s input current and its spiking output, we derive learning rules that increase the variance of the input current evoked by learned patterns relative to that obtained from random background patterns. We demonstrate that the model can successfully recognize a large number of patterns and exhibits a slow deterioration in performance with increasing number of learned patterns. In addition, robustness to time warping of the input patterns is revealed to be an emergent property of the model. Using a leaky integrate-and-fire realization of the readout neuron, we demonstrate that the above results also apply when considering spiking output. 1 Introduction Recognizing the spoken word recognize presents no particular problem for most people, and yet the underlying computation our nervous system has to perform is far from trivial. The stimulus presented to the auditory system can be viewed as a pattern of energy changes on different frequency bands over a period of several hundred milliseconds. This timescale should be contrasted with the membrane and synaptic transmission time constants of the individual neurons, which are at least an order of magnitude smaller. The auditory modality is not the only one concerned with the recognition of spatiotemporal patterns of activity; reading braille and understanding sign language are among the exemplars of this rule. Songbirds provide an experimentally accessible model system to study this ability, where Margoliash and Konishi (1985) showed that specific learned (bird’s own) songs elicit a higher neuronal response than nonlearned songs with similar Neural Computation 18, 2343–2358 (2006)
C 2006 Massachusetts Institute of Technology
2344
O. Barak and M. Tsodyks
statistics (same dialect). The representation of external stimuli in the brain is a spatiotemporal pattern of spikes (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Experiments on many different systems demonstrated a high-precision spiking response to dynamic stimuli, thereby presenting a spatiotemporal pattern of spikes to the next level in the hierarchy (Aertsen, Smolders, & Johannesma, 1979; Berry, Warland, & Meister, 1997; Uzzell & Chichilnisky, 2004; Bair & Koch, 1996). These experiments motivated us to limit the current work to the study of spiking patterns. The computational task at hand can be formulated in the following way: given the statistical ensemble of all possible spatiotemporal spiking patterns, a randomly chosen specific finite subset is designated as learned patterns, while the remaining possible patterns are termed background patterns. The task is to build a model that recognizes a learned pattern as a familiar one by producing a larger output when presented with it, compared to when presented with a typical background pattern. The model therefore reduces the high-dimensional input to a one-dimensional output. We emphasize that in the task that we consider in this letter, the selected group of learned patterns is to be distinguished from infinitely many other random patterns, as opposed to the task of classification, where two (or more) sets of selected patterns are to be distinguished between each other. Several network models were proposed to address this task. Hopfield and Brody (2001) introduced a network of integrate-and-fire neurons capable of recognizing spoken words, encoded as a spatiotemporal pattern of spikes, regardless of time warp effects. Jin (2004) used a spiking neural network to implement a synfire chain that responds only to specific spatiotemporal sequences of spikes, regardless of variations in the intervals between spikes. Although the actual biological solution probably includes a network of neurons, there is an advantage in modeling the same task with a single neuron. Single-neuron models are usually more amenable to analytic treatment, thereby facilitating the understanding of the model’s mechanisms. A well-known single neuron model that performs classification is the perceptron (Minsky & Papert, 1969). It might seem that the recognition problem for spatiotemporal spike patterns can be reduced to it by a simple binning over time, in which the instantaneous spatial patterns of each time bin of each learned pattern are all considered as separate input patterns for a normal perceptron. This, however, is misleading, as spatiotemporal patterns cannot be shuffled over time without destroying their temporal information, while binned patterns contain no temporal information to start with. Bressloff and Taylor (1992) introduced a two-layer model of timesummating neurons that can be viewed as a discrete-time single-neuron model with an exponentially decaying response function for the inputs. This model solves the first problem mentioned above by adding the temporal memory of the response function. While having an architecture similar to ours, the task it performs is different: the mapping of one spatiotemporal pattern to another. Such a mapping task neither necessitates integration of
Recognition by Variance
2345
the input over time, nor does it reduce the dimensionality of the input. Hence, the analysis performed there does not apply to our case. In this article, we consider the input current to a single readout neuron, receiving a spatiotemporal pattern of spikes through synaptic connections. Theoretical results show that the firing rate of an integrate-and-fire model neuron is a monotonic function of both the mean and the variance of its input current (Tuckwell, 1988). Rauch, La Camera, Luscher, Senn, and Fusi (2003) demonstrated experimentally the same behavior in vitro by using in vivo–like current injections to neocortical pyramidal neurons. It is important to note that the variance referred to in this work is the variance over time, and not the variance over space referred to in other works (see, e.g., Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004). The proposed framework, along with the above results, allows us to derive learning rules for the synaptic weights, independent of the particular response properties of the readout neuron. We then use these rules in computer simulations to illustrate and quantify the model’s ability to recognize spatiotemporal patterns. 2 Methods 2.1 Description of the Model. Our model (see Figure 1A) consists of a single readout neuron receiving N afferent spike trains within a fixed time window T. Each spike contributes a synaptic current with an initial amplitude Wi depending on the synapse, and an exponential decay with a synaptic time constant τ S . The total synaptic current is the input current to N Wi2 = 1, the neuron. The synaptic weights are normalized according to i=1 and we assume T τ S . We chose input patterns where each of the N neurons fires exactly n spikes, enabling the precise temporal information to provide the information for recognition. For each learned pattern µ ∈ {1, . . . , p}, each neuµ ron i ∈ {1, . . . , N} fires spike number k ∈ {1, . . . , n} at time ti,k ∈ [0, T]. An µ N,n , is therefore defined as a fixed realization input pattern, tµ = {ti,k }i=1,k=1 of spike times drawn from a common probability distribution P(t), where each ti,k is an independent random variable distributed uniformly in [0, T] (see Figure 1B). The input current I µ (t), upon presentation of pattern µ, is determined according to I µ (t) =
N
µ
Wi Ii (t)
i µ µ τ S I˙i (t) = −Ii (t) µ
ξi (t) =
n k=1
µ
+ ξi (t) µ
δ(t − ti,k ).
(2.1)
2346
O. Barak and M. Tsodyks
µ ξ (t) i
A)
N
W1 W2 W3
I µ(t )
WN T
Spike Train Number
400
C) Normalized Synaptic Currents
B) 350 300 250 200 150 100 50 0
Time (msec) 250
0
Time (msec)
250
Figure 1: The model and sample input. (A) The components of the model. A spatiotemporal input pattern ξiµ (t) is fed via synapses to a readout neuron. Each presynaptic spike from neuron i contributes a decaying exponential with time constant τ S and initial height Wi to the input current. (B) A typical input pattern; each dot represents a spike. (C) Normalized synaptic currents (W = 1) for five input neurons (arbitrary units).
µ
Here Ii (t) are normalized synaptic currents, resulting from the spike train of a single input neuron, with the synaptic weights factored out (see Figure 1C).
Recognition by Variance
2347
The following leaky integrate-and-fire model (Lapicque, 1907; Tuckwell, 1988) will be used for an illustration of a typical readout neuron: τm V˙ = −V + I,
(2.2)
where I is the input current and V the membrane potential. Once V reaches a threshold, an output spike is generated, and V is immediately set to a constant reset value. We omitted a refractory period in this model since the output rates are not high. 3 Results 3.1 Moments of the Input Current. As discussed in section 1, we focus our attention on the mean and variance of the input current to the neuron upon presentation of an input pattern. In doing so, we are relying on the monotonic relations between these values and the spiking output of the readout neuron. The mean input current upon presentation of pattern µ (assuming T T N τ S ) is computed (see the appendix) to be I µ = τTS i=1 Wi n ( f = T1 0 f dt). Since there is no pattern-specific information contained in the mean current, it cannot be used for recognition. We therefore consider the variance of the input current to the readout neuron on presentation of pattern µ (abbreviated as variance of pattern µ), and calculate it to be 2 de f Var(I µ ) = (I µ )2 − I µ = WT C µ W,
(3.1)
where µ
µ Ci j
t n −t τ S − i,ki τ j,k j nτ S 2 µ µ S = Ii Ij = e − 2T T µ
ki ,k j =1
µ de f µ µ Ii = Ii − Ii .
(3.2)
We see that the variance depends on the matrix C µ , which is the temporal covariance matrix of the normalized synaptic currents (see Figure 1C) resulting from presentation of pattern µ. The ijth element of C µ can be described as counting the fluctuations in the number of coincidences between pattern µ’s spike trains i and j, with a temporal resolution of τ S . Our aim is to maximize the variance for all learned patterns, leading to the cost function − µ Var(I µ ). This is not a priori the best choice for a cost function, as different functions can perhaps take into account the spread of variances for the different patterns, but it is the most natural one.
2348
O. Barak and M. Tsodyks
Our problem is therefore to maximize WT CW (see equation 3.1), where N C = µ C µ , with the constraint i=1 Wi2 = 1. 3.2 Derivation of Learning Rules. The solution to the maximization problem is known to be the leading eigenvector of C; however, we are also interested in deriving learning rules for W. By performing stochastic projected gradient descent (Kelley, 1962) on the cost function with the norm constraint on W, we derive the following rule:
˙ i (t) = η(t) Iµ (t) Iiµ (t) − Iµ (t)Wi (t) , W
(3.3)
which is similar to the online Oja’s rule (Oja, 1982), with the input vector composed of the instantaneous normalized synaptic currents and the output being the total input current to the neuron. This is to be expected since the optimal W is the leading eigenvector of the covariance matrix of the normalized synaptic currents. Since Oja’s rule requires many iterations to converge, it is of interest to also look for a “one-shot” learning rule. We construct a correlation-based rule by using the coincidences between the different inputs, Wi = κ
N
Ci j ,
(3.4)
j=1
with κ a normalization factor. Figure 2 demonstrates the convergence of Oja’s rule to the optimal set of weights W. It can also be seen that the correlation rule is a step in the right direction, though suboptimal. We also verified that the correlation between the leading eigenvector and the weight vector resulting from Oja’s rule converges to 1, ensuring that the convergence is in the weights, and not only in performance. 3.3 Simulation Results. The main goal of the simulations was to assess the performance of the model for an increasing number of learned patterns. Unless otherwise stated, the parameter values in all simulations were N = 400 inputs, T = 250 msec, n = 5 spikes, and τ S = 3 msec. All numerical simulations were performed in MATLAB. As explained in section 2, we compare the input current variance for learned and background patterns. As shown earlier, the Oja learning rule converges to the leading eigenvector of the matrix C (see Figure 2), and we therefore used the MATLAB eig function instead of applying the learning rule in all simulations. An obvious exception is the illustration of convergence for the Oja rule. Figure 3A shows that both rules lead to higher input variances for the learned patterns relative to the background patterns, and, as expected, the
Recognition by Variance
2349
Figure 2: Convergence of Oja’s rule. The average variance of five learned patterns. Oja’s rule starts at the variance of the background patterns but quickly converges to the upper limit defined by the leading eigenvector of C. The correlation rule can be seen to be better than the background but suboptimal. The Oja learning step declined with iterations according to ηi = 0.02/(1 + i/5) (see equation 3.3).
Oja rule performs better. This is also illustrated for a specific pair of patterns in Figure 3B, where the difference in variances is evident. Although we took a neural model independent approach, we illustrate with a specific realization of the readout neuron—a leaky integrate-and-fire model (see section 2 and equation 2.2)—that the difference in the input variance can be used to elicit a pattern-specific spiking response (see Figures 3C and 3D). In order to quantify the performance of the model, we defined the probability for false alarm (see Figure 4A) as the probability of classifying a background pattern as a learned one, if a classification threshold is put at the variance of the worst (or 10th percentile) learned pattern variance. Probability for false alarm was calculated for each condition by creating p learned patterns, choosing the appropriate (Oja or correlation) weights, and calculating the variance of learned patterns. Then an additional 500 patterns were generated to estimate the distribution of background variances for each set of weights. The fraction of background variances above the minimal (or 10th percentile) learned pattern variance was calculated. The probability for false alarm was estimated as the mean of repeating this process 50 times. Figure 4B shows that for a small number of patterns, both learning rules perform quite well. However, as the number of patterns increases, the correlation rule quickly deteriorates, while the Oja rule exhibits a much later and slower decline in performance.
2350
O. Barak and M. Tsodyks
Figure 3: Simulation results for recognition. (A, B) Input current to the readout neuron. Note that in B, the mean current for the correlation rule is larger than that of the Oja rule, but there is no difference in mean between learned and background patterns. (C, D) Membrane potential and spike output for the leaky integrate-and-fire readout neuron with τm = 10 msec and threshold of 2.6 and 6.5 for Oja and correlation rules, respectively.
Figure 4: Quantification of the model’s performance. (A) The estimation of probability for false alarm. Sixty patterns were learned and their variances plotted (see the dots at the bottom of the figure). The variances of 500 background patterns were used to estimate their distribution. The results of choosing different recognition thresholds are shown. (B, C) Probability for false alarm as a function of the number of learned patterns for 0% (B) and 10% (C) miss. Notice that the x-axis is in log-scale, illustrating the slow degradation in performance. Values of 0 and 1 are actually less than 0.001 and more than 0.999, respectively, due to the finite sample used to estimate the background variance distribution. Error bars are smaller than the markers.
Recognition by Variance
2351
A) Probability Density
0% Miss 10% Miss
False Alarms: 22% False Alarms: 3%
Variance Probability for False Alarm
B) 1 0.8 0.6 0.4 0.2 0
0% Miss Oja Corr.
1
10
Probability for False Alarm
C) 1 0.8 0.6 0.4 0.2 0
100 10% Miss
1000
Oja Corr
1
10 100 Number of Learned Patterns
1000
2352
O. Barak and M. Tsodyks
A) Probability for False Alarm
0.8
Oja 0% Oja 10% Corr. 0% Corr. 10%
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -3
-2
-1
0
1
2
3
log (Warp Ratio) 2
B) S
S
Figure 5: Robustness to time warp. (A) Twenty patterns were learned at T = 250 msec, and then the model was tested with warped versions of various ratios. The performance hardly changes over a large range. Note that the x-axis is in log2 scale. (B) Schematic explanation of the reason for this robustness. The model relies on the number and quality of input coincidences, not on their location, making it robust to time warp.
Since the matrix C depends on the number and temporal precision of the coincidences in the input, and not on their exact location in time (see equation 3.2), we expected that the model’s performance would be robust to time warp effects (see Figure 5B). To quantify this robustness, a set of 20 patterns was learned to obtain the synaptic weights W. Warped versions of both learned and background patterns were formed by setting tµ → αtµ , T → αT. The original weights were then used to calculate
Recognition by Variance
2353
the variance of the warped versions and thus determine the probability for false alarm. Note that this warp changes both the firing rate and the average pattern variance, resulting in a different threshold for each α. A biological realization of this dynamic threshold can use the firing rate as a plausible tuning signal. Figure 5A shows that indeed there is a large plateau, indicating a small decrease in performance for a large range of expansion or contraction of the patterns. 3.4 Capacity Estimate. The performance results presented in the previous section are based on numerical simulations. In order to get a rough analytical estimate of the capacity, we first approximate the input spikes as Poisson trains to facilitate calculations. We then compare the variance of a random pattern to that obtained by the correlation rule. The capacitance will be defined as the limit where the two variances coincide. Since the C µ matrices depend on coincidences with a temporal resolution of τ S , we approximate the normalized synaptic currents by using µ independent random binary variables xit : 1 µ xit = √ 2
with probability 1 − − nτTS , 1 − nτTS , with probability nτTS
< x >=0 nτ S < x2 > = 2T nτ S m < x >∼ = m/2 , m > 1 2 T µ
Ci j =
nτ S T
(3.5) (3.6) (3.7) (3.8)
T/τ S τS µ µ xit x jt , T
(3.9)
t=0
where < f (t) >= f d P(t). For a background pattern {xit }, the W’s and the xit ’s are independent variables, since the W’s are determined by the learned patterns. Thus, the variance of a background pattern is given by
< Varbg > =
N
Wi Wj
i, j=1
=
N
T/τ S τS < xit x jt > T t=0
Wi2 < x 2 >
i=1
= < x2 > =
nτ S . 2T
(3.10)
2354
O. Barak and M. Tsodyks
For the correlation rule, the synaptic weights are given by Wi = κ
N
Ci j
j=1
=
p T/τ S N τS µ=1 j=1
T
µ µ
xit x jt .
(3.11)
t=0
We find κ from the normalization constraint: 1=<
N
Wi2 >
i=1
= κ2
p T/τ S N τ 2 S
T
µ µ
ν < xit x jt xisν xks >
µ,ν=1 i, j,k=1 t,s=0
Np Np N2 p = κ 2 < x4 > −2 + < x 2 >2 Np 2 + T/τ S T/τ S T/τ S
Np (N − 2)n ∼ + np + 1 . (3.12) = κ2 4T/τ S T/τ S The variance of pattern µ can now be estimated as < VarCorr (µ) > = κ 2 = κ2
p τ 3 S
T
N
T/τ S
µ µ
η
η
ν < xit x jt xisν xks x jv xlv >
η,ν=1 i, j,k,l=1 t,s,v=0
nNpτ S 1 − 8nτ S /T + 3np − 6n2 pτ S /T 8T
+ 7n2 τ S2 /T 2 + n2 p 2 + 5nNτ S /T − 6n2 Nτ S2 /T 2 + 3n2 Npτ S /T + n2 N2 τ S2 /T 2 2 N N (3n + O(1)) + n + O(1) n + p T /τ S τ S n p T /τS ∼ = (3.13) N 2T n + n + O(1) p T /τ S
=
nτ S 2T nτ S 2T
N p T /τ S
,
(1 + O(1)) ,
N p τTS N p τTS .
(3.14)
Thus, we see that unless N p τTS , the variance of the learned patterns is similar to that of the background ones; they cannot be successfully
Recognition by Variance
2355
recognized. We conclude that the maximal number of patterns that can be recognized in the network with a correlation learning rule is on the order of Nτ S /T. As we showed in the previous section, the performance of the Oja rule is superior to that of the correlation rule, but we do not have analytical estimates for its capacity. 4 Discussion By considering the input current to the readout neuron as the model’s output, we were able to derive learning rules that are independent of a particular response function of that neuron. These rules enabled recognition of a large number of spatiotemporal spiking patterns by a single neuron. An emergent property of the model, of particular importance when considering speech recognition applications, is robustness of the recognition to time warping of the input patterns. We illustrated that these model-independent rules are applicable to specific spiking models by showing pattern recognition by a leaky integrate-and-fire neuron. A biological realization of the proposed learning rules requires plasticity to be induced as a function of the input to the neuron, without an explicit role for postsynaptic spikes. While at odds with the standard plasticity rules (Abbott & Nelson, 2000), there are some experimental indications that such mechanisms exist. Sjostrom, Turrigiano, and Nelson (2004) used a longterm depression protocol in which the postsynaptic cell is depolarized but does not generate action potentials. Nonlinear effects in the dendrites, such as NMDA spikes (Schiller, Major, Koester, & Schiller, 2000), can provide a mechanism for coincidence detection. It remains a challenge to see whether our proposed rules are compatible with those and similar mechanisms. Although our model was defined as a single neuron receiving spike trains, the same framework is applicable for any linear summation of timedecaying events. One alternative realization could be using bursts of spikes as the elementary input events while still converging on a single readout neuron. This realization is in accord with experimental findings in rabbit retinal ganglion cells showing temporal accuracy of bursts in response to visual stimuli (Berry et al., 1997). A more general realization is an array of different local networks of neurons firing a spatiotemporal pattern of population spikes. The readout neuron in this case can be replaced by a single readout network, where the fraction of active cells stands for the magnitude of the input current. The process of recognition entails several stages of processing, of which our model is but one. As such, our model does not address all the different aspects associated with recognition. For example, if the spike pattern is reversed, our model, which is sensitive to coincidences and not to directionality, will recognize the reversed pattern as if it was the original. Yet we know from everyday experience that reversed words sound completely different. A possible solution to this problem lies in the transformation
2356
O. Barak and M. Tsodyks
between the sensory stimulus and the spike train received by our model, which is not the immediate recipient of the sensory data. Since the transformation is not a simple linear one, time reversal of the stimulus will, in general, not lead to time reversal of the spike train associated with this stimulus. Finally, we present some directions for future work. The performance of the model was quantified by defining the probability for a false alarm (incorrect identification of a background pattern as a learned one). Computer simulations showed a very slow degradation in the model’s performance when the number of learned patterns increased. These results, however, are numerical and sample only a small volume of the parameter space. A more complete understanding of the model’s behavior could be achieved by analytical estimation of this performance, which remains a challenge for a future study. An emergent feature of the model, which is of great ecological significance, is robustness to time warping of the input patterns. There are, however, a variety of other perturbations to which recognition should be robust. These include background noise, probabilistic neurotransmitter release, and jitter in the spike timing. The robustness of the model to these perturbations was not addressed in this study. Appendix: Derivation of the Moments of the Input Current Starting from equation 2.1, we calculate the normalized synaptic currents, µ
Ii (t) =
n
µ
(t − ti,k )e
−
µ t−ti,k τS
,
(A.1)
k=1
where is the Heaviside step function. T We can now calculate the moments ( f = T1 0 f dt) of the input current upon presentation of pattern µ (assuming T τ S ): Ii =
nτ S T
Iµ =
τS n Wi . T
µ
(A.2) N
(A.3)
i=1
The variance of the input current can be calculated as follows: µ
µ µ Ii I j
µ
t n −t τ S − i,ki τ j,k j S = e 2T
ki ,k j =1
(A.4)
Recognition by Variance
(I µ )2 =
N
2357
µ µ
Wi Wj Ii I j
(A.5)
i, j=1
2 de f Var (I µ ) = (I µ )2 − I µ =
N
(A.6)
µ µ µ µ Wi Wj Ii I j − Ii I j
i, j=1
=
N
µ µ Wi Wj Ii Ij
i, j=1 de f
=
N
µ
Wi Wj Ci j
i, j=1
= WT C µ W, where µ
t n −t τ S − i,ki τ j,k j nτ S 2 µ µ µ S e − Ci j = Ii Ij = 2T T µ
ki ,k j =1
µ de f µ µ Ii = Ii − Ii .
(A.7)
Acknowledgments We thank Ofer Melamed, Barak Blumenfeld, Alex Loebel, and Alik Mokeichev for critical reading of the manuscript. We thank two anonymous reviewers for constructive comments on the previous version of the manuscript. The study is supported by the Israeli Science Foundation and Irving B. Harris Foundation. M. T. is the incumbent to the Gerald and Hedy Oliven Professorial Chair in Brain Research. References Abbott, L., & Nelson, S. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3 (Suppl.), 1178–1183. Aertsen, A., Smolders, J., & Johannesma, P. (1979). Neural representation of the acoustic biotope: On the existence of stimulus-event relations for sensory neurons. Biol. Cybern., 32(3), 175–185. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comput., 8(6), 1185–1202. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spiketrains. PNAS, 94(10), 5411–5416.
2358
O. Barak and M. Tsodyks
Bressloff, P., & Taylor, J. (1992). Perceptron-like learning in time-summating neural networks. Journal of Physics A: Mathematical and General, 25(16), 4373–4388. Hopfield, J., & Brody, C. (2001). What is a moment? Transient synchrony as a collective mechanism for spatiotemporal integration. Proc. Natl. Acad. Sci. USA, 98(3), 1282–1287. Jin, D. (2004). Spiking neural network for recognizing spatiotemporal sequences of spikes. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 69(2 Pt. 1), 021905. Kelley, H. (1962). Method of gradients. In G. Leitmann (Ed.), Optimization Techniques (pp. 205–254). New York: Academic Press. Lapicque, L. (1907). Recherches quantitatifs sur l’excitation e´ lectrique des nerfs traite´e comme une polarisation. J. Physiol. Paris, 9, 622–635. Margoliash, D., & Konishi, M. (1985). Auditory representation of autogenous song in the song system of white-crowned sparrows, PNAS, 82(17), 5997–6000. Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to computational geometry. Cambridge, MA: MIT Press. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Rauch, A., La Camera, G., Luscher, H., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiol., 90(3), 1598–1612. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Schiller, J., Major, G., Koester, H., & Schiller, Y. (2000). NMDA spikes in basal dendrites of cortical pyramidal neurons. Nature, 404(6775), 285–289. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. J. Neurophysiol., 91(2), 704–709. Sjostrom, P., Turrigiano, G., & Nelson, S. (2004). Endocannabinoid-dependent neocortical layer-5 LTD in the absence of postsynaptic spiking. J. Neurophysiol., 92(6), 3338–3343. Tuckwell, H. (1988). Introduction to theoretical neurobiology: Vol. 2: Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Uzzell, V., & Chichilnisky, E. (2004). Precision of spike trains in primate retinal ganglion cells. J. Neurophysiol., 92(2), 780–789.
Received June 30, 2005; accepted February 17, 2006.
LETTER
Communicated by Jonathan Victor
Estimating Spiking Irregularities Under Changing Environments Keiji Miura [email protected] Department of Physics, Kyoto University, Kyoto 606-8502, and Intelligent Cooperation and Control, PRESTO, JST, Chiba 277-8561, Japan
Masato Okada [email protected] Department of Complexity Science and Engineering, University of Tokyo, Chiba 277-8561; Intelligent Cooperation and Control, PRESTO, JST, Chiba 277-8561; and Brain Science Institute, RIKEN, Saitama 351-0198, Japan
Shun-ichi Amari [email protected] Brain Science Institute, RIKEN, Saitama 351-0198, Japan
We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. A gamma distribution is a natural extension of the Poisson process taking the effect of a refractory period into account. The model is specified by two parameters: a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes over time, observed data are generated from a model with a time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We obtained an optimal estimating function analytically for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. We suggest a measure of spiking irregularity based on the estimating function, which may be useful for characterizing individual neurons in changing environments. 1 Introduction The firing patterns of cortical neurons look very noisy (Holt, Softky, Koch, & Douglas, 1996), so probabilistic models are necessary to describe them (Cox & Lewis, 1966; Sakai, Funahashi, & Shinomoto, 1999; Tuckwell, 1988). For Neural Computation 18, 2359–2386 (2006)
C 2006 Massachusetts Institute of Technology
2360
K. Miura, M. Okada, and S.-I. Amari
example, Baker and Lemon (2000) showed that the firing patterns recorded from motor areas can be explained using a continuous-time rate-modulated gamma process. Their model had a rate parameter ξ and a shape parameter κ related to spiking irregularity. ξ was assumed to be a function of time because it depended strongly on the behavior of the monkey. κ was assumed to be unique to individual neurons and constant over time. The assumption that κ is unique to individual neurons is also supported by other studies (Shinomoto, Miura, & Koyama, 2005; Shinomoto, Miyazaki, Tamura, & Fujita, 2005; Shinomoto, Shima, & Tanji, 2003). However, these indirect supports are not conclusive. Therefore, we need to accurately estimate κ to make the assumption more reliable. If the assumption is correct, neurons may be identified by κ estimated from the spiking patterns, and κ may provide useful information about the function of a neuron. In other words, it may be possible to classify neurons according to functional firing patterns rather than static anatomical properties. Thus, it is very important to accurately estimate κ in the field of neuroscience. In reality, however, it is very difficult to estimate all the parameters in the model from the observed spike data. The reason is that the unknown function for the time-dependent firing rate ξ (t) has infinite degrees of freedom. This kind of estimation problem is called a semiparametric model (Bickel, Klaassen, Ritov, & Wellner, 1993; Groeneboom & Wellner, 1992; Pfanzagl, 1990; van der Vaart, 1998) and is generally very difficult to solve. Are there any ingenious methods of estimating κ accurately to overcome this difficulty? Ikeda (2005) pointed out that the problem we need to consider is the semiparametric model. However, the problem remains unsolved. There is a method called estimating functions (Godambe, 1960, 1976, 1991; Mukhopadhyay, 2004) for semiparametric problems, and a general theory has been developed (Amari, 1987; Amari & Kawanabe, 1997; Amari & Kumon, 1988) from the viewpoint of information geometry (Amari, 1982, 1985, 1998; Amari, Kurata, & Nagaoka, 1992; Amari & Nagaoka, 2001; Murray & Rice, 1993). However, the method of estimating functions cannot be applied to our problem in its original form. In this letter, we consider the semiparametric model suggested by Ikeda (2005) instead of the continuous-time rate-modulated gamma process. In this discrete-time rate-modulated model, the firing rate varies in time but assumes a fixed value during each interspike interval. This model is a mixture model and can represent various types of interspike interval distributions by adjusting its weight function. For our model, the difficulty of semiparametric models can be explained as follows. If one parameterizes the rate function to be estimated in a manner that does not make assumptions about its form, one needs one parameter for each spike. But then one has more parameters than data. Then there are more parameters than data points, unless there are repeated measures over the same time period. In spite of this difficulty, κ can be estimated by assuming that we have (at least) two
Estimating Spiking Irregularities Under Changing Environments
2361
observations at each rate and using the method of estimating functions for semiparametric models, in which we do not need to estimate the firing rate. Various attempts have been made to solve semiparametric models. Neyman and Scott (1948) pointed out that the maximum likelihood method does not generally provide a consistent estimator when the number of parameters increases in proportion to that of observations. In fact, we show that maximum likelihood estimation for our problem is biased. Ritov and Bickel (1990) and Bickel et al. (1993) considered asymptotic attainability of information bound purely mathematically. However, their results are not practical for application to our problem. Amari and Kawanabe (1997) showed a practical method of estimating a finite number of parameters without estimating an unknown function. This is the method of estimating functions. If this method can be applied, then κ can be estimated consistently independently of the functional form of a firing rate. In this article, we show that the model we consider here belongs to the class of the exponential form defined by Amari and Kawanabe (1997). However, an estimating function does not exist unless multiple observations are given for each firing rate ξ . We show that if multiple observations are given, the method of estimating functions can be applied. In that case, the estimating function of κ can be analytically obtained, and κ can be estimated consistently independently of the functional form of a firing rate. In general, estimation using estimating functions is not efficient. However, for our problem, this method yielded an optimal estimator in the sense of Fisher information (Amari & Kawanabe, 1997). That is, we obtained an efficient estimator whose mean-square error is asymptotically the smallest. The estimator generalizes well even when the assumptions of the model are violated. We suggest a measure of spiking irregularity based on the estimating function, which may be useful for characterizing individual neurons in the case where only a single observation is given for each firing rate. 2 Maximum Likelihood Estimation 2.1 Simple Case. We consider the following gamma distribution, which is defined as q (T; ξ, κ) =
(ξ κ)κ κ−1 −ξ κ T , T e (κ)
(2.1)
where the random variable T denotes an interspike interval. We generate interspike intervals from the distribution and align them to make a spike train. The mean and variance of the interspike intervals are
E x(T) = Var(T) =
1 ξ 1 . ξ 2κ
(2.2)
K. Miura, M. Okada, and S.-I. Amari
0.5
1.0
κ=1 κ=4 κ=16
0.0
q(T; ξ=1,κ)
1.5
2362
0.0
0.5
1.0
1.5
2.0
2.5
3.0
T Figure 1: Probability densities of gamma distributions for various κ with ξ = 1.
ξ is the mean firing rate, and κ is called a shape parameter. κ = 1 corresponds to a Poisson process in which the instantaneous firing rate (hazard function) is constant over time independent of the previous firing time. In this case, a spike train looks irregular. When κ is large, a gamma distribution can be approximated by a normal distribution, whose variance decreases with increasing κ. In the limit of large κ, the interspike intervals become completely regular. Thus, κ is related to spiking irregularities. Figure 1 plots the probability densities of gamma distributions for various κ. We can scale T so that ξ = 1 because ξ always appears as ξ T in q (T; ξ, κ). We assume that ξ changes over time under changing environments. One may assume that ξ is generated for each T from a probability density k(ξ ), whose functional form is unknown. Let ξ (l) be the lth firing rate and T (l) be the lth observation of an interspike interval. Thus, we have N + 1 parameters {ξ (1) , ξ (2) , . . . , ξ (N) , κ} and N observations {T (1) , T (2) , . . . , T (N) }. The purpose is to estimate κ that may be unique to individual neurons by the method of maximum likelihood estimation. Here we estimate all the parameters because we need all ξ (l) ’s to estimate κ. The likelihood that T (l) ’s are generated from the gamma distribution with {ξ (1) , ξ (2) , . . . , ξ (N) , κ} is given by L=
N l=1
q (T (l) ; ξ (l) , κ).
(2.3)
Estimating Spiking Irregularities Under Changing Environments
2363
In maximum likelihood estimation, we choose the parameter values that maximize the likelihood. Without loss of generality, we can consider the maximization of the log likelihood: log L =
N
log q (T (l) ; ξ (l) , κ).
(2.4)
l=1
The estimated parameters must satisfy ∂ log L = u(T (l) : ξ (l) , κ) = 0 and ∂κ N
(2.5)
l=1
∂ log L = v(T (l) ; ξ (l) , κ) = 0 ∂ξ (l)
(2.6)
for all l, where the score functions for q (T; ξ, κ) are defined as ∂ log q (T; ξ, κ) ∂κ = 1 − ξ T + log T + log(κξ ) − φ(κ) and
u(T; ξ, κ) =
v(T; ξ, κ) =
κ ∂ log q (T; ξ, κ) = − κ T, ∂ξ ξ
(2.7) (2.8)
where the digamma function φ(κ) is defined using the gamma function (κ) as φ(κ) =
(κ) . (κ)
(2.9)
The parameters are estimated by solving these equations as κˆ = ∞ and
(2.10)
1 T (l)
(2.11)
ξˆ (l) =
for all l. This result can be understood intuitively as follows. When the mean µ and variance σ of a normal distribution are estimated from a single observation x, they are estimated as µ ˆ = x and σˆ = 0. Similarly, ξ and κ of a gamma distribution q (T; ξ, κ) are estimated from a single observation T as ξˆ = T1 and κˆ = ∞ corresponding to zero variance. Thus, two or more observations are required to estimate κ.
2364
K. Miura, M. Okada, and S.-I. Amari
2.2 Cases with Multiple Observations for Each ξ . Next we consider the case where m observations are given for each ξ (l) , which may be distributed according to k(ξ ). Let {T} = {T1 , . . . , Tm } be the m observations, which are generated from the same distribution specified by ξ and κ. We have N such observations {T (l) }, l = 1, . . . , N, with a common κ and different ξ (l) . Thus, (l) (l) {T1 , . . . , Tm } are generated from the same firing rate ξ (l) . Let us take one {T}. The probability model can be written as p({T}; ξ, κ) =
m
q (Ti ; ξ, κ).
(2.12)
i=1
In this case, the score functions for p({T}; ξ, κ) become ∂ log p({T}; ξ, κ) ∂κ m (1 − ξ Ti + log Ti + log(κξ ) − φ(κ)) and =
u=
(2.13)
i=1
∂ (−κξ + κξ 2 Ti ). log p({T}; ξ, κ) = ∂ξ m
v=
(2.14)
i=1
Note that a score function is defined as the derivative of log likelihood with respect to a parameter. Then κ can be estimated by solving the equation as m N (l) (log Ti + log(κˆ ξˆ (l) ) − φ(κ)) ˆ = 0,
(2.15)
l=1 i=1
where ξˆ (l) =
1 m
1 m i=1
(l)
Ti
(2.16)
for all l. As we show numerically later, the maximum likelihood estimator is biased even when an infinite number of observations is given (N → ∞) for a fixed m. In general, in the case where the number of parameters is finite, the maximum likelihood estimator gives an asymptotically consistent estimator. However, as Neyman and Scott (1948) pointed out, when the number of parameters increases with increasing observations, the maximum likelihood estimator is not necessarily asymptotically consistent. To obtain an unbiased estimator of κ, we use the method of estimating functions in what follows.
Estimating Spiking Irregularities Under Changing Environments
v
2365
k(ξ)
TN
u T
A
TE
u
κ
Figure 2: Schematic diagram for finding estimating functions from the viewpoint of information geometry. u⊥ can be obtained by projecting u so that u⊥ and v’s are orthogonal to each other. In information geometry, score functions are represented as tangent vectors. The cone T N represents the multidimensional space spanned by v’s. The cone T E represents the multidimensional space spanned by u⊥ ’s. Although T E is one-dimensional in our model, it can be multidimensional in general. The cone T A represents the zero-mean functions of T that are orthogonal to both T N and T E . Note that in information geometry for semiparametric models, tangent vectors are functions and span an infinite-dimensional Hilbert space.
3 Theory of Estimating Functions We introduce the information-geometric theory of estimating functions developed by Amari and Kawanabe (1997). Their method is based on the global geometry of families of probability distributions (see Figure 2) and provides a general method for finding unbiased estimators for semiparametric problems. However, here we summarize only necessary conditions for obtaining the estimator for our problem. Let p(x; θ, k) be a probability density function of a random variable x, where the parameter of interest θ is a scalar and the nuisance parameter k is an infinite-dimensional parameter, typically a function. The purpose is to estimate θ consistently without estimating k. A function y(x, θ ) that does not depend on k is called an (unbiased) estimating function when it satisfies, for all k, E θ,k [y(x, θ )] = 0,
(3.1)
2366
K. Miura, M. Okada, and S.-I. Amari
where E θ,k denotes the expectation with respect to p(x; θ, k). When a nontrivial estimating function exists, we have an estimator θˆ of θ by solving N
y(x (l) , θˆ ) = 0,
(3.2)
l=1
where {x (1) , . . . , x (N) } are N independent observations. As this sample average approximates the expectation with respect to the true probability, if the number of observations is large enough, θ can be estimated asymptotically. In the maximum likelihood estimation, the interest score function u(x, θ, k) =
∂ log p(x; θ, k) ∂θ
(3.3)
played the role of an estimating function in equation 2.5 provided k is known. Note that an interest score function is defined as the derivative of log likelihood with respect to a parameter of interest, which we want to estimate. When k is unknown, the interest score function is not an estimating function in its original form. By differentiating equation 3.1 with respect to k, we get E θ,k [v y(x, θ )] = 0,
(3.4)
where v(x, θ, k) =
∂ log p(x; θ, k) ∂k
(3.5)
is the functional derivative (Fr´echet derivative). We call v a nuisance score function because it is the derivative of log likelihood with respect to a “nuisance” parameter, which we do not need to estimate. We define an inner product as a · b = E θ,k [a b].
(3.6)
Then equation 3.4 means that y(x, θ ) must be orthogonal to all v’s. Note that v is infinite-dimensional when k is a function. Let us first consider an easier example where k is a scalar. In this case, we can construct y(x, θ ), which is orthogonal to v by projection in the sense of probability as shown in Figure 2. The projection is given as u⊥ = u −
u·v v. v·v
(3.7)
Estimating Spiking Irregularities Under Changing Environments
2367
In fact, we have u·v E θ,k [v] = 0 and v·v u·v v · v = 0, u⊥ · v = u · v − v·v
E θ,k [u⊥ ] = E θ,k [u] −
(3.8) (3.9)
where the expectations of the score functions are 0 as E θ,k [u] =
∂ ∂θ
p(x; θ, k)d x = 0.
(3.10)
When k is a function, v is infinite-dimensional, and the projection is very difficult to obtain in the closed form except for the case of the mixture model of an exponential form, p(x; θ, k) =
q (x; ξ, θ )k(ξ )dξ,
(3.11)
q (x; ξ, θ ) = exp{ξ · s(x, θ ) + r (x, θ ) − ψ(θ, ξ )}.
(3.12)
where
In this mixture model, {ξ (1) , ξ (2) , . . .} is an unknown sequence where ξ is independently and identically distributed (i.i.d.) according to a probability density function k(ξ ). Then lth observation x (l) is distributed according to q (x (l) ; ξ (l) , θ ). In effect, x is i.i.d. according to p(x; θ, k). The nuisance score function for this model is given as follows. The small deviation of k(ξ ) in the direction of a (ξ ) can be represented by a curve k(ξ, t) starting from k(ξ ), k(ξ, t) = k(ξ ) + ta (ξ ),
(3.13)
where t (0 ≤ t < ) is the parameter of the curve. The nuisance score function in the direction of a (ξ ) is v=
a (ξ ) exp(ξ · s − ψ(θ, ξ ))dξ d log p(x; θ, k)|t=0 = . dt k(ξ ) exp(ξ · s − ψ(θ, ξ ))dξ
(3.14)
Note that the nuisance score functions depend on x only through s(x, θ ). Thus, the vector space spanned by nuisance scores is generated by the random variable s(x, θ ). In this case, we have an estimating function as u I = u − E θ,k [u|s],
(3.15)
2368
K. Miura, M. Okada, and S.-I. Amari
where E θ,k [u|s] is the conditional expectation of u conditioned on s. In fact, u I is orthogonal to any function of s(x, θ ) because E θ,k [u I f (s)] =
E θ,k [u I f (s)|s] p(s)ds =
E θ,k [u I |s] f (s) p(s)ds = 0. (3.16)
It has been shown that the projected score function gives an optimal estimator when we estimate only one of many parameters. Thus, u I gives an efficient estimating function for θ if it does not depend on k(ξ ) (Amari & Kawanabe, 1997; Bickel et al., 1993). There is a special case when ∂θ s is a function of s. In this case, the estimating function becomes simple as u I = {∂θ s − E[∂θ s|s]} · E ξ [ξ |s] + ∂θ r − E[∂θ r |s] = ∂θ r − E[∂θ r |s],
(3.17)
where ξ k(ξ ) exp(ξ · s − ψ)dξ E ξ [ξ |s] = . k(ξ ) exp(ξ · s − ψ)dξ
(3.18)
4 Estimation by Estimating Functions 4.1 Simple Case. We considered the following statistical model of interspike intervals proposed by Ikeda (2005). Interspike intervals are distributed according to a gamma distribution whose mean firing rate changes over time. The mean firing rate ξ at each time is determined randomly according to an unknown probability density k(ξ ). To demonstrate that the model is of the exponential form defined by Amari and Kawanabe (1997), we define s, r , and ψ as s(T, κ) = −κ T,
(4.1)
r (T, κ) = (κ − 1) log(T), and
(4.2)
ψ(κ, ξ ) = −κ log(ξ κ) + log (κ).
(4.3)
Here, T denotes an interspike interval. The model is described by p(T; κ, k(ξ )) =
q (T; ξ, κ)k(ξ )dξ,
(4.4)
Estimating Spiking Irregularities Under Changing Environments
2369
where q (T; ξ, κ) =
(ξ κ)κ κ−1 −ξ κ T T e = e ξ s(T,κ)+r (T,κ)−ψ(κ,ξ ) . (κ)
(4.5)
Note that this type of model is called a semiparametric model because it has both an unknown finite parameter κ, which is a scalar in this case, and a function k(ξ ). To estimate κ without estimating k(ξ ), let us calculate the estimating function according to the method shown in the previous section. It was shown there that for the mixture distributions of an exponential form, the estimating function u I is given by the projection of the score function, u = ∂κ log p, as u I (T, κ) = u − E[u|s] = (∂κ s − E[∂κ s|s]) · E[ξ |s] + ∂κ r − E[∂κ r |s] = ∂κ r − E[∂κ r |s],
(4.6)
where the relation E[∂κ s|s] =
s = −T = ∂κ s κ
(4.7)
holds because the numbers of random variables T and s are the same. For the same reason, E[∂κ r |s] = log(T) = ∂κ r.
(4.8)
u I = 0.
(4.9)
Then,
This means that the set of estimating functions is empty. The restrictions imposed on u I are the necessary conditions that the estimating function must satisfy by definition (Amari & Kawanabe, 1997). Therefore, we have proved that no estimating function of κ exists for the model. Two or more random variables may be needed. We may generalize the model such that ξi ’s are not independently generated but are related. Let us consider the multivariate model described by p(T1 , . . . , Tn ; κ, k(ξ1 , . . . , ξn )) =
n i=1
q (Ti ; ξi , κ)k(ξ1 , . . . , ξn )dξ.
(4.10)
2370
K. Miura, M. Okada, and S.-I. Amari
In this case, si , r , and ψ are defined as si (T, κ) = −κ Ti ,
(4.11)
r (T, κ) = (κ − 1)
n
log(Ti ), and
(4.12)
log(ξi κ) + n log (κ).
(4.13)
i=1
ψ(κ, ξ ) = −κ
n i=1
The numbers of random variables Ti ’s and si ’s are also the same. Then u I becomes an empty set. This result implies that two or more observations are needed for each ξ . 4.2 Cases with Multiple Observations for Each ξ . Let us consider the case where we have multiple observations for each ξ . Here, a consistent estimator of κ exists. Let {T} = {T1 , . . . , Tm } be the m observations that have the same firing rate ξ . The probability model can be written as p({T}; κ, k(ξ )) =
m
q (Ti ; ξ, κ)k(ξ )dξ,
(4.14)
i=1
where m
q (Ti ; ξ, κ) =
i=1
m (ξ κ)κ i=1
(κ)
Tiκ−1 e −ξ κ Ti = e ξ ·s({T},κ)+r ({T},κ)−ψ(κ,ξ ) .
(4.15)
We define s, r , and ψ as s({T}, κ) = −κ
m
Ti ,
(4.16)
i=1
r ({T}, κ) = (κ − 1)
m
log(Ti ), and
(4.17)
i=1
ψ(κ, ξ ) = −mκ log(ξ κ) + m log (κ). Then the estimating function is given by u I ({T}, κ) = u − E[u|s] = (∂κ s − E[∂κ s|s]) · E[ξ |s] + ∂κ r − E[∂κ r |s]
(4.18)
Estimating Spiking Irregularities Under Changing Environments
2371
= ∂κ r − E[∂κ r |s] =
m
log(Ti ) −
i=1
=
m
m
E[log(Ti )|s]
i=1
log(Ti ) − mE[log(T1 )|s],
(4.19)
i=1
where we used E[∂κ s|s] =
s = ∂κ s. κ
(4.20)
The last equality in equation 4.19 holds because of the permutation symmetry among T’s. The conditional expectation of log T1 is given as (see the appendix) s E[log(T1 )|s] = log − − φ(mκ) + φ(κ), κ
(4.21)
where the digamma function is defined as φ(κ) =
(κ) . (κ)
(4.22)
Note that E[log(T1 )|s] does not depend on the unknown function k(ξ ). Thus, we have u ({T}, κ) = I
m
log(Ti ) − m log
i=1
m
Ti
+ mφ(mκ) − mφ(κ).
(4.23)
i=1
The form of u I can be understood as follows. If we scale T as t = ξ T, we have E[t] = 1. Then we can show that u I does not depend on ξ because m
log(Ti ) − m log
i=1
m i=1
Ti
=
m i=1
log(ti ) − m log
m
ti .
(4.24)
i=1
This implies that we can estimate κ without estimating ξ . κ can be estimated consistently from N independent sets of observations, (l) (l) {T (l) } = {T1 , . . . , Tm }, l = 1, . . . , N, as the value of κ that solves N l=1
u I ({T (l) }, κ) ˆ = 0.
(4.25)
2372
K. Miura, M. Okada, and S.-I. Amari
In fact, the expectation of u I is 0 independent of k(ξ ): E[u ] =
E[u − E[u|s]|s] p(s)ds = 0
I
(4.26)
u I yields an efficient estimating function. An efficient estimator is one whose variance attains the Cram´er-Rao lower bound asymptotically. Thus, there is no estimator of κ whose mean square estimation error is smaller than that given by u I . As u I does not depend on k(ξ ), it is the optimal estimating function whatever k(ξ ) is, or whatever the sequence ξ (1) , . . . , ξ (N) is. Maximum likelihood estimation for this problem gives an estimating function as u ML E =
m
log(Ti ) + m log(ξˆ ) + m log κ − mφ(κ),
(4.27)
i=1
where 1 1 = Ti . ˆξ m i=1 m
(4.28)
u ML E is similar to u I but different in terms of the constant u ML E − u I = m log(mκ) − mφ(mκ).
(4.29)
As a result, the maximum likelihood estimator κˆ is biased. Figure 3 shows the biases for the maximum likelihood estimation and the proposed estimation. The maximum likelihood estimation is biased even when an infinite number of observations are given while the estimating function is asymptotically unbiased. In the numerical calculation, we used the model with κ = 4 and m = 2. An interspike interval with firing rate ξ can be generated as follows. First, a normalized interspike interval t is generated according to the standard gamma distribution with ξ = 1. Second, an interspike interval T is obtained by dividing t by ξ : T = t/ξ . Note that log T = log t − log ξ . Then, as shown in equation 4.24, y does not depend on ξ and f (ξ ) at all. Therefore, we fixed ξ to be 1 without loss of generality. The figure is obtained as follows. We generated T’s according to the standard gamma distribution and substituted them into equation 4.27 to estimate κ. We repeated the estimation many times (n = 104 ) for each number of observations and calculated the mean (bias) and the quartiles of κ. ˆ Note that the result does not depend on k(ξ ).
2373
14
Estimating Spiking Irregularities Under Changing Environments
4
6
8
κ^
10
12
Maximum likelihood Proposed method
10
20
50
100
200
500
Number of observations Figure 3: Biases of κˆ for maximum likelihood estimation and proposed method for m = 2. The dotted line represents the true value, κ = 4. The error bars represent the quartiles for repeated trials. The estimation was repeated 104 times for each number of observations. The maximum likelihood estimation is biased even when an infinite number of observations are given while the estimating function is asymptotically unbiased.
We examined whether our estimator works when κ is close to zero. Figure 4 plots the biases for κ = 0.5. The detail of Figure 4 is the same as that of Figure 3. Figure 4 demonstrates that the proposed estimator also works for κ = 0.5. 5 Cases Where the Firing Rate Is Continuously Modulated So far, we have considered only the cases in which two or more consecutive firing rates are the same. In this section, we remove this assumption and consider more general cases where consecutive firing rates are not necessarily the same but the firing rate continuously changes slowly. Although the assumption of the statistical model is violated, we try to estimate κ by the proposed method heuristically. We
K. Miura, M. Okada, and S.-I. Amari
1.4
2374
0.4
0.6
0.8
^ κ 1.0
1.2
Maximum likelihood Proposed method
10
20
50
100
200
500
Number of observations Figure 4: Biases of κˆ for maximum likelihood estimation and proposed method for m = 2. The dotted line represents the true value, κ = 0.5. The error bars represent the quartiles for repeated trials. The estimation was repeated 104 times for each number of observations. The maximum likelihood estimation is biased even when an infinite number of observations are given while the estimating function is asymptotically unbiased. The proposed estimator works even if κ is close to zero.
compare the estimating functions with various m and the measure of spiking irregularity, which we will introduce based on the estimating function. 5.1 Measure of Spiking Irregularity. In this section, we introduce a practical measure of spiking irregularity for experimental data based on the estimating function. The new measure may be useful in the case where the firing rate continuously changes slowly. For experimental data, the assumption that consecutive m interspike intervals have the same or similar ξ is most probable for m = 2. Therefore, we set m = 2 in the estimating function. Let {T1 , T2 , . . . , TN } form a single spike train, where Ti denotes the ith interspike interval. Let N be odd. There are two types of possible
Estimating Spiking Irregularities Under Changing Environments
2375
estimating equations depending on the choice of the starting point: (1) 2 4T2i−1 T2i 1 1 log − log 2 + φ(2κ) ˆ − φ(κ) ˆ = 0, N−1 2 (T2i−1 + T2i )2 2 N−1
(5.1)
i=1
where ξ1 = ξ2 , ξ3 = ξ4 , . . . , ξ N−2 = ξ N−1 are assumed, and (2) 2 4T2i T2i+1 1 1 log − log 2 + φ(2κ) ˆ − φ(κ) ˆ = 0, N−1 2 (T2i + T2i+1 )2 2 N−1
(5.2)
i=1
where ξ2 = ξ3 , ξ4 = ξ5 , . . . , ξ N−1 = ξ N are assumed. We take the average of these equations as N−1 1 1 4Ti Ti+1 − log 2 + φ(2κ) ˆ − φ(κ) ˆ = 0. log N−1 2 (Ti + Ti+1 )2
(5.3)
i=1
We estimate κ by solving this equation for κˆ numerically. To avoid troublesome numerical iterations, we suggest using part of the estimating equation SI ≡ −
N−1 4Ti Ti+1 1 1 , log N−1 2 (Ti + Ti+1 )2
(5.4)
i=1
as a measure of spiking irregularity. φ(2κ) − φ(κ) is monotonic, so that it is clear that we can easily solve equation 5.3 for κ. The correspondence between κˆ and SI is shown in Figure 5. Note that we assumed ξ1 = ξ2 , ξ2 = ξ3 , and so on. Therefore, unless all ξi ’s are the same, κˆ is biased. However, when the firing rate changes slowly enough, κˆ is approximately correct, as we will show in the example below. In addition, SI is similar to the measure of local variation L V , which is known to be useful for cell classification (Shinomoto, Miyazaki, et al., 2005; Shinomoto et al., 2003). Then, SI may also be useful for cell classification. The measure of local variation (Shinomoto et al., 2003), LV =
N−1 N−1 Ti Ti+1 1 3(Ti − Ti+1 )2 12 = 3 − N−1 (Ti + Ti+1 )2 N−1 (Ti + Ti+1 )2 i=1
(5.5)
i=1
looks similar to the measure of spiking irregularity SI . In fact, there exists a tight inequality between these measures. By using Jensen’s inequality, we
K. Miura, M. Okada, and S.-I. Amari
10 1
2
5
κ
20
50
2376
0.00 0.05 0.10 0.15 0.20 0.25 0.30 SI Figure 5: Correspondence between κˆ and SI . The lower bound of SI is 0. For κ = 1, SI is 1 − log 2 = 0.307.
obtain N−1 4Ti Ti+1 1 1 1 − log ≥ − log 2 N−1 2 (Ti + Ti+1 ) 2 i=1 1 LV . = − log 1 − 2 3
N−1 1 4Ti Ti+1 N−1 (Ti + Ti+1 )2
i=1
(5.6)
Thus, L V gives a lower bound of SI . Since SI and κˆ are inversely related, as shown in Figure 5, L V gives an upper bound on κ. ˆ This relation may be useful when only the value of L V is available in the existing literature. Note that this relation holds for any spike trains independent of their statistical models and for any N. 5.2 AR Model. As an example of a rate-modulated case, let us consider an AR model in which ξi ’s are given as
1 2 log ξi+1 = e − τ log ξi + 1 − e − τ σi+1 ,
(5.7)
Estimating Spiking Irregularities Under Changing Environments
2377
0.5
1.0
1.5
ξi
2.0
2.5
τ =2 τ =8 τ =32
0
20
40
60
80
100
i Figure 6: Sample processes of ξ for AR model with = 0.3. The larger the time constant, τ , the longer the correlations.
where σi ’s are independently and identically distributed according to the standard normal distribution with mean 0 and variance 1. Here, we assumed that log ξi ’s obey a gaussian process so that ξi ’s are nonnegative. τ represents the time correlation length of log ξ : j
log ξi+ j log ξi = 2 e − τ .
(5.8)
2 The coefficient 1 − e − τ is multiplied to the noise so that the variances of log ξi ’s are independent of τ . Thus, we have two free parameters, τ and . Figure 6 illustrates sample paths for = 0.3. The figure shows that the larger the time constant τ , the longer the correlation of ξ . Next, we generated interspike intervals by using these ξi ’s and estimated κ from them. Figure 7 plots the results of numerical calculation. The dotted line represents the true value, κ = 4. SI denotes the estimation given by SI using equation 5.3. Note that what is plotted is not SI , but rather the estimate of κ obtained putting SI through the function shown in Figure 5. m = 2 denotes the estimation given by the estimating function with m = 2 using equation 5.1. Note that m = 2 refers to estimating κ without averaging over the two types of pairing: ξ1 = ξ2 , ξ3 = ξ4 , . . . and ξ2 = ξ3 , ξ4 = ξ5 , . . . as SI does. The figure shows that biases exist even if the number of observations
K. Miura, M. Okada, and S.-I. Amari
κ^ 3.2 3.4 3.6 3.8
2378
2.6
2.8
3.0
SI (Eq. 5.3), m=2 (Eq. 5.1) LV (Eq. 5.9) m=4 m=8 Moment estimator
1
2
5
10
20
50
τ Figure 7: Asymptotic biases for AR model with = 0.3. The dotted line represents the true value, κ = 4. m = 2 denotes the estimation obtained using the estimating function with m = 2. SI denotes the estimation given by SI . m = 2 and SI give the same result. L V denotes the estimation given by L V . Moment estimator denotes the estimation by the moment estimator. The number of observations is N = 107 . The biases decrease with increasing τ . m = 2 (and SI ) always gives the smallest bias.
is infinite, although they decrease with increasing τ . The bias increases with increasing m. It is not the case that there is an optimal m depending on τ , but the estimating function with m = 2 always gives the smallest bias. The bias given by the estimating function with m = 2 and that given by SI are the same because the AR model does not distinguish ξi ’s with even and odd i. Note that the biases in Figure 7 are much smaller than those in Figure 3 for the maximum likelihood estimation. Although we fixed = 0.3 in Figure 7, the bias increases as increases. We also estimated κ by the moment estimator: κˆ = Mean(T)2 /Variance(T). If the firing rate is constant over time, the mean and variance are given
Estimating Spiking Irregularities Under Changing Environments
2379
by equations 2.2, and κ can be estimated correctly. Figure 7 shows that the moment estimator is always the worst independent of τ . In fact, because the moment estimator assumes that the firing rate is constant over time, it does not properly capture the spiking irregularity κ when the firing rate changes over time. κ can also be estimated by using L V : κˆ =
3 1 − . 2L V 2
(5.9)
Note that the expectation value of L V for the gamma distribution is L V = 3 (Shinomoto et al., 2003). Figure 7 shows that the estimate of κ obtained 2κ+1 by putting L V through equation 5.9 is slightly more biased than that given by SI . We also compared the variances directly in the limit of large τ where there is no bias. The variance of the estimate given by L V is larger than that given by SI although the difference is only about 3%. Figure 8 shows how bias and variability of the estimator for the AR model depend on the number of observations. The true value is κ = 4. The estimation by SI gives a smaller bias than the estimating function with m = 2 for finite observations. This can be intuitively understood as follows. The number of terms summed in SI is about twice as large as that in the estimating function with m = 2. Then, because the bias decreases with the number of observations, the bias given by SI is smaller. The bias for the estimating function with m = 4 is the smallest in a certain range of the number of observations. In fact, the bias for m = 4 becomes 0 for a certain number of observations, because the asymptotic bias is always negative while the bias due to finite observations is always positive, as shown in Figure 3. However, it is not a good way to fix the number of observations because the variance of the estimator decreases as the numbers of observations increases, as we show next. So far, we have considered only the biases of the estimators. However, it is also important to know the trial-to-trial variability of the estimators. The smaller the variability, the better the estimate. The error bars in Figure 8 show the quartiles of the estimators of κ for repeated trials. The estimation was repeated 104 times for each number of observations. While the variability decreases with increasing m, the bias increases with increasing m. Thus, there is a trade-off between bias and variability. The variability of the estimator given by SI is smaller than that given by the estimating function with m = 2. Thus, SI gives a relatively good estimator of κ for the AR model. In particular, as far as the bias is concerned, SI looks optimal among the estimators we considered here. We believe that SI generally works well in many models, including the Markov model in which the firing rate changes slowly because the Markov model does not distinguish even and odd i for Ti .
K. Miura, M. Okada, and S.-I. Amari
7
2380
3
4
^ κ 5
6
m=2 (Eq. 5.1) SI (Eq. 5.3) m=4
10
20
50
100
200
500
Number of observations Figure 8: Bias and variability for AR model with finite observations. The dotted line represents the true value, κ = 4. = 0.3 and τ = 8. m = 2 denotes the estimation obtained using the estimating function with m = 2. SI denotes the estimation by SI . The error bars represent the quartiles for repeated trials. The estimation was repeated 104 times for each number of observations. The bias for SI is smaller than that for m = 2 for finite observations.
6 Summary and Discussion We estimated the shape parameter κ of the semiparametric model suggested by Ikeda (2005) without estimating the firing rate ξ . The maximum likelihood estimator is not consistent for this problem because the number of nuisance parameters ξl increases with increasing observations of interspike interval T. We showed that the model is of the exponential form defined by Amari and Kawanabe (1997) and can be analyzed by a method of estimating functions for semiparametric models. We found that an estimating function does not exist unless multiple observations are given for each firing rate ξ . If multiple observations are given, the method of estimating functions can be applied. In that case, the estimating function of κ can be analytically obtained, and κ can be estimated consistently independent of the functional form of the firing rate k(ξ ). In general, the estimating function is not efficient.
Estimating Spiking Irregularities Under Changing Environments
2381
However, this method provided an optimal estimator in the sense of Fisher information for our problem. That is, we obtained an efficient estimator. We suggested the measure of spiking irregularity based on the estimating function, which may be useful for characterizing individual neurons in the case where only a single observation is given for each firing rate. Various measures for spiking randomness have been used in previous studies. The coefficient of variation C V is the global variance normalized by the mean interspike interval (Holt et al., 1996): CV =
Var[T] T
.
(6.1)
Although C V increases with increasing irregularity, it also becomes large when the firing rate changes over time. Thus, we cannot distinguish these two cases according to C V . The local variation of interspike intervals L V is locally normalized and relatively independent of the firing rate change (Shinomoto, Miyazaki et al., 2005; Shinomoto et al., 2003). However, it is an ad hoc measure and has no corresponding parameter in statistical models. The measure of spiking irregularity SI that we introduced in this article is an estimator of κ, relatively independent of the firing rate change. For these reasons, we suggest using SI as a measure of spiking irregularity. In this article, we focused on the semiparametric model, in which the firing rate may vary for each interspike interval. However, a better fit of experimental data could be obtained by using other models, depending on situations. For instance, the firing rate can be assumed to be a function of continuous time (Baker & Lemon, 2000; Brown, Barbieri, Ventura, Kass, & Frank, 2002) or early stages of sensory cortex can be explained by more deterministic models such as a noisy leaky integrate-and-fire model (Reich, Victor, & Knight, 1998). The selection of an appropriate model (Brown et al., 2002; Reich et al., 1998) is very important and will depend on the recorded area and its state. It is important to know to what extent the proposed estimator is robust in the sense that if the ISI distribution is close to a gamma distribution but not precisely gamma. We believe that the robustness can be evaluated numerically or analytically. However, it is beyond the scope of this letter. We leave it for future work. Appendix: Calculation of E[log(T1 )|s] in Equation 4.21 To calculate E[log(T1 )|s] in equation 4.21 let us use Bayes theorem: p(T|s) =
p(T, s) p(T, s) = . p(s) p(T, s)dT
(A.1)
2382
K. Miura, M. Okada, and S.-I. Amari
By repeating beta integrals,
1
x a −1 (1 − x)b−1 d x = B(a , b) =
0
(a )(b) (a − 1)!(b − 1)! = , (a + b) (a + b − 1)!
(A.2)
we obtain p(s)
m m Ti (q (Ti ; ξ, κ)dTi )k(ξ )dξ = δ s+κ i=1
i=1
m−1 k(ξ ) s dξ Ti (q (Ti ; ξ, κ)dTi ) = q − − κ κ i=1 i=1
κ−1 m−1 κ (ξ κ) s s m−1 = Ti e −ξ κ(− κ − i=1 Ti ) − − (κ) κ
m−1
i=1
×
m−1 i=1
=
(ξ κ)κ κ−1 −ξ κ Ti T e dTi (κ) i
−s − Ti κ m−2
m−1
Tiκ−1 dTi
i=1
=
Tm−1 1 − s m−2 − κ − i=1 Ti
s Ti × − − κ
κ−1 m−2
i=1
=
κ−1
e sξ (ξ κ)mκ k(ξ ) dξ (κ)m κ
Tm−1 1 − s m−2 − κ − i=1 Ti m−2
k(ξ ) dξ κ
κ−1
i=1
×
κ−1
κ−1 Tm−1 dTm−1
Tiκ−1 dTi
i=1
s Ti B(κ, κ) − − κ m−2
i=1
e sξ (ξ κ)mκ k(ξ ) dξ (κ)m κ
2κ−1 m−2
Tiκ−1 dTi
i=1
e sξ (ξ κ)mκ k(ξ ) dξ (κ)m κ
=... m−2 (m−1)κ−1 s e sξ (ξ κ)mκ k(ξ ) − − T1 = T1κ−1 dT1 B(iκ, κ) dξ κ (κ)m κ i=1
=
m−1 i=1
B(iκ, κ)
(−s)mκ−1 (κ)m
ξ mκ e sξ k(ξ )dξ.
(A.3)
Estimating Spiking Irregularities Under Changing Environments
2383
Similarly, by Taylor expansion of log(T1 ) about T1 = − κs , we get E[log(T1 )|s]
m m 1 = log(T1 )δ s + κ Ti (q (Ti ; ξ, κ)dTi )k(ξ )dξ p(s) i=1
=
i=1
m−2 (m−1)κ−1 s log(T1 ) − − T1 T1κ−1 dT1 B(iκ, κ) κ i=1
e sξ (ξ κ)mκ k(ξ ) 1 dξ m (κ) κ p(s) ∞ s s mκ−1 m−2 −1 + = log − B(iκ, κ) B((m − 1)κ + j, κ) − κ j κ ×
j=1
×
i=1
e sξ (ξ κ)mκ k(ξ ) 1 dξ (κ)m κ p(s)
∞ s 1 B((m − 1)κ + j, κ) − = log − . κ j B((m − 1)κ, κ)
(A.4)
j=1
Next we show that ∞ 1 B((m − 1)κ + j, κ) = φ(mκ) − φ(κ), j B((m − 1)κ, κ)
(A.5)
j=1
where the digamma function is defined as φ(κ) =
(κ) . (κ)
(A.6)
Let κ be an integer. We define Il as Il =
∞ (mκ − 1)(mκ − 2) · · · ((m − 1)κ + l) 1 . j (mκ + j − 1)(mκ + j − 2) · · · ((m − 1)κ + j + l)
(A.7)
j=1
Then the infinite series can be rewritten as ∞ 1 B((m − 1)κ + j, κ) = I0 . j B((m − 1)κ, κ) j=1
(A.8)
2384
K. Miura, M. Okada, and S.-I. Amari
By repeatedly using the following equation, Il = Il+1 −
1 , κ +l −1
(A.9)
we get I0 = Iκ−1 −
κ−2 l=0
∞ κ−1 mκ−1 κ−1 1 1 1 mκ − 1 1 1 = − = − . κ −1−l j mκ + j − 1 l l l j=1
l=1
l=1
l=1
(A.10) The last equality follows by telescoping of the infinite sum. By using the formula for harmonic series (Havil, 2003), n 1 l=1
l
= γ + φ(n + 1),
(A.11)
where Euler’s constant is γ = 0.57721 · · ·, we get ∞ 1 B((m − 1)κ + j, κ) = φ(mκ) − φ(κ). j B((m − 1)κ, κ)
(A.12)
j=1
Although we assumed that κ is an integer during the proof, the numerical calculation shows that the result also holds for noninteger κ. Thus, we obtain s E[log(T1 )|s] = log(− ) − φ(mκ) + φ(κ). κ
(A.13)
Note that E[log(T1 )|s] does not depend on the unknown function k(ξ ). Acknowledgments This work was supported in part by grants from the Japan Society for the Promotion of Science (Nos. 14084212 and 16500093). References Amari, S. (1982). Differential geometry of curved exponential families—curvatures and information loss. Ann. Statist., 10, 357–385. Amari, S. (1985). Differential-geometrical methods in statistics. New York: SpringerVerlag.
Estimating Spiking Irregularities Under Changing Environments
2385
Amari, S. (1987). Dual connections on the Hilbert bundles of statistical models. In C. T. J. Dodson (Ed.), Geometrization of statistical theory. Lancaster: University of Lancaster, Department of Mathematics. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Comput., 10, 251–276. Amari, S., & Kawanabe, M. (1997). Information geometry of estimating functions in semi-parametric statistical models. Bernoulli, 3, 29–54. Amari, S., & Kumon, M. (1988). Estimation in the presence of infinitely many nuisance parameters—geometry of estimating functions. Ann. Statist., 16, 1044–1068. Amari, S., Kurata, K., & Nagaoka, H. (1992). Information geometry of Boltzmann machines. IEEE Trans. on Neural Networks, 3, 26–271. Amari, S., & Nagaoka, H. (2001). Methods of information geometry. Providence, RI: American Mathematical Society. Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. J. Neurophysiol., 84, 1770–1780. Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models. Baltimore, MD: Johns Hopkins University Press. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Comput., 14, 325–346. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist., 31, 1208–1211. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277–284. Godambe, V. P. (Ed.). (1991). Estimating functions. New York: Oxford University Press. Groeneboom, P., & Wellner, J. A. (1992). Information bounds and nonparametric maximum likelihood estimation. Basel: Birkhauser. Havil, J. (2003). Gamma: Exploring Euler’s constant. Princeton, NJ: Princeton University Press. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75, 1806–1814. Ikeda, K. (2005). Information geometry of interspike intervals in spiking neurons. Neural Comput., 17, 2719–2735. Mukhopadhyay, P. (2004). An introduction to estimating functions. Harrow: Alpha Science International. Murray, M. K., & Rice, J. W. (1993). Differential geometry and statistics. New York: Chapman and Hall. Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 32, 1–32. Pfanzagl, J. (1990). Estimation in semiparametric models. Berlin: Springer-Verlag. Reich, D. S., Victor, J. D., & Knight, B. W. (1998). The power ratio and the interval map: Spiking models and extracellular recordings, J. Neurosci., 18, 10090–100104.
2386
K. Miura, M. Okada, and S.-I. Amari
Ritov, Y., & Bickel, P. J. (1990). Achieving information bounds in non and semiparametric models. Ann. Statist., 18, 925–938. Sakai, Y., Funahashi, S., & Shinomoto, S. (1999). Temporally correlated inputs to leaky integrate-and-fire models can reproduce spiking statistics of cortical neurons. Neural Netw., 12, 1181–1190. Shinomoto, S., Miura, K., & Koyama, S. (2005). A measure of local variation of inter-spike intervals. Biosystems, 79, 67–72. Shinomoto, S., Miyazaki, Y., Tamura, H., & Fujita, I. (2005). Regional and laminar differences in in vivo firing patterns of primate cortical neurons. J. Neurophysiol., 94, 567–576. Shinomoto, S., Shima, K., & Tanji, J. (2003). Differences in spiking patterns among cortical neurons, Neural Comput., 15, 2823–2842. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol 2, nonlinear and stochastic theories. Cambridge: Cambridge University Press. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.
Received August 19, 2005; accepted April 12, 2006.
LETTER
¨ Communicated by Peter Konig
Spatiotemporal Structure in Large Neuronal Networks Detected from Cross-Correlation Gaby Schneider [email protected] Department of Computer Science and Mathematics, Johann Wolfgang Goethe University, Frankfurt (Main), Germany
Martha N. Havenith [email protected] Department of Neurophysiology, Max-Planck-Institute for Brain Research, Frankfurt (Main), Germany
Danko Nikoli´c [email protected] Department of Neurophysiology, Max-Planck-Institute for Brain Research, Frankfurt (Main), and Frankfurt Institute for Advanced Studies, Johann Wolfgang Goethe University, Frankfurt (Main), Germany
The analysis of neuronal information involves the detection of spatiotemporal relations between neuronal discharges. We propose a method that is based on the positions (phase offsets) of the central peaks obtained from pairwise cross-correlation histograms. Data complexity is reduced to a one-dimensional representation by using redundancies in the measured phase offsets such that each unit is assigned a “preferred firing time” relative to the other units in the group. We propose two procedures to examine the applicability of this method to experimental data sets. In addition, we propose methods that help the investigation of dynamical changes in the preferred firing times of the units. All methods are applied to a sample data set obtained from cat visual cortex. 1 Introduction Cortical neurons can fire in precise temporal relation to each other, producing repeatable spatiotemporal patterns (Mainen & Sejnowski, 1995; Lestienne, 1996; Singer, 1999; Abeles, Bergman, Margalit, & Vaadia, 2000; Reinagel & Reid, 2002; Ikegaya et al., 2004). A variety of methods has been proposed and used for detecting and investigating such precise firing patterns among large networks of neurons (for a review, see, e.g., Brown, Kass, & Mitra, 2004). Each of these methods is tuned to detect specific types of spatiotemporal relations, operating on different timescales, and Neural Computation 18, 2387–2413 (2006)
C 2006 Massachusetts Institute of Technology
G. Schneider, M. Havenith, and D. Nikoli´c
B 8 ●
9 ●
4 ●
5 ●
6 ●
1 ●
2 ●
3 ●
25
7 ●
20
A
15
# coincidences
30
2388
−80
0
80
delay [ms] Figure 1: Definition and data structure of phase offsets. (A) A phase offset is defined as the time delay of the central peak’s maximum in a CCH. Inset: Enlarged image of the CCH peak. (B) An illustration of the representational complexity that originates from pairwise analysis. For only nine simultaneously recorded units, 36 phase offsets are already measured.
each uses a different definition of what constitutes a pattern. For example, some methods focus on coincident events (Gerstein, Perkel & Dayhoff, ¨ Aertsen & Palm, 1995; Grun, ¨ 1996; 1985; Martignon, von Hasseln, Grun, ¨ Diesmann & Aertsen, Johnson, Gruner, Baggerly, & Seshagiri, 2001; Grun, ¨ 2003), while 2002; Amari, Nakahara, Wu & Sakai, 2003; Schneider & Grun, others register events that are delayed (Abeles & Gerstein, 1988; Abeles, 1991; Ikegaya et al., 2004). In this study, we propose a method for detecting a particular type of spatiotemporal relations occurring between neurons with synchronized discharges, as defined by the presence of a center peak in their raw crosscorrelation histograms (CCHs; Moore, Perkel, & Segundo, 1966; Perkel, Gerstein, & Moore, 1967). The method starts with the computation of CCHs between all pairs of spike trains and with the extraction of the center peak’s positions by fitting cosine functions to the CCHs (for details, see Schneider ¨ & Nikoli´c, 2006; for an alternative method, see Konig, 1994). A shift of the center peak indicates that the two neurons tend to fire with a delay (see Fig¨ ure 1A). If CCHs are associated with oscillatory activity (e.g., Gray, Konig, ¨ Engel & Singer, 1989; Engel, Kreiter, Konig, & Singer, 1991), these delays are also called phase offsets. Such delays are often small (2 ms or less) and thus ¨ have been considered equivalent to zero delays (Roelfsema, Engel, Konig, & Singer, 1997). However, we have shown in Schneider and Nikoli´c (2006) that these small delays can be statistically different from zero and thus that they can represent real temporal offsets between the firing events of pairs
Spatiotemporal Structure Detected from Cross-Correlation
Unit
A 1 2 3
2389
B a
c b
a+b=c
a
c b a+b≠c
Figure 2: Additive and nonadditive spiking delays. (A) If the same spatiotemporal pattern among three units occurs repetitively (small jitter allowed), phase offsets in the corresponding CCHs (a–c) are additive. (B) An example in which spatiotemporal patterns are restricted to pairs of units and phase offsets are not additive.
of neurons. As a consequence, it appears worthwhile to investigate such delays within large data sets obtained in highly parallel recordings (com¨ pare also to Konig, Engel, Roelfsema, & Singer, 1995; Traub, Whittington, & Jefferys, 1997; Wennekers & Palm, 2000). As Figure 1B illustrates, analyses of pairwise relations can become cumbersome when the number of neurons (units) in the data set increases. It is thus necessary to decrease the representational complexity of such data sets. The method presented here achieves this in the following manner. Pairwise phase shift measurements ( n2 pairs for n neurons) are collapsed into a one-dimensional representation that indicates the preferred time at which each neuron fires an action potential relative to the firing times of all other neurons. As a result, a complex data set is represented by a simple one-dimensional temporal map, also called a linear configuration. This procedure is appropriate only when certain conditions are satisfied. The main precondition for a reduction of representational complexity is that the phase offsets obtained from different pairs of units have the property of additivity. That is, the offset that was measured between units A and C should correspond to the sum of offsets between units A and B and units B and C (illustrated in Figure 2A). If this condition is satisfied across the entire data set, a simple summation method can be applied to estimate the relative temporal positions of all units (see section 2). One should, however, be cautious when using such an analysis because, as illustrated in Figure 2B, additivity of phase offsets is not given by default. Therefore, the suitability of the additivity assumption should be investigated for each data set; the issues related to such tests are discussed in sections 2 and 3. Finally, phase offsets might change due to functionally relevant variables such as changes ¨ in stimulation conditions (e.g., Konig et al., 1995; Schneider & Nikoli´c, 2006) or due to the shift in the focus of attention or a behavioral event. In section 4
2390
G. Schneider, M. Havenith, and D. Nikoli´c
p.o.AB
A
p.o.AC
p.o.BC
B
C
Figure 3: Linear configuration of three units on the time axis. If phase offsets (p.o.) are additive, the relative times at which units prefer to fire can be represented by delays between points in one temporal dimension.
we propose methods that can be used to compare two sets of phase offsets and investigate whether phase offsets change consistently across different measurements. 2 Stochastic Model for the Extraction of Linear Structure 2.1 Assumptions and Estimates. We will first show how additivity can be used to reduce the representational complexity of a set of phase offsets. Phase offsets between the units A, B, and C are additive if the following condition is satisfied: ϕ AC = ϕ AB + ϕ BC . If this holds true for all subsets of pairs of units, the temporal relations between all units in the data set can be represented precisely by positioning all units on one temporal dimension (time axis). The position of a unit then indicates its “preferred firing time” relative to the other units. The absolute value of a phase offset can be read from the time axis as the distance between two preferred firing times (see also Figure 3), and its sign indicates the temporal order of the units on this axis. Perfect additivity is not likely to be achieved in practice because even additive phase offsets would be measured with an error. It is therefore necessary to investigate how well one can represent the original phase offsets by positioning the units on one time axis. To this end, we first identify the most likely positions of the units on the time axis by using a maximumlikelihood (ML) approach. Thus, the units are positioned such that their pairwise distances (model distances) resemble the measured phase offsets as closely as possible. A canonical set of assumptions needed for such positioning is illustrated in Figure 4. Here, we assume that n units (1, 2, . . . , n) have positions {x1 , x2 , . . . , xn } on the time axis. The positions have mean zero, and the real delays between all unit pairs (i, j) are denoted by δi j := (x j − xi ). The
Spatiotemporal Structure Detected from Cross-Correlation
2391
preferred relative firing times time axis ● unit ● 6
●
3 ●
●
● ●
1 0 ●● 4 2 ●
●
5 ●
true distance measurement error measured offset Figure 4: Model assumptions. Units are represented as points on the time axis with mean zero. The measured phase offset ϕi j of the distance δi j = (x j − xi ) between units i and j is associated with a normally distributed measurement error σ Zi j with zero mean and variance σ 2 .
measured phase offsets ϕi j are assumed to result as a sum of the real delays, δi j , and normally distributed measurement errors, which are independent across different phase offsets, have mean 0 and equal variances σ 2 . Thus, for all unit pairs (i, j) with 1 ≤ i < j ≤ n, ϕi j = (x j − xi ) + σ Zi j , with independent Zi j ∼ N (0, 1). Note that there is only one phase offset for each pair of units because ϕi j = −ϕ ji (i.e., the CCH between units A and B is the mirror image of the CCH between units B and A). Given those assumptions, the ML estimate of position xk of unit k on the time axis can be computed as the normed sum of phase offsets between this unit and all other units under investigation (for a proof, see appendix B.1):
xˆ k =
1 ϕk . n =k
(2.1)
These estimates of unit positions remain unbiased even if the assumptions of normality and independence of measurement errors are violated or if the variances of different phase offsets are unequal. This is because the expected value of the sum is the sum of expectations. Thus, the only effect of the violations of model assumptions is that the estimates can no longer be interpreted as ML estimates.
2392
G. Schneider, M. Havenith, and D. Nikoli´c
From equation 2.1, it follows that the estimated model distance δˆi j between the units i and j is 1 (ϕi + ϕj ) . δˆi j = xˆ j − xˆ i = 2ϕi j + n =i, j
(2.2)
Thus, the model distance between the units i and j is estimated by a weighted average of their direct distance and all indirect distances of paths of length two. The direct measurement ϕi j contributes twice as much to this weighted average as each indirect measurement ϕi + ϕj . This accounts for the fact that the direct estimate is affected by only one error of measurement, while every indirect estimate is contaminated by two errors of measurement. The estimates xˆ i in the resulting one-dimensional temporal map, {xˆ 1 , xˆ 2 , . . . , xˆ n }, minimize the sum of squares of differences between the measured phase offsets ϕi j and the offsets estimated by the model, δˆi j , that is, Q :=
(ϕi j − δˆi j )2 =
i< j
!
(ϕi j − (xˆ j − xˆ i ))2 = min .
i< j
This can be used to estimate σ 2 : 1 σˆ 2 = n−1 · Q.
(2.3)
2
It can be shown that this estimate is unbiased—that it neither over- nor underestimates σ 2 in expectation (see appendix B.2). With equation 2.3, the measurement error is computed on the basis of the agreement across phase offsets: the higher the degree of additivity among a set of phase offsets, the smaller the estimated measurement error. We will therefore also refer to the measurement error as the error of additivity. This quantity can also be used to compute σxˆ2 := Var(xˆ ) =
(n − 1) · σ 2, n2
(2.4)
which takes into account the error of additivity in order to indicate the precision with which a unit can be positioned on the time axis. Equation 2.4 still holds even if measurement errors are not distributed normally. However, its utility can be affected if measurement errors are dependent or if they have different variances. Dependence of measurement errors will be discussed in section 4.2. If measurement errors have different
Spatiotemporal Structure Detected from Cross-Correlation
2393
variances, each unit has its own estimation error, and thus the global σ from equation 2.3 represents only the average variability. In this case, an estimate of the error can be computed for each individual unit k by using the following equation: Var(xˆ k ) =
1 2 · σ . n2 =k k
(2.5)
This equation cannot be computed directly from a set of phase offsets but 2 of the measurement errors are requires that the individual variances σk estimated from a different source of information than the additivity error. Thus, if the units do not differ strongly in their individual variances, it is more practical to use the global estimate rather than the individual estimates of the positioning error. 2.2 Application to a Sample Data Set. In this section we apply the model to a sample data set that consists of multiunit activity recorded simultaneously from 14 electrodes in the cat primary visual cortex in response to six different stimulation conditions (for details on experimental methods, see appendix A). For all CCHs in stimulation conditions 1 to 3, the positions of the center peaks in the CCHs were estimated with high precision (∼1 ms) by using the methods described in Schneider and Nikoli´c (2006). The position of each central peak was determined by fitting a cosine function and extracting the point at which it reaches its maximum. The resulting distribution of phase offsets for all 91 pairs in stimulation condition 1 is shown in Figure 5A. The corresponding distributions in conditions 2 and 3 were similar (data not shown). We positioned the units on the time axis by using equation 2.1 (see Figure 6A). The resulting positions span a total distance slightly larger than 2 ms. The consistency across the phase offsets, indicated by the global estimate in equation 2.4 (additivity error: σˆ 2 ≈ 0.04; σˆ xˆ2 ≈ 0.0026), is illustrated by the width of the normal distributions drawn in black above each position. These suggest a high degree of separability between the preferred times at which the units fire action potentials. For example, unit 2 is likely to fire before unit 11, while the preferred firing times of units 7 and 10 seem indistinguishable. The precision with which the units’ positions can be estimated may differ across units if the variances of the phase measurements differ across unit pairs. Therefore, we also estimated the measurement errors for each individual position by using equation 2 from Schneider and Nikoli´c (2006). This equation describes analytically the precision with which a phase offset of a CCH can be determined with respect to the variability of the coincidence counts in the CCH, also taking into account the oscillation frequency and
2394
G. Schneider, M. Havenith, and D. Nikoli´c
B
0
10
20
10 0
frequency
30
A
−2
0
phase [ms]
2
0.0
0.1
σ 2ij
0.2
0.3
Figure 5: Distributions of 91 phase offsets (A) and the variances of their measurement errors (B) obtained in response to two moving bars that cross at the center of the receptive field (stimulation condition 1). Detailed description of the stimuli is provided in appendix A, and the sketches of the stimuli are provided in Figure 6.
amplitude of the central peak. The resulting histogram of estimated measurement errors of phase offsets is shown in Figure 5B. Separate estimates of σi2j in equation 2.5 resulted in a small variability of the precision estimates of the unit positions (minimum, σˆ xˆ2 ≈ 0.001 for units 5 and 8; maximum, σˆ xˆ2 ≈ 0.007 for units 6 and 9). The normal distributions with their widths and heights adjusted to correspond to the variances of individual positions are added in gray above each unit position in Figure 6A. One can see that the additional consideration of the differences in variances does not affect strongly the separability between the units. Therefore, the general model that uses only the global variance provides a reasonably accurate representation of the precision with which a unit’s position can be estimated. Hence, for the remaining stimulation conditions 2 and 3, we show only the global estimates of the variability (see Figures 6B & 6C). To obtain further understanding of the relations between the phase offsets, we propose to investigate how faithfully the resulting linear configuration represents the original data set. This can be assessed by plotting the model distances—the pairwise delays estimated within the model, δˆi j — against the measured phase offsets, ϕi j , for all unit pairs. Figure 7A shows the corresponding plot for stimulation condition 1. The close scattering around the diagonal and the high correlation (r = 0.98) indicate that the model represents the measured phase offsets well (r = 0.97 and r = 0.93 in conditions 2 and 3, respectively). This suggests that the concept of
Spatiotemporal Structure Detected from Cross-Correlation
2395
A ●● ●
● ●●
●
● ●●
2 13 12 11 6 59 −1.5
−1.0
−0.5
● ●
0.0
●
10 7
14 84 0.5
●
13 1.0
1.5
B ●
●●
6 −1.5
●●
● ●
● ●
●
13 5 1211 4 8 2 9 −1.0
−0.5
0.0
● ●
●
14 10 1 7
●
3
0.5
1.0
1.5
C ●●
● ●●
13 2 12 69 8 −1.5
−1.0
−0.5
●●
●● ● ● ●
●
10
1 11 14 5 4 73 0.0
0.5
1.0
1.5
1.0
1.5
D ●
● ●● ●●●●
● ● ● ●
●
●
14 5 7812 9 11 4 13 106 23 1 −1.5
−1.0
−0.5
0.0
0.5
relative firing time [ms] Figure 6: Linear configurations on the time axis. The dots denote the estimated positions of the units, and the black curves indicate localization errors (see equation 2.4). Gray curves indicate localization errors under heteroscedasticity. (A–C) Original data sets obtained from stimulation conditions 1 to 3. Stimulus configurations are indicated on the left side of each panel. (D) Phase offsets in the data set of stimulation condition 1 are permuted randomly.
2396
G. Schneider, M. Havenith, and D. Nikoli´c
−2
0
1
2
phase offset [ms]
2 1 0
model dist. [ms]
r =0.98
r =0.26 −2
0
1
2
B
−2
model dist. [ms]
A
−2
0
1
2
phase offset [ms]
Figure 7: Comparison between measured phase offsets and the corresponding model distances derived from those phase offsets. (A) Original data set. (B) Original phase offsets permuted randomly.
additivity can be useful in decreasing the representational complexity of the temporal structure within large data sets. 3 Consistency Analysis A high correlation between phase offsets and model distances indicates that the method presented here provides a reasonable representation of the data structure. However, the interpretation of the data representation has to take into account two distorting effects that could be caused by assumptions of the model. First, phase offsets are assumed to be perfectly additive except for measurement errors. As a consequence, a certain degree of additivity is also extracted from data sets that are not inherently additive. To address this issue, we compare the linear configuration obtained from the original data set to a configuration derived from randomly permuted data sets (see section 3.1). Second, the model assumes that phase offsets from different subsets of units agree on the global linear configuration derived from the entire data set. This could hide inconsistencies between different subnetworks. We therefore present a method that can be used to investigate the consistency of phase offsets across different subnetworks of neurons (see section 3.2). 3.1 Consistency in Permuted Data Sets. We permuted the phase offsets estimated in stimulation condition 1 by randomly assigning phase offsets to pairs of units. This procedure destroys additive structure but maintains the empirical distribution of phase offsets. If the linear configuration in
Spatiotemporal Structure Detected from Cross-Correlation
orig. offsets
0 200
500
permuted offsets
counts
2397
0.0
0.2
0.4
0.6
0.8
1.0
model−moment correlation Figure 8: Distribution of correlation coefficients between model distances and phase offsets obtained from 10,000 permutations of the data set obtained in stimulation condition 1. Arrow: correlation coefficient obtained in the original data set (r = 0.98).
Figure 6A is entirely imposed by our method and not present in the data set, application of the method to a permuted data set should result in similar distances of units and similar measurement errors. Vice versa, if the latter does not apply, the additivity model grasps important aspects of the data structure. In this case, the positions of the units for the permuted data set should be much closer to each other because a sum of randomly assigned positive and negative phase offsets tends to average out (compare equation 2.1). The results of such a permutation test are shown in Figures 6D, 7B, and 8. The units are positioned closer to zero for the permuted phase offsets (see Figure 6D) than they were in the original data set (see Figure 6A), and the distributions that indicate the precision of the estimates are much broader and overlap to a higher degree. The large width of these distributions indicates a high disagreement between measured phase offsets and model distances (see equation 2.4) and, thus, poor representation of permuted phase offsets by the one-dimensional model. This can also be seen in the corresponding scatter plot (see Figure 7B). The points are no longer scattered along the diagonal, and the range of model distances is much narrower. This makes the representation of larger phase offsets impossible. Accordingly, the resulting correlation coefficient between permuted phase offsets and model distances is small (r = 0.26) but positive, reflecting a small fraction of additivity that is imposed by the estimation method. The permutation procedure was repeated 10,000 times, and the resulting distribution of correlation coefficients between the permuted phase offsets and their corresponding model distances is shown in Figure 8. All correlation coefficients were far below the value r = 0.98 obtained for the original data. Thus, by far the strongest linear structure among all 10,001
reference set
2398
G. Schneider, M. Havenith, and D. Nikoli´c
{7 − 10}
●●
●
265 {11− 14} −1.5
●●
−1.0
●
4
●
−0.5
●●
1
●
0.0
●
0.5
3 ●
1.0
1.5
relative firing time [ms] Figure 9: Linear configurations of target units 1 to 6 determined from the phase offsets to the reference units 7 to 10 and 11 to 14, separately.
investigated constellations was obtained from the original constellation, a structure that is therefore highly unlikely to be obtained by chance.
3.2 Consistency Across Subnetworks. To investigate the consistency of phase offsets across different subnetworks, we propose the following procedure. Choose one subset of units (e.g., units 1–6) as the target set that will be positioned on the time axis. Subdivide the remaining units into two reference sets of similar sizes (e.g., units 7–10 and 11–14), and use those sets to separately estimate two linear configurations for the targets, each configuration being obtained from a different reference set. High agreement between the two linear configurations then indicates consistency across subnetworks. To derive linear configurations from a subset of units, only the offsets to one reference set are summed up in equation 2.1. Thus, the position of target unit i derived from reference units k1 , . . . , k is 1 (ϕ + ϕk2 ,i + · · · + ϕk ,i ). Note that the resulting linear configurations (+1) k1 ,i are derived in relation to different reference sets and therefore do not have the same means. Thus, they need to be recentered at zero prior to further comparisons. Figure 9 shows the estimated positions of units 1 to 6 in stimulation condition 1 derived from the two reference sets comprising units 7 to 10 and 11 to 14, respectively. Although each position is estimated from only four measurements, the order of units does not change, and distances between temporal positions are conserved to a high degree. Investigations of other combinations of target and reference sets led to similar results (data not shown). Thus, already small data sets can contain highly consistent information about the temporal structure in neuronal activity. Consequently, estimates of temporal maps resulting from the entire data set should yield an interpretable linear configuration that faithfully reflects the data set structure for all units.
Spatiotemporal Structure Detected from Cross-Correlation
2399
4 Comparison of Two Linear Configurations Phase offsets could represent a particularly interesting neuronal code if their configuration depends on the stimulation condition or behavior (Hopfield, 1995; VanRullen & Thorpe, 2002). The first question that needs to be addressed in this respect is whether linear configurations change across stimulation conditions. If the changes in the phase offsets occur solely due to independent errors of measurement, these changes are also inconsistent across phase offsets, and thus the size of the changes can be evaluated with the estimate of the measurement error in equation 2.3—the error in additivity. In contrast, if the changes result from changes in the relative position of the units, phase offsets change in a consistent manner, preserving additivity. In this section, we propose methods for investigating the consistencies in the changes of phase offsets and thus investigating whether there is a statistically significant amount of change that exceeds the variability resulting from the additivity errors. To this end, we propose two methods. First, one can obtain a graphical representation of the differences between two configurations and include an indicator of the positioning error based on the error of additivity (see section 4.1). This graph can provide information about the extent of the changes and the identity of the units involved in these changes. The graphical analysis can then be complemented by a statistical test. To this end, we propose an ANOVA approach (see section 4.2). (For an alternative test independent of the additivity assumption, see Schneider & Nikoli´c, 2006.) Both the graphical analysis and the ANOVA are applied to the sample data set. To investigate the stability of responses to identical stimuli, the 20 presentations (trials) of the same stimulus were subdivided into two subsets (odd and even trials with respect to the order of stimulus presentation). The comparisons between different stimulation conditions were based on all 20 trials (for details, see appendix A). 4.1 Displaying Differences Between Two Linear Configurations. One can compare two linear configurations of the same set of units by plotting the estimated positions derived from two separate data sets against each other (Figures 10 and 11A–C). Close clustering of the points around the main diagonal indicates little or no differences between the linear configurations. To this graph we add the following estimate of the positioning error. For (1) (2) one unit k, the size of the difference between the two positions, (xˆ k & xˆ k ), (1) (2) 2 can be measured in terms of its variance, σ D := Var(xˆ k − xˆ k ). If the unit positions do not change more than is accounted for by the error of additivity, then σ D2 =
(n − 1) 2 · σ1 + σ22 . 2 n
(4.1)
−1
0
1 −1
1
2
−1
0
1
f. time odd trials [ms]
1
1
D 0
7 11 5 12 8 13
−1
12
3 7 10 4 8 11 59 13 6
0
f. time odd trials [ms]
f. time even trials [ms]
1 0
1 14
−1
f. time even trials [ms]
firing time set 1 [ms]
C
31 7 10 4 14 8 11 12 5 9 2 6 13
0
0
1
3 1 7 10 14 4 8 11 12 5 9 2 13 6
B
−1
A
f. time even trials [ms]
G. Schneider, M. Havenith, and D. Nikoli´c
−1
firing time set 2 [ms]
2400
−1
0
10 14 3 4 1 29 6
1
f. time odd trials [ms]
Figure 10: Comparisons of configurations of preferred firing times (f. times). (A) Configurations derived from two simulated sets of phase offsets that differed only in the random noise added to the measurements but not in the underlying temporal structure. (B–D) Configurations obtained from odd and even trials in stimulation conditions 1 to 3. Stimulus configurations are indicated on the upper left side of each panel. Individual units are represented by points and labeled on the right of each graph. Lines parallel to the diagonal indicate error bands (±2σˆ D ) computed with equation 4.1.
The terms σ1 and σ2 indicate the additivity errors in the two data sets (see equation 2.3) and do not have to be equal. The borders indicating approximate 95% confidence intervals (i.e., 2σ D ) are then plotted parallel to the diagonal. Those points that lie outside this band indicate that the corresponding units changed their positions to a higher degree than what would be expected by the additivity errors.
Spatiotemporal Structure Detected from Cross-Correlation
2401
To illustrate the applicability of such graphs, we simulated two sets of phase offsets. We used the linear configuration obtained from stimulation condition 1 and added to each pairwise distance an artificial measurement error: independent and normally distributed random noise with the same variance as estimated from the data set with equation 2.3 (σˆ 2 ≈ 0.04). The graphical approach was then used to compare the two linear configurations derived from the two simulated sets of phase offsets. Figure 10A shows that the units remain within the error borders. Thus, the graphical method does not indicate changes in the units’ positions when the linear configuration remains stable. By using this method, we compared linear configurations that were obtained under both identical and different stimulation conditions. The positions of the units across odd and even trials of the same stimulation condition (conditions 1–3) are shown in Figures 10B to 10D, together with the error lines. The positions of almost all units remained within the error bands, indicating that the variabilities of the positions for the most part did not exceed the variabilities expected by the measurement errors. In Figures 11A to 11C, we show three pairwise comparisons in linear configurations across different stimulation conditions (conditions 1–2, 1–3 and 2–3, respectively). One can see that in the comparisons involving stimulation condition 3 (Figures 11B and 11C), about half of the units lie outside the error bands. Thus, in contrast to the comparisons between identical stimulation conditions, the changes in the positions across different stimulation conditions can be much larger than what would be expected by the measurement errors. This, however, does not have to apply to all cases, as the comparison between conditions 1 and 2 indicates. With the exception of one unit, there are no changes in positions that exceed the additivity error. Changes between different stimulation conditions are often also visible in the original CCHs. For one pair of units (3–5), we show the CCHs obtained in stimulation conditions 1–3 in Figures 11D to 11F, respectively. One can see that the point at which the fitted cosine function reaches its maximum is similar in conditions 1 and 2 but changes for condition 3 (for a direct statistical analysis of raw phase offsets, see Schneider & Nikoli´c, 2006). In conclusion, the graphical method can be a useful tool for comparing linear configurations. The method illustrates the degree to which the unit positions change relative to the errors in additivity and visualizes the directions of these changes. In the data set, application of the method indicates that temporal positions change much less in response to identical stimuli than in response to different stimuli. However, the graphical approach cannot be used directly for statistical inferences. This is because the method does not correct for multiple comparisons among different conditions (type I error) or for the dependencies among position estimates of different units introduced by setting the sum of unit positions to zero. Thus, if one unit moves in one direction, the temporal positions of all remaining units move in the opposite direction. As
2402
G. Schneider, M. Havenith, and D. Nikoli´c
a consequence, the graphical method cannot provide a rigorous statistical tool to decide whether the changes between two configurations exceed the error in additivity. This question can be addressed much more accurately by an ANOVA approach, which we discuss in the following section. 4.2 Statistical Test for Two Linear Configurations. We now introduce a test with which the graphical representations discussed in the preceding section can be analyzed statistically. Changes across two linear configurations can be evaluated with respect to the errors of additivity by using a general ANOVA approach. Mathematical details are provided in appendix B.3; here we outline the main computational steps. For each of the two sets of phase offsets (k = 1, 2), one has to estimate the (k) (k) linear configuration of units {x1 , . . . , xn } (see equation 2.1), the pairwise (k) 2 model distances, δˆi j (see equation 2.2), and the measurement error, σˆ (k) (see equation 2.3). Under the null hypothesis, the two configurations are identical, and model distances and phase offsets differ only due to measurement error. Then the test statistic F =
2 1 1 (1) ˆi j − δˆi(2) δ · j 2 2 n − 1 i< j σˆ (1) + σˆ (2)
(4.2)
is Fisher distributed with (n − 1) and (n − 1)(n − 2) degrees of freedom (see appendix B.3). In contrast, if the two configurations differ, then F is increased by larger differences in model distances between the two data (1) (2) sets, (δˆi j − δˆi j ). One can intuitively interpret F as the squared difference between the linear configurations measured in units of the measurement 2 2 error, σˆ (1) + σˆ (2) . We applied this ANOVA to the data sets presented in Figures 10 and 11; the results are shown in Table 1. Qualitatively, the F - and the p-values indicate the same relations as the graphical representations in Figures 10 and 11: the changes in the relative firing times are much stronger between responses obtained to different stimulation conditions than between odd and even trials of the same stimulation condition. Two comparisons of identical stimulation conditions (conditions 1 and 2) show significant p-values (at α = 0.05), indicating that the changes across odd and even trials are larger than what is accounted for by the additivity
Figure 11: Comparison of configurations across stimulation conditions. (A–C) Three pairwise comparisons between different stimulation conditions: 1 to 2, 1 to 3, and 2 to 3 (notation as in Figure 10). (D–F) Raw CCH counts for the unit pair 3 to 5 in stimulation conditions 1 to 3 in the original time resolution of 1/32 ms. Black: Cosine functions fitted to the central peak (i.e., ±10 ms) of the CCH counts.
Spatiotemporal Structure Detected from Cross-Correlation
● ●
0
● ● ● ●● ●
6
●
−1
0
60 −1.8 ms 0
●
●
40
1
●
20
3 1 7 10 14 4 8 12 11 59 2 13 ●
●
# coincidences
D
−1
f. time cond. 2 [ms]
A
−10
1
f. time cond. 1 [ms]
0
10
0
10
delay [ms]
●3 11 ● 5 8 13
●
●
0
● ●
●
● ●
−1.7 ms 0
−1
●●
14 47 1 12 2 69
40
1
● ●
20
10
●
60
E # coincidences
B f. time cond. 3 [ms]
2403
−1
0
−10
1
f. time cond. 1 [ms]
●
0
●
● ●
● ●
●
●
●3 11 5 8 13
14 47 1 12 2 69
60 −0.3 ms 0
●
●
40
1
● ●
20
10
# coincidences
F ●
−1
f. time cond. 3 [ms]
C
delay [ms]
−1
0
1
f. time cond. 2 [ms]
−10
0
delay [ms]
10
2404
G. Schneider, M. Havenith, and D. Nikoli´c
Table 1: Results of ANOVA Applied to a Sample Data Set Tested for Changes in Linear Configurations. Identical Stimuli
Different Stimuli
Cond.
F
p
p*
Cond.
F
p
p*
1 vs. 1 2 vs. 2 3 vs. 3
2.0 2.7 1.2
0.027 0.002 0.283
0.047 0.006 0.288
1 vs. 2 1 vs. 3 2 vs. 3
5.3 22.5 15.4
0 0 0
0 0 0
Notes: For n = 14 units, every data set contains n2 = 91 phase offsets, resulting in (n − 1) = 13 and (n − 1)(n − 2) = 156 degrees of freedom. p = 0 indicates that the respective values were smaller than 10−7 (1-2: p < 10−7 , 1-3 and 2-3: p < 10−16 ). p ∗ -values are derived from 10,000 simulations to correct for inequality of variances. p ∗ = 0 indicates that not a single simulation, out of 10,000, showed an F -value larger than the one obtained in the experimental data set.
errors computed within each of the two configurations. This result suggests that violations of the model assumptions might have increased the type I error. It is therefore necessary to discuss whether the results might be affected by violations of the model assumptions. A violation of the normality assumption is unlikely to have affected the results. In the data set, measurement errors showed no deviations from the normal distribution (Schneider & Nikoli´c, 2006), and ANOVA is highly robust to the violations of the normality assumption (e.g., Pearson, 1931). ANOVA is also highly robust to the violation of the assumption that the variances are equal (e.g., Horsnell, 1953). We nevertheless estimated the extent to which unequal variances could have affected our findings because this assumption was violated to a certain degree in the data set (see section 2.2 and Figure 5B). The p-values that correct for inequality of variances can be obtained by the following simulation procedure. For (1) (2) two sets of estimated phase offsets {ϕi j }i< j , {ϕi j }i< j and their individual (1)
(2)
variances {σi j }i< j , {σi j }i< j , estimate first the global linear configuration as the mean of the configurations of the compared sets. Next, phase offsets are generated by adding independent and normally distributed measurement (k) (k) error with variance (σi j )2 to each model distance δi j , k = 1, 2. With this procedure, we simulated 10,000 data sets and performed ANOVA by using equation 4.2. The percentage of simulations with a larger F-value than what was found in the original data set was used as the corrected p-value. The results are shown in Table 1 (column indicated with p ∗ ). The correction for unequal variances increased the p-values in the comparisons between odd and even trials. However, this increase was relatively small, and the changes within stimulation conditions 1 and 2 remained significant. This suggests that in this data set, inequality of variances was not the main
Spatiotemporal Structure Detected from Cross-Correlation
2405
factor responsible for the significant changes within stimulation conditions 1 and 2. Finally, the model assumes that measurement errors are independent. With regard to this, it is necessary to discuss possible sources of variability that are dependent such that the changes across phase offsets are additive. Dependent variability of phase offsets emerges if the primary source of variability is the position of units rather than the individual phase offsets. The analysis methods intepret such dependent changes in phase offsets as differences in the linear structure, not as measurement error. Any additive variability in phase offsets is included in the position of the units by equation 2.1 and thus not included in the measurement error by equation 2.3. Thus, consistent and dependent changes in phase offsets are the most likely reason for the significant ANOVA comparing odd and even trials in Table 1. This property of the test does not constitute a violation of the assumption of independent measurement errors and thus does not invalidate the results of the ANOVA. This is because the methods are designed to investigate only the consistency between phase offsets, not the variability of the units’ positions across repeated measurements. Therefore, the results obtained should be interpreted accordingly. The proper interpretation is that the rate of type I error does not increase, but instead, the positions of the units vary slightly more than expected from the error in additivity. The questions related to the variability of the unit positions across repeated measurements should be addressed by conventional statistical methods. The methods used here provide the means to obtain these temporal positions and test whether they change in a sufficiently consistent way to warrant further investigation. 5 Discussion In this letter, we propose a method for analyzing temporal relations between the firing events of large groups of simultaneously recorded neurons. The method uses pairwise temporal relations defined as the positions of center peaks in CCHs and assumes that these relations are additive across all pairs. If the data set complies with this assumption, it is possible to represent the underlying spatiotemporal relations on a single temporal dimension, which then indicates the preferred times at which neurons fire action potentials relative to each other. We present a graphical tool as well as a statistical test that can be applied to data in order to investigate whether changes of such spatiotemporal maps across different measurements are consistent across phase offsets. The method we have presented requires the existence of prominent central peaks in CCHs and is therefore highly related to the concept of synchronized firing events that occur across groups of neurons (Abeles, 1982; Diesmann, Gewaltig, & Aertsen, 1999; Singer, 1999). Synchronous events
2406
G. Schneider, M. Havenith, and D. Nikoli´c
have been found to occur with a precision of up to a few milliseconds ¨ Diesmann, & Aertsen, 1997; Grun, ¨ Diesmann, Grammont, (Riehle, Grun, Riehle, & Aertsen, 1999), and linear configurations may thus be extracted on a timescale that is finer than the maximal delay up to which events have been characterized as coincident. Another property of the method presented here is that it does not require the delays between firing events to repeat as exact replicas of each other. Instead, it is sufficient that the delays cluster around a certain value that can be obtained from a CCH. Thus, the method detects spatiotemporal patterns on the basis of events that are allowed to jitter over time and may therefore be invisible in raw spike trains. Also, temporal relations between unit pairs do not have to be stable over time. Instead, phase offsets are defined as those delays observed predominantly between a pair of units in the chosen analysis window and may thus result from temporally inhomogeneous processes (Vaadia et al., 1995). The model that extracts the spatiotemporal relations assumes homoscedasticity and independent and normally distributed measurement errors and is thus compatible with a standard ANOVA approach. This allows for high flexibility in investigating various variables that might affect the relative firing times of neuronal cells. Finally, the method has the advantage of being applicable to the activity of both single units and multiple units, as well as to any set of continuous signals that show preferred delays in their cross-correlation, prominent examples being the local field poten¨ tial (e.g., Gray, Engel, Konig, & Singer, 1992; Roelfsema et al., 1997) and the electroencephalogram (e.g., Sauseng et al., 2005). Perhaps most surprising is our finding that the temporal map indicating the preferred times at which units fire action potentials relative to each other can be determined with a precision much higher than previously ¨ reported for the visual cortex (Konig et al., 1995; Roelfsema et al., 1997). This precision can take values smaller than 1 millisecond and is achieved partially by the fitting procedure used to estimate the positions of the center peaks (Schneider & Nikoli´c, 2006) and partially due to the integration of multiple pieces of information obtained from different CCHs. As we could show, this allows detecting unusually small changes in the preferred firing times. The question whether small spiking delays and their resulting spatiotemporal relations have functional significance is beyond the scope of this study. However, we believe that the proposed methods will help to address these issues as they provide important tools for the detection, display, and analysis of temporal relationships between spiking events of a large number of neurons. Appendix A: Methods for Data Acquisition A.1 Preparation and Recordings. Anesthesia was induced with ketamine and after trachiotomy, was maintained with a mixture of 70%
Spatiotemporal Structure Detected from Cross-Correlation
2407
N2 O and 30% O2 , and with halothane (0.4–0.6%). The cats were paralyzed with pancuronium bromide applied intravenously (Pancuronium, Organon, 0.15 mg kg-1 h-1). Multiunit activity (MUA) was recorded by using an SI-based 16-channel probe (organized in a 4 × 4 spatial matrix), which was supplied by the Center for Neural Communication Technology at the University of Michigan (Michigan probes). The probe had intercontact distances of 200 µm (0.3–0.5 M impedance at 1000 Hz). Signals were amplified 1000 times, filtered between 500 Hz and 3.5 kHz, and digitized with 32 kHz sampling frequency. The probe was inserted into the cortex approximately perpendicular to the surface, which allowed us to record simultaneously from neurons at different depths and along an axis tangential to the cortical surface. Fourteen MUA signals responded well to visual stimuli and had good orientation selectivity. This resulted in a cluster of overlapping receptive fields (RF), all being stimulated simultaneously by the same stimulus. A.2 Visual Stimulation. Stimuli were presented on a 21 inch computer monitor (HITACHI CM813ET, 100 Hz refresh rate). The software for visual stimulation was a commercially available stimulation tool, ActiveSTIM (www.ActiveSTIM.com). Binocular fusion of the two eyes was achieved by mapping the borders of the respective RFs and then aligning the optical axes with an adjustable prism placed in front of one eye. The stimuli consisted of either one white bar or two bars moving in different directions (60 degree difference in orientation). In the stimuli with two bars, the bars crossed their paths at the center of the RF cluster. At each trial, the stimulus was presented in total for 5 seconds, but only 2 seconds with the strongest rate responses were used for the analysis. The bars appeared at about 3 degrees eccentricity of the center of the RF cluster and moved with a speed of 1 degree per second, covering the cluster of RFs completely. In the six stimulation conditions the bars moved in the following directions: 1: 30 and 330 degrees; 2: 0 degrees; 3: 150 and 210 degrees; 4: 180 degrees; 5: 30 and 150 degrees; 6: 210 and 330 degrees. Each stimulation condition was presented 20 times, and the order of conditions was randomized across trials. In our study, we used only conditions 1, 2, and 3 for the analysis. Appendix B: Mathematical Proofs B.1 ML-Estimates of Unit Positions (Eq. 2.1). Consider the vector x := (x , numbers representing the positions of the units with
1 . . . , xn ) of real xi = 0. The n2 distance measurements are denoted by ϕi j = (x j − xi ) + σ Zi j ,
1 ≤ i < j ≤ n,
2408
G. Schneider, M. Havenith, and D. Nikoli´c
with independent measurement errors σ Zi j ∼ N (0, σ 2 ). Define ϕ ji := −ϕi j for j < i, implying ϕii = 0. Then the ML estimate of x is given by xˆ k :=
1 ϕk n
k = 1, . . . , n.
Proof. As ϕi j ∼ N (x j − xi , σ 2 ) ∀1 ≤ i < j ≤ n, the likelihood function is given by
1 − L(x) = √ (ϕi j − (x j − xi ))2 . n exp 2σ 2 i< j ( 2πσ )(2) 1
Maximizing L requires minimizing the sum of squares: Q(x) =
(ϕi j − (x j − xi ))2 =
i< j ∂ Q(x) ∂ xk
With and
∂ Q ∂ xk ∂ xl 2
=2
(ϕi j − (x j − xi ))2 .
i≤ j
n
=1 (ϕk
+ xk − x ) = 2
= 0 for k = , the estimates xˆ k =
n
=1
1 n
ϕk + 2nxk and
∂2 Q ∂ 2 xk2
= 2n
ϕk minimize Q(x).
B.2 Unbiasedness of σˆ 2 (Eq. 2.3). We want to show that E (σˆ 2 ) = σ 2 , −1 Q(ˆx) is an unbiased estimate of σ 2 . xˆ := (xˆ 1 , . . . , xˆ n ) that is, that σˆ 2 := n−1 2 denotes the vector of position estimates. The expectation of each summand in Q(ˆx) is
E (ϕi j − δˆi j )2 = Var(ϕi j − δˆi j ) + ( E (ϕi j − δˆi j ))2 ,
n−2 2 n−2 2 σ [(n − 2) + 2] = σ 2 n n 1 and E (ϕi j − δˆi j ) = E (ϕi j + ϕ jk + ϕki ) = 0. n k=i, j
with Var(ϕi j − δˆi j ) =
n (n − 2) 2 1 Thus, E (σˆ 2 ) = n−1 · σ = σ 2. 2 n 2
(1)
(1)
B.3 Distribution of Test Statistic F (Eq. 4.2). Let C1 := {x1 , . . . , xn } (2) (2) and C2 := {x1 , . . . , xn } be two sets of real numbers representing linear
(1) (2) configurations on the time axis, with i xi = i xi = 0. For all pairs (i, j)1≤i< j≤n , let (1)
(1)
(1)
(1)
ϕi j := x j − xi + σ Zi j
(2)
(2)
(2)
(2)
and ϕi j := x j − xi + σ Zi j
(B.1)
Spatiotemporal Structure Detected from Cross-Correlation
2409 (k)
be the raw distance measurements with independent (Zi j )k=1,2 1≤i< j≤n ∼ N (0, 1). We want to test the null hypothesis, H0 , that C1 and C2 are identical, against the alternative, H1 , that the two configurations differ: H0 :
(1)
xi
(2)
= xi
∀i = 1, . . . , n,
Let D denote the merged data vector of all 2 (1) (2) D := (ϕi j )i< j , ϕi j
i< j
(1)
∃i ∈ {1, . . . , n} :
H1 :
n 2
xi
(2)
= xi .
phase offset measurements:
(B.1) (1) (1) = x j − xi
i< j
(2) (2) , x j − xi
i< j
+ σ Z
=: µ + σ Z, where σ is a vector all of whose 2 n2 entries equal σ and Z is standard normal n in R2(2) . The systematic term, µ, represents the additivity assumption that phase offsets are pairwise distances between points on a line. This holds true for both H0 and H1 and is represented by the model space M and the model assumption µ ∈ M. Let H denote a subspace of M such that µ ∈ H describes the null hypothesis. If the alternative hypothesis is true, then µ ∈ / H, which means that µ has a component in the orthogonal complement of H in M, called E. Note that dim(M) = 2(n − 1), dim(M⊥ ) = (n − 1)(n − 2), and dim(H) = dim(E) = n − 1. We decompose the data vector D by orthogonal projection onto H and M, D = P H D + P E D + P M⊥ D, and compare the lengths of P E D and P M⊥ D. The vector P E D represents the differences between the two linear configurations, and P M⊥ D represents the residual error: Under H0 : P E D = P E σ Z, and therefore 1/σ 2 P E D 2 ∼ χ 2 (dim(E)) Under H1 : P E D 2 = P E µ + P E σ Z 2 > P E σ Z 2 In both cases, P M⊥ D = P M⊥ σ Z, hence, 1/σ 2 P M⊥ σ Z 2 ∼ χ 2 (dim(M⊥ )). Thus, F =
P E D 2 / dim(E) ∼ F ((n − 1), (n − 1)(n − 2))
P M⊥ D 2 / dim(M⊥ )
(B.2)
is, under H0 , Fisher distributed with (n − 1) and (n − 1)(n − 2) degrees of freedom, whereas under H1 , F is increased systematically.
2410
G. Schneider, M. Havenith, and D. Nikoli´c
It remains to compute the lengths of P E D and P M⊥ D: PMD =
(1) δˆi j
i< j
(2) , δˆi j
i< j
=
(1)
(1)
xˆ j − xˆ i
i< j
(2) (2) , xˆ j − xˆ i
i< j
,
(k)
where xˆ i are the ML estimates derived separately for each of the two data
(k) (k) sets. These estimates minimize the term i< j (ϕi j − δˆi j )2 for k = 1, 2 and
(k) (k) thus minimize also k=1,2 i< j (ϕi j − δˆi j )2 . Thus, P M⊥ D = D − P M D =
(1) (1) ϕi j − δˆi j
i< j
(2) (2) , ϕi j − δˆi j
2 (k) (k) ϕi j − δˆi j .
P M⊥ D 2 =
and
i< j
(B.3)
k=1,2 i< j
To derive P E D, note that under H0 , the estimates of the model distances are averages of the estimates derived separately in the two data sets because the measurement error is assumed to be of the same size in the two data sets. Thus, 1 (1) (2) (1) (2) δˆi j + δˆi j , δˆi j + δˆi j i< j i< j 2 2 (1) (2)
P E D 2 = P M D − P H D 2 = 1/2 δˆi j − δˆi j . PH D =
and
(B.4)
i< j
With equations B.2 to B.4, we can conclude 2 (1) (2) ˆ ˆ δ − δ /(n − 1) ij ij i< j F= 2
(k) ˆ (k) /(n − 1)(n − 2) k=1,2 i< j ϕi j − δi j 1 2
=
σˆ 12
2 (1) 1 1 (2) δˆi j − δˆi j , · · 2 (n − 1) + σˆ 2 i< j
where σˆ k2 (k = 1, 2) is the estimate of σ 2 derived in data set k with equation 2.3. Acknowledgments We thank Brooks Ferebee for important inputs to several mathematical aspects of the project. We also thank Julia Biederlack, Sergio Neuenschwander,
Spatiotemporal Structure Detected from Cross-Correlation
2411
and Nan-Hui Chen for help with data acquisition and conditioning. Finally, we thank Anton Wakolbinger and Wolf Singer for their helpful comments and support. This work was partially supported by Hertie Foundation. References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: Springer. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (2000). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. Journal of Neurophysiology, 70, 1629–1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. Journal of Neurophysiology, 60, 909– 924. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synchronous firing and higherorder interactions in neuron pool. Neural Computation, 15, 127–142. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spiketrain analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. ¨ Engel, A. K., Kreiter, A. K., Konig, P., & Singer, W. (1991). Interhemispheric synchronization of oscillatory responses in cat visual cortex. Science, 252, 1177–1179. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5, 881–889. ¨ Gray, C. M., Engel, A. K., Konig, P., & Singer, W. (1992). Synchronization of oscillatory neuronal responses in cat striate cortex: Temporal properties. Visual Neuroscience, 8, 337–347. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity: Detection, sigGrun, nificance and intergration. Frankfurt am Main: Deutsch. ¨ S., Diesmann, M., & Aertsen, A. (2002). Unitary events in multiple singleGrun, neuron activity. I. Detection and significance. Neural Computation, 14, 43–80. ¨ S., Diesmann, M., Grammont, F., Riehle, A., & Aertsen, A. (1999). Detecting Grun, unitary events without discretization of time. Journal of Neuroscience Methods, 94, 67–79. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Horsnell, G. (1953). The effect of unequal group variances on the F-test for the homogeneity of group means. Biometrika, 40, 128–136. Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D., & Yuste, R. (2004). Synfire chains and cortical songs: Temporal modules of cortical activity. Science, 204, 559–564.
2412
G. Schneider, M. Havenith, and D. Nikoli´c
Johnson, D. H., Gruner, M. C., Baggerly, K., & Seshagiri, C. (2001). Informationtheoretic analysis of neural coding. Journal of Computational Neuroscience, 10, 47– 69. ¨ Konig, P. (1994). A method for the quantification of synchrony and oscillatory properties of neuronal activity. Journal of Neuroscience Methods, 54, 31–37. ¨ Konig, P., Engel, A. K., Roelfsema, P. R., & Singer, W. (1995). How precise is neuronal synchronization? Neural Computation, 7, 469–485. Lestienne, R. (1996). Determination of the precision of spike timing in the visual cortex of anaesthetised cats. Biological Cybernetics, 74, 55–61. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. ¨ S., Aertsen, A., & Palm, G. (1995). DetectMartignon, L., von Hasseln, H., Grun, ing higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73, 69–81. Moore, G. P., Perkel, D. H., & Segundo, J. P. (1966). Statistical analysis and functional interpretation of neuronal spike data. Annual Review of Physiology, 28, 493– 522. Pearson, E. S. (1931). The analysis of variance in cases of non-normal variation. Biometrika, 23, 114–133. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophysical Journal, 7, 419–440. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Reinagel, P., & Reid, C. (2002). Precise firing events are conserved across neurons. Journal of Neuroscience, 22, 6837–6841. ¨ Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997). Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Sauseng, P., Klimesch, W., Doppelmayr, M., Pecherstorfer, T., Freunberger, R., & Hanslmayr, S. (2005). EEG alpha synchronization and functional coupling during top-down processing in a working memory task. Human Brain Mapping, 26, 148– 155. ¨ S. (2003). Analysis of higher-order correlations in multiple Schneider, G., & Grun, parallel processes. Neurocomputing, 52–54, 771–777. Schneider, G., & Nikoli´c, D. (2006). Detection and assessment of near-zero delays in neuronal spiking activity. Journal of Neuroscience Methods, 152, 97–106. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65. Traub, R. D., Whittington, M. A., & Jefferys, J. G. R. (1997). Gamma oscillation model predicts intensity coding by phase rather than frequency. Neural Computation, 9, 1251–1264. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518.
Spatiotemporal Structure Detected from Cross-Correlation
2413
VanRullen, R., & Thorpe, S. J. (2002). Surfing a spike wave down the ventral stream. Vision Research, 42, 2593–2615. Wennekers, T., & Palm, G. (2000). Cell assemblies, associative memory and temporal structure in brain signals. In R. Miller (Ed.), Time and the brain (pp. 251–273). London: Harwood Academic Publishers.
Received July 8, 2005; accepted April 13, 2006.
LETTER
Communicated by Walter Senn
Stable Competitive Dynamics Emerge from Multispike Interactions in a Stochastic Model of Spike-Timing-Dependent Plasticity Peter A. Appleby [email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, Invalidenstrasse 43, D-10115 Berlin, Germany
Terry Elliott [email protected] Department of Electronics and Computer Science, University of Southampton, Highfield, Southampton SO17 1BJ, U.K.
In earlier work we presented a stochastic model of spike-timingdependent plasticity (STDP) in which STDP emerges only at the level of temporal or spatial synaptic ensembles. We derived the two-spike interaction function from this model and showed that it exhibits an STDPlike form. Here, we extend this work by examining the general n-spike interaction functions that may be derived from the model. A comparison between the two-spike interaction function and the higher-order interaction functions reveals profound differences. In particular, we show that the two-spike interaction function cannot support stable, competitive synaptic plasticity, such as that seen during neuronal development, without including modifications designed specifically to stabilize its behavior. In contrast, we show that all the higher-order interaction functions exhibit a fixed-point structure consistent with the presence of competitive synaptic dynamics. This difference originates in the unification of our proposed “switch” mechanism for synaptic plasticity, coupling synaptic depression and synaptic potentiation processes together. While three or more spikes are required to probe this coupling, two spikes can never do so. We conclude that this coupling is critical to the presence of competitive dynamics and that multispike interactions are therefore vital to understanding synaptic competition. 1 Introduction Spike-timing-dependent plasticity (STDP) has become a topic of much interest. In STDP, the exact timing of pre- and postsynaptic stimulation determines the degree and polarity of change in synaptic strength (for review, see Roberts & Bell, 2002). This is in contrast to conventional, rate-based Neural Computation 18, 2414–2464 (2006)
C 2006 Massachusetts Institute of Technology
Competition Under Multispike STDP
2415
¨ synaptic plasticity (Bliss & Lømo, 1973; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Dudek & Bear, 1992), where the governing factor is the rate of pre- and postsynaptic firing. STDP is apparently widespread (Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Froemke & Dan, 2002), and as a result, the learning properties of STDP rules are of potentially far-reaching consequence. In theoretical studies, STDP rules have been explored in a variety of phenomenological (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003) and biophysical (Castellani, Quinlan, Cooper, & Shouval, 2001; Karmarkar & Buonomano, 2002; Shouval, Bear, & Cooper, 2002) models. In all cases, the basic phenomenology of STDP can be reproduced, but the models are often sensitive to the choice of parameters or difficult to generalize to multispike interactions (Froemke & Dan, 2002). A key feature of developmental synaptic plasticity in the visual system and elsewhere is that it is activity dependent and competitive in character, with stable, segregated patterns of afferent innervation emerging as development proceeds (Purves & Lichtman, 1985). Although rate-based synaptic plasticity rules have been generally successful in reproducing much of the phenomenology of developmental plasticity (see, e.g., van Ooyen, 2003), it is natural to wonder whether models of STDP can also account for such processes, especially given that STDP is expressed, for example, in the de¨ veloping visual system (Zhang et al., 1998; Schuett, Bonhoeffer, & Hubener, 2001) and that STDP may in fact underlie rate-based synaptic plasticity. If STDP really does underlie rate-based synaptic plasticity, then the connection between STDP models and rate-based plasticity models must be established, and the capacity of STDP models to exhibit the stable, competitive dynamics characteristic of developmental synaptic plasticity must also be demonstrated. To be sure, models of STDP have been related to one particular rate-based plasticity model, the Bienenstock-Cooper-Munro (BCM) (Bienenstock, Cooper, & Munro, 1982) model (Izhikevich & Desai, 2003). Furthermore, a variety of bimodal (Song et al., 2000) and unimodal (van Rossum et al., 2000) distributions of synaptic weights can be obtained under simple STDP rules in the presence of additional constraints. However, the conditions under which STDP can in general support stable, competitive dynamics, and the relationship between rate- and timing-based models of plasticity more generally, have only been partially explored. Existing models of STDP typically assume that STDP operates at each individual synapse. This places a heavy computational burden on the synapse, requiring an array of spike coincidence detection machinery to be present that can represent spike time differences to millisecond accuracies and adjust synaptic weights in a continuous, graded fashion. As experimental work on STDP typically involves measuring the plasticity of multisynapse connections across multiple spike pairings, the observed STDP plasticity rule could, however, emerge from the averaging of some simpler synaptic rule. In earlier work, we postulated such a model of STDP,
2416
P. Appleby and T. Elliott
in which something akin to a three-state switch governs individual synaptic changes (Appleby & Elliott, 2005), with changes in synaptic strength occurring in discrete jumps of fixed magnitude (cf. Peterson, Malenka, Nicoll, & Hopfield, 1998; O’Connor, Wittenberg, & Wang, 2005). The overall change in strength in response to repeated spike pairings was then shown to be of an STDP-like form. The observed STDP rule can therefore emerge as a temporal and spatial average over multiple synapses and multiple spike pairings, with synapses required only to perform a minimal level of coincidence detection. In addition, an explanation of spike triplet interactions (Froemke & Dan, 2002) emerged as a natural consequence of the structure of the switch rule. Using simple probability theory to consider the interaction of any two spikes in a spike train, we derived an averaged, rate-based learning rule and derived a set of constraints under which a BCM-like form may be obtained. Elsewhere (Appleby & Elliott, 2006), we perform a fuller analytical treatment that takes into account the higher-order spike interactions that occur when examining spike trains composed of more than two spikes, and derive a generating function for the general n-spike, averaged learning rule that emerges from the model. For convenience, we refer to interactions involving three or more spikes as multispike interactions. In this article, we examine the learning dynamics of multispike rules in much greater detail, with the aim of exploring the connections between STDP and rate-based plasticity in our model. Extending the simple probabilistic argument presented in Appleby & Elliott (2005), we present averaged, rate-based rules for cases where spike interactions are limited to two, three, or four spikes and the case where interactions are not limited in any way. We observe that the twospike learning rule, despite possessing a promisingly BCM-like form, leads inevitably to pathological learning behaviors unless further modifications are made to stabilize its dynamics. We then explore the effect of allowing spike interactions to extend beyond two spikes and find that unlike the two-spike rule, all multispike learning rules can exhibit stable, competitive dynamics without introducing any additional modifications. We explore analytically the fixed-point structure of the switch rule under the assumption that afferents stochastically vary about the same mean firing rate and find that the presence of higher-order corrections in multispike rules changes the fixed-point structure compared to the two-spike rule. We argue that this change occurs because potentiation and depression are coupled in the switch model in the presence of multispike interactions. We demonstrate, numerically, that the multispike learning rules are computational modes available to a real neuron and find a constraint on the magnitude of plasticity below which the rate-based behavior becomes dominant. Finally, we perform several numerical simulations, including a large-scale simulation of ocular dominance column (ODC) development, that demonstrates that the observed fixed-point structure leads to stable, competitive dynamics in a large-scale system.
Competition Under Multispike STDP POT
A
OFF
B
pre
post
2417
pre
post
OFF
pre
post
OFF
DEP
D
post
pre
A
s
POT
C
t 0
t
t
DEP
A
Figure 1: A schematic representation of the switch and its resulting plasticity rule. (A) A switch to account for potentiation induced by a presynaptic spike preceding a postsynaptic spike. (B) An equivalent switch to account for depression induced when the spike order is reversed. Both switches can be unified into a single switch (C) with a common OFF state. Such a switch cannot simultaneously exist in both the DEP and POT states, but only in one state at a time. (D) shows the change in synaptic strength evoked under the unified three-state switch for a representative spike pair at various spike timings. Wavy lines represent stochastic transitions to the OFF state without any induced potentiation or depression. Semicircles represent transitions induced by a pre- or postsynaptic spike. The arrows ⇑ and ⇓ indicate the induction of potentiation and depression, respectively.
2 Summary of Switch Model We first give a brief overview of our synaptic switch model presented in Appleby & Elliott (2005). We propose that a pair of spikes, one pre- and one postsynaptic, will potentiate or depress a given synapse by a fixed amount, A± , subject only to the requirement that the second spike occurs within a finite time window relative to the first spike. Outside this time window, the second spike does not evoke any change in synaptic strength. The duration of this time window is not fixed but is taken to be a stochastic quantity governed by some probability distribution. The degree of potentiation or depression does not depend on the time difference between events. Rather, we require only that the second event occurs within a finite time window after the first event for the fixed steps, A± , to occur. This simple modification rule could be embodied by some biological, synaptic switch mechanism similar to that shown in Figure 1A, which represents the process leading to potentiation under a pre- and then postsynaptic spike pairing. The synapse initially resides in a resting state that we label the OFF state. A presynaptic spike elevates the synapse into a different
2418
P. Appleby and T. Elliott
functional state, which we label the P OT state. A postsynaptic spike occurring while the switch is in the P OT state immediately returns the switch to the OFF state and induces an associated potentiation of synaptic strength of magnitude A+ . In the absence of a postsynaptic spike, the switch will move from the POT state back to the OFF state in a stochastic manner, governed by some probability distribution. Such a stochastic transition does not induce any change in synaptic strength. We postulate an identical switch to account for depression (see Figure 1B), which operates in the same manner. This time, an initial postsynaptic spike activates the synapse, moving it into a third functional state, which we label the DEP state. A presynaptic spike occurring while the switch is in the DEP state immediately returns the switch to the OFF state and induces an associated depression of synaptic strength of magnitude A− . In the absence of a presynaptic spike, the switch will spontaneously move from the DEP state back to the OFF state, again in a stochastic manner, without any change in synaptic strength. We choose to unify these two switches into a single three-state switch (see Figure 1C). Although any number of additional states and transitions may be postulated, we find that this simple switch mechanism is sufficient to reproduce a variety of STDP results (Appleby & Elliott, 2005). Although the unification of the switch was initially motivated on the grounds of simplicity, unifying the switch also has the effect of coupling potentiation and depression together. This coupling takes place in the sense that once the synapse enters the POT state, it cannot subsequently enter the DEP state without first returning to the OFF state (and vice versa). We will show later that this coupling of potentiation and depression dramatically affects the dynamical landscape of the multispike learning rules derivable from the switch model compared to two uncoupled switches. A depiction of the step change in synaptic strength that a spike pair induces as a function of spike timing is shown in Figure 1D. The magnitude of change is fixed: there is no dependence on the time difference between pre- and postsynaptic events, and synapses undergo an all-or-none potentiation or depression. We distinguish two forms of the model based on its response to additional presynaptic spiking once the synapse is already in the POT state. In the resetting model, additional presynaptic spiking while in the POT state resets the stochastic process that governs the transition back to OFF. In the non-resetting model, additional presynaptic spiking has no further effect. The behavior of the depressing lobe of the switch is taken to be the same as the potentiating lobe. In the resetting model, therefore, postsynaptic spiking is capable of resetting the stochastic process governing the DEP → OFF transition in an identical manner to that of presynaptic spiking resetting the stochastic POT → OFF process. In the resetting form of a multispike rule, there is a tendency to experience more potentiation or depression than in the nonresetting form, as the effect of resetting is to increase the proportion of time that the synapse spends in the POT or DEP states compared to the OFF state. Resetting also has the effect of discarding the spike history. Once
Competition Under Multispike STDP
2419
elevated to the POT state, each subsequent presynaptic spike resets the stochastic process and eradicates all trace of its predecessors. This gives rise to a lack of memory in the sense that if the synapse is in the POT state, then only the time since the last presynaptic spike, not the times of all presynaptic spikes since the initial elevation to the POT state, is relevant. A synapse that has been reset n times by n additional presynaptic spikes is therefore identical to a synapse that has just been elevated to the POT state. In the experimental STDP protocols that we consider (Bi & Poo, 1998), we are limited to precisely two spikes, which, due to the structure of the switch, will never be enough to probe the resetting behavior. The choice of resetting or nonresetting is arbitrary to some extent, as the basic phenomenology of STDP is reproduced under both forms of the model. It is only when considering spike trains comprising three or more spikes that the choice of whether to allow resetting begins to make an impact. This impact is, however, minimal, in the sense that the resetting and nonresetting forms display qualitatively similar learning behavior. The motivation for considering resetting lies in the observation that the resetting form is particularly convenient for deriving the spike interaction function for any number of spikes (Appleby & Elliott, 2006). We assume, as before, that an afferent makes multiple synapses onto a target cell and that the overall connection strength between the two cells is the linear sum of each individual synaptic strength. The synapses are treated independently, which, due to the stochastic nature of the synaptic modification rule, means that the synapses comprising a connection will often be in different states. It is therefore the spatial average over synapses, and the temporal average over spike pairs, that determines the overall change in connection strength. This overall change in connection strength for a multisynapse connection as a function of spike timing is qualitatively STDP-like (Appleby & Elliott, 2005). Thus, when the changes are viewed at the level of multisynapse connections, the averaging of the proposed synaptic switch rule leads directly to an STDP-like modification curve. 2.1 Multispike Interaction Functions. We may calculate the expected response of a synapse to a train of spikes using simple probability theory. We first reproduce the derivation of the two-spike interaction function (Appleby & Elliott, 2005) and then state the three-, four-, and ∞-spike rules. For notational convenience, we denote a presynaptic spike by the symbol π and a postsynaptic spike by the symbol p. Pre- and postsynaptic firing are assumed to be independent Poisson processes with rates λπ and λ p , respectively, and we set β = λπ + λ p . Because they are independent, the combined pre- and postsynaptic spike sequences form a single Poisson process of overall rate β. The probability density function for the interspike interval of this joint process is f T (t) = β exp(−βt), where t denotes time. The probability that any particular spike in this combined train is a π spike is λπ = λπ /β, while the probability of its being a p spike is λp = λ p /β. We
2420
P. Appleby and T. Elliott
use the subscript + to refer to quantities governing the potentiating lobe of the switch, and the subscript − to refer to their counterparts governing the depressing lobe of the switch. For example, the magnitudes of plasticity are denoted A+ for potentiation and A− for depression, respectively. The potentiating and depressing lobes of the switch do not necessarily share the same parameters, representing the possibility that the two processes have at least some degree of independence. We assume that the probability density functions f ± (t) for the stochastic transitions POT → OFF and DEP → OFF (see Figures 1A and 1B) are given by integer-order gamma distributions, with probability density functions
f ± (t) =
(t/τ± )n± −1 1 exp(−t/τ± ), (n± − 1)! τ±
(2.1)
where n± are the integer orders and τ± are the characteristic timescales associated with these stochastic processes. For n± = 1, these are just simple exponential distributions. The lack of memory property associated with such exponential distributions means that only the time since the most recent spike is relevant. For n± = 1, the choice of resetting or nonresetting is therefore irrelevant, as the exponential stochastic processes themselves erase the spike history. We may calculate the expected response to a typical spike train comprising n spikes. This involves deconstructing the spike train into all the different possible spike combinations, then calculating the contribution from each combination to the expected change in synaptic strength. Because we have two different spike types (pre- and postsynaptic), we have a total of 2n possible spike trains of length n. Integrating out the interspike intervals and averaging across all possible spike combinations gives an unconditional expectation value for the average change induced by a typical n-spike train. Consider first a two-spike train. The spike train may manifest itself in one of four possible spike sequences, ππ, π p, pπ or pp, occurring with probabilities pσ1 σ2 = λσ1 λσ2 , where σ1 , σ2 ∈ {π, p}. Each of these spike sequences gives a contribution, denoted by Sσ1 σ2 , to the expected change for the typical two-spike train, denoted by S2 . This contribution is obtained by conditioning on the state of the synapse when the second spike arrives and calculating a mean expected change. Consider the spike sequence π p. The mean change in synaptic efficacy triggered by a π p spike pair of time difference t is the amplitude of synaptic plasticity, A+ , multiplied by the probability that the switch is still ON at time t after the first spike, P + (t). Thus, the conditional expectation value for the change in synaptic efficacy, Sπ p (t), given a π p spike time interval t, is just Sπ p (t) = +A+ P + (t).
(2.2)
Competition Under Multispike STDP
2421
Similarly, an identical argument gives Spπ (t) = −A− P − (t),
(2.3)
for the sequence pπ, where P ± (t) are the probabilities that the synapse is still in the elevated state when the second spike arrives. These probabilities are just ±
∞
P (t) = t
dt f ± (t ),
(2.4)
which, for the gamma distributions in equation 2.1, integrate to give P ± (t) = e n± (t/τ± ) exp(−t/τ± ),
(2.5)
n−1 i x /i! is the truncated exponential series. The spike patwhere e n (x) = i=0 terns ππ and pp cannot cause a change in synaptic strength under our switch rule, so Sπ π (t) = Spp (t) = 0. Averaging across all four possible manifestations of the two-spike train gives the expected change in synaptic strength for a typical two-spike train. This expression is a conditional expectation value due to the dependence on the interspike interval between the pre- and postsynaptic spikes. Integrating out the interspike time intervals according to their exponential distributions, f T (t), gives the unconditional expectation value for the change in synaptic strength induced by a twospike train. The two-spike rule may therefore be written as
S2 =
pσ1 σ2
σ1 ,σ2 ∈{π, p}
∞
∞
dt0 0
0
dt1 f T (t0 ) f T (t1 )Sσ1 σ2 (t1 ).
(2.6)
1 (1 + τ± β)n±
(2.7)
Defining K 1± (β)
= 0
∞
dt f T (t)P ± (t) = 1 −
and K˜ 1+ (β) = λπ K 1+ (β) and K˜ 1− (β) = λp K 1− (β), it is straightforward to show that the two-spike rule takes the form S2 = λp A+ K˜ 1+ (β) − λπ A− K˜ 1− (β).
(2.8)
This equation is an analytical expression for the expected change in synaptic efficacy induced by a two-spike train at given pre- and postsynaptic firing rates, λπ and λ p . As a two-spike train can never probe the resetting properties of the switch, this form is valid for both resetting and nonresetting models.
2422
P. Appleby and T. Elliott
For an n-spike train, equation 2.6 becomes
Sn =
pσ1 ···σn
σ1 ,...,σn ∈{π, p}
∞
dt0 · · · dtn−1 f T (t0 ) · · · f T (tn−1 )
0
× Sσ1 ···σn (t1 , . . . , tn−1 ),
(2.9)
where Sσ1 ···σn (t1 , . . . , tn−1 ) is the expected change induced by the spike sequence σ1 . . . σn at interspike intervals t1 , . . . , tn−1 , and pσ1 ,...,σn = λσ1 × . . . × λσn . The function Sσ1 ,...,σn does not depend on the time of the first spike, t0 , only on the interspike intervals. It may be calculated directly by conditioning on the state of the synapse as each spike in the sequence arrives. Although cumbersome, this method allows Sn to be explicitly evaluated for small n. For larger n, alternative methods based on formulating a recurrence relation in the Sn can be used (Appleby & Elliott, 2006). In the nonresetting model, we find by direct calculation that S3NR = λp A+ F0 K˜ 2+ (β) + (F0 + F1 ) K˜ 1+ (β) − λπ A− F0 K˜ − (β) + (F0 + F1 ) K˜ − (β) , 2
1
(2.10)
and S4NR = λp A+ F0 K˜ 3+ (β) + (F0 + F1 ) K˜ 2+ (β) + (F0 + F1 + F2 ) K˜ 1+ (β) − λπ A− F0 K˜ 3− (β) + (F0 + F1 ) K˜ 2− (β) + (F0 + F1 + F2 ) K˜ 1− (β) , (2.11) where K l± (β) =
dt1 · · · dtl f T (t1 ) · · · f T (tl )P ± (t1 + · · · + tl )
0
=
∞
τ± β 1 + τ± β
l n ± −1 (i + l − 1)! 1 , i!(l − 1)! (1 + τ± β)i
(2.12)
i=0
and l K˜ l+ (β) = λπ K l+ (β),
(2.13)
l K˜ l− (β) = λp K l− (β),
(2.14)
F0 = 1,
(2.15)
and
Competition Under Multispike STDP
F1 = 1 − ( K˜ 1+ (β) + K˜ 1− (β)), ( K˜ 1+ (β)
2423
(2.16)
K˜ 1− (β))
F2 = 1 − + + − K˜ 2 (β) + K˜ 2− (β) − ( K˜ 1+ (β) + K˜ 1− (β))2 .
(2.17)
The resetting model replaces P ± (t1 + · · · + tl ) in the definition of K l± (β) by the product P ± (t1 ) × · · · × P ± (tl ). Hence, for the resetting model, we have that K l± (β) =
0
∞
l l dt f T (t)P ± (t) = K 1± (β) ,
(2.18)
so we can replace K˜ l± (β) by [ K˜ 1± (β)]l in the expression for SnNR to obtain that for the resetting model, SnR . We then have S3R = λp A+ 2 − K˜ 1− (β) K˜ 1+ (β) − λπ A− 2 − K˜ + (β) K˜ − (β),
(2.19)
S4R = λp A+ 3 − 2 K˜ 1− (β) + K˜ 1+ (β) K˜ 1− (β) K˜ 1+ (β) − λπ A− 3 − 2 K˜ 1+ (β) + K˜ 1+ (β) K˜ 1− (β) K˜ 1− (β).
(2.20)
1
1
and
Clearly, as n increases, Sn increases. In general, in both forms of the model, more spikes should induce greater synaptic change (ignoring issues such as saturation of synaptic strengths). It is therefore convenient to consider Sˆ n = Sn /n, and, in particular, to determine the asymptotic form as n → ∞. Although the expected divergence of Sn as n → ∞ may seem problematic, we will see that Sn possesses a fixed-point structure such that the afferents evolve to these fixed points and remain there. Hence, the divergence of Sn as n → ∞ is not, in practice, observed, and scaling Sn by 1/n to remove the divergent behavior amounts simply to rescaling time or the overall learning rate. A general analysis of n-spike trains is presented elsewhere (Appleby & Elliott, 2006). We show that by viewing the behavior of the switch as a sequence of transitions back to OFF, the underlying Markovian nature of the switch may be exploited to produce a generating function for any n-spike rule. The switch is Markovian in the sense that after any spike sequence that returns the switch to OFF, the future behavior of the switch is independent of how the switch arrived at that point. We may therefore divide any spike sequence into a chain of successive Markovian steps, each ending in a transition to OFF, that may be computed explicitly. Each Markovian step has an associated expected change in synaptic strength, T1 , calculated by
2424
P. Appleby and T. Elliott
conditioning on the state of the synapse as each spike arrives, as well as an expected number of spikes that occurs during the step, N1 . For a long sequence of n spikes, we therefore expect on average n/N1 Markovian steps, each inducing on average a change in synaptic strength of T1 . Hence, the total change, Sn , for large enough n, goes like nT1 /N1 , and so we have, asymptotically, Sˆ n =
1 T1 . Sn ∼ n N1
(2.21)
By direct computation of T1 and N1 (Appleby & Elliott, 2006), we find that the nonresetting form of the ∞-spike rule is given by NR Sˆ ∞ =
λπ A+ K 1+ (λ p ) − λp A− K 1− (λπ ) 1+
λp λπ
K 1− (λπ ) +
λπ λp
K 1+ (λ p )
,
(2.22)
and the resetting form is R = Sˆ ∞
λp A+ [1 − K˜ 1− (β)] K˜ 1+ (β) − λπ A− [1 − K˜ 1+ (β)] K˜ 1− (β) . 1 − K˜ + (β) K˜ − (β) 1
(2.23)
1
Equations 2.22 and 2.23 describe the expected change in synaptic strength arising from two freely spiking pre- and postsynaptic neurons at firing rates λπ and λ p , respectively. Note that the K 1± are functions of λπ and λ p in the nonresetting, ∞-spike model rather than functions of β as elsewhere. There is no difficulty with the two limits λπ → 0 or λ p → 0 in the expression for NR Sˆ ∞ since the functions K 1± (λ) go to zero at least as fast as λ. Starting from the two-spike rule in equation 2.8, we previously derived two constraints on the parameters A± , n± and τ± (Appleby & Elliott, 2005). In the limit of large λπ and λ p in equation 2.8, we have that S2 ∝ A+ − A− . The sign of this expression indicates whether potentiation or depression of synaptic strengths is expected for high pre- and postsynaptic firing rates. Experimental work on long-term potentiation shows that high ¨ om, ¨ pre- and postsynaptic firing rates generally lead to potentiation (Sjostr Turrigiano, & Nelson, 2001). This requires that S2 > 0 for large λπ and λ p , so that A+ > A− . Second, a necessary condition for competitive dynamics is that a depressing regime exists, as otherwise synapses can never weaken on average. Because S2 = 0 at the origin, the requirement that ∂S2 /∂λπ |λπ ,λ p =0 < 0 is sufficient to ensure the existence of a depressing regime. This leads to the second constraint, γ =
A+ n+ τ+ < 1, A− n− τ−
(2.24)
Competition Under Multispike STDP
2425
which we interpret as depression dominating over potentiation. Although necessary, the presence of a depressing regime is not, of course, sufficient to guarantee the presence of stable, competitive dynamics. The two-spike rule is qualitatively BCM-like for γ < 1, with a depressing phase at low presynaptic firing rates followed by a transition to a potentiating phase as a threshold is passed. Although we now consider multispike rules, we retain the definition of γ given by equation 2.24. Throughout this article, we set A+ = 1 and A− = 0.95, in accordance with the condition that A+ > A− , and choose n+ = n− ∈ {1, 3}, as in Appleby & Elliott (2005). Setting τ− = 20 ms and choosing γ therefore determines the remaining parameter τ+ . We will consider the two-, three-, four-, and ∞-spike, n± = 1 cases in detail and then discuss how our observations extend to other n-spike rules, and for general n± . 3 The Two-Spike Rule and Beyond We now examine the two-spike rule and show that it leads, without further modification, to pathological learning dynamics. We then move beyond the two-spike rule and perform an initial analysis of the multispike rules, indicating the reasons that their dynamics differ from the two-spike rule’s dynamics. 3.1 Failure of the Two-Spike Rule. To study two-spike interactions, we consider a system of m afferents innervating a single target cell and explore their behaviors in the context of the averaged two-spike learning rule, S2 . We label the m afferents with indices such as i and j, so that i, j ∈ {1, . . . , m}. Let afferent i support li synapses of strength siα , α ∈ {1, . . . , li }. Let afferent i fire at rate λπi and the target cell fire at rate λ p . We then define dsiα = S2 (λπi , λ p ), dt
(3.1)
where we have noted the explicit dependence of S2 on the pre- and postsynaptic firing rates. Since S2 is independent of the synapse label α, all siα experience the same change at each time step: all of afferent i’s synapses experience identical pre- and postsynaptic firing rates. Hence, we may consider the evolution of either the total synaptic strength supported by affer ent i, siT ≡ α siα , or the average synaptic strength, siA ≡ l1i α siα , evolving according to dsiT /dt = li S2 (λπi , λ p ) or dsiA/dt = S2 (λπi , λ p ), respectively. We set the postsynaptic firing rate λ p to be the standard linear sum of the presynaptic firing rates weighted by the synaptic T strengths. Thus, if we consider the total strengths siT , we set λ = p i si λπi , while if we consider the average strengths siA, we set λ p = i li siAλπi . Since we study competitive dynamics, we do not consider scenarios in which some afferents at least initially enjoy an advantage over other afferents, and in particular, we do not consider scenarios in which the numbers of synapses supported by a
2426
P. Appleby and T. Elliott
group of afferents differ significantly. Hence, for convenience we take li to be independent of i: all afferents support the same number of synapses. We may then dispense with the factors of li since they can be absorbed into a redefinition of time (for the total strength) or a redefinition of afferent rates, which is equivalent to a redefinition of time (for the average strength). We therefore consider one common quantity si , which can be thought of as either total or average synaptic strength, evolving according to the equation dsi (3.2) = S2 (λπi , λ p ), dt for which we set λ p = i si λπi . As we consider only excitatory synapses, we insist that si ≥ 0. Hence, if equation 3.2 tries to drive si negative, we truncate the evolution and set si to zero. The synapse is not frozen there: it can regrow to nonzero strength. We may obtain a qualitative understanding of the dynamics induced by equation 3.2. Consider a scenario in which at least one afferent, say afferent i, has a large synaptic strength, si . The postsynaptic firing rate λ p will thus typically be large, and so the variables βi = λπi + λ p will typically be large. In this limit, K 1± (βi ) ≈ 1, so the two-spike learning rule reduces to ds j /dt ∝ A+ − A− > 0 for any afferent j. Thus, we see that once one afferent becomes strong, it induces all afferents to become strong, and this induces a positive feedback mechanism in which all afferents reinforce each other. So unless the first afferent that becomes strong is silenced with low λπi for a sustained period, so that λ p does not become large, all afferents’ strengths will escape to infinity. Consider now a scenario in which all synaptic strengths si are small. Here we may write βi ≈ λπi since λ p is small, and the two-spike rule becomes ds j /dt ∝ A+ K 1+ (λπ j ) − A− K 1− (λπ j ). Writing this out according to the definitions of K 1± , we have two terms: a negative term that goes like γ − 1 and a positive term that grows with λπ j . However, only for presynaptic firing rates in excess of at least 100 Hz does ds j /dt become overall positive. Thus, if the si are all small, then dsi /dt < 0 for all but high, sustained firing rates, and so the si become even smaller. These weak afferents become trapped in the depressing phase and fall to zero strength. These simple limiting scenario arguments therefore suggest that the phase space for the two-spike rule is partitioned into two regimes, with all afferents either pulled toward zero on average or pushed toward infinity on average. Although these arguments are not rigorous proofs, their conclusion is confirmed by a full fixed-point analysis of the two-spike rule. We therefore see that despite having a BCM-like form, the learning behavior of the rate-based, two-spike rule is in fact pathological and always leads to the afferents’ either all dying or all escaping to infinity. To demonstrate that the spike-based system also exhibits the behavior characteristic of the rate-based system, in Figure 2 we show a spike-based simulation
Competition Under Multispike STDP
Synaptic Strengths
A
2427
6.00 5.00 4.00 3.00 2.00
B
0.16
Synaptic Strengths
1.00
0.12
0
0.1 0.2 0.3 0.4 Time Step (Millions of Iterations)
0.5
0
0.1 0.2 0.3 0.4 Time Step (Millions of Iterations)
0.5
0.08
0.04
0.00
Figure 2: A spike-based simulation of the two-spike rule for four afferents innervating one target cell. The dynamics are partitioned into two distinct regimes, determined by the postsynaptic firing rate. (A) When initial synaptic strengths are large, the afferents drive the postsynaptic cell to a high firing rate, and runaway learning ensues. (B) When initial synaptic strengths are small, the postsynaptic firing rate is low, and all four afferents fall to zero strength.
2428
P. Appleby and T. Elliott
of four afferents innervating a single target cell. Spike trains are truncated at two spikes: there is no interaction between successive pairs of spikes. We see that the spike-based simulation exhibits the same two regimes discussed above for the rate-based system. This behavior is, as argued above, independent of the number of afferents because the governing factor in this behavior is the postsynaptic firing rate, λ p , which is common to all afferents synapsing on the target cell. Once λ p begins to move toward or away from zero, uncontrolled learning ensues and the afferents either all die or all escape to infinity. The instability inherent in the two-spike learning rule, whether rate or spike based, thus shows that it is unable to support the stable, competitive dynamics characteristic of, for example, ODC formation. This situation appears rather unpromising. For both experimental and theoretical reasons, we require that a BCM-like form emerges from the two-spike learning rule on average, yet this requirement leads directly to these pathological learning behaviors. Without further modification designed specifically to prevent runaway learning, such as hard upper bounds on synaptic strengths, the rule will always lead to uncontrolled learning. One possible remedy is to allow the threshold between the potentiating and depressing regimes, which is a function of various easily modifiable parameters, to depend on the recent time average of postsynaptic firing in a manner similar to the BCM rule (Bienenstock et al., 1982). In effect, this couples potentiation and depression together, in the sense that the dependence of the plasticity threshold on the recent time average of postsynaptic firing allows the history of potentiation and depression to influence later plasticity events. The result of this coupling in the BCM rule is to stabilize the learning behavior and prevent runaway learning. We see a similar result when we modify the two-spike rule to incorporate a sliding threshold. In the BCM model, the sliding threshold M is explicitly set as a function of the recent time-averaged postsynaptic firing rate, λ¯ p . In our two-spike rule, the threshold is dynamically determined by the solution of A+ K 1+ (β) = A− K 1− (β).
(3.3)
With the six parameters A± , n± , and τ± held fixed, this gives a value for β. However, for a given value of β (and fixed parameters A± , n± , and τ− ), we can instead regard this equation as determining a value of τ+ . We denote this value of τ+ by τ M = f (β), where f is the function that gives the solution of equation 3.3. Since it is unrealistic for the value of τ+ to depend on the instantaneous value of β, we instead make it depend on the recent ¯ Thus, for a given time average, similar to the BCM rule, so that τ M = f (β). ¯ our preferred value of τ+ should be set so that τ+ = τ M . Such a value of β, value would place the threshold at exactly the right location, putting some afferents in the depressing region and the others in the potentiating region.
Competition Under Multispike STDP
2429
0.9 Synaptic Parameters
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.5 1 1.5 2 2.5 Time Step (Millions of Iterations)
3
Figure 3: The evolution of four afferents innervating a single target cell under the modified two-spike, rate-based rule, which allows the threshold between depression and potentiation to slide as a function of the time-averaged postsynaptic firing rate. The four solid lines show the strengths of the four afferents, and the dashed line shows the value of γ . The introduction of the sliding threshold has stabilized the dynamics and generated competition.
¯ we will achieve a Thus, if we make τ+ dynamically evolve toward τ M (β), sliding threshold sensitive to the recent time average of postsynaptic firing, λ¯ p , through β¯ = λ¯ π + λ¯ p . Setting dτ+ ¯ − τ+ = τ M (β) dt
(3.4)
represents one simple way in which to realize such sliding, where is some small, inverse time constant for the relaxation of τ+ to the (chang¯ Implementing this sliding threshold in the two-spike ing) value of τ M (β). rule, we replicate the dynamics of the BCM rule. Synaptic strengths are therefore stabilized, and uncontrolled learning prevented. The behavior of a set of four afferents operating under this modified, rate-based twospike rule is shown in Figure 3. We have chosen initial conditions so that all synapses would grow without bound under the unmodified twospike rule. For the modified rule, we see that γ therefore decreases very rapidly (within the first few thousand time steps), moving all synapses
2430
P. Appleby and T. Elliott
into the depressing regime. As they depress, γ slowly increases. Eventually three of the synapses hit zero strength. The surviving synapse initially continues in the downward direction, but γ increases rapidly, moving the synapse into the potentiating regime. The synapse grows and stabilizes, and γ remains approximately constant. We therefore see, as expected, that the introduction of a sliding threshold stabilizes the learning behavior. 3.2 Beyond Two Spikes. Although coupling potentiation and depression in the manner described above for the modified two-spike rule stabilizes the learning dynamics exhibited by it, doing so forces us to make assumptions concerning the exact dependence of our model’s parameters on the recent firing history. To avoid such somewhat ad hoc complications, we seek to determine whether higher-order spike interactions can instead exhibit the required stable, competitive dynamics. One reason that higher-order spike interactions might achieve this is that they automatically provide the coupling between potentiation and depression that leads to stable behavior under the modified two-spike rule. This coupling takes place in the sense that once the synapse enters the POT (DEP) state, it cannot subsequently enter the DEP (POT) state without having first returned to the OFF state. That is, once a synapse has entered the potentiating mode, that synapse is prevented from entering the depressing mode without having first been deactivated to the resting state. For this coupling to be expressed, however, we require a minimum of three spikes. Consider, for example, the two-spike train π p. Once the first presynaptic spike has elevated the synapse into the POT state, the paucity of spikes prevents the synapse from subsequently visiting the opposite half of the switch and undergoing depression. There is therefore never any coupling under two-spike trains. This is not the case for higher-order spike trains. Consider, for example, the spike train π pπ. As before, the first presynaptic spike elevates the synapse into the POT state. If a transition back to OFF does not occur before the p spike arrives, then a potentiation event occurs as usual, and the last π event is of no importance. If, however, a transition back to OFF occurs before the second p spike arrives, then that p spike will cause the synapse to move to the DEP state, enabling the synapse possibly to undergo a depression event (assuming the final π spike arrives in good time). A similar argument applies for any number of spikes greater than two. Hence, if the switch is unified, all multispike rules couple potentiation and depression together, but the two-spike rule does not. Given that the coupling of potentiation and depression endows the BCM rule with a structure that supports stable, competitive dynamics, it is natural to ask whether the presence of this coupling in the multispike rules alters the learning dynamics compared to the two-spike rule. In studying the dynamics of the multispike interactions, we again consider a system of m afferents innervating a single target cell and explore
Competition Under Multispike STDP
2431
Synaptic Strengths
0.4
0.3
0.2
0.1
0 0
0.2 0.4 0.6 0.8 Time Step (Millions of Iterations)
1
Figure 4: A spike-based simulation of the nonresetting three-spike rule for four afferents innervating one target cell. The rule is competitive, with stable segregation of the afferents being observed. This is in contrast to the twospike rule, which requires explicit modification to achieve stable, competitive dynamics.
their behavior in the context of the averaged, n-spike learning rules. The same arguments that led to equation 3.2 now lead to dsi = Sn (λπi , λ p ) dt
(3.5)
as the synaptic strength modification equation corresponding to the n-spike rule. We continue to truncate si at zero if it is driven negative. A numerical exploration of the three-spike, rate-based rule (either resetting or nonresetting) shows that, indeed, its learning dynamics differ significantly from those of the two-spike rule. Under the three-spike rule, the uncontrolled learning behavior seen for the (unmodified) two-spike rule is absent for a broad range of parameters. Afferents compete for control of the target cell, with stable segregation robustly occurring. Figure 4 confirms that these observations are also true for the spike-based version of the three-spike rule. Stable, segregated fixed points exist under the threespike rule, leading to the same competitive dynamics exhibited by the BCM rule. Thus, the ability of the three-spike rule to probe the coupling of potentiation and depression under a unified switch, however modest,
2432
P. Appleby and T. Elliott
dramatically alters the dynamical landscape compared to that of the twospike rule, which cannot probe coupling. We observe a similar result for any multispike rule. Attempts to modify the two-spike rule, such as introducing a sliding threshold, while adequate in terms of stabilizing the learning dynamics, are therefore unnecessary: all that is required is to extend our consideration of spike interactions to three or more spikes with no ad hoc modifications of the learning rules. The success of the multispike rules depends critically on the unification of the switch mechanism and thus the coupling of potentiation and depression. We can see this explicitly by examining the multispike learning rules for two ununified switches, so that we consider the potentiating lobe in Figure 1A and the depressing lobe in Figure 1B separately. We can reduce the unified switch to two separate potentiating and depressing switches by setting τ− = 0 and τ+ = 0, respectively. Setting τ− = 0, for example, means that the depressing lobe of the unified switch is effectively unavailable, because any transition to the DEP state results in an instantaneous stochastic decay back to OFF with no depression. Similarly with τ+ = 0, the potentiating lobe is unavailable. Adding the two rules derived by separately setting τ+ = 0 and τ− = 0 gives the plasticity rule corresponding to the operation of two ununified, separate switches. Consider, for example, the resetting ∞-spike rule given in equation 2.23. Setting τ− = 0 gives POT = λp A+ K˜ 1+ (β), Sˆ ∞
(3.6)
and setting τ+ = 0 gives DEP = −λπ A− K˜ 1− (β), Sˆ ∞
(3.7)
so that the overall rule is POT DEP Sˆ ∞ + Sˆ ∞ = λp A+ K˜ 1+ (β) − λπ A− K˜ 1− (β) ≡ S2 .
(3.8)
The resetting ∞-spike rule therefore reduces to the two-spike rule when the switch is split into two halves. Repeating this manipulation for any resetting multispike rule produces the same result, so that SnPOT + SnDEP = S2 ∀n. Thus, the unification of the separate switches, which was proposed initially on the grounds of simplicity, has unexpected consequences for the dynamics of the model. Examining the nonresetting ∞-spike rule under an identical manipulation, we find that the potentiating half (when τ− = 0) gives POT = Sˆ ∞
λπ A+ K 1+ (λ p ) 1+
λπ λp
K 1+ (λ p )
,
(3.9)
Competition Under Multispike STDP
2433
and the depressing half (when τ+ = 0) gives DEP =− Sˆ ∞
λp A− K 1− (λπ ) 1+
λp λπ
K 1− (λπ )
,
(3.10)
so that the overall rule is POT DEP Sˆ ∞ + Sˆ ∞ =
λπ A+ K 1+ (λ p ) 1+
λπ λp
K 1+ (λ p )
−
λp A− K 1− (λπ ) 1+
λp λπ
K 1− (λπ )
.
(3.11)
Although this nonresetting rule has not reduced to the two-spike rule, a similar analysis of the cases of large and small si (or large and small λ p ) as performed above for the two-spike rule reveals identical conclusions, so that afferents either all escape to infinity or all die at zero. We therefore conclude that the presence of higher-order spike interactions under a unified threestate synaptic switch differentiates the two-spike and multispike rules by allowing a probing of the coupling between potentiation and depression in the unified model. These higher-order interactions are responsible for giving rise to the stable, competitive dynamics that we observe for the multispike rules. Examining the large β limit of the (unified) multispike rules reveals that their asymptotic behavior differs significantly from that of the two-spike rule. For large β, K l± (β) → 1, so K˜ l+ (β) → λπ l and K˜ l− (β) → λp l . Because λπ + λp = 1, we introduce a new variable, x ∈ [−1, +1], such that 1 λπ = (1 − x), 2 1 λp = (1 + x). 2
(3.12) (3.13)
It is easy to see that x is the tangent of the angle between the vector (λπ , λ p )T , the superscript T denoting the transpose, and the line λπ = λ p . If θ is the standard angle in a polar coordinate system (λπ = r cos θ , λ p = r sin θ ), then x is just x = tan(θ − π/4).
(3.14)
In this large β limit, we rewrite the two- and multispike rules in terms of the variable x and find that 1 S2 → (1 − x 2 )(A+ − A− ), 4
(3.15)
2434
P. Appleby and T. Elliott
1 S3 → (1 − x 2 ) [A+ (3 − x) − A− (3 + x)] , 8 1 S4 → (1 − x 2 ) A+ (9 − 4x − x 2 ) − A− (9 + 4x − x 2 ) , 16 1 1 − x2 Sˆ ∞ → [A+ (1 − x) − A− (1 + x)] . 2 3 + x2
(3.16) (3.17) (3.18)
Because the β → ∞ limits of the resetting and nonresetting models are identical, the results above are independent of the form of the model, although the limit of the ∞-spike resetting rule is much easier to extract. We see that while the two-spike rule, S2 , is symmetrical about x = 0, and thus symmetrical about the line λπ = λ p , the other rules exhibit an asymmetry about x = 0 due to the presence of odd powers of x. This property is true for all the multispike rules, not just S3 , S4 , and Sˆ ∞ given above. For x 2 < 1, the right-hand side of equation 3.14 is strictly positive, because A+ > A− . Hence, the two-spike rule always potentiates in the large β limit, as we saw earlier. This behavior is not the case for the multispike rules. Consider the cases x ≈ +1 and x ≈ −1 in the above. Then we find
S3 ∝
S4 ∝
Sˆ ∞ ∝
A+ − 2A−
for x ≈ +1
(3.19)
2A+ − A− > 0 for x ≈ −1 A+ − 3A−
for x ≈ +1
(3.20)
3A+ − A− > 0 for x ≈ −1 −A− < 0
for x ≈ +1
+A+ > 0
for x ≈ −1
(3.21)
Indeed, by using the general form of the (n + 1)-spike rule (Appleby & Elliott, 2006), we find
Sn+1 ∝
A+ − nA−
for x ≈ +1
nA+ − A− > 0 for x ≈ −1
.
(3.22)
The multispike rules therefore always potentiate in the x ≈ −1 or large λπ , small λ p direction. For the three-spike rule, if A+ < 2A− , then it, and all higher-order rules, depress in the x ≈ +1 or small λπ , large λ p direction. If, however, A+ > 2A− , then the three-spike rule potentiates in all directions and will exhibit runaway learning just like the two-spike rule. Although it may be the case that A+ > 2A− , it may not be the case that A+ > 3A− . Here, although the three-spike rule fails, the four-spike rule will depress in the x ≈ +1 direction. Indeed, by looking at the behavior of the general (n + 1)-
Competition Under Multispike STDP
2435
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 5: A contour plot of S2 in the λπ –λ p plane. Black areas represent minimum values and white areas maximum values, and nine shades of gray interpolate between these extremes. The minimum value on this partial plane is −0.0244, and the maximum value is +0.0118.
spike rule in equation 3.22, we see that provided A− = 0, there always exists a value of n above which the multispike rules will start to depress in the x ≈ +1 direction. In particular, for n > A+ /A− , the (n + 1)-spike rule will always potentiate in the x ≈ −1 direction and depress in the x ≈ +1 direction. With such “mixed” dynamics at large β, large λ p will not induce runaway learning. Since A+ /A− > 1, the two-spike rule can never achieve this. Hence, the two-spike rule is irredeemably pathological in its learning behavior due to a symmetry that is absent in all the multispike learning rules, and although the multispike rules can fail in the same way as the two-spike rule, this is parameter dependent (unlike the two-spike rule), and we are guaranteed (for A− = 0) that there exists an n above which the n-spike rules will not fail. Plotting the Sn in the λπ –λ p plane illustrates these results. For the twospike rule (see Figure 5), we see a depressing well around the origin and a potentiating regime away from it. The two-spike surface is symmetrical about the line λπ = λ p . Depression therefore always occurs at low β, and potentiation always occurs at high β. For the nonresetting forms of the three- (see Figure 6), four- (see Figure 7), and ∞-spike (see Figure 8) rules,
2436
P. Appleby and T. Elliott
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 6: A contour plot of S3NR for the nonresetting model in the λπ –λ p plane. The minimum value on this partial plane is −0.0826, and the maximum value is +0.1007.
the symmetry about the line λπ = λ p is absent, and it is possible to induce either potentiation or depression at high β depending on the value of θ (or x), for the parameter choice used here. Thus, the values of λπ and λ p together determine whether potentiation or depression occurs. 4 Fixed-Point Analysis of a Rate-Based Rule Having derived various multispike, rate-based rules from our switch model and performed an initial study of the differences between the two- and multispike rules, we may proceed to develop a deeper analytical understanding of the learning dynamics exhibited by the multispike rules. In particular, we continue to study equation 3.5 by performing a fixed-point analysis of the ∞-spike learning rule. We assume that all m afferents have the same mean firing rate, µ > 0, and variance, σ 2 > 0. The firing rate of each afferent therefore fluctuates about a common mean. The fluctuations distinguish the afferents (unless they are perfectly correlated), while preventing any afferent from enjoying an overall advantage. We set λπi = µ(1 + αi ), where αi is some small perturbation about the mean, |αi | 1, and take the mean of αi to be zero, αi = 0, so
Competition Under Multispike STDP
2437
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 7: A contour plot of S4NR for the nonresetting model in the λπ –λ p plane. The minimum value on this partial plane is −0.1686, and the maximum value is +0.1926.
that λπi = µ, as required. As we will average over the ensemble of activity patterns, we must obtain an expression for αi α j . The variance in the activity of afferent i is σ 2 = λ2πi − λπi 2 = µ2 αi2
(4.1)
so that αi2 = σˆ 2 , where σˆ = σ/µ, with σˆ 2 1. Assuming for convenience that the afferents’ activities are uncorrelated, so that their covariance Cov(αi , α j ) = 0 for i = j, we then have αi α j = σˆ 2 δi j ,
(4.2)
where δi j is the Kronecker delta, equal to one if i = j and zero otherwise. Defining the vectors s = (s1 , . . . , sm )T and α = (α1 , . . . , αm )T , we then have α · s = 0,
(4.3)
αi (α · s) = σˆ 2 si , (α · s) = σˆ |s| . 2
2
2
(4.4) (4.5)
2438
P. Appleby and T. Elliott
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
NR Figure 8: A contour plot of Sˆ ∞ for the nonresetting model in the λπ –λ p plane. The minimum value on this partial plane is −0.1010, and the maximum value is +0.1036.
As the perturbations are small, σˆ 2 1, we may expand any n-spike rule in αi and then average over the ensemble of afferent activity patterns using the three equations above. This expansion must be to second order in αi , as the mean of αi is zero. We thus arrive at a set of equations describing the evolution of afferents governed by the n-spike switch rule when the activities of the afferents fluctuate about some common mean firing rate. We may then extract the fixed-point structure that characterizes the dynamics of this system. Because si ≥ 0, any fixed points must lie in the nonnegative hyperquadrant of the vector space defined by s in order to be accessible to the afferents. We now explore the fixed-point structure of a simplified form of the ∞-spike rule, the simplification merely allowing a less messy analytical characterization of the locations and stabilities of the various fixed points. We then examine the full, unsimplified rule. Although analytical results can still be obtained for the full rule, they are messy, cumbersome, and rather opaque, so we do not reproduce them here. Nevertheless, both the simplified and full forms of the model exhibit qualitatively identical dynamics.
Competition Under Multispike STDP
2439
4.1 Simplified ∞-Spike Rule. We consider a simplified form of the full ∞-spike nonresetting rule given by equation 2.22 that excludes the denominator. This exclusion simplifies the expansion, making the resulting expressions more transparent. The price for this transparency is that the location and stability of any fixed point in this simplified model will be slightly different from that of the full model. To zeroth order in σˆ 2 , the simplified model and full model are, however, identical in their fixed-point structure. For reasons of analytical tractability, we set n± = 1. The simplified ∞-spike nonresetting rule is then Sim (λπ , λ p ) = λπ A+ K 1+ (λ p ) − λp A− K 1− (λπ ), Sˆ ∞
(4.6)
and we set dsi Sim (λπi , λ p ). = Sˆ ∞ dt
(4.7)
We denote the sum of the synaptic weights as s+ = i si , and we define t+ = 1 + s+ . Expanding the simplified rule to second order in αi and averaging over the ensemble of afferent activity patterns as set out above yields the averaged form of the rule. Dropping the angle brackets around the si for notational convenience, after lengthy algebra we obtain 1 dsi s+ N0 + σˆ 2 N1i , = A− τ− µ dt t+
(4.8)
where N0 =
γ 1 − , 1 + µτ+ s+ 1 + µτ−
N1i =
γ 1 Xi − Yi , 1 + µτ+ s+ 1 + µτ−
(4.9)
and (4.10)
where Xi =
1 1 + µτ+ s+ −s+
si −
si + |s|2 t+
1 + si µτ+ |s|2 − , t+ (1 + µτ+ s+ )2
+ s+
1 + 2si + |s|2 t+2 (4.11)
2440
P. Appleby and T. Elliott
and Yi =
1 1 + µτ− −
1 + si 1 + 2si + |s|2 si − s+ + s+ t+ t+2
si + |s|2 µτ− s+ − . t+ (1 + µτ− )2
(4.12)
At a fixed point, we require that ds/dt = 0. Solving this equation exactly for the location of all the fixed points is usually difficult, if not impossible, so we proceed by finding an approximation to zeroth order in σˆ 2 , for which all the fixed points can be located, and then calculate first-order corrections in σˆ 2 . We therefore write a fixed point as s FP = x + σˆ 2 y,
(4.13)
where x is the zeroth-order approximation to the location of the fixed point, s FP , and y is the first-order correction. We define x+ = i xi and y+ = i yi , so that s+ = x+ + σˆ 2 y+ . 4.1.1 Zeroth-Order Solutions and Behavior. To zeroth order in σˆ 2 , equation 4.8 becomes dsi s+ = A− τ− µ dt t+
γ 1 − 1 + µτ+ s+ 1 + µτ−
,
(4.14)
which depends only on s+ and not the individual components si . By inspection, we see that there are two fixed hyperplanes. One of these corresponds to the hyperplane x+ = 0. All points except x = 0 in this hyperplane have at least one negative component of x, and all of these points are therefore forbidden. Hence, the only fixed point on this hyperplane of fixed points accessible to the afferent weight vector is the origin. We refer to this point throughout as the zero fixed point. The other fixed hyperplane arises from the solution of γ 1 = , 1 + µτ+ x+ 1 + µτ−
(4.15)
or x+ =
γ (1 + µτ− ) − 1 . µτ+
(4.16)
Competition Under Multispike STDP
2441
So that at least some points in the hyperplane s+ = x+ have all nonnegative components, we require x+ > 0, or γ > γ0 ≡
1 . 1 + µτ−
(4.17)
We refer to the hyperplane s+ = x+ > 0 as the nonzero fixed hyperplane. If γ = γ0 , then the nonzero hyperplane becomes coincident with the hyperplane s+ = 0, and the only permitted fixed point that exists for γ = γ0 is the origin, x = 0. To determine the stability of these fixed points, we examine the behavior of the system under small perturbations about them. We denote a perturbation by δ = (δ1 , . . . , δm )T and write s = s FP + δ. We define δ+ = i δi . Expanding and linearizing equation 4.14 about the zero fixed point, s FP = 0, we have dδi = A− τ− µ(γ − γ0 )δ+ . dt
(4.18)
The m eigenvalues of the associated matrix characterizing these linearized dynamics are therefore easily seen to be λ1 = m A− τ− µ(γ − γ0 ),
(4.19)
λi = 0, ∀ i > 1.
(4.20)
and
The m − 1 repeated zero eigenvalues indicate that there is no flow along the associated eigenvectors. The flow toward or away from the origin will therefore occur only parallel to the eigenvector associated with λ1 , which is (1, . . . , 1)T . The stability of the zero fixed point is determined by the sign of λ1 . For γ < γ0 , the origin is stable, while for γ > γ0 , it is unstable. Note that when γ > γ0 , the nonzero fixed hyperplane s+ = x+ > 0 intersects the positive hyperquadrant, so the origin becomes unstable precisely when this hyperplane moves into the positive hyperquadrant. Now expanding and linearizing equation 4.14 about any point in the nonzero fixed hyperplane, we obtain dδi x+ 1 = −A− τ− τ+ µ2 γ −1 δ+ , dt 1 + x+ (1 + µτ− )2
(4.21)
2442
P. Appleby and T. Elliott
with associated eigenvalues λ1 = −m A− τ− τ+ µ2
x+ 1 γ −1 , 1 + x+ (1 + µτ− )2
(4.22)
and λi = 0, ∀ i > 1.
(4.23)
The flow toward the nonzero fixed hyperplane is parallel to (1, . . . , 1)T . As for the zero fixed point, the sign of λ1 determines the stability of the nonzero fixed hyperplane. For x+ > 0, that is, when the nonzero fixed hyperplane intersects the positive hyperquadrant, the nonzero fixed hyperplane is stable. Note that x+ > 0 requires γ > γ0 , so when the nonzero fixed hyperplane is stable, the origin is unstable. The zeroth-order dynamics are therefore uniquely determined by the sign of the quantity γ − γ0 . We briefly summarize the dynamics in each of the two possible regimes. When γ < γ0 , a single fixed point s FP = 0 is permitted for the afferent weight vector. This point is stable, and the afferent weight vector will initially flow toward the hyperplane s+ = 0 parallel to the vector (1, . . . , 1)T . When it hits a hyperplane defined by si = 0, for some i, it is prevented from crossing it because si is always truncated at zero. The weight vector therefore remains in this si = 0 hyperplane and flows toward the origin. It may hit other s j = 0, j = i, hyperplanes as it further evolves and again will be constrained to remain in them. The weight vector will therefore always arrive at the origin, regardless of the initial conditions. When γ > γ0 , the origin is an unstable fixed point, and there exists a hyperplane of stable, nonzero fixed points in the positive hyperquadrant. The afferent weight vector initially flows parallel to the vector (1, . . . , 1)T toward the hyperplane s+ = x+ > 0, from either above or below it. For some initial conditions, the weight vector will directly hit the nonzero hyperplane and stop evolving. For other initial conditions, the weight vector will hit a hyperplane si = 0 for some i first, and then flow in this hyperplane until it reaches the intersection with the s+ = x+ > 0 hyperplane, and then stop evolving. Regardless of the initial conditions, the weight vector will always arrive at some point on the nonzero fixed hyperplane. Of course, overall these dynamics are rather uninteresting, precisely because the zeroth-order solutions do not discriminate between afferents, since all afferents fire with a common rate µ. Nevertheless, the zerothorder solutions are the foundation on which the first-order corrections are determined, and this is why we have labored the analysis of the zerothorder case somewhat. The first-order corrections do permit a discrimination between afferents based on the fluctuations in their firing rates, and so we
Competition Under Multispike STDP
2443
expect to find a more compelling set of dynamics at first order. We now turn to this case. 4.1.2 First-Order Corrections and Behavior. We now examine the full form of equation 4.8, including the first-order corrections. For the first-order dynamics, we find that several different classes of fixed point exist. By simple inspection of the terms in N1i , the first-order correction in equation 4.8, we see that the system still possesses a fixed point at s = 0, and thus there are obviously no first-order corrections to its location. However, there are corrections to the eigenvalues of the stability matrix. Expanding and linearizing equation 4.8 about the zero fixed point as usual, we now find µ2 τ−2 dδi 1 µτ− 2 2 = A− τ− µ γ − − σˆ δi , δ+ + σˆ dt 1 + µτ− (1 + µτ− )3 (1 + µτ− )2 (4.24) with associated eigenvalues λ1 = m A− τ− µ γ −
µτ− 1 − σˆ 2 1 + µτ− (1 + µτ− )2
µτ− 1 − 1 + µτ− m
(4.25)
and λi = A− τ− µ
µτ− , ∀ i > 1. (1 + µτ− )2
(4.26)
The eigenvalues λi , i > 1, are always positive, so the first-order corrections have made the zero fixed point always unstable. The precise classification of the zero fixed point depends on the sign of λ1 . When λ1 < 0, the origin is a saddle node, and when λ1 > 0, it is a repeller. The transition of the zero fixed point from a saddle to a repeller occurs at the value of γ given by µτ− 1 1 2 µτ− − γ = γ2 ≡ . 1 + σˆ 1 + µτ− 1 + µτ− 1 + µτ− m
(4.27)
Unlike the zero fixed point, the first-order corrections do affect the location of the nonzero fixed points. Indeed, the corrections destroy the entire nonzero hyperplane of fixed points, leaving one real fixed point and m quasi-fixed points, together with a set of other fixed points that are essentially uninteresting because their existence merely renders the global fixedpoint structure consistent. Unless s = 0, the first-order corrections in the expression for N1i break the symmetry between the afferents that is present at zeroth order. It is precisely this symmetry that endows the zeroth-order
2444
P. Appleby and T. Elliott
system with an entire hyperplane of fixed points. At first order, this symmetry is absent, and the hyperplane collapses into a set of isolated quasi-fixed points and real fixed points. We first define what we mean by quasi-fixed points and examine their stability, then examine the other fixed points. In models of synaptic competition that possess a fixed-point structure, there are usually fixed points in which all but one si are zero. Such fixed points correspond to segregated states, since at each fixed point, only one afferent innervates the target cell. With m afferents, there are m such segregated fixed points. Of course, the defining condition of a fixed point is that ds/dt = 0 when the derivatives are evaluated at the fixed point. Suppose, however, that we have a point for which si = 0 and s j = 0 ∀ j = i for which dsi /dt = 0 and ds j /dt < 0, ∀ j = i, when the derivatives are evaluated at this point. Such a point is not strictly a fixed point. However, if a model’s dynamics include truncation of s j as it tries to pass through zero into a region of negative s j , s j will be returned to zero. Such a point would therefore appear to be a fixed point, since the dynamics could evolve the weight vector to this point and then would remain there. We refer to such points as quasi-fixed points. If ds j /dt < 0 for those components that are zero and if the nonzero si direction is stable, then we refer to the quasi-fixed point as stable; otherwise we refer to the quasi-fixed point as unstable. Equation 4.8 possesses m such points corresponding to segregated quasifixed points. Consider the point s QFP = (0, . . . , 0, s+ , 0, . . . , 0)T , where only the ith component is nonzero and where we write the usual expansion, s+ = x+ + σˆ 2 y+ . The component si must therefore have zero derivative at s QFP , and this requirement determines the value of s+ . Solving the equation dsi /dt|s QFP = 0, we obtain the zeroth-order solution, x+ =
γ (1 + µτ− ) − 1 , µτ+
(4.28)
which is equation 4.16, reflecting the fact that all nonzero solutions at zeroth order must live in the nonzero fixed hyperplane. After a little algebra, the first-order correction is found to be y+ =
1 1 − γ 1 − γ µτ− . µτ+ γ 1 + µτ−
(4.29)
We require that s+ = x+ + σˆ 2 y+ > 0 for this point to be in the nonnegative hyperquadrant, so we will obtain a first-order correction to the bound on γ . At zeroth order, the condition that x+ > 0 forces γ > γ0 . We now write γ = γ0 + σˆ 2 γˆ and determine a condition on γˆ so that s+ > 0. We find that γˆ > −
µτ− , (1 + µτ− )3
(4.30)
Competition Under Multispike STDP
and so, for s+ > 0 at these m possible segregated points, we need µτ− 1 2 1 − σˆ . γ > γ1 ≡ 1 + µτ− (1 + µτ− )2
2445
(4.31)
It is easy to see that γ2 ≥ γ1 since m ≥ 1. Thus, these m points become accessible to the weight vector before the origin turns from a saddle into a repeller. We now need to determine the sign of ds j /dt, j = i, at s QFP in order to determine whether these points are possibly stable. We find that for j = i, ds j 1 x+ 1 = A− τ− µ σˆ 2 2 dt s=s (1 + x+ ) γ 1 + µτ− QFP 1 µτ− −γ . (4.32) × x+ γ − 1 + µτ− 1 + µτ− This equation is purely first order in σˆ 2 , and thus any resulting bound on γ derived from it will be purely zeroth order in σˆ 2 . To calculate a higher-order correction to any resulting bound on γ , we would be compelled to extend our expansion out to order σˆ 4 . One consequence is that while at first order in σˆ 2 the quantity x+ is allowed to be negative, because the correction σˆ 2 y+ can pull the sum s+ = x+ + σˆ 2 y+ overall positive, nevertheless, in determining the sign of the right-hand side of equation 4.31, we must take x+ strictly positive, because the resulting bound on γ will be only zeroth order, for which we are allowed to have only x+ > 0. Hence, the sign of the righthand side of equation 4.32 is determined by the terms in square brackets. For negative derivatives, we require x+ γ −
1 1 + µτ−
−γ
µτ− < 0. 1 + µτ−
(4.33)
Replacing x+ by its expression in equation 4.28 and writing τ+ = γ τ− A− /A+ since we regard τ+ as a function of γ with τ− and A± fixed, we then obtain the quadratic equation A− 1− (µτ− γ0 )2 γ 2 − 2γ0 γ + γ02 < 0. A+
(4.34)
Solving this equation for γ gives bounds on γ that, after a little algebra, are √ √ A+ (1 + µτ− ) + µτ− A+ A− A+ (1 + µτ− ) − µτ− A+ A− < γ < . A+ (1 + 2µτ− ) + (A+ − A− )µ2 τ−2 A+ (1 + 2µτ− ) + (A+ − A− )µ2 τ−2 (4.35)
2446
P. Appleby and T. Elliott
Notice that γ = γ0 satisfies equation 4.34, so the relevant bound on γ is the upper bound. Defining γ3 =
√ A+ (1 + µτ− ) + µτ− A+ A− , A+ (1 + 2µτ− ) + (A+ − A− )µ2 τ−2
(4.36)
we require that γ < γ3 in order that the segregated quasi-fixed points s QFP may be stable. For γ > γ3 , the derivatives ds j /dt, ∀ j = i become positive, so the segregated quasi-fixed points are certainly unstable in this region. It remains to be ensured that the nonzero, si direction is stable. At zeroth order, the various points s QFP are part of the nonzero fixed hyperplane, which we know to be stable for γ > γ0 . Hence, the si direction is always stable at zeroth order. Thus, the segregated quasi-fixed points are certainly stable for γ0 < γ < γ3 , where for consistency the lower bound γ0 is of the same order in σˆ 2 , namely zeroth order, as the upper bound γ3 . This leaves open the small, first-order-sized window γ1 < γ < γ0 in which the segregated quasi-fixed points exist in the nonnegative hyperquadrant. In fact, the quasi-fixed points are stable in this small region too. Strictly speaking, it is inconsistent to write the condition on γ that guarantees the stability of the segregated quasi-fixed points in the form γ1 < γ < γ3 , since the orders of the two bounds differ, but we shall do so anyway in order to close the small γ1 < γ < γ0 window. We now turn to a real unsegregated fixed point. This fixed point is defined by s FP having entirely nonzero components. Thus, since we are then forced to require
s+ N0 + σˆ 2 N1i |s=s FP = 0, ∀i,
(4.37)
j
we must have N1i = N1 , ∀i = j. The simplest solution of this equation is j i sFP = sFP , ∀i = j, so that all the components of s FP are equal (and nonzero). Because all the components are equal, we refer to this point as the unsegregated fixed point. We write s FP = m1 (s+ , . . . , s+ )T and again expand s+ as s+ = x+ + σˆ 2 y+ . The zeroth-order solution must lie on the zeroth-order nonzero fixed hyperplane, so we again have x+ =
γ (1 + µτ− ) − 1 . µτ+
(4.38)
We find that the first-order correction is y+ =
γ µτ+
1 1 + µτ−
µτ− −
µτ− 1 1 −2 . 1− γ µτ+ x+ − m 1 + x+ m (4.39)
Competition Under Multispike STDP
2447
Again, we require that s+ = x+ + σˆ 2 y+ > 0 for this unsegregated fixed point to be accessible to the weight vector, so, expanding γ as γ = γ0 + σˆ 2 γˆ as for the segregated quasi-fixed points, we find that γ must satisfy γ >
µτ− µτ− 1 1 − . 1 + σˆ 2 1 + µτ− 1 + µτ− 1 + µτ− m
(4.40)
The right-hand side of this inequality is precisely γ2 defined in equation 4.27, which determines the value of γ at which the zero fixed point turns from a saddle into a repeller. Thus, as the unsegregated fixed point passes through the origin into the positive hyperquadrant, the zero fixed point turns into a repeller. To determine the stability of the unsegregated fixed point, we expand and linearize about it as usual, and after lengthy algebra we find that dδi A− τ− µ 1 1 = 2 dt 1 + µτ− (1 + x+ ) γ
x+ (1 + x+ ) −µτ+ + σˆ 2 J δ+ − σˆ 2 K δi , 1 + µτ− (4.41)
where J is a long and unwieldy expression that we do not reproduce here, and K is given by K = x+ γ −
1 1 + µτ−
−γ
µτ− . 1 + µτ−
(4.42)
The eigenvalues of the associated matrix are then just 1 1 m A− τ− µ λ1 = 1 + µτ− (1 + x+ )2 γ
1 2 x+ (1 + x+ ) 2 −µτ+ + σˆ J − σˆ K , 1 + µτ− m (4.43)
and λi = −
1 1 2 A− τ− µ σˆ K , ∀ i > 1. 1 + µτ− (1 + x+ )2 γ
(4.44)
Although the expression is messy, it is easy to show that λ1 , associated with the eigenvector (1, . . . , 1)T , changes sign at γ = γ2 , being positive for γ < γ2 and negative for γ > γ2 . Hence, as the unsegregated fixed point moves into the positive hyperquadrant, the zero fixed point turns into a repeller because the direction corresponding to (1, . . . , 1)T becomes unstable and the unsegregated fixed point becomes stable precisely in this same direction. The sign of all the other eigenvalues associated with the unsegregated fixed
2448
P. Appleby and T. Elliott
point is determined solely by K . Stability in all the directions orthogonal to (1, . . . , 1)T thus requires K > 0, or x+ γ −
1 1 + µτ−
−γ
µτ− > 0. 1 + µτ−
(4.45)
This is identical to equation 4.33, determining the stabilities of the segregated quasi-fixed points, except that the inequality is opposite. Thus, we see immediately that we must have γ > γ3 for the stability of the unsegregated fixed point, and for γ2 < γ < γ3 , the unsegregated fixed point is an unstable saddle node. The unsegregated fixed point becomes stable precisely when the segregated quasi-fixed points become unstable, and the unsegregated fixed point becomes accessible to the weight vector precisely when the zero fixed point turns into a repeller. Thus, all three sets of fixed points are dynamically coupled in terms of their stabilities. Because the local fixed-point structure must be globally consistent, we can deduce that there must exist other fixed points for m ≥ 3. For example, in the interval γ ∈ (γ1 , γ3 ), there must exist saddles that partition the afferent weight vector space into m regions, each region being defined by the requirement that an initial weight vector in the region always flows to the same quasi-fixed point. We do not explore these additional, essentially uninteresting fixed points here because they merely render consistent the global fixed-point structure that is determined by the relative stabilities of the segregated quasi-fixed points and the unsegregated fixed point. It is the stabilities of the segregated and unsegregated states in which we are principally interested here, since these are the states relevant for a putative model of synaptic competition. We see that the first-order dynamics are again uniquely determined by the value of γ , but in contrast to the zeroth-order dynamics, which had only two distinct regimes, the first-order system has four distinct regimes. We now summarize the dynamics in each of the regimes. For γ < γ1 , there is only one fixed point, at the origin. Although a saddle node, it is stable in the (1, . . . , 1)T direction. Hence, all flow initially moves parallel to this vector in the direction of the origin. If the weight vector moves sufficiently close to the origin, it will experience a repulsion in the directions orthogonal to (1, . . . , 1)T , but for γ < γ1 this repulsion is never sufficiently strong to reverse the downward components of flow toward the origin. In all cases, the weight vector will eventually hit an si = 0, for some i, hyperplane and be trapped in it by the truncation procedure. Because there are still negative components of flow, the weight vector continues to move toward the origin, becoming trapped in further s j = 0, j = i hyperplanes. The weight vector thus is always driven toward the origin and ends up there, despite the origin’s being a saddle node. The truncation at zero thus overrides this fixed point’s instability.
Competition Under Multispike STDP
2449
For γ1 < γ < γ2 , the zero fixed point is still a saddle node, and now there exists a set of m stable, segregated quasi-fixed points. In this regime of γ , the repulsion away from the origin is sufficiently strong to reverse the negative components of flow. Thus, for a weight vector sufficiently close to the origin, it hits an si = 0, for some i, hyperplane and is turned away from the origin, so that it starts moving in the opposite direction. It moves up si = 0 hyperplanes until it reaches a stable, quasi-fixed point and remains there. For weight vectors sufficiently distant from the origin, they flow toward the origin parallel to the direction (1, . . . , 1)T . These vectors never go sufficiently close to the origin to have their negative components of flow reversed. They thus hit si = 0 hyperplanes and move down toward the stable, quasi-fixed points, and stay there. In the regime γ2 < γ < γ3 , the zero fixed point has now turned into a repeller, so there are never any components of flow toward the origin in the neighborhood of the origin. An unsegregated fixed point has become accessible, but it is a saddle node. The segregated quasi-fixed points remain stable. Hence, this regime is essentially identical to the regime in which γ1 < γ < γ2 , except with local differences near the origin and around the now-accessible unsegregated fixed point. All flow therefore ends up at the stable, segregated quasi-fixed points. The combined regime γ1 < γ < γ3 therefore supports stable, competitive dynamics, allowing afferents to segregate on the target cell in an activitydependent manner, as required, for example, in a model of ODC formation. Finally, when γ3 < γ , the segregated quasi-fixed points become unstable, and the unsegregated fixed point becomes an attractor. The origin remains a repeller. Hence, all flow ends up at the unsegregated fixed point. The phase portraits for the two interesting, dynamically distinct γ regimes are shown in Figure 9 for a two-afferent system. We do not show the low γ regime, γ < γ1 , because the portrait is trivial, with all initial conditions flowing to the origin. Figure 9A shows the regime in which γ takes an intermediate value, γ1 < γ < γ3 , for which a set of stable, segregated quasi-fixed points exists. The system always evolves to one of these segregated points. For this intermediate value of γ , the unsegregated fixed point is unstable. When γ is too high, γ3 < γ , shown in Figure 9B, the segregated quasi-fixed points become unstable, and the unsegregated fixed point becomes stable, so afferent segregation on the target cell breaks down, and the afferent weight vector always flows to the unsegregated fixed point. 4.2 Full ∞-Spike Rule. The above fixed-point analysis for the simplified rule may be repeated for the full, nonresetting, ∞-spike rule in equation 2.22, for the convenient parameter choice n± = 1. We do not present the results of this analysis here because the expressions that arise from the full model are unwieldy and thus lack the transparency of those for the simplified model. The simplified model has the virtue, compared to the full model, that almost all the resulting expressions can be stated on
2450
P. Appleby and T. Elliott
A
1.4
Afferent 2 strength
1.2 1 0.8 0.6 0.4 0.2 0
B
0
0.2 0.4 0.6 0.8 1 1.2 1.4 Afferent 1 strength
0
0.2 0.4 0.6 0.8 1 1.2 1.4 Afferent 1 strength
1.4
Afferent 2 strength
1.2 1 0.8 0.6 0.4 0.2 0
Figure 9: Phase portraits of the simplified, nonresetting ∞-spike rule, with n± = 1, in a system with two afferents. (A) Evolution to the segregated quasifixed points with γ = 0.65. (B) Evolution to the unsegregated fixed point with γ = 0.95.
Competition Under Multispike STDP
2451
one line, while those for the full model occupy several lines. Nevertheless, the full model possesses dynamics that are qualitatively identical to those of the simplified model discussed above. In the full model, we observe similar critical values of γ at which new quasi- or real fixed points become accessible to the afferent weight vector or at which the fixed points change their stability. In particular, corresponding to the values γ1 and γ2 for the simplified model, we have the values, say, γ 1 and γ 2 , for the full model, where these differ from γ1 and γ2 by terms of order σˆ 2 . At γ = γ 1 , the segregated quasi-fixed points become available and are initially stable. At γ = γ 2 , the unsegregated fixed point moves into the nonnegative hyperquadrant, being initially an unstable saddle, and the zero fixed point at the origin simultaneously changes from a saddle node into a repelling node. We still have γ 1 ≤ γ 2 as in the simplified model. Corresponding to γ3 in the simplified model, we also have a new value, say, γ 3 , in the full model. Although derived from first-order equations, γ3 and γ 3 are nonetheless purely zeroth-order in σˆ 2 . Hence, γ3 and γ 3 differ even at zeroth order. Despite this, the dynamics associated with the γ = γ 3 transition in the full model are identical to those associated with the γ = γ3 transition in the simplified model. At this point, the segregated quasi-fixed points become unstable, and the unsegregated fixed point simultaneously becomes stable. Although our analyses of the full and simplified models have been performed only for the case n± = 1, for reasons of analytical simplicity, we can explore numerically the impact of other values of n± on the full and simplified models. For n± = 3, for example, we observe dynamics that are qualitatively similar to the n± = 1 case, with the same three, essentially distinct parameter regimes in γ . 4.3 Beyond Small α i . In the above fixed-point analysis, we expanded, for example, equation 4.6 in the variables αi , representing small fluctuations about a common mean afferent activity, µ, and then performed an ensemble average over these fluctuations. Although this permits us to make some progress in terms of understanding the fixed-point dynamics of the model, it necessarily does not allow an examination of the large αi regime, in which the fluctuations about the mean activity can be large. In principle, because we defined the size of these fluctuations with respect to the mean firing rate, so that |αi | ≤ 1, we could continue the expansion to yet higher orders in σˆ 2 . Although possible, doing so would be tiresome and the resulting expressions an uncontrollable mess. It is therefore not clear that any additional analytical insight would be possible in the face of the growing complexity of the new terms. We can, however, explore the fixed-point structure for large αi by means of numerical simulation. When we do so, we find essentially the same three regimes in γ for both the full and simplified models considered above. In
2452
P. Appleby and T. Elliott
particular, γ1 and γ2 still exist and define the points at which the segregated quasi-fixed points and the unsegregated fixed point, respectively, move into the nonnegative hyperquadrant and thus become accessible to the weight vector. However, the γ3 critical value in both models is somewhat modified. The large αi fluctuations “split” this value of γ into two different values; call them, say, γ3 and γ3 , where γ3 < γ3 . At γ = γ3 , the unsegregated fixed point becomes stable, while the segregated quasi-fixed points remain stable. Only at γ = γ3 do the segregated quasi-fixed points become unstable. Thus, there is a narrow interval for γ , γ ∈ (γ3 , γ3 ), of size of order σˆ 2 or higher, in which the system may evolve to either a segregated quasi-fixed point or the unsegregated fixed point. To which point the system evolves is determined by the initial conditions. Because the fixed-point structure must be globally consistent, we deduce that there must exist in this narrow regime a new set of saddle fixed points that partition the space into a region containing the unsegregated fixed point, to which any initial point in this region will flow, and m regions containing the segregated quasi-fixed points. The existence of this narrow transition region in which both the segregated quasi-fixed points and the unsegregated fixed point are simultaneously stable is all that appears to distinguish the dynamics of both the full and simplified models in the small αi , analytically explored regime from the large αi , numerically explored regime. It is likely, in fact, that this transition region is present even for small αi , but is so narrow as to be extremely difficult to observe numerically.
5 Computation in the Rate-Based Limit To derive the n-spike, rate-based rules, we integrated over the interspike intervals and averaged over all 2n possible spike trains to compute an unconditional expectation value for the change in synaptic efficacy due to a typical n-spike train. The resulting rate-based rule is somewhat abstract, in the sense that the neuron does not really compute at the level of this ratebased rule but continues to compute at the level of individual spikes. We may ask, however, whether there are any conditions under which, although the neuron is computing at the level of spikes, it nevertheless behaves as if it were following the rate-based, n-spike rule. If no such conditions exist, then our analysis is somewhat academic, since the derived rules would be merely mathematical abstractions to which the neuron’s behavior cannot ever approximate. We can identify two limiting cases of interest. First, if the spike train is extremely long and not highly unusual, then we can think of the train as naturally decomposing into a set of shorter subtrains. The neuron can then effectively average its behavior over these subtrains. If there are enough subtrains, then sufficient averaging over them will occur in order for the neuron’s mean dynamics to approximate the rate-based rules. However, the
Competition Under Multispike STDP
2453
variance in this behavior could be large, and these fluctuations could thus nevertheless prevent the emergence of stable, segregated states at the spikebased level. Second, we know that even for small n, n ≥ 3, the rate-based rules exhibit stable, segregated states, and the above considerations for large n do not apply. Of course, if we present a neuron with a single instance of an n-spike train, n small, then we would not expect its synaptic strengths to evolve much during this short train. The neuron must therefore be presented with a long sequence of such n-spike trains, so that it can average over this sequence, and the trains must be sufficiently well separated that all the synapses return to the OFF state between trains. Again, the averaging will ensure that the mean behavior is exhibited, but the fluctuations could destroy the stability. We therefore see that the key to stability is to ensure that the fluctuations are small. Small fluctuations can be guaranteed provided that the neuron’s dynamics are not dominated by the most recent spike train (or subtrain). If the most recent spike train essentially erases the neuron’s state developed from exposure to earlier trains, then the neuron’s behavior will be dominated by train-to-train fluctuations. We therefore require that the change in synaptic strength induced by each spike train is small compared to the (nonzero) synaptic strengths. This can be achieved by setting A± sufficiently small, since these two parameters set the overall magnitude of plasticity. Changing the magnitude of plasticity (the overall scale of A± ) is, of course, equivalent to changing the learning rate in a model’s dynamics. The learning rate is essentially the step size in a numerical integration procedure. It is well known that the stability of the numerical solutions of a set of differential equations depends critically on the step size and that there usually exists a threshold above which the integration scheme fails to converge to the exact solutions, with chaotic instability or divergent behavior ensuing. We should therefore not be too surprised that when the magnitude of plasticity is sufficiently small, the spike-based behavior will be expected to converge to the rate-based behavior. Given the complexity of the switch model, however, determining the location of the threshold above which the neuron does not compute in the rate-based limit, and exhibits instead a strong dependence primarily on the last spike train, is a difficult matter. We therefore resort to a simple numerical search for the approximate location of this threshold. To obtain a condition on the magnitude of plasticity below which the rate-based behavior becomes dominant in a spike-based simulation, we determine when the spike-based system exhibits qualitatively the same fixed-point structure known to exist in the rate-based system. We consider a system of two afferents for simplicity. A rate-based simulation of two afferents will stably segregate, with one afferent gaining complete control of the target cell, provided that γ1 < γ < γ3 . Thus, we select a value of γ in this range and run a spike-based simulation for various values of the overall scale of A± . In these simulations, in order to examine the n-spike rule, we
2454
P. Appleby and T. Elliott
present a series of n-spike trains to each of the afferent’s synapses, and after every train force the synapses to return to the OFF state, which is equivalent to spacing the trains sufficiently far apart that they do not interact. Within each train, the afferents have their Poisson firing rates randomly fixed either “high” (75 Hz) or “low” (25 Hz). We perform an initial presentation of 2.5 × 107 spikes, partitioned into n-spike trains, in order to allow sufficient time for the afferents to segregate on the target cell. At a typical, average rate of 50 Hz, this corresponds to approximately six days’ worth of simulated synaptic activity, which is not too dissimilar to the typical timescale for developmental processes in the nervous system. After this initial period to allow time for segregation, we present another series of 2.5 × 107 spikes, again partitioned into n-spike trains, during which we probe the extent and stability of any segregation. After each train presentation, we calculate the segregation index, SI , which we define as SI =
s1 − s2 . s1 + s2
(5.1)
If the afferents are well segregated, then SI will take values close to +1 or −1, depending on which afferent controls the target cell. If segregation is stable, then this index will not change much, except for small fluctuations. If the afferents are segregated but not stably so, with control switching between the two afferents, then SI will flip between +1 and −1. Averaged over sufficient trains, its value will be roughly zero. If the afferents are not segregated, but oscillate about a mean synaptic strength, then SI will always be roughly zero, and of course its average will be roughly zero. Thus, for this probing phase, we determine SI P , where P denotes the average value of SI during this second period. We then take the absolute value of this average, and average this value over 50 distinct runs for each value of the overall scale of A± . Thus, our final measure of segregation and stability is |SI P | R , where R denotes an average over runs. In Figure 10, we plot |SI P | R as a function of the overall scale of A± for 3-, 9- and 15-spike train simulations for the nonresetting model with n± = 3 and γ = 0.6. We obtain qualitatively similar results for the resetting model and for different values of γ . Also shown in Figure 10 is the fit of our raw data to logistic-like functions, a − b tanh(cx − d), where a , b, c, and d are fitted parameters. That the fits to logistic-like functions match the raw data well indicates that the transition from stable segregation to unstable segregation is relatively sharp. For an overall scale of plasticity of approximately 10−3 or lower, depending on the number of spikes in the train, we observe robust and stable afferent segregation, while for a scale greater than this value, segregation is achieved, but is not stable, so that the afferents change their control of the target cell over time. For values of the overall scale very much greater than 10−3 , segregation is not achieved at all. We observe that as the train length increases, the mean segregation
Competition Under Multispike STDP
2455
Mean Segregation Index
1 0.8
15 9
0.6 0.4
3
0.2 0 1
2
3 4 5 6 7 8 9 Magnitude of Plasticity (x 10000)
10
Figure 10: The dependence of the mean segregation index |SI P | R on the magnitude of plasticity, or the overall scale of A± . Solid lines represent the numerically obtained values of |SI P | R for the number of spikes in each train indicated by the attached number. The dashed lines show the fit of the raw data to logistic-like functions, as described in the text.
index increases for a fixed value of the overall plasticity magnitude. However, by about 10 to 20 spikes, asymptotic behavior is reached, with no further increase in the index observed. This is in accord with our expectations, since the n-spike rate-based learning rules converge very rapidly as a function of n, with convergence achieved by n ≈ 10 spikes (Appleby & Elliott, 2006). The data in Figure 10 are obtained by randomly fixing the afferents’ Poisson firing rates within each spike train. For longer trains, therefore, each afferent fires for longer with the same firing rate. It could therefore be argued that the dependence of the magnitude of plasticity on the number of spikes in a train merely reflects this longer exposure to the same firing pattern. We can rule this out in two ways. First, instead of fixing each train’s afferents’ rates, we can instead fix each afferent’s rate for a given period of time (a firing “epoch”), regardless of the number of spikes in each train. For a firing epoch length of 1000 ms, for example, we obtain data that are essentially identical to those in Figure 10 (data not shown). Second, we can, for example, consider three-spike trains with fixed firing per train but simply duplicate the firing patterns between consecutive trains, so that
2456
P. Appleby and T. Elliott
two-train sequences of three-spike trains have the same rates. Doing this, we produce data identical to the three-spike data shown in Figure 10 and not the six-spike data with an enhancement in the segregation index (data not shown). Thus, longer exposure to the same activity patterns is not responsible for the trend exhibited in Figure 10. Rather, the observed trend reflects the fact that as the number of spikes increases in a train, competition becomes stronger and stronger, as revealed by equation 3.22. A modification level of 10−3 roughly translates to a 0.1% change in synaptic strength per potentiation or depression event. For an afferent supporting 10 synapses, this translates to a 1% change in afferent strength per spike pairing event. This approximate magnitude is commensurate with that seen in experimental work, where a change of around 1.5% per spike pairing is typically measured (Bi & Poo, 1998). Although this numerical treatment is not exact, it is sufficient for our purposes to demonstrate that computation in the rate-based limit is realistically available to a real, spike-based system and to have an approximate idea of where that limit lies. Assuming that A± satisfy the numerically obtained condition on the magnitude of plasticity below which the rate-based behavior becomes dominant, we may dispense with spikes completely and work in the rate-based limit. The analysis of section 4 was carried out in the rate-based limit and did not consider individual spikes. The stable, competitive dynamics that we observed will therefore be the dominant mode of computation provided that A± ∼ 10−3 or lower. 6 Large-Scale Numerical Simulations In section 4 we established that the ∞-spike nonresetting rule possesses a fixed-point structure consistent with the presence of stable, competitive dynamics in the model. The presence of such dynamics has also been argued for in all multispike rules for an appropriate choice of parameters. In section 5, we established that the rate-based multispike dynamics are actually available to neurons in the sense that, although computing at the level of single spikes, the neuron’s dynamics will respect the fixed-point structure of the rate-based rules and exhibit stable, competitive dynamics. We are therefore now in a position to consider a large-scale simulation of, for example, ODC development, in order to demonstrate that the results shown for a single target cell scale up without difficulty to multiple target cells. Moreover, while for convenience we considered above only uncorrelated afferent activity patterns, a simulation of ODC development will permit us to explore negatively or positively correlated afferent activity patterns too. For ODC development in the cat, which develops ODCs in the presence of presumably positively correlated afferent activity after eye opening, the case of positive correlations is more pertinent. This is also usually the much harder task: models that can segregate negatively correlated afferents
Competition Under Multispike STDP
2457
frequently cannot segregate positively correlated afferents. For our switch model to be a candidate model of developmental synaptic plasticity, we must therefore show that it operates successfully in the face of positively correlated afferent activity patterns. We run ODC simulations according to existing, documented protocols (Elliott & Shadbolt, 1998). We use the rate-based, nonresetting ∞-spike rule with n± = 1, in equation 2.22, as the synaptic modification rule. In brief, we simulate two patches of retinotopically equivalent lateral geniculate nucleus (LGN), each containing a square array of cells of size 13 × 13, with periodic boundary conditions imposed for convenience. The cortex is a 25 × 25 square array of cells, again with periodic boundary conditions. Each LGN cell arborizes over a retinotopically appropriate patch of cortex, of size 7 × 7. LGN activity patterns are constructed according to the method of Goodhill (1993), taking the form of gaussian correlated noise. The parameter p ∈ [0, 1] determines the activity correlations between the two LGN patches, with p = 0 corresponding to perfectly anticorrelated patterns and p = 1 corresponding to perfectly correlated patterns. Cortical activity is just the standard linear sum of afferent input, but smeared by convolving cortical activity with a short-range gaussian function. Such smearing could be achieved, for example, by lateral excitation. In this way, nearby cortical cells fire similarly, which is necessary in order to develop structured ODCs rather than a pattern of salt-and-pepper segregation. It is well known that presynaptic constraints must typically be introduced in order for the pattern of ODCs to exhibit some degree of regularity, with fairly constant stripe widths across the whole simulated patch of cortex. To achieve this, we introduce a lower bound on the total synaptic strength that an afferent may support. If it falls below 75% of the average total strength supported by all afferents, we simply freeze depression. We use this method because it is convenient and removes the need for us to calculate explicitly the expected total afferent strength in such a simulation and then set a lower bound accordingly. In Figure 11 we show three ODC maps corresponding to three different values of p: p = 0.3 representing negatively correlated activity patterns; p = 0.5 representing uncorrelated activity patterns; and p = 0.7 representing positively correlated activity patterns. We see clear patterns of ODCs in all cases, with well-segregated afferents, although with increasingly binocular boundaries between ODCs as p increases, as expected. We also observe a clear decrease in the widths of ODCs as the inter-ocular correlations increase. This phenomenon was first observed by Goodhill (1993) in simulation, and subsequent experimental results supported the possibility that the widths of ODCs are not fixed but may be partially determined by visual ¨ experience (Lowel, 1994; Tieman & Tumosa, 1997). Our results here show that our switch model of STDP scales up to a large-scale simulation without difficulty and can comfortably segregate afferents whose activities are positively correlated.
2458
P. Appleby and T. Elliott
L p=0.3
R p=0.5
p=0.7
Figure 11: A simulation of ocular dominance column formation in the nonresetting ∞-spike model with n± = 1 for the three values of the interocular correlation probability shown above each map. In these maps, each square represents a single cortical neuron. The shade of gray assigned indicates the relative degree of control by the two eyes. A white square indicates complete control by the left eye, and a black square indicates complete control by the right eye. Shade of gray, as shown in the key, interpolate between these two extremes.
7 Discussion In this letter, we have studied the synaptic dynamics induced by the twoand multispike rules that are derived within the context of a stochastic model of STDP. In this model, the STDP rule emerges only at the temporal and synaptic ensemble level, while individual synapses obey a much simpler, computationally less burdensome plasticity rule in which synaptic strengths change in an all-or-none fashion independent of spike timing. The structure of the postulated unified three-state synaptic switch, which was initially designed only to account for the basic phenomenology of the STDP learning rule seen in the context of interactions between two spikes, tightly constrains the form of the multispike interactions that exist in our model. One freedom that the model possesses is to allow stochastic process resetting in the DEP and POT states. Although resetting changes the precise form of the multispike interaction functions, nevertheless, dynamically speaking, the qualitative dynamics observed, in terms of the fixed-point structure of the model, are essentially independent of whether we allow resetting. Without further modification, we find that the two-spike learning rule is irredeemably pathological in its learning dynamics, sending all afferents to either zero strength or unbounded growth. The two-spike learning rule cannot therefore support the stable, competitive dynamics that are essential within the context of developmental synaptic plasticity. This observation is particularly interesting given that the majority of theoretical studies of
Competition Under Multispike STDP
2459
STDP are based on two-spike rules. Often the introduction of additional constraints and nonlinearities becomes necessary, such as imposing hard upper bounds on synaptic strengths (Song et al., 2000), scaling of potentiation with synaptic strength (van Rossum et al., 2000), or the temporal restriction of spike interactions (Izhikevich & Desai, 2003). These various additional constraints are required to stabilize the underlying STDP learning dynamics of the unmodified STDP rule. The assertion that the two-spike STDP rule can therefore account for, for example, developmental synaptic plasticity must be construed with caution. It appears, in fact, that the additional constraints required to stabilize the two-spike STDP rule’s dynamics are actually playing a significant role in giving rise to the very dynamics of interest. The BCM model (Bienenstock et al., 1982) is a well-studied model of developmental synaptic plasticity, exhibiting the stable, competitive dynamics required of such a model. Systematic attempts have therefore been made to connect the two-spike STDP rule to the BCM model, in order to endow the STDP rule with the dynamics exhibited by the BCM rule (Izhikevich & Desai, 2003). In our own formulation of STDP, we find that a largeparameter regime exists in which the rate-based, two-spike learning rule is qualitatively BCM-like. It is, however, exactly this BCM-like form that gives rise to the two-spike rule’s instabilities, trapping afferents in a depressing well for small synaptic strengths and permitting unbounded growth for large synaptic strengths. This behavior occurs because the threshold determining the cross-over point between depression and potentiation in our unmodified two-spike rule is fixed, and, indeed, the BCM model with such a fixed rather than a sliding threshold would exhibit identical dynamics. By introducing such a sliding threshold, the two-spike rule can therefore be stabilized, as expected. Focusing attention only on two-spike interactions and thus attempting to patch up the dynamics exhibited by the two-spike STDP learning rule strikes us as unnecessarily restrictive. After all, a real neuron does not experience a disconnected series of spike pairs, but rather an entire train of freely interlacing pre- and postsynaptic spike events. The dynamics of any plasticity rule should thus be understood in the context of these natural spike trains, and not in the context of an experimental stimulation protocol that happens to be particularly convenient to the experimentalist. It is therefore critical that multispike interactions be understood, especially given the difficulties arising from the two-spike interaction function. However, it is precisely the generalization to longer spike trains that creates difficulties for many existing models of STDP. Such devices as making the scale of plasticity sensitive to spike history (Froemke & Dan, 2002) or restricting the temporal interactions between spikes (Izhikevich & Desai, 2003) are introduced. To be sure, such methods do replicate spike triplet data (Froemke & Dan, 2002) and do allow a BCM rule to be derived from STDP rules (Izhikevich & Desai, 2003). However, such approaches seem to us essentially to be attempts to
2460
P. Appleby and T. Elliott
force multispike interactions into the strait jacket imposed by two-spike interactions. Without prior design (since we were thinking only about two-spike STDP), our model essentially forces on us all the multispike interaction functions, and all such functions exhibit qualitatively similar dynamics— dynamics that differ significantly from those exhibited by the two-spike interaction function. For a very broad range of parameters, all these rules exhibit stable, competitive dynamics without any further modification. Even when the lower-order multispike rules fail, we are guaranteed that all the higher-order multispike rules operate as required. At first blush, this is astonishing. Merely extending our considerations from two spikes to three spikes immediately solves all the problems inherent in the two-spike rule. Why should the step from two to three spikes change the dynamics so dramatically? We argued that this is because in the unified switch rule, depression and potentiation are coupled. By this rather vague notion, we mean that activation of the potentiation lobe of the switch precludes a simultaneous activation of the depression lobe of the switch, and vice versa. In a sense, a putative potentiation event blocks a possible depression event, and vice versa. By breaking the two lobes into two separate switches, so that they may be simultaneously active, we showed that competition immediately breaks downs, resulting in the runaway learning characteristic of the two-spike rule. Critical to probing this coupling between potentiation and depression is the presence of three or more spikes; even in the unified switch, two spikes can never probe this couplings, and so two-spike interactions always induce uncoupled potentiation and depression dynamics. It therefore appears that this coupling, or interaction, between potentiation and depression processes at the level of a single neuron is vital to the presence of stable, competitive dynamics. Indeed, a sliding threshold in the BCM model provides precisely a means of coupling potentiation and depression events, in that the threshold is sensitive to the firing history of the cell, and this firing history is dependent on the strengthening and weakening of the synapses that the cell supports. Without this coupling, so that the threshold is fixed, stability and competition in the BCM model break down. It appears in general, therefore, that in order for a model of synaptic competition to operate successfully, the machinery of potentiation and the machinery of depression cannot be independent. Although the dependence of the two processes is very indirect in the BCM model, via the sliding threshold, it is nevertheless critical to the model’s successful operation. We therefore speculate that any model in which potentiation and depression are completely independent processes cannot be competitive. It would be interesting to seek to prove this within the context of the most general class of model possible, although this is likely to be very difficult. In the above analyses, we have truncated si at zero when it is driven negative. We saw, for example, that although the zero fixed point is a saddle
Competition Under Multispike STDP
2461
for γ < γ2 in the simplified ∞-spike resetting model, nevertheless, the truncation procedure dynamically converts it into a stable node when γ < γ1 . Moreover, the truncation gave rise, partly, to the existence of quasi-fixed points. It might therefore appear that this truncation procedure is playing an important role in our model’s dynamics, being partly responsible for the presence of stable, competitive dynamics in it. This is not, however, the case. We can see this in two different ways. First, from a mathematical point of view, we set dsi /dt = Sn (λπi , λ p ) and used this as our synaptic strength update rule. This is not the only choice available to us. In this formulation, si represents the average synaptic strength over all of afferent i’s synapses on the target cell. Suppose, however, that we instead think of si as the (possibly scaled) number of synapses supported by afferent i, each of which experiences the single-synapse learning rule. Then the overall change in afferent i’s synaptic strength will be the individual change multiplied by the number of synapses, si . Hence, on this view, we would instead write dsi /dt = si Sn (λπi , λ p ). The presence of the extra factor of si in this rule immediately means that synapses cannot evolve below zero, and thus the quasi-fixed points are converted into real fixed points, with their location and stability completely unchanged. Furthermore, it is easy to see that the zero fixed point in this formulation is genuinely stable for γ < γ1 . We further note that truncation at zero is a hard nonlinearity to which the afferents are completely insensitive when si > 0. Thus, their evolution to the vicinity of the segregated quasi-fixed points in the interval γ ∈ (γ1 , γ3 ) is entirely independent of truncation: it is a dynamical consequence of the overall structure of the model. The need for truncation, then, is merely a technical issue that does not fundamentally modify the model’s prior dynamics. Second, from a biological point of view, when a synapse reaches zero or close-to-zero synaptic strength, we expect it either to shut down and evolve no further, or to be retracted entirely. Biologically, an excitatory synapse of zero strength cannot turn into an inhibitory synapse. Our derivation of the learning rules, however, takes no account of this, for the purposes of tractability. However, the simplest remedy to this situation is merely to prevent a synapse from depressing when such depression would takes its synaptic strength negative. (Indeed, this it the presynaptic constraint that we used to model ODC development, in order not to obtain irregular patterns of afferent segregation, except that the lower limit on synaptic strength was set at some nonzero value.) This prevention of depression is tantamount to switching off the depressing lobe of the switch, perhaps by setting A− = 0, when the synapse is too weak to permit further depression. In such a state, it could be potentiated but not depressed. This, though, is essentially equivalent to truncation of strengths at zero. The key competitive dynamics in the model are those that precisely allow a synapse to move toward a quasi-fixed point near which truncation, or the closing down of depression, then becomes necessary.
2462
P. Appleby and T. Elliott
We observe three essentially distinct parameter regimes in our model’s dynamics, corresponding to different ranges for the parameter γ . For low γ , all afferent strengths fall to zero. For high γ , afferents evolve to an unsegregated fixed point in which all afferents equally control the target cell. Only for intermediate values of γ do we see the evolution of the system to stable, segregated final states. Considered within the context of a simulation of ODC formation, the low- and high-γ regimes would correspond to a breakdown of ODC formation, while the intermediate regime would correspond to the normal development of ODCs. Do these three regimes correspond to real experimental situations, or are they just mathematical fictions? Interestingly, the infusion of the neurotrophic factors brain-derived neurotrophic factor (BDNF) or NT-4/5 (Cabelli, Hohn, & Shatz, 1995) or the blockade of their common, endogenous receptor trk-B (Cabelli, Shelton, Segal, & Shatz, 1997) both result in abolishing ODC development. In the former case, autoradiographic labeling reveals a higher-than-normal labeling, while in the latter case, the labeling is lower than normal. One interpretation of these results is that BDNF or NT-4/5 infusion causes a growth of afferent axonal arbors, while removing available BDNF or NT-4/5 from afferents causes their axonal arbors to atrophy. A similar influence of neurotrophic factors on axonal branching is also observed in the frog retinotectal system (Cohen-Cory & Fraser, 1995). Low and high γ , which cause the breakdown of ODC development, could therefore correspond to these experimental regimes. Moreover, low γ corresponds to very weak synapses, while high γ corresponds to strong synapses. Under an anatomical interpretation of synaptic strength, these would correspond to small and large axonal arbors, respectively. Intermediate values of γ then correspond to normal patterns of development. Given the capacity of neurotrophic factors rapidly to modulate synaptic transmission in the visual system (Akaneya, Tsumoto, & Hatanaka, 1996; Carmignoto, Pizzorusso, Tia, & Vicini, 1997; Sala et al., 1998), we can thus imagine a scenario in which neurotrophic factors dynamically determine the parameter γ in our model of STDP. In conclusion, we have extended our earlier analysis of a stochastic model of STDP (Appleby & Elliott, 2005) to include an examination of multispike interactions. We have found that a consideration of multispike interactions is sufficient to endow our model with a fixed-point structure consistent with the presence of stable, competitive dynamics. In contrast, at least in our formulation, two-spike interactions by themselves cannot give rise to competitive dynamics. Multispike interactions therefore appear critical to understanding the presence of stable, competitive dynamics under STDP. Acknowledgments P.A.A. thanks the University of Southampton for the support of a studentship.
Competition Under Multispike STDP
2463
References Akaneya, Y., Tsumoto, T., & Hatanaka, H. (1996). Brain-derived neurotrophic factor blocks long-term depression in the rat visual cortex. J. Neurophysiol., 76, 4198– 4201. Appleby, P. A., & Elliott, T. (2005). Synaptic and temporal ensemble interpretation of spike timing dependent plasticity. Neural Comput., 17, 2316–2336. Appleby, P. A., & Elliott, T. (2006). Multi-spike interactions in a stochastic model of spike timing dependent plasticity. Neural Comput. (In press.) Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bliss, T. V. T., & Lømo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 331–356. Cabelli, R. J., Hohn, A., & Shatz, C. J. (1995). Inhibition of ocular dominance column formation by infusion of NT-4/5 or BDNF. Science, 267, 1662–1666. Cabelli, R. J., Shelton, D. L., Segal, R. A., & Shatz, C. J. (1997). Blockade of endogenous ligands of trkB inhibits formation of ocular dominance columns. Neuron, 19, 63– 76. Carmignoto, G., Pizzorusso, T., Tia, S., & Vicini, S. (1997). Brain-derived neurotrophic factor and nerve growth factor potentiate excitatory synaptic transmission in the rat visual cortex. J. Physiol., 498, 153–164. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. USA, 98, 12772–12777. Cohen-Cory, S., & Fraser, S. E. (1995). Effects of brain-derived neurotrophic factor on optic axon branching and remodelling in vivo. Nature, 378, 192–196. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Elliott, T., & Shadbolt, N. R. (1998) Competition for neurotrophic factors: Ocular dominance columns. J. Neurosci., 18, 5850–5858. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern., 69, 109–118. ¨ Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y. -Y. (1987). Long-term potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513.
2464
P. Appleby and T. Elliott
¨ Lowel, S. (1994). Ocular dominance column development: Strabismus changes the spacing of adjacent columns in cat visual cortex. J. Neurosci., 14, 7451–7468. O’Connor, D. H., Wittenberg, G. M., & Wang, S. S-H. (2005). Graded bidirectional synaptic plasticity is composed of switch-like unitary events. Proc. Natl. Acad. Sci., 102, 9679–9684. Peterson, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-none potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Purves, D., & Lichtman, J. W. (1985). Principles of neural development. Sunderland, MA: Sinauer. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Sala, R., Viegi, A., Rossi, F. M., Pizzorusso, T., Bonanno, G., Raiteri, M., & Maffei, L. (1998). Nerve growth factor and brain-derived neurotrophic factor increase neurotransmitter release in the rat visual cortex. Europ. J. Neurosci., 10, 2185–2191. ¨ Schuett, S., Bonhoeffer, T., & Hubener, M. (2001). Pairing-induced changes of orientation maps in cat visual cortex. Neuron, 32, 325–337. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. ¨ om, ¨ Sjostr P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. Tieman, S. B., & Tumosa, N. (1997). Alternating monocular exposure increases the spacing of ocularity domains in area 17 of cats. Visual Neurosci., 14, 929–938. van Ooyen, A. (2003). Modeling neural development. Cambridge, MA: MIT Press. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received October 21, 2005; accepted March 27, 2006.
LETTER
Communicated by Maneesh Sahani
A State-Space Analysis for Reconstruction of Goal-Directed Movements Using Neural Signals Lakshminarayan Srinivasan [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Charlestown, MA 02129, and Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, and Division of Health Sciences and Technology, 77 Massachusetts Avenue, Massachusetts Institute of Technology, Cambridge, MA 02139, and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, U.S.A.
Uri T. Eden [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Charlestown, MA 02129, and Harvard/MIT Division of Health Sciences and Technology, Cambridge, MA 02139, U.S.A.
Alan S. Willsky [email protected] Laboratory for Information and Decision Systems, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Emery N. Brown [email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Charlestown, MA 02129, and Division of Health Sciences and Technology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, U.S.A.
The execution of reaching movements involves the coordinated activity of multiple brain regions that relate variously to the desired target and a path of arm states to achieve that target. These arm states may represent positions, velocities, torques, or other quantities. Estimation has been previously applied to neural activity in reconstructing the target separately from the path. However, the target and path are not independent. Neural Computation 18, 2465–2494 (2006)
C 2006 Massachusetts Institute of Technology
2466
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
Because arm movements are limited by finite muscle contractility, knowledge of the target constrains the path of states that leads to the target. In this letter, we derive and illustrate a state equation to capture this basic dependency between target and path. The solution is described for discrete-time linear systems and gaussian increments with known target arrival time. The resulting analysis enables the use of estimation to study how brain regions that relate variously to target and path together specify a trajectory. The corresponding reconstruction procedure may also be useful in brain-driven prosthetic devices to generate control signals for goal-directed movements. 1 Introduction An arm reach can be described by a number of factors, including the desired hand target and the duration of the movement. We reach when moving to pick up the telephone or lift a glass of water. The duration of a reach can be specified explicitly (Todorov & Jordan, 2002) or emerge implicitly from additional constraints such as target accuracy (Harris & Wolpert, 1998). Arm kinematics and dynamics during reaching motion have been studied through their correlation with neural activity in related brain regions, including motor cortex (Moran & Schwartz, 1999), posterior parietal cortex (Andersen & Buneo, 2002), basal ganglia (Turner & Anderson, 1997), and cerebellum (Greger, Norris, & Thach, 2003). Separate studies have developed control models to describe the observed movements without regard to neural activity (Todorov, 2004a). An emerging area of interest is the fusion of these two approaches to evaluate neural activity in terms of the control of arm movement to target locations (Todorov, 2000; Kemere & Meng, 2005). While several brain areas have been implicated separately in the planning and execution of reaches, further study is necessary to elucidate how these regions coordinate their electrical activity to achieve the muscle activation required for reaching. In this letter, we develop state-space estimation to provide a unified framework to evaluate reach planning and executionrelated activity. Primate electrophysiology during reaching movements has focused on primary motor cortex (M1) and posterior parietal cortex, regions that represent elements of path and target, respectively. Lesion studies previously identified M1 with motor execution (Nudo, Wise, SiFuentes, & Milliken, 1996) and PPC with movement planning (Geschwind & Damasio, 1985). Several experiments have characterized the relationship between M1 neuronal activity, arm positions, and velocities (Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Schwartz, 1992; Moran & Schwartz, 1999; Paninski, Fellows, Hatsopoulos, & Donoghue, 2004), and forces (Georgopoulos, Ashe, Smyrnis, & Taira, 1992; Taira, Boline, Smyrnis, Georgopoulos, & Ashe, 1995; Li, Padoa-Schioppa, & Bizzi, 2001). PPC is described as relating broadly to the formation of intent and specifically to the transformation of sensory cues into movement goals (Andersen &
Reconstruction of Goal-Directed Movements
2467
Buneo, 2002). More recent experiments are beginning to elucidate the role of premotor cortical areas in motion planning and execution (Schwartz, Moran, & Reina, 2004), including interactions with PPC (Wise, Boussaoud, Johnson, & Caminiti, 1997). Explicit regression analyses have also been performed to relate motor cortical activity to features of both target and path (Fu, Suarez, & Ebner, 1993; Ashe & Georgopoulos, 1994). In parallel, theoretical models for the planning and execution of reaches have developed to include different concepts in control engineering and robotics. A common starting point is the state equation, a differential equation that describes how the arm moves due to passive sources like joint tension and user-controlled forces such as muscle activation. The state equation is used to prescribe a path or a sequence of forces to complete the reach based on the minimization of some cost function that depends on variables such as energy, accuracy, or time. Many reach models specify control sequences computed prior to movement that assume a noise-free state equation and perfect observations of arm state (Hogan, 1984; Uno, Kawato, & Suzuki, 1989; Nakano et al., 1999). The execution of trajectories planned by these models can be envisioned in the face of random perturbations by equilibrium-point control, where each prescribed point in the trajectory is sequentially made steady with arm tension. Recently, reach models have been developed that explicitly account for noisy dynamics and observations (Harris & Wolpert, 1998; Todorov, 2004b). Based on stochastic optimal control theory, the most recent arm models (Todorov & Jordan, 2002; Todorov, 2004b) choose control forces based on estimates of path history and costto-go, the price associated with various ways of completing the reach. A general review of control-based models is provided in Todorov (2004a). Estimation has been used to relate neural activity with aspects of free arm movements (Georgopoulos, Kettner, & Schwartz, 1988; Paninski et al., 2004). Alternate models of neural response in a specific brain region can be compared by mean squared error (MSE). Reconstruction of a measured parameter is one way to characterize neural activity in a brain region. Learning rates can be related explicitly and simultaneously to continuous and discrete behavioral responses using an estimation framework (Smith et al., 2004). Mutual information is a related alternative that has been prevalent in the characterization of neural responses to sensory stimuli (Warland, Reinagel, & Meister, 1997). Both MSE and conditional entropy (calculated in determining mutual information) are functions of the uncertainty in an estimate given neural observations, and MSE rises with conditional entropy for gaussian distributions. These two methods were recently coupled to calculate the conditional entropy associated with recursively computed estimates on neural data (Barbieri et al., 2004). Estimation algorithms form the interface between brain and machine in the control of neural prosthetics, bearing directly on the clinical treatment of patients with motor deficits. Prototype systems have employed either estimation of free arm movement (Carmena et al., 2003; Taylor, Tillery, & Schwartz, 2002; Wu, Shaikhouni, Donoghue, & Black, 2004) or
2468
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
target location (Musallam, Corneil, Greger, Scherberger, & Andersen, 2004; Santhanam, Ryu, Yu, Afshar, & Shenoy, 2005). Most recently, several estimation procedures were proposed to combine these two approaches and specifically facilitate reaching movements for brain-controlled prosthetics (Srinivasan, Eden, Willsky, & Brown, 2005; Cowan & Taylor, 2005; Yu, Santhanam, Ryu, & Shenoy, 2005; Kemere & Meng, 2005). Two probability densities are used implicitly in estimation. The first density describes the probability of neural activity conditioned on relevant covariates like stimulus intensities or kinematic variables. This density arises through the observation equation in estimation and as an explicit function in information-theoretic measurements. The second density describes the interdependence of the relevant covariates before any neural activity is recorded. This density arises through the state equation in estimation and as a prior on stimulus values in the information-theoretic characterization of sensory neurons. In experiments that calculate mutual information between neural activity and independent stimulus parameters, this second probability density is commonly chosen to be uniform. In the study of reaching movements, the complete prior density on target and path variables cannot be uniform because the target and the path state at all times in the trajectory are dependent. A state equation naturally expresses these constraints and serves as a point of departure for analysis based on estimation. In this letter, we develop a discrete-time state equation that relates target state and path states under weak assumptions about a reach. Specifically, the result represents the extension of the linear state-space description of free arm movement with no additional constraints. The states of the target or path refer to any vector of measurements of the arm at a particular point in time, such as joint torque, joint angle, hand velocity, and elbow position. This method supports arbitrary order, time-varying linear difference equations, which can be used to approximate more complicated state equation dynamics. The approach is based on the continuous-time results by Castanon, Levy, and Willsky (1985) in surveillance theory and draws on the discrete time derivation of a backward Markov process described by Verghese and Kailath (1979). Unlike existing theoretical models of reaching movement, we do not begin with an assumed control model or employ cost functions to constrain a motion to target. The resulting reach state equation is a probabilistic description of all paths of a particular temporal duration that start and end at states specified with uncertainty. We first develop a form of the reach state equation that incorporates one prescient observation on the target state. We then extend this result to describe an augmented state equation that includes the target state itself. This augmented state equation supports recursive estimates of path and target that fully integrate ongoing neural observations of path and target. Sample trajectories from the reach state equation are shown. We then demonstrate the estimation of reaching movements by incorporating the reach state equation into a point process filter (Eden, Frank, Barbieri, Solo, & Brown,
Reconstruction of Goal-Directed Movements
2469
2004). We conclude by discussing the applicability of our approach to the study of motion planning and execution, as well as to the control of neural prosthetics. 2 Theory 2.1 State Equation to Support Observations of Target Before Movement. The objective in this section is to construct a state equation for reaching motions that combines one observation of the target before movement with a general linear state equation for free arm movement. The resulting state equation enables estimation of the arm path that is informed by concurrent observations and one target-predictive observation, such as neural activity from brain regions related to movement execution and target planning, respectively. We begin with definitions and proceed with the derivation. A reach of duration T time steps is defined as a sequence of vector random variables (x0 , . . . , xT ) called a trajectory. The state variable xt represents any relevant aspects of the arm at time sample t, such as position, velocity, and joint torque. The target xT is the final state in the trajectory. While we conventionally think of a target as a resting position for the arm, xT more generally represents any condition on the arm at time T, such as movement drawn from a particular probability distribution of velocities. For simplicity, we restrict our trajectory to be a Gauss-Markov process. This means that the probability density on the trajectory p(x0 , . . . , xT ) is jointly gaussian and that the probability density of the state at time t conditioned on all previous states p(xt |x0 , . . . , xt−1 ) equals p(xt |xt−1 ), the state transition density. Although more general probability densities might be considered, these special restrictions are sufficient to allow for history dependency of arbitrary length. This is commonly accomplished by including the state at previous points in time in an augmented state vector (Kailath, Sayed, & Hassibi, 2000). Figure 1A is a schematic representation of the trajectory and the target observation, emphasizing that the prescient observation of target yT is related to the trajectory states xt only through the target state xT . The conditional densities of the Gauss-Markov model can alternatively be specified with observation and state equations. For a free arm movement, the state transition density p(xt |xt−1 ) can be described by a generic linear time-varying multidimensional state equation, xt = At xt−1 + wt ,
(2.1)
where the stochastic increment wt is a zero-mean gaussian random variable with E[wt wτ ] = Qt δt−τ . The initial position x0 is gaussian distributed with mean m0 and covariance 0 . The prescient observation yT of the target state xT is corrupted by independent zero-mean gaussian noise vT with
2470
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
A
xt
x0
xT
yT
B x0
w1 A1
+
wT x1
AT
+
xT
vT
+
yT Figure 1: Alternate representations of a reaching trajectory and one observation on target. In the Markov model (A), circles represent the state of the arm at various times, and the arrangement of directed arrows indicates that the state of the arm at time t is independent of all previous states conditioned on knowledge of the state at time t − 1. Accordingly, the only state pointing to yT , the prescient observation of target, is the target state xT itself. In the system diagram (B), the specific evolution of the arm movement is described. Consistent with the state equation, the arm state xt−1 evolves to the next state in time xt through the system matrix At , with additive noise wt that represents additional uncertainty in aspects of the arm movement that are not explained by the system matrix. The diagram also specifies that the observation yT of the target state xT is corrupted by additive noise vT .
covariance T that denotes the uncertainty in target position: yT = xT + vT .
(2.2)
The state equation coupled with this prescient observation is described schematically in Figure 1B. Restated, our objective is to represent the free movement state equation together with the prescient observation on target, as a Gauss-Markov model on an equivalent set of trajectory states xt conditioned on yT for
Reconstruction of Goal-Directed Movements
2471
t = 0, 1, . . . , T. The consequent reach state equation is of the form xt = At xt−1 + ut + εt ,
(2.3)
where ut is a drift term corresponding to the expected value of wt |xt−1 , yT , and the εt are a new set of independent, zero-mean gaussian increments whose covariances correspond to that of wt |xt−1 , yT . This reach state equation generates a new probability density on the trajectory of states that corresponds to the probability of the original states conditioned on the prescient observation, p(x0 , . . . , xT |yT ). To derive this reach state equation, we calculate the state transition probability density p(xt |xt−1 , yT ). Because wt is the only stochastic component of the original state equation, the new state transition density is specified by p(wt |xt−1 , yT ). To compute this distribution, we use the conditional density formula for jointly gaussian random variables on the joint density p(wt , yT |xt−1 ). The resulting distribution is itself gaussian, with mean and variance given by: ut = E[wt |xt−1 , yT ] = E[wt |xt−1 ] + cov(wt , yT |xt−1 ) × cov−1 (yT , yT |xt−1 )(yT − E[yT |xt−1 ])
(2.4)
t = cov(wt |xt−1 , yT ) Q = cov(wt |xt−1 ) − cov(wt , yT |xt−1 ) × cov−1 (yT |xt−1 )cov (wt , yT |xt−1 ).
(2.5)
The mean ut corresponds identically to the linear least-squares estimate of equals the uncertainty in this estimate. wt |xt−1 , yT , and the variance Q The covariance terms in equations 2.4 and 2.5 can be computed from the following equation that relates wt to yT given xt−1 ,
yT = φ(T, t − 1)xt−1 +
T
φ(T, i)wi + vT ,
(2.6)
i=t
where φ(t, s) denotes the state transition matrix that advances the state at time s to time t,
φ(t, s) =
max(t,s) i=1+min(t,s)
I, t = s
sign(t−s)
Ai
, t = s
.
(2.7)
2472
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
The covariance terms are accordingly given by cov(wt |xt−1 ) = Qt
(2.8)
cov(wt , yT |xt−1 ) = Qt φ (T, t) cov(yT , yT |xt−1 ) = T +
T
(2.9)
φ(T, i)Qi φ (T, i).
(2.10)
i=t
For notational convenience, define the following quantity: (t, T) = φ(t, T)T φ (t, T) +
T
φ(t, i)Qi φ (t, i).
(2.11)
i=t
Simplifying and substituting into equations 2.4 and 2.5, we obtain the mean and covariance of the old increment given the target observation: ut = Qt −1 (t, T)φ(t, T) [yT − φ(T, t − 1)xt−1 ]
(2.12)
t = Qt − Qt Q
(2.13)
−1
(t, T)Qt .
The density on the initial state conditioned on the target observation is calculated similarly. The resulting mean and variance of the initial state are given by 0 = (−1 + −1 (0, T))−1 0 0 (0)(−1 m0 + −1 (0, T)φ(0, T)yT ). E[x0 |yT ] = 0
(2.14) (2.15)
A recursion can be obtained for equation 2.11 by writing (t − 1, T) in terms of (t, T): (t − 1, T) = φ(t − 1, t)(t, T)φ (t − 1, t) + φ(t − 1, t)Qt−1 φ (t − 1, t) (2.16) with (T, T) = T + QT .
(2.17)
Complementing the new initial conditions 2.14 and 2.15, the reach state equation can be written in various equivalent forms. The following form emphasizes that the old increment wt has been broken into the estimate ut of wt from yT and remaining uncertainty εt , xt = At xt−1 + ut + εt
(2.18)
t ) εt ∼ N(0, Q
(2.19)
Reconstruction of Goal-Directed Movements
2473
with ut as given in equation 2.12 and εt distributed as a zero-mean gaussian t . with covariance Q This form is suggestive of stochastic control, where ut is the control input that examines the state at time xt−1 , and generates a force to place the trajectory on track to meet the observed target. Nevertheless, this form emerges purely from conditioning the free movement state equation on the target observation rather than from any specific biological modeling of motor control. Note critically that ut is a function of xt−1 , so that the covariance update in a Kalman filter implementation should not ignore this term. Alternatively, we can group the xt−1 terms. This form is more conducive to the standard equations for the Kalman filter prediction step: xt = Bt xt−1 + f t + εt
(2.20)
−1
Bt = [I − Qt (t, T)]At
(2.21)
−1
f t = Qt (t, T)φ(t, T)yT .
(2.22)
In both forms, the resulting reach state equation remains linear with independent gaussian errors εt , as detailed in the appendix. Because xt is otherwise dependent on xt−1 or constants, we conclude that the reach state equation in 2.18 or 2.20 is a Markov process. 2.2 Augmented State Equation to Support Concurrent Estimation of Target. Building on the previous result, we can now construct a more versatile state equation that supports path and target estimation with concurrent observations of path and target. The previous reach state equation incorporates prescient target information into a space of current arm state xi . We now augment the state space to include the target random variable xT . According to this model, the state of the arm at time t is explicitly determined by the target and the state of the arm at time t − 1. The reach state equation derived above suggests an approach to calculating the state transition density p(xt , xT |xt−1 , xT ) that corresponds to an augmented state equation. Because xT is trivially independent of xt conditioned on xT , we can equivalently calculate the transition density of p(xt |xt−1 , xT ). This is identical to the reach state equation derivation of p(xt |xt−1 , yT ) with vT set to zero. The resulting state equation can be consolidated into vector notation to give the augmented form:
xt xT
=
0 I
xt−1 xT
+
εt
0
(2.23)
= Bt
(2.24)
= Qt −1 (t, T)φ(t, T)
(2.25)
T = 0.
(2.26)
2474
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
The initial condition on the augmented state [x0 , xT ]’ is the joint distribution that corresponds to our uncertainty as external observers about the true starting and target states chosen by the brain at time zero. This augmented state equation confers additional features over the reach state equation. First, observations of the target can be incorporated throughout the duration of the reach to improve arm reconstructions. In contrast, the reach state equation incorporated one target observation before movement. Second, refined estimates of the target can be generated recursively as estimates become more informed by reach and target-related activity. 3 Results 3.1 Sample Trajectories. We proceed to illustrate the underlying structure of a reach for our goal-directed state equation, which appropriately constrains a general linear state equation to an uncertain target. We also explain how the underlying reach structure is affected by parameters of the model: reach duration, the target state observation, and target uncertainty. The density on the set of trajectories, p(xt , xt−1 , . . . , x0 |yT ), can be calculated by iteratively multiplying the transition densities p(xt |xt−1 , yT ) given by the state equation. This density represents our assumptions about the trajectory before receiving additional observations of neural activity during the reach. Broader probability densities on the set of trajectories imply weaker assumptions about the specific path to be executed. We can visually examine the structure of our assumptions by plotting samples from the density on trajectories as well as the average trajectory. Sample trajectories are generated by drawing increments εt from the density specified in equation 2.19. The simulated increments are accumulated at each step with At xt + ut , the deterministic component of the state equation 2.18. The resulting trajectory represents a sample drawn from p(xt , xt−1 , . . . , x0 |yT ), the probability density on trajectories. The average trajectory is generated from the same procedure, except that the increments εt are set to their means, which equal zero. We first examine sample trajectories that result from small changes in model parameters. For illustration, the states were taken to be vectors [x, y, vx , v y ]t , representing position and velocity in each of two orthogonal directions. The original noise covariance was nonzero in the entries corresponding to velocity increment variances:
0 0 0 0
0 0 0 0 Q= . 0 0 q 0 0 0 0 q
(3.1)
Reconstruction of Goal-Directed Movements
x position (m)
D
x position (m)
y position (m)
y position (m)
C
y position (m)
B
y position (m)
A
2475
x position (m)
x position (m)
Figure 2: Sample trajectories (gray) and the true mean trajectory (black) corresponding to the reach state equation for various parameter choices. Appropriate changes in model parameters increase the observed diversity of trajectories, making the state equation a more flexible prior in reconstructing arm movements from neural signals. Parameter choices (detailed in section 3.1) were varied from (A) baseline, including (B) smaller distance to target, (C) increased time to target, and (D) increased increment uncertainty.
The uncertainty in target state T was also diagonal, with r 0 0 0 0 r 0 0 T = . 0 0 p 0
(3.2)
0 0 0 p In Figure 2, sample trajectories from the reach state equation are generated with baseline parameters (see Figure 2A) from which distance to target, reach duration, and increment uncertainty have been individually changed (see Figures 2B–2D). The baseline model parameters are given in Table 1. Parameters were individually altered from baseline as shown in Table 2.
2476
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
Table 1: Sample Trajectory: Baseline Model Parameters. Parameter
Baseline Value
Reach distance Time step Noise covariance (q ) Reach duration Target position uncertainty (r ) Target velocity uncertainty ( p)
0.35 m 0.01 sec. 1e-4 m2 2 sec. 1e-6 m2 1e-6 m2
Table 2: Sample Position Trajectory: Altered Model Parameters. Parameter Reach distance Reach duration Noise covariance (q )
Altered Value
Graph
0.25 m 4 sec. 3e-4 m2
Figure 2B Figure 2C Figure 2D
In Figure 3, sample trajectories are plotted for increasing uncertainty (r ) in target position, with variances (A) 1e-4, (B) 1e-3, (C) 1e-2, and (D) 1e-1 m2 . This corresponds to scenarios in which observations of neural activity before movement initiation provide estimates of target position with varying certainty. Figures 4A to 4C examine the velocity profiles in one direction generated by the reach state equation with various parameter choices. Velocity profiles from the baseline trajectory are displayed in Figure 4A, and parameters are sequentially altered from the baseline values (see Figures 4B and 4C) as shown in Table 3. Figure 4D examines the effect of target information on uncertainty in the velocity increment. The magnitude of one diagonal t is plotted over the duration of the velocity term of the noise covariance Q reach for comparison against the noise covariance Qt of the corresponding free movement state equation. 3.2 Reconstructing Arm Movements During a Reach. The reach state equation can be incorporated into any estimation procedure based on probabilistic inference since it represents a recursively computed prior. Because the reach state equation minimally constrains the path to the target observation, it may be useful in the analysis of coordinated neural activity with respect to planning and execution. We illustrate the reconstruction of reaching movements from simulated neural activity using a point process filter (Eden, Frank, et al., 2004), an estimation procedure that is conducive to the description of spiking activity in particular. The extension to variants of the Kalman filter is also direct, because the reach state equation, 2.20, is written in standard Kalman filter notation.
Reconstruction of Goal-Directed Movements
B
y position (m)
y position (m)
A
2477
x position (m)
x position (m)
C
y position (m)
y position (m)
D
x position (m)
x position (m)
Figure 3: Sample trajectories (gray) and the true mean trajectory (black) of the reach state equation corresponding to various levels of uncertainty about target arm position. Variance in the noise vT of the prescient observation yT is progressively increased from (A) 1e-4, to (B) 1e-3, (C) 1e-2, and (D) 1e-1 m2 . As target uncertainty grows, trajectories become more unrestricted, corresponding to increasing flexibility in the prior for reconstruction of arm movements.
We first simulated arm trajectories using the reach model as described in the previous section. For comparison, arm trajectories were also generated from a canonical model. This model was a family of movement profiles from which specific trajectories could be chosen that guaranteed arrival at the desired target location and time:
x
1 0 0
x
y 0 1 0 y = vx 0 0 1 0 vx vy
t
0 0 0 1
vy
+ t−1
0
0 1 (π/T)2 cos(πt/T) . xT − x0 2δ yT − y0 (3.3)
2478
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
x velocity (m/s)
B
x velocity (m/s)
A
Time (sec.)
C
Time (sec.)
2
Variance (m )
x velocity (m/s)
D
Time (sec.)
Time (sec.)
Figure 4: Sample velocity trajectories (gray) and the true mean velocity trajectory (black) generated by the reach state equation. (A) For baseline parameters (detailed in section 3.1) with reach duration of 2 seconds, the velocity profile is roughly bell shaped. (B) As reach duration increases to 4 seconds, the trajectories become more varied. (C) If uncertainty in the observed target velocity and position is large (1e3 m2 for each variance), velocity trajectories resemble samples from the free movement state equation. (D) Uncertainty in the velocity increment decreases with time due to the prescient target observation (solid line) as compared to the original velocity increment of the corresponding free movement state equation (dashed line).
This deterministic equation relates velocities [x, y, vx , v y ]t to the time increment δ, the current time step t, and the distances in two orthogonal directions between the target and starting points, over T + 1 time steps. After generating trajectories, we simulated the corresponding multiunit spiking activity from 9 neurons, a typical ensemble size for recording from a focal, single layer of cortex (Buz´aki, 2004). Output from each unit in the ensemble was simulated independently as a point process with an instantaneous firing rate that was a function of the velocity. This function, referred to as the conditional intensity (Eden, Frank, et al., 2004), is equivalent to
Reconstruction of Goal-Directed Movements
2479
Table 3: Sample Velocity Trajectory: Altered Model Parameters. Parameter Reach duration Target position and velocity uncertainty
Altered Value
Graph
4 sec. r = 1e3 m2 p = 1e3 m2
Figure 4B Figure 4C
Table 4: Simulated M1 Activity: Receptive Field Parameters. Parameter β0 β1 θp
Assignment or Interval 2.28 4.67 sec/m [−π ,π ]
specifying a receptive field. Our conditional intensity function is adapted from a model of primary motor cortex (Moran & Schwartz, 1999): λ(t|vx , v y ) = exp(β0 + β1 (vx2 + v 2y )1/2 cos(θ − θ p ))
= exp(β0 + α1 vx + α2 v y ),
(3.4) (3.5)
where vx and v y are velocities in orthogonal directions. The receptive field parameters were either directly assigned or drawn from uniform probability densities on specific intervals as shown in Table 4. The corresponding receptive fields had preferred directions between −π and π, background firing rates of 10 spikes per second, and firing rates of 24.9 spikes per second at a speed of 0.2 m per second in the preferred direction. Together with the simulated trajectory, this conditional intensity function specifies the instantaneous firing rate at each time sample based on current velocity. Spikes were then generated using the time rescaling theorem (Brown et al., 2002), where interspike intervals are first drawn from a single exponential distribution and then adjusted in proportion to the instantaneous firing rate. This method is an alternative to probabilistically thinning a homogeneous Poisson process. The simulated spiking activity served as the input observations for the point process filter, described extensively in Eden, Frank, et al. (2004). The two defining elements of this filter are the state equation and observation equation. Our state equation is the reach model and represents the dynamics of the variables we are estimating, specified by p(xt |xt−1 , y). Our observation equation is the receptive field of each neuron, specified by p(Nt |N1:t−1 , xt , yT ). This is the probability of observing Nt spikes at time t, given previous spike observations N1:t−1 , the current kinematic
2480
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
state xt , and the observation of target yT . Because the spiking activity is described as a point process, the conditional intensity function specifies this observation density: p(Nt |N1 , . . . , Nt−1 , xt , yT ) ≈ exp[Nti log(λ(t|xt )δ) − λ(t|x t )δ], (3.6) where δ denotes the time increment. The formulation of a recursive estimation procedure from these two probability densities is the topic of Eden, Frank, et al. (2004). As with the Kalman filter, the resulting point process filter comprises a prediction step to compute p(xt |N1:t−1 , yT ) and an update step to compute p(xt |N1:t , yT ). The reach state equation determines the mean and variance prediction steps of the point process filter, as given by xˆ t|t−1 = Bt xˆ t−1|t−1 + f t t|t−1 =
Bt t−1|t−1 Bt
t . +Q
(3.7) (3.8)
The update step remains unchanged: ∂ log λ ∂ log λ + (t|t ) [λδk ] ∂ xt ∂ xt ∂ 2 log λ − (Nt − λδk ) ∂ xt xt|t−1 ∂ log λ xˆ t|t = xˆ t|t−1 + t|t − λ δ . (Nt k) ∂ xt xt|t−1 −1
= −1 t|t−1
(3.9) (3.10)
We compared the quality of reconstruction using the reach state equation versus the standard free arm movement state equation. The same covariance Qt from equation 3.1 was incorporated into the free arm movement state equation 2.1 and the reach state equation 2.20. Figure 5 compares position and velocity decoding results for one simulated trial on a trajectory generated from the reach state equation. In this trial, the filter employing a reach state equation is provided the target location with relative certainty by setting both the r and p parameters of T to 1e-5 m2 in equation 3.2. The point process filter appears to track the actual trajectory more closely with the reach state equation than with the free movement state equation. Next, we examined the performance of the reach model point process filter in estimating trajectories that were generated from the canonical equation, 3.3, rather than the reach state equation to determine whether the reconstruction would still perform under model violation. Decoding
Reconstruction of Goal-Directed Movements
2481
performance for one trial with the canonical trajectory is illustrated in Figure 6, using the free movement state equation and the reach state equation with r and p in T set to 1e-5 m2 , as with Figure 5. Again, the point process filter tracks the actual trajectory more closely when using the reach state equation than when using the free movement state equation. We then assessed whether incorrect and uncertain target planning information could be refined with neural activity that was informative about the path. We implemented the target-augmented state equation and examined the mean and variance of estimates of the target position as the reach progressed. Although the true target coordinates were (0.25 m, 0.25 m) on the x-y plane, the initial estimate of the target location was assigned to (1m, 1m) with a variance of 1 m2 , large relative to the distance between the initial target estimate and correct target location. Decoding performance for one trial is illustrated in Figure 7. In Figures 7A and 7B, the estimate of the target location is shown to settle close to the true target location relative to the initial target estimate within 1.5 seconds of a 2 second reach. In Figure 7C, the variances in the position (solid) and velocity (dotted) estimates for target (black) approach the variances in estimates for the path (gray) as the reach proceeds. Finally, we confirmed in simulation that the MSE of reconstruction using the reach state equation approaches that of the free movement state equation as the uncertainty in target position grows. One common simulated set of neural data was used to make a performance comparison between the two methods. Mean squared errors were averaged over 30 trials for the point process filter using the free and reach state equations separately. The results are plotted in Figure 8 for values of r and p in T set equal and over a range from 1e–7 m2 to 10 m2 , evenly spaced by 0.2 units on a log10 (m2 ) scale. The MSE line for the reach state equation approaches that of the free movement state equation as T grows large and also flattens as T approaches zero. 4 Discussion We have developed a method for describing reaching arm movements with a general linear state equation that is constrained by its target. We first derived a reach state equation, which incorporates information about the target that is received prior to movement. This derivation was then adapted to explicitly include the target as an additional state space variable. The resulting augmented state equation supports the incorporation of target information throughout the reach as well as during the planning period. As described in the derivation, the reach state equation is Markov. This property is guaranteed in part by the independence of noise increments that is demonstrated in the appendix. Consequently, the reach state equation is amenable to recursive estimation procedures. With no further alterations, the estimate of xt can be obtained exclusively from the neural observation at time t and the estimate of xt−1 given data through time t − 1.
2482
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
0.2
0.2 0.1
0
1
2
1 Time (sec.)
2
C
0.1
y position (m)
y position (m)
0.3
0
0
0.1 0.2 x position (m)
x velocity (m/s)
D 0.3
0.2
0.2 0.1 0 0
E 0.2 0.1 0 0
F
0.1 y velocity (m/s)
y velocity (m/s)
B
x position (m)
A
0
-0.1 0
0.1 0.2 x velocity (m/s)
1
2
1 Time (sec.)
2
0.2 0.1 0 0
Reconstruction of Goal-Directed Movements
2483
The form of the reach state equation presented in 2.18 is particularly suggestive of stochastic control. In fact, the ut component in equation 2.18 is the solution to the standard linear quadratic control problem. This represents a duality between estimation and control (Kailath et al., 2000). In this interpretation, the reach state equation is a model for the way in which the subject dynamically plans their path from a current position to the target. The stochastic increment εt represents our uncertainty as external observers, about the precise control strategy being employed. The variable ut takes the role of a control input that represents the adjustments that the subject is expected to make to return the trajectory to a path that puts it on track to the target. In the reach state equation, ut is a function of the state xt−1 and target observation yT . In the augmented state equation, ut is a function of xt−1 and the actual target xT rather than the target observation yT . Various parameters work together to determine our uncertainty in the control strategy, including the increment variance in the original free movement state equation, distance to target, time to target, and target uncertainty. Together, these parameters determine whether the state equation at any given time forces the trajectory toward a particular target or whether the trajectory tends to proceed in a relatively unconstrained fashion. Figures 2 and 3 describe the variation in trajectories that can be generated by modulating these parameters, from very directed movements to paths with nearly unconstrained directions. The reach state equation in its simplest form is sufficient to generate, on average, bell-shaped velocity profiles that are similar to those observed in natural arm reaching (Morasso, 1981; Soechting & Lacquaniti, 1981). Models of reaching movement that are based on optimization of specific cost functions, examples of which include Hogan (1984) Uno et al. (1989), Hoff and Arbib (1993), and Harris and Wolpert (1998), also generate these bell-shaped velocity profiles. It has been previously noted in a literature review (Todorov, 2004a) that these various methods implicitly or explicitly optimize a smoothness constraint. In our reach state equation, the
Figure 5: Reconstruction of reaching arm movements from simulated spiking activity. The reach state equation was used to generate trajectories, from which spiking activity was simulated with a receptive field model of primary motor cortex. Point process filter reconstructions using a free movement state equation (thin gray) and a reach state equation (thick gray) were compared against true movement values (black). Trajectories of x and y arm positions were plotted against each other (A) and as a function of time (B, C). Additionally, trajectories of x and y arm velocities were plotted against each other (D) and as a function of time (E, F). In these examples, target location is known almost perfectly to the reconstruction that uses the reach state equation, with position and velocity variances of 1e-5 m2 .
2484
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
B x position (m)
A 0.2
0.1
-0.1 0
1
2
1 Time (sec.)
2
1
2
1 Time (sec.)
2
C
0.1
0
x velocity (m/s)
0.2
0.2 0.1
0
0.1 0.2 x position (m)
D
0.1
E 0.2 0.1 0 0
F y velocity (m/s)
y velocity (m/s)
0.2
0
y position (m)
y position (m)
0.3
0
0
0.1 0.2 x velocity (m/s)
0.15 0.1
0 0
Reconstruction of Goal-Directed Movements
2485
bell-shaped velocity profile emerges implicitly from the zero-mean gaussian increment of the original free movement state equation. This probability density sets a probabilistic smoothness constraint, where it is more likely that the state at consecutive time steps will be similar. Additionally, symmetry in the profile emerges from the choice of a constant, invertible matrix At in equation 2.18 and equal mean starting and ending velocities, as with trajectories in Figure 4A. Optimal control models have previously reproduced the skewed velocity profiles (Hoff, 1992) that occur in experiments (Milner & Ijaz, 1990) where the target must be acquired with increased precision. With the reach state equation, skewed profiles may require the appropriate choice of time-varying components such as At and wt . When the arrival time grows longer (see Figure 4B) or the end point becomes less constrained (see Figure 4C) in the reach state equation, the trajectory tends to resemble a sample path of the free movement state equation, as intended by construction. As the reaching motion approaches the target arrival time, our sense of the subject’s control strategy becomes clearer, because we know the intended target with some uncertainty. We also know that the path must converge to this target soon. Furthermore, we can calculate the control signal that would achieve this goal based on the system dynamics represented by the At matrices in equation 2.18. Figure 4D illustrates that the uncertainty in the control strategy, represented by the variance in the stochastic increment εt , decreases over the duration of the reach based on yT, the prescient observation of the target. In contrast, the free movement state equation maintains constant uncertainty in the control strategy as the reach progresses because it is not informed about the target location. Because the reach state equation incorporates target information, it is able to perform better than the equivalent free movement state equation that is uninformed about target. This is illustrated in Figure 5, where closer tracking is achieved over the entire reach when the state equation is informed about target than otherwise. This reach model and its augmented form are minimally constrained linear state equations. In a probabilistic sense, this means that the estimation prior at each step is only as narrow (or broad) as implied by the original free
Figure 6: Reconstruction in the face of model violation. Trajectories are generated with an appropriately scaled cosine velocity profile. Again, results are compared for point process filtering using free (thin black) and reach (thick gray) movement state equations against true values (thick black). As with Figure 5, trajectories of x and y arm positions were plotted against each other (A) and as a function of time (B, C). Similarly, trajectories of x and y arm velocities were plotted against each other (D) and as a function of time (E, F). Position and velocity variances of the target observation are 1e-5 m2 .
2486
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
B Distance to actual target location (m)
A 1.4
y target position (m)
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 0.2
0.4
0.6
0.8
1
1.2
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
x target position (m)
Log of target variance (log10 (m2))
C
0.5
1
1.5
2
Time (sec.)
0 -1 -2 -3 -4 -5 -6 0
0.5
1 Time (sec.)
1.5
2
Reconstruction of Goal-Directed Movements
2487
movement state equation and observations of path and target. In contrast, most reach models based on specific control strategies (Todorov, 2004b), cost functions (Todorov, 2004a), or canonical models (Kemere, Santhanam, Yu, Shenoy, & Meng, 2002; Cowan & Taylor, 2005) place additional constraints on the path that make the estimation prior more exclusive of alternate paths to target. An exception is Kemere and Meng (2005), which uses the linear quadratic control solution that provides identical average trajectories to the reach state equation, based on the estimation control duality (Kailath et al., 2000) although the resulting increment variances are different. As depicted in Figure 6, estimation with a reach state equation is able to perform under model violation, where arm movements are generated by a different model while still taking advantage of the target information. The target-augmented state equation also allows neural activity related to the path to inform estimates of the target. This is illustrated in Figure 7, where the initial estimate of target position was assigned to be incorrect and with large uncertainty (variance). Consequently, the estimate of the target location relied in large part on neural activity that related to the path. The augmented state equation projects current path information forward in time to refine target estimates. As a result, the estimated target location in Figure 7B settled close to the actual target location 0.5 second before completion of the 2 second reach. The remaining distance between the target location estimate and the actual target location is limited by the extent to which path-related neurons provide good path estimates. For example, path-related neural activity that is relatively uninformative about the path will result in poor final estimates of the target when combined only with poor initial target information. Because the target in the augmented state equation is simply the final point of the path, the variance in the target estimate plotted in Figure 7C approaches that of the path estimate as the reach proceeds to the arrival time T.
Figure 7: Target estimation with the augmented state equation for one trial. The initial estimate of the target is intentionally set to be incorrect at (1 m, 1 m) and with variance of 1 m2 that is large relative to the distance to the true target location at (0.25 m, 0.25 m). Subsequent target estimates are produced using simulated neural spiking activity that relates directly to the path rather than the target. (A) Estimates of the target position are plotted in gray on the x-y axis, with the actual target marked as a black cross. (B) Distances from target estimates to the actual target location are plotted in meters against time. (C) Variances in estimates of target (black) and path (gray) are plotted on a logarithmic scale over the duration of one reach for position (solid) and velocity (dashed). These target estimate variances reduce with observations consisting of only simulated primary motor cortical activity relating to path.
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
2
Mean squared error in position estimate (m )
2488
x 10
-3
3 2.5 2 1.5 1 0.5 0 -8
-6
-4
-2
0
2
Log of initial target variance (log10(m2))
Figure 8: Performance comparison between two approaches to estimation on the same simulated set of neural data. MSE of position reconstruction is plotted versus log10 of uncertainty (variance) in the prescient observation of target. For each of 30 trials, receptive field parameters, trajectory, and spiking activity were simulated anew. For each target variance, MSE is averaged over reconstructions from the 10 trials. In the case of large initial target uncertainty, the MSE for reconstruction with the reach state equation (dotted) asymptotes to that of the free movement state equation (solid). The MSE for reconstruction with the reach state equation also asymptotes as initial target uncertainty diminishes.
The reach state equation in 2.18 or 2.20 reduces to the original free movement state equation in the limit that the prescient target observation is completely uncertain. This explains the trend in Figure 8, where MSE in trajectory estimates with the reach state equation approaches that of the free movement state equation. Estimates were produced from a common simulated set of neural data to allow performance comparison between these two approaches. Filtering with the reach and augmented state equations, 2.18 and 2.23, respectively, bears resemblance to fixed interval smoothing. Fixed interval smoothing refers to a class of estimation procedures that produce maximum a posteriori estimates of trajectory values over an interval with observations of the trajectory over the entire interval (Kailath et al., 2000). In filtering with the reach state equation, estimates at a given time t are based on data received through time t and the single prescient observation yT on the target
Reconstruction of Goal-Directed Movements
2489
state xT . In filtering with the augmented state equation, estimates of xt are based on data received through time t and potentially multiple prescient observations on xT . While these three filter types employ observations of future states in maximum a posteriori estimates, there are important distinctions in terms of which observations are used and allowance for multiple sequential observations of a single state, such as with xT in the augmented state equation. Although parallels exist to stochastic control, there is a sharp distinction between the results of this article and a control-based state equation (Todorov, 2004b; Kemere & Meng, 2005). First, the reach state equation was derived as the natural extension of a free movement state equation, with no further assumptions. In contrast, control-based state equations are derived by assuming a specific form for the brain’s controller and choosing the parameters that optimize some cost function. Second, the increment in the reach state equation approaches zero for perfectly known targets. The increment of control-based state equations persists and represents system properties rather than our uncertainty about the control signal. Third, the reach state equation describes the target state in the most general sense, including the possibility of nonzero velocities. While this can be accommodated in the control framework, the classical notion of a reaching motion has been to a target with zero velocity. Distinctions between the reach state equation and control-based state equations are especially important in considering the study of reaching motions. Recursive estimation coupled with a state equation that relates target to path provides a convenient tool for the analysis of neural data recorded during planning and execution of goal-directed movements. The state-space estimation framework can assess the extent to which neural data and an observation equation improve the reconstruction beyond information about the movement built into the state equation. Classically, control-based state equations have been developed to explain as many features about reaching movements as possible without any neural data. In contrast, the reach state equation was developed to extend the free movement state equation with no further assumptions. Both approaches represent different levels of detail in a spectrum of models for the dynamics that drive the observed neural activity in brain regions that coordinate movement. These models can be used to clarify the roles of various brain regions or the validity of alternate neural spiking relationships. The reach and augmented state equations may also provide improved control in brain machine interfaces (Srinivasan et al., 2005) by allowing the user to specify a target explicitly with neural signals or implicitly through the probability density of potential targets in a workspace. This and other recent approaches (Cowan & Taylor, 2005; Yu et al., 2005; Kemere & Meng, 2005) are hybrids between target-based control prosthetics (Musallam et al., 2004; Santhanam et al., 2005) and path-based control prosthetics (Carmena et al., 2003; Taylor et al., 2002; Wu, Shaikhouni, et al., 2004), perhaps most
2490
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
relevant when neither approach alone is sufficient for the desired level of control using available recording hardware to complete a task. Additionally, the method could support more robust receptive field estimates in the face of disappearing units due to neuronal death or tissue retraction (Eden, Truccolo, Fellows, Donoghue, & Brown, 2004). The flexibility of the reach and augmented state equations over more specific reach models might also allow the user to employ the same reaching algorithm to navigate obstacles in acquiring the target. In developing the method further for scientific and clinical application, it is important to consider limitations of the equations presented in this article. Importantly, both the augmented and reach state equation are written for the prescient observation of a target with known arrival time T. We are currently developing a principled approach to accommodate uncertain arrival time, although uncertainty in the target velocity estimate might be a convenient surrogate. Also, the calculations were simplified greatly by assuming a linear free-arm-movement state equation with gaussian increments. This may not be possible if linear approximation is insufficient to describe the nonlinear dynamics of a movement. Finally, additional experimental work will be needed to elucidate the appropriate observation equations, recording sites, and patient rehabilitation regimen that would enhance the clinical application of this hybrid approach to control prosthetics. Appendix: Proof of Independent Increments in the Reach State Equation The new increments are defined as εt = wt − E[wt |yT , xt−1 ]. Substituting equation 2.6 into an equation that is equivalent to equation 2.12, we can rewrite the new increments as ε t = wt −
Qt φ(T, t) St−1
T
φ(T, i)wi + vT ,
(A.1)
i=t
T φ(T, i)Qi φ(T, i) and RT is the covariance of the where St = RT + i=t observation random variable yT , with RT = φ(T, t − 1)Vt−1 φ (T, t − 1) + T i=t φ(T, i)Qi φ (T, i) + T . Therefore, εt can be written entirely in terms T of the future increments {wi }i=t and vT . For s < t, E[εt εs ] =
wt −
E
Qt φ(T, t) St−1
× ws −
Qs φ(T, s) Ss−1
T
φ(T, i)wi + vT
i=t T i=s
φ(T, i)wi + vT
Reconstruction of Goal-Directed Movements
2491
= −Qt φ(T, t) Ss−1 φ(T, s)Qs + Qt φ(T, t) St−1 T × φ(T, i)Qi φ(T, i) + RT Ss−1 φ(T, s)Qs i=t
= −Qt φ(T, t) Ss−1 φ(T, s)Qs + Qt φ(T, t) Ss−1 φ(T, s)Qs = 0. (A.2) Acknowledgments L.S. thanks Ali Shoeb, Benjie Limketkai, Ashish Khisti, Gopal Santhanam, and Rengaswamy Srinivasan for helpful discussions. We thank Julie Scott and Riccardo Barbieri for help in preparing the manuscript. This research is supported in part by the NIH Medical Scientist Training Program Fellowship to L.S. and NIH grant R01 DA015644 to E.N.B. References Andersen, R. A., & Buneo, C. A. (2002). Intentional maps in posterior parietal cortex. Annual Review of Neuroscience, 25, 189–220. Ashe, J., & Georgopoulos, A. P. (1994). Movement parameters and neural activity in motor cortex and area 5. Cerebral Cortex, 4, 590–600. Barbieri, R., Frank, L. M., Nguyen, D. P., Quirk, M. C., Solo, V., Wilson, M. A., & Brown, E. N. (2004). Dynamic analyses of information encoding by neural ensembles. Neural Computation, 16(2), 277–308. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14(2), 325–346. Buz´aki, G. (2004). Large-scale recording on neuronal ensembles. Nature Neuroscience, 7, 446–451. Carmena, J. M., Lebedev, M. A., Crist, R. E., O’Doherty, J. E., Santucci, D. M., Dimitrov, D. F., Patil, P. G., Henriquez, C. S., & Nicolelis, M. A. L. (2003). Learning to control a brain-machine interface for reaching and grasping by primates. Public Library of Science Biology, 1(2), 193–208. Castanon, D. A., Levy, B. C., & Willsky, A. S. (1985). Algorithms for the incorporation of predictive information in surveillance theory. International Journal of Systems Science, 16(3), 367–382. Cowan, T. M., & Taylor, D. M. (2005). Predicting reach goal in a continuous workspace for command of a brain-controlled upper-limb neuroprosthesis. Proceedings of the 2nd International IEEE EMBS Conference on Neural Engineering. Piscataway, NJ: IEEE. Eden, U. T., Frank, L. M., Barbieri, R., Solo, V., & Brown, E. N. (2004). Dynamic analyses of neural encoding by point process adaptive filtering. Neural Computation, 16(5), 971–998.
2492
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
Eden, U. T., Truccolo, W., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2004). Reconstruction of hand movement trajectories from a dynamic ensemble of spiking motor cortical neurons. Proceedings of the 26th Annual International Conference of the IEEE EMBS. Piscataway, NJ: IEEE. Fu, Q.-G., Suarez, J. I., & Ebner, T. J. (1993). Neuronal specification of direction and distance during reaching movements in the premotor area and primary motor cortex of monkeys. Journal of Neurophysiology, 70(5), 2097–2116. Georgopoulos, A. P., Ashe, J., Smyrnis, N., & Taira, M. (1992). Motor cortex and the coding of force. Science, 256, 1692–1695. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 2, 1527–1537. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primary motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population. Journal of Neuroscience, 8(8), 2928–2939. Geschwind, N., & Damasio, A. R. (1985). Apraxia. In P. J. Vinken, G. W. Bruyn, & H. L. Klawans (Eds.), Handbook of clinical neurology (pp. 423–432). Amsterdam: Elsevier. Greger, B., Norris, S. A., & Thach, W. T. (2003). Spike firing in the lateral cerebellar cortex correlated with movement and motor parameters irrespective of the effector limb. Journal of Neurophysiology, 91, 576–582. Harris, C. M., & Wolpert, D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hoff, B. (1992). A computational description of the organization of human reaching and prehension. Unpublished doctoral dissertation, University of Southern California. Hoff, B., & Arbib, M. A. (1993). Models of trajectory formation and temporal interaction of reach and grasp. Journal of Motor Behavior, 25, 175–192. Hogan, N. (1984). An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4, 2745–2754. Kailath, T., Sayed, A. H., & Hassibi, B. (2000). Linear estimation. Upper Saddle River, NJ: Prentice Hall. Kemere, C., & Meng, T. H. (2005). Optimal estimation of feed-forward-controlled linear systems. IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE. Kemere, C., Santhanam, G., Yu, B. M., Shenoy, K. V., & Meng, T. H. (2002). Decoding of plan and peri-movement neural signals in prosthetic systems. IEEE Workshop on Signal Processing Systems. Piscataway, NJ: IEEE. Li, C. R., Padoa-Schioppa, C., & Bizzi, E. (2001). Neuronal correlates of motor performance and motor learning in the primary motor cortex of monkeys adapting to an external force field. Neuron, 30, 593–607. Milner, T. E., & Ijaz, M. M. (1990). The effect of accuracy constraints on threedimensional movement kinematics. Neuroscience, 35, 365–374. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82(5), 2676–2692. Morasso, P. (1981). Spatial control of arm movements. Experimental Brain Research, 42, 223–227.
Reconstruction of Goal-Directed Movements
2493
Mussallam, S., Corneil, B. D., Greger, B., Scherberger, H., & Andersen, R. A. (2004). Cognitive control signals for neural prosthetics. Science, 305, 258–262. Nakano, E., Imamizu, H., Osu, R., Uno, Y., Gomi, H., Yoshioka, T., & Kawato, M. (1999). Quantitative examinations of internal representations for arm trajectory planning: Minimum commanded torque change model. Journal of Neurophysiology, 81, 2140–2155. Nudo, R. J., Wise, B. M., SiFuentes, F., & Milliken, G. W. (1996). Neural substrates for the effects of rehabilitation training on motor recovery after ischemic infarct. Science, 272, 1791–1794. Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue, J. P. (2004). Spatiotemporal tuning of motor cortical neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Santhanam, G., Ryu, S. I., Yu, B. M., Afshar, A., & Shenoy, K. V. (2005). A high performance neurally-controlled cursor positioning system. IEEE Engineering in Medicine and Biology (EMBS) 2nd International Conference on Neural Engineering. Piscataway, NJ: IEEE. Schwartz, A. B. (1992). Motor cortical activity during drawing movements: Singleunit activity during sinusoid tracing. Journal of Neurophysiology, 68, 528–541. Schwartz, A. B., Moran, D. W., & Reina, G. A. (2004). Differential representation of perception and action in the frontal cortex. Science, 303, 380–383. Smith, A. C., Frank, L. M., Wirth, S., Yanike, M., Hu, D., Kubota, Y., Graybiel, A. M., Suzuki, W. A., & Brown, E. N. (2004). Dynamic analyses of learning in behavioral experiments. Journal of Neuroscience, 24(2), 447–461. Soechting, J. F., & Lacquaniti, F. (1981). Invariant characteristics of a pointing movement in man. Journal of Neuroscience, 1, 710–720. Srinivasan, L., Eden, U. T., Willsky, A. S., & Brown, E. N. (2005). Goal-directed state equation for tracking reaching movements using neural signals. Proceedings of the 2nd International IEEE EMBS Conference on Neural Engineering. Piscataway, NJ: IEEE. Taira, M., Boline, J., Smyrnis, N., Georgopoulos, A. P., & Ashe, J. (1995). On the relations between single cell activity in the motor cortex and the direction and magnitude of three-dimensional static isometric force. Experimental Brain Research, 109, 367–376. Taylor, D. M., Tillery, S. I. H., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 296, 1829–1832. Todorov, E. (2000). Direct cortical control of muscle activation in voluntary arm movements: A model. Nature Neuroscience, 3(4), 391–398. Todorov, E. (2004a). Optimality principles in sensorimotor control. Nature Neuroscience, 7(9), 907–915. Todorov, E. (2004b). Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Computation, 17, 1084–1108. Todorov, E., & Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Turner, R. S., & Anderson, M. E. (1997). Pallidal discharge related to the kinematics of reaching movements in two dimensions. Journal of Neurophysiology, 77, 1051– 1074.
2494
L. Srinivasan, U. Eden, A. Willsky, and E. Brown
Uno, Y., Kawato, M., & Suzuki, R. (1989). Formation and control of optimal trajectory in human multijoint arm movement: Minimum torque-change model. Biol. Cybern, 61, 89–101. Verghese, G., & Kailath, T. (1979). A further note on backwards Markovian models. IEEE Transactions on Information Theory, IT-25(1), 121–124. Warland, D. K., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Wise, S. P., Boussaoud, D., Johnson, P. B., & Caminiti, R. (1997). Premotor and parietal cortex: Corticortical connectivity and combinatorial computations. Annual Review of Neuroscience, 20, 25–42. Wu, W., Shaikhouni, A., Donoghue, J. P., & Black, M. J. (2004). Closed-loop neural control of cursor motion using a Kalman filter. The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway, NJ: IEEE. Yu, B. M., Santhanam, G., Ryu, S. I., & Shenoy, K. V. (2005). Feedback-directed state transition for recursive Bayesian estimation of goal-directed trajectories. Computational and Systems Neuroscience (COSYNE) meeting. Salt Lake City, UT. Available online at http://www.cosyne.org/program05/291.html.
Received June 1, 2005; accepted March 24, 2006.
LETTER
Communicated by Barak Pearlmutter
What Is the Relation Between Slow Feature Analysis and Independent Component Analysis? Tobias Blaschke [email protected]
Pietro Berkes [email protected]
Laurenz Wiskott [email protected] Institute for Theoretical Biology, Humboldt University Berlin, D-10115 Berlin, Germany
We present an analytical comparison between linear slow feature analysis and second-order independent component analysis, and show that in the case of one time delay, the two approaches are equivalent. We also consider the case of several time delays and discuss two possible extensions of slow feature analysis. 1 Introduction In data analysis, it is often desirable to transform the input signals into a new representation that recovers as much information as possible about the underlying processes. In the classical example of two people speaking simultaneously while being recorded with two microphones, for instance, the observed signal is a mixture of their voices. A more useful representation here would be one where each signal component contains only the information about a single speaker. In the visual domain, one might be interested in a representation that is invariant to typical transformations, such as translation or zoom. A variety of linear and nonlinear methods have been developed to extract the interesting features from an observed signal. In this letter, we focus on two methods that consider different properties of the observed signal: independent component analysis (ICA) (see Hyv¨arinen, Karhunen, & Oja, 2001, for an overview) and slow feature analysis (SFA) (Wiskott & Sejnowski, 2002). ICA finds a representation of the data such that signal components are mutually statistically independent, which can be used to separate the two speakers in the example above. SFA extracts slowly-varying features, which can be used in the second example to learn visual invariances. At first glance, these two methods are very different and even seem to be conflicting, since two slowly varying signals of finite length are intuitively more likely to have statistical dependencies Neural Computation 18, 2495–2508 (2006)
C 2006 Massachusetts Institute of Technology
2496
T. Blaschke, P. Berkes, and L. Wiskott
than quickly varying ones. However, we will see that ICA and SFA have common properties, which we are going to point out by comparing the two algorithms mathematically. To carry out the comparison, we have to apply some restrictions. SFA is constrained to nonwhite signals with a temporal structure (e.g., speech signals), and it is based on second-order statistics. We therefore compare it to ICA algorithms that use only second-order information and need a temporally structured signal as well (Molgedey & Schuster, 1994; Belouchrani, ¨ Abed Meraim, Cardoso, & Moulines, 1997; Ziehe & Muller, 1998; Zibulevsky & Pearlmutter, 2000; Nuzillard & Nuzillard, 2003). SFA is usually applied as a nonlinear method: it uses a nonlinear expansion to map the input signal into a feature space and then solves a linear problem there. ICA, on the other hand, is typically a linear method, since in the nonlinear case, the problem is in general underdetermined (because the solution is not unique), and there is thus no guarantee of recovering the original sources (Hyv¨arinen & Pajunen, 1999; Jutten & Karhunen, 2003). (However, there do exist some nonlinear approaches that make additional assumptions about the nonlinear mapping or the input data.) To make a comparison between the two methods possible, we will restrict SFA to the linear case. Nevertheless, all calculations in this letter are essentially the same for linear or nonlinear SFA. 2 Linear Mixing and Unmixing Let x(t) = [x1 (t), . . . , xN (t)]T be a linear mixture of a multidimensional source signal s(t) = [s1 (t), . . . , s N (t)]T , x(t) = As(t) ,
(2.1)
where A is a square mixing matrix and different components si come from statistically independent sources. In the following, we will assume that s(t) and x(t) have zero mean, without loss of generality. A common linear preprocessing step in many ICA algorithms as well as in linear SFA is the whitening of the input signal x(t). Whitening results in a signal y(t) = Wx(t) with mutually uncorrelated components, yi (t)y j (t) = 0 ∀ i = j, unit variance, yi (t)2 = 1, and zero mean, yi (t) = 0, where · denotes averaging over time. It can be shown that after the whitening step, an orthogonal transformation Q on y is sufficient to yield independent components (Comon, 1994) or slowly-varying features (Wiskott & Sejnowski, 2002). Therefore, the output signal u(t) can be obtained by combining the whitening matrix W and a rotation matrix Q: u(t) = Qy(t) = QWx(t) .
(2.2)
What Is the Relation Between SFA and ICA?
2497
In the following, we will always assume whitened data y(t) and focus on finding Q. Since zero mean and whitening are preserved under any orthogonal transformation, the components of u(t) also satisfy these conditions: ui (t) = 0 ui (t)2 = 1 ∀i = j :
ui (t)u j (t) = 0
(zero mean)
(2.3)
(unit variance)
(2.4)
(decorrelation).
(2.5)
These properties fulfill the constraints imposed by SFA (cf. section 4) and are a good prerequisite for ICA because they constrain the output signals ui (t) to be statistically independent in the first and second order. 3 Second-Order Independent Component Analysis Given the linear mixture 2.1, ICA tries to retrieve the source signal components s(t) from the input signal x(t). The mixing matrix A is unknown, and the source signal components are assumed to be mutually independent. The typical approach is to define an objective function that is a measure of independence of the estimated source signal components ui . The problem is then solved by optimizing this function with respect to Q. There exist different measures of independence. Most algorithms are based on the assumption that two signals are independent if their joint distribution is equal to the product of their marginals (e.g., Cardoso & Souloumiac, 1993; Hyv¨arinen, 1999; Lee, Girolami, & Sejnowski, 1999). A corresponding measure in this case is the Kullback-Leibler divergence. We will refer to this approach as higher-order ICA. This definition, however, does not capture all aspects of independence. Consider a signal without temporal autocorrelation (e.g., white noise) and a second signal that is equal to the first one but shifted in time. Applying the measure of independence, the two signals appear to be independent, although they are actually a time-shifted copy of each other and thereby intuitively strongly dependent. This dependence across time can be taken into account using a different measure where two signals are considered statistically independent if all time-delayed correlations are zero (secondorder ICA) (Molgedey & Schuster,1994; Belouchrani et al., 1997; Ziehe & ¨ Muller, 1998). In order to successfully apply this measure, the source signals need to have a time structure (must be nonwhite), which is also a necessary condition for SFA. An alternative formulation of this idea is to use a model of the sources that includes a dynamics in time and assume that the time series are independent as a whole (Pearlmutter & Parra, 1996). In this letter, we are going to study algorithms based on this latter definition of independence, following the formulation by Molgedey and Schuster (1994).
2498
T. Blaschke, P. Berkes, and L. Wiskott
To derive an objective function for second-order ICA, we first introduce time-delayed correlation matrices of the estimated source signal u(t), C(u) (τ ) := u(t)u(t + τ )T ,
(3.1)
where τ is the time delay between two signals. We denote the entries of (u) C(u) (τ ) as Ci j (τ ). For a signal u(t) with independent components, C(u) (τ ) should be diagonal for all τ . We are therefore looking for an objective function that, when optimized, jointly diagonalizes those matrices. It is common in practice to use a symmetrized version of the correlation matrices:1 C(u) (τ ) :=
1 u(t)u(t + τ )T + u(t + τ )u(t)T . 2
(3.2)
Computing the symmetrized matrices is equivalent to applying the algorithm to the original input data and to the data reversed in time (because u(t + τ )u(t)T = u(t)u(t − τ )T ). This reflects the fact that with respect to the unmixing problem, the time direction is not important. Moreover, the symmetric form can always be diagonalized with a rotation matrix (while the nonsymmetric matrices can have complex eigenvalues and eigenvectors) and has better numerical properties. Note, however, that in some pathological cases, the cross-correlation terms can cancel out each other: For example, if u(t) = [sin (t), cos (t)]T , there clearly are cross-correlations but in the symmetrized version, the off-diagonal terms in equation 3.2 are zero for all τ . The two signals are thus considered independent by the algorithm. We first focus on the case of a single time delay τ (Molgedey & Schuster, 1994). The extension to more than one time-delayed correlation matrix is straightforward and will be described in section 5. Because of the whitening step, equation 2.5, the correlation matrix with time delay zero is already diagonal. With one time delay, the ICA algorithm thus reduces to diagonalizing a single time-delayed correlation matrix C(u) (τ ). This can be achieved by using the method of Jacobi (Cardoso & Souloumiac, 1996) to minimize the sum of the squared off-diagonal entries, a technique used in several ¨ second-order ICA algorithms (Belouchrani et al., 1997; Ziehe & Muller, 1998) as well as in methods based on higher-order statistics (Cardoso & Souloumiac, 1993). Using this method, we can define a simple objective
1
¨ In Ziehe and Muller (1998) the correlation matrices are not explicitly defined in the article, but the Matlab implementation made available by the authors uses the symmetric form.
What Is the Relation Between SFA and ICA?
2499
function subject to minimization, ICA (τ ) :=
N (u) 2 Ci j (τ )
(3.3)
i, j=1 i= j
=
2 qiT C(y) (τ )q j ,
(3.4)
i= j
where qi is the ith row of Q. ICA is a function of the vectors qi , which are subject to learning, and of the whitened signal y(t), which is given. This objective function is optimized by a sequence of elementary rotations within the plane spanned by two axes. A possible optimization procedure has been described by Cardoso and Souloumiac (1996); a more efficient optimization schedule has been derived by Blaschke and Wiskott (2004a). 4 Linear Slow Feature Analysis Given a whitened input signal y(t) = [y1 (t), . . . , yN (t)]T , linear SFA finds a rotation matrix Q such that the components ui of the output signal u(t) = Qy(t) vary as slowly as possible in time and are ordered by decreasing slowness (the first one being the slowest possible, the second one the next slowest uncorrelated to the first, and so on). As a measure of slowness, we define (small values indicating slowly varying signals) (ui ) := u˙ i (t)2 ,
(4.1)
which has to be minimized (Wiskott & Sejnowski, 2002). Due to the earlier whitening step, each output signal ui (t) has zero mean and unit variance (see equations 2.3 and 2.4). This ensures that the solution will not be the trivial solution ui (t) = const. The decorrelation of the output signals, equation 2.5, guarantees that different components carry different information. We first show how to solve the optimization problem of SFA in a way similar to that described by Wiskott and Sejnowski (2002) and then establish a link between SFA and second-order ICA. For discrete time series, the first derivative of u(t) can be approximated in the first order by ˙ ≈ u(t + 1) − u(t). u(t)
(4.2)
Using this approximation, we can rewrite the SFA objective function, equation 4.1, as (ui ) ≈ (ui (t + 1) − ui (t))2 = ui (t + 1)ui (t + 1) + ui (t)ui (t)
(4.3)
2500
T. Blaschke, P. Berkes, and L. Wiskott
− ui (t)ui (t + 1) − ui (t + 1)ui (t) = 2ui (t) − 2ui (t)ui (t + 1) 2
(4.4) (4.5)
(since ui (t + 1)2 = ui (t)2 because we average over all t) = 2 − 2ui (t)ui (t + 1)
(4.6)
(since ui (t) = 1 because u(t) is white (see equation 2.4)) 2
Since the constant factor does not matter during optimization, instead of minimizing (ui ), we can maximize 1 (ui ) := 1 − (ui ) 2 = ui (t)ui (t + 1) (u)
= Cii (1) = qiT C(y) (1)qi .
(4.7) (4.8) (4.9) (4.10)
(ui ) is a function of the rotation matrix Q, and we The objective function are thus searching for the orthogonal weight vectors qi in equation 4.10 that (ui ). The solution for i = 1 is obviously the eigenvector of the maximize largest eigenvalue of C(y) (1), which yields the slowest component u1 (t) = q1T y(t). The following eigenvectors in order of decreasing eigenvalue yield the next-slowest components, u2 (t), u3 (t), and so forth. Therefore, to extract all slow components, the maximization problem, equation 4.10, can be formulated as an eigenvalue problem, C(y) (1)QT = QT ,
(4.11)
where denotes a diagonal matrix with ii being the ith largest eigenvalue and qi the corresponding eigenvectors. In order to allow a better comparison with second-order ICA, we now want to deduce an alternative formulation of SFA; that is, we want to construct an objective function similar to that of second-order ICA. First, we show the equivalence of solving the eigenvalue problem, equation 4.11, and the diagonalization of C(u) (1). If we multiply both sides of equation 4.11 with Q, we obtain C(u) (1) = QC(y) (1)QT = .
(4.12)
Since is diagonal C(u) (1) is diagonal too. Therefore, solving the eigenvalue problem for C(y) (1) is equivalent to finding a rotation matrix Q such that the time-delayed correlation matrix C(u) (1) is diagonal. Second, to
What Is the Relation Between SFA and ICA?
2501
perform the diagonalization, we minimize all off-diagonal entries of C(u) (1) using the same Jacobi scheme as for second-order ICA (see section 3) and define the following objective function for SFA: SFA :=
(u) 2 Ci j (1)
(4.13)
i= j
=
2 qiT C(y) (1)q j .
(4.14)
i= j
Minimizing this expression produces the same slow components u1 (t), . . . , u N (t) as obtained by the eigenvalue problem, equation 4.11, again assuming an additional sorting step. Note also that this is equivalent to a decorrelation of the time derivatives of the output signal components ui (t) (u) (cf. Wiskott, 2003) since u˙ i u˙ j = −2 Ci j (1) for i = j. Interestingly, the objective function, equation 4.14, is identical to the one for ICA, equation 3.4. With this observation, we arrive at the important result that linear SFA is formally equivalent to second-order ICA with time delay one. To bring equation 4.13 into a form that can be understood more intuitively in the sense of SFA, we can use the fact that the sum of all squared entries of correlation matrices with a given time delay τ is invariant under orthogonal transformations, (u) 2 (y) 2 Ci j (τ ) = Ci j (τ ) = const. i, j
(4.15)
i, j
We can split this sum in two terms, (u) 2 (u) 2 (u) 2 Ci j (τ ) = Cii (τ ) + Ci j (τ ) = const , i, j
i
(4.16)
i= j
SFA is equivalent to the so that it is easy to see that the minimization of maximization of SFA :=
(u) 2 Cii (1)
(4.17)
i
=
2 qiT C(y) (1)qi .
(4.18)
i
Having started from minimizing temporal variations, equation 4.1, as an objective for SFA, we now arrive at an objective for maximizing squared autocorrelations, equation 4.7, at time delay one. This relation can be
2502
T. Blaschke, P. Berkes, and L. Wiskott
interpreted intuitively. A signal component with a large squared autocorrelation has a high temporal predictability. If the autocorrelation is positive (u) (i.e., Cii (1) > 0), predictability implies that the signal component has to vary slowly. What if the autocorrelation is negative? This could happen if, for example, ui (t) has alternating signs for successive data points. Consider the signal ui (t) :=
−1 for t odd 1
for t even
,
(4.19)
with 1 ≤ t ≤ T. This signal has zero mean and unit variance and thus fulfills constraints 2.3 and 2.4. Furthermore, it is favorable in terms of the objec(u) tive 4.17, since Cii (1) has a large absolute value. On the other hand, this is a very quickly varying component, which might seem paradoxical since maximizing equation 4.17 should result in slowly varying components. This apparent contradiction can be resolved by studying the constraints imposed on the optimization of equation 4.17. Since Q is an orthogonal matrix, the trace of C(u) (1) is invariant under the transformation u(t) = Qy(t) ¨ & Falk, 1997). If we consider all N possible components (e.g., Zurmuhl (u) in the optimization procedure, the decrease of one correlation Cii (1) (u) implies the increase of at least one other correlation C j j (1). Therefore, extracting the most slowly varying signals implies that other extracted components correspond to the most quickly varying signals. Hence, it is reasonable to further minimize negative correlations since this implies that other correlations will be maximized. As above, a successive sorting step is required to bring the components in order of increasing temporal variation. 5 More Than One Time Delay 5.1 Second-Order ICA. We know that second-order ICA can always be solved with a single time delay (Tong, Liu, Soon, & Huang, 1991). However, the delay τ has to be chosen properly so that all eigenvalues of C(y) (τ ) are distinct. To obtain a more robust method, one can consider a certain number T of time-delayed correlation matrices with respective time delays τ = 1, 2, . . . , T and diagonalize them jointly (Belouchrani et al., 1997; Ziehe ¨ & Muller, 1998). This leads to a straightforward extension of objective 3.3, subject to minimization,
ICAj :=
T τ =1
κτ ICA (τ )
(5.1)
What Is the Relation Between SFA and ICA?
=
κτ
τ
=
(u) 2 Ci j (τ )
2503
(5.2)
i= j
κτ
τ
qiT C(y) (τ )q j
2
,
(5.3)
i= j
where we introduced positive factors κτ that allow us to weight correlation matrices with different time delays differently. In equation 5.1, we write ICAj for joint-diagonalization ICA. Pham and Garat (1997) have derived a formula closely related to equation 5.3 with a maximum likelihood approach. Extending the objective function of ICA in this way leads to the joint diagonalization of several correlation matrices with different time delays. Decorrelation is thus achieved over a time window of length T. It is intuitively clear that by enlarging the window length, the unmixing performance should improve until the width of the autocorrelation function is reached. Exceeding this limit would introduce matrices consisting entirely of zero mean noise, which would degrade the unmixing performance. 5.2 Linear SFA 5.2.1 Joint Diagonalization. We can use an argument similar to the one used for second-order ICA in order to extend SFA to more than a single time delay. Adding more time-delayed autocorrelations increases the temporal predictability of the signal. Knowing the amplitude of a signal at a given time can give a good prediction for the next T time points since they are strongly correlated. Signals with large temporal predictability are in turn likely to be slowly varying (cf. the end of section 4). Thus, an intuitive extension of the normal SFA objective, equation 4.17, subject to maximization, is SFAj :=
κτ SFA (τ )
τ
=
κτ
τ
=
τ
(5.4)
(u) 2 Cii (τ )
(5.5)
i
κτ
qiT C(y) (τ )qi
2
.
(5.6)
i
As in equations 5.1 to 5.3, we have introduced weighting factors κτ for the delayed correlation matrices. Note that this new objective, equations 5.5 and 5.6, is again equivalent to the ICA objective, equations 5.2 and 5.3, due to the constancy of the sum of all squared entries of each time-delayed correlation matrix, equation 4.16.
2504
T. Blaschke, P. Berkes, and L. Wiskott
We must be careful with this definition for two reasons. Firstly, while the (u) definition of slowness based on Cii (1) corresponds to our intuition of what (u) a slow signal is, Cii (2) can have a large, positive value for signal components that we would not consider to be slow at all. In fact, the alternating (u) signal, equation 4.19, would yield a maximal value for Cii (2). Secondly, consider the case where two time-delayed autocorrelations have opposite (u) (u) signs, for example, Cii (1) < 0 and Cii (2) > 0. Maximizing objective func(u) tion 5.5 would favor a decreasing value of Cii (1) (since it is negative) and (u) an increasing value of Cii (2). The former would intuitively tend to make the signal faster, while the latter would make it slower. Thus, if the autocorrelations of a component have different signs for different time delays, the objective function appears to be inconsistent, at least for that component. This conflict cannot be solved as easily as the one discussed at the end of section 4. However, one can at least monitor the signs of the autocorrelations and diagnose the inconsistent cases. It is not clear to us how often these two problems arise in practice. We believe that by weighting the first autocorrelation stronger than the others, for example, with an exponential decay of the weights, the inconsistencies can be largely avoided. 5.2.2 Linear Filtering. An alternative to the joint diagonalization of several correlation matrices with different time delays in analogy to secondorder ICA is to average over a range of time delays within one correlation matrix and diagonalize just this one matrix. To do so, we introduce the following new measure of slowness (cf. equations 4.7 to 4.10):
(ui ) := ui (t) =
κτ ui (t + τ )
(5.7)
τ
κτ ui (t)ui (t + τ )
(5.8)
τ
=
τ
=
qiT
(u)
κτ Cii (τ )
(5.9)
κτ C (τ ) qi (y)
(5.10)
τ
=: qiT C(y) qi ,
(5.11)
with constants κτ that weight different time delays differently. This definition differs from that of equations 4.7 to 4.10 in that ui (t) should not only be well correlated to the next data point but to a weighted average over the next T data points. This is a straightforward way of taking several timescales into account. Note that the weighted averaging is a linear filter operation.
What Is the Relation Between SFA and ICA?
2505
As in the joint diagonalization extension, exponentially decaying weights κτ := exp(−γ τ ) for different time delays seem to be a suitable choice. With such weights, this measure of slowness is similar to the objective of temporal smoothness used by Stone (1995) and somewhat related to the trace ¨ ak (1991). learning rules introduced by Foldi´ Because of the similarity of equation 5.11 with equation 4.10, we can apply the steps that led from equation 4.10 to equation 4.18 and derive the following objective function to be maximized, SFAl :=
(u) 2 ii C
(5.12)
i
=
2 C(y) qi , qiT
(5.13)
i
where C(u) is defined analogously to C(y) and SFAl stands for linear-filtering SFA. Since this objective function is based on just one correlation matrix, it does not have the problems mentioned above for the joint diagonalization extension (see section 5.2.1). Blaschke (2005, sec. 8.2.2) also considered extending SFA by simultaneously minimizing the variance not only of the first but also of higher-order derivatives, which could result in even more stable signals. This would also lead to equation 5.13, because discrete approximations of higher-order derivatives involve multiple time delays. In this case, with positive weights for all derivatives, the constants κτ in equation 5.10 would have values with alternating signs (positive for odd τ and negative for even τ ), which is somewhat counterintuitive. We do not fully understand the implications of this effect but believe that higher-order derivatives do not offer a good way of extending SFA to longer timescales, even though unmixing performance was good in some simple examples. 6 Conclusion The main result of this work is that linear SFA and second-order ICA with time delay one are formally equivalent (see equations 3.4 and 4.14). This is surprising, because SFA and ICA are based on two very different principles: slowness versus statistical independence. These principles might seem to contradict each other, because two analog signals of finite length would typically become more statistically dependent if they are more slowly varying. The formal equivalence of linear SFA and second-order ICA with time delay allows us to apply the intuition we have gained for one algorithm to deepen our understanding of the other. For example, it is known that higher-order ICA applied to natural images learns linear filters similar to Gabor wavelets (e.g., Bell & Sejnowski, 1997; van Hateren & van der Schaaf,
2506
T. Blaschke, P. Berkes, and L. Wiskott
1998), which in turn resemble receptive fields of simple cells in V1. On the other hand, linear SFA (and therefore also second-order ICA with time delay one) applied to natural image sequences learns filters similar to the principal components of natural images, the first of which are effectively spatial lowpass filters and therefore also generate slowly varying output signals. This suggests that the solutions found by second-order ICA and higher-order ICA can be very different in practice even though both methods try to maximize statistical independence. Despite the formal equivalence in the linear case and for time delay one, SFA and ICA have different objectives and differ in the more general case. Firstly, while in standard SFA the time delay is fixed to 1 due to the approximation of the time derivative, in ICA it can be chosen freely, or one can use several correlation matrices with different time delays simultaneously for optimal unmixing (see section 5.1). We have seen (see section 5.2.1) that the same extension to several time delays can also be used for SFA, but that the algorithm then becomes inconsistent with respect to the slowness objective if the entries of the time-delayed correlation matrices have different signs for different delays. An extension more consistent with the slowness objective is based on linear filtering before computing the time derivative (see section 5.2.2). This also introduces several time delays, but in a different way than used for ICA. Thus, when taking several time delays into account, the conceptual differences between ICA and SFA become relevant. Secondly, in the nonlinear case, many output signal components can be extracted from a lower-dimensional input signal. With SFA, they would all be uncorrelated and ordered by slowness, in agreement with the definition in equations 2.3 to 2.5 and 4.1. With second-order ICA, they would not be ordered in any way and would not be statistically independent for dimensionality reasons. The results would therefore be inconsistent with the ICA objective. Thus, in the nonlinear case, the conceptual differences between ICA and SFA also matter. We believe that the close relation between linear SFA and second-order ICA will lead to a way to combine the two algorithms into a nonlinear method for extracting slowly varying and statistically independent components and thereby perform nonlinear blind source separation. This is the subject of current research (Blaschke & Wiskott, 2004b, 2005). Acknowledgments This work has been supported by a grant to L.W. from the Volkswagen Foundation. References Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters, Vision Research, 37(23), 3327–3338.
What Is the Relation Between SFA and ICA?
2507
Belouchrani, A., Abed Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–44. Blaschke, T. (2005). Independent component analysis and slow feature analysis: Relations and combination. Doctoral dissertation, Humboldt-University Berlin. Available online at http://edoc.huberlin.de/docviews/abstract.php?lang=ger&id=25458. Blaschke, T., & Wiskott, L. (2004a). CuBICA: Independent component analysis by simultaneous third- and fourth-order cumulant diagonalization. IEEE Transactions on Signal Processing, 52(5), 1250–1256. Blaschke, T., & Wiskott, L. (2004b). Independent slow feature analysis and nonlinear blind source separation. In Proc. of the 5th Int. Conf. on Independent Component Analysis and Blind Signal Separation. Berlin: Springer-Verlag. Blaschke, T., & Wiskott, L. (2005). Nonlinear blind source separation by integrating independent component analysis and slow feature analysis. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17, (pp. 177–184). Cambridge, MA: MIT Press. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non gaussian signals, IEEE Proceedings–F, 140, 362–370. Cardoso, J.-F., & Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1), 161–164. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3(2), 194–200. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symposium on Independent Component Analysis and Blind Signal Separation (pp. 245–256). Available online at http://www.kecl.ntt.co.jp/icl/ signal2003/index.html. Lee, T.-W., Girolami, M. & Sejnowski, T. (1999). Independent component analysis using an extended Infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 409–433. Molgedey, L., & Schuster, G. (1994). Separation of a mixture of independent signals using time-delayed correlations. Physical Review Letters, 72(23), 3634–3637. Nuzillard, D., & Nuzillard, J.-M. (2003). Second-order blind source separation in the Fourier space of data. Signal Processing, 83(3), 627–631. Pearlmutter, B., & Parra, L. (1996). A context-sensitive generalization of ICA. In Proc. of the International Conference on Neural Information Processing. Berlin: SpringerVerlag. Pham, D., & Garat, P. (1997). Blind separation of mixtures of independent sources through a maximum likelihood approach. IEEE Transactions on Signal Processing, 45(7), 1712–1725.
2508
T. Blaschke, P. Berkes, and L. Wiskott
Stone, J. (1995). A learning rule for extracting spatio-temporal invariances. Network, 6(3), 1–8. Tong, L., Liu, R., Soon, V. C., & Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38(5), 499–509. van Hateren, J., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265, 359–366. Wiskott, L. (2003). Slow feature analysis: A theoretical analysis of optimal free responses. Neural Computation, 15(9), 2147–2177. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Zibulevsky, M., & Pearlmutter, B. (2000). Second order blind source separation by recursive splitting of signal subspaces. In Proc. of the 2nd Int. Workshop on Independent Component Analysis and Blind Signal Separation (pp. 489–491). Available online at http://www.cis.hut.fi/ica2000. ¨ Ziehe, A., & Muller, K.-R. (1998). TDSEP—an efficient algorithm for blind separation using time structure. In Proc. of the 8th Int. Conference on Artificial Neural Networks (pp. 675–680). Berlin: Springer-Verlag. ¨ A., & Falk, S. (1997). Matrizen und ihre Anwendungen (Vol. 1, 6th ed.). Berlin: Zurmuhl, Springer-Verlag.
Received September 17, 2004; accepted April 7, 2006.
LETTER
Communicated by Joshua B. Tenenbaum
Nonlocal Estimation of Manifold Structure Yoshua Bengio [email protected]
Martin Monperrus [email protected]
Hugo Larochelle [email protected] D´epartement d’Informatique et Recherche Op´erationnelle, Centre de Recherches Math´ematiques, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7
We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation invites an exploration of nonlocal manifold learning algorithms that attempt to discover shared structure in the tangent planes at different positions. A training criterion for such an algorithm is proposed, and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where local nonparametric methods fail. 1 Introduction A central issue in order to obtain generalization is how information from training examples is used to make predictions about new examples. In nonparametric models, there are no strong prior assumptions about the structure of the underlying generating distribution, and this might make it difficult to generalize far from the training examples, as illustrated by the curse of dimensionality. In recent years, there has been a lot of work on unsupervised learning based on characterizing a possibly nonlinear manifold near which the data would lie, such as locally linear embedding (LLE) (Roweis & Saul, 2000), Isomap (Tenenbaum, de Silva, & Langford, 2000), ¨ kernel principal component analysis (or kernel PCA) (Scholkopf, Smola, & ¨ Muller, 1998), Laplacian eigenmaps (Belkin & Niyogi, 2003), and manifold charting (Brand, 2003). These are all essentially nonparametric methods that can be shown to be kernel methods with an adaptive kernel (Bengio et al., 2004) and represent the manifold on the basis of local neighborhood relations. Very often, these relations are constructed using the nearest neighbors graph (the graph with one vertex per observed example and arcs Neural Computation 18, 2509–2528 (2006)
C 2006 Massachusetts Institute of Technology
2510
Y. Bengio, M. Monperrus, and H. Larochelle
between near neighbors). These methods characterize the manifold through an embedding that associates each training example (an input object) with a low-dimensional coordinate vector (the coordinates on the manifold). Other closely related methods characterize the manifold as well as noise around it. Most of these methods consider the density as a mixture of flattened gaussians, such as mixtures of factor analyzers (Ghahramani & Hinton, 1996), manifold Parzen windows (Vincent & Bengio, 2003), and other local PCA models such as mixtures of probabilistic PCA (Tipping & Bishop, 1999). This is not an exhaustive list, and recent work also combines models through a mixture density and dimensionality reduction (Teh & Roweis, 2003; Brand, 2003). In this letter, we claim that there is a fundamental weakness with such nonparametric kernel methods, due to the locality of learning. We show that for these methods, the definition of the local tangent plane of the manifold at a point x is defined based mostly on the near neighbors of x. As a consequence, it is difficult with such methods to generalize to new combinations of values x that are far from the training examples xi , where being “far” is a notion that should be understood in the context of several factors: the amount of noise around the manifold (the examples do not lie exactly on the manifold), the curvature of the manifold, and the dimensionality of the manifold. For example, if the manifold curves quickly around x, neighbors need to be closer for a locally linear approximation to be meaningful. Dimensionality of the manifold compounds that problem because the number of data thus needed will grow exponentially with it. Saying that y is “far” from x means that y is not well represented by its projection on the tangent plane at x. In this letter, we explore one way to address these problems, based on estimating the tangent planes of the manifolds as a function F taking x as argument and computing a prediction of the tangent plane around x. The important point is that F can be estimated not only from the data around x but from the whole dataset. Hence if there is a compact way to represent the manifold structure and if the class from which F is chosen can represent it and if F can be optimized to learn it, then it can generalize to regions with not enough data to determine the manifold shape from looking at near neighbors (which may be the case even in regions where there is data, when the manifold dimension is high, or the manifold is highly curved, or the data do not lie strictly on the manifold). We present experiments on a variety of tasks illustrating the weaknesses of the local manifold learning algorithms. The most striking result is that the nonlocal model is able to generalize a notion of rotation learned on one kind of image (digits) to a very different kind (alphabet characters), very far from the training data. 2 Local Manifold Learning By local manifold learning, we mean a method that derives information about the local structure of the manifold (i.e., implicitly its tangent directions)
Nonlocal Estimation of Manifold Structure
2511
at x based mostly on the training examples around x. There is a large group of manifold learning methods (as well as spectral clustering methods) that share several characteristics and can be seen as data-dependent kernel PCA (Bengio et al., 2004). These include LLE (Roweis & Saul, 2000), ¨ Isomap (Tenenbaum et al., 2000), kernel PCA (Scholkopf et al., 1998), and Laplacian eigenmaps (Belkin & Niyogi, 2003). The methods first build a data-dependent gram matrix M with n × n entries K D (xi , x j ) where D = {x1 , . . . , xn } is the training set and K D is a data-dependent kernel, and compute the eigenvector-eigenvalue pairs {(vk , λk )} of M. The embedding of the training set is obtained directly from the principal eigenvectors vk of M (the ith element of vk givesthe kth coordinate of xi ’s embedding, i.e., e k (xi ) = vki , possibly scaled by λnk ), and the embedding for a new example ¨ formula (Bengio et al., 2004), can be estimated using the Nystrom e k (x) =
n 1 vki K D (x, xi ), λk
(2.1)
i=1
for the kth coordinate of x, where λk is the kth eigenvalue of M (the optional
scaling by λnk would also apply). Equation 2.1 says that the embedding for a new example x is a local interpolation of the manifold coordinates i) . To see vki of its neighbors xi , with interpolating weights given by K Dλ(x,x k more clearly how the tangent plane may depend only on the neighbors of x, consider the relation between the tangent plane and the embedding function: for any K D the tangent plane at x is simply the subspace spanned by the vectors ∂e∂kx(x) , as illustrated in Figure 1. We show below that in the case of very local kernels like that of LLE, spectral clustering with gaussian kernel, Laplacian eigenmaps, or kernel PCA with gaussian kernel, ∂e∂kx(x) depends significantly only on the near neighbors of x. Consider first the simplest case: kernel PCA with a gaussian kernel. Then ∂e∂kx(x) can be closely approximated by a linear combination of the difference vectors (x − x j ) for x j near x. The weights of that combination may depend on the whole data set, but if the ambient space has many more dimensions than the number of such near neighbors of x, this is a very strong, locally determined constraint on the shape of the manifold. If there are enough examples in a small enough neighborhood around x then these approaches work well. However, if the dimensionality, curvature and noise are too large, generalization will be poor, as argued in more detail in the next subsection. Let us now consider the case of LLE and show that similar results are obtained. A kernel consistent with LLE is K LLE (x, xi ) being the weight of xi in the reconstruction of x by its k nearest neighbors (Bengio et al., 2004). This weight is obtained by the following equation (Saul & Roweis, 2002), k K LLE (x, xi ) = k l=1
j=1
k
G i−1 j
m=1
−1 G lm
(2.2)
2512
Y. Bengio, M. Monperrus, and H. Larochelle
tangent directions tangent plane
Data on a curved manifold
Figure 1: The tangent plane is spanned by the vectors ∂e∂kx(x) , that is, the directions of most rapid change of coordinate k when moving along the manifold.
with G −1 the inverse of the local gram matrix G, G lm = (x − xl ) · (x − xm ), for all pairs (xl , xm ) of k nearest neighbors of x in the training set. Because G −1 = |G|−1 C T with C the cofactor matrix of G, and because |G|−1 at the numerator and the denominator cancel, equation 2.2 can be rewritten as K LLE (x, xi ) =
j
sj
j
uj
tjlm l,m (G lm ) vjlm l,m (G lm )
,
where j s j l,m (G lm )tjlm is a polynomial expansion of the cofactor element Ci j (i.e. a determinant), and similarly for Clm . Consequently, thanks to the usual derivation rules, its derivative is a linear combination of derivatives
Nonlocal Estimation of Manifold Structure
2513
of terms of the form (G lm )t . But ∂(G lm )t ∂((x − xl ) · (x − xm ))t = ∂x ∂x = t(G lm )t−1 (x − xl + x − xm ), which implies that the derivative of K LLE (x, xi ) with regard to x is in the span of the vectors (x − x j ) with x j one of the k nearest neighbors of x. The case of Isomap is less intuitively obvious, but we show below that it is also local. Let D(a , b) denote the graph geodesic distance going only through a , b, and points from the training set. As shown in Bengio et al. (2004), the corresponding data-dependent kernel can be defined as 1 1 ¯ K D (x, xi ) = − (D(x, xi )2 − D(x, x j )2 − D¯ i + D), 2 n j where 1 D(xi , x j )2 D¯ i = n j and 1 ¯ D¯ = Dj. n j Let N (x, xi ) denote the index j of the training set example x j that is the neighbor of x minimizing x − x j + D(x j , xi ). Then (x − xN (x,x j ) ) ∂e k (x) 1 1 = vki D(x, x j ) ∂x λk i n j x − xN (x,x j ) (x − xN (x,xi ) ) − D(x, xi ) , x − xN (x,xi )
(2.3)
which is a linear combination of vectors (x − xk ), where xk is a neighbor of x. This clearly shows that the tangent plane at x associated with Isomap is also included in the subspace spanned by the vectors (x − xk ) where xk is a neighbor of x. There is also a variety of local manifold learning algorithms that can be classified as “mixtures of pancakes” (Ghahramani & Hinton, 1996; Tipping & Bishop, 1999; Vincent & Bengio, 2003; Teh & Roweis, 2003; Brand,
2514
Y. Bengio, M. Monperrus, and H. Larochelle
2003). These are generally mixtures of gaussians with a particular covariance structure. When the covariance matrix is approximated using its principal eigenvectors, this leads to “local PCA” types of methods. For these methods, the local tangent directions directly correspond to the principal eigenvectors of the local covariance matrices. Learning is also local since it is mostly the examples around the gaussian center that determine its covariance matrix. The problem is not so much due to the form of the density as a mixture of gaussians. The problem is that the local parameters (e.g., local principal directions) are estimated mostly based on local data. There is usually a nonlocal interaction between the different gaussians, but its role is mainly of global coordination, for example, where to set the gaussian centers to allocate them properly where there are data and optionally how to orient the principal directions so as to obtain a globally coherent coordinate system for embedding the data. 2.1 Where Local Manifold Learning Would Fail. It is easy to imagine at least four causes of failure for local manifold learning methods, which can be compounded:
r
r
Noise around the manifold. Data are not exactly lying on the manifold. In the case of nonlinear manifolds, the presence of noise means that more data around each pancake region will be needed to properly estimate the tangent directions of the manifold in that region. More data are needed simply to average out the noise sufficiently (some random directions quite different from the local principal directions might otherwise be selected). High curvature of the manifold. Local manifold learning methods basically approximate the manifold by the union of many locally linear patches. For this to work, there must be at least d (with d = manifold dimension) close enough examples in each patch (more with noise). With a high curvature manifold, more—smaller—patches will be needed, and the number of patches required to cover the manifold will grow exponentially with the dimensionality of the manifold. Consider, for example, the manifold of translations of a high-contrast image (see Figure 2). The tangent direction corresponds to the change in image due to a small translation; it is nonzero only at the edges in the image. After a 1 pixel translation, the edges have moved by 1 pixel and may not overlap much with the edges of the original image if it had high contrast. This is indeed a very high curvature manifold. In addition, if the image resolution is increased, then many more training images will be needed to capture the curvature around the translation manifold with locally linear patches. Yet the physical phenomenon responsible for translation is expressed by a simple equation, which does not get more complicated with increasing resolution.
Nonlocal Estimation of Manifold Structure
2515
tangent image tangent directions
high−contrast image
shifted image tangent image tangent directions
Figure 2: The manifold of translations of a high-contrast image has very high curvature. A smooth manifold is obtained by considering that an image is a sample on a discrete grid of an intensity function over a two-dimensional space. The tangent vector for translation is thus a tangent image, and it has high values only on the digit edges. The tangent plane for an image translated by only 1 pixel looks similar but changes abruptly since the edges are also shifted by 1 pixel. Hence the two tangent planes are almost orthogonal.
r
r
High intrinsic dimension of the manifold. We have already seen that high manifold dimensionality d is hurtful because O(d) examples are required in each patch and O(r d ) patches (for some r depending on curvature) are necessary to span the manifold. Presence of many manifolds with few data per manifold. In many real-world settings, there is not just one global manifold but a large number of (generally nonintersecting) manifolds, which, however, share something about their structure. A simple example is the manifold of transformations (e.g., viewpoint, position, lighting) of 3D objects in 2D images. There is one manifold per object instance (corresponding to the successive application of small amounts of all of these transformations). If there are only a few examples for each such class, then it is almost impossible to learn the manifold structures using only local manifold learning. However, if the manifold structures are generated by a common underlying phenomenon, then a nonlocal manifold learning method could potentially learn all of these manifolds and even generalize to
2516
Y. Bengio, M. Monperrus, and H. Larochelle
manifolds for which a single instance is observed, as demonstrated in the experiments in section 5. 3 Nonlocal Manifold Tangent Learning We propose here a new non-local manifold learning methodology. We choose to characterize the manifolds in the data distribution through a matrix-valued function F (x) that predicts at x ∈ Rm a basis for the tangent plane of the manifold near x—hence, F (x) ∈ Rd×m for a d-dimensional manifold. Basically, F (x) specifies in which directions (with regard to x) one expects to find near neighbors of x. We are going to consider a simple supervised learning setting to train this function. As with Isomap, we consider that the vectors (x − xi ) with xi a near neighbor of x span a noisy estimate of the manifold tangent space. We propose to use them to define a “noisy target” for training F (x). In our experiments, we simply collected the k nearest neighbors of each example x, but better selection criteria might be devised. Points on the predicted tangent subspace can be written x + F (x) w, with w ∈ Rd being local coordinates in the basis specified by F (x). Several criteria are possible to match the neighbor’s differences with the subspace defined by F (x). One that yields to simple analytic calculations is to minimize the distance between the x − x j vectors and their projection on the subspace defined by F (x). The low-dimensional local coordinate vector wtj ∈ Rd that matches neighbor x j of example xt is thus an extra free parameter that has to be optimized, but this optimization can be done analytically. The overall training criterion involves a double optimization over function F and local coordinates wtj of what we call the relative projection error, R(F, w) =
F (xt ) wtj − (xt − x j )2 , xt − x j 2 t
(3.1)
j∈N (xt )
where w = wt j and N (x) denotes the selected set of near neighbors of x. The objective is to minimize R across F and w simultaneously. The normalization by xt − x j 2 is to avoid giving more weight to the neighbors that are farther away. The above ratio amounts to minimizing the square of the sinus of the projection angle. To perform the above minimization, we can do coordinate descent (which guarantees convergence to a minimum), that is, alternate changes in F and changes in w that at each step go down the total criterion. Since the minimization over w can be done separately for each example xt and neighbor x j , it is equivalent to minimize F (xt ) wtj − (xt − x j )2 xt − x j 2
(3.2)
Nonlocal Estimation of Manifold Structure
2517
over vector wtj for each such pair (done analytically) and compute the gradient of the above over F (or its parameters) to move F slightly (in the experiments we used stochastic gradient on the parameters of F ). The solution for wtj is obtained by solving the linear system F (xt )F (xt ) wtj = F (xt )
(xt − x j ) . xt − x j 2
(3.3)
In our implementation, this is done robustly through a singular value decomposition, F (xt ) = U SV and, introducing a matrix B wtj = B(xt − x j ), where B can be precomputed for all the neighbors of xt , B=
d
1 Sk > V·k V·k /Sk2
F (xt ),
k=1
with a small regularization threshold. The gradient of the criterion with respect to the ith row of F (xt ), holding the local coordinates wtj fixed, is simply wtji ∂R =2 (F (xt ) wtj − (xt − x j )), ∂ Fi (xt ) xt − x j
(3.4)
j∈N (xt )
where wtji is the ith element of wtj . In practice, it is not necessary to store more than one wtj vector at a time. In the experiments, F (·) is parameterized as standard one-hidden-layer neural network with m inputs and d × m outputs. It is trained by stochastic gradient descent, one example xt at a time. The rows of F (xt ) are not constrained to be orthogonal or to have norm 1. They are used only to define a basis for the tangent plane. Although the above algorithm provides a characterization of the manifold, it does not directly provide an embedding or a density function. However, once the tangent plane function is trained, there are ways to use it to obtain all of the above. The simplest method is to apply existing algorithms that provide both an embedding and a density function based on a gaussian mixture with pancake-like covariances. For example one could use charting (Brand, 2003), and the local covariance matrix around x could
2518
Y. Bengio, M. Monperrus, and H. Larochelle
be of the form F (x) F (x) + σ 2 I , i.e. F specifies both the principal directions and variances in these directions, and σ 2 takes care of off-manifold noise. Figure 3 illustrates why nonlocal tangent learning can be a more accurate predictor of the tangent plane. Since the tangent plane is estimated by a smooth predictor (in our case, a neural net) that has the potential to generalize nonlocally, the tangent plane tends to vary smoothly between training points. This will not be true for local PCA, for example, especially if there are not many training points. Note that this type of estimator can make predictions anywhere in the data space, even far from the training examples, which can be problematic for algorithms such as local PCA.
4 Previous Work on Manifold Learning The nonlocal manifold learning algorithm presented here (find F (·), which minimizes minw R(F, w), is similar to the one proposed in Rao and Ruderman (1999) to estimate the generator matrix of a Lie group. That group defines a one-dimensional manifold generated by following the orbit x(t) = e Gt x(0), where G is an m × m matrix and t is a scalar manifold coordinate. A multidimensional manifold can be obtained by replacing Gt above by a linear combination of multiple generating matrices. In Rao and Ruderman (1999) the matrix exponential is approximated to first order by (I + Gt), and the authors estimate G for a simple signal undergoing translations, using as a criterion the minimization of x,x˜ mint (I + Gt)x − x˜ 2 , where x˜ is a neighbor of x in the data. Note that in this model, the tangent plane is a linear function of x, that is, F (x) = Gx. By minimizing the above across many pairs of examples, a good estimate of G for the artificial data was recovered by Rao and Ruderman (1999). Our proposal extends this approach to multiple dimensions and nonlinear relations between x and the tangent planes. The work on tangent distance (Simard, LeCun, & Denker, 1993), though more focused on character recognition, also uses information from the tangent plane of the data manifold. In Simard et al., the tangent planes are used to build a nearest-neighbor classifier that is based on the distance between the tangent subspaces around two examples to be compared. The tangent vectors that span the tangent space are not learned, but rather are obtained analytically a priori for transformations that locally do not change the class label, such as rotation, location shift, and thickness change. Hastie, Simard, and Sackinger (1995) and Hastie and Simard (1998) present a tangent subspace learning algorithm to learn character prototypes along with a tangent plane around each prototype, which reduces the time and memory requirements of the nearest-neighbor tangent distance classifier. Unlike in the case of Rao and Ruderman (1999), the manifold can be more than one-dimensional (they present results for 12 dimensions), but the manifold
Nonlocal Estimation of Manifold Structure
2519
New prediction
F(x)
x Figure 3: The difference between local PCA and nonlocal tangent learning. (Top) What the one-dimensional tangent plane learned by local PCA (using two nearest neighbors) might look like for the three data points. (Bottom) The same but for nonlocal tangent learning. We emphasize here that with nonlocal tangent learning, the predicted tangent plane should change smoothly between points, and new predictions can be made anywhere in the data space.
is locally linear around each prototype (hence, must be globally smooth if the number of prototypes is significantly fewer than the number of examples). This learning procedure exploits the a priori tangent vector basis for the training points, which is computed analytically as in tangent distance (Simard et al., 1993). Nonlocal tangent learning can be viewed as an extension of these ideas that avoids the need for explicit prior knowledge on
2520
Y. Bengio, M. Monperrus, and H. Larochelle
the invariances of objects in each class and also introduces the notion of nonlocal learning of the manifold structure. More generally, we can say that nonlocal tangent learning, local PCA, LLE, Isomap, and tangent subspace learning all try to learn a manifold structure (either the embedding or the tangent plane) that respects local metric structure. Since all of them implicitly or explicitly estimate the tangent plane, they all have the potential to learn invariants that could be useful for transformation-invariant classification. Local PCA and LLE are based on the Euclidean metric Isomap on an approximate geodesic metric, and Hastie et al. (1995) use the tangent distance metric based on a priori knowledge about the domain. One important difference with the ideas presented here is that for all these algorithms, the predicted manifold structure at x is obtained essentially using only local information in the neighborhood of x. We believe that the main conceptual advantage of the approach proposed here over local manifold learning is that the parameters of the tangent plane predictor can be estimated using data from very different regions of space, thus in principle allowing it to be less sensitive to all four of the problems described in section 2.1, thanks to sharing of information across these different regions. 5 Experimental Results The objective of the experiments is to validate the proposed algorithm: Does it provide a good estimate of the true tangent planes? Does it generalize better than a local manifold learning algorithm, especially in regions “far” from the data? 5.1 Error Measurement. In addition to visualizing the results for the low-dimensional data, we measure performance by considering how well the algorithm learns the local tangent distance, as measured by the normalized projection error of nearest neighbors (see equation 3.2). We compare the errors of four algorithms, always on test data not used to estimate the tangent plane: (1) true analytic (using the true manifold’s tangent plane at x computed analytically), (2) tangent learning (using the neural network tangent plane predictor F (x), trained using the k ≥ d nearest neighbors in the training set of each training set example), (3) Isomap (using the tangent plane defined in equation 2.3), or (4) local PCA (using the d principal components of the empirical covariance of the k nearest neighbors of x in the training set). 5.2 Tasks 5.2.1 Multiple Sinusoidal Manifolds. We first consider a low-dimensional but multimanifold problem. The data {xi } are in two dimensions and coming from a set of 40 1D manifolds. Each manifold is composed of four near points
Nonlocal Estimation of Manifold Structure
2521
4
3
2
1
0
−1
−2
0
1
2
3
4
5
6
7
8
9
10
Figure 4: Multiple Sinusoidal Manifolds. Two-dimensional data with 1D sinusoidal manifolds; the method indeed captures the tangent planes. The small segments are the estimated tangent planes. Small dots are the training examples.
obtained randomly from a sinus curve, that is, ∀i ∈ 1..4, xi = (a + ti , sin(a + ti ) + b), where a , b, and ti are randomly chosen. Four neighbors were used for training both the nonlocal tangent learning algorithm and the benchmark local nonparametric estimator (local PCA of the four neighbors). Figure 4 shows the training set and the tangent planes recovered with nonlocal tangent learning, both at training examples and generalizing away from the data. The neural network has 10 (chosen arbitrarily) hidden units. This problem is particularly difficult for local manifold learning: the out-ofsample relative projection errors are, respectively, 0.09 for the true analytic plane, 0.25 for nonlocal tangent learning, and 0.81 for local PCA. 5.2.2 Gaussian Curves in a High-Dimensional Space. This is a higherdimensional manifold learning problem, with 41 dimensions. The data are generated by sampling gaussian curves. Each curve is of the form 2 x(i) = e t1 −(−2+i/10) /t2 with i ∈ {0, 1, . . . , 40}. Note that the tangent vectors
2522
Y. Bengio, M. Monperrus, and H. Larochelle
0.16 Analytic Local PCA Isomap Tangent Learning
0.15
0.14
0.13
0.12
0.11
0.1
0.09
0.08
0.07
0.06
1
2
3
4
5
6
7
8
9
10
Figure 5: Gaussian Curves in a High-Dimensional Space. Relative projection error for kth nearest neighbor with regard to k, for compared methods (from lowest to highest at k = 1: analytic, tangent learning, local PCA, Isomap). Note the U shape due to opposing effects of curvature and noise.
are not linear in x. The manifold coordinates are t1 and t2 , sampled uniformly, respectively, from (−1, 1) and (0.1, 3.1). Normal noise (standard deviation = 0.001) is added to each point. One hundred example curves were generated for training and 200 for testing. The neural network has 100 hidden units. Figure 5 shows the relative projection error as a function of the number of nearest neighbors, for the four methods on this task. First, the error decreases because of the effect of noise (nearby noisy neighbors may form a high angle with the tangent plane). Then it increases because of the manifold curvature (farther-away neighbors form a larger angle). This effect is illustrated schematically in Figure 6 and gives rise to the U-shaped projection error curve in Figure 5. 5.2.3 Rotation Manifold. This is a high-dimensional multimanifold task, involving digit images to which we have applied slight rotations in such a way as to have the knowledge of the analytic formulation of the manifolds. There is one rotation manifold for each instance of digit from the database, but only two examples for each manifold: one real image from the MNIST data set and one slightly rotated image. There are 1000 × 2 examples used
Nonlocal Estimation of Manifold Structure
2523
tangent plane
noise level
Figure 6: Schematic explanation of the U-shaped curve in projection error. With noise around the manifold, nearest examples tend to have a large angle, but because of curvature, the error also increases with distance to the reference point.
for training and 1000 × 2 for testing. In this context we use k = 1 nearest neighbor only, and the manifold dimension is 1. The average relative projection error for the nearest neighbor is 0.27 for the analytic tangent plane (obtained using the same technique as in Simard et al. (1993), 0.43 with tangent learning (100 hidden units), and 1.5 with local PCA. Note the neural network would probably overfit if trained too much (here, only 100 epochs). An even more interesting experiment consists of applying the abovetrained predictor on novel images that come from a very different distribution but that shares the same manifold structure. It was applied to images of other characters that are not digits. We have used the predicted tangent planes to follow the manifold by small steps (this is very easy to do in the case of a 1D manifold). More formally, this corresponds to the following pseudocode: WalkOnManifold(x, F (·), i, nsteps, stepsize) for s from 1 to nsteps 1. ds = SVD(F (x), i) 2. x = x + ds · stepsize Here, x is the initial image, F (·) is the tangent predictor, nsteps is the number of steps, stepsize controls how far in the direction ds each step is made, and SV D(F (x), i) is a function that returns the ith orthogonal basis vector of the subspace spanned by the rows of F (x) using its SVD decomposition. Note that the sign of stepsize also determines the orientation of the walk. Also,
2524
Y. Bengio, M. Monperrus, and H. Larochelle
2
2
2
4
4
4
6
6
6
8
8
8
10
10
10
12
12
12
14
14 2
4
6
8
10
12
14
14 2
4
6
8
10
12
14
2
2
2
4
4
4
6
6
6
8
8
8
10
10
10
12
12
12
14
14 2
4
6
8
10
12
14
2
4
6
8
10
12
14
2
4
6
8
10
12
14
14 2
4
6
8
10
12
14
Figure 7: Rotational Manifold. (Left) Original image. (Middle) Applying a small amount of the predicted rotation. (Right) Applying a larger amount of the predicted rotation. (Top) Using the estimated tangent plane predictor. (Bottom) Using local PCA, which is clearly much worse (the letter is not rotated).
since in this task, the dimension of the manifold is only 1, then we have i = 1 and the SVD is not necessary. We have considered the more general case only because it will be used in the next task. Figure 7 shows the effect of applying WalkOnManifold on a letter M image for a few and a larger number of steps, for both the neural network predictor and the local PCA predictor. This example illustrates the crucial point that nonlocal tangent plane learning is able to generalize to truly novel cases, where local manifold learning fails. The results showed in Figure 7 provide evidence of the impressive extrapolation capacity of nonlocal tangent learning, since the M letter is quite different from any digit in the training set; the neural network is not just locally smoothing the tangent plane estimation, but it truly generalizes the notion of rotation (here) to new objects. Since this experiment was set so that the only class-invariant transformation that could be learned would be the rotation transformation, one might wonder in what ways this task differs from supervised learning, that is, predicting the effect of a slight rotation on an image. First, one should note that we are predicting an undirected vector (i.e., rotations one way or the other are both acceptable), and second, the procedure can be readily generalized to predicting a whole tangent plane, without prior knowledge about invariants of the inputs, as shown with the next set of experiments, in which only natural data are used to infer the shape of the manifold.
Nonlocal Estimation of Manifold Structure
2525
Figure 8: Digit Images Manifold. Examples of the “walk on the manifold” for USPS digit samples (middle row). There is one model per digit class (column). Moving up or down the column corresponds to moving along one of the learned directions. Only the middle row corresponds to an actual example image; the other rows are obtained by walking one way or the other along the manifold.
5.2.4 Digit Images Manifold. Finally, we performed the following experiment in order to observe the invariances that can be learned with nonlocal tangent learning for a typical character recognition data set. These invariances can be compared with those reported for other methods, such as in Hastie et al. (1995). We used the first 6291 examples from the U.S. Postal Service (USPS) training set to train a separate neural network per each digit class, using nonlocal tangent learning. For this experiment, the hyperparameters (number of neighbors, number of hidden units, number of training epochs through early stopping) were tuned using a validation set (the 1000 USPS examples following the training set), with normalized projection error. The manifold dimension was chosen to be 7, following inspiration from the tangent distance work (Simard et al., 1993). For each class, we chose one digit example and performed a walk on the manifold as indicated by the pseudocode of WalkOnManifold for the rotation manifold task in order to visualize the learned manifold around the example image. The results are plotted in Figure 8. The values of nsteps and i were tuned manually to clarify visually the effect of the learned transformations. Note that those transformations are not linear, since the directions ds are likely to be different from one step to another, and visual inspection also suggests so (e.g., changing the shape of the loop in a 2). The overall picture is rather good, and some of the digit transformations are quite impressive, showing that the model learned typical transformations. For instance, we can see that nonlocal tangent learning was able to rotate the digit ‘8’ so that it would stand straight. In our opinion, those
2526
Y. Bengio, M. Monperrus, and H. Larochelle
transformations compare well to those reported in Hastie et al. (1995), although no prior knowledge about images was used here in order to obtain these transformations. In Bengio and Larochelle (2006), we describe an extension of nonlocal tangent learning: nonlocal manifold Parzen, which uses nonlocal learning to train a manifold Parzen (Vincent & Bengio, 2003) density estimator. The basic idea is to estimate not only the tangent plane but also variances in each of the local principal directions, as functions of x. Having both principal directions and variances, one can write down a locally gaussian density and estimate the global density as a mixture of these gaussian components (one on each training example). From the density, one can readily obtain a classifier using one density estimator per class. Improvements with respect to local learning algorithms on the out-of-sample likelihood and classification error are reported for toy and real problems, such as the USPS digit recognition task. Note that this extension provides other ways to do model selection (e.g., by cross validation on the out-of-sample likelihood or classification error). 6 Conclusion The central claim of this letter is that there are fundamental problems with local nonparametric approaches to manifold learning, essentially due to the curse of dimensionality (at the dimensionality of the manifold), but worsened by manifold curvature, noise, and the presence of several disjoint manifolds. To address these problems, we propose that learning algorithms should be designed in such a way that they can share information, coming from different regions of space, about the structure of the manifold. In this spirit, we have proposed a simple learning algorithm based on predicting the tangent plane at x with a function F (x) whose parameters are estimated using the whole data set. Note that the same fundamental problems are present with nonparametric approaches to semisupervised learning (e.g., ¨ as in (Szummer & Jaakkola, 2002; Chapelle, Weston, & Scholkopf, 2003; Belkin & Niyogi, 2003; Zhu, Ghahramani, & Lafferty, 2003), which rely on an accurate estimation of the manifold in order to propagate label information. Future work should investigate how to better handle the curvature problem: imagine that most nearest neighbor pairs are too far for the locally linear approximation of the manifold to be approximately valid between them. One way to deal with this would be to follow the manifold using the local tangent estimates, and search from or sample from manifold-following paths between pairs of neighboring examples. The algorithm was already extended to obtain a mixture of factor analyzers in Bengio and Larochelle (2006) (with the factors or the principal eigenvectors of the gaussian centered at x obtained from F (x)). This view provides an alternative criterion to optimize F (x) (the local log-likelihood of such a gaussian), that suggests a way to estimate the missing information
Nonlocal Estimation of Manifold Structure
2527
(the variances along the eigenvector directions). On the other hand, since we can estimate F (x) everywhere, a more ambitious view would consider the density as a “continuous” mixture of gaussians (with an infinitesimal component located everywhere in space). According to that view, the model implicitly defines a distribution P by specifying how to stochastically go from a sample xt from P to another nearby sample xt+1 , e.g. according to a gaussian centered on the manifold surface near xt and whose principal components span the tangent plane. Acknowledgments We thank the following funding organizations for support: NSERC, MITACS, IRIS, and the Canada Research Chairs. We also thank Olivier Delalleau for his help. References Belkin, M., & Niyogi, P. (2003). Using manifold structure for partially labeled classification. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15. Cambridge, MA: MIT Press. Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., & Ouimet, M. (2004). Learning eigenfunctions links spectral embedding and kernel PCA. Neural Computation, 16(10), 2197–2219. Bengio, Y., & Larochelle, H. (2006). Non-local manifold Parzen windows. In Y. Weiss, ¨ B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18. Cambridge, MA: MIT Press. Brand, M. (2003). Charting a manifold. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. ¨ Chapelle, O., Weston, J., & Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Ghahramani, Z., & Hinton, G. (1996). The EM algorithm for mixtures of factor analyzers (Tech. Rep. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Hastie, T., & Simard, P. (1998). Metrics and models for handwritten character recognition. Statistical Science, 13(1), 54–65. Hastie, T., Simard, P., & Sackinger, E. (1995). Learning prototype models for tangent distance. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 999–1006). Cambridge, MA: MIT Press. Rao, R., & Ruderman, D. (1999). Learning LIE groups for invariant visual perception. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 810–816). Cambridge, MA: MIT Press. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Saul, L., & Roweis, S. (2002). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4, 119–155.
2528
Y. Bengio, M. Monperrus, and H. Larochelle
¨ ¨ Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Simard, P., LeCun, Y., & Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In C. Giles, S. Hanson, & J. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 50–58). San Mateo, CA: Morgan Kaufmann. Szummer, M., & Jaakkola, T. (2002). Partially labeled classification with Markov random walks. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Teh, Y. W., & Roweis, S. (2003). Automatic alignment of local representations. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Tipping, M., & Bishop, C. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2), 443–482. Vincent, P., & Bengio, Y. (2003). Manifold Parzen windows. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC: AAAI Press.
Received March 30, 2005; accepted March 27, 2006.
LETTER
Communicated by Barbara Hammer
Dynamics and Topographic Organization of Recursive Self-Organizing Maps ˇ Peter Tino [email protected] School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
Igor Farkaˇs [email protected] Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, and Institute of Measurement Science, Slovak Academy of Sciences, Bratislava, Slovak Republic
Jort van Mourik [email protected] Neural Computing Research Group, Aston University, Aston Triangle, Birmingham B4 7ET, U.K.
Recently there has been an outburst of interest in extending topographic maps of vectorial data to more general data structures, such as sequences or trees. However, there is no general consensus as to how best to process sequences using topographic maps, and this topic remains an active focus of neurocomputational research. The representational capabilities and internal representations of the models are not well understood. Here, we rigorously analyze a generalization of the self-organizing map (SOM) for processing sequential data, recursive SOM (RecSOM) (Voegtlin, 2002), as a nonautonomous dynamical system consisting of a set of fixed input maps. We argue that contractive fixed-input maps are likely to produce Markovian organizations of receptive fields on the RecSOM map. We derive bounds on parameter β (weighting the importance of importing past information when processing sequences) under which contractiveness of the fixed-input maps is guaranteed. Some generalizations of SOM contain a dynamic module responsible for processing temporal contexts as an integral part of the model. We show that Markovian topographic maps of sequential data can be produced using a simple fixed (nonadaptable) dynamic module externally feeding a standard topographic model designed to process static vectorial data of fixed dimensionality (e.g., SOM). However, by allowing trainable feedback connections, one can obtain Markovian maps with superior memory depth and topography preservation. We elaborate on the importance of non-Markovian organizations in topographic maps of sequential data. Neural Computation 18, 2529–2567 (2006)
C 2006 Massachusetts Institute of Technology
2530
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
1 Introduction In its original form, the self-organizing map (SOM) (Kohonen, 1982) is a nonlinear projection method that maps a high-dimensional metric vector space onto a two-dimensional regular grid in a topologically ordered fashion (Kohonen, 1990). Each grid point has an associated codebook vector representing a local subset (Voronoi compartment) of the data space. Neighboring grid points represent neighboring regions of the data space. Given a collection of possibly high-dimensional data points, by associating each point with its codebook representative (and so, in effect, with its corresponding grid point), a two-dimensional topographic map of the data collection is obtained. Locations of the codebook vectors in the data space are adapted to the layout of data points in an unsupervised learning process. Both competitive learning1 and cooperative learning2 are employed. Many modifications of the standard SOM have been proposed in the literature (e.g., Yin, 2002; Lee & Verleysen, 2002). Formation of topographic maps by self-organization constitutes an important paradigm in machine learning with many successful applications, for example, in data and web mining. Most approaches to topographic map formation operate on the assumption that the data points are members of a finite-dimensional vector space of a fixed dimension. Recently there has been an outburst of interest in extending topographic maps to more general data structures, such as sequences or trees. Several modifications of SOM to sequences or tree structures have been ´ & Kremer, 2003) and (Hammer, proposed in the literature (Barreto, Araujo, Micheli, Sperduti, & Strickert, 2004) review most of the approaches. Modified versions of SOM that have enjoyed a great deal of interest equip SOM with additional feedback connections that allow for natural processing of recursive data types. No prior notion of metric on the structured data space is imposed; instead, the similarity measure on structures evolves through parameter modification of the feedback mechanism and recursive comparison of constituent parts of the structured data. Typical examples of such models are temporal Kohonen map (Chappell & Taylor, 1993), recurrent SOM (Koskela, Varsta, Heikkonen, & Kaski, 1998), feedback SOM (Horio & Yamakawa, 2001), recursive SOM (Voegtlin, 2002), merge SOM (Strickert & Hammer, 2003), and SOM for structured data (Hagenbuchner, Sperduti, & Tsoi, 2003). Other alternatives for constructing topographic maps of structured data have been suggested by James and Miikkulainen (1995), 1 For each data point, there is a competition among the codebook vectors for the right to represent it. 2 Not only the codebook vector that has won the competition to represent a data point is allowed to adapt itself to that point, but so are, albeit to a lesser degree, codebook vectors associated with grid locations topologically close to the winner.
Dynamics of Recursive Self-Organizing Maps
2531
Principe, Euliano, and Garani (2002), Wiemer (2003), and Schulz and Reggia (2004). There is no general consensus as to how best to process sequences with SOMs, and this topic remains an active focus of current neurocomputational research (Barreto et al., 2003; Schulz & Reggia, 2004; Hammer, Micheli, Sperduti et al., 2004). As Hammer, Micheli, Sperduti et al. (2004) pointed out, the representational capabilities of the models are hardly understood. The internal representation of structures within the models is unclear, and it is debatable as to which model of recursive unsupervised maps can represent the temporal context of time series in the best way. The first major steps toward a much-needed mathematical characterization and analysis of such models were taken in Hammer, Micheli, Sperduti et al. (2004) and Hammer, Micheli, Strickert, and Sperduti (2004). The authors present the recursive models of unsupervised maps in a unifying framework and study such models from the point of view of internal representations, noise tolerance, and topology preservation. In this letter, we continue with the task of mathematical characterization and theoretical analysis of the hidden build-in architectural biases for topographic organizations of structured data in the recursive unsupervised maps. Our starting position is viewing such models as nonautonomous dynamical systems with internal dynamics driven by a stream of external inputs. In the line of our recent research, we study the organization of the nonautonomous dynamics on the basis of dynamics of individual ˇ nansk ˇ Cer ˇ ´ & Benuˇ ˇ skov´a, 2004). Recently we have fixed-input maps (Tino, y, shown how the contractive behavior of the individual fixed-input maps translates to nonautonomous dynamics that organizes the state space in a Markovian fashion: sequences with similar most recent entries tend to have close state-space representations. Longer shared histories of the recently ˇ et al., 2004; observed items result in closer state-space representations (Tino ˇ 2003; Tino ˇ & Hammer, 2003). Hammer & Tino, We concentrate on the recursive SOM (RecSOM) (Voegtlin, 2002) because it transcends the simple local recurrence of leaky integrators of earlier models and it has been demonstrated that it can represent much richer dynamical behavior (Hammer, Micheli, Sperduti et al., 2004). By studying RecSOM as a nonautonomous dynamical system, we attempt to answer the following questions: Is the architecture of RecSOM naturally biased toward Markovian representations of input streams? If so, under what conditions may Markovian representations occur? How natural are such conditions, that is, can Markovian organizations of the topographic maps be expected under widely used architectures and (hyper)parameter settings in RecSOM? What can be gained by having a trainable recurrent part in RecSOM, that is, how does RecSOM compare with a much simpler setting of SOM operating on a simple nontrainable iterative ˇ & Dorffner, function system with Markovian state-space organization (Tino 2001)?
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2532
map at time t
wi
ci
s(t) map at time (t−1)
Figure 1: Recursive SOM architecture. The original SOM algorithm is used for both input vector s(t) and for the context represented as the map activation y(t − 1) from the previous time step. Solid lines represent trainable connections, and the dashed line represents a one-to-one copy of the activity vector y. The network learns to associate the current input with previous activity states. This way, each neuron responds to a sequence of inputs.
This article is organized as follows. We introduce the RecSOM model in section 2 and analyze it rigorously as a nonautonomous dynamical system in section 3. The experiments in section 4 are followed by a discussion in section 5. Section 6 concludes by summarizing the key messages of this study. 2 Recursive Self-Organizing Map The architecture of the RecSOM model (Voegtlin, 2002) is shown in Figure 1. Each neuron i ∈ {1, 2, . . . , N} in the map has two weight vectors associated with it:
r r
wi ∈ Rn —linked with an n-dimensional input s(t) feeding the network at time t ci ∈ R N —linked with the context y(t − 1) = (y1 (t − 1), y2 (t − 1), . . . , yN (t − 1)) containing map activations yi (t − 1) from the previous time step
Dynamics of Recursive Self-Organizing Maps
2533
The output of a unit i at time t is computed as yi (t) = exp(−di (t)),
(2.1)
where3 di (t) = α · s(t) − wi 2 + β · y(t − 1) − ci 2 .
(2.2)
In equation 2.2, α > 0 and β > 0 are model parameters that respectively influence the effect of the input and the context on the neuron’s profile. Both weight vectors can be updated using the same form of learning rule (Voegtlin, 2002): wi = γ · h ik · (s(t) − wi ) ci = γ · h ik · (y(t − 1) − ci ),
(2.3) (2.4)
where k is an index of the best matching unit at time t, k = argmini∈{1,2,...,N} di (t), and 0 < γ < 1 is the learning rate. Note that the best matching (“winner”) unit can be equivalently defined as the unit k of the highest activation yk (t): k = argmax yi (t).
(2.5)
i∈{1,2,...,N}
Neighborhood function h ik is a gaussian (of width σ ) on the distance d(i, k) of units i and k in the map: h ik = e −
d(i,k)2 σ2
.
(2.6)
The “neighborhood width,” σ , linearly decreases in time to allow for forming topographic representation of input sequences. 3 Contractive Fixed-Input Dynamics in RecSOM In this section we answer the following principal question: Given a fixed RecSOM input s, under what conditions will the mapping y(t) → y(t + 1) become a contraction, so that the autonomous RecSOM dynamics is dominated by a unique attractive fixed point? As we shall see, contractive fixedinput dynamics of RecSOM can lead to maps with Markovian representations of temporal contexts.
3 ·
denotes the Euclidean norm.
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2534
Under a fixed input vector s ∈ Rn , the time evolution, equation 2.2, becomes di (t + 1) = α · s − wi 2 + β · e −d1 (t) , e −d2 (t) , . . . , e −d N (t) − ci 2 . (3.1) After applying a one-to-one coordinate transformation equation 3.1 reads yi (t + 1) = e −αs−wi · e −βy(t)−ci , 2
2
yi = e −di ,
(3.2)
where y(t) = (y1 (t), y2 (t), . . . , yN (t)) = e −d1 (t) , e −d2 (t) , . . . , e −d N (t) . We denote the gaussian kernel of inverse variance η > 0, acting on R N , by G η (·, ·), that is, for any u, v ∈ R N , G η (u, v) = e −ηu−v . 2
(3.3)
The system of equations 3.2 can be written in a vector form as y(t + 1) = Fs (y(t)) = (Fs,1 (y(t)), . . . , Fs,N (y(t))) ,
(3.4)
where Fs,i (y) = G α (s, wi ) · G β (y, ci ), i = 1, 2, . . . , N.
(3.5)
Recall that given a fixed input s, we aim to study the conditions under which the map Fs becomes a contraction. Then by the Banach fixedpoint theorem, the autonomous RecSOM dynamics y(t + 1) = Fs (y(t)) will be dominated by a unique attractive fixed point ys = Fs (ys ). A mapping F : R N → R N is said to be a contraction with contraction coefficient ρ ∈ [0, 1), if for any y, y ∈ R N , F(y) − F(y ) ≤ ρ · y − y .
(3.6)
F is a contraction if there exists ρ ∈ [0, 1) so that F is a contraction with contraction coefficient ρ. Lemma 1. Consider three N-dimensional points y, y , c ∈ R N , y = y . Denote by (c, y, y ) the (N − 1)-dimensional hyperplane orthogonal to ( y − y ) and
Dynamics of Recursive Self-Organizing Maps
2535
y
Ω(c,y,y’)
c
u
c
ω (y,y’) y’
Figure 2: Illustration for the proof of lemma 1. The line ω(y, y ) passes through y, y ∈ R N . The (N − 1)-dimensional hyperplane (c, y, y ) is orthogonal to ω(y, y ) and contains the point c ∈ R N . c˜ is the orthogonal projection of c onto ω(y, y ), that is, (c, y, y ) ∩ ω(y, y ) = {˜c}.
containing c. Let c˜ be the intersection of (c, y, y ) with the line ω( y, y ) passing through y, y (see Figure 2). Then, for any β > 0, maxu∈(c, y, y ) |G β ( y, u) − G β ( y , u)| = |G β ( y, c˜ ) − G β ( y , c˜ )|. Proof. For any u ∈ (c, y, y ), y − u2 = y − c˜ 2 + u − c˜ 2 and y − u2 = y − c˜ 2 + u − c˜ 2 .
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2536
So, |G β (y, u) − G β (y , u)| = | exp{−βy − u2 } − exp{−βy − u2 }| = exp{−βu − c˜ 2 } · | exp{−βy − c˜ 2 } − exp{−βy − c˜ 2 }| ≤ | exp{−βy − c˜ 2 } − exp{−βy − c˜ 2 }|, with equality if and only if u = c˜ . Lemma 2. Consider any y, y ∈ R N , y = y and the line ω( y, y ) passing through y, y . Let ω( y, y ) be the line ω( y, y ) without the segment connecting y and y , that is, ω( y, y ) = { y + κ · ( y − y )| κ ∈ (−∞, −1] ∪ [0, ∞)}. Then for any β > 0, argmaxc∈ω( y, y ) |G β ( y, c) − G β ( y , c)| ∈ ω( y, y ). Proof. For 0 < κ ≤ 12 , consider two points: c(−κ) = y − κ · (y − y )
c(κ) = y + κ · (y − y ).
and
Let δ = y − y . Then G β (y, c(−κ)) = G β (y, c(κ)) = e −βδ
κ
2 2
and G β (y , c(κ)) = e −βδ
2
(1+κ)2
< e −βδ
2
(1−κ)2
= G β (y , c(−κ)).
Hence, G β (y, c(κ)) − G β (y , c(κ)) > G β (y, c(−κ)) − G β (y , c(−κ)). A symmetric argument can be made for the case c(−κ) = y − κ · (y − y), c(κ) = y + κ · (y − y), 0 < κ ≤
1 . 2
Dynamics of Recursive Self-Organizing Maps
2537
It follows that for every4 c− ∈ ω(y, y )\ω(y, y ) in between the points y and y , there exist a c+ ∈ ω(y, y ) such that |G β (y, c+ ) − G β (y , c+ )| > |G β (y, c− ) − G β (y , c− )|. For β > 0, define a function Hβ : R+ × R+ → R, Hβ (κ, δ) = e βδ
2
(2κ+1)
−
1 − 1. κ
(3.7)
Theorem 1. Consider y, y ∈ R N , y − y = δ > 0. Then for any β > 0, argmaxc∈R N |G β ( y, c) − G β ( y , c)| ∈ {cβ,1 (δ), cβ,2 (δ)}, where cβ,1 (δ) = y + κβ (δ) · ( y − y ),
cβ,2 (δ) = y + κβ (δ) · ( y − y)
and κβ (δ) > 0 is implicitly defined by Hβ (κβ (δ), δ) = 0.
(3.8)
Proof. By lemma 1, when maximizing |G β (y, c) − G β (y , c)|, we should locate c on the line ω(y, y ) passing through y and y . By lemma 2, we should concentrate only on ω(y, y ), that is, on points outside the line segment connecting y and y . Consider points on the line segment {c(κ) = y + κ · (y − y )| κ ≥ 0}. Parameter κ > 0, such that c(κ) maximizes |G β (y, c) − G β (y , c)|, can be found by maximizing gβ,δ (κ) = e −βδ
κ
2 2
− e −βδ
2
(κ+1)2
.
(3.9)
Setting the derivative of gβ,δ (κ) (with respect to κ) to zero results in e −βδ
4
2
(κ+1)2
(κ + 1) − e −βδ
κ
2 2
κ = 0,
A\B is the set of elements in A not contained in B.
(3.10)
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2538
which is equivalent to e −βδ
2
(2κ+1)
=
κ . κ +1
(3.11)
Note that κ in equation 3.11 cannot be zero, as for finite positive β and δ, 2 e −βδ (2κ+1) > 0. Hence, it is sufficient to concentrate only on the line segment {c(κ) = y + κ · (y − y )| κ > 0}. It is easy to see that κβ (δ) > 0 satisfying equation 3.11 also satisfies Hβ (κβ (δ), δ) = 0. Moreover, for a given β > 0, δ > 0, there is a unique κβ (δ) > 0 given by Hβ (κβ (δ), δ) = 0. In other words, the function κβ (δ) is 2 one-to-one. To see this, note that e βδ (2κ+1) is an increasing function of κ > 0 2 1 with range (e βδ , ∞), while 1 + κ is a decreasing function of κ > 0 with range (∞, 1). 1 The second derivative of gβ,δ (κ) is (up to a positive scaling constant 2βδ 2) equal to e −βδ
2
(κ+1)2
2 2 1 − 2βδ 2 (κ + 1)2 − e −βδ κ 1 − 2βδ 2 κ 2 ,
(3.12)
which can be rearranged as 2 2 2 2 2 2 2 2 e −βδ (κ+1) − e −βδ κ − 2βδ 2 e −βδ (κ+1) (κ + 1)2 − e −βδ κ κ 2 .
(3.13)
The first term in equation 3.13 is negative, as for κ > 0, e −βδ (κ+1) < 2 2 e −βδ κ . We will show that the second term, evaluated at κβ (δ) = K , is also negative. To that end, note that by equation 3.10, 2
e −βδ
2
(K +1)2
(K + 1)2 − e −βδ
But because e −βδ e −βδ
2
(K +1)2
2
K2
2
K2
2
K (K + 1) = 0.
K > 0, we have
(K + 1)2 − e −βδ
2
K2
K 2 > 0,
and so 2 2 2 2 −2βδ 2 e −βδ (K +1) (K + 1)2 − e −βδ K K 2 is negative. Because the second derivative of gβ,δ (κ) at the extremum point κβ (δ) is negative, the unique solution κβ (δ) of Hβ (κβ (δ), δ) = 0 yields the point cβ,1 (δ) = y + κβ (δ) · (y − y ) that maximizes |G β (y, c) − G β (y , c)|.
Dynamics of Recursive Self-Organizing Maps
2539
Arguments concerning the point cβ,2 (δ) = y + κβ (δ) · (y − y) can be made along the same lines by considering points on the line segment {c(κ) = y + κ · (y − y)| κ > 0}. Lemma 3. For all k > 0,
1+k (1+2k) (1+2k) < log . < 2(1 + k)2 k 2k 2
(3.14)
Proof. Consider the functions
1+k (1 + 2k) − log 1 (k) = 2k 2 k
1+k (1 + 2k) . − 2 (k) = log k 2(1 + k)2 We find that lim 1 (k) 1/2k 2 + log(k) > 0,
limk→∞ 1 (k) 1/k 2 > 0
lim 2 (k) −log(k) > 0,
limk→∞ 2 (k) 1/k 2 > 0.
k→0
k→0
Since 1 (k) = −
(1 + 2k) < 0, k 3 (1 + k)
2 (k) = −
(1 + 2k) < 0, k(1 + k)3
both functions 1 (k) and 2 (k) are monotonically decreasing positive functions of k > 0. Lemma 4. For β > 0, consider a function Dβ : R+ → (0, 1), Dβ (δ) = gβ,δ (κβ (δ)),
(3.15)
where gβ,δ (κ) is defined in equation 3.9 and κβ (δ) is implicitly defined by equations 3.7 and 3.8. Then Dβ has the following properties: 1. Dβ > 0. 2. limδ→0+ Dβ (δ) = 0. 3. Dβ is a continuous monotonically increasing concave function of δ.
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2540
4. limδ→0+
d Dβ (δ) dδ
=
2β . e
Proof. To simplify the presentation, we do not write subscript β when referring to quantities such as Dβ , κβ (δ): 1. Since κ(δ) > 0 for any δ > 0, D(δ) = e −βδ
2
κ(δ)2
− e −βδ
2
(κ(δ)+1)2
> 0.
2. Although the function κ(δ) is known only implicitly through equations 3.7 and 3.8, the inverse function, δ(κ), can be obtained explicitly from these equations as log( 1+κ ) κ . (3.16) δ(κ) = β(1 + 2κ) Now δ(κ) is a monotonically decreasing function. This is easily verified, as the derivative of δ(κ), ) 1 + 2κ + 2κ(1 + κ) log( 1+κ κ , (3.17) δ (κ) = −
2κ(1 + κ)(1 + 2κ)2
β log( 1+κ κ ) (1+2κ)
is negative for all κ > 0. Both κ(δ) and δ(κ) are one-to-one (see also the proof of theorem 1). Moreover, δ(κ) → 0 as κ → ∞, meaning that κ(δ) → ∞ as δ → 0+ . Hence, limδ→0+ D(δ) = 0. 3. Since δ(κ) is a continuous function of κ, κ(δ) is continuous in δ. Because 2 2 2 2 e −βδ κ(δ) − e −βδ (κ(δ)+1) is continuous in κ(δ), D(δ) is a continuous function of δ. Because of the relationship between δ and κ(δ), we can write the 2 derivatives d D(δ) and d dδD(δ) explicitly, changing the independent vari2 dδ able from δ to κ. Instead of D(δ), we will work with the corresponding function of κ, D(κ), such that D(δ) = D(κ(δ)).
(3.18)
Given a κ > 0 (uniquely determining δ(κ)), we have (after some manipulations), 1 D(δ(κ)) = D(κ) = (1 + κ)
κ 1+κ
κ2
(1+2κ)
.
(3.19)
Since δ(κ) and κ(δ) are inverse functions of each other, their first- and second-order derivatives are related through κ (δ) =
1 , δ (k)
(3.20)
Dynamics of Recursive Self-Organizing Maps
κ (δ) =
−δ (k) , (δ (k))3
2541
(3.21)
where k = κ(δ). Furthermore, we have that D =
dD(κ) dκ(δ) d D(δ) = = D κ dδ dκ dδ
D =
d2 D d = dδ 2 dδ
and
dD dκ dκ dδ
d 2 D dκ 2 dD d 2 κ = + = D κ 2 + D κ . dκ 2 dδ dκ dδ 2
(3.22)
(3.23)
Using equations 3.19 to 3.23, we arrive at derivatives of D(δ) with respect to δ, expressed as functions of k:5 dD D (k) (k) = , dδ δ (k) d2 D 1 (k) = 3 D (k) δ (k) − D (k) δ (k) . 2 dδ δ (k)
(3.24) (3.25)
The derivatives, equations 3.24 and 3.25, can be calculated explicitly 2 and evaluated for all k > 0. After simplification, ddδD (k), and ddδD2 (k), read
−(1+k)2 log( 1+k ) 1+k (1+2k) k 2β(1 + k) (3.26) k β(1 + 2k) and
−(1+k)2 (1+k) 1+k (1+2k) 2β (1+2k) k ) 1+2k −2(1+k)2 log( 1+k ) 1+2k −2k 2 log( 1+k k k × , 1+2k +2k(1+k) log( 1+k ) k
(3.27)
respectively. Clearly, ddδD (k) > 0 for all β > 0 and k > 0. To show that d2 D (k) < 0, recall that by lemma 3, dδ 2
(1+2k) (1+2k) 1+k < < log 2(1 + k)2 k 2k 2
5
k > 0 is related to δ through k = κ(δ).
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2542
for all k > 0, and so
1+k 1+2k −2k log > 0, k
1+k < 0. 1+2k −2(1+k)2 log k 2
All the other factors in equation 3.27 are positive. 4. Considering only the leading terms as δ → 0 (k → ∞), we have
1 dD 2β lim (k)
+ O , + δ→0 (k→∞) dδ e k2 and so d Dβ (δ) = lim+ δ→0 dδ
2β . e
Denote by Gα (s) the collection of activations coming from the feedforward part of RecSOM, Gα (s) = (G α (s, w1 ), G α (s, w2 ), . . . , G α (s, w N )).
(3.28)
Theorem 2. Consider an input s ∈ R M . If for some ρ ∈ [0, 1), e β ≤ ρ 2 Gα (s)−2 , 2
(3.29)
then the mapping Fs (see equations 3.4 and 3.5) is a contraction with contraction coefficient ρ. Proof. Recall that Fs is a contractive mapping with contraction coefficient 0 ≤ ρ < 1 if for any y, y ∈ R N , Fs (y) − Fs (y ) ≤ ρ · y − y . This is equivalent to saying that for any y, y , Fs (y) − Fs (y )2 ≤ ρ 2 · y − y 2 , which can be rephrased as N i=1
G 2α (s, wi ) · (G β (y, ci ) − G β (y , ci ))2 ≤ ρ 2 · y − y 2 .
(3.30)
Dynamics of Recursive Self-Organizing Maps
2543
For given y, y , y − y = δ > 0, let us consider the worst-case scenario with respect to the position of the context vectors ci , so that the bound equation 3.30, still holds. By theorem 1, when maximizing the left-hand side of equation 3.30, we should locate ci on the line passing through y and y at either cβ,1 (δ) = y + κβ (δ) · (y − y ) or cβ,2 (δ) = y + κβ (δ) · (y − y), where κβ (δ) is implicitly defined by Hβ (κβ (δ), δ) = 0. In that case, we have |G β (y, cβ, j (δ)) − G β (y , cβ, j (δ))| = Dβ (δ),
j = 1, 2.
+ Since Dβ (δ) is a continuous concave function on δ > 0 and limδ→0 Dβ (δ) =
0, with limδ→0+
d Dβ (δ) dδ
Dβ (δ) ≤ δ
=
2β , e
we have the following upper bound:
2β . e
(3.31)
Applying equation 3.31 to 3.30, we get that if δ2
N 2β G 2α (s, wi ) ≤ ρ 2 δ 2 , e
(3.32)
i=1
then Fs will be a contraction with contraction coefficient ρ. Inequality 3.32 is equivalent to 2β Gα (s)2 ≤ ρ 2 . e
(3.33)
Corollary 1. Consider a RecSOM fed by a fixed input s. Define ϒ(s) =
e Gα (s)−2 . 2
(3.34)
Then, if β < ϒ(s), F s is a contractive mapping. We conclude the section by mentioning that we empirically verified the validity of the analytical bound, equation 3.31, for a wide range of values of β, 10−2 ≤ β ≤ 5. For each β, the values of κβ (δ) were numerically calculated
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2544 1
0.9
0.8 beta=0.5 0.7
D(delta)
0.6 beta=2
0.5
0.4
0.3
0.2
0.1
0 0
0.5
1
1.5
2 delta
2.5
3
3.5
4
Figure 3: Functions Dβ (δ) for β = 0.5 and β = 2 (solid lines). Also shown (dashed lines) are the linear upper bounds (see equation 3.31).
on a fine grid of δ-values from the interval (0, 4). These values were then used to plot functions Dβ (δ) and to numerically estimate the limit of the first derivative of Dβ (δ) as δ → 0+ . Numerically determined values matched perfectly the analytical calculations. As an illustration, we show in Figure 3, functions Dβ (δ) for β = 0.5 and β = 2 (solid lines). Also shown (dashed lines) are the linear upper bounds (see equation 3.31). 4 Experiments In this section we demonstrate and analyze (using the results of section 3) the potential of RecSOM for creating Markovian context representations on three types of sequences of different nature and complexity: stochastic automaton, laser data, and natural language. The first and the third data sets were also used in Voegtlin (2002). Following Voegtlin (2002), in order to represent the trained maps, we calculate for each unit in the map its receptive field (RF). The RF of a neuron is the common suffix of all sequences for which that neuron becomes the best-matching unit (Voegtlin, 2002). Voegtlin also suggests measuring the amount of memory captured by the map through quantizer depth, QD =
N i=1
pi i ,
(4.1)
Dynamics of Recursive Self-Organizing Maps
2545
where pi is the probability of the RF of neuron i and i is its length. To assess maps from the point of view of topography preservation, we introduce a measure that aims to quantify the maps’ topographic order. For each unit in the map, we first calculate the length of the longest common suffix shared by RFs of that unit and its immediate topological neighbors. In other words, for each unit i on the grid, we create a set of strings Ri consisting of an RF of unit i and RFs of its four neighbors on the grid.6 The length of the longest common suffix of the strings in Ri is denoted by (Ri ). The topography preservation measure7 TP is the average of such shared RF suffix lengths over all units in the map: 1
(Ri ). N N
TP =
(4.2)
i=1
In order to get an insight about the benefit of having a trainable recurrent part in RecSOM, we also compare RecSOM with standard SOM operating on Markovian suffix-based vector representations of fixed dimensionalˇ & ity obtained from a simple nontrainable iterative function system (Tino Dorffner, 2001). 4.1 Stochastic Automaton. The first input series was a binary sequence of 300,000 symbols generated by a first-order Markov chain over the alphabet {a , b}, with transition probabilities P(a |b) = 0.3 and P(b|a ) = 0.4 (Voegtlin, 2002). Attempting to replicate Voegtlin’s results, we used RecSOM with 10 × 10 neurons and one-dimensional coding of input symbols: a = 0, b = 1. We chose RecSOM parameters from the stable region on the stability map evaluated by Voegtlin for this particular stochastic automaton (Voegtlin, 2002): α = 2 and β = 1. The learning rate was set to γ = 0.1. To allow for map ordering, we allow the neighborhood width, σ (see equation 2.6) to linearly decrease from 5.0 to 0.5 during the first 200,000 iterations (ordering phase) and then keep it constant over the next 100,000 iterations (fine-tuning phase).8
6
Neurons at the grid boundary have fewer than four nearest neighbors. It should be noted that quantifying topography preservation in recursive extensions of SOM is not as straightforward as in traditional SOM (Hammer, Micheli, Sperduti et al., 2004). The proposed TP measure quantifies the degree of local conservation of suffix based RFs across the map. 8 Voegtlin did not consider reducing the neighborhood size. However, we found that the decreasing neighborhood width was crucial for topographic ordering. Initially, small σ did not lead to global ordering of weights. This should not be surprising, since for σ = 0.5 (used in Voegtlin, 2002), the value h ik of the neighborhood function for the nearest neighbor is only exp(−1/.52 ) = 0.0183 (considering a squared grid of neurons with mesh size 1). Decreasing σ is also important in standard SOM. 7
2546
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
Figure 4: Converged input (left) and context (right) weights after training on the stochastic automaton. All values are between 0 (code of a —white) and 1 (code of b—black). Input weights can be clustered into two groups corresponding to the two input symbols, with a few intermediate units at the boundary. Topographic organization of the context weights is also clearly evident.
Weights of the trained RecSOM (after 300,000 iterations) are shown in Figure 4. Input weights (at the left of the figure) are topographically organized in two regions, representing the two input symbols. Context weights of all neurons have a unimodal character and are topographically ordered with respect to the peak position (mode). Receptive fields of all units of the map are shown in Figure 5. For each unit i ∈ {1, 2, . . . , N}, its RF is shaded according to the local topography preservation measure (Ri ) (see equation 4.2).9 The RFs are topographically ordered with respect to the most recent symbols. This map is consistent with the input weights of the neurons (left part of Figure 4), when considering only the last symbol. The RecSOM model can be considered a nonautonomous dynamical system driven by the external input stream (in this case, sequences over an alphabet of two input symbols, a and b). In order to investigate the fixedinput dynamics (see equation 3.4) of the mappings10 Fa and Fb for symbols a and b, respectively, we randomly (with uniform distribution) initialized context activations y(0) in 10,000 different positions within the state space (0, 1] N . For each initial condition y(0), we checked asymptotic dynamics of the fixed-input maps Fs , s ∈ {a , b}, by monitoring L 2 -norm of the activation differences (y(t) − y(t − 1)) and recording the limit set (after 1000 iterations). Both autonomous dynamics settle down in the respective unique attractive fixed points ya = Fa (ya ) and yb = Fb (yb ). An example of the fixed-input dynamics is displayed in Figure 6. Both autonomous systems settle in the 9
We thank one of the anonymous reviewers for this suggestion. We slightly abuse the mathematical notation here by indexing the fixed-input RecSOM maps F with the actual input symbols rather than their vector encodings s. 10
Dynamics of Recursive Self-Organizing Maps
2547
baaab
aaaab
aaaab
aaaab
bab
aabab
bbaaa
bbaaa
bbaaa
abaab
aaaab
aaaab
bbaab
bbaab
babab
bbaaa
babaa
babaa
ababb
b
bbab
bbbab
aaaab
aaaab
aaaaa
aaaaa
aabaa
aabaa
baabb
aaabb
bbab
abbbb
baaab
baaaa
aaaaa
aaaaa
aaaaa
aaaaa
aaabb
aaabb
abbbb
abbbb
baaaa
aaaaa
aaaaa
aaaaa
aaaaa
bbabb
bbbbb
bbbbb
aaaaa
aaaaa
bbbbb
bbbbb
aaaaa
aaaaa
babbb
bbbbb
aabbb
aabbb
8
7
6
5
4
bba
abbaa
3
bbbba
abbba
bbbba
bbbba
bbbba
babbb
babba
babba
bbaba
bb
aabba
aabba
baaba
abbaa
bbbaa
bbbaa
bbbaa
bbbaa
aaaba
baaba
abaaa
abaaa
abaaa
aaaba
baba
baaaa
baaaa
baaaa
2
aaaaa 1
0
Figure 5: Receptive fields of RecSOM trained on the stochastic two-state automaton. Topographic organization is observed with respect to the most recent symbols (only five symbols are shown for clarity). Empty slots signify neurons that were not registered as best-matching units when processing the data. Receptive field of each unit i is shaded according to the local topography preservation measure (Ri ). For each input symbol s ∈ {a , b}, we mark the position of the fixed-point attractor i s of the induced (fixed input) dynamics on the map by a square around its RF.
Figure 6: Fixed-input dynamics of RecSOM trained on the stochastic automaton—symbol a (first row) and symbol b (second row). The activations settle to a stable unimodal profile after roughly 10 iterations.
fixed points in roughly 10 iterations. Note the unimodal profile of the fixed points; for example, there is a unique dimension (map unit) of pronounced maximum activity. It is important to appreciate how the character of the RecSOM fixed-input dynamics (see equation 3.4) for each individual input symbol shapes the overall organization of RFs in the map. For each input symbol s ∈ {a , b}, the autonomous dynamics y(t) = Fs (y(t − 1)) induces a dynamics of the winner
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2548
units on the map: i s (t) = argmax yi (t) i∈{1,2,...,N}
= argmax Fs,i (y(t − 1)).
(4.3)
i∈{1,2,...,N}
To illustrate the dynamics equation 4.3, for each of the 10,000 initial conditions y(0), we first let the system 3.4, settle down by preiterating it for 1000 iterations and then mark the map position of the winner units i s (t) for a further 100 iterations. As the fixed-input dynamics for s ∈ {a , b} is dominated by the unique attractive fixed point ys , the induced dynamics on the map, equation 4.3, settles down in neuron i s , corresponding to the mode of ys : i s = argmax ys,i .
(4.4)
i∈{1,2,...,N}
The position of the neuron i s is marked in Figure 5 by a square around its RF. The neuron i s will be most responsive to input subsequences ending with long blocks of symbols s. As seen in Figure 5, receptive fields of other neurons on the map are organized with respect to the closeness of the neurons to the fixed-input winners i a and i b . Such an organization follows from the attractive fixed point behavior of the individual maps Fa , Fb , and the unimodal character of their fixed points ya and yb . As soon as symbol s is seen, the mode of the activation profile y drifts toward the neuron i s . The more consecutive symbols s we see, the more dominant the attractive fixed point of Fs becomes and the closer the winner position is to i s . Indeed, for each s ∈ {a , b}, the RF of i s ends with a long block of symbols s, and the local topography preservation (Ris ) around i s is high. This mechanism for creating suffix-based RF organization is reminiscent ˇ and of the Markovian fractal subsequence representations used in Tino Dorffner (2001) to build Markov models with context-dependent length. In the next section, we compare maps of RecSOM with those obtained using a standard SOM operating on such fractal representations (of fixed dimensionality). Unlike in RecSOM, the dynamic part responsible for processing the temporal context is fixed. Theoretical upper bounds on β guaranteeing the existence of stable activation profiles in the fixed-input RecSOM dynamics were calculated as ϒ(a ) = 0.0226 and ϒ(b) = 0.0336.11 Clearly, a fixed-point (attractive) RecSOM dynamics is obtained for values of β well above the guaranteed theoretical bounds, equation 3.34.
11
Again, we write the actual input symbols rather than their vector encodings s.
Dynamics of Recursive Self-Organizing Maps
2549
4.2 IFS Sequence Representations Combined with Standard SOM. Previously we have shown that a simple affine contractive iterative function system (IFS) (Barnsley, 1988) can be used to transform the temporal structure of symbolic sequences into a spatial structure of points in a metˇ & Dorffner, 2001). The points represent subsequences in a ric space (Tino Markovian manner: subsequences sharing a common suffix are mapped close to each other. Furthermore, the longer the shared suffix is, the closer the subsequence representations lie. The IFS representing sequences over an alphabet A of Asymbols operates on an m-dimensional unit hypercube [0, 1]m , where m = log2 A.12 With each symbol s ∈ A, we associate an affine contraction on [0, 1]m , s(x) = kx + (1 − k)ts , ts ∈ {0, 1}m , ts = ts for s = s ,
(4.5)
with contraction coefficient k ∈ (0, 12 ]. The attractor of the IFS, equation 4.5, is the unique set K ⊆ [0, 1]m , known as the Sierpinski sponge (Kenyon & Peres, 1996), for which K = s∈A s(K ) (Barnsley, 1988). For a prefix u = u1 u2 , . . . , un of a string v over A and a point x ∈ [0, 1]m , the point u(x) = un (un−1 (. . . (u2 (u1 (x))) . . .)) = (un ◦ un−1 ◦ · · · ◦ u2 ◦ u1 )(x)
(4.6)
constitutes a spatial representation of the prefix u under the IFS, equation 4.5. Finally, the overall temporal structure of symbols in a (possibly long) sequence v over A is represented by a collection of the spatial representations u(x) of all its prefixes u, with a convention that x = { 12 }m . ˇ Theoretical properties of such representations were investigated in Tino (2002). The IFS-based Markovian coding scheme can be used to construct generative probabilistic models on sequences analogous to the variableˇ & Dorffner, 2001). A key element of memory-length Markov models (Tino the construction is a quantization of the spatial IFS representations into clusters that group together subsequences sharing potentially long suffixes (densely populated regions of the suffix-organized IFS subsequence representations). The Markovian layout of the IFS representations of symbolic sequences can also be used for constructing suffix-based topographic maps of symbolic streams in an unsupervised manner. By applying a standard SOM (Kohonen, 1990) to the IFS representations, one may readily obtain topographic maps of Markovian flavor, similar to those obtained by RecSOM. The key difference between RecSOM and IFS+SOM (standard SOM operating on IFS representations) is that the latter approach assumes a fixed nontrainable dynamic part responsible for processing temporal contexts 12
For x ∈ R, x is the smallest integer y, such that y ≥ x.
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2550
Standard SOM
w x(t)
non−trainable IFS maps
s(t)
x(t−1)
Figure 7: Standard SOM operating on IFS representations of symbolic streams (IFS+SOM model). Solid lines represent trainable feedforward connections. No learning takes place in the dynamic IFS part responsible for processing temporal contexts in the input stream.
in the input stream. The recursion is not part of the map itself but is performed outside the map as a preprocessing step before feeding the standard SOM (see Figure 7). As shown in Figure 8, the combination of IFS representations13 and standard SOM14 leads to a suffix-based organization of RFs on the map, similar to that produced by RecSOM. In both models, the RFs are topographically ordered with respect to the most recent input symbols. The dynamics of SOM activations y, driven by the IFS dynamics (see equations 4.5 and 4.6), again induces the dynamics (see equation 4.3) of winning units on the map. Since the IFS maps are affine contractions with fixed points ta and tb , the dynamics of winner units for both input symbols s ∈ {a , b} settles in the SOM representations i s of ts . Note how fixed points i s of the induced winning neuron dynamics shape the suffix-based organization of receptive fields in Figure 8.
IFS coefficient k = 0.3. Parameters such as learning rate and schedule for neighborhood width σ were taken from RecSOM. 13 14
Dynamics of Recursive Self-Organizing Maps
2551
bbbbb
abbbb
abbb
ababb
aaabb
bbab
abaab
aaaab
bbbba
bbbba
bbbb
bbb
aabbb
baabb
b
abbab
baaab
.
bbbba
bbbba
babbb
bb
bbabb
baabb
bbbab
abab
aaaab
abbba
abbba
babba
abb
babb
aabb
abbab
aabab
abaab
bba
ba
bbaba
baaba
bab
abbab
babab
bbaab
abaab
abbba
ababa
ababa
aaaba
bbbaa
ab
baab
aab
aaaab
babba
aaaba
baa
abaa
babaa
abbaa
baaab
aaab
bbbba
aabba
aaba
abbaa
aabaa
bbaaa
baaa
bbaaa
aaaab
bbbba
aabba
aba
a
babaa
aa
abaaa
aaa
baaaa
bbba
babba
baba
aaaba
bbbaa
babaa
bbaaa
baaaa
aaaaa
aaaaa
abbba
abba
ababa
aaaba
bbaa
aabaa
abaaa
aaaa
aaaaa
aaaaa
8
7
6
5
4
3
2
1
0
Figure 8: Receptive fields of a standard SOM trained on IFS representations of sequences obtained from the automaton. Suffix-based topographic organization similar to that found in RecSOM is apparent. The Receptive field of each unit i is shaded according to the local topography preservation measure (Ri ). For each s ∈ {a , b}, we mark the position of the fixed-point attractor i s of the induced dynamics on the map by a square around its RF.
To compare RecSOM and IFS+SOM maps in terms of quantization depth (QD) (see equation 4.1) and topography preservation (TP) (see equation 4.2), we varied RecSOM parameters α and β and ran 40 training sessions for each setting of (α, β). The resulting TP and QD values were compared with those of the IFS+SOM maps constructed in 40 independent training sessions. Other parameters were the same as in the previous simulations and were identical for both models. We chose a 3 × 3 design using α ∈ {1, 2, 3} and β ∈ {0.2, 0.7, 1} attempting to meaningfully cover the parameter space of RecSOM (see Voegtlin, 2002). Each RecSOM model was compared to IFS+SOM using a two-tail t-test. Results are shown in Table 1. The number of stars denotes the significance level of the difference ( p < 0.05, 0.01, 0.001, in ascending order). Almost all differences are significant. Specifically, QD of RecSOM is significantly higher for all combinations of α and β.15 The TP for RecSOM is also significantly higher in most cases, except for those with
15
Due to the constraints imposed by topographic organization of RFs, the quantizer depths of the maps are smaller than that of the theoretically optimal (unconstrained) quantizer computed by Voegtlin (2002) as QD = 7.08.
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2552
Table 1: Means of the (QD, TP) Measures, Averaged over 40 Simulations, for RecSOM Trained on the Stochastic Automaton. β\α 0.2 0.7 1.0
1.0 5.84∗∗∗ 6.11∗∗∗ 5.75∗∗∗
2.0 2.51∗∗∗ 1.27∗∗∗ 0.75∗∗
6.15∗∗∗ 6.14∗∗∗ 5.87∗∗∗
3.0 2.77∗∗∗ 2.21∗∗∗ 2.00
6.05∗∗∗ 6.12∗∗∗ 5.92∗∗∗
2.55∗∗∗ 2.43∗∗∗ 2.10∗∗∗
Note: Corresponding means for IFS+SOM were as follows: QD = 5.55 and TP = 1.96.
higher β/α ratio. As explained in section 4.1, trivial contractive fixed-input dynamics dominated by unique attractive fixed points lead to Markovian suffix-based RF organizations. Lower TP values are observed for higher β/α ratios, because in those cases, more complicated fixed-input dynamics can arise, breaking the Markovian RF maps. 4.3 Laser Data. In this experiment, we trained the RecSOM on a sequence of quantized activity differences of a laser in a chaotic regime. The series of length 8000 was quantized into a symbolic stream over four symˇ & Koteles, ¨ ˇ & Dorffner, 2001; Tino ˇ et al., 2004) bols (as in Tino 1999; Tino represented by two-bit binary codes: a = 00, b = 01, c = 10, d = 11. RecSOM with 2 inputs and 10 × 10 = 100 neurons was trained for 400,000 iterations, using α = 1, β = 0.2, and γ = 0.1. The neighborhood width σ linearly decreased 5.0 → 0.5 during the first 300,000 iterations and then remained unchanged. The behavior of the model was qualitatively the same as in the previous experiment. The map of RFs was topographically ordered with respect to the most recent symbols. By checking the asymptotic regimes of the fixedinput RecSOM dynamics (see equation 3.4) as in the previous experiment, we found that the fixed-input dynamics are again driven by unique attractive fixed points ya , yb , yc , and yd . As before, the dynamics of winning units on the map induced by the fixed-input dynamics y(t) = Fs (y(t − 1)), s ∈ {a , b, c, d} settled down in the mode position i s of ys . Upper bounds on β guaranteeing the existence of stable activation profiles in the fixed-input RecSOM dynamics were determined as ϒ(a ) = 0.0326, ϒ(b) = 0.0818, ϒ(c) = 0.0253, and ϒ(d) = 0.0743. Again, we observe contractive behavior for β above the theoretical bounds. As in the first experiment, we trained a standard SOM on (this time, two-dimensional) inputs created by the IFS (see equation 4.5). Again, both RecSOM and the combination of IFS with standard SOM lead to suffixbased maps of RFs—that is, the maps of RFs were topographically ordered with respect to most recent input symbols.16 16 IFS coefficient k = 0.3; learning rate and schedule for the neighborhood width σ were taken from RecSOM.
Dynamics of Recursive Self-Organizing Maps
2553
Table 2: Means of the (QD, TP) Measures, Averaged over 40 Simulations, for RecSOM Trained on the Laser Data. β \α 0.2 0.7 1.0
1.0 5.07∗∗∗ 5.49∗∗∗ 5.66∗∗∗
2.0 1.44 0.75∗∗∗ 0.55∗∗∗
5.52∗∗∗ 5.92∗∗∗ 6.75∗∗∗
3.0 1.77∗∗∗ 1.41 1.16∗∗∗
4.92∗∗∗ 5.81∗∗∗ 6.73∗∗∗
1.63∗∗∗ 1.59∗∗∗ 1.42
Note: Corresponding means for IFS+SOM were as follows: QD = 4.63 and TP = 1.42.
Analogous to the previous experiment, we compared quantization depths and topography preservation measures of RecSOMs and IFS+SOM in a large set of experimental runs with varying RecSOM parameters α and β. Results are shown in Table 2. As before, QD of RecSOM is always significantly higher, whereas the TP measure for RecSOM is higher except for cases of higher β/α ratio. 4.4 Language. In our last experiment, we used a corpus of written English, the novel Brave New World by Aldous Huxley. In the corpus, we removed punctuation symbols, uppercase letters were switched to lowercase, and the space between words was transformed into the symbol “-”. The complete data set (after filtering) consisted of 356,606 symbols. Letters of the Roman alphabet were binary encoded using 5 bits and presented to the network one at a time. Unlike in Voegtlin (2002), we did not reset the context map activations between the words. RecSOM with 400 neurons was trained for two epochs using the following parameter settings: α = 3, β = 0.7, γ = 0.1, and σ : 10 → 0.5. Radius σ reached its final value at the end of the first epoch and then remained constant to allow for fine-tuning of the weights. The map of RFs is displayed in Figure 9. Figure 10 illustrates asymptotic regimes of the fixed-input RecSOM dynamics (see equation 3.4) in terms of map activity differences between consecutive time steps.17 We observed a variety of behaviors. For some symbols, the activity differences converge to zero (attractive fixed points); for other symbols, the differences level at nonzero values (periodic attractors
17 Because of the higher dimensionality of the activation space (N = 400), we used a different strategy for generating the initial conditions y(0). We randomly varied only those components yi (0) of y(0) with the potential to give rise to different fixed-input dynamics. Since 0 < yi (0) ≤ 1 for all i = 1, 2, . . . , N, it follows from equation 3.5, that these can only be components yi , for which the constant G α (s, wi ) is not negligibly small. It is sufficient to use a small enough threshold θ > 0, and set yi (0) = 0 if G α (s, wi ) < θ . Such a strategy can significantly reduce the dimension of the search space. We used θ = 0.001, and the number of components of y(0) involved in generating the initial conditions varied from 31 to 138, depending on the input symbol.
2554
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
Figure 9: Receptive fields of RecSOM trained on English text. Dots denote units with empty RFs. The receptive field of each unit i is shaded according to the local topography preservation measure (Ri ).
of period two, e.g., symbols i, t, a , -). Fixed-input RecSOM dynamics for the symbol o follows a complicated aperiodic trajectory.18 Dynamics of the winner units on the map induced by the fixed-input dynamics of Fs are shown in Figure 11 (left). As before, for symbols s with dynamics y(t) = Fs (y(t − 1)) dominated by a single fixed point ys , the induced dynamics on the map settles down in the mode position of ys . However, some autonomous dynamics y(t) = Fs (y(t − 1)) of period two (e.g., s ∈ {n, h, r, p, s}) induce a trivial dynamics on the map driven to a single point (grid position). In those cases, the points y1 , y2 on the periodic orbit (y1 = Fs (y2 ), y2 = Fs (y1 )) lie within the representation region (Voronoi compartment) of the same neuron. Interestingly, the complicated dynamics of Fo and Fe translates into aperiodic oscillations between just two grid positions. Still, the suffix-based organization of RFs in Figure 9 is shaped by the underlying collection of the fixed-input dynamics of Fs (illustrated in Figure 11, left, through the induced dynamics on the map). Theoretical upper bounds on β (see equation 3.34) are shown in Figure 12. Whenever for an input symbol s the bound ϒ(s) is above β = 0.7 (dashed horizontal line) used to train RecSOM (symbols z, j, q , x), we can be certain
18 A detailed investigation revealed that the same holds for the autonomous dynamics under the symbol e (even though this is less obvious by scanning Figure 10).
Dynamics of Recursive Self-Organizing Maps
2555
4
3.5
L2−norm of activity difference
3
2
t i a n
1.5
h r
2.5
p
1
s o l m e
0.5
0 0
10
20
30
40
50
60
70
80
90
100
Iteration
Figure 10: Fixed-input asymptotic dynamics of RecSOM after training on English text. Plotted are L 2 norms of the differences of map activities between the successive iterations. Labels denote the associated input symbols (for clarity, not all labels are shown).
that the fixed-input dynamics given by the map Fs will be dominated by an attractive fixed point. For symbols s with ϒ(s) < β, there is a possibility of a more complicated dynamics driven by Fs . We marked β-bounds for all symbols s with asymptotic fixed-input dynamics that goes beyond a single stable sink by an asterisk. Obviously, as seen in the previous experiments, ϒ(s) < β does not necessarily imply more complicated fixed-input dynamics on symbol s. However, in this case, for most symbols s with ϒ(s) < β, the associated fixed-input dynamics was indeed different from the trivial one dominated by a single attractive fixed point. We also trained a standard SOM with 20 × 20 neurons on fivedimensional inputs created by the IFS (see equation 4.5).19 The map is shown in Figure 13. The induced dynamics on the map is illustrated in Figure 11 (right). The suffix-based organization of RFs is shaped by the underlying collection of autonomous attractive IFS dynamics. Table 3 compares the QD and TP measures of the RecSOMs to IFS+SOM maps. In this case, higher β/α ratios quickly lead to rather complicated fixed-input RecSOM dynamics, breaking the Markovian suffix-based RF
19 IFS coefficient k = 0.3. The learning rate and schedule for the neighborhood width σ were taken from RecSOM.
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2556
p
n
l
o w s
g
c
k j
h xy m a
o
-
b
q e
t t
d
v
u g
n
a
c k i
o
e
b
f
-
u v
q
e
w
a p
s
d t
i i
f
r z
r
m j
l
y z x h
Figure 11: Dynamics of the winning units on the RecSOM (left) and IFS+SOM (right) maps induced by the fixed-input dynamics. The maps were trained on a corpus of written English (Brave New World, by Aldous Huxley).
organization of contractive maps. This has a negative effect on the QD and TP measures. 5 Discussion 5.1 Topographic Maps with Markovian Flavor. Maps of sequential data obtained by RecSOM often seem to have a Markovian flavor. The neural units become sensitive to recently observed symbols. Suffix-based (RF) of the neurons are topographically organized in connected regions according to the last symbol. Within each of those regions, RFs are again topographically organized with respect to the symbol preceding the last symbol. Such a self-similar structure is typical of spatial representations of symbolic sequences via contractive (affine) iterative function systems (IFS) (Jeffrey, 1990; Oliver, Bernaola-Galv´an, Guerrero-Gancia, & Rom´an-Roldan, 1993; Rom´an-Roldan, Bernaola-Galv´an, & Oliver, 1994; Fiser, Tusnady, ˇ 2002). Such IFS & Simon, 1994; Hao, 2000; Hao, Lee, & Zhang, 2000; Tino, can be considered simple nonautonomous dynamical systems driven by an input stream of symbols. Each IFS mapping is a contraction, and therefore each fixed-input autonomous system has a trivial dynamics completely dominated by an attractive fixed point. However, the nonautonomous dynamics of the IFS can be quite complex, depending on the complexity of the ˇ 2002). input stream (see Tino, More important, it is the attractive character of the individual fixedinput IFS maps that shapes the Markovian organization of the state space. Imagine we feed the IFS with a long string s1 . . . s p−2 s p−1 s p . . . sr −2 sr −1 sr . . . over some finite alphabet A of A symbols. Consider the IFS states at time
Dynamics of Recursive Self-Organizing Maps
2557
1.8
1.6
1.4
Beta bound
1.2
1
0.8
0.6
0.4
*
0.2
0
z
j
q
x
k
b
v
y
f
*
u
*
g
c
*
p
*
*
m w
* l
*
*
*
s
h
r
d
*
*
*
*
*
a
n
i
o
e
*
*
t −
Figure 12: Theoretical bounds on β for RecSOM trained on the English text. β-bounds for all symbols with asymptotic fixed-input dynamics richer than a single stable sink are marked by an asterisk.
Figure 13: Receptive fields of a standard SOM with 20 × 20 units trained on IFS outputs, obtained on the English text. Topographic organization is observed with respect to the most recent symbols. The receptive field of each unit i is shaded according to the local topography preservation measure (Ri ).
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2558
Table 3: Means of the (QD, TP) Measures, Averaged over 40 Simulations, for RecSOM Trained on the Language Data. β\α 0.2 0.7 1.0
1.0 1.68∗ 1.15∗∗∗ 0.92∗∗∗
2.0 0.30∗∗∗ 0.10∗∗∗ 0.06∗∗∗
2.03∗∗∗ 1.89∗∗∗ 1.66
3.0 0.58∗∗∗ 0.30∗∗∗ 0.21∗∗∗
2.04∗∗∗ 1.93∗∗∗ 1.81∗∗∗
0.64∗∗∗ 0.39∗∗∗ 0.31∗∗∗
Note: Corresponding means for IFS+SOM were as follows: QD = 1.65 and TP = 0.59.
instances p and r , p < r . No matter how far apart the time instances p and r are, if the prefixes s1: p = s1 . . . s p−2 s p−1 s p and s1:r = s1 . . . sr −2 sr −1 sr share a common suffix, the corresponding IFS states (see equations 4.5 and 4.6), s1: p (x) and s1:r (x), will lie close to each other. If s1: p and s1:r share a suffix of length L, then for any initial position x ∈ [0, 1]m , m = log2 A, √ s1: p (x) − s1:r (x) ≤ k L m,
(5.1)
√ where 0 < k < 1 is the IFS contraction coefficient and m is the diameter of the IFS state space [0, 1]m . Hence, the longer is the shared suffix between s1: p and s1:r , the shorter will be the distance between s1: p (x) and s1:r (x). The IFS translates the suffix structure of a symbolic stream into a spatial structure of points (prefix representations) that can be captured on a two-dimensional map using, for example, a standard SOM, as done in our IFS+SOM model. Similar arguments can be made for a contractive RecSOM of N neurons. Assume that for each input symbol s ∈ A, the fixed-input RecSOM mapping Fs (see equations 3.4 and 3.5) is a contraction with contraction coefficient ρs . Set ρmax = max ρs . s∈A
For a sequence s1:n = s1 . . . sn−2 sn−1 sn over A and y ∈ (0, 1] N , define Fs1:n (y) = Fsn Fsn−1 (. . . (Fs2 (Fs1 (y))) . . .) = Fsn ◦ Fsn−1 ◦ . . . ◦ Fs2 ◦ Fs1 (y).
(5.2)
Then, if two prefixes s1: p and s1:r of a sequence s1 . . . s p−2 s p−1 s p . . . sr −2 sr −1 sr . . . share a common suffix of length L, we have √ L Fs1: p (y) − Fs1:r (y) ≤ ρmax N, where
√
N is the diameter of the RecSOM state space (0, 1] N .
(5.3)
Dynamics of Recursive Self-Organizing Maps
2559
For sufficiently large L, the two activations y1 = Fs1: p (y) and y2 = Fs1:r (y) will be close enough to have the same location of the mode,20 i ∗ = argmax yi1 = argmax yi2 , i∈{1,2,...,N}
i∈{1,2,...,N}
and the two subsequences s1: p and s1:r yield the same best matching unit i ∗ on the map, regardless of the position of the subsequences in the input stream. All that matters is that the prefixes share a sufficiently long common suffix. We say that such an organization of RFs on the map has a Markovian flavor because it is shaped solely by the suffix structure of the processed subsequences, and it does not depend on the temporal context in which they occur in the input stream. Obviously, one can imagine situations where (1) locations of the modes of y1 and y2 will be distinct, despite a small distance between y1 and y2 , or where (2) the modes of y1 and y2 coincide, while their distance is quite large. This follows from discontinuity of the best-matching-unit operation, equation 2.5. However, in our extensive experimental studies, we have registered only a negligible number of such cases. Indeed, some of the Markovian RFs in RecSOM maps obtained in the first two experiments over small (two- and four-letter) alphabets were quite deep (up to 10 symbols).21 Our experiments suggest that, compared with IFS+SOM maps, RecSOM maps with lower β/α ratio (e.g., RecSOM maps constructed with stronger emphasis on recently observed history of inputs) are capable of developing Markovian organizations of RFs with significantly superior memory depth and topography preservation (quantified by the QD, equation 4.1, and TP, equation 4.2, measures, respectively). 5.2 Non-Markovian Topographic Maps. Periodic (beyond period 1) or aperiodic attractive dynamics of autonomous systems y(t) = Fs (y(t − 1)) lead to potentially complicated non-Markovian organizations of RFs on the map. By calculating the RF of a neuron i as the common suffix shared by subsequences yielding i as the best matching unit (Voegtlin, 2002), we always create a suffix-based map of RFs. Such RF maps are designed to illustrate the temporal structure learned by RecSOM. Periodic or aperiodic dynamics of Fs can result in a broken topography of RFs: two sequences with the same suffix can be mapped into distinct positions on the map, separated by a region of very different suffix structure. Such cases result in lower values of the topography preservation measure TP (see equation 4.2). For example, depending on the context, subsequences ending with ee can be mapped near either the lower-left or the lower-right corners of the
20 21
Or at least mode locations on neighboring grid points of the map. In some rare cases, even deeper.
2560
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
RF map in Figure 9. Unlike in contractive RecSOM or IFS+SOM models, such context-dependent RecSOM maps embody a potentially unbounded memory structure, because the current position of the winner neuron is determined by the whole series of processed inputs, and not only by a history of recently seen symbols. Unless we understand the driving mechanism behind such context-sensitive suffix representations, we cannot fully appreciate the meaning of the RF structure of a RecSOM map. There is a more profound question to be asked: What is the principal motivation behind building topographic maps of sequential data? If the motivation is a better understanding of cortical signal representations (e.g., Wiemer, 2003), then considerable effort should be devoted to mathematical analysis of the scope of potential temporal representations and conditions for their emergence. If, on the other hand, the primary motivation is data exploration or data preprocessing, then we need to strive for a solid understanding of the way temporal contexts get represented on the map and in what way such representations fit the bill of the task we aim to solve. There will be situations where finite memory Markovian context representations are quite suitable. In that case, contractive RecSOM models, and indeed IFS+SOM models as well, may be appropriate candidates. But then the question arises of why exactly there needs to be a trainable dynamic part in self-organizing maps generalized to handle sequential data. As demonstrated in the first two experiments, IFS+SOM models can produce informative maps of Markovian context structures without an adaptive recursive submodel. One criterion for assessing the quality of RFs suggested by Voegtlin (2002) is the quantizer depth (QD) (see equation 4.1). Another possible measure quantifying topology preservation on maps is the TP measure of equation 4.2. If coding efficiency of induced RFs and their topography preservation is a desirable property,22 then RecSOM with Markovian maps seem to be superior candidates to IFS+SOM models. In other words, having a trainable dynamic part in self-organizing maps has its merits. Indeed, in our experiments, RecSOM maps with lower β/α ratio lead to Markovian RF organizations with significantly superior QD and TP values. For more complicated data sets, like the English-language corpus of the third experiment, RF maps beyond simple Markovian organization may be preferable. Yet it is crucial to understand exactly what structures that are more powerful than Markovian organization of RFs are desired and why. It is appealing to notice in the RF map of Figure 9 the clearly non-Markovian spatial arrangement into distinct regions of RFs ending with the wordseparation symbol -. Because of the special role of - and its high frequency
22
Here we mean coding efficiency of RFs constrained by the two-dimensional map structure. Obviously, unconstrained codebooks will always lead to better coding efficiency.
Dynamics of Recursive Self-Organizing Maps
2561
of occurrence, it may indeed be desirable to separate endings of words in distinct islands with more refined structure. However, to go beyond mere commenting on empirical observations, one needs to address issues such as these:
r r r
What properties of the input stream are likely to induce periodic (or aperiodic) fixed-input dynamics leading to context-dependent RF representations in SOMs with feedback structures What periods for which symbols are preferable What the learning mechanism (e.g., sequence of bifurcations of the fixed input dynamics) is of creating more complicated contextdependent RF maps
Those are the challenges for our future work. 5.3 Linking RecSOM Parameter β to Markovian RF Organizations. RecSOM parameter β weighs the significance of importing information about the possibly distant past into processing of sequential data. Intuitively, it is not surprising that when β is sufficiently small (e.g., when information about the very recent inputs dominates processing in RecSOM), the resulting maps will have a Markovian flavor. This intuition was given a more rigorous form in section 3. Contractive fixed-input mappings are likely to produce Markovian organizations of RFs on the RecSOM map. We have established theoretical bounds on parameter β that guarantee contractiveness of the fixed-input maps. Using corollary 1, we obtain: Corollary 2. Provided β<
e , 2N
(5.4)
irrespective of the input s, the map F s of a RecSOM with N recurrent neurons will be a contraction. For any external input s, the fixed-input dynamics of such a RecSOM will be dominated by a single attractive fixed point. Proof. It is sufficient to realize that Gα (s)2 =
N
e −2αs−wi ≤ N. 2
i=1
We experimentally tested the validity of the β bounds below which the fixed-input dynamics was proved to be driven exclusively by attractive fixed points. Using a 5 × 5 map grid (N = 25), with β = 0.05 (slightly below the bound, equation 5.4), and setting α = 1, we ran a batch of RecSOM training sessions for each of the three data sets considered in this article. The
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2562
Activity distances for symbol ’o’ 3
Training parameters: alpha = 3, beta = 0.7
L2−norm of activity differences
2.5
2
1.5
1
0.5
0 0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
Beta
Figure 14: RecSOM map activity differences between consecutive time steps during 100 iterations (after initial 400 preiterations) for 0 ≤ β ≤ 3, step size 0.01. The input is fixed to the symbol “o”. The map was trained on the language data as described in section.
other model parameters were set as described in section 4. We evaluated the autonomous dynamics for each symbol, assessed by the L 2 norm of consecutive differences of the map activity profiles. In all cases, the differences vanished in fewer than 10 iterations. Because the β parameter was set below the theoretical bound (see equation 5.4), the training process could never induce an autonomous dynamics beyond the trivial one dominated by an attractive fixed point. We constructed a bifurcation diagram for the RecSOM architecture described in section 4.4. After training, we varied β while keeping other model parameters fixed. For each 0 ≤ β ≤ 3 with step 0.01, we computed map activity differences between consecutive time steps during 100 iterations (after initial 400 preiterations). The activation differences for input symbol o are shown in Figure 14. Three dominant types of autonomous dynamics were observed: fixed-point dynamics, period 2 attractors, and aperiodic oscillations (roughly for 0.6 < β < 1). As expected, for small values of β, the
Dynamics of Recursive Self-Organizing Maps
2563
dynamics is always governed by a unique fixed point. For higher values of β, the dynamics switches among periodic, fixed-point, and aperiodic regimes.23 We conclude by noting that when the inputs and input weights are taken from a set of diameter ξ , we have for the bound, equation 3.34, e e 2αξ 2 ≤ ϒ(s) ≤ e . 2N 2N The lower bound follows from corollary 2, and the upper bound follows from minimizing Gα (s) in equation 3.34. 5.4 Related Work. It has been recently observed in Hammer, Micheli, Sperduti et al. (2004) that Markovian representations of sequence data occur naturally in topographic maps governed by leaky integration, such as temporal Kohonen map (Chappell & Taylor, 1993). Moreover, under some imposed circumstances, SOM for structured data (Hagenbuchner et al., 2003) can represent trees in a Markovian manner by emphasizing the top-most parts of the trees. These interesting findings were arrived at by studying pseudometrics in the data structure space induced by the maps. We complement the above results by studying the RecSOM map, potentially capable of complicated dynamic representations, as nonautonomous dynamical systems governed by a collection of fixed input dynamics. Corollary 2 states that if parameter β, weighting the importance of importing the e past information into processing of sequential data, is smaller than 2N (N is the number of units on the map), the map is likely to be organized in a clear Markovian manner. The bound e/(2N) may seem rather restrictive, but as argued in Hammer, Micheli, Sperduti et al., (2004), the context influence has to be small for time-series data to avoid instabilities in the model. Indeed, the RecSOM experiments of Hammer, Micheli, Sperduti et al. (2004) (albeit on continuous data) used N = 10 × 10 = 100 units, and the map was trained with β = 0.06, which is only slightly higher than the bound e/(2N) = 0.0136. Obviously the bound e/(2N) can be improved by considering other model parameters (see corollary 1), as demonstrated in Figure 12. Theoretical results of section 3 and corollary 2 also complement Voegtlin’s stability analysis of the weight adaptation process during training of RecSOM. For β < e/(2N), the stability of weight updates with respect to small perturbations of the activity profile y is ensured (Voegtlin, 2002). Voegtlin also shows, using Taylor expansion arguments, that if β < e/(2N), small perturbations of the activities will decay (fixed-input maps are locally contractive). Our work extends this result to perturbations of arbitrary 23 The results are shown for one particular initial condition y(0). Qualitatively similar diagrams were obtained for a variety of initial conditions y(0).
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
2564
size.24 Based on our analysis, we conclude that for each RecSOM model satisfying Voegtlin’s stability bound on β, the fixed-input dynamics for any input will be dominated by a unique attractive fixed point. This renders the map Markovian quality and training stability. Finally, we note that it has been shown that representation capabilities of merge SOM (Strickert & Hammer, 2003) and SOM for structured data (Hagenbuchner et al., 2003) operating on sequences transcend those of finite memory Markovian models in the sense that finite automata can be simulated (Strickert & Hammer, 2005; Hammer, Micheli, Sperduti et al., 2004). It was assumed that there is no topological ordering among the units of the map. Also, the proofs are constructive in nature, and it is not obvious that deeper memory automata structures can actually be learned with Hebbian learning (Hammer, Micheli, Sperduti et al., 2004). It should be emphasized that the type of analysis presented in this article would not be feasible for the merge SOM and SOM for structured data models, since their fixed-input dynamics are governed by discontinuous mappings (due to discrete winner determination when calculating the context).25 5.5 Relation Between IFS+SOM and Recurrent SOM Models. In this section, we show that in the test mode (no learning), the IFS+SOM model acts exactly like the recurrent SOM (RSOM) model (Koskela et al., 1998). Given a sequence s1 s2 , . . . over a finite alphabet A, the RSOM model determines the winner neuron at time t by identifying the neuron i with the minimal norm of di (t) = ν(tst − wi ) + (1 − ν) di (t − 1),
(5.5)
where 0 < ν < 1 is a parameter determining the rate of “forgetting the past,” tst is the code of symbol st presented at RSOM input at time t, and wi is the weight vector on connections connecting the inputs with neuron i. Inputs x(t) feeding standard SOM in the IFS+SOM model evolve with the IFS dynamics (see equations 4.5 and 4.6), x(t) = k x(t − 1) + (1 − k) tst ,
(5.6)
where 0 < k < 1 is the IFS contraction coefficient. Best-matching unit in SOM is determined by finding the neuron i with the minimal norm of Di (t) = x(t) − wi = k x(t − 1) + (1 − k) tst − wi .
24 25
We thank one of the anonymous reviewers who pointed this out. This was pointed out by one of the anonymous reviewers.
(5.7)
Dynamics of Recursive Self-Organizing Maps
2565
But Di (t − 1) = x(t − 1) − wi , and so Di (t) = k Di (t − 1) + (1 − k) tst − wi ,
(5.8)
which, after setting ν = 1 − k, leads to Di (t) = ν tst − wi + (1 − ν) Di (t − 1).
(5.9)
Provided ν = 1 − k, equations 5.5 and 5.9 are equivalent. The key difference between RSOM and IFS+SOM models lies in the training process. While in RSOM, the best matching unit i with minimal norm of di (t) is shifted toward the current input tst , in IFS+SOM the winner unit i with minimal norm of Di (t) is shifted toward the (Markovian) IFS code x(t) coding the whole history of recently seen inputs. 6 Conclusion We have rigorously analyzed a generalization of the self-organizing map (SOM) for processing sequential data, recursive SOM (RecSOM) (Voegtlin, 2002), as a nonautonomous dynamical system consisting of a set of fixedinput maps. We have argued and experimentally demonstrated that contractive fixed-input maps are likely to produce Markovian organizations of receptive fields on the RecSOM map. We have derived bounds on the parameter β, weighting the importance of importing the past information into processing of sequential data that guarantee contractiveness of the fixed-input maps. Generalizations of SOM for sequential data, such as temporal Kohonen Map (Chappell & Taylor, 1993), recurrent SOM (Koskela et al., 1998), feedback SOM (Horio & Yamakawa, 2001), RecSOM (Voegtlin, 2002), and merge SOM (Strickert & Hammer, 2003), contain a dynamic module responsible for processing temporal contexts as an inherent part of the model. We have shown that Markovian topographic maps of sequential data can be produced by a simple fixed (nonadaptable) dynamic module externally feeding the topographic model. However, allowing trainable feedback connections does seem to benefit the map formation, even in the Markovian case: compared with topographic maps fed by the fixed dynamic module, RecSOM maps are capable of developing Markovian organizations of receptive fields with significantly superior memory depth and topography preservation. We argue that non-Markovian organizations in topographic maps of sequential data may potentially be very important, but much more empirical and theoretical work is needed to clarify the map formation in SOMs endowed with feedback connections.
2566
ˇ I. Farkaˇs, and J. van Mourik P. Tino,
Acknowledgments I.F. and P.T. were supported by the Slovak Grant Agency for Science (#1/2045/05). J.M. was supported in part by the European Community’s Human Potential Programme under contract number HPRN-CT-200200319. We are thankful to the anonymous reviewers for suggestions that helped to improve presentation of the letter.
References Barnsley, M. F. (1988). Fractals everywhere. New York: Academic Press. ´ Barreto, de A. G., Araujo, A., & Kremer, S. (2003). A taxanomy of spatiotemporal connectionist networks revisited: The unsupervised case. Neural Computation, 15(6), 1255–1320. Chappell, G., & Taylor, J. (1993). The temporal Kohonen map. Neural Networks, 6, 441–445. Fiser, A., Tusnady, G., & Simon, I. (1994). Chaos game representation of protein structures. Journal of Molecular Graphics, 12(4), 302–304. Hagenbuchner, M., Sperduti, A., & Tsoi, A. (2003). Self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks, 14(3), 491– 505. Hammer, B., Micheli, A., Sperduti, A., & Strickert, M. (2004). Recursive selforganizing network models. Neural Networks, 17(8-9), 1061–1085. Hammer, B., Micheli, A., Strickert, M., & Sperduti, A. (2004). A general framework for unsupervised processing of structured data. Neurocomputing, 57, 3–35. ˇ P. (2003). Neural networks with small weights implement finite Hammer, B., & Tino, memory machines. Neural Computation, 15(8), 1897–1926. Hao, B.-L. (2000). Fractals from genomes—exact solutions of a biology-inspired problem. Physica A, 282, 225–246. Hao, B.-L., Lee, H., & Zhang, S. (2000). Fractals related to long DNA sequences and complete genomes. Chaos, Solitons and Fractals, 11, 825–836. Horio, K., & Yamakawa, T. (2001). Feedback self-organizing map and its application to spatio-temporal pattern classification. International Journal of Computational Intelligence and Applications, 1(1), 1–18. James, D., & Miikkulainen, R. (1995). Sardnet: A self-organizing feature map for sequences. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 577–584). San Francisco: Morgan Kaufmann. Jeffrey, J. (1990). Chaos game representation of gene structure. Nucleic Acids Research, 18(8), 2163–2170. Kenyon, R., & Peres, Y. (1996). Measures of full dimension on affine invariant sets. Ergodic Theory and Dynamical Systems, 16, 307–323. Kohonen, T. (1982). Self-organizing formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1479.
Dynamics of Recursive Self-Organizing Maps
2567
Koskela, T., Varsta, M., Heikkonen, J., & Kaski, K. (1998). Recurrent SOM with local linear models in time series prediction. In 6th European Symposium on Artificial Neural Networks (pp. 167–172). Brussels: D-Facto Publications. Lee, J., & Verleysen, M. (2002). Self-organizing maps with recursive neighborhood adaptation. Neural Networks, 15(8–9), 993–1003. Oliver, J. L., Bernaola-Galv´an, P., Guerrero-Garcia, J., & Rom´an-Roldan, R. (1993). Entropic profiles of dna sequences through chaos-game-derived images. Journal of Theoretical Biology, 160(4), 457–470. Principe, J., Euliano, N., & Garani, S. (2002). Principles and networks for selforganization in space-time. Neural Networks, 15(8–9), 1069–1083. Rom´an-Roldan, R., Bernaola-Galv´an, P., & Oliver, J. (1994). Entropic feature for sequence pattern through iteration function systems. Pattern Recognition Letters, 15, 567–573. Schulz, R., & Reggia, J. (2004). Temporally asymmetric learning supports sequence processing in multi-winner self-organizing maps. Neural Computation, 16(3), 535– 561. Strickert, M., & Hammer, B. (2003). Neural gas for sequences. In T. Yamakawa (Ed.), Proceedings of the Workshop on Self-Organizing Maps (WSOM’03) (pp. 53– 57). Kyushu: Kyushu Institute of Technology. Strickert, M., & Hammer, B. (2005). Merge som for temporal data. Neurocomputing, 64, 39–72. ˇ P. (2002). Multifractal properties of Hao’s geometric representations of DNA Tino, sequences. Physica A: Statistical Mechanics and its Applications, 304(3–4), 480–494. ˇ nansk ˇ P., Cer ˇ ´ M., & Benuˇ ˇ skov´a, L. (2004). Markovian architectural bias of Tino, y, recurrent neural networks. IEEE Transactions on Neural Networks, 15(1), 6–15. ˇ P., & Dorffner, G. (2001). Predicting the future of discrete sequences from fractal Tino, representations of the past. Machine Learning, 45(2), 187–218. ˇ P., & Hammer, B. (2004). Architectural bias in recurrent neural networks: Fractal Tino, analysis. Neural Computation, 15(8), 1931–1957. ˇ P., & Koteles, ¨ Tino, M. (1999). Extracting finite state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks, 10(2), 284–302. Voegtlin, T. (2002). Recursive self-organizing maps. Neural Networks, 15(8–9), 979– 992. Wiemer, J. (2003). The time-organized map algorithm: Extending the self-organizing map to spatiotemporal signals. Neural Computation, 16, 1143–1171. Yin, H. (2002). ViSOM—a novel method for multivariate data projection and structure visualisation. IEEE Transactions on Neural Networks, 13(1), 237–243.
Received March 14, 2005; accepted March 29, 2006.
LETTER
Communicated by Tae-Hwy Lee
On the Consistency of the Blocked Neural Network Estimator in Time Series Analysis A. Sarishvili [email protected] Institute of Industrial Mathematics, Kaiserslautern, Germany
Ch. Andersson [email protected] ¨ Department of Economy, Statistics, and Computer Science, University of Orebro, Sweden
J. Franke [email protected] Department of Mathematics, University of Kaiserslautern, Germany
G. Kroisandt [email protected] Institute of Industrial Mathematics, Kaiserslautern, Germany
We describe a nonlinear regression problem, where the regression functions have an additive structure and the dependent variable is a onedimensional time series. Multivariate time series with unknown time delay operators are used as independent variables. By fitting a feedforward neural network with block structure to the data, we estimated the additive regression function and, parallel to this, the time lags. We present the consistency proof of neural network weights estimator and the time lag estimator independently from each other. In the practical part of the article, we present the useful feature of blocked neural networks to estimate the relevance measures of each input variable in a simple way. Furthermore, we propose an approach to solve the well-known variable selection problem for the class of nonlinear multivariate beta-mixing time series models considered here. Finally, we apply the methodology to an artificial example. 1 Introduction Let Yt ∈ Rd , t = 1, . . . , N be a d-dimensional time series and let the ith component yti ∈ R with i = 1, . . . , d and t = 1, . . . , N have the representation 1 2 d ) + mi2 (yt−l ) + . . . + mid (yt−l ) + t , yti = mi1 (yt−l i1 i2 id
Neural Computation 18, 2568–2581 (2006)
(1.1)
C 2006 Massachusetts Institute of Technology
The Blocked Neural Network
2569
where t independently and identically (i.i). N(0, σ 2 )-distributed with finite variance and li j ∈ {1, 2, . . . , Ai } is the time lag with Ai ∈ N. The function mi j describes the relationship between the two components yi and y j . In this article, we estimate mi j nonparametrically by fitting feedforward neural networks with block structure to the data for each possible lag vector L i = (li1 , . . . , lid ) . Afterward, we minimize the average one-step forecasting error of the optimal network with respect to the lag vector L i . This gives us an estimate of the network parameters, together with an estimate for the lag vector L i . Furthermore, we compute a measure of relevancy of component j for component i, which helps us to identify the significant independent variables. 2 Neural Network Lag-Dependent Model Estimation Our objective is to introduce a neural network–based method to determine an optimal subset of the stochastic regressors to fit the underlying regression model. For simplicity, we introduce the function g for the true underlying regression function: 1 2 d g(Yt−L i ) = mi1 (yt−l ) + mi2 (yt−l ) + . . . + mid (yt−l ). i1 i2 id
If we take into account that the composite function g(Yt−L i ) is in particular a result of the addition of maximal d functions each of one real variable, then the neural network without “nonparallel” connections seems to have the right architecture to estimate parameters from lag dependent models. Figure 1 shows a neural network with d inputs, one output, and no nonparallel connections (a so-called blocked neural network). Each neuron in the hidden layer accepts only one variable as input apart from the constant y0 = 1, also known as the bias of the neural network. The output of the blocked neural network shows the following equation: yti = f nn (yt1 , . . . , ytd , ) = b0 +
H(1) j=1
β j φ(yt1 γ j + b j ) + . . . +
H(d)
β j φ(ytd γ j + b j ),
(2.1)
j=H(d−1)+1
where the difference H( j) − H( j − 1) for j = 2, . . . , d is the number of neurons in block j, and H(d) is the total number of neurons in the hidden layer. β j refers to the weights from the hidden layer to the output layer, γ j to the weights from the input to the hidden layer, the b j is the constant term, and H is the set of all these 3 · H(d) · d + 1 neural network weights. y −y φ(y) = ee y −e is the sigmoid neuron activation function. +e −y
2570
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
φ ...
y1
φ φ ...
y2 φ
φ
yi Σ
...
...
φ ...
yd
φ
1
Figure 1: A blocked neural network. Solid arrows denote one network connection, and the dotted arrows are the block of n(i) − 2 network connections, where n(i) is the number of neurons in the ith block.
The small modification of the theorem presented in Funahashi (1989) guarantees that such a neural network is able to estimate the composite function g(Y), Theorem 1. Let φ(y) be a nonconstant, bounded, and monotonic increasing continuous function. Let K be a compact subset of Rd and fix an integer k ≥ 1. Then any continuous mapping F : K −→ R with F (y1 , . . . , yd ) = f 1 (y1 ) + . . . + f d (yd ), which is a composition of functions f 1 , . . . , f d , where f i : D( f i ) −→ R, i = 1, . . . , d continuous and K ⊂ D( f i ), can be approximated in the sense of the uniform topology on K by input-output mappings of k hidden-layer networks without nonparallel connections whose output functions for the hidden layers are φ(y) and for the input and output layers are linear. 3 Estimators for the Lag-Dependent Model In this section we determine an estimate of the neural network parame i and lag vector Lˆ i = lˆi1 , lˆi2 , . . . , lˆid , i = 1, . . . , d and show their ters θˆ N,L consistency. In the following, we fix the coordinate i, that is, we omit the dependence on i of the neural network weights and the estimated lags. First, we choose a so-called lag horizon Ai ∈ N, that is, the maximal time delay, which is reasonable to take into consideration. This choice is based on the experience of the observer. The lag horizon Ai and the dimension of the time series, d, are both discrete variables. This fact allows us to find the
The Blocked Neural Network
2571
global minimum of the error function by training the neural network (Ai )d times as follows: θˆ N,L i is the estimator of the network weights for a given lag vector L i , that is, ¯ N (L i , θ ) , θˆ N,L i = argmin D θ ∈ H
¯ N (L i , θ ) denotes the criterion function where D ¯ N (L i , θ ) = D
N 2 1 i yt − f nn (Yt−L i , θ ) , N t=1
d 1 where Yt−L i = (yt−l , . . . , yt−l ) . Afterward, we estimate the time lags Lˆi by i1 id
¯ N (L i , θˆ N,L i ) , Lˆi = argmin D L i ∈(Ai , d)
where (Ai , d) = {1, . . . , Ai } × . . . × {1, . . . , Ai } = {1, . . . , Ai }d is the set of all possible time lags. We furthermore introduce the notation D0 (L i , θ ) as 2 D0 (L i , θ ) = E yti − f nn (Yt−L i , θ ) . 3.1 Consistency of the Network Weights Estimator. The consistency is well known for independent observations Yt (see White, 1989) if we fix L i . Since we have several possibilities for choosing L i , we must formulate the restriction in White (1989) as follows: Assumption 1. There exists a separable subset H (c) of H such that for any L i ∈ (Ai , d), the criterion function D0 (L i , θ ) has a unique global minimum θ0,L i in H (c) = {θ ∈ H ; ||θ − θ0,L i || ≤ c} for any c > 0: θ0,L i = argmin D0 (L i , θ ) = argmin D0 (L i , θ ) . θ ∈ H
θ ∈ H (c)
Remark 1. An alternative to the strategy of reducing the uniform convergence problem to a compact subset of the parameter space is to extend the neural network training procedure to a compactification of the parameter space. This way to show the classical consistency proof of time lag and weight parameters in a blocked neural network–based, lag-dependent model will not be considered here.
2572
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
Remark 2. As the neural network function f nn (x, θ ) depends on θ in a very smooth and simple manner, assumption 1 is not a severe restriction, but it excludes only a few degenerate constellations of the random process and the network function. H (c) can be chosen independent of L i as we consider only finitely many values of L i . The only problem is if the neurons are not interchangeable, that is, in the definition of , we must introduce some restriction like γ j ≤ γ j+1
where H(i) ≤ j < H(i + 1) for some i.
To come from the independently and identically distributed (i.i.d.) case to a time series, we must impose some restrictions on the time series. We must mainly assume that there is ergodicity and some mixing condition: Assumption 2. 1. Yt−L i and t are independent of each other and the (d + 1)– dimensional process (Yt−L i , t ) is strictly stationary and ergodic. 2. g is bounded and Yt is β-mixing with exponentially decaying mixing coefficients (see Douckan, 1994, and Bosq, 1996, for the definition of β-mixing and its equivalence to geometric ergodicity). Remark 3. The mixing assumption is not unusual for nonlinear autoregressive processes. Franke, kreiss, Mammen, & Neumann, (2000) gives simple sufficient conditions on the autoregression function g such that this assumption is satisfied. If we want to use other innovations , which are not normally distributed, then we must be able to use the Bernstein’s inequality for a mixing time series: Assumption 3. The innovations t are i.i.d. zero mean random variables, for which all moments E|t |n , n = 1, 2, . . . , are finite and E|t |n ≤ c n−1 n! Et2 < +∞, for n = 3, 4, . . . and some c > 0. ¨ In order to be able to use the consistency proof as in Potscher and Prucha (1997), we must show the following: Theorem 2. Under assumptions 1, 2, and 3, ¯ N (L i , θ ) − D0 (L i , θ )| → 0 sup | D
θ ∈ H (c)
The Blocked Neural Network
2573
in probability as N → ∞ for all L i ∈ (Ai , d). Proof. Using Bernstein’s inequality for a mixing time series (see Doukhan, 1994) in conjunction with a simple truncation argument (which is possible by assumption 3), we have sup θ ∈ H (c)
¯ N (L i , θ ) − D0 (L i , θ )| > δ → 0, as N → ∞ pr | D
(3.1)
for all δ > 0 and L i ∈ (Ai , d) (compare Franke & Neumann, 2000, for a similar argument). So the only problem is to interchange the probability and the supremum. Let α > 0 be a small real number to be chosen later. Let θk , k = 1, . . . , h(α), be a set of vectors in H (c) such that for all θ ∈ H (c), there is θk such that ||θ − θk || ≤ α. Let ¯ N (L i , θ ) − D0 (L i , θ ). N (θ ) = D Then for arbitrary δ > 0,
pr
sup | N (θ )| > δ ≤ pr
θ ∈ H (c)
≤
sup | N (θ )| > δ, sup | N (θ ) − N (θk )|
θ ∈ H (c)
||θ −θk ||≤α
δ ∀ k = 1, . . . , h(α) 2 + pr
sup | N (θ ) − N (θk )| >
||θ −θk ||≤α
δ , 2
for at least one k ∈ {1, . . . , h(α)} .
(3.2)
The first term on the right-hand side of equation 3.2, is bounded from above by
h(α) δ δ pr sup | N (θk )| > pr | N (θk )| > ≤ 2 2 k≤h(α) k=1
δ ≤ h(α) · sup pr | N (θk )| > →0 2 θ ∈ H (c) for N → ∞ by equation 3.1.
2574
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
For the second term on the right-hand side of equation 3.2, we have N 1 2 2 | N (θ ) − N (θk )| ≤ [yt − f nn (Yt−L i , θ )] − [yt − f nn (Yt−L i , θk )] N t=1 + E [yt − f nn (Yt−L i , θ )]2 − E [yt − f nn (Yt−L i , θk )]2 . Using the particular form of the network function f nn , we have that the derivative of f nn with regard to one of the parameters βi is bounded as the activation function φ(.) is bounded. The derivative with regard to one of the parameters γi or b j is bounded because the derivative of our activation function is also bounded and the parameters βi are out of a compact set. Using the mean value theorem, we obtain 2 2 E [yt − f nn (Yt−L i , θ )] − E [yt − f nn (Yt−L i , θk )] ≤ b · ||θ − θk ||, for a suitable constant b that is independent of δ and α. Analogously, N 1 2 2 [yt − f nn (Yt−L i , θ )] − [yt − f nn (Yt−L i , θk )] ≤ B N · ||θ − θk ||, N t=1
where B N is a nonnegative random variable with B N → b in probability for N → ∞ by the law of large numbers for mixing time series. Therefore, the second term on the right-hand side of equation 3.2 is bounded by
δ pr sup (B N + b) · ||θ − θk || > for at least one k = 1, . . . , h(α) 2 ||θ −θk ||≤α δ δ ≤ pr α · (B N + b) > = pr B N − b > − 2 · b → 0, 2 2·α
if we choose α small enough, such that
δ 2·α
− 2 · b > 0.
Corollary 1. From theorem 2 and assumption 1, it follows immediately that θˆ N,L i ¯ iN (L i , θ ) for given L i . is a uniquely identifiable sequence of minimizers of D Consistency of the least squared estimator θˆ N,L i for θ0,L i can now be inferred from the following lemma, applied to the compact subset H (c):
The Blocked Neural Network
2575
¯ N : Rd × H (c) −→ R be two arbitrary sequences of funcLemma 1. Let DN , D tions such that in probability, ¯ N (L i , θ ) − D0 (L i , θ )| → 0 a s N → ∞. sup | D
θ ∈ H (c)
¯ N (L i , θ ). Then for Let θˆ N,L i be an unique identifiable sequence of minimizers of D ¯ any sequence θ N,L i such that D0 (L i , θ¯ N,L i ) = inf D0 (L i , θ ), H (c)
holds, we have ||θˆ N,L i − θ¯ N,L i || → 0 in probability as N → ∞. ¨ The proof of this lemma is given in Potscher and Prucha (1997). 3.2 Consistency of the Lag Estimator Lˆi . We need a similar assumption for the lag vector as assumption 1, but the space (Ai , d) is already compact: Assumption 4. There is a unique L i0 such that with θ0 ≡ θ0,L i0 the pair (L i0 , θ0 ) satisfies (L i0 , θ0 ) =
argmin θ ∈ H , L i ∈(Ai , d)
and for every L i = L i0 :
D0 (L i , θ )
D0 (L i , θ0,L i ) > D0 (L i0 , θ0 ).
Remark 4. Assumption 4 follows from 1 d 1 d , . . . , yt−l ) = g(yt−λ , . . . , yt−λ ) > 0, pr g(yt−l i1 id i1 id for all (li1 , . . . , lid ) = (λi1 , . . . , λid ). This is automatically satisfied if the innovations t have a density, which is positive everywhere. Now we can formulate the theorem about the consistency of the lag estimator: Theorem 3. N→∞ Lˆi −→ L i0 , in probability
where Lˆi = argmin
L i ∈(Ai , d)
¯ N (L i , θˆ N,L i ) , D
2576
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
L i0 = argmin
L i ∈(Ai , d)
D0 (L i , θ0,L i ) .
Proof. For a fixed L i , we know that the neural network weights are consistently estimated (see Lemma 1), that is, 2 N→∞ θˆ N,L i −→ θ0,L i = argmin E yti − f nn (Yt−L i , θ ) θ ∈ H (c)
2 = argmin E g(Yt−L i0 ) − f nn (Yt−L i , θ ) + σ2 , θ ∈ H (c)
¯ N, and therefore we conclude that for the continuous function D ¯ N (L i , θˆ N,L i ) N→∞ D −→ D0 (L i , θ0,L i ) in probability. Since the minimum L i0 of D0 is unique and (Ai , d) has only finitely many elements, ¯ N (L i , θˆ N,L i ) N→∞ −→ argmin D0 (L i , θ0,L i ) Lˆi = argmin D L i ∈(Ai , d)
L i ∈(Ai , d)
= L i0 in probability. 4 Example Consider the following nonlinear additive dynamical system, which generates a two-dimensional time series, 1 2 ) + m12 (yt−4 ) + t , t ∈ [1, . . . , 240] yt1 = m11 (yt−2 2 1 ) + m21 (yt−2 ) + t , t ∈ [1, . . . , 240], yt2 = m22 (yt−1
where t is i.i. N(0,0.3)-distributed. We choose the following regression functions: m11 (x) =
1 , m12 (x) = 1.7 cos(x), exp(x)
m22 (x) =
1 , m21 (x) = 2 cos(x) + 0.3 sin(x). 0.2 exp(x)
The simulation of this time series with constant starting values is shown in Figure 2.
The Blocked Neural Network
2577
8 y1 y2
6 4 2 0 −2 0
50
100 150 Time
200
Figure 2: Simulated bivariate time series (training data set).
y1
γ1
1
γ2
b1 b2 b3
1 φ
b0
φ
β2 b4
y2
β1
γ3
φ
γ4
φ
Σ
β3
x
β4
Figure 3: Neural network with two neurons in each block.
We train a blocked neural network with two neurons in each block that is presented in Figure 3. We choose the first variable as output and estimate simultaneously the prediction mean squared error (PMSE) on the validation data set (15% of the whole data set) shown in Figure 4, The network function according to the definition of blocked neural network function in equation 2.1 then has the form f nn (Y, θ ) = b 0 +
2 j=1
β j φ(b j + γ j y1 ) +
4
β j φ(b j + γ j y2 ),
j=3
where y j is the jth coordinate of Y. If φ is differentiable, as we always
2578
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
8 Prediction (MSE=0.002) 1 y
6 4 2 0 −2 0
10
20 Time
30
Figure 4: Performance of BNN on the first variable y1 .
Partial derivatives to y1
0 −2 −4 Estimation True partial derivatives
−6 −8 −2
0
2
y1
4
6
8
10
Figure 5: Relevance of y1 to y1 .
assume, the partial derivatives (PD) with regard to y1 are calculated as follows: ∂ f nn (Y, θ ) = β j φ (b j + γ j y1 )γ j , 1 ∂y 2
j=1
The Blocked Neural Network
2579
Partial derivatives to y1
2
1
0
−1 −2 −4
Estimation True partial derivatives
−2
0
2
4
y2
Figure 6: Relevance of y2 to y1 .
4 3 2 1 0 Prediction (MSE=0.0018) y2
−1 −2
0
10
Time
20
30
Figure 7: Performance of BNN on the second variable.
and analogously for the second component, y2 . Because of nonlinearity and the additive topology of the neural network, this measure of relevancy is a function of the corresponding component only and therefore could be plotted for further analysis. The plots of PD versus the input variable (PD plots) can be used to analyze the impact of corresponding input variables. For similar considerations of relevancy measures in the context of the fully
A. Sarishvili, Ch. Andersson, J. Franke, and G. Kroisandt
Partial derivatives to y 2
2580
Estimation True partial derivatives
2 1 0 −1 −2 −2
0
2
y1
4
6
8
10
Figure 8: Relevance of y1 to y2 .
Partial derivatives to y 2
0.5
0
−0.5 Estimation True partial derivatives
−1 −1.5
−2
−1
0
y2
1
2
3
4
Figure 9: Relevance of y2 to y2 .
connected feedforward neural networks, see Refenes, Zapranis, and Utans (1996) and Sarishvili (2002). A large PD value indicates that the influence of the related input variable is strong for the considered output value. Small changes of the input value will already cause large changes on the output value. The PD plots of each of the input variables to the output variable are showed in Figures 5 and 6. The time delay values of the model were correctly estimated.
The Blocked Neural Network
2581
The same plots were generated with the second variable as output variable. The performance plot of the blocked neural network is shown in Figure 7. The PD plots are shown in Figures 8 and 9. The PD and time delay value of y2 to y2 were poorly predicted because of the small true PD values of y2 to y2 , as shown in Figure 9. The training algorithm of the blocked neural network was the Levenberg-Marquardt algorithm. 5 Conclusion We have used the framework of nonparametric regression assuming an additive, block structure of the regression functions to model a multivariate time series. Using a blocked feedforward neural network, we showed how both the regression functions and the vector of time lags can be estimated. Finally, we calculated a measure of relevancy. The major advantage of blocked neural networks compared to other nonlinear time-series approximators is the interpretability of the related partial derivatives and relevance measures with respect to the impact of the single input variables. Plots of the partial derivatives can be used to estimate the functional influence of each input variable with respect to the output. References Bosq, D. (1996). Nonparametric statistics for stochastic processes. Berlin: Springer-Verlag. Doukhan, P. (1994). Mixing: Properties and examples. Berlin: Springer-Verlag. Franke, J., Kreiss, J. P., Mammen, E. M., Neumann, H. (2000). Properties of the nonparametric autoressive bootstrap (Discussion Paper 54/98, SFB 373). Berlin: Humboldt University. Franke, J., & Neumann, M. H. (2000). Bootstrapping neural networks. Neural Computation, 12, 1929–1949. Funahashi, K. (1989). On the approximate realization of continous mappings by neural networks. Neural Networks, 2, 183–192. ¨ Potscher, B. M., & Prucha, I. R. (1997). Dynamic nonlinear econometric models: Asymptotic theory. Berlin: Springer-Verlag. Refenes, P. N., Zapranis, A. D., & Utans, J. (1996). Neural model identification, variable selection and model adequacy. In A. Weigen et al. (Eds.), Neural Networks in Financial Engineering, Proc. NnCM = 96. Singapore: World Scientific. Sarishvili, A. (2002). Neural network base lag selection for multivate time series. Unpublished doctoral dissertation, Universit¨at Kaiserslautern. Available online at: http://kluedo.ub.uni-kl.de/ White, H. (1989). Learning in artificial neural networks: A statistical perspective. Neural Computation, 1, 425–464.
Received March 29, 2005; accepted April 5, 2006.
2582
Erratum The article in the July 2006 issue of Neural Computation, Volume 18, number 7, page 1637 reads: “Representation and Timing in Theories of the Dopamine System” Nathaniel D. Daw UCL, Gatsby Computational Neuroscience Unit, London, WC1N3AR, U.K. Aaron C. Courville Carnegie Mellon University, Robotics Institute and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A. David S. Tourtezky Carnegie Mellon University, Computer Science Department and Center for the Neural Basis of Cognition, Pittsburgh, PA 15213, U.S.A. David S. Tourtezky should read: David S. Touretzky
NOTE
Communicated by Jonathan Victor
Spike Count Correlation Increases with Length of Time Interval in the Presence of Trial-to-Trial Variation Robert E. Kass [email protected]
Val´erie Ventura [email protected] Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
It has been observed that spike count correlation between two simultaneously recorded neurons often increases with the length of time interval examined. Under simple assumptions that are roughly consistent with much experimental data, we show that this phenomenon may be explained as being due to excess trial-to-trial variation. The resulting formula for the correlation is able to predict the observed correlation of two neurons recorded from primary visual cortex as a function of interval length. 1 Introduction Simultaneously recorded cortical neurons often exhibit correlations in spike counts over substantial periods of time, and this has been interpreted as producing important limitations on capacities for neural coding (Shadlen & Newsome, 1998; Zohary, Shadlen, & Newsome, 1994). However, it has also been observed that spike count correlations decrease as the length of the time interval decreases (Averbeck & Lee, 2003; Reich, Mechler, & Victor, 2001). We provide here a simple theoretical explanation of this phenomenon as a necessary consequence of excess trial-to-trial variation. The purposes of this work are two: first, to emphasize the importance of specifying interval length when interpreting spike count correlation, and, second, to focus further attention on excess trial-to-trial variation as an indicator of common neuronal input. 2 Results Our object is to analyze the correlation of a pair of spike counts across repeated trials in the presence of excess trial-to-trial variability, under reasonable assumptions, when the measurement interval—and therefore each expected count—increases. Let us consider random variables Yr1 and Yr2 representing theoretical spike counts over an interval of length T for two Neural Computation 18, 2583–2591 (2006)
C 2006 Massachusetts Institute of Technology
2584
R. Kass and V. Ventura
neurons recorded simultaneously on trial r . To simplify some formulas, we assume the two spike count probability distributions are the same, but this does not affect the essence of the result. We will also assume the following: 1. Within trials, the expected spike counts increase proportionally to T. 2. The within-trial variance is proportional to the within-trial expectation. 3. After conditioning on the trial, Yr1 and Yr2 are independent. Assumptions 1 and 2 are roughly consistent with many observed data (see Shadlen & Newsome, 1998, for references). Assumption 1 concerns the within-trial expected spike counts. In the absence of excess trial-to-trial variation, the within-trial expected spike count would equal the trial-averaged spike count. When there is excess trial-to-trial variation, the neuronal response depends on some external or internal state Sr that varies with the trial. The within-trial expected spike count is the number that would be produced by, hypothetically, averaging spike counts over trials with identical values of the state Sr . Assumption 2 is much more general than the Poisson assumption, which would require the within-trial variance to equal the within-trial expectation. Assumption 3 eliminates short timescale effects and will be discussed below. We also introduce a random variable Xr to represent excess trial-to-trial variation and will take the expectation of each spike count on trial r to be f (Xr ) when T = 1, for some function f (x) (see equation 2.1 below). In the absence of excess trial-to-trial variation, f (Xr ) would be constant across trials. 2.1 Monotonically Increasing Correlation. Under these three assumptions, the correlation of Yr1 and Yr2 and the conditional expectation and variance of Yri given the trial may be computed in terms of T. Before proceeding, we make two observations. First, under assumption 1, we may write the expectation conditionally on the trial effect Xr in the form E(Yri | Xr ) = T f (Xr ),
(2.1)
where f (Xr ) becomes the expected spike count when T = 1. Second, under assumption 2, we may write V(Yri | Xr ) = k · E(Yri | Xr ), and combining this with equation 2.1, we have V(Yri | Xr ) = kT f (Xr ).
(2.2)
Trial-to-Trial Variation Can Explain Spike Count Variation
2585
In the case of Poisson counts, we would have k = 1. Underdispersion occurs when k < 1 and overdispersion when k > 1. In computing the correlation of Yr1 and Yr2 we will use equations 2.1 and 2.2 together with elementary formulas for the variances and covariance in terms of the conditional variances, covariance, and expectations. For i = 1, 2, we have V(Yri ) = E(V(Yri | Xr )) + V(E(Yri | Xr )) = E(kT f (Xr )) + V(T f (Xr )) = kT E( f (Xr )) + T 2 V( f (Xr )), and, applying assumption 3 and equation 2.1, COV Yr1 , Yr2 = E COV Yr1 , Yr2 |Xr + COV E Yr1 | Xr , E Yr2 | Xr = 0 + T 2 V( f (Xr )). Writing µ = E( f (Xr )) and σ 2 = V( f (Xr )), we therefore obtain COR Yr1 , Yr2 =
T 2σ 2 T = , kTµ + T 2 σ 2 T +ω
(2.3)
where ω = kµ/σ 2 . This shows that the correlation will increase monotonically as T increases and will vanish as T → 0. To see the implication of equation 2.3, suppose that the trial-to-trial variation takes the form E Yri | Xr = T f (Xr ) = Tce Xr , where c is the firing rate when T = 1 and Xr = 0, and, for simplicity, suppose further that Xr has a normal distribution. It is easily verified that if Xr has mean a and variance b 2 , then 2 E e Xr = e a +b /2 and 2 2 V e Xr = e 2a +b (e b − 1). These give the ratio 2 V e Xr 2 = e a +b /2 e b − 1 . E (e Xr )
2586
R. Kass and V. Ventura
Using this formula, we may compute the correlation in equation 2.3 for various scenarios. For example, with a firing rate of c = e a = 20 spikes per second and b = 12.5% trial-to-trial variation, correlations for counts in intervals of length 2 ms, 100 ms, and 1000 ms become .0006, .03, and .24, which are roughly consistent with those reported by Reich et al. (2001). We do not mean to suggest that trial-to-trial variation may be described well by normally distributed effects that are constant in time (see below). These values are provided, rather, to help interpret the predictions of equation 2.3. 2.2 Nonmonotonic Correlation. According to equation 2.3, the spike count correlation will increase monotonically and, furthermore, will approach 1 for sufficiently long time intervals. However, Averbeck and Lee (2003) report an increase of correlation as a function of T up to a maximum, and then a subsequent decline. We now show that such effects could also be due to excess trial-to-trial variability. Suppose that the excess trial-to-trial variation is as described previously up to T1 , but that it disappears afterward. Such effects have been reported previously (e.g., Baker, Spinks, Jackson, & Lemon, 2001). Under assumptions 1, 2, and 3, when T ≤ T1 , equation 2.3 still applies. For T > T1 , we write Yri = Yri1 + Yri2 , where Yri1 is the spike count in [0, T1 ], and Yri2 the spike count in [T1 , T], for neuron i on trial r . Because the spike counts Yr12 and Yr22 in [T1 , T] contain no excess trial-to-trial variation, they are mutually 2 independent and are also independent from the spike counts Yr11 and Yr1 in [0, T1 ]. Then, for T > T1 , we have V Yri = V Yri1 + V Yri2 = k T1 µ + T12 σ 2 + k(T − T1 ) and COV Yr1 , Yr2 = COV Yr11 , Yr21 = T12 σ 2 , so that, for T > T1 , COR Yr1 , Yr2 =
T1 T1 + w +
k (T T1 σ 2
− T1 )
.
(2.4)
Under these conditions, the correlation will increase with T for T < T1 and will decrease with T for T > T1 . The modified assumption that trial-to-trial variability vanishes for T > T1 is not supposed to reflect accurately a real situation. Rather, we have provided equation 2.4 to indicate possible nonmonotonic behavior. Other, more realistic forms of time-varying trial-to-trial effects could also produce correlations that are nonmonotonic in T. One relatively simple alternative
Trial-to-Trial Variation Can Explain Spike Count Variation
2587
form of time-varying trial-to-trial variation is given, and its predictions are compared with data below. 2.3 Illustration with V1 Data. We illustrate using data from two neurons recorded simultaneously in the primary visual cortex of an anesthetized macaque monkey (Aronov, Reich, Mechler, & Victor, 2003, units 380506.s and 380506.u, 90 degree spatial phase), which were part of the Reich et al. (2001) study. Figures 1B and 1C show their peristimulus time histograms (PSTHs). Ventura, Cai, and Kass (2005) established that these two neurons had excess trial-to-trial variation, whose effects were shared across the neurons (see also Figure 1D), but that the neurons were independent once these effects were removed. Figure 1A displays the correlation of spike counts for increasing time intervals. The data for these two neurons were recorded from the same electrode, with an accuracy of 2.8 milliseconds, so that it was impossible to detect joint spikes occurring at time lags less than 2.8 milliseconds. This induces an artifactual negative correlation, clearly apparent in Figure 1A for the smallest time interval. A fit of equation 2.3 to the data is overlaid in Figure 1A. It captures reasonably well the general trend of the correlation. However, it fails to follow a leveling off evident at intervals greater than about 200 milliseconds, which may be due to nonconstant trial-to-trial variation. Indeed, Ventura et al. (2005) showed that this pair of neurons had highly significant nonconstant trial-to-trial effects and that the firing rate of neuron i on trial r could be described better by the function Pri (t) = P i (t) e w0r φ(t) ,
(2.5)
with P i (t) being the average firing rate of neuron i over many trials, and w0r φ(t) being a nonconstant contribution shared across the two neurons. According to equation 2.5, the excess trial-to-trial variation is due to trialspecific effects w0r that modulate the function φ(t). We could not produce analytical results, like those of equation 2.3, for the model of equation 2.5 because it is too complicated. Instead, we have calculated the predicted correlation curve by numerical simulation. We simulated 1000 pairs of spike trains from model 2.5 fitted to the data, which we used to compute the correlation as a function of interval length T.1 We also adjusted the correlation function for the recording accuracy of 2.8 msec.2 Figures 1B and 1C 1 Specifically, we sampled with replacement 1000 values w ∗ from the histogram in 0r ∗ Figure 1F and then simulated pairs of Poisson spike trains with rates P i (t)e w0r φ(t) , i = 1, 2, i with P (t) and φ(t) shown in Figures 1B, 1C, and 1E. 2 We identified all the occurrences of simultaneous spikes within 2.8 msec and, for each occurrence, retained the spike of only one neuron, chosen with probability proportional to its firing rate.
2588
R. Kass and V. Ventura
0.2
Correlation 0.4
0.6
0.8
1.0
A
Correlation predicted by:
0.0
Eq. 2.3 Eq. 2.5
−0.2
Eq. 2.5 with 2.8 ms recording accuracy
0
50
100 150 Interval length T (ms)
200
250
Data and fit of model in Equation 2.5 C
neuron 1
D
neuron 2
200 400 Time (ms)
0
E
0
200 400 Time (ms)
0
5
10 15 20 neuron 1
F
φ (t) −0.8 −0.2 0.4
0
Total spike count
neuron 2 0 5 10 15
0 20 40 60
Firing rate (Hz) 0 20 40 60
B
200 400 Time (ms)
−4
−2
0 w0
2
4
Trial-to-Trial Variation Can Explain Spike Count Variation
2589
show estimates of P i (t), i = 1, 2, taken to be the smoothed PSTHs, Figure 1E displays a fit of the function φ(t), and Figure 1F the histogram of the fitted trial-specific effect coefficients w0r . In Figure 1A, the correlation predicted by equation 2.5 appears as the dashed curve, and the correlation after adjustment for the recording accuracy of 2.8 msec appears as the largedashed curve. While equation 2.5 is itself a simplified representation of excess trial-to-trial variation and should not be expected to fit the data perfectly, these curves track the observed correlation quite well. 3 Discussion Trial-to-trial variation is of interest not only for its physiological significance (Azouz & Gray, 1999; Hanes & Schall, 1996), but also because it confounds assessments of correlation (Brody, 1999a, 1999b; Ben-Shaul, Bergman, Ritov, ¨ Riehle, & Diesmann, 2003). We have shown here that & Abeles, 2001; Grun, excess trial-to-trial variation produces spike count correlations that vary with the length of time interval during which the counts are recorded. This follows, essentially, from assumptions 1 and 2 by formulas 2.1 and 2.2. When assumption 3 holds, monotonicity holds throughout the interval during which there is excess trial-to-trial variation and, furthermore, the correlation vanishes as T → 0. We have also noted that when the excess trial-to-trial variation disappears after time T1 , the spike count correlation will decline after it reaches a maximum at an interval of length T1 . This behavior conforms to observations reported in Averbeck and Lee (2003). When there is correlation in the spike timing so that assumption 3 fails, it would be reasonable to assume, analogous to assumption 2, that the withintrial covariance between the two neurons’ spike counts is proportional to
Figure 1: (A) Correlation between the spike counts for two neurons in primary visual cortex, as a function of interval length T. For each T, we plotted the box plot (quantiles and 10th and 90th quartiles) of the correlations obtained by sliding the interval along experimental time. The solid curve is equation 2.3 with ω = 34. The dashed curve is the correlation function predicted by model 2.5 fitted to the data; the large-dashed curve is also for data as in equation 2.5 but with recording accuracy of 2.8 milliseconds to match the observed data. The activity of the two neurons was recorded during 64 trials from an anesthetized monkey; the stimulus in each trial was a standing sinusoidal grating that appeared at time 0 and disappeared at 237 ms. (B, C) Raw and smoothed PSTHs, P i (t), i = 1, 2, of the two neurons. (D) Within-trial spike counts for the complete interval of observation, which suggests that the neurons have shared effects of trial-to-trial variation. (E) The fitted firing rate modulating function φ(t) and (F) a histogram of the coefficients w0r .
2590
R. Kass and V. Ventura
the within-trial expectation. We write: COV Yr1 , Yr2 | Xr = cT E Yr1 | Xr .
(3.1)
In this case, monotonicity holds as long as c < k, and it is easy to show that COR(Y1 , Y2 ) → c/k as T → 0. Note that in the absence of trial-to-trial variation, f (Xr ) becomes a constant, and under assumption 3, the spike counts are uncorrelated. If assumption 3 fails but equation 3.1 holds, in the absence of trial-to-trial variation, the correlation becomes constant and does not increase with the length of time interval. Under assumptions 1 and 2, an increase in spike count correlation with length of time interval, as in Figure 1A, is an indication of excess trial-to-trial variation that is shared across the two neurons. (For related methods and additional analyses of these data, see Ventura et al., 2005.) We have offered our analysis in the usual spirit of those made with simplifying assumptions. We would not expect excess trial-to-trial variation to be summarized accurately by a single number, here represented as Xr . Ventura et al. (2005) have shown how somewhat more complicated phenomena involving trial-to-trial variability may be described. However, we would expect equations 2.3 and 2.4 to capture dominant effects, as illustrated in Figure 1A, and to provide insight into the possible origin of widely observed correlations. References Aronov, D., Reich, D. S., Mechler, F., & Victor, J. D. (2003). Neural coding of spatial phase in V1 of the macaque monkey. J. Neurophysiol., 89, 3304–3327. Averbeck, B. B., & Lee, D. (2003). Neural noise and movement-related codes in the macaque supplementary motor area. J. Neurosci., 23, 7630–7641. Azouz, R., & Gray, C. M. (1999). Cellular mechanisms contributing to response variability of cortical neurons in vivo. J. Neurosci, 19, 2209–2223. Baker, S. N., Spinks, R., Jackson, A., & Lemon, R. N. (2001). Synchronization in monkey motor cortex during a precision grip task. I. Task-dependent modulation in single-unit synchrony. J. Neurophysiol., 85, 869–885. Ben-Shaul, Y., Bergman, H., Ritov, Y., & Abeles, M. (2001). Trial to trial variability in either stimulus or action causes apparent correlation and synchrony in neuronal activity. J. Neurosci. Methods, 111, 99–110. Brody, C. D. (1999a). Correlations without synchrony. Neural Computation, 11, 1537– 1551. Brody, C. D. (1999b). Disambiguating different covariation. Neural Computation, 11, 1527–1535. ¨ S., Riehle, A., & Diesmann, M. (2003). Effect of across trial nonstationarity on Grun, joint-spike events. Biol. Cybernetics, 88, 335–351. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430.
Trial-to-Trial Variation Can Explain Spike Count Variation
2591
Reich, D. S., Mechler, F., & Victor, J. D. (2001). Independent and redundant information in nearby cortical neurons. Science, 294, 2566–2568. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Ventura, V., Cai, C., & Kass, R. E. (2005). Trial-to-trial variability and its effect on timevarying dependence between two neurons. J. Neurophysiology, 94, 2928–2939. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
Received March 22, 2005; accepted April 14, 2006.
LETTER
Communicated by Terrence Sejnowski
The Spike-Triggered Average of the Integrate-and-Fire Cell Driven by Gaussian White Noise Liam Paninski [email protected] Department of Statistics, New York University, New York, NY 10027, U.S.A.
We compute the exact spike-triggered average (STA) of the voltage for the nonleaky integrate-and-fire (IF) cell in continuous time, driven by gaussian white noise. The computation is based on techniques from the theory of renewal processes and continuous-time hidden Markov processes (e.g., the backward and forward Fokker-Planck partial differential equations associated with first-passage time densities). From the STA voltage, it is straightforward to derive the STA input current. The theory also gives an explicit asymptotic approximation for the STA of the leaky IF cell, valid in the low-noise regime σ → 0. We consider both the STA and the conditional average voltage given an observed spike “doublet” event, that is, two spikes separated by some fixed period of silence. In each case, we find that the STA as a function of time-preceding-spike, τ , has a square root singularity as τ approaches zero from below and scales linearly with the scale of injected noise current. We close by briefly examining the discrete-time case, where similar phenomena are observed. 1 Introduction The spike-triggered average (STA) (de Boer & Kuyper, 1968; Bryant & Segundo, 1976; Chichilnisky, 2001) is an easily measured experimental quantity defined as the conditional average stimulus to a cell, given that the cell has just emitted an action potential. Thus, this average quantity summarizes, in a sense, what stimulus led to a spike, and as such has taken on some importance in studies of neural coding (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Simoncelli, Paninski, Pillow, & Schwartz, 2004) and Hebbian models of short-term synaptic plasticity (Dayan & Abbott, 2001). Computing this quantity for model neurons, in turn, has led to some insight into the coding properties of these models. For example, for the linear-nonlinear-Poisson (LNP) cascade model (Simoncelli et al., 2004) the STA turns out to be closely associated with the linear filter of the cell (the “L” stage of the model) (Bussgang, 1952; Chichilnisky, 2001; Paninski, 2003, 2004), allowing for straightforward estimation of the model parameters via simple STA-based computations. For the linear-nonlinear model Neural Computation 18, 2592–2616 (2006)
C 2006 Massachusetts Institute of Technology
The STA of the Integrate-and-Fire Cell
2593
with multiplicative history effects (Berry & Meister, 1998), the STA is perturbed in an easily characterized fashion by these history terms (Paninski, 2003; Aguera y Arcas & Fairhall, 2003). Here we consider the linear integrate-and-fire (IF) neuron driven by white gaussian noise of scale σ and mean µ; in this model, the voltage V satisfies the stochastic differential equation, dVt = µdt + It ,
(1.1)
with It a white noise process of scale σ (that is, It = σ d Bt , with Bt a standard Brownian motion); the cell spikes and V is reset to some value Vr upon each threshold crossing, V(t) = Vth , where Vth > Vr . This is the most common base model for stochastic neuronal responses and has proven useful in a wide variety of contexts (Koch, 1999; Gerstner & Kistler, 2002; Paninski, Lau, & Reyes, 2003; Pillow, Paninski, Uzzell, Simoncelli, & Chichilnisky, 2005). Thus, it is worthwhile to examine its properties in analytical detail where possible. Computing the STA for the integrate-and-fire cell has proven somewhat more complex than in the simpler LNP case described above, basically because the IF cell has a more complex history dependence. Some preliminary approximate analysis of this problem appeared in Gerstner (2001) and Kanev, Wenning, and Obermayer (2004). More recently, Badel, Richardson, and Gerstner (2005) presented some exact asymptotic results based in part on large-deviations approximations (Freidlin & Wentzell, 1984; Kautz, 1988); see also Paninski, in press, for a recent application to the IF model) and in part on the theory of partial differential (Fokker-Planck) equations associated with Brownian motion. Here we show that it is possible to give relatively simple exact (nonasymptotic) formulas for the STA of the nonleaky IF cell. In addition, these nonasymptotic results lead fairly naturally to exact asymptotic results that hold slightly more generally. Our results thus complement those of Badel et al. (2005), who considered more general versions of the basic IF model but give only asymptotic results. This article is organized as follows. Section 2 contains our main result: here we explicitly calculate the conditional average voltage and input current of the nonleaky IF cell in continuous time given an observed spike “doublet” event, that is, two spikes separated by some fixed period of silence. This result has a natural extension to an approximate solution for the leaky case (see section 2.1); this approximation may be shown to be exact in the small noise limit σ → 0, via comparison with the large-deviation results (Paninski, in press; Badel et al., 2005). In section 3 we use this doublettriggered average and some basic renewal theory to compute the exact STA. We discuss an alternate approach in the discrete-time setting in section 4
2594
L. Paninski
and summarize a few salient points about the form of the STA, and generalizations to other IF-based models, in section 5. 2 The Doublet-Triggered Average To compute the STA, we first compute the doublet-triggered density, P(V(t)|s[t1 , t2 ] = {t1 , t2 }), the probability density of V(t) given that s[t1 , t2 ], the observed spike data in the interval [t1 , t2 ], consisted of spikes at times t1 and t2 , with no spikes observed at times t ∈ (t1 , t2 ). Once we have calculated the corresponding doublet-triggered expected voltage, E(V(t)|s[t1 , t2 ] = {t1 , t2 }), we will use equation 1.1 to recover the doublet-triggered expected current. (Of course, the doublet-triggered voltage density P(V(t)|s[t1 , t2 ] = {t1 , t2 }) is of independent interest; de Ruyter & Bialek, 1988; Paninski, in press; Badel et al., 2005.) We emphasize that V(t) here is assumed to follow the nonleaky noisy dynamics 1.1; we address the leaky case below, in section 2.1. To save on notation, we will fix t1 = 0 in this section (without loss of generality). The doublet-triggered density P(V(t)|s[0, t2 ] = {0, t2 }) is given by a normalized product of two terms, P(V(t)|s[0, t2 ] = {0, t2 }) =
1 P f (V, t)Pb (V, t), Z(t)
t ∈ (0, t2 ).
(2.1)
This follows from the fact that the integrate-and-fire cell is a special case of a continuous-time hidden Markov model (HMM): V acts as the “hidden” variable, which evolves according to Markovian dynamics, and the presence or absence of a spike in time bin t is the observed variable, which is dependent (in this case deterministically) only on V(t) at the single time point t. Thus, we may adapt the existing methods for computing (and sampling from) the conditional density of the hidden variable of an HMM, conditioned on its beginning and end states to this special IF model case. (See, e.g., Rabiner, 1989, and Harvey, 1991, for further detail.) We start by defining P f and Pb ; this is a simple matter of some manipulations with Bayes’ rule. For all 0 < t < t2 , we have P(V(t)|s[0, t2 ] = {0, t2 }) = =
P(s[0, t2 ] = {0, t2 }|V(t))P(V(t)) P(s[0, t2 ] = {0, t2 })
P(s[0, t] = {0}|V(t))P(s[t, t2 ] = {t2 }|V(t))P(V(t)) P(s[0, t2 ] = {0, t2 })
The STA of the Integrate-and-Fire Cell
2595
=
P(s[0, t] = {0}) P(V(t)|s[0, t] = {0})P(s[t, t2 ] = {t2 }|V(t)) P(s[0, t2 ] = {0, t2 })
≡
1 P f (V, t)Pb (V, t), Z(t)
where the second equality reflects the conditional independence of s([0, t]) and s((t, T)) given V(t) and the last equality is a definition. The ratio P(s[0, t] = {0})/P(s[0, t2 ] = {0, t2 }), which is constant in V, may be taken as a normalization factor that ensures the conditional V-probability integrates to one. It is well known that the forward term solves the Fokker-Planck (forward) equation (Karlin & Taylor, 1981; Tuckwell, 1988; Risken, 1996; Brunel & Hakim, 1999; Haskell, Nykamp, & Tranchina, 2001), ∂ P f (V, t) ∂ P f (V, t) σ 2 ∂ 2 P f (V, t) −µ = , ∂t 2 ∂ V2 ∂V with boundary conditions P f (Vth , t) = 0
∀t ∈ [0, t2 ]
and P f (V, 0) = δ(V − Vr ). This may be solved explicitly via the method of images (Daniels, 1982) as P f (V, t) = N (Vr + µt, σ 2 t) − e 2µ(Vth −Vr )/σ N (2Vth − Vr + µt, σ 2 t), 2
V ≤ Vth , t > 0, where N (µ, σ 2 ) = N (µ, σ 2 )(V) denotes the gaussian kernel of mean µ and scale σ . The backward term, on the other hand, solves the Kolmogorov backward equation (Karlin & Taylor, 1981), ∂ Pb (V, t) ∂ Pb (V, t) σ 2 ∂ 2 Pb (V, t) −µ =− , ∂t 2 ∂ V2 ∂V with boundary conditions Pb (Vth , t) = 0
∀t ∈ [0, t2 ]
2596
L. Paninski
and Pb (V, t2 ) = δ(V − Vth ). Solving for Pb is slightly more delicate; if we try to solve the backward equation directly, starting at t = t2 and using the method of images to propagate the solution backward in time, all the mass is absorbed at Vth immediately. Thus, a more indirect, limiting argument is required. We start with the exact solution to the backward equation started at V(t2 ) = Vth − , > 0, Pb (V, t) = N (Vth − − µ(t2 − t), σ 2 (t2 − t)) −e −2µ/σ N (Vth + − µ(t2 − t), σ 2 (t2 − t)), 2
V ≤ Vth , t < t2 ,
and then take the (normalized) limit as → 0. Abbreviating t2 − t = w, we have (Vth − − µw − V)2 2 Pb (V, t) = c(t) exp − − e −2µ/σ 2σ 2 w (Vth + − µw − V)2 × exp − 2σ 2 w 2(Vth − µw − V) (Vth − µw − V)2 exp = c(t) exp − 2σ 2 w 2σ 2 w 2(Vth − µw − V) −2µ + o() − − exp σ2 2σ 2 w 2(Vth − µw − V) (Vth − µw − V)2 = c(t) exp − + 2µ 2 2σ w 2w 2(Vth − µw − V) + o() + 2w (Vth − µw − V)2 [Vth − µw − V + µw] + o() = c(t) exp − 2σ 2 w (Vth − µw − V)2 [Vth − V] + o(), = c(t) exp − 2σ 2 w where the c(t) above represents an irrelevant normalization factor, constant in V; thus, we arrive at Pb (V, t) = c(t)[Vth − V]N (Vth − µ(t2 − t), σ 2 (t2 − t)),
V ≤ Vth .
The STA of the Integrate-and-Fire Cell
2597
Now we may solve for P(V(t)|s[0, t2 ] = {0, t2 }) (and therefore any conditional moment of V(t), such as the mean voltage given the observed data s[0, t2 ]) simply by plugging the above formulas for Pb and P f into equation 2.1 and normalizing. This normalization, in turn, requires that we compute integrals of the truncated gaussian distribution, √ N + (m, V)(x) = e(−m/ V)−1 N (m, V)(x),
x > 0,
with
∞
e(x) =
N (0, 1)(u)du.
x
We will need the first and second moments of this distribution, x m,V
√ V 2/π =m+ √ √ erfcx(−m/( V 2))
and x m,V 2
√ m V 2/π =m +V+ √ √ , erfcx(−m/( V 2)) 2
with erfcx(.) denoting the scaled complementary error function, 2 2 erfcx(x) = √ e x π
∞
e −t dt. 2
x
After one last change of variables, we have (Vth − Vr )t Vth − V N + Vr , q (t) z(t) t2 (Vr − Vth )t + 2Vth − Vr , q (t) −N t2
P(V(t)|s[0, t2 ] = {0, t2 }) =
for V < Vth , with the variance term q (t) = σ 2 (t −1 + (t2 − t)−1 )−1 and the normalization z(t) = e(−y(t)/ q (t)) < x > y(t),q (t) −e(y(t)/ q (t)) < x > −y(t),q (t) ,
2598
L. Paninski 1
0.8
forward backward full; t=0.3
p(V)
0.6
0.4
0.2
0 −3
−2
−1 V
0
1
Figure 1: The densities P f (V, t), Pb (V, t), and P(V(t)|s[0, t2 ] = {0, t2 }) = P f (V, t)Pb (V, t)/z(t), for t = 0.3; Vth = 1, Vr = 0, σ 2 = 1, µ = 1, t1 = 0, t2 = 1.
where we have abbreviated y(t) = (Vth − Vr )(1 − t/t2 ). See Figures 1 and 2 for an illustration of the three densities P f , Pb , and P = P f Pb /z. Now, finally, we may read off our main result: E(V(t)|s[0, t2 ] = {0, t2 }) = Vth − z(t)−1 [e(−y(t)/ q (t)) < x 2 > y(t),q (t) −e(y(t)/ q (t)) < x 2 >−y(t),q (t) ], for t ∈ [0, t2 ]. We will abbreviate this solution as St1 ,t2 (t), that is, St1 ,t2 (t) ≡ E(V(t)|s[t1 , t2 ] = {t1 , t2 }),
t ∈ [t1 , t2 ]
for use below. At this point, it is worth pausing to note a few salient properties of this doublet-triggered average. First, St1 ,t2 (t) behaves as √ √ St1 ,t2 (t) = Vth − σ 8/π t2 − t + o( t2 − t) as t t2 . (See Figures 2 and 3.) This square-root behavior at t2 was also noted by Badel et al. (2005).
The STA of the Integrate-and-Fire Cell
2599
backward
forward
1 0 −1 1 0 −1 −2 −3
full
1 0 −1
V
1 0 −1
V
1 samp sta true sta ML V
0.5 0 0
0.1
0.2
0.3
0.4
0.5 time
0.6
0.7
0.8
0.9
1
Figure 2: The doublet-triggered average, S0,1 (t). Parameters as in Figure 1. Panels 1–3 show the evolution of densities P f (t), Pb (t), and P(V(t)|s[0, t2 ] = {0, t2 }), for t ∈ [0, 1]; grayscale level indicates height of density. Panel 4 shows some samples (gray traces) from the conditional voltage path distribution given spikes at t1 = 0 and t2 = 1 (see the appendix for a brief description of the exact sampling procedure), with the empirical mean given 100 samples shown in black. The bottom panel shows the most likely path (dotted trace), the analytical doublettriggered average (dashed), and the empirical doublet-triggered average (solid).
Second, in the low-noise regime, σ → 0, the doublet-triggered average converges uniformly to the most likely (ML) voltage path (see Figure 3), which in this case takes the simple linear form (Paninski, in press) VML (t) =
1 (Vr (t2 − t) + Vth (t − t1 )), t ∈ [t1 , t2 ]. t2 − t1
2600
L. Paninski
sta; σ = 0.2 sta; σ = 1.0 sta; σ = 1.5 ML (σ = 0)
0.8
V
0.6 0.4 0.2 0 −0.2 0
0.2
0.4
0.6
0.8
1
time Figure 3: Effects of varying σ on St1 ,t2 (t). Note the convergence to the most likely voltage path (the linear dotted voltage trace) as σ → 0. Also note that at sufficiently high noise levels, the doublet-triggered average voltage is actually hyperpolarized below Vr due to the “killing” effect of the absorbing boundary at Vth , as described in the text.
This result is consistent with basic results from the theory of large deviations (Freidlin & Wentzell, 1984; Paninski, in press), which indicates that this most likely path will dominate expectations as σ → 0. Thus, the doublet-triggered average may be roughly described as a straight line between Vr and Vth , minus a sag of size proportional to σ ; this sag, in turn, behaves as a square root as t → t2 , and is due to the fact that voltage paths that happen to be depolarized by noise above threshold are “killed” by the absorbing boundary at Vth . Finally, due to some cancellations, µ does not appear in the above expressions; thus, the doublet-triggered average is (somewhat surprisingly) independent of the mean input current µ. (As we will see in section 3, the STA is dependent on µ; this dependence enters strictly through the µdependence of the IF interspike interval density.) To get a better sense of why µ drops out here, it is enlightening to take an alternate approach, based on the Brownian bridge, which is defined (Karlin & Taylor, 1981; Karatzas & Shreve, 1997) as the stochastic process formed by conditioning Brownian motion (with or without drift) on its start and end points. To see the relevance of the Brownian bridge here, we may consider the doublet-triggered density in two steps. First, we condition V(t) to end at V(t2 ) = Vth . This gives us a Brownian bridge started at (0, Vr ) and ended at (t2 , Vth ); note, importantly, that this is the point at which the dependence on µ drops out
The STA of the Integrate-and-Fire Cell
2601
(since the Brownian bridge has no dependence on the drift µ of the original Brownian motion). Then we condition further, imposing the inequality constraints V(t) < Vth , 0 < t < t2 , to obtain the doublet-triggered distribution; since µ plays no role in the Brownian bridge process, µ can play no role in this further conditioned process either. (The relevant computations for this conditioned Brownian bridge may be carried out explicitly, using a form of the method of images for the Brownian bridge; the final result is the same, so we omit the details.) This alternate approach also explains the form of the variance term q (t), which exactly matches the variance of the Brownian bridge (Karlin & Taylor, 1981). With the doublet-triggered average voltage in hand, it is straightforward to derive the doublet-triggered average current from the IF dynamics, equation 1.1; we may simply write E(I (t)|s[t1 , t2 ] = {t1 , t2 }) =
∂ St ,t (t) − µ. ∂t 1 2
(The careful reader will note that we have been rather blithe about an interchange between a derivative and an expectation here; we will discuss this further in section 4.) This doublet-triggered average current diverges as (t2 − t)−1/2 as t → t2 ; this has an interesting effect on the discretized STA for current, which appears not to converge to any physically reasonable limit as the time discretization dt goes to zero (see section 4 below). 2.1 The Leaky Case. The above results suggest a simple approximation for the leaky case, dVt = (µ − gVt )dt + It . The forward equation in this case becomes ∂ P f (V, t) ∂ P f (V, t) ∂[P f (V, t)V] σ 2 ∂ 2 P f (V, t) −µ = +g , ∂t 2 ∂ V2 ∂V ∂V with boundary conditions as above, P f (Vth , t) = 0
∀t ∈ [0, t2 ]
and P f (V, 0) = δ(V − Vr ).
2602
L. Paninski
In this case, P f satisfies the renewal equation (Karlin & Taylor, 1981; Plesser & Tanaka, 1997; Burkitt & Clark, 1999; Paninski, Haith, Pillow, & Williams, 2005), P f (V, t) = PVr ,0 (V, t) −
t
0
p1 (s)PVth ,s (V, t)ds,
with p1 (t) denoting the first-passage density, p1 (t) =
∂ ∂t
1−
Vth
−∞
Vth ∂ P f (V, t)dV = − P f (V, t)dV, ∂t −∞
and Px,s (V, t) denoting the “free” solution to the forward equation, that is, the (uniquely well-behaved) solution to the forward equation in the absence of the threshold boundary condition, for example, µ σ2 PVr ,0 (V, t) = N Vr + − Vr (1 − e −gt ), (1 − e −2gt ) . g 2g No elementary analytical solution for P f is available in the leaky case to our knowledge; instead, we simply neglect the second term in the above renewal expression for P f and approximate P f (V, t) ≈ PVr ,0 (V, t). This is accurate as σ → 0, if µ/g < Vth . For the backward equation, we replace our analytical solution above with Pb (V, t) ≈ c(t)[Vth − V]N
µ g(t2 −t) µ σ 2 2g(t2 −t) Vth − + , (e − 1) , e g g 2g
V ≤ Vth , which again makes use of the free solution to the backward equation. The corresponding approximation to the doublet-triggered average, formed by plugging the above approximations to P f and Pb into equation 2.1, is crude but nonetheless asymptotically correct as σ → 0: this approximate doublet-triggered average, S˜ 0,t2 (t), behaves like σ →0 (t) = S˜ 0,t 2 σ2 (1 2g
− e −2gt )((Vth − µg )e g(t2 −t) + µg ) + σ2 (1 2g
−
σ 2 2g(t2 −t) (e − 1) Vr 2g σ 2 2g(t2 −t) −2gt e ) + 2g (e − 1)
+
µ g
− Vr (1 − e −gt )
The STA of the Integrate-and-Fire Cell
2603
as σ → 0. Some algebra reduces this to µ σ →0 S˜ 0,t (t) = + a e −gt + be gt , 2 g σ →0 (t) uniquely satisfies the with a , b suitably chosen constants in t. Since S˜ 0,t 2 second-order differential equation,
∂ 2 σ →0 µ σ →0 2 ˜ σ →0 ˜ 0,t (t) = −g(−g S˜ 0,t S S (t) + µ) = g (t) − , 0,t2 2 ∂t 2 2 g σ →0 σ →0 σ →0 (0) = Vr and S˜ 0,t (t2 ) = Vth , S˜ 0,t (t) correwith boundary conditions S˜ 0,t 2 2 2 sponds exactly to the ML voltage path, which dominates the true doublettriggered average in the limit σ → 0 (Paninski, in press), as discussed above. See Figure 4 for some examples of this approximation. The doublet-triggered average current is only slightly more complex in this case, since we have to include the effect of the leak, in particular, the mean leak current,
E(−gV(t)|s[0, t2 ] = {0, t2 }) = −g E(V(t)|s[0, t2 ] = {0, t2 }) = −gSt1 ,t2 (t). Our approximation in this case thus takes the form E(I (t)|s[t1 , t2 ] = {t1 , t2 }) ≈
∂ S˜ t ,t (t) − µ + g S˜ t1 ,t2 (t). ∂t 1 2
3 The Spike-Triggered Average Given the doublet-triggered distributions P(V(t)|s[t1 , t2 ]), it is straightforward to obtain the full STA (and more generally, the full distribution of V(s) at any time s, given a spike at time t). We make use of the renewal representation of the spike times in this model: note that for µ > 0, the IF model represents a stationary stochastic process. Moreover, since V(t) is strong Markov, the sequence of spike times is a renewal process; as is well known, it is straightforward to calculate the interspike interval density for every order (e.g., via the reflection principle for Brownian motion, coupled with the Girsanov formula; Karatzas & Shreve, 1997). The density of the interval between a given spike and the ith following spike is given by the inverse gaussian density (Seshadri, 1993): i(Vth − Vr ) −(i(Vth −Vr )−µt)2 /2σ 2 t pi (t) = √ e = p1 ∗i−1 p1 , 2πσ 2 t 3
2604
L. Paninski
1 V
µ = 0; σ = 1.0; g = 2 0.5 0 1 V
µ = 0; σ = 1.0; g = 5 0.5 0 1 V
µ = 3; σ = 1.0; g = 2 0.5 0 1 V
µ = 3; σ = 0.3; g = 2 0.5 0
V
1 µ = 0; σ = 3.0; g = 2
0 −1 0
0.2
0.4
0.6
0.8
1
time Figure 4: Some approximate doublet-triggered averages in the leaky case. The approximation is fairly accurate in general and is exact as σ → 0. The approximation fails in the σ large case, when the free approximation to the forward solution fails (that is, when a nonnegligible amount of probability mass crosses threshold before t2 ). Conventions are as in the bottom panel of Figure 2: dashed curves are analytical approximations to the doublet-triggered averages, dotted curves are ML paths, and solid curves are empirical doublet-triggered averages (based on 1000 samples from true conditional voltage distribution given s[0, t2 ]).
where ∗i denotes the i-fold convolution. Define S1+ (t) =
∞
p1 (s)S0,s (t)ds, t
t > 0,
The STA of the Integrate-and-Fire Cell
2605
and S1− (t) =
−t
−∞
t > 0.
p1 (−s)Ss,0 (t)ds,
These are the doublet-triggered averages S0,t2 averaged over the next and previous interspike intervals, respectively. Denote the firing rate function (Rudd & Brown, 1997) of the IF cell as f (t) =
∞
pi (t);
i=1
this is the expected firing rate of the cell at time t given a spike at time 0. Then the spike-triggered average voltage for positive times t is given by the convolution STA(t) = ( f (t) + δ(t)) ∗ S1+ (t) = S1+ (t) +
s 0
f (t − s)S1+ (s)ds,
t > 0,
f (t − s)S1− (s)ds,
t > 0.
and similarly for negative times, STA(−t) = ( f (t) + δ(t)) ∗
S1− (t)
=
S1− (t)
+ 0
s
The spike-triggered average current and distributions P(V(t)|s[0] = {0}) follow similarly. See Figure 5 for a few examples of the STA. The leaky case follows exactly the same route, with the exception (again) that no explicit analytical solution is known for p1 (t) or f (t) in the leaky case. But the spikes from the leaky IF (LIF) cell are still a renewal process, and the STA can still be written in the convolution form given above. It is worth pointing out that a simpler approach suffices for positive times. In this case, we can form the spike-triggered distributions P(V(t)|s[0] = 0) directly, without going through the intermediate step of computing the doublet-triggered distributions. By the usual renewal argument, we have that
t
P(V(t)|s[0] = 0) = P f (V, t) +
P f (V, t − s) f (s)ds,
t > 0,
0
from which we may read off the expectation to obtain the STA. From this, it is easy to see that the STA may be approximated for small, positive times t with the simple linear form STA(t) ≈ Vr + (µ − gVr )t + o(t)
2606
L. Paninski
1
V
µ = 2; σ = 0.2 0.5
0 1
V
µ = 2; σ = 0.4 0.5
0 1
V
µ = 2; σ = 1.0 0.5
0 1
V
µ = 2; σ = 1.5 0.5 0 −2
−1
0 time
1
2
Figure 5: A few examples of the spike-triggered average voltage, for different values of σ (in each case, the leak g = 0). Black trace is analytical STA; gray trace (mostly obscured by the black trace) is empirical STA, given 2000 seconds of simulated data. In the low-noise regime σ → 0 (top panel), an oscillatory ringing is visible, at a frequency approaching the firing rate of the cell in the absence of noise µ/(Vth − Vr ), 2 Hz in this case; the square root singularity at t → 0− becomes more visible as σ increases.
The STA of the Integrate-and-Fire Cell
2607
(in contrast with the square root singularity as t approaches zero from the left). 4 Discrete-Time Approach In the previous sections, we computed the spike-triggered average voltage in continuous time. Deriving the spike-triggered average current from the average voltage may be done formally by a simple interchange of the derivative in equation 1.1 and the expectation taken when forming the STA. However, justifying this interchange rigorously runs into the usual issues associated with white noise in continuous time (specifically, the fact that white noise is only defined as a measure on a space of generalized functions; Hida, 1980) and would take us slightly afield. Instead, we discuss here a more direct, rigorous calculation of the spike-triggered average current in discrete time and point out some similarities to the formal continuous-time approach taken above. We begin by writing the LIF model in discrete time: V(t + dt) = V(t) + (µ − gV)dt + It , where It , the√input current, is discrete gaussian white noise with mean zero and scale σ dt, √ and the voltage is reset to Vr at each threshold crossing. Note that the dt scaling on the input noise current ensures the existence of a continuum limit of the above process, as dt → 0; this limit process is equivalent to an Ornstein-Uhlenbeck process that resets at each crossing of the threshold Vth . We will focus on this dt → 0 limit below. In the following, we suppress the argument of I (t − τ dt) and abbreviate the event s[t] = {t} as s. Thus, we write the spike-triggered average current as E(I (t − τ dt)|s[t] = {t}) =
I P(I |s)d I
=
I
P(s|I ) P(I )d I. P(s)
We analyze each term in the above expression in turn. First, by definition, P(I ) = N (0, σ 2 dt). Next, P(s) = F dt + o(dt),
2608
L. Paninski
where F denotes the invariant (steady-state) firing rate of the cell. F , in turn, can be computed as follows (Karlin & Taylor, 1981; Brunel & Hakim, 1999; Haskell, Nykamp, & Tranchina, 2001; Paninski et al., 2003):
σ 2 ∂ P∞ (V)
F =− , 2 ∂ V V=Vth where P∞ (V) is the invariant density on voltage. It turns out that we will not need to know anything about this invariant distribution beyond the fact that it exists uniquely, is differentiable from below at Vth , and is zero above Vth . Somewhat surprisingly, we will not even need to compute F . The hard part is P(s|I ). We condition on the voltage at time t, as follows: P(s|I ) = = =
P(s, I, V) dV P(I ) P(s|I, V)P(I, V) dV P(I ) P(s|I, V)P(V|I )dV
=
P(s|I, V)P(V)dV,
where we have used the independence of the current and the voltage at a given time step in the final line. To compute p(s|I, V), we need to introduce some limiting arguments—the fact that dt → 0 will allow us to compute p(s|I, V) exactly asymptotically, and this does not appear to be possible for arbitrary dt. First, we write out the definition more carefully: P(s|I, V) = P(V(t) > Vth |V(t − (τ + 1)dt), I (t − τ dt)) = P(V(t) > Vth |V(t − τ dt) = V(t − (τ + 1)dt) + (µ − gV)dt + I ). To put it more simply, this is the probability that the stochastic process V will cross the boundary Vth on the τ th time step; clearly, any such crossing must be from below, given the definition of the LIF model. The first limiting argument is simple: as dt → 0, the probability that more than one such crossing will occur in the interval (t − τ dt, t) decreases exponentially. Thus, we have a relatively simple gaussian first passage time problem: √ √ P(s|I, V) = q D−OU (τ, Vth , V + I, σ dt, g, µ/g, dt) + O( dt),
The STA of the Integrate-and-Fire Cell
2609
where q D−OU is the probability that a discrete Ornstein-Uhlenbeck process, starting at V + I , with √ leak parameter g, equilibrium potential µ/g, time step dt, and scale σ dt, will first cross the threshold Vth at τ time steps. It will be helpful to rescale q D−OU as follows: √ q D−OU (τ, Vth , V + I, σ dt, g, µ/g, dt) √ = q D−OU (τ, 0, (V + I − Vth )/σ dt, 1, gdt, µ/g, 1). Putting all the pieces together, we have STA(τ ) =
1 P(s)
I P(I )P(V)P(s|I, V)d I dV
I P∞ (V) √ σ dt √ √ ×(q D−OU (τ, 0, (V+ I − Vth )/σ dt, 1, gdt, µ/g, 1) + O( dt))d I dV,
2 =− 2 ˙ σ dt P∞ (Vth )
I √ G σ dt
where G denotes the standard gaussian density. Now we change variables: √ a = I /σ dt, and √ b = (Vth − V)/(σ dt), to simplify our integral to 2 ˙P∞ (Vth )
√ a G(a )P∞ (Vth − bσ dt)
√ ×(q D−OU (τ, 0, a − b, 1, gdt, µ/g, 1) + O( dt))dadb.
Finally, by L’Hopital and a simple dominated convergence argument and the fact that the Ornstein-Uhlenbeck mean and covariance matrix converge to that of discrete Brownian motion in this limit, we have √ STA(τ ) = 2σ dt a G(a )da
∞
bq DB (τ, b − a )db + O(dt),
dt → 0,
0
with q DB denoting the probability that a standard discrete Brownian motion (that is, a cumulative sum of independently and identically standard normal variables) will first cross the threshold b − a at time τ .
2610
L. Paninski
Unfortunately, q DB does not seem to have a simple analytical expression, although we can compute this quantity fairly explicitly for small τ , and the large τ asymptotics can be computed by appealing to known results on the corresponding quantity for Brownian motion. For example, we can compute q DB (1, u) = (−u)e(u); in general, q DB is given by a similar τ -dimensional gaussian integral over an orthant (Paninski, Pillow, & Simoncelli, 2004), or alternately by a repeated convolution of error functions. As discussed in the previous section, the corresponding crossing probabilities for continuous-time Brownian motion can be computed exactly: q B (τ, u) =
τ +1
τ
p1u (t)dt = 2
τ −1/2 u
(τ +1)−1/2 u
G(x)d x,
with p1u (t) denoting the first passage time of a standard Brownian motion to the threshold u, and it is fairly easy to show that q B ∼ q DB as τ → ∞. Since q B (τ, u) ≈ G(τ −1/2 u)τ −3/2 u for large τ , we have √ STA(τ ) ≈ 2σ dtτ −3/2 da
∞
a b(b − a )G(a )G(τ −1/2 (b − a ))db.
0
Another change of variable demonstrates that the spike-triggered average current behaves like √ STA(τ ) = Aσ dtτ −1/2 + o(τ −1/2 ) (with the prefactor A given by a gaussian polynomial integral over a halfspace) as τ → ∞, locally at the spike time (i.e., for τ dt → 0). This matches the result established by the continuous-time argument in the preceding section. See Figure 6 for an illustration of the singular behavior of the discrete STA as τ → 0− and of the dependence of the STA on dt. 5 Conclusions and Extensions We can derive three somewhat surprising conclusions from the above results. First, the spike-triggered average current of the LIF cell in discrete time does not, in fact, have a continuum limit as dt → 0 in the usual sense. We might have expected that the STA would live on a timescale of ∼ 1/g—roughly, that the cell would integrate over about a membrane
The STA of the Integrate-and-Fire Cell
0.25 0.2 0.15 0.1 0.05
2611
dt = 0.064
0.15
dt = 0.016 0.1 0.05 0 0.08 0.06
dt = 0.004
0.04 0.02 0 0.04
dt = 0.001
0.02 0 −1
− 0.8
− 0.6
− 0.4
− 0.2
0
time Figure 6: The discrete spike-triggered average current, as a function of dt. Note that the STA becomes sharper with decreasing dt, with √ a horizontal scale proportional to dt and a vertical scale proportional to dt. Parameters: σ = 1, µ = 2, g = 0.
time constant’s worth of input before “deciding” whether to spike. In fact, the STA is effectively supported on a dt timescale, and therefore the width of the spike-triggered average vanishes as dt → 0. This illustrates the danger of thinking of the STA too glibly as the “linear prefilter” of the cell, applied
2612
L. Paninski
to the input before some nonlinear probabilistic spiking step (a similar point is made, in Aguera y Arcas & Fairhall, 2003). From a more physical point of view, of course, this “degeneracy” of the STA is perhaps less surprising (in retrospect, at least), since decreasing dt corresponds to increasing the bandwidth of the current, and this should increase the “bandwidth” of the cross-correlation between the current input and the spike output (namely, the STA), as well. This intuition is supported by numerical experiments in which the white noise current input is preceded by some fixed prefilter of limited bandwidth (Pillow & Simoncelli, 2003; Paninski et al., 2004); for this prefiltered input, the STA does indeed have a nonvanishing limit as dt → 0. Second, on a related note, ST A(τ ) displays a square root singularity as τ → 0− , which is due to the interaction of the Brownian motion term in the IF stochastic differential equation with the absorbing threshold at Vth (see also Badel et al., 2005, for a discussion of this point). Finally, perhaps most surprising, the STA is basically parameter independent for τ close enough to zero. The STA scales linearly in σ , but all the other model parameters—µ, g, v L , and vr eset —become irrelevant in the √ τ → 0− limit, due to the dt relationship between the scale and drift of a diffusion with bounded coefficients. Loosely speaking, the noise term dominates the leak terms on small timescales; diffusion processes with bounded parameters can be locally approximated by (zero-drift) Brownian motion. The linear scaling of the STA in σ , on the other hand, has interesting implications for the “adaptive” properties of the LIF cell, as discussed in more detail in Rudd & Brown (1997), Paninski et al. (2003), and Yu and Lee (2003). More globally (that is, if we do not confine our attention to times very near the spike), as emphasized in Paninski (in press) and Badel et al. (2005), the most likely voltage path dominates the STA for σ sufficiently small. 5.1 Directions. We briefly indicate a few possible ways to generalize the above results. As mentioned above, we could replace our linear LIF model with a more general stochastic differential equation: dV = f (V)dt + a (V)It , with f (V), a (V) some fixed, uniformly smooth functions of voltage V (Brunel & Latham, 2003). Again, though, while this will change the firing statistics of the model (perhaps drastically), our results on the STA in the τ → 0− regime remain unchanged. (We are assuming, of course, that the Fokker-Planck equation corresponding to this model has a unique, differentiable invariant density P∞ (V); without such a unique invariant P∞ (V), it is typically not possible even to define the STA. (See, e.g., Karlin & Taylor, 1981, for conditions ensuring the existence of a P∞ (V) with the required properties.)
The STA of the Integrate-and-Fire Cell
2613
One interesting application of this idea involves invertible rescalings of the voltage axis, V → g(V), for g(.) a smooth, invertible function. For example, taking U(t) = exp(V(t)) gives us a geometric Brownian motion, which serves the same fundamental role in, for example, financial applications that the Ornstein-Uhlenbeck process serves in neural applications. Since invertible rescalings preserve the Markov property, all of our results go through unchanged after applying the usual change-of-measure formula. Perhaps the fundamental step of our analysis is the Markov assumption. Thus, generalizations that would be worth exploring include the extension to more general spike-response models (Gerstner & Kistler, 2002), as defined by dV = ( f (V) + η(t − ts ))dt + a (V, t − ts )It , where η is a smooth function of t − ts , the time since the last spike (this class of models allows for more interesting interspike interactions, since the subthreshold dynamics are no longer Markovian), and perhaps more importantly, to colored or conductance noise input I (Badel et al., 2005). Appendix: Sampling Sampling from the unconditioned stochastic differential equation 1.1 is straightforward, and will not be discussed further here (see, e.g., Risken, 1996 and Karatzas & Shreve, 1997, for a discussion). Conditional sampling, on the other hand—drawing samples from model 1.1, given the observed spike data s[t1 , t2 ]—is not quite so obvious. The exact sampling method used here is a variant of the forward-backward algorithm described above for computing conditional densities and is again inherited from methods for sampling from hidden Markov models (Rabiner, 1989). An identical procedure is used in Paninski (in press) but will be described here for completeness. We initialize V(t2 ) = Vth . (This initial condition is due to the data that a spike occurred at time t2 , as above.) Now, for t2 > t > t1 , sample backward: V(t) ∼ P(V(t)|{V(u)}t
=
2614
L. Paninski
1 P(V(t + dt)|V(t))P(V(t)|s[t1 , t] = {t1 }) Z 1 = P(V(t + dt)|V(t))P f (V, t). Z
=
Thus, sampling on each time step simply requires that we draw independently from a one-dimensional density, proportional to the product in the last line. Once this product has been computed, this sampling can be done using standard methods (namely, the inverse cumulative probability transform (Press, Teukolsky, Vetterling, & Flannery, 1992)). (Of course, P f (V, t) need only be computed once for t1 < t < t2 , no matter how many samples are required.) The second term is computed directly from the gaussian stochastic dynamics, equation 1.1, given each V(t + dt). Putting the samples together, for 0 < t < t2 clearly gives a sample from P({V(t)}0
The STA of the Integrate-and-Fire Cell
2615
Chichilnisky, E. (2001). A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12, 199–213. Daniels, H. (1982). Sequential tests contructed from images. Annals of Statistics, 10, 394–400. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. de Boer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Transactions on Biomedical Engineering, 15, 159–179. de Ruyter, R., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowfly visual system: Coding and information transmission in short spike sequences. Proc. R. Soc. Lond. B, 234, 379–414. Freidlin, M., & Wentzell, A. (1984). Random perturbations of dynamical systems. Berlin: Springer-Verlag. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Harvey, A. (1991). Forecasting: Structural time series models and the Kalman filter. Cambridge: Cambridge University Press. Haskell, E., Nykamp, D., & Tranchina, D. (2001). Population density methods for large-scale modelling of neuronal networks with realistic synaptic kinetics. Network, 12, 141–174. Hida, T. (1980). Brownian motion. New York: Springer-Verlag. Kanev, J., Wenning, G., & Obermayer, K. (2004). Approximating the responsestimulus correlation for the integrate-and-fire neuron. Neurocomputing, 58, 47–52. Karatzas, I., & Shreve, S. (1997). Brownian motion and stochastic calculus. New York: Springer. Karlin, S., & Taylor, H. (1981). A second course in stochastic processes. New York: Academic Press. Kautz, R. (1988). Thermally induced escape: The principle of minimum available noise energy. Physical Review A, 38, 2066–2080. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Paninski, L. (2003). Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14, 437–464. Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network: Computation in Neural Systems, 15, 243–262. Paninski, L. (in press). The most likely voltage path and large deviations approximations for integrate-and-fire neurons. Journal of Computational Neuroscience. Paninski, L., Haith, A., Pillow, J., & Williams, C. (2005). Improved numerical methods for computing likelihoods in the stochastic integrate-and-fire model. Comp. Sys. Neur. ’05. Salt Lake City, UT. Paninski, L., Lau, B., & Reyes, A. (2003). Noise-driven adaptation: In vitro and mathematical analysis. Neurocomputing, 52, 877–883. Paninski, L., Pillow, J., & Simoncelli, E. (2004). Maximum likelihood estimation of a stochastic integrate-and-fire neural model. Neural Computation, 16, 2533–2561. Pillow, J., Paninski, L., Uzzell, V., Simoncelli, E., & Chichilnisky, E. (2005). Accounting for timing and variability of retinal ganglion cell light responses with a stochastic integrate-and-fire model. Journal of Neuroscience, 25, 11003–11013.
2616
L. Paninski
Pillow, J., & Simoncelli, E. (2003). Biases in white noise analysis due to non-Poisson spike generation. Neurocomputing, 52, 109–115. Plesser, H., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Physics Letters A, 225, 228–234. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C. Cambridge: Cambridge University Press. Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Risken, H. (1996). The Fokker-Planck equation. New York: Springer. Rudd, M., & Brown, L. (1997). Noise adaptation in integrate-and-fire neurons. Neural Computation, 9, 1047–1069. Seshadri, V. (1993). The inverse gaussian distribution. Oxford: Clarendon. Simoncelli, E., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge, MA: MIT Press. Tuckwell, H. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Yu, Y., & Lee, T. (2003). Dynamical mechanisms underlying contrast gain control in single neurons. Physical Review E, 68, 011901.
Received September 12, 2005; accepted March 27, 2006.
LETTER
Communicated by Bard Ermentrout
Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus Dmitri D. Pervouchine [email protected] Department of Mathematics and Statistics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Theoden I. Netoff [email protected] Department of Biomedical Engineering and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Horacio G. Rotstein [email protected] Department of Mathematics and Statistics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
John A. White [email protected] Department of Biomedical Engineering and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Mark O. Cunningham [email protected] School of Neurology, Neurobiology and Psychiatry, University of Newcastle, Newcastle upon Tyne, NE2 4HH, U.K.
Miles A. Whittington [email protected] School of Neurology, Neurobiology and Psychiatry, University of Newcastle, Newcastle upon Tyne, NE2 4HH, U.K.
Nancy J. Kopell [email protected] Department of Mathematics and Statistics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Cells that produce intrinsic theta oscillations often contain the hyperpolarization-activated current Ih . In this article, we use models and dynamic clamp experiments to investigate the synchronization properties Neural Computation 18, 2617–2650 (2006)
C 2006 Massachusetts Institute of Technology
2618
D. Pervouchine et al.
of two such cells (stellate cells of the entorhinal cortex and O-LM cells of the hippocampus) in networks with fast-spiking (FS) interneurons. The model we use for stellate cells and O-LM cells is the same, but the stellate cells are excitatory and the O-LM cells are inhibitory, with inhibitory postsynaptic potential considerably longer than those from FS interneurons. We use spike time response curve methods (STRC), expanding that technique to three-cell networks and giving two different ways in which the analysis of the three-cell network reduces to that of a two-cell network. We show that adding FS cells to a network of stellate cells can desynchronize the stellate cells, while adding them to a network of O-LM cells can synchronize the O-LM cells. These synchronization and desynchronization properties critically depend on Ih . The analysis of the deterministic system allows us to understand some effects of noise on the phase relationships in the stellate networks. The dynamic clamp experiments use biophysical stellate cells and in silico FS cells, with connections that mimic excitation or inhibition, the latter with decay times associated with FS cells or O-LM cells. The results obtained in the dynamic clamp experiments are in a good agreement with the analytical framework.
1 Introduction The hippocampus and entorhinal cortex (EC) are two major functional units of the medial temporal lobe memory system (Witter & Wouterlood, 2002). In these structures, the neural mechanism of memory is believed to be organized by the theta rhythm (4–12 Hz), which has been shown to exist in the EC and the CA1 region of the hippocampus in vivo and in vitro (Adey, Sunderland, & Dunlop, 1957; Adey, Dunlop, & Hendrix, 1960). In both EC and the CA1, there is a cell type known to be able to autonomously produce oscillations in theta frequency range. In CA1, it is the oriens lacunosum-moleculare inhibitory interneurons (O-LM), shown to be critical for internal generation of the theta rhythm within area CA1 (Gilles et al., 2002). In medial entorhinal cortex (mEC) it is spiny stellate cells, also shown to possess robust theta-rhythmic properties (Dickson, Magistretti, Shalinsky, Hamam, & Alonso, 2000). However, the ability of individual cells to produce a theta rhythm does not imply the ability of a network to produce a coherent theta rhythm. Indeed, Rotstein et al. (2005) have shown through simulations that models of O-LM cells do not robustly synchronize unless there are other kinds of interneurons in the network, such as fastspiking (FS) cells. The purpose of this article is to look more closely at the synchronization properties of networks that include these theta-producing cells in order to understand how the stellate cells and the O-LM cells interact with other cells in the superficial entorhinal cortex and hippocampus, respectively, to produce coherent theta and other rhythms.
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2619
Both of the cells described above have the hyperpolarization-activated current (h-current, Ih ), thought to be important for the creation of the theta rhythm (White, Budde, & Kay, 1995; Dickson et al., 2000; Gillies et al., 2002). In these works, it has been shown that the interaction between Ih and other cation currents plays a critical role in subthreshold oscillations in stellate cells and in intrinsic membrane potential oscillations in O-LM cells. In this work, we explore how Ih interacts with inhibition to shape synchronization properties. The effects of ionic currents on synchronization have been explored in other work (Crook, Ermentrout, & Bower, 1998; Ermentrout, Pascal, & Gutkin, 2001; Kopell, Ermentrout, Whittington, & Traub, 2000; Acker, Kopell, & White, 2003; Lewis, 2003). When the coupling is weak, this can be done with methods that average the effects of spikes over the cycle, but, as shown in Acker et al. (2003), this technique fails if the coupling is strong. One method that does not require weak coupling uses a spike-timeresponse curve (STRC), a function that measures the effects of a spike of a presynaptic cell on the timing of the next spike of the postsynaptic one. If there is no “memory” of the previous spike, STRCs can be used to construct spike time difference maps (STDM), which convey information about whether a pair of cell synchronizes (details are in section 2). The STRCs and STDMs are a bridge between cellular biophysics and the behavior of a network in both in silico models and in vitro experiments. These techniques were used in Acker et al. (2003) and Netoff et al. (2004) to understand how the synchronization of a pair of stellate cells depends on key ionic currents important to the theta rhythm. In this article we expand that technique to networks involving more than one cell type and more than two neurons. We start with a network of two O-LM cells to provide a deeper understanding of the Rotstein et al. (2005) results and a foundation for and contrast to the work then presented on the larger networks: (1) a pair of stellate cells and a FS interneuron and (2) a pair of O-LM cells and an FS interneuron. The theoretical work on the EC network is supplemented by new experimental data produced using a dynamic clamp. These data both confirm the basic ideas of the model and present a puzzle that the theory is able to explain. For the rest of the article, we refer to O-LM cells as O-cells, stellate cells as S-cells, and FS interneurons as I-cells. Here we consider three networks (see Figure 1). The first is mutual slow (20 ms decay time) GABAA -mediated inhibition of a pair of O-LM cells (see Figure 1A). The next is a network with two S-cells connected with one I-cell with (5 ms decay time) GABAA mediated synapses from I-cell to S-cells and fast (3 ms decay time) AMPAmediated synapse from S-cells to I-cell (see Figure 1B). The third is a network with two O-cells, each mutually coupled to an FS inhibitory cell with slow GABAA -mediated synapses from the O-cell to the I-cell and fast GABAA synapse from the I-cell to the O-cell (see Figure C). In the second case, we show that the network can sometimes be reduced to that of a pair of
2620
D. Pervouchine et al.
A
B
C Figure 1: The networks: (A) O-O network with slow GABAA -mediated synapses; (B) the S-I-S network with fast GABAA synapse from the I-cell to the S-cell and fast AMPA-mediated excitatory synapse from the S-cells to the I-cell; (C) the O-I-O network with slow GABAA -mediated synapses from the O-cell to the I-cell and fast GABAA synapse from the I-cell to the O-cell.
stellate cells coupled by inhibition, perhaps also with an artificial inhibitory autapse on each stellate cell. However, aspects of the full three-cell system must be taken into account to understand the effects of noise, as occurs in the biological system. In the third case, the analysis of the three-cell system is a perturbation of the analysis of the network with one O-LM cell and one FS cell; the analysis shows why the I-cell synchronizes the two O-cells and shows the role of the long decay time of the inhibition produced by the O-LM cells. 2 Methods 2.1 Computational 2.1.1 Models of Neurons and Networks. The biophysical models of O-, S-, and I-cells use single-compartment representations of ionic currents that govern changes of membrane potential. They contain the standard components of the Hodgkin-Huxley model, fast Na+ , delayed-rectifier K+ , and leak currents. In addition, S-cells and O-cells contain the hyperpolarizationactivated current (Ih ), which consists of fast and slow components, and an
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2621
extra inward current active during the interspike interval. For the stellate cell, this is the persistent sodium current (INap ) as in previous models (Dickson, Magistretti, Shalinsky, Frans´en, et al., 2000). For the O-LM cells we use the same formulation to model the extra inward current, although O-LM cells are not known to have the specific INap current (Saraga, Wu, Zhang, & Skinner, 2003). An additional stationary current component Iapp is chosen such that the neuron spikes periodically with a desired natural frequency. Values of the parameters are taken mainly from Acker et al. (2003) and Rotstein et al. (2005). The dynamic equations and parameters are summarized in the appendix. It is known that both S- and O-neurons exhibit subthreshold oscillations that constrain the firing rate to 5 to 20 Hz over a large range of levels of depolarization (Lacaille, Williams, Kunkel, & Schwartzkroin, 1987; Maccaferri & McBain, 1996; Alonso & Garc´ıa-Austt, 1987; Dickson, Magistretti, Shalinsky, Hamam, et al., 2000). In most of the simulations, we chose the natural frequency of the S- and O-cells to be approximately 10 Hz. However, the results are robust to changes in these natural frequencies. The S-cells are connected to the I-cells using fast AMPA glutamatergic synapses; the I-cells are connected to S-cells using fast GABAergic inhibition (see Destexhe, Mainen, & Sejnowski, (1998) and Terman, Kopell, and Rose (1998) for models). Evidence for these connections comes from subthreshold responses of S- and I-cells to synaptic inputs (Jones & Buhl, 1993; Traub, Whittington, Colling, Buzsaki, & Jefferys, 1996; Cunningham, Daries, Buhl, Kopell, & Whittington, 2003), although it is not known whether excitatory postsynaptic potentials (EPSPs) onto the I-cells come from S-cells or pyramidal cells. The I-cells were connected to the O-cells using the same model as in Rotstein et al. (2005). The O-cells are connected to each other and to the I-cells using slow (but still GABAA -mediated) GABAergic inhibition (H´ajos & Mody, 1997). The synapse decay time (defined as the time it takes for the synaptic conductance to decrease to 37% of its maximum value) for the fast inhibition is taken to be approximately 5 ms, the measured value from the experimental data presented with this article. The synapse decay time for the slow inhibition is taken to be approximately 20 ms, the decay time measured for O-LM synapses onto pyramidal cells. The decay time of inhibitory postsynaptic potentials (IPSPs) from these cells onto interneurons has not been measured (however, see further discussion in Rotstein et al., 2005). Intermediate decay times of inhibition are also discussed. 2.1.2 Spike Time Response Method. The spike time response curve (STRC) measures how much a given perturbation changes the timing of a periodically spiking real or model cell (Acker et al., 2003). The input arrives a time after the cell has spiked. We denote by f () the difference between the perturbed interspike interval and the natural interspike interval T (period of uncoupled cell); thus, f () > 0 means the spike is delayed, and f () < 0 means the spike is advanced. The graph of the function f () is
2622
D. Pervouchine et al. TA
f( )
Cell A
Cell B
(a)
(b)
Figure 2: (a) Construction of spike time response curves. is the time at which the inhibitory pulse arrives. T is the natural (unperturbed) interspike interval. f () is the difference between the interspike interval perturbed by the inhibitory pulse and T. (b) Spike time difference map is mapping that takes the difference in timing between the spikes of the two cells on one cycle () to the difference ¯ between those times on the next cycle ().
the STRC (see Figure 2A). STRCs are essentially the same, up to a factor, as phase response curves (Ermentrout et al., 2001; Winfree, 1980); we find it more natural to work directly with time in a context of hybrid three-cell networks studied in this article. In what follows, we deal with several types of cells. Unless it is clear from the context, the corresponding STRC function will be denoted by f AB (), where A and B refer to the presynaptic cell and the postsynaptic cell, respectively. We now consider a pair of cells that each fire periodically and are mutually coupled. Under some assumptions, discussed below, we can form a spike time difference map (STDM). The STDM takes the difference in the ¯ in the next cycle spike times of the two cells in one cycle to that difference (see Figure 2B). We write the map as
¯ = + F AB ().
(2.1)
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2623
The equilibrium state of equation 2.1, defined by F AB () = 0,
(2.2)
is stable if () < 0. −2 < F AB
(2.3)
Otherwise, the equilibrium state is unstable (Strogatz, 1994). For more details about this method, see Acker et al. (2003). Spike time response curves were obtained using custom software implemented in C++. Numerical integration was performed using standard adaptive step-size Runge-Kutta algorithm. The results of the simulations were visualized using Gnuplot’s graphic interface and Matlab. The STRCs and STDMs are influenced by all parameters in the system. The parameters that play the largest role in the analysis are g (the amplitude of the inhibitory conductance), gh (the conductance of the h-current), Iapp (applied DC current), and τ (the decay time of the inhibition). Unless we deal with one type of synapse, the synaptic amplitude is denoted by g XY , where X and Y are the pre- and the postsynaptic neurons, respectively. In the simulations below, τ , g, and gh are varied independently, while Iapp is varied simultaneously with gh to preserve the natural frequency (see the appendix). 2.2 Experimental. Methods in this article are similar to those used in Netoff et al. (2004). More detailed descriptions can be found in that article. All experimental results were obtained with entorhinal cortex stellate cells. The O-LM cells were assumed to be sufficiently similar in their intrinsic properties to stellate cells that STRC curves could be measured by inputs to stellate cells. The differences between the stellates and the O-LM cells in the networks are the kind of input that the cells receive. For example, the O-LM cells get input from other O-LM cells, which is inhibitory with a decay time of 20 ms. The stellate cells get only fast inhibition, with a decay time of 5 ms. 2.2.1 Electrophysiology. All experiments were conducted as approved by the Boston University Institutional Animal Care and Use Committee. Long-Evans rats 14 to 21 days old were anesthetized with isoflurane and decapitated. The brain was removed and chilled in ACSF (in mM, 126 NaCl, 1.25 NaH2 PO4 , 1.4-2 MgSO4 , 26 NaHCO3 , 10 Dextrose, 2 CaCl2 ) and then sliced using a Vibratome to 350 µm thickness. Slices were bathed in a 34◦ C bath for 30 minutes and then let rest at room temperature for 30 minutes before experiments. Slices were then transferred to a heated (34–36◦ C) chamber (Warner Instruments, Hamden, CT), mounted on a fixed
2624
D. Pervouchine et al.
microscope stage. Slices were perfused with heated ACSF aerated with 95% O2 and 5% CO2 . Neurons within slices were visualized using differential interference contrast video microscopy (Zeiss AxoSkop FS2+, Dage/MTI CCD camera). Whole cell patch clamp recordings were obtained using patch pipettes (4–6 M) fabricated from borosilicate glass (1.0 O.D. 0.75 I.D., Sutter Instruments, Novato, CA) and filled with (in mM), 120 K-gluconate, 10 KCl, 10 HEPES, 4 Mg-ATP, 0.3 Tris-GTP, 10Na2 − phospocreatine, and 20 creatine kinase and brought to pH 7.25 with KOH. Recordings of stellate cells were made from the cell-dense layer 2. Stellate cells were identified electrophysiologically under current clamp, based on the presence of slow, inward-rectifying cation current (Ih ) and brief first-spike latency (Alonso & Klink, 1993). All neurons included in this study were identified as S-cells. All the experiments were done without any pharmacological blockers of background activity. In previous work on the construction of STRCs in stellate cells, such blockers were found not to change the qualitative behavior of the results (Netoff et al., 2004). 2.2.2 Dynamic Clamp and Spike Time Response Curves. A real-time experimental control system (Dorval, Christini, & White, 2001) was used for a number of manipulations in these experiments, including controlling spike rate, delivering artificial synaptic conductance inputs, and building hybrid neuronal networks. The system is built on top of a real-time version of the Linux operating system. It is publicly available and can be downloaded from our web site (http://bme.bu.edu/ndl). The dynamic clamp was run at 10 kHz with a jitter on the order of 10–15 µs and response latency of one time step. Spike time response curves were generated by delivering artificial inhibitory conductance inputs (IPSGs) to periodically firing neurons and measuring induced changes in spike timing. Artificial synaptic inputs were delivered only once per six firing cycles to minimize interactions of the synaptic inputs to allow us to track and control the baseline firing rate (see below) and confirm that the effects of artificial synaptic inputs lasted only one cycle (Netoff et al., 2004). The phase of the synaptic input was chosen using a pseudo-random Sobol sequence (Press, Teukolsky, William, & Brian, 1992) to sample the phase interval optimally with a finite number of choices. The synaptic conductance waveform used followed the form gsyn = gs (e −t/τr − e −t/τ )/k, where gs is the maximal synaptic conductance, t is the time since the initiation of the synaptic event, τr is the synaptic rise time constant, τ is the synaptic decay time constant, and k is a normalization factor. Synaptic time constants used were measured directly from spontaneous synaptic events in the S-cells as previously reported (Netoff et al., 2004). We found τr = 2.5 ms and τ = 5 ms for IPSGs. The current injected is calculated in real-time Is = gsyn (Vm − Vs ), where Vm is membrane potential and
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2625
Vs is the synaptic reversal potential. This signal was scaled appropriately, converted to an analog signal, and passed to the current-drive channel of the bridge-balance amplifier (Axon Instruments 700B, Union City, CA). The measured value of Vm was updated, and a new value of Is calculated and delivered, at a clock rate of 10 kHz. For EPSGs, τr = 1.68 ms and τ = 6.21 ms were used. The reversal potentials of excitatory and inhibitory synapses were 0 mV and −70 mV, respectively. The experiments measuring STRCs were done at a control period of 100 ms, while the ones concerning antiphase (see Figure 8) used a baseline period for an uncoupled cell at about 137 ms. The latter were done first and showed the antiphase, as expected. However, it was difficult to get the uncoupled stellate cells to fire periodically at that slow rate, and so the STRC experiments were done at a higher control frequency. The h-current was altered by blocking with ZD7288 and by adding that current artificially with the dynamic clamp. Unfortunately, the blocking experiment was not technically possible because ZD7288 depolarizes the cell, which can then compensate using the dynamic clamp’s spike rate controller. However, the neurons consistently went into depolarization block after 1 or 2 minutes, preventing a good estimation of the STRC using our technique. Addition of extra h-current was done as in Dorval et al. (2001). Spike time response curves were determined from responses to hundreds of artificial synaptic perturbations. Both the x-values (time of synaptic input minus time of last postsynaptic action potential) and y-values (change in timing of next action potential, relative to the unperturbed value) were typically normalized by the average unperturbed interspike interval. The average values of STRC were fit as explained in curve fitting section of the appendix.
3 Results 3.1 O-O Network. In this section we consider a network that consists of two mutually coupled O-cells (see Figure 1A). The STRC relevant for this network is f OO (), for which τ = 20 (see Figure 3A for dependence of these curves on gh ; Figure 3B shows the experimentally determined STRC for one value of gh ). As shown, the numerical STRCs are qualitatively similar to experimental STRCs (see section 4 for details). Suppose O1 spikes at time t1 , and O2 spikes at time t2 > t1 . Denote t2 − t1 by . The spike of O2 makes O1 spike at time t¯ 1 = t1 + T + f OO (), where T is the period of the O-cell. The second spike of O1 makes O2 spike at ¯ = t¯ 2 − t¯ 1 = t2 + T + f OO (T + time t¯ 2 = t2 + T + f OO (t¯ 1 − t2 ). Therefore, f OO () − ) − (t1 + T + f OO ()) = f OO (T + f OO () − ) − f OO () + . Here we assume that neither cell spikes twice in the absence of firing of the other, so the cells alternate their action potentials. The necessary
2626
D. Pervouchine et al.
40
fOO(∆)
30
τ=20
20 10 A
0 –10
60
fOO(∆)
40 20 B
0 –20
40 FOO(∆)
20 C
0 –20 –40 0
20
40
∆
60
80
100
Figure 3: A family of slow inhibitory STRCs (τ = 20 ms, g OO = 0.20) depending on the conductance of the h-current (gh ). (A) STRCs computed from the model. The values of gh are 1.5, 1.0, 0.5, and 0.3 (solid, dashed, dot-dashed, and dotted, respectively). The conductance of the h-current and the bias DC current are varied simultaneously to maintain the neuron’s natural spiking period at ∼ 100 ms. The corresponding compensatory values of Iapp are −2.007, −0.879, 0.257, and 0.695. The reversal potential is −70 mV. (B) STRCs measured in the dynamic clamp experiment, g OO = 5 nS (dots). (C) Spike time difference maps for the O-O network. Line types are the same as in A.
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2627
conditions for this are t¯ 1 > t2 and t¯ 1 < t2 + T. That is, T + f OO () − > 0 and f OO () − < 0. Equivalently, − T < f OO () < .
(3.1)
Condition 3.1 is met for all STRCs in Figure 3A. Then for the spike time difference map, we have ¯ = ψ OO (ψ OO ()) = + F OO (),
(3.2)
where ψ OO () = T + f OO () − .
(3.3)
The function ψ OO () is interpreted as follows. While f OO () measures the spike time change (positive or negative) caused by the stimulus that arrives at time after the previous spike, the function ψ OO () measures the difference between the time of the stimulus and the time of the next spike. In the O-O network, where cells are mutually coupled, the time difference between the stimulus and the next spike for the first cell is equal to the time difference between the previous spike and the stimulus for the second cell. This leads to the second power of ψ OO in equation 3.2. These STRCs were used to construct the STDMs according to equations 3.2 and 3.3. The STDMs for the four different values of are plotted in Figure 3C. The STDMs have x-intercepts at 50 to 60 ms, with the slope between −2 and 0. These intercepts are referred to as antiphase equilibrium points. In all situations shown above, with gh not equal to zero, the antiphase solution is stable according to equation 2.3. Changes in strength of the synaptic coupling do not change the stability of the antiphase solution (data not shown). A decrease in gh causes the position of the x-intercept to increase; that is, the interspike interval of the coupled O-cells is effectively shortened compared to the uncoupled cells. There is also an in-phase solution, corresponding to = 0. Figure 3C shows that this point is unstable when gh = 1.5, our baseline value, but becomes stable as gh decreases; note that the slope of F at = 0 changes from positive to negative. That the decrease in the h-current, with τ = 20, facilitates the stability of the synchronous solution was confirmed by numerical simulations (Rotstein et al., 2005).
2628
D. Pervouchine et al.
3.2 S-I-S Network 3.2.1 Reduction to 2-Cell Network. This network (see Figure 1B) contains three cells, so it is not immediately clear from the previous work how to use STDM methods to examine the stability of any solutions. We first show that under some hypotheses, the analysis of this network can be reduced to that of a related two-cell network. The hypotheses are: 1. The I-cell does not fire in the absence of phasic inputs from the S-cells. 2. An EPSP from either of the S-cells is sufficient to make the I-cell fire. 3. The firing pattern has the I-cell spike between the spikes of the S-cells, which alternate in firing (i.e., S1 − I − S2 − I − S1 − I . . .); 4. The effect of the I-cell inhibition on the S-cell that causes the I-cell to spike is small (because of the timing of the inhibition on that cell) and can be ignored without changing the qualitative results. 5. The delay between the firing of an S-cell and the firing of the I-cell that it induces is minimal and can be ignored. Hypothesis 4 is the central one in the reduction to a two-cell model. By removing the effect of an I-cell on the S-cell that caused the I-cell to spike, the effect of an S-cell spike becomes a (slightly) delayed inhibition on the second S-cell. Therefore, at least when the two S-cells do not spike very close in time, the network is a pair of S-cells connected by inhibition. This is very similar to that of the O-O network analyzed in the previous section, but with a smaller decay time of the inhibition (τ = 5 is the value measured for IPSPs onto the stellate cells). After the analysis of this two-cell network, we will revisit hypotheses 4 and 5. The STRC relevant to this situation is f I S , in which the S-cell gets inhibition with a decay time of τ = 5 ms. Figure 4A shows numerical simulations of these STRCs with several levels of gh . The major effect of the h-current is to advance the spike of the S-cell; the larger the gh , the larger this effect. Parts B1 and B2 of Figure 4 show the experimentally determined STRCs, with inhibitory pulses fed to a real S-cell via dynamic clamp (see methods); Figure 4B2 is done with extra Ih current added with the dynamic clamp. As shown, the numerical and experimental STRCs behave in similar ways as gh is changed: adding extra gh causes the advance portion of the STRC in the beginning of the period (i.e., where f () < 0) to increase. The experimental STRCs are consistent across cells, as Figure 5 shows. The STDM that embodies the above hypotheses can help us understand if there is an antiphase solution and if it is stable; this STDM cannot say anything about the stability of solutions in which the S-cells are synchronous, since that solution violates the above hypotheses; synchrony needs to be looked at separately. The derivation of the STDM is exactly the same as that of the previous section, this time using the STRC corresponding to τ = 5.
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2629
40
fIS(∆)
30 τ=5 20 10 A
0 –10
60
fIS(∆)
40 20 B1
0 –20
60
fIS(∆)
40 20 B2
0 –20
40 FSS(∆)
20 C
0 –20 –40 0
20
40
∆
60
80
100
Figure 4: A family of fast inhibitory STRCs (τ = 5 ms, g I S = 0.20) depending on gh . The rest of the legend is the same as in Figure 3 (gh = 15 nS). Panel B2 has an additional 10 nS of gh applied compared to B1 .
2630
D. Pervouchine et al.
60
τ=5
fIS(∆)
40 20 A
0
60
fIS(∆)
40 20 B
0
τ=20
fOO(∆)
40 20
C
0 0
20
40
∆
60
80
100
Figure 5: Variation of STRCs across cells. Fast inhibitory STRCs (τ = 5 ms) for three cells are shown for basal level of gh (A), and for increased level of gh (B; additional 10 nS of gh applied). Slow inhibitory STRCs (τ = 20 ms) (C).
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2631
For later work, we will need these equations explicitly, so we write the STDMs as ¯ = ψ I S (ψ I S ()) = + F SS (),
(3.4)
where ψ I S () = T + f I S ( + δ) − .
(3.5)
T is the natural spiking period of the S-cell, and δ is the time lag between firing of an S-cell and the firing of the I-cell that it induces (see hypothesis 5). The domain of validity of equation 3.4 is given by − T < f I S ( + δ) < .
(3.6)
Now assume that the value of δ is equal to 0. Then condition 3.6 is met for all STRCs in Figure 4A. The STDMs for the S-I-S network is given in Figure 4C. Note that the antiphase solution is stable for all values of the conductance of the h-current shown in the figure. As in the O-O case, the stability and position of the antiphase equilibrium points do not change with changing the strength of the inhibitory synaptic coupling (data not shown). A decrease in gh causes the equilibrium phase to increase. This observation correlates with the dynamic clamp experiments, where adding extra h-current component noticeably shortened the interspike interval in the S-I-S network (data not shown). 3.2.2 Embedding 2-Cell Network Back in the 3-Cell Network. We now revisit assumption 4 and show that it does not make a qualitative difference in the behavior of the network. We replace the connection from the I-cell to the S-cell (e.g., S1 ) that caused it to spike, but (for now) keep the delay from the S-cell spike to the I-cell spike at zero. We assume that there is a small time delay in the onset of inhibition in the S-cell, corresponding to the buildup of inhibition. The inhibition on S1 creates a kind of self-inhibition that delays the next spike of the S-cell. However, it does not introduce another degree of freedom, since that inhibition occurs with a time course fixed with respect to the time of the S-cell spike. Thus, the S-I-S network can still be treated as a two-cell network, but with an inhibitory “autapse” on the S-cell. To see quantitatively the effect of this addition, we compare the usual STRC of inhibition onto an S-cell, with one computed by adding (artificial) self-inhibition onto the S-cell. Since the self-inhibition changes the “natural” period of the S-cell, we change the timescale of the latter to be the same as the S-cell without self-inhibition. Figure 6 shows that
2632
D. Pervouchine et al. 50
40
f(∆)
30
20
10
0
–10
0
20
40
60
80
100
∆
Figure 6: Fast inhibitory STRCs computed with adding self-inhibition onto the S-cell (dashed-dotted) and without it (solid). The time axis was scaled to 100 ms in both cases. The natural periods of the S-cell with and without self-inhibition were 100.0 and 102.6 ms, respectively.
the difference between them is tiny. Thus, the essential effect of the selfinhibition is a scaling of the STRC, which does not change qualitative behavior. However, such scaling has an important consequence for the entire S-I-S network: if we compare an isolated stellate cell with an S-cell in the network, the period of the S-cell changes much less for the coupled cell when Iapp is varied (data not shown). 3.2.3 Infinitesimal Delays. An excitatory synapse to some kinds of neurons (type 1) can cause a spike in the postsynaptic cell after some delay. The delay can be very sensitively depending on initial conditions and the size of the EPSP. In the previous sections, we assumed that the delay from the time the I-cell received excitation to the time it fired was zero (assumption 5). We now revisit that hypothesis, assuming that S-to-I synapses are strong; in this case, the delay is nonzero but very small and the same from cycle to cycle.
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2633
First, we estimate analytically how the antiphase solution changes when δ is varied infinitesimally. Let δ be the value of for the antiphase solution when the delay is δ. It is a fixed point not only of F SS but also of ψ I S , that is, T + f I S (δ + δ) − δ = δ .
(3.7)
From equation 3.7, f I S (0 ) = 20 − T. Using the first-order Taylor expansion of f I S (δ + δ) at = 0 and expressing δ from the linear equation obtained, we get 2δ = 20 + f I S (0 )(δ + δ − 0 ), and therefore δ = 0 +
f I S (0 ) δ. 2 − f I S (0 )
(3.8)
Thus, the change in δ per 1 ms of delay is f I S (0 )/(2 − f I S (0 )); it is close to zero when f I S (0 ) is small and increases as f I S (0 ) gets close to 2. Note that in Figure 4C, 0 is close to the middle point of the period, where f I S (0 ) ranges from 0.5 to 1. Thus, we predict that the change in δ is the same order of magnitude as δ, or may be even smaller depending on f I S (0 ). We now go back to equation 3.4 and compute F SS () numerically using ψ I S () from equation 3.5. Note that the function ψ I S () is defined for ∈ [δ, T − δ]. The function F SS () is shown in Figure 7 for several values of δ. Note that the change in δ is smaller than δ, as predicted by equation 3.8, and the antiphase solution remains stable as δ increases up to 15 ms. Thus, when S-to-I synapses are strong, the stability of the antiphase solution does not change even with delays as long as 15 ms. However, weak S-to-I synapses lead to a different result, as we show in the next section. 3.2.4 Large Delays and Noise. We now explain an otherwise puzzling observation about the S-I-S experimental data. The first set of experiments done with the three-cell network and the dynamic clamp technology used weak synaptic connections from the S-cells to the in silico I-cells. In this situation, the results did not replicate the predicted antiphase behavior between the S-cells. Instead, the phase lags appeared almost random, clustering about both antiphase and in-phase fixed points. When a larger value of the S-I synaptic conductance was taken, the predicted antiphase was found (see Figure 8). These experiments were done with a control period of approximately 140 ms instead of the 100 ms used in the experiments described above.
2634
D. Pervouchine et al. 40
F(∆)
20
0
–20
–40 0
20
40
∆
60
80
100
Figure 7: Spike time difference maps for the S-I-S network with delays: δ = 0 ms (solid), 5 ms (dashed), 10 ms (dot-dashed), and 15 ms (dotted); g I S = 0.04, g SI = 0.10.
To understand the origin of this phenomenon, we look at the sources of variation in our system. There are two independent such sources. One is the delayed response of the I-cell to weak phasic EPSPs coming from the S-cells. The other is the noise that arises from ionic currents in the S-cells (e.g., persistent Na+ current) and leads to spread of their firing times. We next ask whether each of these sources alone can account for the difference in distribution of phases between S-cells in the dynamic clamp experiment. Figure 9 shows simulation of the full three cell network at the following conditions: with weak versus strong S-to-I synapses and high versus low levels of noise. The noise in S-cells, which was modeled by stochastic persistent Na+ channels (see the appendix), leads to an even distribution of phases only if S-to-I synapses are sufficiently weak, that is, if the delay δ is sufficiently large. Thus, one needs both a high level of noise and large delays in order to get such distribution of phases (compare to Figure 8). The critical question here is how the system escapes from the antiphase fixed point, which is stable in the deterministic case even if δ is large (see
Spike time difference
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
100 50 0 -50 -100 0
Spike time difference
2635
200
400
600
800
0
20
0
20 Count
100 50 0 -50 -100 0
50
100
150
200 250 300 Spike number
350
400
Figure 8: Evolution of spike time differences between two patch-clamped S-cells coupled through an I-cell with artificial synapses. The period of the uncoupled cells is 137 ms. Upper panel (weak synapses): The excitatory synapses are g SI = 10 nS. The inhibitory synapses are g I S = 3 nS. The average lag from EPSG to firing of I-cell is 16 ms. Bottom panel (strong synapses): The excitatory synapses are g SI = 30 nS. The inhibitory synapses are g I S = 15 nS. The lag between the peak of the EPSG and the action potential of I-cell is 6 ms. The histograms on the right show the distribution of spike time differences across the experiment. Note that the cells are not exactly antiphase due to differences in the cells involved. This is reflected in the fact that the upper trace of the bottom panel, giving the phase lag from cell 1 to cell 2, is not exactly the same as the lower trace of the bottom panel, giving lag from cell 2 to cell 1. Though both excitatory and inhibitory conductances were changed, the later modeling revealed that it is primarily the excitatory conductance change that accounted for the results.
previous section). Consider the function ψ I S () computed from equation 3.5 for δ = 0 and δ = 20 (see Figure 10). The antiphase equilibrium points correspond to intercepts of ψ I S () with the main diagonal. For both δ = 0 and δ = 20, the antiphase solutions are stable because the slope at the intercept is between −1 and 1, and hence the slope of the full map ψ I2S () is also between −1 and 1. The noise in the S-cells can be interpreted geometrically as adding a random term to the value of ψ I S () (which can be negative or positive) on each iteration of the map. In Figure 10 we show five such iterations for
2636
D. Pervouchine et al.
Low noise
Weak S–to–I
Strong S–to–I
200
200
150
150
100
100
50
50
High noise
0 –100
–50
0
50
100
0 –100
50
100
40
80
30
60
20
40
10
20
0 –100
–50
0 50 STD (ms)
100
0 –100
–50
–50
0
50
100
0 50 STD (ms)
100
Figure 9: Distribution of spike time differences (STD) between S-cells in S-I-S network with weak (g SI = 0.01) and strong S-to-I synapses (g SI = 0.05) at two levels of noise: high (Nmax = 500) and low (Nmax = 5000). The S-cells were biased to fire on average with 120 ms interval.
δ = 0 and δ = 20, starting at the same initial conditions and applying the same sequence of random perturbations to both. For δ = 0, the trajectory remains in the vicinity of the antiphase fixed point, while for δ = 20 it goes much further; that is, an identical sequence of perturbations can cause more spread distribution of phases when S-to-I synapses are weak than when they are strong. This explains why the predicted antiphase behavior in the dynamic clamp experiment was observed only at larger values of S-I synaptic conductance. 3.2.5 Inhibition and h-Current. The difference in network behavior between τ = 20 and τ = 5 shows up only when the conductance of the hcurrent is very small or zero. When the h-conductance in the O-O network is reduced, the antiphase fixed point remains stable, while the in-phase fixed point changes from unstable to stable (see Figure 3C). Thus, in the O-O network, both in-phase and antiphase fixed points are stable when the h-conductance is reduced. Note the appearance of two other fixed
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus δ=0
2637
δ=20 100
80
80
60
60
40
40
20
20
ψ(∆)
100
0
0
20
40
∆
60
80
100
0
0
20
40
∆
60
80
100
Figure 10: The map ψ I S () defined by equation 3.5. The values of δ are 0 ms (left panel) and 20 ms (right panel). The dashed line represents a trajectory of the iterated map. The initial conditions are t0 = 50 ms for both values of δ. The sequence of perturbations (dotted segments between asterisks) is 13 ms, −3 ms, 10 ms, −9 ms, and 5 ms for both values of δ.
points—neither in-phase nor antiphase (see Figure 3C, dotted line); they are both unstable as predicted by the slopes at the fixed points. In the S-I-S network, a decrease in the h-conductance does not change the stability of either fixed point; the in-phase remains stable, and the antiphase remains unstable (see Figure 4C). In Figure 11 we focus on the transition of the h-current conductance from small to infinitesimal level (gh 0). In the S-I-S network, the in-phase solution changes from unstable (gh small) to stable (gh 0), while the antiphase remains stable. Since for gh 0 they are both stable, two other unstable fixed points appear as above. In the O-O network, the antiphase fixed point changes from stable (gh small) to unstable (gh 0). Note that the STDM for gh 0 is not defined at the ends of the period due to violation of equation 3.1. Although in this case we cannot rigorously predict synchrony, we expect the in-phase to be stable as points everywhere else in the
2638
D. Pervouchine et al.
f(∆)
40
τ=5
τ=20
20 0 –20 40
F(∆)
20 0 –20 –40
0
20
40
60
80
100
0
20
40
60
∆ (ms)
∆ (ms)
A
B
80
100
Figure 11: The transition of the h-current conductance from small to infinitesimal level in the S-I-S network (A) and the O-O network (B). (Top panels) STRCs for gh = 0.1 (dot-dashed) and gh = 0 (solid). (Bottom panels) The corresponding STDMs (same line types). The values of the compensatory DC current are Iapp = 0.895 for gh = 0.1 and Iapp = 1.314 for gh = 0.0. The other parameters are g I S = 0.04, g SI = 0.10, and g OO = 0.01.
period go away from one another. All these observations are summarized as follows: τ = 20
τ =5
In-phase Antiphase In-phase Antiphase Large Ih Unstable Stable Unstable Stable Low Ih Stable Stable Unstable Stable gh 0 Stable Unstable Stable Stable The domain of stability for the antiphase solution is the region between the pair of fixed points surrounding the antiphase fixed point; for the inphase solution, it is the union of regions surrounding = 0 and = T. In
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2639
each of these cases, the domains can be read off from Figures 3C, 4C, and 11A. This formulation does not work for Figure 11B, where the domains of stability cannot be correctly defined (see section 4). For young animals, the real neurons are most likely to operate with Ih lower than in adults (Richter, Klee, Heinemann, & Eder, 1997); one may expect gh 0 in animals with Ih knocked out. 3.3 O-I-O Network. In the previous section, we reduced a three-cell model to a two-cell model by showing that the I-cell inhibition could be replaced by inhibition to the other S-cell produced by each S-cell, along with self-inhibition. In this section, we reduce a three-cell network that consists of two O-LM cells and one FS cell (see Figure 1C) to a perturbation of a two-cell network that consists of one O-LM and one FS cell. We start with those two cells, the O-cell producing IPSPs in the Icell with a decay time of τ = 20 ms and the I-cell producing an IPSP in the O-cell with a decay time of τ = 5 ms (synapse rise time is unchanged). As explained in section 2, each of these pulses is associated with a STRC. Using the previous notation, these STRCs are denoted by f OI () and f I O (). Since the O-cell has the same currents and parameters as the S-cell, f I O () is the same as f I S (), which was previously shown in Figure 4A. The function f OI () is plotted in Figure 12A; it is very similar to f OO () when gh = 0 because the model of O-cell without h-current differs from the model of I-cell only by the presence of persistent sodium current. We first use these STRCs to show that there is a stable fixed point for the O-I network. Assume that the O-cell spikes first, and let θ denote the time difference until the spike of the I-cell. From the STRCs, we can compute the function ψ OI (θ ) = TI + f OI (θ ) − θ , and similarly for ψ I O (θ ) = TO + f I O (θ ) − θ , with the indices O and I reversed. These play the same role as ψ I S in the S-I-S network (see equation 3.4) in producing the STDM, the map that takes θ to the time difference in the next cycle, which we call θ¯ . We write this map in the form θ¯ = θ + F OI (θ ) = ψ OI (ψ I O (θ )).
(3.9)
Here F OI plays the same role as F OO and F SS of the previous sections. As before, the right-hand side is a composition of two maps. In this case, however, the maps are not the same, reflecting the lack of symmetry in the network. The function F OI is plotted in Figure 12B for several values of gh . Recall that a zero of the map F is a fixed point of the map (see equation 3.9). For each value of gh , there is a stable antiphase fixed point; the in-phase fixed point is also stable, but its domain of attraction is tiny.
2640
D. Pervouchine et al.
14 12 10
OI
f (∆)
8 6
A
4 2 0 –2
20
F
OI
(∆)
10
B
0
–10
–20 0
20
40
∆
60
80
100
Figure 12: (A) The STRC f OI (). The conductance of O-to-I synapse is 0.02. (B) The STDMs for the O-I network. The values of gh are 1.5, 1.0, 0.5, and 0.3 (solid, dashed, dot-dashed, and dotted, respectively). The compensatory DC currents are the same as in Figure 3; g OI = 0.02.
We now show that the entire O-I-O network can be considered as a perturbation of an O-I network in which value g OI is doubled. The latter network corresponds to the O-I-O network in which the two O-cells are synchronous, and hence the effect of the simultaneous inhibition on the I-cell is twice that of a single cell. Let t1 , t2 , and t3 be the spike times of O1 , O2 , and I , respectively, and let t¯ 1 , t¯ 2 , and t¯ 3 be their spike times on the next cycle. We assume that the cells spike in this order: = t2 − t1 > 0 and θ = t3 − t2 > 0. We have t¯ 1 = t1 + TO + f I O ( + θ ) and t¯ 2 = t2 + TO + f I O (θ ). Then
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2641
100
60 θ=0
∆+F
OO
(∆)
80
40 θ=25 20 θ=50 0
0
20
40
∆
60
80
100
¯ = () ¯ Figure 13: The map defined by equation 3.10. Three solid curves correspond to θ = 50 ms, θ = 25 ms, and θ = 0 ms, as marked in the figure. The dashed lines represent trajectories of the iterated map, which starts at = ¯ for θ = 50 ms, 35 ms. Asterisks, diamonds, and circles denote iterations of θ = 25 ms, and θ = 0 ms, respectively.
¯ = t¯ 2 − t¯ 1 = + f I O (θ ) − f I O ( + θ ). For fixed value of theta, consider the map ¯ = + f I O (θ ) − f I O ( + θ ) = + F OO ().
(3.10)
¯ is shown in Figure 13. Note that if θ is large enough (θ = 50), The function then the three-cell network approaches a regime in which the two O-cells are synchronous after one or two cycles; this happens for a relatively large range of initial values of . Even after first interation, the two O-cells are close enough to be considered as one “composite” cell, with inhibition onto the I-cell twice as strong as one of a single cell. After becomes close to zero, θ achieves its equilibrium according to equation 3.9. For smaller values of θ , does not converge to zero. Thus, it is important that θ is large enough, that is, if the I-cell does not spike shortly after one of the O-cells. The latter condition holds when the O-to-I synaptic conductance is large enough and synapse decay time is long (τ = 20). With weak O-to-I synapses
2642
D. Pervouchine et al.
or fast decay of the inhibition, the synchrony in the O-I-O network becomes unstable. 4 Discussion The methods used here are closely related to those of phase response curves (Winfree, 1980), long used to understand how periodic input to an oscillator can entrain the latter. In more recent work, those methods have been used to investigate the circumstances in which a pair of coupled oscillators interacting via pulses can synchronize or not (Goel & Ermentrout, 2002; Oprisan, Prinz, & Canavier, 2004; Gutkin, Ermentrout, & Reyes, 2005; Acker et al., 2003; Kopell & Ermentrout, 2002). The methodological novelty of the analysis in this article is the use of such methods for more than two cells. In two distinct ways, a three-cell network was shown to behave like a related two-cell network. For the network of two stellates and a FS cell, the latter was shown to behave, in some parameter regimes, as if the stellates were coupled directly by fast-decaying inhibition. The added FS cell acted to make the S-cells go in antiphase. When an FS interneuron was added to a network of O-LM cells, the network behaved (again, in some parameter regimes) like one with a single O-LM cell, with twice the coupling to the I-cell; the added FS cell synchronized the O-LM cells, which do not synchronize in the absence of that kind of cell. The central difference between the two situations is that the stellate cells excite the FS cell, while the O-LM cells inhibit the FS cell. Different methods of analysis were needed for the two cases. We note that the analysis we did with the three-cell networks cannot be done with weak coupling, since the synchrony described depends on the coupling being strong enough. Also, in Netoff, Acker, Bettencourt, and White (2005), it was shown experimentally that physiologically relevant inputs give rise to outputs that violate weak coupling assumptions (see also Preyer & Butera, 2005, for synapses in an invertebrate preparation). The STRCs obtained in the dynamic clamp experiment are in good agreement with analytical STRCs (see Figures 3 and 4). The inhibitory stimuli that arrive in the beginning of the period cause advance of the next spike; for S-cells this advance is more substantial than for O-LM cells and increases with increasing gh . In the middle of the period, inhibition delays the next spike for both O-LM and S-cells. The most significant distinction between analytical and experimental STRCs is observed for the S-cells at the end of the period, where the model neuron is affected by the inhibition much less than the in vitro neuron. This leads to an underestimation in the analysis of the domain of stability of the in-phase solution in the S-I-S network. The reduced two-cell analysis of the S-I-S network can be extended to account for the behavior when there is a significant delay between the firing of an S-cell and that of the I-cell. This occurs when the excitation from an S-cell to an I-cell is sufficiently weak (Ermentrout & Kopell, 1998). Analysis showed that such a delay does not change the antiphase behavior of the
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2643
S-cells in a deterministic case, but in cooperation with noise inherently present in I-to-S synapses, it results in great variability of phases between S-cells. A two-cell reduction is also possible for larger networks with all-to-all connections consisting of more than one I-cell and more than two S-cells. In such a network, there is another source of variability—one that comes from the initial conditions and results in formation of synchronous cell assemblies (clusters) in both S- and I-cells populations. In the situation relevant to the theta rhythm (approximately 10 Hz), there are two such cell ensembles for the S-cell population and one ensemble for the I-cell population, which behave as aggregate units very similarly to the S-cells and the I-cell in the three-cell network. Depending on the initial conditions, these units may contain different number of cells; this leads to nonequivalent synapses and, as a consequence, to lack of the symmetry in that network. The difference in network behavior caused by delays is explained by Figure 9 only if assumptions 1 to 5 are valid. Direct simulations show that the I-cell sometimes skips a cycle as a result of overlap of phasic EPSPs from the S-cells when S-to-I synapses are weak. This violates assumption 2 and results in a pattern of firing that is different from one stated in the assumption 3; the latter was necessary for construction of the map (see equation 3.4). Also, assumption 4 is violated when δ is large. In this situation, another force plays a more dominant role: both S-cells receive common inhibition, which is known to facilitate synchrony in other excitatory cells ¨ (Terman et al., 1998; Borgers & Kopell, 2003). This common inhibition can help to explain several synchronous spikes, which appeared in Figure 8 even with strong S-to-I synapses. The fact that our STDMs are not applicable to the analysis of synchrony in the O-O network (see Figure 11) has to do with the usual assumption of the STRC method that each stimulus influences only the next spike, not the subsequent ones. This is not the case in the O-O network. According to Figure 11, the spike of O1 that arrives a few milliseconds before the spike of O2 has very little effect on the timing of the latter; however, the next spike of O2 is delayed (even in absence of O1 activity) because the inhibition lasts long and affects O2 after its first spike. In other words, the inhibition with τ = 5 ms can be considered as an instant pulse, while one with τ = 20 ms cannot. In the presence of O1 activity, this would cause a switch of leadership between O1 and O2 . The technique of spike time responses can be adopted to this case by defining the second-order STRC as in Oprisan et al. (2004). The analysis of the stellate/FS network was motivated by data of Cunningham and Whittington (Cunningham, Pervouchine, Kopell, & Whittington, 2004) in slices of medial entorhinal cortex. If the activation from kainate is sufficiently low, the slice displays slow (< 1 Hz) oscillations in the electrical activity of the stellate cells, alternating between a silent regime and an active one. Within the latter, the spectrum of the activity contains peaks at both beta (centered around 21 Hz) and theta (centered around
2644
D. Pervouchine et al.
9 Hz) frequencies. Since the main cells involved in the active regime are the stellates and the FS interneurons, the origin of the beta peak is mysterious. The work presented here suggests a possible solution: since the stellate cells are in antiphase, the population frequency is twice as fast, producing a beta frequency. In larger network simulations, with many stellates and FS cells, the stellates break into two clusters, each firing at a theta rhythm, with a population rhythm twice as fast (data not shown); as long as the clusters are not of the same size (the generic case), there is also a theta peak in the spectrum. The STRC/STDM technique goes beyond the information acquired about these particular models and shows how to do such an analysis when the model is changed. For example, other models of the stellate cell (Acker et al., 2003; Alonso & Klink, 1993) contain an M-current instead of, or in addition to, the h-current. A similar analysis can show how changes in the biophysics of the intrinsic or synaptic models can change the network outcome. The STRCs contain the information needed about the biophysics to make predictions about network behavior. However, the transition from STRCs to the maps that predict network dynamics depends on assumptions about the order in which spikes occur in the different cells. Thus, a single map cannot necessarily embody all the possible dynamical behaviors of the network; each map is valid in some (possibly very large) set of trajectories but can fail when the spike order changes. Such a bifurcation cannot be investigated within that map but requires consideration of the full system. Appendix A.1 Dynamic Models of Neurons. Voltage-dependent conductances are modeled using a Hodgkin-Huxley type of kinetic model. The cells in the network are indexed with a symbol i ∈ . The current-balance equation for all types of cells is Ci
∂vi Iion, i − Isyn, j→i , = Ia pp, i − ∂t j∈
where vi and Ci are the membrane potential (mV) and membrane capacitance (µF/cm2 ) of the ith cell, Ia pp, i is the bias (DC) current (µA) applied to the ith cell, and Iion, i and Isyn, j→i are the respective sums of ionic and synaptic currents. The sum of ionic currents is
Iion, i = I Na , i + I K , i + I L , i + I Na p, i + Ih, i ,
for the O- and S-cells and
Iion, i = I Na , i + I K , i + I L , i
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2645
for the I-cells, where I Na , i = g Na , i mi3 h i (vi − E Na , i ), I K , i = g K , i ni4 (vi − E K , i ), I L , i = g L , i (vi − E L , i ), I Na p, i = g Na p, i pi (vi − E Na , i ), f
Ih, i = gh, i (0.65 h i + 0.35 h is )(vi − E h, i ). In the expressions for ionic currents, g X,i are the maximal conductances f (mS/cm2 ) and E X,i are the reversal potentials (mV), and mi , h i , ni , pi , h i , s and h i are the respective channel gating variables (see below). Units of time are ms. The following maximal conductances and reversal potentials are used for the O- and S-cells: E Na , i = 55, E K , i = −90, E L , i = −65, E h, i = −20, g Na , i = 52, g K , i = 11, g L , i = 0.5, g Na p, i = 0.5, gh, i = 1.5, and Ci = 1.5. For the I-cells, the following maximal conductances and reversal potentials are used: E Na , i = 50, E K , i = −100, E L , i = −67, g Na , i = 100, g K , i = 80, g L , i = f 0.1, and Ci = 1.5. The gating variables xi = mi , h i , ni , pi , h i , and h is obey a first-order differential equation of the following form: ∂ xi = (xi,∞ (vi ) − xi ) / τx, i (vi ), ∂t where xi,∞ (v) =
αx, i (v) αx, i (v) + βx, i (v)
τx, i (v) =
1 . αx, i (v) + βx, i (v)
Here αx, i (v) and βx, i (v) are the corresponding channel’s opening and closing rates. For O- and S-cells, the following channel opening and closing rates are used: αm, i (v) = −0.1 (v + 23) / (e −0.1 (v+23) − 1) βm, i (v) = 4 e −(v+48)/18 αh, i (v) = 0.07 e −(v+37) / 20 βh, i (v) = 1 / (e −0.1 (v+7) + 1) αn, i (v) = −0.01 (v + 27) / (e −0.1 (v+27) − 1) βn, i (v) = 0.125 e −(v+37)/80
2646
D. Pervouchine et al.
α p, i (v) = 1 / (0.15 (1 + e −(v+38)/6.5 )) β p, i (v) = e −(v+38)/6.5 / (0.15 (1 + e −(v+38)/6.5 )) f
h i,∞ (v) = 1 / (1 + e (v+79.2)/9.78 ) τh f , i (v) = 0.51 / (e (v−1.7)/10 + e −(v+340)/52 ) + 1 s h i,∞ (v) = 1 / (1 + e (v+2.83)/15.9 )58
τh s , i (v) = 5.6 / (e (v−1.7)/14 + e −(v+260)/43 ) + 1. The channel opening and closing rates for the I-cells are: αm, i (v) = 0.32 (54 + v) / (1 − e −(v+54)/4 ) βm, i (v) = 0.28 (v + 27) / (e (v+27)/5 − 1) αh, i (v) = 0.128 e −(50+v)/18 βh, i (v) = 4 / (1 + e −(v+27)/5 ) αn, i (v) = 0.032 (v + 52) / (1 − e −(v+52)/5 ) βn, i (v) = 0.5 e −(57+v)/40 . A.2 Synapses. The synaptic currents for all types of cells has the following form: Isyn, j→i = g˜ j i s j (vi − Er ev, j ), where g˜ j i , s j , and Er ev, j are the maximal conductance of j → i synapse, the synaptic gating variable of the jth cell, and the synaptic reversal potential of the jth cell, respectively. The reversal potentials of I-, O-, and S-cells are −70 mV, −70 mV, and 0 mV, respectively. The value of the maximal synaptic conductance is normalized to the area under IPSP (or EPSP) such that g˜ j i = g j i /τ j i , where τ j i is the decay time of j → i synapse, and g j i is the value of maximal synaptic conductance reported in the text. A.3 Stochastic Simulations. In the stochastic simulations, the term g Na p,i · pi in the equation for the persistent sodium current (I Na p, i ) is replaced by the stochastic term γ Ni /SA as in previous works (White, Klink, Alonso, & Kay, 1998; Acker et al., 2003). Here Ni is the number of open persistent sodium channels and varies from 0 to Nmax , γ = 20 pS is the open channel conductance, and SA = 2.29 · 10−4 cm2 is the cell’s surface area. The values of γ , Nmax , and SA are such that the maximal conductance is equal to g Na p,i in the deterministic model. The channels are assumed to be independent and identical. On each step of the simulation, a random
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2647
f(∆)/∆
1 A
0 –1
f(∆)/(A∆2+B∆)
3 2 1
B
0 –1
60
f(∆)
40 20
C
0 –20 0
20
40
∆
60
80
100
Figure 14: Curve fitting (the data from Figure 3 are used). (A) Linear regression, A · + B, of f ()/ versus ; the range of is between dotted vertical lines. (B) Multiplicative residuals f ()/(A · 2 + B · ) are fit with the function tanh((C − )/D) using nonlinear least-square optimization (border constraints: 50 ≤ C ≤ 90, 0.01 ≤ D ≤ 100). (C) The resulting fit is f () (A · 2 + B · )tanh ((C − )/D) (solid); compare to the quadratic fit (dashed).
number is chosen from an exponential distribution based on the equations α p,i (V) and β p,i (V) above to determine the time of the next channel transition. The equations are then integrated up to that time, and the number of open channels is updated. This method is generally used for exact stochastic simulations of chemical reactions (Gillespie, 1977).
2648
D. Pervouchine et al.
A.4 Curve Fitting. Experimental data were fit using a nonlinear model as follows. First, we obtain a linear fit, A · + B, for the function f ()/ versus in a certain range of (see Figure 14A). For f (), this gives a quadratic fit, A · 2 + B · , which passes through the origin (see Figure 14B, dashed line). Then the multiplicative residuals f ()/(A · 2 + B · ) are fit with the function tanh((C − )/D), where tanh(x) = (e x − e −x )/(e x + e −x ), using nonlinear least-square optimization (see Figure 14C). The resulting fit is f () (A · 2 + B · )tanh ((C − )/D) (see Figure 14B, solid line). Acknowledgments We acknowledge Jozsi Jalics and Corey Acker for insightful discussions. This work was partially supported by the Burroughs Wellcome Fund (H.G.R.), NIH, award number 1 R01 NS46058, as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program. References Acker, C. D., Kopell, N., & White, J. A. (2003). Synchronization of strongly coupled excitatory neurons: Relating network behavior to biophysics. J. Comp. Neurosci., 15(1), 71–90. Adey, W. R., Dunlop, C. W., & Hendrix, C. E. (1960). Hippocampal slow waves: Distribution and phase relationship in the course of approach learning. Arch. Neurol., 3, 74–90. Adey, W. R., Sunderland, M. D., & Dunlop, C. W. (1957). The entorhinal area: Electrophysiological studies of its interrelations with rhinencephalic structures and the brainstem. Electroencephalography and Clinical Neurophysiology, 9(3), 309– 324. Alonso, A., & Garc´ıa-Austt, E. (1987). Neuronal sources of theta rhythm in the entorhinal cortex of the rat. II. Phase relations between unit discharges and theta field potentials. Exp. Brain Res., 67(3), 502–509. Alonso, A., & Klink, R. (1993). Differential electroresponsiveness of stellate and pyramidal-like cells of medial entorhinal cortex layer II. J. Neurophysiol., 70, 128– 143. ¨ Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neurocomp., 15(3), 509–538. Crook, S. M., Ermentrout, B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillator. Neural Computation, 4, 837–854. Cunningham, M. O., Davies, C. H., Buhl, E. H., Kopell, N., & Whittington, M. A. (2003). Gamma oscillations induced by kainate receptor activation in the entorhinal cortex in vitro. Journal of Neuroscience, 23(30), 9761– 9769. Cunningham, M. O., Pervouchine, D. D., Kopell, N., & Whittington, M. (2004). Cellular and network mechanisms of slow activity (<1 Hz) in the entorhinal cortex. Society for Neuroscience Meeting 2004, Abstract 638.9.
Low-Dimensional Maps in Entorhinal Cortex and Hippocampus
2649
Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I Seger (Eds.), Methods in neuronal modeling (2nd Ed.). Cambridge, MA: MIT Press. Dickson, C. T., Magistretti, J., Shalinsky, M. H., Frans´en, E., Hasselmo, M. E., & Alonso, A. A. (2000). Properties and role of Ih in the pacing of subthreshold oscillations in entorhinal cortex layer II neurons. J. Neurophysiol., 83, 2562–2579. Dickson, C. T., Magistretti, J., Shalinsky, M., Hamam, B., & Alonso, A. (2000). Oscillatory activity in entorhinal neurons and circuits: Mechanisms and function. Ann. N.Y. Acad. Sci., 911, 127–150. Dorval, A. D., Christini, D. J., & White, J. A. (2001). Real-time Linux dynamic clamp: A fast and flexible way to construct virtual ion channels in living cells. Annals of Biomedical Engineering, 29, 897–907. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ermentrout, B., Pascal, M., & Gutkin, B. S. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Computation, 13, 1285–1310. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry, 81, 2340–2361. Gillies, M. J., Traub, R. D., LeBeau, F. E. N., Davies, C. H., Gloveli, T., Buhl, E. H., & Whittington, M. A. (2002). A model of atropine-resistant theta oscillations in rat hippocampal area CA1. Journal of Physiology, 543(3), 779–793. Goel, P., & Ermentrout, B. (2002). Synchrony, stability, and firing patterns in pulsecoupled oscillators. Physica D, 163, 191–216. Gutkin, B. S., Ermentrout, G. B., & Reyes, A. D. (2005). Phase-response curves give the responses of neurons to transient inputs. J. Neurophysiol., 94, 1623–1635. H´ajos, N., & Mody, I. (1997). Synaptic communication among hippocampal interneurons: Properties of spontaneous IPSCs in morphologically identified cells. J. Neurosci., 17, 8427–8442. Jones, R. S., & Buhl, E. H. (1993). Basket-like interneurones in layer II of the entorhinal cortex exhibit a powerful NMDA-mediated synaptic excitation. Neurosci. Lett., 149(1), 35–39. Kopell, N., & Ermentrout, G. B. (2002). Mechanisms of phase-locking and frequency control in pairs of coupled neural oscillators. In B. Fiedler (Ed.), Handbook on dynamical systems: Toward applications (Vol. 2, pp. 3–54). Dordrecht: Elsevier. Kopell, N., Ermentrout, G. B., Whittington, M., & Traub, R. D. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Nat. Acad. Sci. USA, 97, 1867–1872. Lacaille, J.-C., Williams, S., Kunkel, D., & Schwartzkroin, P. (1987). Local circuit interactions between oriens/alveus interneurons and CA1 pyramidal cells in hippocampal slices: Electrophysiology and morphology. J. Neurosci., 7, 1979–1993. Lewis, T. J. (2003). Phase-locking in electrically coupled non-leaky integrate-and-fire neurons. Discrete Contin. Dyn. Syst. Ser. B (Suppl.), 554–562. Maccaferri, G., & McBain, C. (1996). The hyperpolarization-activated current (Ih ) and its contribution to pacemaker activity in rat CA1 hippocampal stratum oriensalveus interneurones. J. Physiol., 497, 119–130.
2650
D. Pervouchine et al.
Netoff, T. I., Acker, C. D., Bettencourt, J. C., & White, J. A. (2005). Beyond two-cell networks: Experimental measurement of neuronal responses to multiple synaptic inputs. Journal of Computational Neuroscience, 18, 287–295. Netoff, T. I., Banks, M. I., Dorval, A. D., Acker, C. D., Haas, J. S., Kopell, N., & White, J. A. (2004). Synchronization in hybrid neuronal networks of the hippocampal formation. J. Neurophysiol., doi:10.1152/jn.00982.2004. Oprisan, S. A., Prinz, A. A., & Canavier, C. C. (2004). Phase resetting and phase locking in hybrid circuits of one model and one biological neuron. Biophys. J., 87, 2283–2298. Press, W., Teukolsky, S., William, T. V., & Brian, P. F. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Preyer, A. J., & Butera, R. J. (2005). Neuronal oscillators in Aplysia californica that demonstrate weak coupling in vitro. Phys. Rev. Lett., 95(13), 138103. Richter, H., Klee, R., Heinemann, U., & Eder, C. (1997). Developmental changes in inward rectifier currents in neurons of the rat entorhinal cortex. Neurosci. Lett., 228, 139–141. Rotstein, H. G., Pervouchine, D. D., Gillies, M. J., Acker, C. D., White, J. A., Buhl, E. H., Whittington, M. A., & Kopell, N. (2005). Slow and fast inhibition and an hcurrent interact to create a theta rhythm in a model of CA1 interneuron network. J. Neurophysiol., 94, 1509–1518. Saraga, F., Wu, C. P., Zhang, L., & Skinner, F. K. (2003). Active dendrites and spike propagation in multi-compartment models of oriens-lacunosum/moleculare hippocampal interneurons. J. Physiol., 552, 502–509. Strogatz, S. H. (1994). Nonlinear dynamics and chaos: With applications to physics, biology, chemistry, and engineering. Cambridge, MA: Perseus Books. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. White, J. A., Budde, T., & Kay, A. R. (1995). A bifurcation analysis of neural subthreshold oscillations. Biophysical Journal, 69, 1203–1217. White, J. A., Klink, R., Alonso, A., & Kay, A. R. (1998). Noise from voltage-gated ion channels may influence neuronal dynamics in the entorhinal cortex. J. Neurophysiol., 80, 262–269. Winfree, A. T. (1980). Geometry of biological time. Berlin: Springer. Witter, M., & Wouterlood, F. (2002). The parahippocampal region. New York: Oxford University Press.
Received September 1, 2005; accepted April 27, 2006.
LETTER
Communicated by Dan Hammerstrom
Programmable Logic Construction Kits for Hyper-Real-Time Neuronal Modeling Ruben Guerrero-Rivera [email protected] Center for Bioengineering, University of Leicester, Leicester LE1 7RH, U.K.
Abigail Morrison [email protected]
Markus Diesmann [email protected] Computational Neurophysics and Bernstein Center for Computational Neuroscience, Institute of Biology III, Albert-Ludwigs-University, Freiburg, Germany
Tim C. Pearce [email protected] Center for Bioengineering, University of Leicester, Leicester LE1 7RH, U.K.
Programmable logic designs are presented that achieve exact integration of leaky integrate-and-fire soma and dynamical synapse neuronal models and incorporate spike-time dependent plasticity and axonal delays. Highly accurate numerical performance has been achieved by modifying simpler forward-Euler-based circuitry requiring minimal circuit allocation, which, as we show, behaves equivalently to exact integration. These designs have been implemented and simulated at the behavioral and physical device levels, demonstrating close agreement with both numerical and analytical results. By exploiting finely grained parallelism and single clock cycle numerical iteration, these designs achieve simulation speeds at least five orders of magnitude faster than the nervous system, termed here hyper-real-time operation, when deployed on commercially available field-programmable gate array (FPGA) devices. Taken together, our designs form a programmable logic construction kit of commonly used neuronal model elements that supports the building of large and complex architectures of spiking neuron networks for real-time neuromorphic implementation, neurophysiological interfacing, or efficient parameter space investigations. 1 Introduction Numerical simulation of large-scale networks of spiking neurons and real-time neuromorphic system implementation requires significant Neural Computation 18, 2651–2679 (2006)
C 2006 Massachusetts Institute of Technology
2652
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
computational power. On the other hand, existing general-purpose microprocessors, although extremely versatile, rely largely on serial processing of data, which severely limits their computational throughput. This serial dependence comes about through centralized arithmetic resources (such as arithmetic logic unit or floating point unit), which are restricted to one or a small number of concurrent operations. This class of numerical simulation problem, due to its inherent parallelism, is challenging for single-core processor architectures, rendering real-time operation impossible for all but the simplest of neural systems. The fact that cluster computing approaches are so successful in speeding up neuronal simulations demonstrates this serial bottleneck in computation. Clearly, then, single-core microprocessorbased neural simulators offer flexibility but are limited by serial processing constraints. Fully customized hardware (such as analog VLSI) potentially offers high computational power at the expense of flexibility and design iteration times. Consequently there is a need for a rapid prototyping platform for neuronal models that is extremely fast with similar flexibility to generalpurpose microprocessors. Field-programmable hardware (in the forms of field programmable gate arrays (FGPAs) or field programmable analog arrays (FPAAs)) is an ideal technology to fulfill these requirements, since devices are fast (up to 1 GHz clock speeds), can be reprogrammed in milliseconds, and offer vast numbers of individual circuit elements that are inherently parallel in nature and may be configured arbitrarily. Programmable logic promises to deliver computational speeds approaching that of custom silicon solutions while providing a flexible substrate for numerical simulation. Delivering on this promise, however, requires that field-programmable neuronal element circuit designs exploit the inherent parallelism of both the problem domain and target technology. Without this finely grained parallelism, this approach cannot compete in terms of processing performance with single-core microprocessors, in the form of pipelined and reduced instruction set compute architectures optimized for serial computation. Hence, to achieve the necessary performance in programmable logic requires deploying robust, extremely lowcomplexity neuron element designs that are inherently self-contained (i.e., no shared resources) and operate independently and in parallel. Furthermore, to achieve efficient operation, one should consider only single clock cycle iteration solutions (one clock cycle equals one numerical iteration) without sacrificing numerical performance. Furthermore, to achieve efficient operation, one should consider only single clock cycle iteration solutions without sacrificing numerical performance. Multicycle architectures of neuronal models have been previously discussed (Graas, Brown, & Lee, 2004). Single-cycle architectures have independently been investigated by Weinstein and Lee (2006). FPGAs are digital integrated circuits (IC) that internally contain configurable blocks of logic, as well as programmable interblock connections
Programmable Logic Construction Kit for Neuronal Modeling
2653
(Xilinx, 2002). Such devices can be configured in an arbitrary fashion to perform a variety of signal and data processing tasks. As a first step in the implementation, designs must satisfy specific criteria to guarantee a functional circuit, which is free from logical errors. Generally designs are specified in a hardware description language (HDL), such as VHDL or Verilog, although schematic-level entry is also possible. The HDL program must then be synthesized, which implies a process of minimization and optimization, which ends in the conversion of the sequential coded description into a collection of parallel registers (memory storage) and Boolean relationships. The gate-level functions obtained from this synthesis process are then mapped onto the physical layer of the device, which means assigning the logical design to available chip resources, depending on the chosen target device. From this process, a map file is created, which defines the placing of the logic onto the physical device, as well as the routing of the signals between logic elements (so-called place and route). Finally a “bitstream” is created, which is a file containing the information to internally configure the FPGA device (see Xilinx, 2003, and Maxfield, 2004). Design tools to aid the FPGA synthesis process are in a rapid state of development. Here we present a set of reduced complexity programmable logic designs for exponential decay and alpha and beta function synapse with spike-time dependent plasticity (STDP) learning, as well as integrate-and-fire soma complete with axonal delays. Together these designs form a neuronal modeling kit, simulating all the major components commonly used in neuronal modeling of large-scale networks. The designs are of sufficiently low complexity to be realizable in massive numbers on a single FPGA device through multiplexing, yet feature single clock cycle numerical iteration. We begin by making mathematical comparisons between forward-Euler (FE) and exact integration (EI) numerical iteration schemes in the case of linear ordinary differential equation (ODE) solvers. We show that this comparison leads to a surprisingly simple solution for implementing neuronal elements with exact numerical properties. By simulating and evaluating these designs, we demonstrate FPGAs as a viable technology for large-scale spiking neuron model implementation. The designs are finally tested against numerical and analytical results, verifying exact integration behavior. 2 Numerical Considerations Neuronal dynamics are commonly modeled in digital hardware using numerical methods for solving ODEs. The simplest scheme for numerical simulation of dynamical system behavior is the FE approach. For an initial value problem (IVP) of the form y˙ (t) = f (y, t),
y(0) = y0 ,
(2.1)
2654
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
a first-order (linear) approximation to its solution is given by the FE approximation (Lambert, 1973) y˜ k+1 = y˜ k + f (yk , tk ),
(2.2)
which is an iteration scheme that begins from an initial value (y0 ), where k is an integer defining the iteration number (k = 0, 1, · · ·), and is the fixed step size, in part determining the accuracy of the approximation. In general, this numerical integration scheme has a truncation error of order O(2 ). More sophisticated iteration schemes can be used that reduce the single step error (such as Runge-Kutta), but these require greater computational effort and hence more complex implementations. In this context, considering a homogeneous first-order ODE of the form y˙ (t) = a y(t),
y(0) = y0 ,
(2.3)
we can approximate its solution by applying the FE method as follows: y˜ k+1 = y˜ k (1 + a ),
y˜ (0) = y0 .
(2.4)
As an alternative, there is an exact way to perform digital simulation of general linear-time-invariant systems, which is described by Rotter and Diesmann (1999). In order to obtain an EI solution of equation 2.3, we make use of this method, yielding the following expression: yk+1 = e a yk ,
y(0) = y0 .
(2.5)
In general, both methods will yield different values. However, in order to find a relation between the FE and the EI method for the linear ODE case, we make use of the parameter a , which determines the time constant of the system as the relational parameter. Let us now define a˜ and a as the factors determining the time constants for the FE and EI solutions, respectively. If both iteration schemes begin from the same initial value y0 , then clearly they will give identical results if the following condition is met: 1 + a˜ = e a .
(2.6)
To satisfy this condition, we must first solve for a˜ and then substitute this value into equation 2.4 instead of a . Hence, for first-order linear-timeinvariant dynamical systems such as we consider here, the FE solution is simply a timescaled version of the exact solution, which can be corrected by adopting the a˜ parameter in equation 2.4. In the following circuit implementations, we will exploit this fact to derive extremely low-complexity
Programmable Logic Construction Kit for Neuronal Modeling
2655
circuit designs capable of implementing an EI of common neuronal modeling components. 3 Neuron Model We first consider the two main neuronal elements: a synapse model that reproduces postsynaptic dendritic current dynamics resulting from a presynaptic action potential and a soma model that integrates these currents over time to generate a membrane potential. We describe the implementation of both models in turn, providing the numerical details on which they are based, as well as their optimized programmable logic circuit implementations. 3.1 Synapse Model 3.1.1 Exponential Decay Synapse Model. One of the most common methods to model dendritic currents generated by a synapse in response to a spike train is through an exponential decay over multiple spike inputs occurring at times (t1 , t2 , . . . , t j , . . . tl ) to give the dendritic current I (t) = w
l
H(t − t j )e −
t−t j τe
,
(3.1)
j=1
where H(•) is the Heaviside function, w is the synaptic efficacy (weight) that determines the current increment in response to the arrival of each presynaptic action potential, and τe is the time constant of the exponential decay resulting from the arrival of each action potential. Equation 3.1 is the solution to the first-order ODE of the form I˙ (t) = w
l j=1
δ(t − t j ) −
1 I (t), τe
(3.2)
which can be approximated using the FE scheme for a = − τ1e as . Ik+1 = wδk+1, j + Ik 1 − τe
(3.3)
This numerical scheme may be implemented directly in programmable logic by keeping a running accumulation of I over time by adding the current value to itself and subtracting a fraction of Ik at each time step. When a time step occurs in which a spike is received, the constant factor w must also be added to the accumulated value. This scheme makes for a
2656
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
very simple architecture with the exception of the multiplication involved in the fractioning process, which in its most general form is expensive to implement in programmable logic. The resulting register transfer level (RTL) description of the synapse that can be directly implemented in programmable logic is shown in Figure 1A, as derived from the FE iteration scheme given in equation 3.3. The implementation assumes that within a single clock cycle (identical to the step time ), only a single action potential may be received at the input. This is reasonable if we assume that every synapse is connected to a single presynaptic cell, which typically has an absolute refractory period far exceeding a single time step. n−bit integer arithmetic may be used without loss of precision as long as we scale the circuit weight input w B according to w B = k I w, by choosing k I such that the bit count output of the circuit, B, extends across the integer number line {0, 1, 2, . . . , 2n−1 }. In this case, the dendritic current can be recovered from the circuit output via I (t) = B/k I . The real-time response of this circuit to a single action potential shown in Figure 1B is an exponential decay, with starting value equal to the weight of the synapse as shown in Figure 1C. The simulation proceeds at least three orders of magnitude faster than biological time, requiring a dimensionless speed up to factor kt to translate between biological time constants and those achieved by the FPGA. The speed-up factors kt are calculated based on a biological time constant for the synapse of 5 ms for Figures 1 and 2 and a biological time constant for the soma of 20 ms for Figures 3, 4, and 7. Using the same arguments as previously, it is clear that the FE scheme describing the synaptic dynamics is equivalent to the EI solution as long as the following corrected time constant is substituted for τe in equation 3.3: τ˜e =
1 − e − τe
.
(3.4)
The behavior of this circuit to the regular spike train shown in Figure 1D is shown in Figure 1E. The limiting value of peak synaptic current is compared to the asymptotic value (dotted line), which in the case of a regular spiking input at fixed frequency, f sp , can be shown to be I pea k =
w 1−e
−
1 f sp τ˜e
.
(3.5)
The asymptotic peak response of the circuit is found to be within one least significant bit (LSB) of the theoretical value, I pea k . 3.1.2 Alpha and Beta Function Synapse Models. Alpha and beta functions are also popular physical models for synaptic dynamics (Jack, Noble, & Tsien, 1975; Gerstner & Kistler, 2002; Tuckwell, 1988). The dendritic current
W Zeros
Synapse input
B
10
+
-
+
C X ∆ ~
+
τe
Ik + 1 Synaptic current
D
Synapse input (Bit value)
Ik
E
1 0
Synaptic current (Bit count)
A
Synapse input (Bit value)
Programmable Logic Construction Kit for Neuronal Modeling
2657
1 0
5
10 15 20 Time (µs)
25
30
0
5
10 15 20 Time (µs)
25
30
16000 12000 8000 4000 0
0
2
4
6 8 Time (µs)
10 12
0
2
4
6 8 Time (µs)
10
Synaptic current (Bit count)
30000 25000 20000 15000 10000 5000 0
12
Figure 1: Exponential decay synapse implementation. (A) RTL description of the synapse architecture. Spiking synaptic input is represented by 0-1 logic levels, which controls the addition of weight value at each clock cycle. Thick, solid lines represent m-bit digital buses (representing unsigned integers), and the thin, solid line represents an individual control line. (B) A single spike event at time t = 0 occurs at the input of the synapse. (C) An exponential decay is generated in the circuit as response to the input. For a synapse with a current increment of 50 pA, the bit count output, B, can be converted to synaptic current using I (t) = B/k I , where k I = 3.28 × 1014 counts A−1 . (D) Spiking input at a regular firing frequency, f sp = 1 MHz, used to test the synapse implementation. (E) Real-time synapse response to this regular spiking input for FPGA synapse implementation. The theoretical asymptotic value of peak synaptic current is shown by the upper dotted line. The clock frequency was set to f clk = 100 MHz, giving a step time, = 10 ns, and a speed-up factor kt = 1953 resulting in time constants, τ˜e = 2.56 µs and weight w = 16,384 for C and w = 8192 for E.
2658
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
generated by an alpha function synapse responding to multiple spike inputs occurring at times (t1 , t2 , . . . , t j , . . . tl ) is described by
I (t) =
l t−t j we H(t − t j )(t − t j )e − τα , τα
(3.6)
j=1
where w is again the efficacy of the synapse and τα is the characteristic time constant for the synapse. In a beta function model, current in a postsynaptic dendrite generated in response to multiple presynaptic spikes occurring at times (t1 , t2 , . . . , t j , . . . tl ) is described by
I (t) = wβ
l
t−t j t−t − − j H(t − t j ) e τβ1 − e τβ2 ,
(3.7)
j=1
where τβ1 and τβ2 now determine the time constant of the synapse and β is a parameter adjusted to produce a (dimensionless) beta function with unit amplitude. Rotter and Diesmann (1999) show how a matrix exponential can be used to describe the exact solution to greater than first-order linear ODEs (see appendix A), such as alpha and beta function dynamics. Due to their second-order dynamics, the matrix exponential for either the alpha (see equation A.2) or beta function (see equation A.3), consists of two exponential decay terms in the diagonal and a third off-diagonal term. This fact provides us with a simple method for implementing either function by deploying in cascade two exponential decay elements detailed in section 3.1.1. There is, however, a nonzero off-diagonal term in both matrices, which act as a scaling function that must ordinarily be applied when coupling both exponential decay elements. We see that this matrix element is in fact a constant factor and so can be implemented by means of a multiplier in our circuit. Yet to keep the design as simple as possible, this constant factor may be directly applied to the weight of the synapse instead. Adjusting the synaptic weight in this way does not change the dynamics of the function but eliminates the necessity of multipliers between exponential decay elements, resulting in a simpler circuit. Thus, the resulting adjusted weight for the alpha function is
wadj = e − τα w,
(3.8)
A
B
Wadj
τ e1
Synaptic input (Bit value)
Programmable Logic Construction Kit for Neuronal Modeling
1 0
5
10
15 20 Time (µs)
25
30
0
5
10
15 20 Time (µs)
25
30
ρ 0
C
+
+
τ e2
1
7000 Synaptic current (Bit count)
10
Output
2659
6000 5000 4000 3000 2000 1000 0
Figure 2: Alpha and beta function synapse implementation. (A) Block diagram of the implementation of an alpha and beta function generator implemented by connecting in cascade two exponential decay synapses. When τe 1 = τe 2 , the function generated is an alpha function; otherwise, it is a beta function. Each time an action potential arrives at the synapse, both the adjusted weight and the value of ρ are added to the circuit. The parameter ρ is used only when the alpha or beta function is fitted to a soma circuit to produce a combined neuron model; in such cases, the value of ρ is determined using equations 3.12 and 3.13. When simply performing isolated alpha function synapses, ρ has no effect in the circuit; therefore, set ρ = 0, and no value of ρ will be added during spikes, whereas wadj is determined by equations 3.8 or 3.9 depending on the type of synapse. (B) A single spike event at time t = 0 occurs at the input of the synapse. (C) An alpha function is generated as real time response to the input. A 32-bit representation was used in the circuit with parameters clock frequency f clk = 100 MHz, step time, = 10 ns, kt = 1953, τ˜α = 2.56 µs, ρ = 0, and adjusted weight wadj = 64.
whereas for the beta function, it is wadj
wτβ1 τβ2 = τβ1 − τβ2
e
− τ
β1
−e
− τ
β2
.
(3.9)
The schematic for this class of synapse is shown in Figure 2A. In this case, action potentials, acting as input, are received at the first exponential decay element, while the output from the second element follows alpha and beta function dynamics. The parameter ρ, shown, will be proved to
2660
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
be an important parameter when the circuit is used in combination with a soma circuit to reproduce the membrane potential of a neuron, but is not relevant for isolated synapses and can then be set to zero. Figure 2C shows an alpha function generated by this circuit as a response to a single input spike applied at time t = 0 (see Figure 2B). Note that we have again adjusted each time constant according to equation 3.4 to guarantee an EI solution. 3.2 Soma Model. The popular integrate-and-fire model that receives synaptic input of the form I (t) has a membrane potential, V(t), the dynamics of which are described in between generated action potentials by ˙ V(t) =−
V(t) I (t) + , τm Cm
(3.10)
where the membrane capacitance, Cm , is a constant and τm is the characteristic time constant for the cell. Once V(t) reaches a fixed threshold potential, Vth , say, at time t , V(t )− = Vth , the soma then resets to the afterhyperpolarization potential, V(t )+ = Va hp , and a spike is generated by the soma and integration of the current input continues. Again, using the FE approach, the solution of equation 3.10 may be approximated as Vk+1 = Vk 1 − + Ik , τm Cm
(3.11)
resulting in a similar implementation to that of the exponential decay synapse, except that the dendritic input is added at each time step. In the next section, we show how, again, EI can be implemented within the FE scheme if the dynamics of I are appropriately taken into account. The RTL description for the soma implementation, complete with spike generating mechanism, is shown in Figure 3A. A comparator is used to detect threshold crossings, which gates a single flip-flop producing a 0 → 1 → 0 transition on the axon output lasting one clock cycle. If required, a fixed input bias may be added at each time step in order to determine the resting potential of the cell. As before, the bit count held in the soma potential register, B, is a linearly scaled representation of the soma potential, such that V(t) = B/k V . The soma model responding to two current pulses of different magnitudes (shown in Figure 3B) is shown in Figure 3C. 3.3 Combined Neuron Model. We have now developed programmable logic circuits that implement EI solutions for all the main neuronal modeling components described by equations 3.1, 3.6, 3.7, and 3.10. These synapse and soma circuits must next be combined in such a way as to obtain an EI solution for the complete neuron model. This process is not immediate, since, unfortunately, combining individual elements with EI performance
A
B Somatic input current +
-
Somatic input current (Bit count)
Programmable Logic Construction Kit for Neuronal Modeling
350 250 0
0
5
0
5
2661
10 Time (µs)
15
10
15
+
Afterhyperpolarization 10 potential
Soma potential (Bit count)
C +
Vk + 1
∆ X ~ τ
m
Soma potential
Vk Comparator
Vk >= Vth
Vth Threshold voltage
Soma output
40000 30000 20000 10000 0
Time (µs)
Figure 3: Integrate-and-fire soma implementation. (A) RTL description of soma architecture. Somatic input current, I (t), afterhyperpolarization (reset) potential, Va hp , and the threshold potential, Vth , are each represented by a p-bit signed integer (thick, solid lines). A fire event (spike) is represented by a 0 → 1 transition on the soma output line lasting a single clock cycle. (B) Somatic current pulses (see Rotter & Diesmann, 1999, sec. 3.2.3, for the appropriate EI coefficient) of I = 250 and 350 bit count, respectively, lasting and separated by 4 µs were used to test the soma implementation. (C) Real-time soma response to the current input for the FPGA implementation with a speed-up factor kt = 7813. For a soma with a threshold of 20 mV above resting potential, the soma potential bit count, B, can be converted to soma potential V(t) = B/k V , where k V = 2 × 106 counts V−1 . A 32-bit representation was used with parameters clock frequency f clk = 100 MHz, step time, = 10 ns, resulting in τ˜m = 2.56 µs, and Vth = 40,000 which is indicated by the horizontal dotted line.
does not guarantee EI system performance when coupled together. Thus, some important considerations must be taken into account before connecting these components. First, we use the matrix exponential of the combined system given by Diesmann, Gewaltig, Rotter, and Aertsen (2001), as in equation B.1. In this case the matrix exponential describes the state-space representation of a combined neuron comprising synapse and soma. We see from this representation that a simple cascade connection of these components will not suffice, since there exist nonvanishing off-diagonal elements that act as constant factors between stages, again requiring the use of multipliers. In order to optimize the combined neuron model for the case of alpha function synapse and soma, we apply a linear transformation (see appendix B), which permits the direct cascade of synapse and soma
2662
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
elements without the need of coupling factors. An important consequence of this transformation, however, is the necessity to again adjust the synaptic weight and also apply a constant addition factor (ρ) in between stages according to
wadj =
τα τm e − τα (e − τα − e − τm ) w, C(τα − τm )
(3.12)
and
τα τm ((τα − τm )e − τα − τα τm (e − τα − e − τm )) w. ρ= C(τα − τm )2
(3.13)
These parameters may again be calculated off-line to avoid the necessity of multipliers in the circuit, yielding the far simplified combined neuron circuit shown in Figure 4A. Using similar arguments for the exponential decay synapse and soma case (see appendix C), we can again find an adjusted weight that results in combined EI performance: wadj =
wτe τm (e − τe − e − τm ). C(τe − τm )
(3.14)
Now that we have determined the conditions required to perform an exact integration for each case, synapse and soma may now be combined to form a self-contained, single clock cycle operation, spiking neuron implementation. Figure 4A shows a generalized multisynapse scheme for a single neuron. The neuron comprises r synapses, which receive spikes from different presynaptic cells, generating dendritic currents that are summed in a single clock cycle and integrated to produce the membrane potential of the soma. In general, a greater number of bits will be need to be deployed at the soma as compared to the synapse, since integration of the input occurs at each time step (not just after spike arrival) and multiple synapse inputs may be summed. In order to minimize the total number of bits required in the soma, we can limit the maximum and minimum weights in the synapses to avoid overloading the soma, which, as we will see, is also a desirable characteristic for the learning algorithm we consider later. In order to demonstrate the EI performance of the combined designs, we tested an exponential decay type synapse and soma. Input spikes were applied to the combined neuron model as seen in Figure 4B which gave rise to the resulting membrane potential shown in Figure 4C. We see that in this case, the membrane potential crosses the threshold (Vth ) three times, generating the same number of spikes (see Figure 4D).
Presynaptic cell 1
Synapse 1 Synapse model
Excitatory/ Inhibitory
+/-
wadjr Presynaptic cell r
+/-
Zeros
+
B
C
Synapse r Synapse model
Excitatory/ Inhibitory
Threshold potential Afterhyperpolarization voltage
+/-
+/-
+
Soma model
1 0 0
5
10 Time (µs)
15
20
0
5
10 Time (µs)
15
20
1 0 0
5
10 Time (µs)
15
20
50000 40000 30000 20000 10000 0
D Soma output
2663
60000 Soma potential (Bit count)
wadj1
Soma output (Bit value)
A
Synapse input (Bit value)
Programmable Logic Construction Kit for Neuronal Modeling
Figure 4: Combined neuron implementation. (A) RTL description of the combined neuron architecture. Spiking input is represented by 0-1 logic levels received at r synapses. Weights, wi , for synapses (i = 1, . . . , r ), the afterhyperpolarization (reset) potential, Va hp , and the threshold potential, Vth , are represented by m-bit unsigned integers (thick solid lines). (B) Spiking input at a regular firing frequency, f sp = 200 KHz, used to test the combined neuron implementation (only one synapse input (i = r = 1) was considered). (C) Real-time soma response to the synapse input for the FPGA combined neuron implementation. (D) Spikes generated by the combined neuron model in response to the spiking input defined in B. A 32-bit representation was used for both circuits with clock frequency f clk = 100 MHz, step time, = 10 ns, and kt = 7813. For the synapse, time constant of τ˜e = 2.56 µs and adjusted weight wadj ≈ 0.61681; for the soma, τ˜m = 2.56 µs, C = 12 pF, and Vth = 60,000 which is again indicated by the horizontal dashed line.
3.4 Axonal Delays. Axonal delays play an important role in the dynamics of spiking neuron network models (Crook, Ermentrout, Vanier, & Bower, 1997) and are simple to implement in our design. In general, axonal delays will not be constant across all cells of the network. Instead, each axon should have associated with it a particular delay (see Figure 5A). When a spike occurs at a soma output, this should not be delivered instantly to the target synapse. Rather, the action potential must be presented in some n time steps after the spike occurred. In our design, axonal delays are implemented using a ring buffer (see Figure 5B). The spike history (up to time n) is stored in the ring buffer from each soma, and at each time step, the nth element is transmitted to the target synapse, while the current state of the soma is written to the buffer. Ring buffers have the advantage that spike history data are automatically overwritten as they become redundant.
2664
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
Figure 5: Axonal delay and STDP learning implementation. (A) Each axon is programmed to have a specific delay (defined by Dn for the n-axon), which is implemented using a ring buffer. (B) Action potentials are read from the ring buffer at time t; the synapses from other neurons receive the delayed spikes according with the programmed delay for each neuron. (C) RTL design of the STDP learning unit. Superpositions of exponential decays with height equal to the amount of weight modification (W+ to increment or W− to decrement) are generated; the order of presynaptic and postsynaptic fire times defines both the amount and the sign of the weight modification. (D) Neuron model with STDP learning. The STDP unit receives presynaptic spikes arriving at a specific synapse as well as the action potentials generated by its respective soma. The adjusted weight is fed to the synapse every time a presynaptic or postsynaptic spike takes place.
Programmable Logic Construction Kit for Neuronal Modeling
2665
3.5 Learning by STDP. There are many Hebbian (correlation)-based plasticity mechanisms that can be used for learning purposes in spik¨ ing neuron models (Abbott & Nelson, 2000; Gutig, Aharonov, Rotter, & Sompolinsky, 2003). One of the most common of these is spike-timing dependent plasticity (STDP) proposed by Song, Miller, and Abbott (2000). This plasticity mechanism relies on the difference in presynaptic and postsynaptic spike times to adjust the strength of excitatory synapses. As an example of how plasticity mechanisms may be conveniently combined with our neuronal element designs for on-chip learning in real time, we have implemented STDP with our synapses. Figure 5C shows the RTL design of the STDP unit, which contains two exponential decay elements reused from our synapse designs. The exponential decay unit situated in the upper section of the diagram receives presynaptic spiking input in the same way as the synapse itself. Each time a spike is received at this exponential decay unit, a value equal to the maximum amount of change of the synapse strength (W+ ) is added to its output, while during latency periods, this value decays exponentially according to equation 3.1. Once a spike is generated in the postsynaptic neuron, the strength of the synapse is increased by the current value output by this exponential decay unit. The second exponential decay unit behaves in the opposite way. That is, the second element has as input postsynaptic action potentials that add to the current value an amount equal to the maximum possible change of synapse weakening (W− ). When a presynaptic action potential is received, the current value of this exponential decay unit is subtracted from the weight of the synapse. In this way, synapses compete to control the postsynaptic spike timing of the neuron. Figure 5D shows the implementation of the STDP unit in the generalized neuron model. The weight of the synapse is limited to the range [0, Wmax ]. Stable synaptic modification requires that W+ < W− < Wmax , and experimental results suggest that W− W+ ≈ 1.1 (Song et al., 2000). The maximum weight Wmax should be chosen so as not to overload the soma. 4 Results RTL designs for each neuronal component were implemented in VHDL and executed on a PCI-based FPGA development board (model ADM-XRCII Pro, manufactured by Alpha Data Systems, UK), which hosts a Xilinx Virtex II Pro device (product code XC2VP100-5). Numerical results were tested against the equivalent exact numerical solution as calculated on a standard PC with Intel Pentium IV running at a clock speed of 3.06 GHz with 1 GB of external memory, programmed in C++. All responses shown in Figures 1B and 1C, 2B and 2C, and 3B and 3C were compared against their corresponding EI solution and the differences plotted in Figure 6 (left). In all cases, the errors show a random behavior about zero, which suggests that the difference is due to round-off as a result of the finite-length number
2666
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
representation. This conclusion is confirmed by the histograms of the error (see Figure 6, right), which show that in all cases, the distributions closely resemble the shape of the uniform probability density function, which is the expected behavior generated by a round-off process (Barnes, Tran, & Leung, 1985). With increasing time, the response error becomes more regular as the output of the circuits approaches zero, since the time derivative becomes very small (itself an exponential decay). In the case of the alpha function synapse (see Figure 6B), although we see a random uniform distribution about zero of ±0.5 bits for each of the two stages, when combined, the two errors may accumulate in the positive direction leading to an error greater than ±0.5 bits for a single neuron. However the total error will never exceed ±1 bit and is not systematic, since the asymptotic behavior is toward zero. An additional experiment was carried out using a combined neuron model comprising an alpha function synapse and a soma, which was excited by a single spike at time (t = 0). The membrane potential was again compared against the numerical solution given by the EI method. The error between the circuit implementation and the EI solution shows the same behavior as in the preceding case: a uniform distribution due to the round-off process (see Figure 6D). To further test the numerical performance and robustness of our designs, a combined neuron simultaneously integrating signals from five exponential decay synapses driven by Poisson processes (see Figure 7A) was implemented. The total resulting synaptic current is shown in Figure 7B. The difference between the membrane potential obtained by the EI scheme and the value generated by the circuit (see Figure 7C), again shows a random behavior about zero, ruling out any systematic error in the circuit behavior
Figure 6: Differences between the numerical solution and the circuit responses. Difference over time (left) and histograms of their corresponding error distribution (right) for (A) exponential decay synapse response, (B) the alpha function synapse response, (C) the soma response, and (D) combined neuron model made up of alpha function synapse and integrate-and-fire model. The error corresponds to those responses shown in Figures 1 to 3 except for the combined neuron model, whose response comes from a single spike at (t = 0). The parameters of the three first circuits are the same as the ones specified for each response, while for the combined model, the parameters are τ˜α = 25.6 µs, τ˜m = 4.096 ms, C = 250 pF, adjusted weight wadj ≈ 0.5089, and ρ ≈ 0.2553. All errors show the same characteristics: a random behavior about zero and a uniformly distributed probability density function. The clock frequency was set to f clk = 100 MHz, giving a time step of = 10 ns. Numerical solutions were carried out in C++ using long double precision, using an 80 bits representation with 64 bits for the mantissa and 14 for the exponent.
Programmable Logic Construction Kit for Neuronal Modeling
2667
Exponential decay synapse
A
Difference (Bit Count)
Difference (Bit Count)
1.0 0.5 0 -0.5 -1.0
0
5
10 15 Time (µs)
-0.4 -0.8
25
0.5 0 -0.5
0
10
20 30 Time (µs)
C
40
0.4 0 -0.4 -0.8
D
5
0 -0.4 -0.8 0
20
40 60 80 Frequency
Difference (Bit Count)
0.4 0 -0.4 -0.8 500 1000 Time (µs)
1500
100
0.8 0.4 0.0 -0.4 -0.8
10 15 0 10 20 30 40 Time (µs) Frequency Combined neuron model with alpha synapse
0.8
0
0.4
Soma
0.8
0
0.8
50
Difference (Bit Count)
Difference (Bit Count)
0
1.0
-1.0
Difference (Bit Count)
0.4
0 10 20 30 40 50 60 70 Frequency Alpha function synapse
20
Difference (Bit Count)
Difference (Bit Count)
B
0.8
0.8 0.4 0.0 -0.4 -0.8 0
200 400 Frequency
600
50
2668
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
(see Figure 7D). In this case, 308 action potentials were generated by the circuits (see Figure 7E), exactly the number of spikes obtained through the EI scheme. Critically, there were no differences in firing times for both the circuit and the numerical solution, all spikes were time coincident within a single clock cycle (see Figure 7F). This confirms that for the purposes of simulation of integrate-and-fire neuron and exponential decay–based synapse dynamics, our circuit produces EI performance under realistic simulation conditions, limited only by the restricted bit length of the representation, which is the case for any digital neuronal model implementation. In order to further test the comparative numerical performance of our FPGA implementation with PC technology, we deployed a medium-scale neuronal model described previously (Pearce et al., 2005) and tested it against its software equivalent. Briefly, this model was composed of 100 integrate-and-fire somas and 675 exponential decay synapses describing two classes of neuron in the mammalian olfactory bulb (see Figure 8A). Mitral/tufted (M/T) neuron outputs in the vertebrate olfactory system are under tight regulation through lateral inhibition, mediated by dendrodendritic interaction. This interaction effects complex synchronous firing behavior across the bulb output that is stimulus specific, reminiscent of that observed in electrophysiological studies (Friedrich & Laurent, 2001; Friedrich, Habermann, & Laurent, 2004) and demonstrated in our model (see Figures 8B and 8C). Due to the similarity of synapse time constants in this model, it was possible to implement the design on a single Virtex II Pro device (Xilinx) running at a very conservative clock speed of 33 MHz.
Figure 7: Numerical error for the combined neuron. (A) Random spikes trains applied at the input of each synapse. (B) Total synaptic current generated by the circuit. (C) The error between the exact integration numerical implementation and the circuit response for the membrane potential again shows a random behavior about zero. (D) Histogram of the error distribution, which uniform distribution corroborates the round-off effects as the only cause of the error. (E) Over 300 spikes were generated by the soma. (F) Spike timing for both the exact integration numerical implementation and circuit response. All spikes from the circuit were exactly coincident with the spike times given by the numerical solution. An m-bit representation was used with parameters τ˜e 1 = 5.12 µs, τ˜e 2 = 2.56 µs, τ˜e 3 = 1.28 µs, τ˜e 4 = 0.64 µs, τ˜e 5 = 1.28 µs, adjusted weights wa d j1 ≈ 0.077026, wa d j2 ≈ 0.077101, wa d j3 ≈ 0.077253, wa d j4 ≈ 0.077558, and wa d j5 ≈ 0.077253, whereas τ˜m = 2.56 µs, for the soma, capacitance C = 12 pF, threshold potential, Vth = 65,000. The clock frequency was set to f clk = 100 MHz, giving a step time = 10 ns and a speed-up factor kt = 7813. Numerical solutions were carried out in C++ using long double precision using an 80 bits representation with 64 bits out of them for the mantissa and 14 for the exponent.
Programmable Logic Construction Kit for Neuronal Modeling
2669
2670
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
The same network was coded in C++ using the same set of equations as implemented by the hardware, compiled using GNU gcc, and executed on an Intel Pentium IV 3.06 GHz with 1 GB RAM. As before, the error between the circuit and the software implementation is within a single bit and uniformly distributed (see Figure 8D). Spike timing analysis reveals that no incorrect translation occurs between the FPGA- and PC-based solutions. 5 Conclusion Designs for leaky integrate-and-fire soma and exponential, alpha, and beta function synapses with STDP learning and axonal delays are presented in this article, which are suitable for implementation in programmable logic. Together these designs provide a neuronal modeling construction kit of the most popular elements that can be deployed in arbitrary configurations for
Figure 8: Simplified olfactory bulb neuronal model implementation comprising 100 somas and 675 exponential decay synapses. (A) Schematic of the olfactory bulb architecture. Twenty-five mitral/tufted (M/T) cells, represented by integrate-and-fire elements provide the main olfactory bulb output (corresponding to the lateral olfactory tract). These cells are reciprocally connected by exponential decay synapses, which mediate lateral inhibition in the model representing inhibitory coupling between M/T cells in vertebrates (open circles: excitatory synapse; closed circles: inhibitory synapse). Excitatory input to the model is provided by olfactory receptor neurons (ORNs) driven by population-coded receptor input shown by irregular polygons. Synaptic input from identical ORNs is summed to represent receptor input convergence at a single glomerulus (ellipse) (after Pearce et al., 2005). (B) Firing behavior of all 25 M/T cells in the network over time. (C) Membrane potential of a randomly selected M/T cell. (D) The error between the exact integration numerical implementation and the circuit response for the membrane potential and spike timing analysis. A 32-bit representation was used with parameters τ˜e E ≈ 2.1 ms for the excitatory synapses and τ˜e I ≈ 17.1 ms for the inhibitory synapses. The adjusted weights for the excitatory synapses were fixed to wa d j E = 256.8, whereas the inhibitory weights wa d j I were randomly chosen in a range between 1.5 and 32, inclusive. The time constants of both the ORN and mitral cells were set to τ˜m ≈ 10.2 ms, with a capacitance C = 250 pF and threshold potential Vth = 432,640, where k V = 21.6 × 106 counts V−1 . For the ORNs, constant values randomly selected in a range between 4240 and 6400 were used to represent a constant concentration of an odor stimulus. The clock frequency was set to f clk = 33 MHz, giving a step time ≈ 30.3 ns and a speed-up factor kt = 3300. Numerical solutions were carried out in C++ using long double precision using an 80 bits representation with 64 bits out of them for the mantissa and 14 for the exponent.
Programmable Logic Construction Kit for Neuronal Modeling
2671
2672
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
hyper-real-time implementation. This programmable logic construction kit supports the building of large and complex architectures of spiking neuron networks by means of an efficient communication and multiplexing scheme of the neuronal elements. The circuits proposed here have the advantage that they work in parallel rather than depending on serial computation, being capable of simulating neuronal models in hyper real time. The source codes (VHDL) for these designs have been made freely available for download at http://www.neurolab.le.ac.uk/fpga/. The implementation we present is restricted to the class of integrateand-fire models where synapses are described by time-dependent currents. However, with conductance-based synapses, only the differential equation for the membrane potential is no longer linear time invariant. Future work will need to explore whether an implementation with exact integration (EI) for the conductances and an approximate method for the membrane equation (e.g., FE as described in section 2) can be effectively combined. Our circuit designs have been demonstrated to perform exact integration in a single clock cycle and are also self-contained (no shared resources). The advantage of single clock cycle operation is that designs may operate faster than biological timescales (milliseconds); depending on their complexity and the extent of optimization, processing speeds may be able to approach the clock frequency of the FPGA. This hyper-real-time operation provides us with the opportunity of multiplexing the physical neuron architecture to achieve far greater neuron numbers. This is made possible since only digital spike events need to be communicated between neurons, thereby simplifying connectivity circuitry. Thus, programmable logic offers neuronal simulation speeds approaching those of fully custom silicon solutions, but not at the expense of flexibility, design times, or network capacity, since once the design process outlined in section 1 is complete, large-scale models may be downloaded and adapted in milliseconds. Most important, by using fully parallel single clock cycle implementations, our neuronal modeling approach leverages the impressive advances in programmable logic, in terms of clock speeds and device capacities, specialized DSP components, cost, and ongoing development of advanced design tools. FPGA capabilities and operating speeds are currently under exponential growth. Over the past 10 years, capacity has increased more than 200-fold with every indication that Moore’s law will continue for these devices some way into the future (Alfke & Hitesh, 2005). This should ensure the possibility for growth of neuronal modeling capability that is commensurate with advances in programmable logic. We can estimate the total neuronal capacity of any given FPGA device by calculating the amount of resources each neuronal element would require. FPGA resources are measured in terms of slices, each of which contains the fundamental digital elements required to implement arbitrary combinational logic functions (Xilinx, 2002). Thus, the number of slices required per neuron element limits the total physical numbers implementable on a
Programmable Logic Construction Kit for Neuronal Modeling
2673
single device. Using a 16 bit representation for the synapse and a 32 bit representation for the soma requires a total of 70 slices for the (exponential decay) synapse circuit and 90 slices for the soma. Current FPGA devices exceed 50,000 slices. Therefore, it is possible to configure, for instance, over 500 physical synapses without STDP (or 250 with STDP) and 100 physical somas, running in parallel with single clock cycle iteration. Such an arrangement would be ideal in the case of small and medium network designs for which we require hyper-real-time operation in order to understand or optimize its performance in different portions of its parameter space. Over 100,000 prototypes of the same network could be simulated in the time taken for one iteration of the biological network it represents. Hence, programmable logic provides a powerful technology for understanding and optimizing small and medium spiking neuronal models. Synapses with a common target soma often have similar time constants in the brain. We can exploit this fact to make further gains in total synapse numbers by using only one physical synapse and convolving the incoming spikes of many virtual synapses. This is mathematically equivalent to implementing large numbers of separate synapses with identical time constants. In this way, a very large degree of synaptic convergence can be achieved, which is particularly relevant to cortical models. In order to test the comparative speed performance of our FPGA implementation with a single PC, we ran parallel trials of the olfactory bulb model (see Figure 8) on both PC and FPGA. The identical network was coded in C using the same set of equations as implemented by the hardware, compiled using GNU gcc and executed on an Intel Pentium IV 3.06 GHz with 1 GB RAM. The PC performed 10,000 numerical iterations of the network in 0.76 s without disk access. Due to single clock cycle numerical iteration, the FPGA performed the same number of steps on-chip in 303.03 µs, giving a speedup factor of about 2500. This provides a speed-up factor of three to four orders of magnitude, which may be further optimized either by improving the placement and routing of the logic of the design implementation or by employing a faster FPGA (e.g., Virtex IV devices offer faster internal speeds and shorter propagation delays; Xilinx, 2004). In further experiments, we were able to increase the clock frequency up to 50 MHz. In both cases, about 55% of the FPGA resources were utilized. We may also choose to exploit this hyper-real-time operation to increase network sizes through the use of a multiplexing scheme. In this case, we create virtual exponential decay units by storing and replacing the current state at each numerical step and, optionally, the associated parameters such as weight, afterhyperpolarization potential, threshold potential, and time constant. This requires storage either on or off chip, which, depending on the access time, will determine the communication overhead associated with applying a multiplexing scheme. FPGAs have on-chip RAM block storage with parallel access single clock cycle operation, which minimizes this overhead (Xilinx XC2VP125 device
2674
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
has 556 such blocks, which may be variously configured). In principle, for this Xilinx device, we can use 556, 1024 × 18-bit buffers to create 105 to 106 virtual exponential decay elements operating in biological time (18-bit precision). However, in practice, there are two issues that must be addressed when using FPGAs to build large-scale neuronal models that exploit such multiplexed architectures. Firstly, at each numerical step, we must store, update, and replace the state of each element. This imposes a time overhead, which is at least one additional clock cycle per numerical iteration step to move the data to and from on-chip memory. This reduces the total number of virtual elements that can be operated in biological time. Second, we must consider connectivity between neurons, which imposes constraints on the scale of architectures that can be deployed on a single device. Cortical models represent a particular challenge due to highly convergent synaptic input, which is often thousands of synapses targeting each neuron. Since most cortical models adopt only one or two time constants for these convergent synapses, arriving input spikes may be convolved through only one virtual synapse for each time constant, leading to a massive reduction in resource use. Such scaling issues represent an important focus of future research for implementing FPGA neuronal models to achieve cortical model scales in real time. Appendix A: Alpha and Beta Functions Matrices Exponentials The general solution of the exact integration (EI) scheme (Rotter & Diesmann, 1999) for an nth order system of linear-time-invariant ODEs is described by yk+1 = e A yk ,
y(0) = y0 ,
(A.1)
where e A is the matrix exponential of the system. The diagonal elements of this matrix describe the step dynamics for each of the exponential decay stages to be used as part of the solution, which can be cascaded to simulate the nth order system. However, to maintain an exact solution, the remaining off-diagonal elements must be applied as coupling factors between the cascaded stages. A.1 Matrix Exponential of the Alpha Function Synapse. Pα = e A =
e − τα
0
e − τα e − τα
.
(A.2)
Programmable Logic Construction Kit for Neuronal Modeling
2675
A.2 Matrix Exponential of the Beta Function Synapse. Pβ = e A =
e
− τ
− τ
β1
τβ1 τβ2 (e β1 −e τβ1 −τβ2
0 − τ β2
)
e
− τ
.
(A.3)
β2
Appendix B: Matrix Exponential of the Combined Neuron Model Based on Alpha Function Synapse B.1 Matrix Exponential of the Combined Neuron Model Based on Alpha Function Synapse. Pm = e A =
e − τα e − τα
τα τm ((τα −τm )e − τα
−τα τm (e − τα C(τα −τm )2
0 −e − τm
e − τα
))
τα τm (e − τα
−e − τm ) C(τα −τm )
0 0 .
e − τm
(B.1) B.2 Linear Transformation of the Matrix Exponential of the Combined Neuron Model Based on Alpha Function Synapse. Here we describe a linear transformation that may be applied to the matrix exponential in order to simplify the combined neuron circuit. The goal of this transformation is to reduce the hardware required to implement the combined neuron model by removing the requirement for multipliers. First, for simplicity, let us rewrite the matrix exponential (Pm = e A ) in a more general form as
α 0 0
Pm = r β 0 ,
(B.2)
q p γ where α, β, and γ define the three exponential decay circuits connected in cascade and p, q , and r are constant coupling factors, which call for the use of multipliers. We express the matrix exponential in the form
α 0 0
1 β 0 , 0 1 γ
(B.3)
which means that the resulting combined neuron model circuit will be made up of three exponential decay elements connected in cascade but, crucially, with no multiplication factors coupling them. The best way to achieve this
2676
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
form without altering the dynamics is by applying a linear transformation. We must find a linear transformation that, when applied to the matrix exponential, such as equation B.1, yields a transformed matrix exponential of the form shown in equation B.3. The following transformation matrix fulfills this requirement:
b 0 0
Q = f c 0. 0 0 d
(B.4)
Thus, we must now find the values of b, c, d, and f that transform equation B.1 into the form of equation B.3. Applying this transformation matrix to the general solution, equation A.1, gives y = Qz,
(B.5)
which gives rise to the linear transformation of equation A.1 given by zk+1 = Q−1 Pm Q · zk ,
z(0) = z0 ,
(B.6)
where
α
f −1 Q Pm Q = − bc α + 1c r b + cf β b q d
+
f d
p
0 β c d
0
0 .
(B.7)
p γ
As was already mentioned, in order to keep simplicity in the circuit, the transformed matrix exponential should be in the form of equation B.3. The following conditions ensure that this is accomplished: − f α + br + fβ = c, bq + f p = 0,
(B.8)
cp = d. To maintain the dynamic of the membrane potential, it is required that d = 1. Now we can easily solve for b, c, and f , which results in b=
−1 (β − α)q − pr
(B.9)
Programmable Logic Construction Kit for Neuronal Modeling
2677
1 p
(B.10)
d =1
(B.11)
q f= . p((β − α)q − pr )
(B.12)
c=
In turn, the transformed initial conditions for a single presynaptic action potential should then be z(0) =
1 we b τα f we − bc τα
.
(B.13)
0 In the context of the combined circuit, the first and second row of equation B.13 correspond to the initial conditions of the exponential decay units (in the same order), which produces the alpha function synapse (see Figure 2A). The first row requires us to adjust the synaptic weight according to wadj =
1 we , b τα
(B.14)
where b is defined in equation B.9 and w is the synaptic weight. In turn, the second row gives the initial condition that must be applied to the second exponential decay unit of the alpha function synapse, which is carried out by making
ρ=−
f we , bc τα
(B.15)
where b, c, and f are defined in equations B.9 to B.12. It is important to point out that ρ must be nonzero in combined neuron models based on alpha and beta function synapses; otherwise, ρ = 0 and can be neglected. Furthermore, since equation A.1 describes the dynamics of the combined neuron model in between spikes, both wadj and ρ should be added every time an action potential is received at the synapse (see Figure 2A). Finally, the soma model requires initial conditions set to zero. Meeting these conditions will guarantee an EI of a leaky integrate-and-fire model receiving dendritic currents modeled by alpha functions.
2678
R. Guerrero-Rivera, A. Morrison, M. Diesmann, and T. Pearce
Appendix C: Exponential Decay Synapse-Based Combined Neuron Model Matrix Exponential C.1 Matrix Exponential of the Combined Neuron Model Based on Exponential Decay Synapse.
e − τe
0
τe τm (e − τe −e − τm ) C(τe −τm )
e − τm
.
(C.1)
Acknowledgments This work was funded by the EU-FPV IST program in Future and Emerging Technologies IST-2001-33066 (to T.C.P.) and ESPRC grant GR/R37968/01(P)(to T.C.P.). R.G. is funded by CONACYT, Mexico. A.M. and M.D. are partially funded by DIP F1.2, EU Grant 15879 (FACETS), and BMBF Grant 01GQ0420 to BCCN Freiburg. Special thanks go to Manuel ˜ es with whom we had many useful discussions regardS´anchez-Montan´ ing exact integration solutions and axonal delay implementations. Thanks also go to Carlo Fulvi-Mari, who provided the analytical expression for the asymptotic peak synaptic output. References Abbott, L. F., & Nelson, F. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1179–1183. Alfke, P., & Hitesh, P. (2005). Achieving breakthrough performance with Virtex4, the world’s fastest FPGA. (Online conference), Accessed February 2005 at http://www.techonline.com/community/ed resource/webcast/37558. Barnes, C. W., Tran, B. N., & Leung, S. H. (1985). On the statistics of fixed-point roundoff error. IEEE Trans. Acoustic, Speech, and Signal Proc., 33, 595–606. Crook, S. M., Ermentrout, G. B., Vanier, M. C., & Bower, J. M. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. J. Comp. Neurosci., 4, 161–172. Diesmann, M., Gewaltig, M., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565–571. Friedrich, R. W., Habermann, C. J., & Laurent, G. (2004). Multiplexing using synchrony in the zebrafish olfactory bulb. Nat. Neuroscience, 7, 862–871. Friedrich, R. W., & Laurent, G. (2001). Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 291, 889–894. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Graas, E. L., Brown, E. A., & Lee, R. H. (2004). An FPGA-based approach to highspeed simulation of conductance-based neuron models. Neuroinformatics, 2(4), 417–436.
Programmable Logic Construction Kit for Neuronal Modeling
2679
¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23, 3697–3714. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electric current flow in excitable cells. New York: Oxford University Press. Lambert, J. D. (1973). Computational methods in ordinary differential equations. New York: Wiley. Maxfield, C. (2004). The design warrior’s guide to FPGAs. Burlington, MA: Newnes. Pearce, T. C., Koickal, T., Fulvi-Mari, C., Covington, J. A., Tan, F. S., Gardner, J. W., & Hamilton, A. (2005). Silicon-based neuromorphic implementation of the olfactory pathway. In 2005 2nd International IEEE Conference on Neural Engineering (pp. 307–312). Piscataway, NJ: 2005. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biological Cybernetics, 81, 381– 402. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Weinstein, R. K., & Lee, R. H. (2006). Architecture for high-performance FPGA implementations of neural models. Journal of Neural Engineering, 3, 21–34. Xilinx. (2002). Virtex-II pro: Platform FPGA handbook. San Jose, CA. Xilinx. (2003). Development system reference guide. San Jose, CA. Retrieved March 17, 2005, from http://toolbox.xilinx.com/docsan/xilinx6/books/docs/ dev/dev.pdf Xilinx (2004). Virtex-4 family overview. In Virtex-4 handbook. San Jose, CA. Retrieved October 21, 2004, from http://direct.xilinx.com/bvdocs/publications/ ds112.pdf
Received August 10, 2005; accepted April 27, 2006.
LETTER
Communicated by Michael Lewicki
Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics Odelia Schwartz [email protected] Howard Hughes Medical Institute, Computational Neurobiology Lab, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.
Terrence J. Sejnowski [email protected] Howard Hughes Medical Institute, Computational Neurobiology Lab, Salk Institute for Biological Studies, La Jolla, CA 92037, and Department of Biology, University of California at San Diego, La Jolla, CA 92093, U.S.A.
Peter Dayan [email protected] Gatsby Computational Neuroscience Unit, University College, London WC1N 3AR, U.K.
Gaussian scale mixture models offer a top-down description of signal generation that captures key bottom-up statistical characteristics of filter responses to images. However, the pattern of dependence among the filters for this class of models is prespecified. We propose a novel extension to the gaussian scale mixture model that learns the pattern of dependence from observed inputs and thereby induces a hierarchical representation of these inputs. Specifically, we propose that inputs are generated by gaussian variables (modeling local filter structure), multiplied by a mixer variable that is assigned probabilistically to each input from a set of possible mixers. We demonstrate inference of both components of the generative model, for synthesized data and for different classes of natural images, such as a generic ensemble and faces. For natural images, the mixer variable assignments show invariances resembling those of complex cells in visual cortex; the statistics of the gaussian components of the model are in accord with the outputs of divisive normalization models. We also show how our model helps interrelate a wide range of models of image statistics and cortical processing. 1 Introduction The analysis of the statistical properties of natural signals such as photographic images and sounds has exerted an important influence over both Neural Computation 18, 2680–2718 (2006)
C 2006 Massachusetts Institute of Technology
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2681
sensory systems neuroscience and signal processing. From the earliest days of the electrophysiological investigation of the neural processing of visual input, it has been hypothesized that neurons in early visual areas decompose natural images in a way that is sensitive to aspects of their probabilistic structure (Barlow, 1961; Attneave, 1954; Simoncelli & Olshausen, 2001). The same statistics lie at the heart of effective and efficient methods of image processing and coding. There are two main approaches to the study of the statistics of natural signals. Bottom-up methods start by studying the empirical statistical regularities of various low-dimensional linear or nonlinear projections of the signals. These methods see cortical neurons in terms of choosing and manipulating projections, to optimize probabilistic and information-theoretic metrics (Shannon, 1948; Shannon & Weaver, 1949), such as sparsity (Field, 1987), and efficient coding including statistical independence (Barlow, 1961; Attneave, 1954; Li & Atick, 1994; Nadal & Parga, 1997). In contrast, topdown methods (Neisser, 1967; Hinton & Ghahramani, 1997) are based on probabilistic characterizations of the processes by which the signals are generated and see cortical neurons as a form of coordinate system parameterizing the statistical manifold of the signals. There has been substantial recent progress in bottom-up statistics. In particular, a wealth of work has examined the statistical properties of the activation of linear filters convolved with images. The linear filters are typically chosen to qualitatively match retinal or cortical receptive fields. For example, primary visual cortex receptive fields (e.g., simple cells) are tuned to a localized spatial region, orientation, and spatial frequency (Hubel & Wiesel, 1962). These receptive fields are also closely related to multiscale wavelet decompositions, which have gained wide acceptance in the computational vision community. For typical natural images, empirical observations of a single linear filter activation reveal a highly kurtotic (e.g., sparse) distribution (Field, 1987). Groups of linear filters (coordinated across parameters such as orientation, frequency, phase, or spatial position) exhibit a striking form of statistical dependency (Wegmann & Zetzsche, 1990; Zetzsche, Wegmann, & Barth, 1993; Simoncelli, 1997), which can be characterized in terms of the variance (Simoncelli, 1997; Buccigrossi & Simoncelli, 1999; Schwartz & Simoncelli, 2001). The importance of variance statistics had been suggested earlier in pixel space (Lee, 1980) and has been addressed in other domains such as speech (Brehm & Stammler, 1987) and even finance (Bollerslev, Engle, & Nelson, 1994). There has also been substantial recent progress in top-down methods (Rao, Olshausen, & Lewicki, 2002), especially in understanding the tight relationship between bottom-up and top-down ideas. In particular, it has been shown that optimizing a linear filter set for statistical properties such as sparseness or marginal independence (Olshausen & Field, 1996; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998) in the light of the statistics of natural images can be viewed as a way of fitting an
2682
O. Schwartz, T. Sejnowski, and P. Dayan
exact or approximate top-down generative model (Olshausen & Field, 1996). These methods all lead to optimal filters that are qualitatively matched to simple cells. The bottom-up variance coordination among the filters has also found a resonance in top-down models (Wainwright & Simoncelli, 2000; Wainwright, Simoncelli, & Willsky, 2001; Hyv¨arinen & Hoyer, 2000a; Romberg, Choi, & Baraniuk, 1999, 2001; Karklin & Lewicki, 2003a, 2005). Various generative models have built hierarchies on top of simple cell receptive fields, leading to nonlinear cortical properties such as the phase invariance exhibited by complex cells together with other rich invariances. This article focuses on a hierarchical, nonlinear generative modeling approach to understanding filter coordination and its tight relation to bottomup statistics. We build on two substantial directions in the literature, whose close relationship is only slowly being fully understood. One set of ideas started in the field of independent component analysis (ICA), adding to the standard single, linear layer of filters a second layer that determines the variance of the first-layer activations (Hyv¨arinen & Hoyer, 2000a, 2000b; Hoyer & Hyv¨arinen, 2002; Karklin & Lewicki, 2003a, 2005; Park & Lee, 2005). In particular, Karklin and Lewicki (2003a, 2003b, 2005) suggested a model in which the variance of each unit in the first layer arises from an additive combination of a set of variance basis function units in the second layer. The method we propose can be seen as a version of this with competitive rather than cooperative combination of the second-layer units. The other set of ideas originates with the gaussian scale mixture model (GSM) (Andrews & Mallows, 1974; Wainwright & Simoncelli, 2000; Wainwright et al., 2001),1 which has strong visibility in the image processing literature (Strela, Portilla, & Simoncelli, 2000; Portilla, Strela, Wainwright, & Simoncelli, 2001, 2003; Portilla & Simoncelli, 2003). GSM generative models offer a simple way of parameterizing the statistical variance dependence of the first-layer filter activations in a way that captures some of the key bottom-up statistical properties of images. However, although GSMs parameterize the dependence of linear filters, they do not by themselves specify the pattern of dependence among the filters. This is the key hurdle in their application as a top-down basis for bottom-up, hierarchical learning models. In these terms, we propose an extension to the GSM model that learns the pattern of dependencies among linear filters, thereby learning a hierarchical representation. In the next section, we discuss bottom-up statistical properties of images. We describe and motivate the use of gaussian scale mixture models and then pose the question of learning a hierarchical representation in this framework. This lays the groundwork for the rest of the article, in which we develop the model and hierarchical learning more formally and demonstrate results on both synthetic data and natural image ensembles.
1 Another class of models, which has recently been related both to ICA and the GSM, is the energy-based product of Student-t models (Osindero et al., 2006).
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2683
An earlier version of part of this work appeared in Schwartz, Sejnowski, & Dayan (2005). 2 Bottom-Up and Top-Down Statistics of Images At the heart of both bottom-up and top-down methods are the individual and joint statistics of the responses of the set of linear Gabor-like filters that characterize simple-cell receptive fields in primary visual cortex. The distribution of the activation of a single linear filter when convolved with an image is highly kurtotic. That is, the response of the filter is often approximately zero, but occasionally the filter responds strongly to particular structures in the image (Field, 1987). The joint statistics of two related linear filters convolved with the same image exhibit a striking form of statistical dependence: when one of the filters responds strongly to a prominent aspect in the image, the other filter may also respond strongly (say, if two spatially displaced vertical filters are responding to an elongated vertical edge in the image). This is also known as a self-reinforcing characteristic of images (e.g., Turiel, Mato, Parga, & Nadal, 1998). The strength of this dependence is determined by the featural similarity of the linear filters in terms of relative location, orientation, spatial scale, phase, and so forth. The coordination is reflected in the joint conditional distribution having the shape of a bowtie and thus following a variance dependency (Buccigrossi & Simoncelli, 1999; Schwartz & Simoncelli, 2001), or by examining the marginal versus the joint distributions (Zetzsche et al., 1993; Zetzsche & Nuding, 2005). Huang and Mumford (1999) analyzed joint contour plots for a large image database and modeled the joint dependencies as a generalized 2D gaussian. The dependencies can be seen in the responses of various types of linear filters, including predefined wavelets and filters designed to be maximally sparse or independent. These are also present even when the filter responses are linearly decorrelated. Another view on this self-reinforcing characteristics comes (Wainwright & Simoncelli, 2000; Wainwright et al., 2001) from the top-down GSM model, which was originally described by Andrews and Mallows (1974) over 30 years ago. The model consists of two components: a multidimensional gaussian g, multiplied by a positive scalar random variable v. The second component v effectively “scales” the gaussian component g, forming a “mixture,” l, according to the equation: l = vg
(2.1)
with density p[l] =
t −1 1 l l exp − p[v]dv, (2π)m/2 |v 2 |1/2 2v 2
(2.2)
O. Schwartz, T. Sejnowski, and P. Dayan
A Generative Model v
g1
g2
x
x
0.2
g2 0 -5
0
g1
5
g1
l2
l1
D Filter response
0.15
0 0
vα
15
Distribution
C Mixer Distribution
B Gaussian Distribution
2684
0.4
l2 0
-10 0 10
l1
l1
Figure 1: (A) Generative model for a two-dimensional GSM. Each filter response, l1 and l2 , is generated by multiplying (circle with X symbol) its gaussian variable, g1 and g2 , by a common mixer variable v. (B) Marginal and joint conditional statistics (bowties) of the gaussian components of the GSM. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently rescaled to fill the range of intensities. (C) Marginal statistics of the mixer component of the GSM. The mixer is by definition positive and is chosen here from a Rayleigh distribution with parameter a = .1 (see equation 3.1), but exact distribution of mixer is not crucial for obtaining statistical properties of filter responses shown in D. (D) Marginal and joint conditional statistics (bowties) of generated filter responses.
where m is the number of filters, is the covariance matrix, and mixer v is distributed according to a prior distribution p[v].2 In its application to natural images (Wainwright & Simoncelli, 2000), we typically think of each li as modeling the response of a single linear filter when applied to a particular image patch. We will also use the same analogy in describing synthetic data. We refer to the scalar variable v as a
In other work, the mixture has also been defined as l = different notation. 2
√ vg, resulting in slightly
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2685
mixer variable to avoid confusion with the scales of a wavelet.3 Figure 1A illustrates a simple two-dimensional GSM generative model, in which l1 and l2 are generated with a common mixer variable v. Figures 1B and 1C show the marginal and joint conditional gaussian statistics of the gaussian and mixer variables for data synthesized from this model. The GSM model provides the top-down account of the two bottomup characteristics of natural scene statistics described earlier: the highly kurtotic marginal statistics of a single linear filter and the joint conditional statistics of two linear filters that share a common mixer variable (Wainwright & Simoncelli, 2000; Wainwright et al., 2001). Figure 1D shows the marginal and joint conditional statistics of two filter responses l1 and l2 based on the synthetic data of Figures 1B and 1C. The GSM model bears a close relationship with bottom-up approaches of image statistics and cortical representation. First, models of sparse coding and cortical receptive field representation typically utilize the leptokurtotic properties of the marginal filter response, which arise naturally in a generative GSM model (see Figure 1D, left). Second, GSMs offer an account of filter coordination, as in, for instance, the bubbles framework of Hyv¨arinen (Hyv¨arinen, Hurri, & Vayrynen, 2003). Coordination arises in the GSM model when filter responses share a common mixer (see Figure 1D, right). Third, some bottom-up frameworks directly consider versions of the two GSM components. For instance, models of image statistics and cortical gain control (Schwartz & Simoncelli, 2001) result in a divisively normalized output component that has characteristics resembling that of the gaussian component of the GSM in terms of both the marginal and joint statistics (see Figure 1B and Wainwright & Simoncelli, 2000). Further, Ruderman and Bialek (1994) postulate that the observed pixels in an image (note, not the response of linear filters convolved with an image) can be decomposed into a product of a local standard deviation and a roughly gaussian component. In sum, the GSM model offers an attractive way of unifying a number of influential statistical approaches. In the original formulation of a GSM, there is one mixer for a single collection of gaussian variables, and their bowtie statistical dependence is therefore homogeneous. However, the responses of a whole range of linear filters to image patches are characterized by heterogeneity in their degrees of statistical dependence. Wainwright and Simoncelli (2000) considered a prespecified tree-based hierarchical arrangement (and indeed generated the mixer variables in a manner that depended on the tree). However, for a diverse range of linear filters and a variety of different classes of scenes, it is necessary to learn the hierachical arrangement from examples. Moreover, because different objects induce different dependencies, different
3 Note that in some literature, the scalar variable has also been called a multiplier variable.
2686
O. Schwartz, T. Sejnowski, and P. Dayan
50
20
4
0
0
-20
-4
10 0
0
-10 -50
-20
0
20
-5
0
5
-10
0
10
-50
0
50
Figure 2: Joint conditional statistics for different image patches, including white noise. Statistics are gathered for a given pair of vertical filters that are spatially nonoverlapping. Image patches are 100 by 100 pixels. Intensity is proportional to the bin counts, except that each column is independently rescaled to fill the range of intensities.
arrangements may be appropriate for different image patches. For example, for a given pair of filters, the strength of the joint conditional dependency can vary for different image patches (see Figure 2). This suggests that on a patch-by-patch basis, different mixers should be associated with different filters. Karklin and Lewicki (2003a) suggested what can be seen as one way of doing this: generating the (logarithm of the) mixer value for each filter as a linear combination of the values of a small number of underlying mixer components. Here, we consider the problem in terms of multiple mixer variables v = (vα , vβ . . .), with the linear filters being clustered into groups that share a single mixer. As illustrated in Figure 3, this induces an assignment problem of marrying linear filter responses li and mixers v j , which is the main focus of this article. Inducing the assignment is exactly the process of inducing a level of a hierarchy in the statistical model. Although the proposed model is more complex than the original GSM, in fact we show that inference is straightforward using standard tools of expectation maximization (Dempster, Laird, & Rubin, 1977) and Markov chain Monte Carlo sampling. Closely related assignment problems have been posed and solved using similar techniques, in a different class of image model known as dynamical tree modeling (Williams & Adams, 1999; Adams & Williams, 2003) and in credibility networks (Hinton, Ghahramani, & Teh, 1999). In this article, we approach the question of hierarchy in the GSM model. In section 3, we consider estimating the gaussian and mixer variables of a GSM model from synthetic and natural data. We show how inference fails in the absence of correct knowledge about the assignment associations between gaussian and mixer variables that generated the data. For this
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2687
vα vβ vγ binary assignment
g1
g2
gn
x
x
x
l1
l2
ln
Figure 3: Assignment problem in a multidimensional GSM. Filter responses l = {l1 , . . . , ln } are generated by multiplying gaussian variables g = {g1 , . . . , gn } by mixer variables {vα , . . . , vµ }, where we assume µ < n. We can think of each mixture li as the response of a linear filter when applied to a particular image patch. The assignment problem asks which mixer v j was assigned to each gaussian variable gi , to form the respective filter response li . The set of possible mixers vα , vβ , vγ is surrounded by a rectangular black box. Gray arrows mark the binary assignments: l1 was generated with mixer vα , and l2 and ln were generated with a common mixer vγ . In section 4 and Figure 6, we also consider what determines this binary assignment.
demonstration, we assume the standard GSM generative model, in which each gaussian variable is associated with a single mixer variable. In section 4, we extend the GSM generative model to allow probabilistic mixer overlap and propose a solution to the assignment problem. We show that applied to synthetic data, the technique finds the proper assignments and infers correctly the components of the GSM generative model. In section 5, we apply the technique to images. We show that the statistics of the inferred GSM components match the assumptions of the generative model and demonstrate the hierarchical structure that emerges. 3 GSM Inference of Gaussian and Mixer Variables Consider the simple, single-mixer GSM model described in equation 2.1. We assume g are uncorrelated, with diagonal covariance matrix σ 2 I, and
2688
O. Schwartz, T. Sejnowski, and P. Dayan
that v has a Rayleigh distribution: p[v] ∝ v exp(−v 2 /2 ]a where 0 < a ≤ 1 parameterizes the strength of the prior.
(3.1)
For ease, we develop the theory for a = 1. In this case, the variance of each filter response li (we will describe the li as being filter responses throughout this section, even though they mostly are generated purely synthetically) is exponentially distributed with mean 2. The qualitative properties of the model turn out not to depend strongly on the precise form of p[v]. Wainwright et al. (2001) assumed a similar family of mixer variables arising from the square root of a gamma distribution (Wainwright & Simoncelli, 2000), and Portilla et al. considered other forms such as the log normal distribution (Portilla et al., 2001) and a Jeffrey’s prior (Portilla et al., 2003). As stated above, the marginal distribution of the resulting GSM is highly kurtotic (see Figure 1D, left). For our example, given p[v], in fact l follows a double exponential distribution: p[l] ∼
1 exp(−|l|). 2
(3.2)
The joint conditional distribution of two filter responses l1 and l2 follows a bowtie shape, with the width of distribution of one response increasing for larger values (both positive and negative) of the other response (see Figure 1D, right). The inverse problem is to estimate the n + 1 variables g1 , . . . , gn , v from the n filter responses l1 , . . . , ln . It is formally ill posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant and can be derived analytically from the model: 1. p[v|l1 ] is the local estimate of the mixer, given just a single filter response. In our model, it can be shown that σ 2 |l1 | l2 v exp − − 21 2 , p[v|l1 ] = (3.3) 2 2v σ B 1 , |l1 | 2
σ
where B(n, x) is the modified Bessel function of the second kind (see also Grenander & Srivastava, 2002). For this, the mean is |l1 | |l1 | B 1, σ . (3.4) E[v|l1 ] = σ B 1 , |l1 | 2
σ
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2689
2. p[v|l] is the global estimate of the mixer, given all the filter responses.
2 This has a very similar form to p[v|l1 ], only substituting l = i li for |l1 |, l 12 (n−2) 2 v l2 (3.5) p[v|l] = σ n l v −(n−1) exp − − 2 2 , 2 2v σ B 1 − 2, σ whose mean is E[v|l] =
l B 32 − n2 , σl . σ B 1 − n2 , σl
(3.6)
Note that E[v|l] has also been estimated numerically in noise removal for other mixer variable priors (e.g., Portilla et al., 2001). 3. p[g1 |l1 ] is the local estimate of the gaussian variable, given just a local filter response. This is √ g12 l12 σ |l1 | 1 2 exp − 2 − 2 U(sign{l1 }g1 ), (3.7) p[g1 |l1 ] = 2σ 2g1 B − 1 , |l1 | g1 2
σ
where U(sign{l1 }g1 ) is a step function that is 0 if sign{l1 } = sign{g1 }. The step function arises since g1 is forced to have the same sign as l1 , as the mixer variables are always positive. The mean is |l1 | B 0, σ |l1 | . E[g1 |l1 ] = sign{l1 }σ (3.8) σ B − 1 , |l1 | 2
σ
4. p[g1 |l] is the estimate of the gaussian variable, given all the filter responses. Since in our model, the gaussian variables g are mutually independent, the values of the other filter responses l2 , . . . , ln provide information only about the underlying hidden variable v. This leaves p[g1 |l] proportional to p(l1 |g1 )P(l2 , . . . , ln |v = l1 /g1 ) p(g1 ), which results in 12 (2−n) √ σ |l1 | |ll1 | g12 l 2 l12 (n−3) n p[g1 |l] = U(sign{l1 }g1 ) exp − − g 1 2σ 2 l12 2g12 B 2 − 1, σl (3.9) with mean E[g1 |l] = sign{l1 }σ
|l1 | σ
|l1 | B n2 − 12 , σl . l B n2 − 1, σl
(3.10)
We first study inference in this model using synthetic data in which two groups of filter responses l1 , . . . , l20 and l21 , . . . , l40 are generated by two mixer variables vα and vβ (see the schematic in Figure 4A, and the respective
2690
O. Schwartz, T. Sejnowski, and P. Dayan
B
vα
g1 ..g20
vβ
x
l1 ..l20
l21 ..l40
Distribution
Assumed 0.15
15
Distribution
D
0.2
0 -5
0
g1
5
g1
-10 0
0 0
15
l21
0
10
0
0
0
l1
l1
Just right 0.15
0.15
0 0
15
0 0
15
E[vα |l1 ]
E[vα |l1 ..l40 ]
E[vα |l1 ..l20 ]
0.2
0.2
0.2
0 -5
0
5
E[g1 |l1 ]
E[g2 |l1 ]
g2
l2 0
Too global
0.15
0 0
0.4
l1
Too local
vα
E
g21 ..g40
x
C
Distribution
A
E[g1 |l1 ]
0 -5
0
0
5
E[g1 |l1 ..l40 ]
E[g2 |l1 ..l40 ]
-5
0
5
E[g1 |l1 ..l20 ]
E[g2 |l1 ..l20 ]
E[g1 |l1 ..l40 ]
E[g1 |l1 ..l20 ]
Figure 4: Local and global estimation in synthetic GSM data. (A) Generative model. Each filter response is generated by multiplying its gaussian variable by one of the two mixer variables vα and vβ . (B) Marginal and joint conditional statistics of sample filter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently rescaled to fill the range of intensities. (C–E) Left column: actual (assumed) distributions of mixer and gaussian variables; other columns: estimates based on different numbers of filter responses (either 1 filter, labeled “too local”; 40 filters, labeled “too global”; or 20 filters, labeled “just right,” respectively). (C) Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. (D) Distribution of estimate of one of the gaussian variables, g1 . (E) Joint conditional statistics of the estimates of gaussian variables g1 and g2 .
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2691
statistics in Figure 4B). That is, each filter response is deterministically generated from either mixer vα or mixer vβ , but not both. We attempt to infer the gaussian and mixer components of the GSM model from the synthetic data, assuming that we do not know the actual mixer assignments. Figures 4C and 4D show the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) (see equations 3.4 and 3.6) and one of the gaussian variables E(g1 |{l}) (see equations 3.8 and 3.10) based on different assumed assignments. For inference based on too few filter responses, the estimates do not match the actual distributions (see the second column labeled “too local”). For example, for a local estimate based on a single filter response, the gaussian estimate peaks away from zero. This is because the filter response is a product of the two terms, the gaussian and the mixer, and the problem is ill posed with only a single filter estimate. Similarly, the mixer variable is not estimated correctly for this local case. Note that this occurs even though we assume the correct priors for both the mixer and gaussian variables and is thus a consequence of the incorrect assumption about the assignments. Inference is also compromised if it is based on too many filter responses, including those generated by both vα and vβ (see the third column, labeled “too global”). This is because inference of vα is based partly on data that were generated with a different mixer, vβ (so when one mixer is high, the other might be low, and so on). In contrast, if the assignments are correct and inference is based on all those filter responses that share the same common generative mixer (in this case vα ), the estimates become good (see the last column, labeled “just right”). In Figure 4E, we show the joint conditional statistics of two components, each estimating their respective g1 and g2 . Again, as the number of filter responses increases, the estimates improve, provided that they are taken from the right group of filter responses with the same mixer variable vα . Specifically, the mean estimates of g1 and g2 become more independent (see the last column). Note that for estimations based on a single filter response, the joint conditional distribution of the gaussian appears correlated rather than independent (second column); for estimation based on too many filter responses generated from either of the mixer variables, the joint conditional distribution of the gaussian estimates shows a dependent (rather than independent) bowtie shape (see the third column). Mixer variable joint statistics also deviate from their actual independent forms when the estimations are too local or global (not shown). These examples indicate modes of estimation failure for synthetic GSM data if one does not know the proper assignments between mixer and gaussian variables. This suggests the need to infer the appropriate assignments from the data. To show that this is not just a consequence of an artificial example, we consider estimation for natural image data. Figure 5 demonstrates estimation of mixer and gaussian variables for an example natural image. We derived linear filters from a multiscale oriented steerable pyramid (Simoncelli, Freeman, Adelson, & Heeger, 1992), with 100 filters, at two preferred
2692
O. Schwartz, T. Sejnowski, and P. Dayan
A
Distribution
Assumed
Too local
0 0
15
0
0 0
Distribution
15
E[vα |l1 ]
vα
B
0.15
0.15
0.15
0.2
0
5
0 -5
C
0
5
E[g1 |l1 ]
g1
E[g2 |l1 ]
g2 g1
0
15
E[vα |l1 ..l40 ]
0.2
0.2
0 -5
Too global
0 -5
0
5
E[g1 |l1 ..l40 ]
E[g2 |l1 ..l40 ] E[g1 |l1 ]
E[g1 |l1 ..l40 ]
Figure 5: Local and global estimation in image data. (A–C) Left: Assumed distributions of mixer and gaussian variables; other columns: estimates based on different numbers of filter responses (either 1 filter, labeled “too local,” or 40 filters, including two orientations across a 38 by 38 pixel region, labeled “too global,” respectively). (A) Distribution of estimate of the mixer variable vα . Note that mixer variable values are by definition positive. (B) Distribution of estimate of one of the gaussian variables, g1 . (C) Joint conditional statistics of the estimates of gaussian variables g1 and g2 .
orientations, 25 nonoverlapping spatial positions (with spatial subsampling of 8 pixels), and a single phase and spatial frequency peaked at 1/6 cycles per pixel. By fitting the marginal statistics of single filters, we set the Rayleigh parameter of equation 3.1 to a = 0.1. Since we do not know a priori the actual assignments that generated the image data, we demonstrate examples for which inference is either very local (based on a single wavelet coefficient input) or very global (based on 40 wavelet coefficients at two orientations and a range of spatial positions). Figure 5 shows the inferred marginal and bowtie statistics for the various cases. If we compare the second and third columns to the equivalents in Figures 4C to 4E for the synthetic case, we can see close similarities. For instance, overly local or global inference of the gaussian variable leads to
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2693
bimodal or leptokurtotic marginals, respectively. The bowtie plots are also similar. Indeed, we and others (Ruderman & Bialek, 1994; Portilla et al., 2003) have observed changes in image statistics as a function of the width of the spatial neighborhood or the set of wavelet coefficients. It would be ideal to have a column in Figure 5 equivalent to the “just right” column of Figure 4. The trouble is that the equivalent neighborhood of a filter is defined not merely by its spatial extent, but rather by all of its featural characteristics and in an image and image-class dependent manner. For example, we might expect different filter neighborhoods for patches with a vertical texture everywhere than for patches corresponding to an edge or to features of a face. Thus, different degrees of local and global arrangements may be appropriate for different images. Since we do not know how to specify the mixer groups a priori, it is desirable to learn the assignments from a set of image samples. Furthermore, it may be necessary to have a battery of possible mixer groupings available to accommodate the statistics of different images. 4 Solving the Assignment Problem The plots in Figures 4 and 5 suggest that it should be possible to infer the assignments, that is, work out which linear filters share common mixers, by learning from the statistics of the resulting joint dependencies. Further, real-world stimuli are likely better captured by the possibility that inputs are coordinated in somewhat different collections in different images. Hard assignment problems, in which each input pays allegiance to just one mixer, are notoriously computationally brittle. Soft assignment problems, in which there is a probabilistic relationship between inputs and mixers, are computationally better behaved. We describe the soft assignment problem and illustrate examples with synthetic data. In section 5, we turn to image data. Consider the richer mixture-generative GSM shown in Figure 6. To model the generation of filter responses li for a single image patch (see Figure 6A), we multiply each gaussian variable gi by a single mixer variable from the set vα , vβ , . . . , vµ . In the deterministic (hard assignment) case, each gaussian variable is associated with a fixed mixer variable in the set. In the probabilistic (soft assignment) case, we assume that gi has association probability
pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable v j . Note that this is a competitive process, by which only a single mixer variable is assigned to each filter response li in each patch, and the assignment is determined according to the association probabilities. As a result, different image patches will have different assignments (see Figures 6A and 6B). For example, an image patch with strong vertical texture everywhere might have quite different assignments from an image patch with a vertical edge on the right corner. Consequently, in these two patches, the linear filters will share different common mixers. The assignments are assumed to be made independently for each patch. Therefore, the task for hierarchical learning
2694
O. Schwartz, T. Sejnowski, and P. Dayan
is to work out association probabilities suitable for generating the filter responses. We use χi ∈ {α, β, . . . µ} for the assignments li = gi vχi .
(4.1)
Consider a specific synthetic example of a soft assignment: 100 filter responses are generated probabilistically from three mixer variables, vα , vβ , and vγ . Figure 7A shows the association probabilities pij . Figure 8A shows example marginal and joint conditional statistics for the filter responses, based on an empirical sample of 5000 points drawn from the generative model. On the left is the typical bowtie shape between two filter responses generated with the same mixer, vα , 100% of the time. In the middle is a weaker dependency between two filter responses whose mixers overlap for only some samples. On the right is an independent joint conditional distribution arising from two filter responses whose mixer assignments do not overlap. There are various ways to try solving soft assignment problems (see, e.g., MacKay, 2003). Here we use the Markov chain Monte Carlo method called Gibbs sampling. The advantage of this method is its flexibility and power. Its disadvantage is its computational expense and biological implausibility—although for the latter, we should stress that we are mostly interested in an abstract characterization of the higher-order dependencies rather than in a model for activity-dependent representation formation. Williams and Adams (1999) suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Variational approximations have also been considered in this context (Adams & Williams, 2003; Hinton et al., 1999). Inference and learning in this model proceeds in two stages, according to an expectation maximization framework (Dempster et al., 1977). First, given a filter response li , we use Gibbs sampling to find possible
Figure 6: Extended generative GSM model with soft assignment. (A) The depiction is similar to Figure 3, except that we examine only the generation of two of the filter responses l1 and l2 , and we show the probabilistic process according to which the assignments are made. The mixer variable assigned to l1 is chosen for each image patch according to the association probabilities p1α , p1β , and p1γ . The binary assignment for filter response l1 corresponds to mixer vα = 9. The binary choice arose from the higher association probability p1α = 0.65, marked with a gray ellipse. The assignment is marked by a gray arrow. For this patch, the assignment for filter l2 also corresponds to vα = 9. Thus, l1 and l2 share a common mixer (with a relatively high value). (B) The same for a second patch; here assignment for l1 corresponds to vα = 2.5, but for l2 to vγ = 0.2.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
A Patch 1
2695
vα = 9 vβ = 0.1 vγ = 1.1 p1α p2α
= .65 = .65
p1β = 0.3 p2β = 0.05
g1
g2
x
x
l1
l2
B Patch 2
p1γ p2γ
= 0.05 = 0.3
vα = 2.5 vβ = 6 vγ = 0.2 p1α p2α
= .65 = .65
p1β = 0.3 p2β = 0.05
g1
g2
x
x
l1
l2
p1γ p2γ
= 0.05 = 0.3
2696
O. Schwartz, T. Sejnowski, and P. Dayan
A Actual vα 1
1 21 41 61 81 100
Filter number
Probability
B Inferred vα
0
vγ 1
1
Probability 0
vβ
1 21 41 61 81 100
Filter number
vβ
0
1 21 41 61 81 100
Filter number
vγ
1
1
1
0
0
0
1 21 41 61 81 100
1 21 41 61 81 100
1 21 41 61 81 100
Filter number
Filter number
Filter number
Figure 7: Inference of mixer association probabilities in a synthetic example. (A) Each filter response li is generated by multiplying its gaussian variable by a probabilistically chosen mixer variable, vα , vβ , or vγ . Shown are the actual association probabilities pij (labeled probability) of the generated filter responses li with each of the mixer variables v j . (B) Inferred association probabilities pij from the Gibbs procedure, corresponding to vα , vβ , and vγ .
appropriate (posterior) assignments to the mixers. Second, given the collection of assignments across multiple filter responses, we update the association probabilities pij (see the appendix). We tested the ability of this inference method to find the association probabilities in the overlapping mixer variable synthetic example shown in Figure 7A. The Gibbs sampling procedure requires that we specify the number of mixer variables that generated the data. In the synthetic example, the actual number of mixer variables is 3. We ran the Gibbs sampling procedure, assuming the number of possible mixer variables is 5 (e.g., > 3). After 500 iterations, the weights converged near the proper probabilities. In
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2697
Distribution
A Filter response 0.2
l2 -20
20
0
l21
0
l1
l81
0
0
0
l1
l1
0
0
l1
B Inferred components Mixer
Gibbs fit assumed
0.1
E[g2 |l] 0 0
-4
0
4
E[g1 |l]
Gibbs fit assumed
0.15
0
E[g1 |l]
Distribution
Distribution
Gaussian
E[vβ |l] 0
0
15
E[vα |l]
0
0
E[vα |l]
Figure 8: Inference of gaussian and mixer components in a synthetic example. (A) Example marginal and joint conditional filter response statistics. (B) Statistics of gaussian and mixer estimates from Gibbs sampling.
Figure 7A, we plot the actual probability distributions for the filter response associations with each of the mixer variables. In Figure 7B, we show the estimated associations for three of the mixers: the estimates closely match the actual association probabilities; the other two estimates yield association probabilities near zero, as expected (not shown). We estimated the gaussian and mixer components of the GSM using the Bayesian equations of the previous section (see equations 3.10 and 3.6), but restricting the input samples to those assigned to each mixer variable. In Figure 8B, we show examples of the estimated distributions of the gaussian and mixer components of the GSM. Note that the joint conditional statistics of both gaussian and mixer are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different Rayleigh parameters a (in equation 3.1), allowing us to model a wide range of behaviors observed in the responses of linear filters to a range of images. We have also tested the synthetic model for cases in which the mixer variable generating the data deviates somewhat from the assumed mixer variable distribution: Gibbs sampling still tends to find the proper association weights, but the
2698
O. Schwartz, T. Sejnowski, and P. Dayan
probability distribution estimate of the mixer random variable is not matched to the assumed distribution. We have thus far discussed the association probabilities determined by Gibbs inference for filter responses over the full set of patches. How does Gibbs inference choose the assignments on a patch-by-patch basis? For filter responses generated deterministically, according to a single mixer, the learned association probabilities of filter responses to this mixer are approximately equal to a probability of 1, and so the Gibbs assignments are correct approximately 100% of the time. For filter responses generated probabilistically from more than one mixer variable (e.g., filter responses 21–40 or 61–80 for the example in Figures 7 and 8), there is potential ambiguity about the generating mixer. We focus specifically on filter responses 21 to 40, which are generated from either vα or vβ . Note that the overall association probabilities for the mixers for all patches are 0.6 and 0.4, respectively. We would like to know how these are distributed on a patch-by-patch basis. To assess the correctness of the Gibbs assignments, we repeated 40 independent runs of Gibbs sampling for the same filter responses and computed the percentage correct assignment for filter responses that were generated according to vα or vβ (note that we know the actual generating mixer values for the synthetic data). We did this on a patch-by-patch basis and found that two factors affected the Gibbs inference: (1) the ratio of the two mixer variables vβ /vα for the given patch and (2) the absolute value of the ambiguous filter response for the given patch. Figure 9 summarizes the Gibbs assignments. The x-axis indicates the ratio of the absolute value of the ambiguous filter response and vα . The y-axis indicates the percentage correct for filter responses that were actually generated from vα (black circles) or vβ (gray triangles). In Figure 9A we depict the result for a patch in which the ratio vβ /vα was approximately 1/10 (marked by an asterisk on the x-axis). This indicates that filter responses generated by vα are usually larger than filter responses generated by vβ , and so for sufficiently large or small (absolute) filter response values, it should be possible to determine the generating mixer. Indeed, Gibbs assigns correctly filter responses for which the ratio of the filter response and vα are reasonably above or below 1/10 but does not fare as well for ratios that are in the range of 1/10 and could have potentially been generated by either mixer. Figure 9B illustrates a similar result for vβ /vα ≈ 1/3. Finally, Figure 9C shows that for vβ /vα ≈ 1, all filter responses are in the same range, and Gibbs resorts to the approximate association probabilities, of 0.6 and 0.4, respectively. We also tested Gibbs inference in undercomplete cases for which the Gibbs procedure assumes fewer mixer variables than were actually used to generate the data. Figure 10 shows an example in which we generated 75 sets of filter responses according to 15 mixer variables, each associated deterministically with five (consecutive) filter responses. We ran Gibbs assuming that only 10 mixers were collectively responsible for all the filter responses. Figure 10 shows the actual and inferred association probabilities
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2699
Percent correct
A 100 50 0 0
0.5
Percent correct
B
1.5
2
1
1.5
2
1
1.5
2
|l|/vα
100 50 0 0
0.5
C Percent correct
1
|l|/vα
100 50 0 0
0.5
|l|/vα
Figure 9: Gibbs assignments on a patch-by-patch basis in a synthetic example. For filter responses of each patch (here, filter responses 21–40), there is ambiguity as to whether the assignments were generated according to vα or vβ (see association probabilities in Figure 7). We summarize the percentage correct assignments computed over 40 independent Gibbs runs (y-axis), separately for the patches with filter responses actually generated according to vα (black, circles) and filter responses actually generated according to vβ (gray, triangles). There are overall 20 points corresponding to the 20 filter responses. For readability, we have reordered vα and vβ , such that vα ≥ vβ . The x-axis depicts the ratio of the absolute value of each ambiguous filter response in the patch (labeled “patch”), and vα . The black asterisk on the x-axis indicates the ratio vβ /vα . See the text for interpretation. (A) vβ /vα ≈ 1/10. (B) vβ /vα ≈ 1/3. (C) vβ /vα ≈ 1.
2700
O. Schwartz, T. Sejnowski, and P. Dayan
A Actual
Probability
1
1
1
0 0 1
75
0 0 1
75
0 0 1
75
0 0
75
0 0
75
0 0
75
1
0 0
1
1
75
0 0
75
0 0
1
1
0 0 1
0 0
75
0 75 0
1
75
0 0
0 0 1
75
75
1
75
0 0
75
Filter number
B Inferred 1
1
Probability
0 0
75
1
0 0
0 0
75
1
75
0 0
0 0
75
1
75
0 0
1
1
1
0 0
75
1
75
0 0
0 0
75
1
75
0 0
75
Filter number
Figure 10: Inference of mixer association probabilities in an undercomplete synthetic example. The data were synthesized with 15 mixer variables, but Gibbs inference assumes only 10 mixer variables. (A) Actual association probabilities. Note that assignments are deterministic, with 0 or 1 probability, in consecutive groups of 5. (B) Inferred association probabilities.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2701
in this case. The procedure correctly groups together five filters in each of the 10 inferred associations. There are groups of five filters that are not represented by a single high-order association, and these are spread across the other associations, with smaller weights. The added noise is expected, since the association probabilities for each filter must sum to 1. 5 Image Data Having validated the inference model using synthetic data, we turned to natural images. Here, the li are actual filter responses rather than synthesized products of a generative model. We considered inference on both wavelet filters and ICA bases and with a number of different image sets. We first derived linear filters from a multiscale oriented steerable pyramid (Simoncelli et al., 1992), with 232 filters. These consist of two phases (even and odd quadrature pairs), two orientations, and two spatial frequencies. The high spatial frequency is peaked at approximately 1/6 cycles per pixel and consists of 49 nonoverlapping spatial positions. The low spatial frequency is peaked at approximately 1/12 cycles per pixel, and consists of 9 nonoverlapping spatial positions. The spacing between filters, along vertical and horizontal spatial shifts, is 7 pixels (higher frequency) and 14 pixels (lower frequency). We used an ensemble of five images from a standard compression database (see Figure 12A) and 8000 samples. We ran our method with the same parameters as for synthetic data, with 20 possible mixer variables and Rayleigh parameter a = 0.1. Figure 11 shows the association probabilities pij of the filter responses for each of the obtained mixer variables. In Figure 11A, we show a schematic (template) of the association representation that follows in Figure 11B for the actual data. Each set of association probabilities for each mixer variable is shown for coefficients of two phases, two orientations, two spatial frequencies, and the range of spatial positions along the vertical and horizontal axes. Unlike the synthetic examples, where we plotted the association probabilities in one dimension, for the images we plot the association probabilities along a two-dimensional spatial grid matched to the filter set. We now study the pattern of the association probablilities for the mixer variables. For a given mixer, the association probabilities signify the probability that filter responses were generated with that mixer. If a given mixer variable has high association probabilities corresponding to a particular set of filters, we say that the mixer neighborhood groups together the set of filters. For instance, the mixer association probabilities in Figure 11B (left) depict a mixer neighborhood that groups together mostly vertical filters on the left-hand side of the spatial grid, of both even and odd phase. Strikingly, all of the mixer neighborhoods group together two phases of quadrature pair. Quadrature pairs have also been extracted from cortical data (Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2005) and are the components of ideal complex cell models. However, the range of spatial
2702
O. Schwartz, T. Sejnowski, and P. Dayan
groupings of quadrature pairs that we obtain here has not been reported in visual cortex and thus constitutes a prediction of the model. The mixer neighborhoods range from sparse grouping across space to more global grouping. Single orientations are often grouped across space, but in a couple of cases, both orientations are grouped together. In addition, note that there is some probabilistic overlap between mixer neighborhoods; for instance, the global vertical neighborhood associated with one of the mixers overlaps with other more localized, vertical neighborhoods associated with other mixers. The diversity of mixer neighborhoods matches our intuition that different mixer arrangements may be appropriate for different image patches. We examine the image patches that maximally activate the mixers, similar to Karklin and Lewicki (2003a). In Figure 12 we show different mixer association probabilities and patches with the maximum log likelihood of P(v| pa tch). One example mixer neighborhood (see Figure 12B) is associated with global vertical structure across most of its “receptive” region. Consequently, the maximal patches correspond to regions in the image data with multiple vertical structure. Another mixer neighborhood (see Figure 12C) is associated with vertical structure in a more localized iso-oriented region of space; this is also reflected in the maximal patches. This is perhaps similar to contour structure that is reported from the statistics of natural scenes (Geisler, Perry, Super, & Gallogly, 2001; Hoyer & Hyv¨arinen, 2002). Another mixer neighborhood (see Figure 12D) is associated with vertical and horizontal structure in the corner, with maximal patches that tend to have any structure in this region (a roof corner, an eye, a distant face, and so on). The mixer neighborhoods in Figures 12B and 12D bear similarity to those in Karklin and Lewicki (2003a).
Figure 11: Inference of mixer association probabilities for images and wavelet filters. (A) Schematic of filters and association probabilities for a single mixer, on a 46-by-46 pixel spatial grid (separate grid for even and odd phase filters). Left: Example even phase filter along the spatial grid. To the immediate right are the association probabilities. The probability that each filter response is associated with the mixer variable ranges from 0 (black) to 1 (white). Only the example filter has high probability, in white, with a vertical line representing orientation. Right: Example odd phase filter and association probabilities (the small line represents higher spatial frequency). (B) Example mixer association probabilities for image data. Even and odd phases always show a similar pattern of probabilities, so we summarize only the even phase probability and merge together the lowand high-frequency respresentation. (C) All 20 mixer association probabilities for image data for the even phase (arbitrarily ordered). Each probability plot is separately normalized to cover the full range of intensities.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2703
Even phase, low Even phase, low 23 0
Y Position
Y Position
A Schematic example Odd phase, high Odd phase, high 23 0
-23
-23
-23 0 23 X Position
-23 0 23 X Position
B Images; summary of representation Even, high Even, low
Odd, high Odd, low
Even, low + high
Even, high Even, low Even, low + high
Odd, high Odd, low
C Images; all even phase
2704
O. Schwartz, T. Sejnowski, and P. Dayan
A B
C
E
Figure 12: Maximal patches for images and wavelet filters. (A) Image ensemble. Black box marks the size of each image patch. (B–E) Example mixer association probabilities and 46×46 pixel patches that had the maximum log likelihood of P(v| pa tch).
Soft Gaussian Scale Mixer Assignments for Natural Scenes
B Gaussian Gibbs fit assumed
Distribution
0.2
-50
0
l1
C Mixer vα
50
E[g2 |l] 0 -5
l1 0.2
0
0
5
E[g1 |l]
E[g1 |l]
Gibbs fit assumed
0
20
Distribution
0.25
Distribution
Distribution
A Filter response
2705
0.2
0
0
E[vα |l] Distribution
vα
20
E[vβ |l]
0.2
vα
E[vβ |l] 0
0
20
E[vγ |l]
E[vα |l]
Figure 13: Inference of gaussian and mixer components for images and wavelet filters. (A) Statistics of images through one filter and joint conditional statistics through two filters. Filters are quadrature pairs, spatially displaced by seven pixels. (B) Inferred gaussian statistics following Gibbs. The dashed line is assumed statistics, and the solid line is inferred statistics. (C) Statistics of example inferred mixer variables following Gibbs. On the left are the mixer association probabilities, and the statistics are shown on the right.
Although some of the mixer neighborhoods have a localized responsive region, it should be noted that they are not sensitive to the exact phase of the image data within their receptive region. For example, in Figure 12C, it is clear that the maximal patches are invariant to phase. This is to be expected, given that the neighborhoods are always arranged in quadrature pairs. From these learned associations, we also used our model to estimate the gaussian and mixer variables (see equations 3.10 and 3.6). In Figure 13, we show representative statistics for the filter responses and the inferred variables. The learned distributions of gaussian and mixer variables match our assumptions reasonably well. The gaussian estimates exhibit joint
2706
O. Schwartz, T. Sejnowski, and P. Dayan
conditional statistics that are roughly independent. The mixer variables are typically (weakly) dependent. To test if the result is not merely a consequence of the choice of waveletbased linear filters and natural image ensemble, we ran our method on the responses of filters that arose from ICA (Olshausen & Field, 1996) and with 20-by-20 pixel patches from Field’s image set (Field, 1994; Olshausen & Field, 1996). Figure 14 shows example mixer neighborhood associations in terms of the spatial and orientation/frequency profile and corresponding weights (Karklin & Lewicki, 2003a). The association grouping consists of both spatially global examples that group together a single orientation at all spatial positions and frequencies and more localized spatial groupings. The localized spatial groupings sometimes consist of all orientations and spatial frequencies (as in Karklin & Lewicki, 2003a) and are sometimes more localized in these properties (e.g., a vertical spatial grouping may tend to have large weights associated with roughly vertical filters). The statistical properties of the components are similar to the wavelet example (not shown here). Example maximal patches are shown in Figure 15. In Figure 15B are maximal patches associated with a spatially global diagonal structure; in Figure 15C are maximal patches associated with approximately vertical orientation on the right-hand side; in Figure 15D are maximal patches associated with low spatial frequencies. Note that there is some similarity to Karklin and Lewicki (2003a) in the maximal patches. So far we have demonstrated inference for a heterogeneous ensemble of images. However, it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. We consider a couple of examples with wavelet filters, in which we both learn and demonstrate the results on the particular image class. In Figure 16 we demonstrate example mixer association probabilities that are learned for a zebra image (from a Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs (only even phase shown); however, some of the spatial configurations are richer. For example, in Figure 16A, the mixer neighborhood captures a horizontal-bottom/vertical-top spatial configuration. In Figure 16B, the mixer neighborhood captures a global vertical configuration, largely present in the back zebra, but also in a
Figure 14: Inference of mixer association probabilities for Field’s image ensemble (Field, 1994) and ICA bases. (A) Schematic example of the representation for three basis functions. In the spatial plot, each point is the center spatial location of the corresponding basis function. In the orientation/frequency plot, each point is shown in polar coordinates where the angle is the orientation and the radius is the frequency of the corresponding basis function. (B) Example mixer association probabilities learned from the images. Each probability plot is separately normalized to cover the full range of intensities.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
A Schematic example Example basis functions a
b
Orientation/frequency
Spatial 20
c
a b c 0
c a 0
B Images
2707
b 0
20
2708
A
O. Schwartz, T. Sejnowski, and P. Dayan
...
B
C
D
Figure 15: Maximal patches for Field’s image ensemble (Field, 1994) and ICA bases. (A) Example input images. The black box marks the size of each image patch. (B–D) The 20×20 pixel patches that had the maximum log likelihood of P(v| pa tch).
portion of the front zebra. Some neighborhoods (not shown here) are more local. We also ran Gibbs inference on a set of 40 face images (20 different people, 2 images of each) (Samaria & Harter, 1994). The mixer neighborhoods are again in quadrature pairs (only even phase shown). Some of the more interesting neighborhoods appear to capture richer information that is not necessarily continuous across space. Figure 17A shows a neighborhood resembling a loose sketch of the eyes, the nose, and the mouth (or moustache); the maximal patches are often roughly centered accordingly. The neighborhood in Figure 17B is also quite global but more abstract and appears to largely capture the left edge of the face along with other features. Figure 17C
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2709
A
B
Figure 16: (A–B) Example mixer association probabilities and maximal patches for zebra image and wavelets. Maximal patches are marked with white boxes on the image.
shows a typical local neighborhood, which captures features within its receptive region but is rather nonspecific. 6 Discussion The study of natural image statistics is evolving from a focus on issues about scale-space hierarchies and wavelet-like components and toward the coordinated statistical structure of the wavelets. Bottom-up ideas (e.g., bowties, hierarchical representations such as complex cells) and top-down
2710
O. Schwartz, T. Sejnowski, and P. Dayan
A
B
C
Figure 17: Example mixer association probabilities and maximal patches for face images (Samaria & Harter, 1994) and wavelets.
ideas (e.g., GSM) are converging. The resulting new insights inform a wealth of models and concepts and form the essential backdrop for the work in this article. They also link to engineering results in image coding and processing. Our approach to the hierarchical representation of natural images was motivated by two critical factors. First, we sought to use top-down models to understand bottom-up hierarchical structure. As compellingly argued by Wainwright and Simoncelli (2000; Wainwright et al., 2001), Portilla et al. (2001, 2003), and Hyvarinen et al. (2003) in their bubbles framework, the popular GSM model is suitable for this because of the transparent statistical
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2711
interplay of its components. This is perhaps by contrast with other powerful generative statistical approaches such as that of De Bonet and Viola (1997). Second, as also in Karklin and Lewicki, we wanted to learn the pattern of the hierarchical structure in an unsupervised manner. We suggested a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to mixture inputs (at a lower level). We showed how these assignments can be learned using Gibbs sampling. Williams and Adams (1999) used Gibbs sampling for solving a related assignment problem between child and parent nodes in a dynamical tree. Interestingly, Gibbs sampling has also been used for inferring the individual linear filters of a wavelet structure, assuming a sparse prior composed of a mixture of gaussian and Dirac delta function (Sallee & Olshausen, 2003), but not for resolving mixture associations. We illustrated some of the attractive properties of the technique using both synthetic data and natural images. Applied to synthetic data, the technique found the proper association probabilities between the filter responses and mixer variables, and the statistics of the two GSM components (mixer and gaussian) matched the actual statistics that generated the data (see Figures 7 and 8). Applied to image data, the resulting mixer association neighborhoods showed phase invariance like complex cells in the visual cortex and showed a rich behavior of grouping along other features (that depended on the image class). The statistics of the inferred GSM components were a reasonable match to the assumptions embodied in the generative model. These two components have previously been linked to possible neural correlates. Specifically, the gaussian variable of the GSM has characteristics resembling those of the output of divisively normalized simple cells (Schwartz & Simoncelli, 2001); the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Some aspects of our results are at present more difficult to link strongly to cortical physiology, such as the local contour versus more global patterns of orientation grouping that emerge in our and other approaches. Of course, the model is oversimplified in a number of ways. Two particularly interesting future directions are allowing correlated filter responses and correlated mixer variables. Correlated filters are particularly important to allow overcomplete representations. Overcomplete representations have already been considered in the context of estimating a single mixer neighborhood in the GSM (Portilla et al., 2003) and in recent energy-based models (Osindero, Welling, & Hinton, 2006). They are fertile ground for future investigation within our framework of multiple mixers. Correlations among the mixer variables could extend and enrich the statistical structure in our model and are the key route to further layers in the hierarchy. As a first stage, we might consider a single extra layer that models a mixing of the mixers, prior to mixing the mixer and gaussian variables.
2712
O. Schwartz, T. Sejnowski, and P. Dayan
In our model, the mixer variables themselves are uncorrelated, and dependencies arise through discrete mixer assignments. Just as in standard statistical modeling, some dependencies are probably best captured with discrete mixtures and others with continuous ones. In this regard, it is interesting to compare our method to the strategy adopted by Karklin and Lewicki (2003a). Rather than having binary assignments arising from a mixture model, they accounted for the dependence in the filter responses by deriving the (logarithms of the) values of all the mixers for a particular patch from a smaller number of underlying random variables that were themselves mixed using a set of basis vectors. Our association probabilities reveal hierarchical structure in the same way that their basis vectors do, and indeed there are some similarities in the higher-level structures that result. For example, Karklin and Lewicki obtain either global spatial grouping favoring roughly one orientation or spatial frequency range or local spatial grouping at all orientations and frequencies. We also obtain similar results for the generic image ensembles, but our spatial groupings sometimes show orientation preference. The relationship between our model and Karklin and Lewicki’s is similar to that between the discrete mixture of experts model of Jacobs, Jordan, Nowlan, and Hinton (1991) and the continuous model of Jacobs, Jordan, and Barto (1991). One characteristic difference between these models is that the discrete versions (like ours) are more strongly competitive, with the mixer associated with a given group having to explain all their variance terms by itself. The discrete nature of mixer assignments in our model also led to a simple implementation of a powerful inference tool. There are also other directions to pursue. First, various interesting bottom-up approaches to hierarchical representation are based on the idea that higher-level structure changes more slowly than low-level struc¨ ture (Foldiak, 1991; Wallis & Baddeley, 1997; Becker, 1999; Laurenz & ¨ ¨ ¨ Sejnowski, 2002; Einh¨auser, Kayser, Konig, & Kording, 2002; Kording, ¨ Kayser, Einh¨auser, & Konig, 2003). Although our results (and others like them; Hyv¨arinen & Hoyer, 2000b) show that temporal coherence is not necessary for extracting features like phase invariance, it would certainly be interesting to capture correlations between mixer variables over time as well as over space. Understanding recurrent connections within cortical areas, as studied in a bottom-up framework by Li (2002), is also key work for the future. Second, as in applications in computer vision, inference at higher levels of a hierarchical model can be used to improve estimates at lower levels, for instance, removing noise. It would be interesting to explore combined bottom-up and top-down inference as a model for combined feedforward and feedback processing in the cortex. It is possible that a form of predictive coding architecture could be constructed, as in various previous suggestions (MacKay, 1956; Srinivasan, Laughlin, & Dubs, 1982; Rao & Ballard, 1999), in which only information not known to upper levels of the hierarchy
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2713
would be propagated. However, note the important distinction between the generative model and recognition processes, such as predictive coding, that perform inference with respect to the generative model. In this article, we focused on the former. We should also mention that not all bottom-up approaches to hierarchical structure fit into the GSM framework. In particular, methods based on discriminative ideas such as the Neocognitron (Fukushima, 1980) or the MAX model (Riesenhuber & Poggio, 1999) are hard to integrate directly within the scheme. However, some basic characteristics of such schemes, notably the idea of the invariance of responses at higher levels of the hierarchy, are captured in our hierarchical generative framework. Finally, particularly since there is a wide spectrum of hierarchical models, all of which produce somewhat similar higher-level structures, validation remains a critical concern. Understanding and visualizing high-level, nonlinear, receptive fields is almost as difficult in a hierarchical model as it is for cells in higher cortical areas. The advantages for the model—that one can collect as many data as necessary and that the receptive fields arise from a determinate computational goal—turn out not to be as useful as one might like. One validation methodology, which we have followed here, is to test the statistical model assumptions in relation to the statistical properties of the inferred components of the model. We have also adopted the maximal patch demonstration of Karklin and Lewicki (2003a), but the results are inevitably qualitative. Other known metrics in the image processing literature, which would be interesting to explore in future work, include denoising, synthesis, and segmentation. Appendix: Gibbs Sampling We seek the association probabilities pij between filter response (ie mixture) li and mixer v j that maximize the log likelihood log p[l|{ pij }] l
(A.1)
averaged across all the input cases l. As in the expectation maximization algorithm (Dempster et al., 1977), this involves an inner loop (the E phase), calculating the distribution of the (binary) assignments χ = χij for each given input patch P[χ|l], and an outer loop (a partial M phase), which in this case sets new values for the association probability pij closer to the empirical mean over the E step: P[χij |l] l .
(A.2)
We use Gibbs sampling for the E phase. This uses a Markov chain to generate samples of the binary assignments χ ∼ P[χ|l] for a given input. In any given
2714
O. Schwartz, T. Sejnowski, and P. Dayan
assignment, define η j = i χij to be the number of filters assigned to mixer
2 j and λ j = i χij li to be the square root of the power assigned to mixer j. Then, by the same integrals that lead to the posterior probabilities in section 3, log p[l|χ] = log
p[l, v|χ]dv = log
p[v j ] p[l|v, χ]dv.
(A.3)
j
For Rayleigh prior a = 1, we have log p[l|χ] = K +
(1 − η j /2) log λ j + log B((1 − η j /2), λ j ),
(A.4)
j
where K is a constant. ∗ For the Gibbs sampling, we consider one
filter i at random, and, fixing i¯∗ ∗ all the other assigments, χ = χij , ∀i = i , we generate a new assignment χ according to the probabilities ¯∗ ¯∗ P χi∗ j = 1, χ i ∝ p l| χi ∗ j = 1, χ i .
(A.5)
We do this many times to try to get near to equilibrium for this Markov chain and then can generate sample assignments that approximately come from the distribution P[χ|l]. We then use these to update the association probabilities pij = pij + γ P[χij |l] l − pij ,
(A.6)
using only a partial M step because of the approximate E step. Acknowledgments This work was funded by the HHMI (O.S., T.J.S.) and the Gatsby Charitable Foundation (P.D.). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla, and Eero Simoncelli for discussion. References Adams, N. J., & Williams, C. K. I. (2003). Dynamic trees for image modelling. Image and Vision Computing, 10, 865–877. Andrews, D., & Mallows, C. (1974). Scale mixtures of normal distributions. J. Royal Stat. Soc., 36, 99–102.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2715
Attneave, F. (1954). Some informational aspects of visual perception. Psych. Rev., 61, 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neural Computation, 11(2), 347–374. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Bollerslev, T., Engle, K., & Nelson, D. (1994). ARCH models. In B. Engle & D. McFadden (Eds.), Handbook of econometrics V. Amsterdam: North-Holland. Brehm, H., & Stammler, W. (1987). Description and generation of spherically invariant speech-model signals. Signal Processing, 12, 119–141. Buccigrossi, R. W., & Simoncelli, E. P. (1999). Image compression via joint statistical characterization in the wavelet domain. IEEE Trans. Image Proc., 8(12), 1688–1701. De Bonet, J., & Viola, P. (1997). A non-parametric multi-scale statistical model for natural images. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38. ¨ ¨ Einh¨auser, W., Kayser, C., Konig, P., & Kording, K. P. (2002). Learning the invariance properties of complex cells from natural stimuli. Eur. J. Neurosci., 15(3), 475–486. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12), 2379–2394. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. ¨ Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36, 193–202. Geisler, W. S., Perry, J. S., Super, B. J., & Gallogly, D. P. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vision Research, 41, 711–724. Grenander, U., & Srivastava, A. (2002). Probability models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23, 423–429. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions Royal Society B, 352, 1177– 1190. Hinton, G. E., Ghahramani, Z., & Teh, Y. W. (1999). Learning to parse images. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 463–469). Cambridge, MA: MIT Press. Hoyer, P., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Huang, J., & Mumford, D. (1999). Statistics of natural images and models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (p. 547). Fort Collins, CO: Computer Science Press.
2716
O. Schwartz, T. Sejnowski, and P. Dayan
Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160, 106–154. Hyv¨arinen, A., & Hoyer, P. (2000a). Emergence of phase- and shift-invariant features by decomposition of natural images into independent subspaces. Neural Computation, 12, 1705–1720. Hyv¨arinen, A., & Hoyer, P. (2000b). Emergence of topography and complex cell properties from natural images using extensions of ICA. In S. A. Solla, T. K. ¨ Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 827–833). Cambridge, MA: MIT Press. Hyv¨arinen, A., Hurri, J., & Vayrynen, J. (2003). Bubbles: A unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America A, 20, 1237–1252. Jacobs, R. A., Jordan, M. I., & Barto, A. G. (1991). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science, 15, 219–250. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Karklin, Y., & Lewicki, M. S. (2003a). Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14, 483–499. Karklin, Y., & Lewicki, M. S. (2003b). A Model for learning variance components of natural images. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1367–1374). Cambridge, MA: MIT Press. Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17, 397–423. ¨ ¨ Kording, K. P., Kayser, C., Einh¨auser, W., & Konig, P. (2003). How are complex cell properties adapted to the statistics of natural scenes? Journal of Neurophysiology, 91(1), 206–212. Laurenz, W., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Lee, J. S. (1980). Digital image enhancement and noise filtering by use of local statistics. IEEE Pat. Anal. Mach. Intell. PAMI-2, 165–168. Li, Z. (2002). A saliency map in primary visual cortex. Trends in Cognitive Sciences, 6, 9–16. Li, Z., & Atick, J. J. (1994). Towards a theory of striate cortex. Neural Computation, 6, 127–146. MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. MacKay, D. M. (1956). In C. E. Shannon & I. McCarthy (Eds.), Automata studies (pp. 235–251). Princeton, NJ: Princeton University Press. Nadal, J. P., & Parga, N. (1997). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Neural Computation, 9, 1421–1456. Neisser, U. (1967). Cognitive psychology. Englewood Cliffs, NJ: Prentice Hall. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse factorial code. Nature, 381, 607–609.
Soft Gaussian Scale Mixer Assignments for Natural Scenes
2717
Osindero, S., Welling, M., & Hinton, G. E. (2006). Topographic product models applied to natural scene statistics. Neural Computation, 18(2), 381–414. Park, H. J., & Lee, T. W. (2005). Modeling nonlinear dependencies in natural images using mixture of Laplacian distribution. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1041–1048). Cambridge, MA: MIT Press. Portilla, J., & Simoncelli, E. P. (2003). Image restoration using gaussian scale mixtures in the wavelet domain. In Proc. 10th IEEE Int’l. Conf. on Image Proc. (Vol. 2, pp. 965– 968). Piscataway, NJ: IEEE Computer Society. Portilla, J., Strela, V., Wainwright, M., & Simoncelli, E. (2001). Adaptive Wiener Denoising using a gaussian scale mixture model in the wavelet domain. In Proc. 8th IEEE Int’l. Conf. on Image Proc. (pp. 37–40). Piscataway, NJ: IEEE Computer Society. Portilla, J., Strela, V., Wainwright, M., & Simoncelli, E. P. (2003). Image denoising using a scale mixture of gaussians in the wavelet domain. IEEE Trans. Image Processing, 12(11), 1338–1351. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. Rao, R. P. N., Olshausen, B. O., & Lewicki, M. S. (2002). Probabilistic models of the brain. Cambridge, MA: MIT Press. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Romberg, J., Choi, H., & Baraniuk, R. (1999). Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l. Conf. on Image Proc. Piscataway, NJ: IEEE Computer Society. Romberg, J., Choi, H., & Baraniuk, R. (2001). Bayesian tree-structured image modeling using Wavelet-domain hidden Markov models. IEEE Trans. Image Proc., 10(7), 1056–1068. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6), 814–817. Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2005). Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46(6), 945–956. Sallee, P., & Olshausen, B. A. (2003). Learning sparse multiscale image representations. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15 (pp. 1327–1334). Cambridge, MA: MIT Press. Samaria, F., & Harter, A. (1994). Parameterisation of a stochastic model for human face identification. In Proc. of 2nd IEEE Workshop on Applications of Computer Vision. Piscataway, NJ: IEEE. Schwartz, O., Sejnowski, T. J., & Dayan, P. (2005). Assignment of multiplicative mixtures in natural images. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1217–1224). Cambridge, MA: MIT Press. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8), 819–825. Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423.
2718
O. Schwartz, T. Sejnowski, and P. Dayan
Shannon, C. E., & Weaver, W. (1949). Nonlinear problems in random theory. Urbana: University of Illinois Press. Simoncelli, E. P. (1997). Statistical models for images: Compression, restoration and synthesis. In Proc. 31st Asilomar Conf. on Signals, Systems and Computers (pp. 673–678). Pacific Grove, CA: IEEE Computer Society. Available online at http://www.cns.nyu.edu/∼eero/publications.html Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multi-scale transforms. IEEE Trans. Information Theory, 38(2), 587–607. Simoncelli, E., & Olshausen, B. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. J. R. Soc. Lond. B, 216, 427–459. Strela, V., Portilla, J., & Simoncelli, E. (2000). Image denoising using a local gaussian scale mixture model in the wavelet domain. In A. Aldroubi, A. F. Laine, & M. A. Unser (Eds.), Proc. SPIE, 45th Annual Meeting. Bellingham, WA: International Society for Optional Engineering. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22(24), 10811–10818. Turiel, A., Mato, G., Parga, N., & Nadal, J. P. (1998). The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80, 1098–1101. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265, 359–366. Wainwright, M. J., & Simoncelli, E. P. (2000). Scale mixtures of gaussians and the ¨ statistics of natural images. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 855–861). Cambridge, MA: MIT Press. Wainwright, M. J., Simoncelli, E. P., & Willsky, A. S. (2001). Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1), 89–123. Wallis, G., & Baddeley, R. J. (1997). Optimal, unsupervised learning in invariant object recognition. Neural Computation, 9, 883–894. Wegmann, B., & Zetzsche, C. (1990). Statistical dependence between orientation filter outputs used in an human vision based image code. In M. Kunt (Ed.), Proc. SPIE Visual Comm. and Image Processing (Vol. 1360, pp. 909–922). Bellingham, WA: SPIE. Williams, C. K. I., & Adams, N. J. (1999). Dynamic trees. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 634–640). Cambridge, MA: MIT Press. Zetzsche, C., & Nuding, U. (2005). Nonlinear and higher-order approaches to the encoding of natural scenes. Network: Computation in Neural Systems, 16(2–3), 191– 221. Zetzsche, C., Wegmann, B., & Barth, E. (1993). Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In J. Morreale (Ed.), Int’l. Symposium, Society for Information Display (Vol. 24, pp. 933–936). Playa del Ray, CA.
Received November 18, 2004; accepted April 25, 2006.
LETTER
Communicated by Joachim Buhmann
The Scaling of Winner-Takes-All Accuracy with Population Size Maoz Shamir [email protected] Hearing Research Center and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Empirical studies seem to support conflicting hypotheses with regard to the nature of the neural code. While some studies highlight the role of a distributed population code, others emphasize the possibility of a “single-best-cell” readout. One particularly interesting example of single-best-cell readout is provided by the winner-takes-all (WTA) approach. According to the WTA, every cell is characterized by one particular preferred stimulus, to which it responds maximally. The WTA estimate for the stimulus is defined as the preferred stimulus of the cell with the strongest response. From a theoretical point of view, not much is known about the efficiency of single-best-cell readout mechanisms, in contrast to the considerable existing theoretical knowledge on the efficiency of distributed population codes. In this work, we provide a basic theoretical framework for investigating single-best-cell readout mechanisms. We study the accuracy of the WTA readout. In particular, we are interested in how the WTA accuracy scales with the number of cells in the population. Using this framework, we show that for large neuronal populations, the WTA accuracy is dominated by the tail of the single-cell-response distribution. Furthermore, we find that although the WTA accuracy does improve when larger populations are considered, this improvement is extremely weak compared to other types of population codes. More precisely, we show that while the accuracy of a linear readout scales linearly with the population size, the accuracy of the WTA readout scales logarithmically with the number of cells in the population. 1 Introduction In many regions of the central nervous system, information on specific external stimulus features is represented in a distributed manner, by the responses of many cells (see, e.g., Hubel & Wiesel, 1962; Georgopoulos, Schwartz, & Kettner, 1986; Razak & Fuzessery, 2002; Coltz, Johnson, & Ebner, 2000). Nevertheless, the fact that many cells respond to the stimulus does not prove that the central nervous system indeed uses all of these cells Neural Computation 18, 2719–2729 (2006)
C 2006 Massachusetts Institute of Technology
2720
M. Shamir
when reading the information on the stimulus by higher brain areas. In many sensory and motor tasks, the psychophysical response to a certain stimulus is believed to be the manifestation of pooling information from a large population of cells (see, e.g., Georgopoulos, Kalaska, Caminiti, & Massey, 1982; McAlpine, Jiang, & Palmer, 2001). However, there are also experimental studies reporting that in certain brain areas, the psychophysical response is equivalent to that of the “single best cell” in the population (see, e.g., Parker & Newsome, 1998; Newsome, Britten, & Movshon, 1989; Skottun, Shackleton, Arnott, & Palmer, 2001). Although the question of single best neuron versus a population readout is a fundamental question in neuroscience, basic theoretical framework for addressing this issue is still lacking. One problem is that the single-bestcell readout is not well defined. A possible single-best-cell readout is the winner-takes-all (WTA) readout (see, e.g., Hertz, Krogh, & Palmer, 1991). According to the WTA approach, the best neuron is chosen to be the neuron with the highest firing rate. In this work, we study the accuracy of the WTA readout. In particular we are interested in the scaling of the performance of such a readout mechanism with the number of cells in the population, N. We address this problem by investigating the WTA performance in a 2 interval 2 alternative forced choice (2I2AFC) discrimination task (see below). This is done in a statistical model for the responses of a population of nerve cells coding for a specific stimulus θ . The outline of this article is as follows. We begin by defining the statistical model for the neuronal population and defining the discrimination task in section 1.1. The main known results for the scaling of the linear readout performance are reviewed in section 1.2. In section 2 we define the WTA readout and analyze its performance. Finally, in section 3, we summarize our results and discuss further extensions of the theory presented here. 1.1 The Model. We model a population of N independently and identically distributed (i.i.d.) cells with an exponentially distributed response tuned to a continuous-variable stimulus, θ . Denoting by xi the response of the ith neuron to the stimulus, the probability distribution of the population response, x, given the stimulus θ is P(xθ ) =
N
p(xi )
(1.1)
i=1
p(xθ ) =
µ(θ )e −µ(θ )x x≥0 0 otherwise,
(1.2)
where the function µ(θ ) determines the tuning of the of mean response 1 2 = 1 to the and its standard deviation, (δx) a single cell, x = µ(θ ) µ(θ ) stimulus, θ . Note that we have used the notation · · · to represent averaging
The Scaling of WTA Accuracy with Population Size
2721
over the distribution of the neuronal responses, given a specific stimulus θ , that is, f (x) = f (x)P(x|θ )dx. The Fisher information, J, is a measure of the sensitivity by which the stimulus, θ , can be represented by the stochastic responses of the cells (see, e.g., Thomas & Cover, 1991). The Fisher information of a random variable y, with a conditional probability distribution function, P(y|θ ), is given by J y (θ ) =
∂ log P(y|θ ) ∂θ
2 .
(1.3)
Note that the Fisher information, J y , of a random variable y is a property of the random variable y. Taking in equation 1.3 y = xi for any i ∈ {1, 2, . . . , N}, one obtains the Fisher information, J o , of a single cell J o = ( µµ(θ(θ)) )2 , where
) N . Taking y = {xi }i=1 in equation 1.3, one obtains the Fisher inµ (θ ) = dµ(θ dθ formation, J pop , of the entire population of N cells, J pop = J o N. In a 2I2AFC discrimination task, the system is given two stimuli, θ and (θ + δθ ), in random order. We will denote by θ (1) the first stimulus and by θ (2) the second stimulus. Each stimulus generates a population response, x(1) and x(2) , for stimuli θ (1) and θ (2) , respectively. The readout’s task is to infer on the basis of these responses which stimulus was the first. Below we investigate the performance of different readouts for this task. We shall use the probability of a correct discrimination, Pc , as a measure of the performance. We shall term sensitivity of the readout the stimulus difference, δθth , at which Pc crosses a certain threshold, Pth . This latter measure is related to the definition of the “just noticeable difference” used in psychophysics.
1.2 Linear Readout Discrimination. The decision of an optimal linear readout for discrimination is done according to the sign of a decision vari N (1) (2) able q = i=1 (xi − xi ) (see, e.g., Shamir & Sompolinsky, 2004). Without loss of generality, we shall assume throughout this article that δθ ≥ 0 and that µ(θ ) ≤ µ(θ + δθ ) with equality if and only if δθ = 0. We further assume the tuning curve, µ(θ ), is a differentiable function of the stimulus, θ . In this case, the optimal decision of the linear readout is θ was the first stimulus presented if q > 0, while for q < 0, the optimal readout decides that (θ + δθ ) was the first stimulus. For large populations of cells and assuming δθ to be small, the probability of a correct discrimination is given by Pc = 1 − H( √d 2 ), √ ∞ dz −z2 /2 where d = δθ J o N = δθ J pop , and H(x) = x 2π e (see, e.g., Shamir & Sompolinsky, 2004). Hence, the probability of a correct discrimination, Pc , “decays” to 1 as exponentially in the number of cells in the population. The sensitivity, δθth , of this readout is defined by the value of δθ for which Pc
obtains an arbitrary threshold Pth : δθth = J o2N H −1 (1 − Pth ), where H −1 is the inverse function of H. Hence, the sensitivity of the linear discrimination
2722
M. Shamir
√ scales like 1/ N. Identical results are obtained using the maximum likelihood approach in this case. Below we define the WTA readout for this task and study its performance. 2 The WTA Readout 2.1 The WTA Performance. The WTA readout for a 2I2AFC task is defined by decide first stimulus was
(1)
(2)
θ max{xi } > max{xi } (2) (1) . θ + δθ max{xi } > max{xi }
(2.1)
The probability of a correct decision, Pc , of the WTA readout is given by
∞
Pc = N
dx 0
p1 (x) exp N log[1 (x)] + log[2 (x)] , 1 (x)
(2.2)
where we have used the notation p1 (x) = p(x|θ ), p2 (x) = p(x|θ + δθ ), for The cumulative distribution functions are defined i (x) ≡ convenience. x p (x), i ∈ {1, 2}. For large N, the integral over x in equation 2.2 is domi 0 inated by the region where the exponent is maximal. Since the cumulative distribution function, (x), is monotonically increasing in x, the WTA performance is determined by the the tail of the distribution of x for large systems. It is convenient to transform the integral to the interval (0, 1] by a change of variable y = 1 − 1 (x) = e −µ(θ )x and obtain
1
Pc = N
dy 0
1 e Ng(y) 1−y
(2.3)
g(y) = log[1 − y] + log[2 (x(y))].
(2.4)
In the case of an exponential distribution, equation 1.2, one obtains g(y) = log[1 − y] + log[1 − y1+d ], where d =
µ(θ +δθ ) µ(θ )
− 1 ≈ δθ
√
J o ≥ 0. Now
(2.5) dg dy
the limit of y → 0, one obtains → 1 and
1 (x(y)) p2 (x(y)) ). In 2 (x(y)) p1 (x(y)) p2 (x(y)) lim y→0 p1 (x(y)) = 0 for δθ > 0; 1 = − 1−y (1 +
thus, dg = −1 for δθ > 0 and dg = −2 for δθ = 0. Hence, to a leading order dy dy in the number of cells, N, one obtains the trivial result of Pc = 1 for δθ > 0 and Pc = 1/2 for δθ = 0 in the limit of large N. In order to obtain the leading nontrivial correction to Pc for large N, we apply Watson’s lemma (see, e.g.,
The Scaling of WTA Accuracy with Population Size
2723
Orszag & Bender, 1991),
b
dx
0
a n x α+nβ e −Nx ≈
n
a n (α + nβ + 1) n
Nα+nβ+1
α > −1, β > 0.
(2.6)
Changing variables to u = −g(y), one obtains Pc = N
∞
du 0
1 1+
1 (x(u)) p2 (x(u)) 2 (x(u)) p1 (x(u))
e −Nu .
(2.7)
For large N, the last integral is dominated by small u. Near u = 0, one obtains u ≈ −y(1 + O(yd )). Using Watson’s lemma, the first correction to Pc is given by Pc ≈ 1 − (1 + d)(1 + d)N−d + O(N−2d ),
(2.8)
√ with d = J o δθ . For relatively small d, near threshold, the dominant term in the first correction to Pc , equation 2.8, is N−d ; thus, at threshold, one ob tains N−dth = 1 − Pth , and hence, assuming small discrimination errors, one obtains a logarithmic scaling for the WTA sensitivity with the population size: δθth ≈
− log(1 − Pth ) . √ J o log(N)
(2.9)
Note that approximating N−dth = 1 − Pth , we are neglecting corrections that are of order N−2dth = O([1 − Pth ]2 ). Hence, the approximation of equation 2.9 is for small values of (1 − Pth ). In order to obtain higher-order corrections for dth , terms up to order N−2d in equation 2.8 have to be taken explicitly into account. It is then convenient to use the ansatz dth = (1)
− log(1−Pth )+d (1) log(N)
solving
to a leading order in d . Figure 1 shows Pc as function of N for different values of d; from bottom to top, d = 0.2, 0.3, 0.4, 0.5, 0.6. The open circles show the estimation of Pc from 3000 simulations. The first approximation for the scaling of Pc , equation 2.8, is shown by the solid lines. From the figure, one can see that the approximation is indeed better for larger values of N, and that for smaller ds, one needs to study larger populations for the approximation to be effective. This results from the fact that the second nontrivial correction to Pc scales like N−2d . Figure 2 shows the scaling of the sensitivity of the WTA readout with the number of cells in the population for different threshold values; from top to bottom, Pth = 0.8, 0.85, 0.9. The open circles show numerical estimation of the value of d at threshold, dth , and was computed as follows. For
2724
M. Shamir
1 0.95 0.9
P
c
0.85 0.8 0.75 0.7 0.65 0.6 0
2000
4000
6000
8000
10000
N Figure 1: Performance of the WTA readout in discrimination task. The value of Pc is plotted as a function of the number of cells in the population, N, for d = 0.2, 0.3, 0.4, 0.5, 0.6, from bottom to top. The open circles show numerical estimation of Pc computed by averaging over 3000 realizations on the population response. The solid line shows the approximation of equation 2.8.
every open circle in the figure, the value of Pc was estimated by averaging over 10,000 simulations of the stochastic responses of the population. The value of d was systematically varied in steps of 0.005 until the estimated Pc crossed the threshold value, Pth . The estimated sensitivity dth , was then taken to be the weighted average of the two values of d: d ↑ and d ↓ , for which ↑ ↓ the estimated Pc was just above, Pc , and just below, Pc threshold, that ↓
↑
is, dth = d ↑ Pth↑ −Pc↓ + d ↓ Pc↑−Pth↓ . The dashed lines show the approximation of Pc −Pc Pc −Pc equation 2.9. From the figure, the logarithmic scaling of the WTA readout sensitivity can easily be seen. Moreover, one can also observe that asymptotic relation of equation 2.9 provides a good approximation for dth for small values of (1 − Pth ). When (1 − Pth ) is not very small, the numerical results of Figure 2 show some deviation from the asymptotic relation of equation 2.9 (compare the upper dashed line and open circles for Pth = 0.8). Nevertheless, dth still seems to scale logarithmically with N. 2.2 The Fisher Information of the “Winner.” The above analysis can be complemented by studying the Fisher information of the “winner”:
The Scaling of WTA Accuracy with Population Size
2725
5
3
th
(d ) −1
4
2
1
0 1 10
2
3
10
10
N Figure 2: Scaling of the WTA sensitivity with the number of cells. The inverse of the WTA sensitivity, (dth )−1 , is shown as a function of the population size, N, for different threshold values: Pth = 0.8, 0.85, 0.9, from top to bottom. The open circles show the sensitivity as computed numerically by simulating the stochastic neuronal responses (see the text for details). The dashed lines show the approximation of equation 2.9. N X = max{xi }i=1 . The probability distribution of X can given by the derivative of its cumulative distribution function,1
P(X|θ ) =
d . ((x|θ )) N dx x=X
(2.10)
The Fisher information, J , of the winner, X, is given by J = Jo
d2 1 + N(N − 1) dµ2
1
µ−1
y 0
(1 − y)
N−3
dy
µ=2
.
(2.11)
The cumulative distribution function of the winner, P(Winner < X) = P(X |θ )d X , is equal to the probability that the responses of all cells in the population are less than X. Using the statistical independence of the neural responses, one obtains X N 0 P(X |θ )d X = ((X|θ )) . Taking the derivative of this last equation with respect to X, one obtains equation 2.10.
X 0
1
2726
M. Shamir
6 5
(J/Jo )
1/2
4 3 2 1 0 0 10
1
10
2
N
10
3
10
Figure 3: Fisher information of the winner. The Fisher information of the winner is plotted as a function of the number of cells. The open circles show the Fisher information as computed directly from equation 2.11. The solid line shows the approximation of equation 2.12. For comparison, we show by the dashed line the curve of α(log N)2 with α = 0.94.
Applying Watson’s lemma, equation 2.6, and keeping only the leading terms for the scaling of J with N, one obtains 2 J ≈ J o log N .
(2.12)
This result is consistent with the logarithmic scaling of the discrimination sensitivity of the WTA readout, equation 2.9. Figure 3 shows the Fisher information of the winner (open circles) as a function of the number of cells in the population as obtained by numerical integration of the exact expression for J , equation 2.11. The solid line shows the analytical approximation of equation 2.12. The dashed line shows the curve of α(log N)2 with α = 0.94 for comparison. Note that similar scaling can be obtained by a simple approximation, replacing the sigmoidal term N (x) by a step function at the point of maximal slope, x ∗ = log N. Thus, in contrast to the linear scaling of population Fisher information, J pop = J o N, the Fisher information of the winner grows very slowly, logarithmically, with the population size.
The Scaling of WTA Accuracy with Population Size
2727
3 Summary and Discussion The accuracy of the WTA readout was studied using the framework of a 2I2AFC paradigm in an i.i.d. population of cells. We find that for large populations, the performance of the WTA is dominated by the tail of the single-cell-response distribution. It was shown that in the case of neurons with exponentially distributed responses, the probability of a correct discrimination, Pc , approaches the value of 1 algebraically with the number of cells (see equation 2.8 and Figure 1). Furthermore, we have shown that the WTA sensitivity scales logarithmically with the population size (see equation 2.9 and Figure 2 for the sensitivity; see equation 2.12 and Figure 3 for the scaling of the Fisher information of the winner). This weak scaling of the WTA performance with the population size should be compared to the performance of other types of readout such as the linear readout (see section 1.2) in which (1 − Pc ) is exponentially decaying in N and the sensitivity threshold scales algebraically in N. The above analysis can be applied to other kinds of response distributions as well. For example, we find that in the case of distributions with algebraic tail, p(x) ∝ x −µ(θ ) , and in the case of gaussian distributions with 2 stimulus-dependent variance, p(x) ∝ e −µ(θ )x /2 similar scaling of the performance and sensitivity of the WTA readout is obtained. Namely, the performance decays to 1 algebraically in N, (1 − Pc ) ∝ N−α , and the sensitivity of the WTA decision scales logarithmically with the number of cells in the 1 population, δθth ∝ log(N) (results not shown). In the context of a nerve cell population coding for an external stimulus by their firing rates, an interesting distribution to study is the Poisson distribution. We have addressed this issue numerically. Figure 4 shows the scaling of the WTA sensitivity with the number of cells in an i.i.d. population of Poisson neurons. The two stimuli, θ (1) and θ (2) , generated i.i.d. Poisson distribution for the population responses with mean 50 given stimulus θ (1) and mean (50 + d) given stimulus θ (2) . The value of d was systematically increased in steps of 0.25 until Pc crossed a threshold level, Pth . The threshold level was set to be Pth = 0.8 (circles) and Pth = 0.9 (squares). The sensitivity, dth , was computed by a weighted average of the two values of d for which the estimated Pc was just above and just below threshold (as above in Figure 2). For each value of d, the probability of a correct discrimination, Pc , was estimated by averaging over 4000 realizations of the neural responses. From the figure, one can see that the logarithmic scaling of the WTA sensitivity with the number of cells exists also in a Poisson population. This study was motivated by empirical findings showing that in certain brain areas, a considerable fraction of the cells code information about a specific stimulus feature with an accuracy that is comparable to the psychophysical accuracy. These findings raise serious doubts as to the utility of population codes. This is because of the strong scaling of the population
2728
M. Shamir
th
(d )−1
0.3
0.2
0.1 1 10
2
10
3
N
10
Figure 4: Scaling of the WTA sensitivity with the number of cells in a Poisson population. The inverse of the WTA sensitivity, (dth )−1 , is shown as a function of the population size, N, for different threshold values: Pth = 0.8 (open circles) and Pth = 0.9 (open squares). The sensitivity is defined as the difference in the average firing rate given stimulus 1 (50 spikes) and the average firing rate given stimulus 2 (50 + d spikes) when the threshold value, Pth , was obtained for Pc . The value of Pc was estimated numerically by averaging over 4000 realizations (see the text for details).
code sensitivity with the number of cells that predicts a psychophysical sensitivity that is several orders of magnitude better than observed. In contrast, the weak scaling of the WTA readout accuracy with the population size suggests that such a readout mechanism can account for the psychophysical accuracy on the basis of the neural responses. This study provides a basic framework for addressing the fundamental question of whether readout in the central nervous system is based on population averaging or the response of single cell. However, this study was focused on the assumption of a population of cells with identical response function that are statistically independent. It is well established empirically that neural responses are heterogeneous and are also not independent. Recent theoretical studies have shown that the experimentally observed correlated noise in the firing rates of different neurons may have a drastic detrimental effect on the capacity of the population to code the stimulus with a high degree of accuracy (see, e.g., Sompolinsky, Yoon, Kang, & Shamir, 2001). On the other hand, a population heterogeneity enables specific readout methods, tuned to this heterogeneity, to overcome the correlated
The Scaling of WTA Accuracy with Population Size
2729
noise (Shamir & Sompolinsky, 2006). In order to make a more definitive claim as to the validity, either the population code hypothesis or the singlebest-cell code hypothesis, the effects of these properties of the neural responses should be taken into account. Acknowledgments M.S. is supported by a fellowship from the Burroughs Wellcome Fund. References Coltz, J. D., Johnson, M. T., & Ebner, T. J. (2000). Population code for tracking velocity based on cerebellar Purkinje cell simple spike firing in monkeys. Neurosci. Lett., 296(1), 1–4. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2(11), 1527–1537. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. McAlpine, D., Jiang, D., & Palmer, A. R. (2001). A neural code for low-frequency sound localization in mammals. Nat. Neurosci., 4(4), 396–401. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Orszag, S. A., & Bender, C. M. (1991). Advanced mathematical methods for scientists and engineers. New York: Springer-Verlag. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annu. Rev. Neurosci., 21, 227–277. Razak, K. A., & Fuzessery, Z. M. (2002). Functional organization of the pallid bat auditory cortex: Emphasis on binaural organization. J. Neurophysiol., 87(1), 72–86. Shamir, M., & Sompolinsky, H. (2004). Nonlinear population codes. Neural Comput., 16(6), 1105–1136. Shamir, M., & Sompolinsky, H. (2006). Implications of neuronal diversity on population coding. Neural Computation, 18, 1951–1986. Skottun, B. C., Shackleton, T. M., Arnott, R. H., & Palmer, A. R. (2001). The ability of inferior colliculus neurons to signal differences in interaural delay. Proc. Natl. Acad. Sci. USA. 98(24), 14050–14054. Sompolinsky, H., Yoon, H., Kang, K., & Shamir, M. (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5 Pt. 1), 051904. Thomas, J. A., & Cover, T. M. (1991). Elements of information theory. New York: Wiley.
Received August 19, 2005; accepted January 17, 2006.
LETTER
Communicated by Klaus-Robert Mueller
An Extended EM Algorithm for Joint Feature Extraction and Classification in Brain-Computer Interfaces Yuanqing Li [email protected]
Cuntai Guan [email protected] Neural Signal Processing Lab, Institute for Infocomm Research, Singapore 119613
For many electroencephalogram (EEG)-based brain-computer interfaces (BCIs), a tedious and time-consuming training process is needed to set parameters. In BCI Competition 2005, reducing the training process was explicitly proposed as a task. Furthermore, an effective BCI system needs to be adaptive to dynamic variations of brain signals; that is, its parameters need to be adjusted online. In this article, we introduce an extended expectation maximization (EM) algorithm, where the extraction and classification of common spatial pattern (CSP) features are performed jointly and iteratively. In each iteration, the training data set is updated using all or part of the test data and the labels predicted in the previous iteration. Based on the updated training data set, the CSP features are reextracted and classified using a standard EM algorithm. Since the training data set is updated frequently, the initial training data set can be small (semisupervised case) or null (unsupervised case). During the above iterations, the parameters of the Bayes classifier and the CSP transformation matrix are also updated concurrently. In online situations, we can still run the training process to adjust the system parameters using unlabeled data while a subject is using the BCI system. The effectiveness of the algorithm depends on the robustness of CSP feature to noise and iteration convergence, which are discussed in this article. Our proposed approach has been applied to data set IVa of BCI Competition 2005. The data analysis results show that we can obtain satisfying prediction accuracy using our algorithm in the semisupervised and unsupervised cases. The convergence of the algorithm and robustness of CSP feature are also demonstrated in our data analysis. 1 Introduction As brain-computer interfaces (BCIs) provide an alternative means of communication and control for people with severe motor disabilities (Birbaumer et al., 1999), research into BCIs has received more attention in recent years, as seen in Blanchard and Blankertz (2004), Donoghue (2002), Neural Computation 18, 2730–2761 (2006)
C 2006 Massachusetts Institute of Technology
An Extended EM Algorithm
2731
Kubler, Kotchoubey, Kaiser, Wolpaw, and Birbaumer (2001), Pfurtscheller et al. (2000), and Wolpaw, Birbaumer, McFarland, Pfurtscheller, and Vaughan (2002). Being noninvasive, electroencephalogram (EEG)-based BCI measures specific components of EEG activity, extracts features, and translates these features into control signals to devices such as a robot arm or a cursor. The features that are commonly used in EEG-based BCIs include visual evoked potentials, slow cortical potentials, P300 evoked potentials, common spatial pattern (CSP) features, mu and beta rhythms, and other activities from sensorimotor cortex and autoregressive parameters (Wolpaw et al., 2002). CSP features of EEG signals correspond to event-related desynchronization (ERD) and event-related synchronization (ERS) evoked by motor imagery or movements (Pfurtscheller, Neuper, Flotzinger, & Pregenzer, 1997). CSP feature is very effective in discriminating several motor imageries (Blanchard & Blankertz, 2004; Ramoser, Muller-Gerking, & Pfurtscheller, 2000; Wolpaw et al., 2002). However, CSP feature extraction relies on a time-consuming training process to determine a spatial filter matrix, also known as the CSP transformation matrix. Considering the importance of training effort reduction, Muller and his colleagues provided a data set with a small training data set for the BCI competition in 2005 (Dornhege, Blankertz, Curio, & Muller, 2004; see http://ida.first.fraunhofer.de/ projects/bci/competition). In this article, semisupervised learning, which refers to finding a decision rule from both labeled and unlabeled data, will be used to tackle the small training data set problem in BCI systems. Semisupervised learning has gained much appeal in recent years due to its potential in reducing labeling and training effort, which is usually tedious and time-consuming (Nigam & Ghani, 2000; Grandvalet & Bengio, 2004). A necessary condition for a semisupervised learning algorithm, as well as for an unsupervised learning algorithm, is that the applied feature set has sufficient consistency (Zhou, ¨ Bousquet, Lal, Weston, & Scholkopf, 2003). Otherwise the algorithm will not work well. In this article, the consistency is reflected in the Fisher ratio of the two classes of features. If a small data set is used for training and CSP features are then directly extracted from the training data set and the test data set, the feature consistency is not sufficient. Consequently, a standard semisupervised learning method cannot be directly employed for classification. To solve this problem, we propose an extended expectation maximization (EM) algorithm by embedding a feature reextraction into the standard EM algorithm. In each iteration of the proposed algorithm, the training data set is updated using all or part of the test set1 with labels (predicted in the previous iteration) in order to make the training data become sufficient or expanded.
1
In this article, the term test set refers to unlabeled trials.
2732
Y. Li and C. Guan
Based on the updated training data set, CSP features are then reextracted and classified by a standard EM algorithm. The improvement in prediction accuracy in one iteration leads to a higher consistency of CSP feature in the next iteration, and the latter leads to a further improvement in the subsequent prediction accuracy, and so on. This is the main difference between our algorithm and conventional semisupervised algorithms. The initial training set can be small or even null, that is, the proposed algorithm can be used in both a semisupervised learning case (with a small initial training data set) and a unsupervised learning case (without any initial training data set). Furthermore, the proposed algorithm can be used to improve the adaptability of BCI systems. In general, if we do not consider the adaptability of a BCI system, its parameters, such as the CSP transformation matrix and the classifier parameters, do not change once determined unless new training is performed. Many researchers think that EEG and other electrophysiological signals typically display short-term and long-term variations linked to several factors, such as time of day, hormonal levels, immediate environment, recent events, fatigue, and illness (McEvoy, Smith, & Gevins, 2000; Polich, 2004; Regan, 1989; Wolpaw et al., 2002). In other words, for BCI systems, the consistency in the features may have only a short-term existence. Adaptability is a recommendation for an effective BCI system, by adjusting the latter’s parameters online (Millan & Mourino, 2003; Vidaurre, Schlogl, Cabeza, Scherer, & Pfurtscheller, 2005). This reduces the impact of such spontaneous variations and keeps the consistency of features. As will be seen, the CSP transformation matrix and the Bayes classifier parameters are also updated by using test data and predicted labels in our method. The BCI system parameters can thus be adjusted online without new system training. The remainder of this article is organized as follows. In section 2, we describe CSP feature extraction and analyze the robustness of CSP feature to noise. In section 3, we introduce our extended EM algorithm. The convergence of the algorithm is also analyzed. Data analysis results in section 4 demonstrate our algorithm’s validity. Section 5 concludes with discussions of the data analysis results. 2 CSP Feature Extraction and the Robustness of the CSP Feature The CSP feature, which is commonly used in EEG-based BCI systems, is very effective for discriminating motor imageries. In this section, we describe CSP feature extraction and present the robustness analysis of CSP feature. 2.1 CSP Feature Extraction. For the convenience of the following analysis, we present the main steps of CSP feature extraction, which can be seen
An Extended EM Algorithm
2733
in Blanchard and Blankertz (2004) and Lemm, Blankertz, Curio, and Muller (2005). Hereafter, let N1 and N2 denote the trial numbers of the training set and the test set respectively. Define
(1) =
j∈C1
E j ∗ ETj E j ∗ ETj , (2) = , T trace E j ∗ E j trace E j ∗ ETj j∈C2
(2.1)
where E j ∈ Rm×k2 denotes an EEG data matrix of the jth trial, m is the number of selected channels, k2 is the number of samples in each trial, and C1 and C2 refer to the first class and the second class of trials of training data set, respectively. The two matrices (1) and (2) are then jointly diagonalized. Set = (1) + (2) , which is a symmetric matrix. Let V be an orthogonal matrix whose first row vector is nonnegative (if the first entry of a column vector is negative, we use −1 to times the column vector), such that VT V = P,
(2.2)
where P is a diagonal matrix composed by the eigenvalues of , in a decreasing order. 1 Set U = (P) 2 VT , R1 = U (1) UT , R2 = U (2) UT . It is observed that R1 is symmetrical; thus, we can find an orthogonal matrix denoted as Z with its first row vector being nonnegative, such that ZT R1 Z = D = diag(d1 , . . . , dm ),
(2.3)
where the elements in the diagonal lines of D are sorted in decreasing order, and 0 ≤ d1 , . . . , dm ≤ 1. Define W = ZT U; then W (1) WT = D, W (2) WT = I − D.
(2.4)
I is the identity matrix. ¯ composed by the Next, we construct the CSP transformation matrix W, first l1 and the last l2 rows of W. The first l1 rows of W correspond to the largest l1 eigenvalues of D, and the last l2 rows of W correspond to the smallest l2 eigenvalues of D.
2734
Y. Li and C. Guan
For the EEG data matrix E j obtained from the jth trial, the CSP feature vector is defined as ¯ cf( j) = diag W
E j ETj T ¯ W , trace E j ETj
(2.5)
where j = 1, . . . , N1 + N2 . Remark 1. In the above CSP feature extraction, the first row vectors of two orthogonal matrices V and Z for diagonalizing matrices are set to be nonnegative. In a standard CSP feature extraction, there is no such constraint. As will be seen in section 2.2, V and Z are generally unique under this constraint. The uniqueness of V and Z is needed to guarantee the robustness of the CSP feature. 2.2 Robustness of the CSP Feature. An extracted feature that can effectively reflect the subject’s intentions even in a noisy environment is highly desirable. This is a problem of the feature robustness to noise. In this article, there is no sufficient training data to determine the CSP transformation matrix, so the test data along with the predicted labels are used for training. Since the prediction error of labels is inevitable in each iteration, we need to consider the robustness of the CSP feature in our algorithm. To analyze the robustness of the CSP feature, we consider two correlation matrices (1) + ε1 and (2) + ε2 , where (1) and (2) are defined in equation 2.1. ε1 and ε2 are symmetric matrices related to noise, which can be additive in nature. ε1 and ε2 are defined in this article as follows,
ε1 =
j∈C1r
Ej ∗ ETj Ej ∗ ETj = , ε , 2 T trace Ej ∗ ETj r trace Ej ∗ Ej
(2.6)
j∈C2
where C1r and C2r denote the sets of trials misclassified in the first and second class, respectively. As in the previous section, we can find a joint diagonalization matrix denoted as W(), such that W()( (1) + ε1 )WT () = D(),
W()( (2) + ε2 )WT () = 1 − D(), (2.7)
where denotes max{ε1 1 , ε2 1 }. Note that the 1-norm of a matrix implies the summation of the absolute values of all its entries.
An Extended EM Algorithm
2735
In a noisy environment, the CSP feature (denoted as cfn(, j)) for the jth trial is E j ETj T ¯ ¯ cfn(, j) = diag W() (2.8) W () . trace E j ETj Before presenting our results, we present two lemmas. The first lemma can be found in Chen (2000) or other textbooks related to matrix theory. Lemma 1 (Bauer-Fike). Suppose that A = QP Q−1 , P = diag(λ1 , . . . , λm ). Then for any eigenvalue u of A + , we have min |λi − u| ≤ Q−1 Q2 , i
(2.9)
where is a perturbation matrix with consistent dimension. In this article, the matrix √ norm A2 refers to the spectral norm of the matrix A, that is, A2 = λ, where λ is the maximum eigenvalue of AT A. Spectral norm is consistent with Frobenius vector norm. Lemma 2. Suppose that a real symmetric matrix A ∈ Rm×m has m different eigenvalues, and ∈ Rm×m is a symmetric disturb matrix, θ = 2 . G(θ ) and G are two orthogonal matrices with their first row vectors being nonnegative, such that GT (θ )( A + )G(θ ) = Q(θ ) = diag(q 1 (θ ), . . . , q m (θ )),
(2.10)
G AG = = diag(λ1 , . . . , λm ).
(2.11)
T
Then we have lim G(θ ) = G, lim Q(θ ) = .
θ →0
θ →0
(2.12)
The proof is given in appendix A. Theorem 1.
Considering equations 2.4 and 2.7, we have
lim W() = W, lim D() = D.
→0
→0
(2.13)
Furthermore, lim cfn(, j) = cf ( j),
→0
where cfn(, j) and cf ( j) are defined in equations 2.5 and 2.8, respectively.
(2.14)
2736
Y. Li and C. Guan
The proof is given in appendix B. From theorem 1, we can see that the CSP feature is robust to additive noise to some degree.
3 An Extended EM Algorithm In this section, an extended EM algorithm is proposed for joint CSP feature extraction and classification. We also discuss how the algorithm is used for both semisupervised and unsupervised learning for online BCI systems. Finally, we present several results on convergence analysis of the iterative algorithm. Before introducing our algorithm, we present a simplified version of a standard EM algorithm, which can be found in Xu and Jordan (1996). Suppose a gaussian mixture probabilistic model as follows,
P(x|) =
2
αq P(x|m(q ) , Var(q ) ),
q =1
P(x|m(q ) , Var(q ) ) =
e xp(− 12 (x − m(q ) )T (Var(q ) )−1 (x − m(q ) ) (2π) L/2 |Var(q ) |1/2
,
(3.1)
where x ∈ R L , the parameter vector consists of the mixing proportions αq , the mean vectors m(q ) , and the covariance matrices Var(q ) , q = 1, 2. Assuming αq = 12 , the EM algorithm can be expressed as
(q ) h k (t) =
(q )
mk+1 =
(q ) (q ) P x(t)|mk , Vark , 2 (i) (i) i=1 P x(t)|mk , Vark N
q
h k (t)x(t)
, q h k (t) N q (q ) (q ) T h (t) x(t) − mk x(t) − mk (q ) Vark+1 = t=1 k . N q t=1 h k (t) t=1
N
t=1
(3.2)
If we further assume that the two gaussian distributions are well separated, (q ) (q ) such that the posterior probabilities h k (t) ≈ 1 or h k (t) ≈ 0, then the above
An Extended EM Algorithm
2737
iterative algorithm becomes (q ) (q ) P x(t)|mk , Vark (q ) , h k (t) = (i) (i) 2 i=1 P x(t)|mk , Vark (q ) mk+1
(q )
Nk(q ) =
Vark+1 =
x(q ) (t)
, (q ) Nk T Nk(q ) (q ) (q ) (q ) (q ) x x (t) − m (t) − m k k t=1 t=1
(q )
Nk
,
(3.3)
(q )
where x(q ) (t) belongs to the q th class, Nk is the number of samples belonging to the q th class in the kth iteration, and q = 1, 2. Hereafter, equation 3.3 refers to a standard EM iteration. 3.1 Algorithm Steps. We first present a version of our extended EM algorithm for semisupervised learning. In the next section, we show how it is extended for the unsupervised learning case. This is an iterative algorithm in which a naive Bayes classifier is used. In each iteration, we need to update the trial labels of the test set, the training data set, the CSP feature vectors of the initial training set and test set, and the parameters of Bayes classifier (mean vectors and covariance matrices). Algorithm 1 Step 1: Initial step. Denote D0 as the initial training data set. First, we train a CSP transformation matrix based on D0 and extract CSP features on the initial training data set and test data set. Using the CSP features of the initial training data set, we then calculate two mean vectors and two covariance matrices for both classes as initial parameters of a naive Bayes classifier. Step 2: The kth iteration (k = 1, . . .) follows steps 2.1 to 2.6. Step 2.1: With the CSP feature vectors extracted in the (k − 1)th iteration, perform K 0 standard EM iterations in equation 3.3, where K 0 is a predefined positive integer. Step 2.2: According to the posterior probabilities obtained in the K 0 th standard EM iteration, we perform a classification on the test set (containing N2 trials). The predicted labels are denoted as [Labelk (1), . . . , Labelk (N2 )]. Step 2.3: Update the training data set. Select I nt(α N2 ) trials from the test set that have higher posterior probabilities for retraining, where α ∈ (0, 1] is a predetermined percentage, and I nt(α N2 ) defines the largest integer, which is smaller than α N2 . The selected trials along with their predicted
2738
Y. Li and C. Guan
labels are put together with the initial training data set D0 to form an updated training data set denoted as Dk , which has N1 + I nt(α N2 ) trials. Notice that the test set remains unchanged. Step 2.4: Feature reextraction. Using the training data Dk , regenerate the CSP transformation matrix, and then reextract CSP features for all trials. The CSP feature vector of the ith trial is denoted as cfk (i) = [c f k (1, i), . . . , c f k (L , i)]T , where k refers to the kth iteration, L is the dimension number of feature vector, and i = 1, . . . , N1 + N2 . Step 2.5: Calculate the mean vectors and covariance matrices of the two classes as new parameters of the Bayes classifier by using the reextracted features of the training data set Dk along with predicted labels. Step 2.6: Find out the number of the trials from the test set with different predicted labels in the current and previous iteration, dlk−1 =
N2
|Labelk (i) − Labelk−1 (i)|.
(3.4)
i=1
Step 3: Termination step. Given that M0 is a predetermined positive integer, if dlk−1 < M0 , the algorithm stops after the kth iteration, and the predicted labels [Labelk (1), . . . , Labelk (N2 )] of the test set are the final results. Otherwise, perform the k + 1th iteration. Note that three parameters need to be preset: α (the percentage of the test set used for retraining), M0 (the number of testing trials with inconsistent labels in two consecutive iterations), and K 0 (the number of standard EM iterations). These parameters are generally set based on empirical evaluations. According to our extensive simulations, α and M0 can be set to be 80% and 0.05 ∗ N2 , respectively; K 0 can be chosen in {1, 2, 3} (K 0 = 3 in this article). Small adjustments can be made to the above parameters according to the convergence property of the algorithm. The rule of thumb is that if the algorithm converges smoothly after, for example, 20 iterations, we deduce that these parameter settings are reasonable. The fundamental reason behind our choice of the CSP feature reextraction is that the initial training data set is too small to give a reliable estimation of the CSP transformation matrix in semisupervised learning and unsupervised learning. We can make use of the test set along with the predicted labels to augment the training set and improve the efficiency of feature extraction. Obviously there exists prediction error of the labels. Trials with incorrect labels are considered as noise in the estimation of the CSP transformation matrix. A necessary condition is that the CSP feature should be fairly robust to noise. As analyzed in the previous section, the CSP feature is indeed robust to noise to some extent. Furthermore, the higher the prediction accuracy rate (i.e., the smaller the noise), the better
An Extended EM Algorithm
2739
the CSP feature quality is. Usually the prediction accuracy of labels for the test set is not high initially; thus, the extracted CSP features do not have sufficient consistency, and we need to update the CSP features during later iterations. As will be seen in our experimental data analysis, the CSP feature reextraction can actually improve the Fisher ratio between the two sets of the CSP features corresponding to the two classes. A higher Fisher ratio implies higher consistency of features and a higher classification efficiency. The CSP feature reextraction is also motivated by the dynamic characteristics of EEG signals. Even if there is a sufficient training data set, the parameters related to feature extraction of a real-time BCI system should be adjusted if necessary. In the above iterations, the CSP transformation matrix and classifier parameters are kept updated without new system training. This method can be used to improve the adaptability of a BCI system. Remark 2. (i) Before applying algorithm 1 to an EEG data set, we need to perform data preprocessing, including common average reference (CAR) spatial filtering, frequency filtering, and channel selection (see section 4). From our experience, CAR spatial filtering based on all available channels can improve the accuracy rate. This may be due to denoising. (ii) There are two differences between a standard EM algorithm and the proposed extended EM algorithm. First, our extended EM algorithm is embedded with a CSP feature reextraction. Second, in our extended EM algorithm, only some of the testing data are selected for retraining according to the merits of their classification probabilities. Our experimental analysis results will show that the above two extensions over the standard EM algorithm can improve prediction accuracy significantly. 3.2 Unsupervised and Semisupervised Learning for Online BCI Systems. In the previous section, we presented the extended EM algorithm (algorithm 1) for the semisupervised case. In this section, we first discuss how algorithm 1 can be applied in the unsupervised case. This is followed by a brief discussion of how this algorithm can be used to improve the adaptability of online BCI systems. Unsupervised learning for a BCI system implies that there is no training phase, that is, no prior training data are available. In the previous section, algorithm 1 was presented for the semisupervised case, but it can be easily extended to the unsupervised case. For off-line data analysis, since an initial training data set is unavailable, we first assign random labels to all the testing trials. The random labels and testing data are then used as the initial training data set. Next, we start the extended EM iterations. In each iteration, the training data set is updated by selecting some testing trials with labels that are predicted in the previous iteration. For the online case,
2740
Y. Li and C. Guan
the initial parameter values of a BCI system can be set default (or as those set previously). The EEG data and real labels are stored while the system is working. When the stored data are sufficient, we use the extended EM algorithm to learn the parameters of the BCI system and update the old ones. In the semisupervised learning case, there is usually a short training phase. The small number of training trials are used as the initial training data for the extended EM algorithm. The initial parameters of the BCI system can be determined based on the small training data set and updated online, as in the unsupervised case. There exist two ways to set the initial data set for the learning of the parameters. The initial training data set can include only the small training data set with labels or both the small training set with labels and the test set with the labels obtained online. For both the unsupervised case and the semisupervised case, the labels of the data obtained online may not reflect the user’s true intents because of inappropriate BCI system parameters or some other reason. Errors may exist in real-time labels. Thus, we use these labels only as initial values in the online case and update them iteratively in our algorithm (note that in the semisupervised case, the test data set with online labels can also be used as part of the initial training data set). After the parameters of a BCI system (CSP transformation matrix and classifier parameters) are determined by several iterations, the extracted CSP features have sufficient consistency for classification. Furthermore, the labels obtained online may well reflect the user’s true intents. However, this consistency of features may exist only in the short term. When an online BCI system has been used for extended periods, the subject’s brain state may change significantly. If so, the consistency (or quality) of the CSP features and classification accuracy will be deteriorated. In this case, the system parameters need to be adjusted to keep or recover the consistency of the CSP features. Note that the system parameters are updated during the iterations of algorithm 1. Once the iterations terminate, these parameters are determined accordingly. Similar to how the system parameters are determined when the system starts working initially, the proposed extended EM algorithm can also be used to adjust the system parameters online to improve the adaptability of a BCI system. We now present a simulation example to explain why our algorithm is effective even for the unsupervised case. Example 1: We generated two artificial data sets, each containing 500 3-by-100 random matrices. Note that each random matrix corresponds to a single trial of EEG data. Each data set is equally divided into two parts corresponding to two classes. The two parts of the first data set are drawn from two uniform distributions with means of 0.5 and 1, respectively; the two parts of the second data set are drawn from two gaussian distributions with means of 0 and 0.5, respectively.
An Extended EM Algorithm
2741
random labels
100%
100%
random labels
83.4%
94.4%
Figure 1: Analysis results for two artificial data sets in the unsupervised case. The two rows correspond to the first and second data sets, respectively. Column 1: True CSP features with true labels. Column 2: CSP features with randomly given labels. Column 3: Features and labels obtained in the first iteration. Column 4: Features and labels obtained in the last iteration. The two circles in each subplot represent two class means.
Assuming that the labels of these data are unknown, we apply our algorithm to predict the labels for all these data matrices. In the first iteration, we randomly assign labels to the data. After extracting three-dimensional CSP features for each data matrix, we predict their labels. These predicted labels are used in the next iteration, and the cycle repeats itself. Feature reextraction and classification are executed in each iteration. Figure 1 shows our simulation results. The plots in the first and second row correspond to the first and second data set, respectively. From each data set, we first extract three-dimensional CSP features for all data matrices using true labels. These features serve as ground truth for us to compare. Each subplot shows only the first 30 features with labels in Figure 1, and each data point is a two-dimension vector composed by the first two entries of a feature vector. However, the prediction accuracy rates and class means shown below are obtained from all 500 data points. The two circles in each subplot represent the two class means. The two subplots in the first column show the true features with true labels for the two data sets, respectively. We can see that the features of the first data set are separable, while the features of the second data set are overlapped. The second column shows the CSP features derived from random labels. For the first data set, we can see that two separable clusters are obtained, although feature extraction is based on these random labels. The third column depicts the labels obtained in the first classification, noting that the features are the same as those in the second row. The accuracy rates are 100% and 83.4% for the two data sets, respectively. Although the feature extraction and classification here are based on random labels, the prediction accuracy rates obtained are high, especially for the first data set. This is because the features for each data set form two clusters that are somewhat separable. The fourth row shows the
2742
Y. Li and C. Guan
final results. For the first data set, all features and their labels (obtained in the last iteration) are identical to the true ones shown in the first subplot in the first row. For the second data set, the final accuracy rate is 94.4%, noting that the features are different from the true values shown in the first subplot in the second row. Additionally, if a standard EM algorithm (without feature reextraction) is applied to the features extracted from the second data set based on random labels, the classification accuracy is 89.1%. We also would like to point out that if the semisupervised version of our algorithm is applied to the above two data sets, we can obtain similar results. Note that the first feature extraction and classification are based on the true labels of a small training set rather than randomly given labels. 3.3 Convergence of the Extended EM Algorithm. Although it is local, convergence is an attractive property of a standard EM algorithm. As we pointed out in section 1, it is difficult to have a high-quality CSP feature when the training data are insufficient. If a standard EM is used for classification, CSP features with low consistency (or a low Fisher ratio) may degrade or limit the algorithm performance. In algorithm 1, CSP feature reextraction is embedded in a standard EM algorithm. This may give rise to a convergence problem, which is discussed in this section. Xu and Jordan (1996) proved that in the EM iterations of equation 3.2, the directions of mean vectors and covariance matrices are the corresponding gradient directions of a log likelihood premultiplied by a positive definite matrix. That is, the log likelihood will increase along with each iteration direction. This guarantees the convergence of a standard EM algorithm. In this article, the standard EM algorithm in equation 3.2 has been extended for joint CSP feature extraction and classification. Since the feature vectors are updated in each iteration, the log likelihood may not increase as the iterations proceed. However, from our experimental data analysis, the extended EM algorithm still converges. In the following, we analyze the convergence of the extended EM algorithm. We consider only the unsupervised case, in which all the test data are used in retraining. Suppose that the labels of test data are known initially. Using these labels, we extract the CSP features of the test data by jointly diagonalizing the two matrices (1) and (2) (see equations 2.1 to 2.5). In the following, these CSP features, denoted as cf(q ) (i) (q = 1, 2), are treated as the true features, which are not affected by prediction error. We now analyze the average error between the true CSP features cf(q ) (i) and those extracted during the iterations of our algorithm. In this article, noise comes from the classification error in each iter(q ) ation. For the kth iteration, we denote k as the normalized correlation matrices corresponding to the two classes (similarly calculated as (q ) in equation 2.1) and denote the extracted CSP feature vector as cfk (i) = (q ) (q ) [c f k (1, i), . . . , c f k (L , i)]T , where q (= 1, 2) refers to the q th class, and i is
An Extended EM Algorithm
2743
the trial index. We have the following theorem with respect to the mean feature vectors. Theorem 2.
(q ) (q ) mea n cf k − mea n cf (q ) 2 ≤ L k − (q ) 2 , q = 1, 2.
(3.5)
The proof is given in appendix C. Let us recall algorithm 1. In the kth iteration, suppose that the prediction accuracy is ratek . From theorem 2, if ratek+1 > ratek , that is, the number of misclassfied trials decreases, then the error matrices in equation 2.6 become (q ) (q ) smaller. Thus, k − (q ) 2 < k+1 − (q ) 2 (q = 1, 2), that is, the bounds (q )
of mea n(cfk ) − mea n(cf(q ) )2 should decrease. Note that before K 0 standard EM iterations in the (k + 1)th iteration, the classification accuracy rate is ratek , while the accuracy rate ratek+1 is obtained by the K 0 standard EM iterations. Due to the performance of the standard EM algorithm, ratek+1 > ratek in general. If the improvement of accuracy rate between two successive iterations is sufficiently large, the bounds in (q ) equation 3.5 will decrease greatly. This may make the errors mea n(cfk ) − (q ) mea n(cf )2 decrease. This monotonically decreasing phenomenon will be seen in our real data analysis in section 4. We will provide an explanation (q ) here. Furthermore, if the prediction rate approaches 1, then mea n(cfk ) will approach mea n(cf(q ) ). Remark 3. From theorem 1, although we can conclude that the CSP features corrupted by noise will tend to the uncorrupted ones when the noise tends to zero, it is difficult to give an error bound of CSP feature with respect to noise. The errors given in equation 3.5 can be seen as the average error bounds of the CSP feature obtained in each iteration. We now give the average error bounds for the variances of the CSP features. (q ) (q ) For the kth iteration, σk ( j) denotes the variance of c f k ( j, ·), the jth (q ) element of a CSP feature vector belonging to the q th class, Mk denotes the (q ) (q ) number of these feature vectors, and M = max{Mk }, σ ( j) denotes the q ,k
variance of c f (q ) ( j, ·). We have Theorem 3. large, then
If the prediction accuracy ratek in the kth iteration is sufficiently
(q ) 2 (q ) 2
σ ( j) − σ ( j) < 2(M + 1) (q ) − (q ) + (1 − ratek ), k k 2 where j = 1, . . . , L, q = 1, 2.
(3.6)
2744
Y. Li and C. Guan
A sketch of the proof of theorem 3 is given in appendix D. From theorem 3, we can see that if the prediction accuracy rate is suffi(q ) ciently close to 1, then σk ( j) will tend to σ (q ) ( j). (q ) Using equation 3.6, we can further estimate the bound of Vark − (q ) (q ) Var(q ) 2 , where Vark and Var(q ) are the covariance matrices of {cfk ( j)} (q ) and {cf ( j)}, respectively. Due to limited space, we omit the estimate here. 4 Experimental Results In this section, we evaluate our methods with the following data set: data set IVa in BCI Competition 2005, provided by K. R. Muller and B. Blankertz (Fraunhofer FIRST, Intelligent Data Analysis Group), and G. Curio (Neurophysics Group, Department of Neurology, Campus Benjamin Franklin of the Charit, University Medicine Berlin). This data set is provided for researchers to evaluate their algorithm performance when only a small amount of labeled training data is available. (The description in the following paragraph is from http://ida.first.fraunhofer.de/projects/ bci/competition iii). This data set was recorded from 118 scalp electrodes at a sampling rate of 1000 Hz from five healthy subjects. Subjects sat in a comfortable chair with arms resting on armrests. This data set contains only data from the four initial sessions without feedback. Visual cues indicated for 3.5 s which of the following two motor imageries the subject should perform: (R) right hand, (F) right foot. The presentations of target cues were separated by periods of random length, 1.75 to 2.25 s, in which the subject could relax. There were two types of visual stimulation: (1) targets were indicated by letters appearing behind a fixation cross (which might nevertheless induce small target-correlated eye movements) and (2) a randomly moving object indicated targets (inducing target-uncorrelated eye movements). For the second and fourth subjects (“al” and “aw”), two sessions of both types were recorded, while for the other three subjects (“aa,” “av,” and “ay”), three sessions of type 2 and one session of type 1 were recorded. The data were also down-sampled from 1000 Hz into 100 Hz. We use this 100 Hz version in this article. Due to limited space, we present our detailed analysis results for only three subjects: aa, al, and ay. For convenient analysis, we do not use the competition splitting of data sets. We run our extended EM algorithm using the first 150 trials of the data set for each subject. In the semisupervised case, we perform fivefold cross-validation in which only one fold of data (30 trials) with labels is used for the initial training data set; the other four folds (120 trials) without labels are used for the test data set and retraining. In the unsupervised case, we do not use any labels from the 150 trials. For further demonstration, we use the subsequent 80 trials as an independent test set. Note that in the competition, the number of trials of the five training
An Extended EM Algorithm
2745
sets are 168 (subject aa), 224 (subject al), 84 (subject av), 56 (subject aw), and 28 (subject ay). Except for the fifth training set, the other four training sets are much larger than those used in this article. In the following, we give a description of preprocessing and then consider the cases of semisupervised learning and unsupervised learning. 4.1 Preprocessing. To ensure good performance, appropriate preprocessing is necessary. The preprocessing in this article includes CAR spatial filtering, frequency filtering, and channel selection. For every trial, we use data of duration 3.5 sec for analysis. During this period, the cue was visible on the screen, so we have 350 samples for each trial. We then obtain a 118 × 350 EEG data matrix denoted as E k for the kth trial. The whole EEG data matrix is given by E = [E1 , . . . , E N1 , E N1 +1 , . . . , E N1 +N2 ], where [E1 , . . . , E N1 ] denotes the training data set with N1 trials, and [E N1 +1 , . . . , E N1 +N2 ] denotes the test set with N2 trials. Notice that for the unsupervised case, prior training data are unavailable. All the training data come from the test set. In the semisupervised case, the relatively small initial training data set contains 30 trials. EEG data matrix E is first preprocessed by a CAR spatial filter, and ¯ This filter is useful in reducing the resultant data matrix is denoted as E. some artifacts and noise. After the filtering, a spectral analysis for every EEG channel in the training data (every row of the training data matrix [E1 , . . . , E N1 ]) is performed. In order to select the proper frequency band, we calculate the Fisher ratio at each frequency bin of the power spectra for each channel. Based on the Fisher ratios, we roughly determine the frequency band (typically in mu or beta bands). This frequency band may be different from subject to subject. In this article, we use only the signals in mu band. The selected frequency bands for the three subjects in our study are 12 Hz to 14 Hz for subjects aa and al and 9 Hz to 13 Hz for subject ay. After determining the frequency band, we further roughly select EEG channels (generally in the sensorimotor area) that exhibit relatively higher Fisher ratios in the determined frequency band. The number of selected channels is denoted as N0 . Note that the above frequency band and channel selection are based on the small training data set in the semisupervised case. For the unsupervised case, we can first use the default settings of the mu frequency band (e.g., 10–14 Hz) and channels (the channels in sensorimotor area); then adjustment is made according to the classification results obtained in the algorithm iterations. 4.2 Semisupervised Learning Case. In this section, we apply the extended EM algorithm to the semisupervised learning case. As an example, we first describe our analysis procedure and results for subject aa. As stated in the beginning of section 4, we use the first 150 trials with labels for cross-validation and use the subsequent 80 trials as an independent test set. We equally divide the 150 trials into five folds
2746
Y. Li and C. Guan
according to their sequential order. To evaluate the effect of a small training set, we use one fold for the initial training data set. The other four folds, which are used for learning/retraining and testing, are called learning test set in order to distinguish it from the independent test set. The independent test set is to further demonstrate the validity of our algorithm. The process is formulated in this way. In each iteration, besides performing all the tasks stated in our extended EM algorithm, we also extract the CSP features of the independent test set, predict their labels, and calculate the corresponding prediction accuracy rate. We have two different percentage settings for our extended EM algorithm: (1) 80% of the learning test set is used for retraining and (2) 100% of the learning test set is used for retraining. In each iteration, we calculate the prediction accuracy rates accuracy(i, k, j) for the learning test set and accuracy I (i, k, j) for the independent test set, where i = 1, 2 refer to the percentage parameters 80% and 100%, respectively; k represents the kth iteration; and j(= 1, . . . , 5) represents the jth fold used for the initial training set. The average accuracy rates over all folds are calculated as 1 accuracy(i, k, j) 5
(4.1)
1 accuracy I (i, k, j), 5
(4.2)
5
rate(i, k) =
j=1 5
rate I (i, k) =
j=1
where i = 1, 2, k = 1, . . . , 9 for subject aa. For the purpose of comparison, we calculate all these accuracy rates similar to equation 4.1 except that CSP feature reextraction is skipped during the retraining. This corresponds to the performance of a standard EM algorithm. The obtained average accuracy rates for the learning test set are ¯ denoted as rate(i, k) (i = 1, 2, k = 1, . . . , 9). From the comparison, we can observe how the feature reextraction contributes to accuracy. The above analysis results are shown in the first row of Figure 2. In the first subplot, rate(1, k) is depicted as a solid line with asterisks and rate(2, k) ¯ ¯ as a solid line with circles. Similarly, rate(1, k) and rate(2, k) are depicted as dotted lines with stars and circles, respectively. Note that these results are obtained from the learning test set. In the second subplot, accuracy rates rate I (1, k) and rate I (2, k) for the independent test set are depicted as solid lines with asterisks and circles, respectively. Next, we present our analysis results on the convergence of our algorithm. We consider only the case in which 80% of the learning test set is used for retraining. We define meanq (k, j) and varq (k, j) as the mean vectors and variance matrices of a Bayes classifier, where q = 1, 2 are class indices, k = 1, . . . , 9 are iteration indices, and j = 1, . . . , 5 refer to the jth
An Extended EM Algorithm
0.8
0.2
5 0.65 9 1
6
1
1
0.95
0.95
0.9
0.9
0.85 1
0.85 8 1
4
6
3
6
0 1
9
4
8
4
6
8
3
5
7
0.7 1
2 3 Iteration K
4
5
7
0.3
5 4
3
me,mv
ml
Accuracy rates
0.9
2 3 Iteration K
0 1 1
0.8 0.7 1
8
0.1
25 0.9
4
0.5
2 0 1
0 1
8
me,mv
3
ml
0.65 1
Accuracy rates
0.7
me,mv
0.8
20
ml
0.95
Accuracy rates
0.95
2747
0 1
2 q
3
0 1
2 q
3
Figure 2: Analysis results in the semisupervised case. The first, second, and third rows are for subjects aa, al, and ay, respectively. The first column shows prediction accuracy rates for the learning test set obtained by our algorithm (solid lines) and the standard EM algorithm (dotted lines), where the lines with asterisks and circles refer to the percentage settings of 80% and 100%, respectively. The second column shows prediction accuracy rates obtained by our algorithm for the independent test set, where the lines with asterisks and circles refer to the percentage settings of 80% and 100%, respectively. The third column depicts the curves of label convergence index ml in equation 4.5. The fourth column shows the curves of average errors for the mean (solid lines) and covariance (dotted lines) of the classifier.
fold used for the initial training data set. We can find the average difference of meanq (k, j) and varq (k, j) between two successive iterations over five folds and two classes as follows,
me(k) =
mv(k) =
11 meanq (k, j) − meanq (k + 1, j) 2 5
(4.3)
1 1 varq (k, j) − varq (k + 1, j), 2 M
(4.4)
2
5
q =1
j=1
2
5
q =1
j=1
2748
Y. Li and C. Guan
where k = 1, . . . , 8 for subject aa. In equations 4.3 and 4.4, · represents the Frobenius norm of a vector or a matrix. We can also observe the convergence from the consistency of labels for the learning test set predicted in two successive iterations. Let Label(k, j, ·) be the label vector predicted in the kth iteration, where j (= 1, . . . , 5) refers to the jth fold used for the training data set. We calculate the average number of different labels between two successive iterations over five folds, which we call the label convergence index, 2 1 |Label(k, j, n) − Label(k + 1, j, n)|, 5
5
ml(k) =
N
(4.5)
q =1 n=1
where k = 1, . . . , 8, N2 is the number of predicted labels (i.e., the number of testing trials). ml is shown in the third subplot of the first row of Figure 2, and me and mv are shown in the fourth subplot with the solid and dotted line, respectively. These iterative curves in these two subplots illustrate the convergence of our algorithm. For subjects al and ay, we performed a similar analysis. The corresponding results are presented in the second and the third rows of Figure 2, respectively. Note that the numbers of iterations for subjects al and ay are 8 and 4, respectively. Remark 4. Besides the above three subjects, we also applied our algorithm to the data sets from the other two subjects (aw and av). Under the same settings as above, the final average prediction accuracy rates of the learning test sets are 91.2% and 76.4% for the two subjects aw and av, respectively, while the final corresponding average accuracy rates of the independent test sets are 88.8% and 75.3%. The result for subject av is not so satisfactory as those for the other subjects here. This case also happened for the results obtained by the winner in the BCI2005 competition. We think it is due to the quality of the data. 4.3 Unsupervised Learning Case. In this section, we consider the unsupervised learning case. In this article, unsupervised learning for a BCI system implies that there are no initial training data. Our extended EM algorithm can be used in the unsupervised learning case as stated in section 3.2. We evaluate our algorithm using data for subjects aa and al due to limited space, although we have also obtained satisfactory results for the other data sets. For each of the two subjects, we use the first 150 trials for the learning test set and the subsequent 80 trials for the independent test set. In the initialization step, we assign random labels to the learning test set, extract the CSP features of the learning test set according to these random labels, and set the initial values of the Bayes classifier
An Extended EM Algorithm
2749
parameters. Next we apply the extended EM algorithm. In each iteration, we extract the CSP features for both the learning test and the independent test set and predict their labels. We have two iteration settings: (1) 80% of the learning test trials are used for the training set in each iteration, and (2) 100% of the learning test trials are used for the training set in each iteration. First, we calculate the prediction accuracy rates, rate(i, k), for the learning test set and rate I (i, k) for the independent test set, where i = 1, 2 represent the percentage settings of 80% and 100%, respectively, and k refers to the kth iteration. Next, we calculate the number of different labels for the learning test set between two successive iterations, which is the indicator for terminating the iteration,
d L i (k) =
N2
|Labeli (k, n) − Labeli (k + 1, n)|,
(4.6)
n=1
where Labeli (k, ·) is the label vector predicted in the kth iteration and under the ith (i = 1, 2) percentage setting. N2 is the number of predicted labels. We now consider the convergence of the algorithm. From theorems 2 and 3, we can conclude that the mean vectors and covariance matrices will tend to the true ones if the improvement of prediction accuracy in each iteration is sufficiently large. We now demonstrate this conclusion by data analysis results. For each of the two subjects, we use all 150 trials of the learning test set and their true labels to extract the CSP feature vectors cf(q ) ( j), where q = 1, 2 is the class index (i.e., label), and j refers to the jth trial of the q th class. The covariance matrix of {cf(q ) ( j)} is denoted as Var(q ) . These CSP feature vectors, the mean vectors, and class covariance matrices are treated as true ones without being affected by prediction error. We then calculate the average errors for mean vectors and covariance matrices,
MEi (k) =
2 1
mean cf(q ) (·) − mean cf(q ) (·) , k 2 2
(4.7)
2
1
Var(q ) − Var(q ) , k 2 2
(4.8)
q =1
MVi (k) =
q =1
(q )
where i refers to the ith percentage setting; the notations cfk ( j) (CSP feature (q ) vector) and Vark (covariance matrix) can be seen in section 3.3.
Y. Li and C. Guan
1.4
1
1
MVi
0.9
MEi
Accuracy rates
2750
0.7
0.6 0.6
0.5 1
5
9
13
0.2 1
17
5
9
13
17
15
0.2 1
5
9
13
17
5
9
13
17
15
dLi
36
18
0 1
5
9
13
17
Fisher ratio
Fisher ratio
54 10
5
0 1
5 9 13 Iteration K
17
10
5
0 1
Figure 3: Analysis results for subject aa in the unsupervised case. In the first row, the left subplot shows curves of prediction accuracy rates for the learning test set (solid lines) and the independent test set (dotted lines). Note that for all subplots in this figure, the percentage settings of 80% and 100% are represented by asterisks and circles, respectively. The middle and right subplots in the first row show the curves of average errors for mean (ME i (k) in equation 4.7) and covariance (MVi (k) in equation 4.8) of the classifier. The subplots in the second row show curves of label convergence index d L i (k) in equation 4.6 (left subplot), Fisher ratios obtained from the 150 trials of the learning test set by our algorithm (middle subplot), and Fisher ratios obtained from the same data set by the standard EM algorithm (right subplot).
To further demonstrate that feature reextraction in the iterations can improve the consistency of features, we calculate the Fisher ratios,
mean cf(1) (·) − mean cf(2) (·) k k FRi (k) = ,
1
Var(1) + Var(2) 2 k k
(4.9)
where i implies the ith percentage setting and k implies the kth iteration. For comparison, we perform several standard EM iterations (without feature reextraction) using the CSP features extracted in the third iteration of our algorithm. Similar to equation 4.9, we also calculate the Fisher ratios, which are denoted as F¯Ri (k). Figure 3 shows the above analysis results for subject aa. In the first row, the left subplot shows the curves of average accuracy rates, rate(i, k) for the learning test set in solid lines, and rate I (i, k) for the independent test set in dotted lines. Note that for all subplots in this figure, the settings of 80% and 100% are represented by asterisks and circles, respectively. The average errors of mean vectors ME i (k) and covariance matrices MVi (k) are shown
An Extended EM Algorithm
2751 2 2
1
5
10
30
0 1
5
10
40
Fisher ratio
60
0 1
5
9
20
0 1
1
0 1
5
10
5
10
40
Fisher ratio
0.6 1
dLi
MV
0.8
i
MEi
Accuracy rates
1
5 Iteration K
10
20
0 1
Figure 4: Analysis results for subject al in the unsupervised case. In the first row, the left subplot shows the curves of prediction accuracy rates for the learning test set (solid lines) and the independent test set (dotted lines). Note that for all subplots in this figure, the percentage settings of 80% and 100% are represented by asterisks and circles, respectively. The middle and right subplots in the first row show the curves of average errors for the mean (ME i (k) in equation 4.7) and covariance (MVi (k) in equation 4.8) of the classifier. The three subplots in the second row show the curves of label convergence index d L i (k) in equation 4.6 (left subplot), Fisher ratios obtained from the 150 trials of the learning test set by our algorithm (middle subplot), and Fisher ratios obtained from the same data set by standard EM algorithm (right subplot).
in the middle and right subplots, respectively. The iteration-terminating indicator d L i (k), Fisher ratios FRi (k), and FRi (k) obtained from the 150 trials of learning test set are shown in the three subplots in the second row, separately. We perform data analysis for subject al as was done for subject aa. The corresponding results are shown in Figure 4. 4.4 Discussion. In this section, we present our discussions based on the experimental analysis results shown in Figures 2, 3, and 4 for both the semisupervised and the unsupervised cases. 1. Our analysis results for the semisupervised case are shown in Figure 2, in which the three rows correspond to the three subjects’ data. The first and second column of Figure 2 display accuracy rate curves for the learning test set and the independent test set, respectively. Figures 3 and 4 present our analysis results for the unsupervised case for subjects aa and al. Accuracy rate curves for both the learning test set and the independent test set are shown in the first subplots of Figures 3 and 4.
2752
Y. Li and C. Guan
From all these accuracy curves obtained by the extended EM algorithm under the percentage settings of 80% (solid lines with asterisks in the first two columns of Figure 1, lines with asterisks in the first subplots of Figures 2 and 3), we see that satisfying prediction accuracy rates are obtained by several iterations of our proposed algorithm. The validity of our algorithm is thus demonstrated. This suggests that the extended EM algorithm presented in this article may be used in a BCI system when training data are not sufficient or are even unavailable. 2. From the analysis results shown in the three subplots of the first column in Figure 1, we see that the highest accuracy rates are obtained when the percentage of the learning test data for retraining is 80% instead of 100%. In one iteration, the higher the posterior probability of a trial from the learning test set, the more confident its predicted label. Thus, we do not use those trials with very low posterior probabilities for retraining. Therefore, for the semisupervised case, an appropriate percentage setting can improve the performance of our algorithm. This phenomenon is also suggested by Figures 3 and 4 for the unsupervised case. Thus, we recommend setting the percentage to be less than 100%. One question might be how this percentage can be determined. Through extensive experiments, we found that for the percentages in a broad range (e.g., from 60% to 90%), the performance does not vary too much. To choose a suitable value, we can start with an initial value of 80%, for example. If the iterations converge smoothly, we take this percentage value as our choice. Otherwise, we search for another one. 3. We now compare the accuracy rates obtained by our extended EM algorithm, which is embedded with a feature reextraction, with the accuracy rates obtained by the standard EM algorithm in equation 3.3. From the three subplots of the first column of Figure 1, we can see that rate(1, k), ¯ ¯ rate(2, k), and rate(3, k) (solid curves) are higher than rate(1, k), rate(2, k), and ¯ rate(3, k) (dotted curves), respectively. This means that feature reextraction can improve the performance of classification for the semisupervised case. In the unsupervised case, since the initial labels are given randomly, it is theoretically necessary to reextract the CSP feature during the iterations. For any two consecutive iterations, feature reextraction in the first iteration can improve the feature consistency (expressed by Fisher ratio; see the later discussion), and this leads to a higher classification accuracy. The latter will result in a further improvement of feature consistency after feature reextraction in the next iteration. 4. We consider iteration convergence in the semisupervised case. From all subplots in the second and third columns in Figure 2, the parameters (mean vectors and variance matrices) of the Bayes classifiers and the predicted label vectors are convergent for all three subjects. This implies a satisfying convergence property of our algorithm and also demonstrates the validity of our criterion for the termination of iterations. Convergence analysis for the semisupervised version of our algorithm can be similar to theorems 2 and 3.
An Extended EM Algorithm
2753
On iteration convergence in the unsupervised case, theorems 2 and 3 and their corresponding proofs (see the appendixes) tell us that the mean vectors and covariance matrices of CSP feature vectors for both classes will tend to the true ones (unaffected by noise) if the improvement of prediction accuracy in each iteration is sufficiently large. This has been demonstrated in the second and third subplots in the first rows of Figures 3 and 4. In addition, the first subplots in the second rows of Figures 3 and 4 show the convergence of the predicted label vectors. 5. From the second subplots in the second rows of Figures 3 and 4, we can find that the Fisher ratios between two classes of CSP features can be improved significantly during the extended EM iterations. By comparing the Fisher ratio curves (obtained by the standard EM iterations) shown in the third subplots in the second rows of Figures 3 and 4 with those shown in the second subplots, feature reextraction causes a significant improvement of the Fisher ratios. Thus, our method of retraining with feature reextraction can improve both the classification performance and the quality of feature.
5 Conclusion In this article, we present an extended EM algorithm that can be used for both semisupervised learning and unsupervised learning in BCI systems. The first objective is to reduce or even skip the training phase entirely. The second objective is to improve the adaptability of BCI systems. Two key problems—the robustness of the CSP feature to noise and algorithm convergence—are addressed here. In our proposed algorithm, the labels predicted in each iteration are used for reextracting the CSP features. Since prediction error of the labels (treated as noise in this article) is inevitable, we need to consider the robustness of the CSP feature to noise. According to our analysis, the feature is somewhat robust to noise. Furthermore, during the iterations of the extended EM algorithm, if the prediction accuracy rates tend to one, then the reextracted CSP features will tend to the true values, which are unaffected by prediction errors. It is well known that the convergence plays a key role for an iterative algorithm to work. From our theoretical and experimental data analysis, our extended EM algorithm also has satisfying convergence property. This is due to the convergence property of the standard EM algorithm and the robustness of CSP feature to noise. The main difference between our algorithm and a standard EM algorithm is that there is a feature reextraction in each iteration of our algorithm. When the initial training data set is small or null, the CSP features extracted in the beginning have low consistency and thus are not reliable. According to our analysis results, feature reextraction can improve the consistency (expressed by Fisher ratio) of CSP features and classification accuracy.
2754
Y. Li and C. Guan
Appendix A: Proof of Lemma 2 The second conclusion can be obtained directly from lemma 1. We now prove the first conclusion. Note that equation 2.10 is equivalent to (A + )gi (θ ) = q i (θ )gi (θ ), i = 1, . . . , m,
(A.1)
where gi (θ ) is the ith column vector of G(θ ). Noting that giT (θ )gi (θ ) = 1, the vector function gi (θ ) is bounded. Thus, gi (θ ) has convergent subsequences. Suppose that {gi (θ j ), j = 1, . . . , } is a convergent subsequence of {gi (θ )}, that is, lim θ j = 0, lim gi (θ j ) = g¯ i . We j→∞
j→∞
first have (A + j )gi (θ j ) = q i (θ j )gi (θ j ).
(A.2)
Noting that lim q i (θ j ) = λi , we have j→∞
Ag¯ i = λi g¯ i .
(A.3)
It follows from equation A.3 that g¯ i is an eigenvector of A corresponding to λi . Since {gi (θ j )} are normalized vectors with their first entries being nonnegative, g¯ i 2 = 1, and the first entry of g¯ i is nonnegative. A has m different eigenvalues and the first entry of gi is nonnegative; thus, g¯ i = gi . From the above analysis, we can see that any convergent subsequence of gi (θ ) tends to gi . Thus, lim gi (θ ) = gi . Lemma 2 is proven. θ →0
Appendix B: Proof of Theorem 1 Reconsidering the joint diagonalization procedure of the two noisy correlation matrices (1) + ε1 and (2) + ε2 , we have VT ()()V() = P(),
(B.1)
where () = (1) + ε1 + (2) + ε2 , P() is a diagonal matrix composed by the eigenvalues of () in a decreasing order. 1 Set U() = (P()) 2 VT (), R1 () = U()( (1) + ε1 )UT (). Suppose that Z() is a orthogonal matrix such that ZT ()R1 ()Z() = D() = diag(d1 (), . . . , dm ()).
(B.2)
Note that the first row vectors of above two orthogonal matrices V() and Z() are set to be nonnegative.
An Extended EM Algorithm
2755
Define W() = ZT ()U(); then we have equation 2.7. For practical data, we can say that and R1 have m different eigenvalues respectively (with probability one). It follows from lemma 2 that lim V() = V, lim P() = P. →0
(B.3)
→0
Thus, we have lim U() = U, lim R1 () = R1 , lim R2 () = R2 . →0
→0
(B.4)
→0
From equation B.4 and lemma 2, lim Z() = Z and lim D() = D. In view →0
of the definition of W(), lim W() = W.
→0
→0
The second conclusion can be directly obtained from equation 2.13 and the definitions of CSP features in equations 2.5 and 2.8. The theorem is thus proven.
Appendix C: Proof of Theorem 2 (1)
(2)
Noting that the sum k + k does not change in every iteration, that is, (1) (2) k + k = , where is the same as in equation 2.2. 1 1 1 1 (1) (1) (2) (2) We denote Rk = (P) 2 VT k V(P) 2 , Rk = (P) 2 VT k V(P) 2 , where the (q ) matrices V, P are defined in equation 2.2. We also denote Mk the number of trials belonging to the q th class in the kth iteration and M(q ) the true number of trials belonging to the q th class. (1) Suppose that Rk = ZkT Dk Zk , where Zk is an orthogonal matrix and Dk is a diagonal matrix with its elements in the diagonal line in decreasing order. 1 (1) Let Wk = ZkT (P) 2 VT ; then Wk can jointly diagonalize the matrices k and (2) ¯ k is constructed by using the first l1 rows and the last k . The submatrix W l2 rows of Wk ; then it is a CSP transformation matrix in the kth iteration (similarly as in equation 2.5). By the definition of CSP feature, (1)
Mk 1 (1)
Mk
i=1
(1) (1) T Si Si ¯ k ( j) ¯ kT ( j) w (1) (1) T w (1) Mk i=1 trace Si Si (1) Mk (1) (1) T Si Si 1 ¯ k ( j) (1) ¯ kT ( j) =w (1) (1) T w Mk i=1 trace Si Si (1)
(1) c f k ( j, i) =
Mk 1
2756
Y. Li and C. Guan (1)
¯ k ( j)k w ¯ kT ( j) =w = dk (n j ),
(C.1)
¯ k which is assumed to be the n j th ¯ k ( j) is the jth row vector of W where w (1) row of wk ( j), and dk (n j ) is the n j th eigenvalue of k (i.e., the n j th element of the diagonal line of Dk ). Similarly, M 1 (1) ¯ j (1) w ¯ Tj = d( j). c f ( j, i) = w M(1) (1)
(C.2)
i=1
It follows from equations C.1 and C.2 and lemma 1 that (1)
|
Mk 1 (1)
Mk
M 1 (1) c f ( j, i)| = |dk (n j ) − d(n j )| M(1) (1)
(1)
c f k ( j, i) −
i=1
i=1
(1)
≤ k − (1) 2 .
(C.3)
Furthermore, we have (1)
mean(cfk (·)) − mean(cf(1) (·))2 =
L j=1
(1)
|
Mk 1 (1)
Mk
i=1
12 M(1) 1 (1) c f k ( j, i) − (1) c f (1) ( j, i)|2 M i=1
(1)
≤ Lk − (1) 2 .
(C.4)
Similarly, we have the following conclusion for the second-class mean vector: (2) (2) mean cfk (·) − mean cf(2) (·) 2 ≤ Lk − (2) 2 .
(C.5)
Theorem 2 is proved. Appendix D: Sketch of Proof of Theorem 3 (q )
As in the proof of theorem 2, we denote Mk the number of trials belonging to the q th class in the kth iteration and M(q ) the true number of tri(q ) (q ) (q ) als belonging to the q th class, M = maxq ,k {Mk }, mk ( j) = mea n(c f k ( j, ·)), (q ) (q ) m ( j) = mea n(c f ( j, ·)).
An Extended EM Algorithm
2757
(q )
The variances of c f k ( j, ·) and c f (q ) ( j, ·) are calculated as
2 (q ) σk ( j)
(q )
Mk 2 (q ) 1 (q ) = (q ) c f k ( j, i) − mk ( j) Mk i=1
M 2 2 1 (q ) ( j) = (q ) c f ( j, i) − m(q ) ( j) , M (q )
σ
(q )
(D.1)
i=1
where j = 1, . . . , L, q = 1, 2. ¯ k(1) trials of the first class Suppose that in the kth iteration, there are M ¯ k(1) is close to Mk(1) with correct labels. Since ratek is sufficiently large, then M and M(1) . Thus, we have ¯
(1)
Mk 1
¯ k(1) M
(1)
i=1 ¯
(1)
Mk 1
¯ k(1) M
(1)
c f k ( j, i) mk ( j),
c f (1) ( j, i) m(1) ( j).
(D.2)
i=1 (1)
Without loss of generality, suppose that mk ( j) > m(1) ( j); we have ¯ (1) ¯ (1) 2 2 Mk Mk 1 1 (1) c f ( j, i) − (1) c f (1) ( j, i) ¯ k(1) i=1 k ¯k M M i=1 ¯ k(1) M (1) 2 (1) 1 (1) c f k ( j, i) + c f k ( j, i) = (1) c f k ( j, l) ¯ k i=1 M l=i ¯ k(1) M 1 c f (1) ( j, i) 2 + c f (1) ( j, i) − (1) c f (1) ( j, l) ¯ k i=1 M l=i
(D.3)
and ¯ k(1) ¯ k(1) M M 1 1 (1) c f k(1) ( j, i) c f (1) ( j, i) c f k ( j, l) − (1) c f (1) ( j, l) ¯ k(1) i=1 ¯ M M l=i l=i i=1 k (1)
(1)
i=1
i=1
¯k ¯k M M (1) (1)
c f k ( j, i)mk ( j) − c f (1) ( j, i)m(1) ( j) 2 2 ¯ k(1) (m(1) ¯ (1) (1)
M k ( j)) − Mk (m ( j)) > 0.
(D.4)
2758
Y. Li and C. Guan
It follows from equations D.3 and D.4 that 2 2 ¯ (1) ¯ (1) Mk Mk 1 1 (1) c f ( j, i) − (1) c f (1) ( j, i) ¯ k(1) i=1 k ¯k M M i=1 (1)
¯
¯
(1)
Mk Mk 1 1 (1) 2 ≥ (1) (c f ( j, i)) − (1) (c f (1) ( j, i))2 . ¯ k i=1 k ¯ k i=1 M M
(D.5)
From equation D.1, we have for j = 1, . . . , L, 2 (1) σk ( j) − [σ (1) ( j)]2 (1) Mk M(1) 2 1 1 (1) (1) (1) (1) 2 c f k ( j, i) − mk ( j) − (1) = (1) (c f ( j, i) − m ( j)) M Mk i=1 i=1 (1) Mk M(1) 2 1 1 (1) (1) 2 (1) 2 (1) 2 c f k ( j, i) − (mk ( j)) − (1) = (1) (c f ( j, i)) + (m ( j)) M Mk i=1 i=1 ¯ k(1) ¯ k(1) M M 1 1 2 1 (1) (1) 2 c f k ( j, i) − (1) ≤ (1) (c f ( j, i)) + (1) M M M k
1 − (1) M
i=1
i=1
M
k
(1)
(c f
(1)
( j, i)) | + |(m ( j)) − 2
(1)
2
(1)
¯ k +1 i= M
M (1)
(c f
(1)
( j, i)) | + |(m ( j)) − 2
(1)
2
(1)
¯ k +1 i= M
2 (1) c f k ( j, i)
(1)
¯ k +1 i= M
2
(1) mk ( j)
¯ k(1) ¯ k(1) M M 1 1 2 1 (1) (1) 2
(1) (c f ( j, i)) + (1) c f k ( j, i) − (1) M M ¯ k i=1 ¯ k i=1 M k 1 − (1) M
(1)
Mk
(1)
Mk
2 (1) c f k ( j, i)
(1)
¯ k +1 i= M
2 .
(1) mk ( j)
(D.6)
(1)
In view that c f k ( j, i) ≤ 1, 1 (1) M k
2 1 (1) (1) 2 (c f ( j, i)) c f k ( j, i) − (1) M ¯ k(1) +1 ¯ k(1) +1 i= M i= M (1) ¯ k(1) M(1) − M ¯ k(1) Mk − M ≤ max ,
1 − ratek . (1) M(1) M (1)
Mk
M (1)
k
(D.7)
An Extended EM Algorithm
2759
In view of equations D.6, D.7, and D.5, ¯ (1) ¯ (1) M M k k 2 1 2 2 1 (1) (1) (1) (1) 2 c f k ( j, i) − (1) (c f ( j, i)) σk ( j) − σ ( j) ≤ (1) M ¯ ¯ k i=1 M k i=1 2 (1) +1 − ratek + (m(1) ( j))2 − mk ( j) 2 2 ¯ (1) ¯ (1) M M k k 1 1 (1) (1) ≤ (1) c f k ( j, i) − (1) c f ( j, i) + 1 − ratek M ¯ ¯k M i=1 i=1 k 2 (1) +(m(1) ( j))2 − mk ( j) 2 2 ¯ (1) (1) ¯ k(1) (m(1) ( j))2 + 1 − ratek + (m(1) ( j))2 − m(1)
M mk ( j) − M k k ( j) 2 (1) ≤ (M + 1) mk ( j) − (m(1) ( j))2 + 1 − ratek (1) ≤ 2(M + 1)mk ( j) − m(1) ( j) + 1 − ratek
(1)
≤ 2(M + 1) k − (1) + 1 − ratek , 2
(D.8)
where the last inequality is from equation C.3. Similarly, (2) 2 (2) 2
σ ( j) − σ ( j) < 2(M + 1) (2) − (2) + 1 − ratek , k k 2
(D.9)
where j = 1, . . . , L. Thus we have the conclusion in theorem 3. Acknowledgments We are grateful for the anonymous reviewers for their insightful comments. We are also grateful to Chin Zheng Yang for his efforts to improve the presentation of this article. References Birbaumer, N., Ghanayim, N., Hinterberger, T., Iversen, I., Kotchoubey, B., Kubler, A., Perelmouter, J., Taub, E., & Flor, H. (1999). A spelling device for the paralysed. Nature, 398, 297–298.
2760
Y. Li and C. Guan
Blanchard, G., & Blankertz, B. (2004). BCI competition 2003-data set IIa: spatial patterns of self-controlled brain rhythm modulations. IEEE Transactions on Biomedical Engineering, 51(6), 1062–1066. Chen, Y. P. (2000). Matrix theory. Xian, China: Northwest Chian University of Technology Publisher. Donoghue, J. P. (2002). Connecting cortex to machines: Recent advances in brain interfaces. Nature Neuroscience Supplement, 5, 1085–1088. Dornhege, G., Blankertz, B., Curio, G., & Muller, K. R. (2004). Boosting bit rates in non-invasive EEG single-trial classifications by feature combination and multiclass paradigms. IEEE Trans. Biomed. Eng., 51(6), 993–1002. Grandvalet, Y., & Bengio, Y. (2004). Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Kubler, A., Kotchoubey, B., Kaiser, J., Wolpaw, J. R., & Birbaumer, N. (2001). BrainCcomputer communication: Unlock the locked-in. Psychol. Bull., 127(3), 358C375. Lemm, S., Blankertz, B., Curio, G., & Muller, K. R. (2005). Spatio-spectral filters for improving the classification of single trial EEG. IEEE Transactions on Biomedical Engineering, 52(9), 1541–1548. McEvoy, L. K., Smith, M. E., & Gevins, A. (2000). Test-retest reliability of task-related EEG. Clinical Neurophysiology, 1, 457–463. Millan, J. R., & Mourino, J. (2003). Asynchronous BCI and local neural classifiers: An overview of the Adaptive Brain Interface project. IEEE Trans. on Neural Systems and Rehabilitation Engineering, 11(2), 159–161. Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of cotraining. In Proceedings of 9th International Conference on Information and Knowledge Management (pp. 86–93). McLean, VA. Pfurtscheller, G., Neuper, C., Guger, C., Harkam, W., Ramoser, H., Schlogl, A., Obermaier, B., & Pregenzer, M. (2000). Current trends in Graz brain-computer interface research. IEEE Trans. on Rehabilitation Engineering, 8(2), 216–218. Pfurtscheller, G., Neuper, C., Flotzinger, D., & Pregenzer, M. (1997). EEG-based discrimination between imagination of right and left hand movement? Electroencephalogr. Clin. Neurophysiol., 103, 642–651. Polich, J. (2004). Neuropsychology of P3a and P3b- A theoretical overview. In N. C. Moore & K. Arikan (Eds.), Brainwaves and mind—Recent developments (pp. 15–29). Wheaton, IL: Kjellberg. Ramoser, H., Muller-Gerking, J., & Pfurtscheller, G. (2000). Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans. on Rehabilitation Engineering, 8(4), 441–446. Regan, D. (1989). Human brain electrophysiology: Evoked potentials and evoked magnetic fields in science and medicine. Dordrecht: Elsevier Science Publishing. Vidaurre, C., Schlogl, A., Cabeza, R., Scherer, R., & Pfurtscheller, G. (2005). Adaptive on-line classification for EEG-based brain-computer interfaces with AAR parameters and band power estimates. Biomed. Tech. (Berl.), 50(11), 350– 354. Wolpaw, J. R., Birbaumer, N., McFarland, D. J., Pfurtscheller, G., & Vaughan, T. M. (2002). Brain-computer interfaces for communication and control. Clinical Neurophysiology, 113, 767–791.
An Extended EM Algorithm
2761
Xu, L., & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, 1, 129–151. ¨ Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Scholkopf, B. (2003). Learning ¨ & K. Obermayer (Eds.), with local and global consistency. In S. Becker, S. Thrun, Advances in Neural Information Processing Systems, 15. Cambridge, MA: MIT Press.
Received September 14, 2005; accepted February 25, 2006.
LETTER
Communicated by Eitan Greenshtein
On the Consistency of Bayesian Variable Selection for High Dimensional Binary Regression and Classification Wenxin Jiang [email protected] Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A.
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. We use a prior to select a limited number of candidate variables to enter the model, applying a popular method with selection indicators. We show that this approach can induce posterior estimates of the regression functions that are consistently estimating the truth, if the true regression model is sparse in the sense that the aggregated size of the regression coefficients are bounded. The estimated regression functions therefore can also produce consistent classifiers that are asymptotically optimal for predicting future binary outputs. These provide theoretical justifications for some recent empirical successes in microarray data analysis. 1 Introduction Binary classification is an important situation of supervised learning, where output y is a 0/1 valued binary response, and input x is a vector of explanatory variables (including 1 for introducing an intercept or bias term). A popular statistical approach, using the principle of generalized linear models (McCullagh & Nelder, 1989), is to assume that the probability of y = 1 is a monotone transform of a linear combinaT T tion of x, such as P(y = 1|x) = e x β /(1 + e x β ) in logistic regression or x T β −z2 /2 √ / 2π)dz in probit regression. This will lead to a P(y = 1|x) = −∞ (e linear classification rule such as x T β > 0 for predicting a future output to be y = 1, as in perceptrons (Rosenblatt, 1962). Since the components of x can include trigonometric functions, products, or powers of the original explanatory variables, the resulting classification rule can indeed be very general. Classification through such regression modeling has been regarded as one of the most important tools in modern practices of data mining and is included in popular softwares such as SAS Enterprise Miner. Neural Computation 18, 2762–2776 (2006)
C 2006 Massachusetts Institute of Technology
Consistent Bayesian Learning in High Dimensions
2763
Recently, considerable interest has been attracted by high-dimensional modeling of binary responses, with the dimension of x being very high. For example, in microarray data analysis, thousands of genes expressed as components of a high-dimensional explanatory variable x may influence a 0/1 valued binary response y indicating disease status. Other examples include business data mining where many explanatory variables, possibly with higher-order (powers) and interaction terms (products), can be used to model a binary response of product preference. Data can typically be expressed as (xi , yi )n1 , which include the information on (x, y) for n subjects in study. The dimension K of x is often very high compared to the sample size n. It is in fact quite common that K >> n; for example, in microarray data, K is often several thousand, while n is often a few dozen. These applications with K >> n have provided an important new playground for statistical learning techniques. For example, Lee, Sha, Dougherty, Vannucci, and Mallick (2003) illustrated applications of several statistical learning techniques in microarray data analysis, including probit regression, neural networks, and support vector machines. Common goals of binary data modeling include regression (finding the relation between x and y) and classification (predicting future unobserved y based on the available x information). These goals can be achieved by Bayesian variable selection, which involves defining prior probabilities for different subsets of x to model y and possibly averaging over these subset models according to the induced posterior probabilities based on observed data. As an advantage over non-Bayesian approaches, the posterior probabilities provide additional information about how likely each candidate model is in the Bayesian framework. In high-dimensional settings, Bayesian variable selection has proven to perform very well in practice. For example, Lee et al. (2003) and Sha et al. (2004) (via probit regression), and Zhou, Liu, and Wong (2004) (via logistic regression) use Bayesian variable selection to model microarray data and achieve excellent cross-validated classification errors, all in the situation of K >> n. The performance is often competitive or more superior to other common classification procedures (see, e.g., Lee et al., 2003, who compared other approaches, including the nearestneighbor method, neural networks, and support vector machine). What is lacking is a theoretical explanation why Bayesian variable selection works so well, when the number of variables considered, K , can be much larger than the sample size n. This article addresses the theoretical question of consistency. Roughly speaking, we will show the following result even when the dimensionality of x can be very high. Under certain conditions, Bayes variable selection with either logistic or probit regression can produce estimated regression functions that are often close to the truth, as well as classifiers of future binary response that are close to the optimal. To have a theoretical analysis accommodating the large number of explanatory variables, we will formulate K = K n as dependent on the sample
2764
W. Jiang √
size n. Therefore, for example, the case K ∼ e n indicates much more can√ didate explanatory variables than the case K ∼ n. The consistency describes the limiting trend as n →∞. Our results will cover a wide range of data dimensions satisfying 1 ≺ K n and ln K n ≺ n, where a n ≺ b n represents a n = o(b n ), or limn→∞ a n /b n = 0. The model relating the binary response y and the explanatory variable x is described by the regression function (conditional mean function) µ(x) = P(y = 1|x). We assume a true model of the form µ0 (x) = ψ(x T β), where ψ : → (0, 1) is a known transform used, for example, in logistic regression or probit regression. The K n -vector β of regression coefficients represents the effects of the corresponding x components on y. Our consistency results n require sparsity of these effects, in the sense that limn→∞ Kj=1 |β j | < ∞. This can describe situations, for example, when only a limited number of β components are nonzero or when most of the β components are very small, even if K n can be much larger than n. The task is therefore selecting the relatively few important effects out of a vast number of β components. We consider Bayesian variable selection similar to Smith and Kohn (1996), where a latent vector of selection indicators γ = (γ1 , . . . , γ K ) is introduced to indicate the subset model used to estimate the truth. The components of γ are 0/1 valued, with the 1-valued components corresponding to the components of x included in the subset model. A model indicated by γ proposes a regression µ(x) = ψ(xγT βγ ), where vγ denotes the subvector of a vector v with components {v j }, for all j’s with γ j = 1. A prior distribution π can be placed on (γ , βγ ), proposing a model γ together with a corresponding set of regression coefficients. This will generate a posterior distribution on (γ , βγ ) and the corresponding µ(x)’s. The consistency result involves verifying that certain choices of the prior π lead to posteriors proposing regression functions that are often close to the true regression function µ0 in some sense. Previously, Ghosal (1997, 1999) studied posterior normal approximations for high-dimensional linear and generalized linear regression models. His work did not consider variable selection and covered cases with K n increasing at some rates slower than n. We note that the use of variable selection seems to be essential in empirical works such as Lee et al. (2003) and Sha et al. (2004) for obtaining excellent results when K n >> n. Using non-Bayesian approaches such as constrained or penalized optimization, Greenshtein and Ritov (2004) considered high-dimensional variable selection with K n = O(nα ) for any α > 0. To relax the need of assuming a true model, their work uses the concept of persistence in the sense of finding ¨ the best subset models of a certain size. Buhlmann (2004) assumes a sparse true model and considers the use of boosting in high-dimensional linear 1−ξ ¨ regression, with K n = O(e Cn ) for some C > 0 and 0 < ξ < 1. Buhlmann’s approach is also non-Bayesian (sequential optimization), and linear
Consistent Bayesian Learning in High Dimensions
2765
regression is used instead of logistic or probit regression. Note that the Bayesian approach has the advantage of providing posterior probabilities that can assess the relative importance of candidate models. In contrast to other works, our article considers Bayesian variable selection for logistic or probit regression, with the number of candidate variables K n lying in a wide range 1 ≺ K n = e o(n) . Our theoretical results include consistency in both the regression sense and the classification sense. This provides theoretical justification for the excellent empirical performance reported in, for example, Lee et al. (2003), who used Bayesian variable selection to handle high-dimensional binary classification with K n >> n. Below we will first specify the notation and framework of the article. Then we will state the main results and conditions rigorously. We will also give examples of priors and link functions that satisfy these conditions. Proofs of the results will be outlined. We conclude with a brief discussion. 2 Notation and Definitions 2.1 Framework and Models. The article studies asymptotics as n →∞. The formalism described in section 1, with x being K dimensional and K increasing with n, will naturally lead us to embed x as an ∞-dimensional vector for a convenient mathematical treatment. This can be understood in a variety of ways. In gene selection problems, one can add on 0’s or independent random variables behind the several thousand (= K ) genes to form the ∞-dimensional x. Alternatively, one can add on higher-order terms or interactions. Below we will describe the framework using an embedded ∞-dimensional x that follows a sparse true regression model. The binary response is y ∈ {0, 1}. The explanatory variable is an ∞dimensional random vector x = (x1 , x2 , . . .)T . For simplicity, we will assume that |x j | ≤ 1 for all j. The results can be easily extended to the case when all |x j |’s are bounded above by a large constant. The true relation between y and x is assumed to follow a sparse parametric regression model µ0 = E(y|x) = ψ(x T β), where a transform (called the inverse link function) ψ : → (0, 1) can be logistic ψ(u) = e u /(1 + e u ) √ u 2 or probit ψ(u) = (u) ≡ −∞ (e −z /2 / 2π)dz. We assume that the regression parameter vector β satisfies a sparseness condition ∞ j=1 |β j | < ∞. Note y that f 0 = µ0 (1 − µ0 )1−y is the conditional density of y|x, which is also the joint density of (x, y) if the dominant measure νx (d x)ν y (dy) is the product of the probability measure of x and the counting measure of y. We will always use this kind of dominant measure and denote it as dxdy for simplicity. The data for n subjects are assumed to be independent and identically distributed (i.i.d.) based on f 0 dxdy. The data set consists of subjects i = 1, . . . , n, and for each subject, only the first K n components of the explanatory vector x, together with response y, are included. Therefore, showing the subject n index i, the data set is of the form Dn = (xi1 , . . . , xi K n , yi )i=1 . We will refer
2766
W. Jiang
to Dn as a restricted i.i.d. sample based on density f 0 due to the fact that only the first K n components of x variables are observed. We will consider the situation with 1 ≺ K n and ln K n ≺ n. The prior selects a subset of the K n x-variables in the data set to model y, using µ = ψ(xγT βγ ), where γ = (γ1 , . . . , γ K n , 0, 0, . . .) has 0/1 valued components, which are 1 only when the corresponding x components are included in the model, that is, γ j = I [|β j | > 0]. Since only the K n x-variables appearing in the data set are selected, γ j = 0 for all j > K n . Here the notation vγ denotes the subvector of a vector v with components {v j }, for all j’s with γ j = 1. We use probability measure πn (γ , dβγ ) to denote the prior distribution of the subset model γ and the corresponding regression coefficients βγ . (The prior can depend on the sample size n.) This includes a posterior measure conditional on the data set Dn : πn (γ , dβγ |Dn ) =
n
f (yi , xi |γ , βγ )πn (γ , dβγ )/
γ
i=1
n βγ i=1
f (yi , xi |γ , βγ )πn (γ , dβγ ),
where f (y, x|γ , βγ ) = ψ(xγT βγ ) y {1 − ψ(xγT βγ )}1−y . The prior and posterior distributions for (γ , βγ ) induce distributions for the corresponding parameterized densities. ˆ posterior estimator of n the density f 0 is denoted as f n (y, x) = The f (y, x|γ , βγ )πn (γ , dβγ |D ). The posterior estimate of µ0 is then γ βγ y ˆf n dy integrating the counting measure dy on {0, 1}, which is also over equal to µ ˆ n (x) = γ βγ ψ(xγT βγ )πn (γ , dβγ |Dn ). The plug-in classifier of a ˆ n (x) > 0.5]. Since only the K n x-variables future y is defined as Cˆ n (x) = I [µ appearing in the data set are selected by the prior on γ , all these posterior estimates depend on only the restricted i.i.d. sample Dn , instead of depending on all the components of the infinite-dimensional x-vectors. 2.2 Definitions of Consistency. We first define consistency in regression function estimation, which we will call R consistency. Definition 1
(R consistency). µ ˆ n is asymptotically consistent for µ0 if P
(µ ˆ n (x) − µ0 (x))2 d x →0
as n→∞. P
Here and below, the convergence in probability of the form q (Dn )→q 0 , for any quantity dependent on the observed data, means limn→∞ PDn [|q (Dn ) − q 0 | ≤ ] = 1 for all > 0. This definition describes a desirable property for
Consistent Bayesian Learning in High Dimensions
2767
the estimated regression function µ ˆ n to be often (with PDn tending to one) close (in L 2 sense) to the true µ0 , for large n. Now we define consistency in terms of the function, which we density √ √ will term D consistency. Denote d H ( f, f 0 ) = ( f − f 0 )2 dxdy as the Hellinger distance. Definition 2 (D consistency). Suppose Dn is a truncated i.i.d. sample based on density f 0 . The posterior πn (·|Dn ) is asymptotically consistent for f 0 over Hellinger neighborhood if for any ε > 0, P
πn [ f : d H ( f, f 0 ) ≤ ε|Dn ] → 1
as n→∞.
That is, the posterior probability of any Hellinger neighborhood of f 0 converges to 1 in probability. This definition describes a desirable property for the posterior-proposed joint density f to be often close to the true f 0 , for large n. Now we define the consistency in classification, which we will call C consistency. Here we consider the use of the plug-in classification rule Cˆ n (x) = I [µ ˆ n (x) > 1/2] in predicting y. We are interested in how the misclassification error E Dn P{Cˆ n (x) = y|Dn } approaches the minimal error P{Co (x) = y} = infC: Dom(x)→{0,1} P{C(x) = Y}, where Co (x) = I [µ0 (x) > 1/2] is the ideal Bayes rule based on the (unknown) true mean function µ0 . Definition 3 (C consistency). Let Bˆ n : Dom(x) → {0, 1} be a classification rule that is computable based on the observed data Dn . If limn→∞ E Dn P{ Bˆ n (x) = y|Dn } = P{Co (x) = y}, then Bˆ n is called a consistent classification rule. These terminologies appeared in Ge and Jiang (2006), where it was shown that the three consistency concepts are related: Proposition 1 (Relations among three consistencies; Ge & Jiang 2006, Proposition 1). D consistency =⇒ R consistency =⇒ C consistency. In the article, we will first establish D consistency; R and C consistencies naturally follow. The same relations can be easily generalized to the case of selected posterior estimates. In practice, sometimes one would like to average over a selected portion of the posterior distribution instead of averaging over all the posterior distribution. We will denote a rule A as a subset of the space of (γ , βγ ), possibly dependent on data Dn . A selected posterior estimate of a quantity g(γ , βγ ) according to rule A is defined as gˆ A = A n A n γ βγ g(γ , βγ )πn (γ , dβγ |D ), with the selected posterior πn (γ , dβγ |D ) =
2768
W. Jiang
I [(γ , βγ ) ∈ A]πn (γ , dβγ |Dn )/ γ βγ I [(γ , βγ ) ∈ A]πn (γ , dβγ |Dn ). A rule of this kind for example, can be averaging over several of the best models, that is, the γ ’s that have the largest marginal posteriors πn (γ |Dn ) (see, e.g., Smith & Kohn, 1996, who considered the use of the best model, and Sha et al., 2004, who averaged over the top 10 best models.) A rule can also be defined from using the models that include the individually strongest variables. For example, include a model γ in average if γ j = 1 for variable j that appears more than 5% of the times in the posterior distribution (that is, if πn (γ j = 1|Dn ) >5%). (See, e.g., Lee et al., 2003.) a rule A, we can define selected posterior estimates µ ˆ nA(x) = With T A n A A ˆ ˆ n (x) > 0.5], for regression γ βγ ψ(xγ βγ )πn (γ , dβγ |D ) and C n (x) = I [µ and classification, respectively. As long as the selection probability πn {(γ , dβγ ) ∈ A|Dn } is bounded away from 0, the D consistency will still imply the R and C consistency for the selected posterior estimates. Proposition 2 (Relations among three consistencies for selected posterior estimates). Suppose a rule A has selection probability πn {(γ , dβγ ) ∈ A|Dn } ≥ r for some constant r > 0. Then D consistency =⇒ R consistency for µ ˆ nA =⇒ C consistency for Cˆ nA. Proof. The proof is a straightforward adaptation to the proof of proposition 1 of Ge and Jiang (2006), by using the selected posterior πnA(γ , dβγ |Dn ) in place of the full posterior πn (γ , dβγ |Dn ) in the process. 3 Results and Conditions We will denote |v| = ∞ j=1 |v j | for an ∞-dimensional vector v. For the consistency results to hold, we will require two conditions on the prior π and one condition on the inverse link function ψ. Denote ω(u) as the log odds function ω(u) = ln[ψ(u)/{1 − ψ(u)}]. Condition I. (On inverse link function ψ.) The derivative of the log odds ω (u) is continuous and satisfies the following boundedness condition when the size of the domain increases: sup|u|≤C |ω (u)| ≤ C q for some q ≥ 0, for all large enough C. It is easy to see that for the logistic inverse link ψ = e u /(1 + e u ), ω = u and ω = 1. Condition I is trivially satisfied. For the probit inu verse link ψ = (u) ≡ −∞ φ(z)dz, ω = [(u)−1 + {1 − (u)}−1 ]φ(u), where √ 2 φ(u) = e −u /2 / 2π is the standard normal probability density and is the cumulative distribution function. By using the Mill’s ratio, it is straightforward to see that |ω (u)| increases at most linearly with u, which also satisfies condition I.
Consistent Bayesian Learning in High Dimensions
2769
One of the conditions on prior requires not-too-small probability (not exponentially small in n) to be placed on a small neighborhood of the truth. This is a set of (γ , βγ ) that can approximate the true relation. Suppose the true relation involves the parameter vector β satis∞ fying |β | j < ∞. For large integer rn > 0 and small η > 0, the set j=1 S(rn , η) = {(γ , βγ ) : γ = γ (rn ), βγ ∈ M(rn , η)} indexes a small set of densities f = ψ(xγT βγ ) y {1 − ψ(xγT βγ )}1−y that approximates the true density f 0 = ψ(x T β) y {1 − ψ(x T β)}1−y . We will call this set S a small approximation set. Here M(rn , η) = {(b 1 , . . . , brn )T : b j ∈ β j ± η/(2rn ), j = 1, . . . , rn }, and γ (rn ) denotes an increasing sequence of models of size rn , such as γ (rn ) = (10 , 20 , . . . , rn0 , 0, 0, . . .), whose first rn components take value 1. Condition S. (For prior πn on small approximation set.) There exists a sequence rn increasing to infinity as n→∞, such that for any η > 0, any c > 0, we have πn [γ = γ (rn )] > e −cn and πn [βγ ∈ M(rn , η)|γ = γ (rn )] > e −cn , for all large enough n. To satisfy the condition on βγ , suppose we take rn in the range 1 ≺ rn ≺ min(K n , n/ ln K n ), where, as before, we assumed that 1 ≺ K n and ln K n ≺ n. Assume, for example, that the components of βγ follow independent N(0, 1) priors. The prior of β over M(rn , η) depends on both the normal density and the volume of M(rn , η). The exponential part of the density is T bounded away from zero as n increases, since it is of order e −βγ βγ /2 , and ∞ βγT βγ ≤ ( j=1 |β j |)2 < ∞. Then the prior of βγ over M(rn , η) has the same order as (2π)−rn /2 times its volume (η/rn )rn , which is larger than e −cn for all large enough n. The simplest way to satisfy the condition on γ is to have a point mass πn [γ = (10 , 20 , . . . , rn0 , 0, 0, . . .)] = 1 for a sequence rn →∞. However, in practice, the effects of explanatory variables may not be ordered in such a way that the preceding ones are more important. It is better to have the prior treating the orders symmetrically and let the data tell which variables are more important. One such choice is, for example, uniform distribution over all subset model γ ’s with a bounded size, satisfying, for example, |γ | = ∞ j=1 |γ j | < r¯n . Suppose we take, for example, r¯n such that rn ≤ r¯n ≤ K n and r¯n ≺ n/ ln K n . Here as before, we take rn in the range 1 ≺ rn ≺ min(K n , n/ ln K n ) and assume that 1 ≺ K n and ln K n ≺ n. Then the number of subset models, with size r¯n or less, by selecting from among the ¯n |γ | rn + 1)K nr¯n . A K n candidate variables, is less than r|γ |=0 K n and less than (¯ uniform prior for these subset models therefore places at least probability {(¯rn + 1)K nr¯n }−1 on each model, which will be more than any exponential e −cn for all large enough n. The tail condition is therefore satisfied for prior on γ . Alternatively, a computationally convenient method places prior on γ by assuming i.i.d. binary prior for each γ j , j = 1, . . . , K n . This is the
2770
W. Jiang
approach used in Smith and Kohn (1996) and Lee et al. (2003), for example. Suppose the available number of explanatory variables K n satisfies 1 ≺ K n ≺ e o(n) . Suppose the prior on γ is i.i.d. binary with P(γ j = 1) = λn , j = 1, . . . , K n . Then the prior πn on γ = (10 , 20 , . . . , rn0 , 0, 0, . . .) satisfies ln πn = rn ln λn + (K n − rn ) ln(1 − λn ). Take rn ≈ K n λn . Then for small λn , ln πn becomes about −rn ln(K n /rn ). To have a not-too-small πn , we can set rn satisfying 1 ≺ rn ≺ min(K n , n/ ln K n ) and λn = rn /K n . Then for any c > 0, ln πn ≥ −rn ln K n > −cn for all large enough n, satisfying the prior condition about γ . This condition will also be satisfied if we somehow restrict the number of selected variables by truncation. For example, let n γj Kn πn (γ1 , . . . , γ K n ) ∝ Kj=1 λn (1 − λn )1−γ j I [ l=1 γl ≤ r¯n ] for some r¯n , satisfying r¯n ∈ [rn , K n ]. This is because the truncation increases the probabilities for all allowed models and the not-too-small prior probability condition will still be satisfied. (Such a truncation with r¯n ≺ n/ ln K n will be helpful for satisfying condition L below, as well as for avoiding the inclusion of too many variables that can lead to singularity of design matrices, as discussed after condition L.) The next condition about the prior is that it has sufficiently thin tails (thinner than exponentially small in n) outside a large region of (γ , βγ )’s. Denote the size of a model γ by |γ | = ∞ j=1 γ j . A large region involves either a large |γ | or a large size of some regression coefficient. Condition L. (For prior π outside a large region.) There exist some r¯n = o(n/lnK n ), r¯n ∈ [1, K n ], and some Cn satisfying Cn−1 = o(1) and ln Cn = o(n/¯rn ), such that for some c > 0, πn [|γ | > r¯n ] ≤ e −cn , and πn (∪ j:γ j =1 [|β j | > Cn ]|γ ) ≤ e −cn for all γ such that |γ | ≤ r¯n , for all large enough n. The part of condition L bounding the tail of γ can be trivially satisfied by restricting the size of the selected models. For example, truncate the prior by a factor proportional to I [|γ | ≤ r¯n ], where r¯n ≺ n/ ln K n and r¯n ∈ [1, nK n ]. This T kind of restriction is beneficial also for making the design matrix i=1 xiγ xiγ nonsingular; such a design matrix is often used in the popular algorithms for generating the posterior distributions in probit regression (e.g., Lee et al., 2003) and logistic regression (e.g., Zhou et al., 2004). The tail condition on the regression coefficients can be easily verified when their prior distributions are i.i.d. normal by using the Mill’s ratio, by choosing, for example, Cn = n. Note that 1 ≺ Cn and ln Cn ≺ n/¯rn , if we also have r¯n ≺ n/ ln n. Although in the above discussions about the prior conditions we have considered i.i.d. normal priors for the regression coefficients, it is also possible to place a non-i.i.d. normal prior such as βγ ∼ N(0, V). For example, V ≈ consta nt × (E xγ xγT )−1 (see, e.g., Smith & Kohn, 1996, and Lee et al., 2003, who used a sample approximation of this choice). With mild restrictions on the largest eigenvalues of V and V −1 , all conditions can
Consistent Bayesian Learning in High Dimensions
2771
also be confirmed. One example of such restriction is that the largest eigenvalues of V and V −1 are both bounded linearly for large |γ |. This is true, for example, when xγ has components standardized to have mean zero and common variance and have all pairwise correlations being ρ ∈ (0, 1). More generally, assume that the eigenvalues of V and V −1 are both bounded above by B|γ |v for some B > 0 and v ≥ 1, for all large |γ |. In these cases for condition S to hold, πn [βγ ∈ M(rn , η)|γ = γ (rn )] will also depend on how fast the normal density decreases as rn increases. v The density can be shown to be bounded below by e −c1 rn (c 2 rnv )−0.5rn for some constants c 1 , c 2 > 0. For the sake of condition S, one can then restrict 1 ≺ rn ≺ min{K n , n/ ln K n , n1/v }. For condition L to hold, in the Mill’s ratio argument bounding the tail probability πn (∪ j:γ j =1 [|β j | > Cn ]|γ ), the constant Cn needs to be inflated by the largest possible prior standard v/2 deviation of β j , which is at most of order r¯n ≺ nv/2 . So Cn can be 1+v/2 now taken as n . All the arguments will still go through to ensure condition L. Proposition 3. Under conditions I, S, and L, we have D consistency for the posterior estimates of the density. Proof.
The details are included in the next section.
The results of the previous section then imply R and C consistency for the regression estimates and classifiers, respectively, whether they are obtained from the complete posterior distribution or from the selected parts of the posterior distribution when the selection probability is bounded away from 0. Combining these with the above comments on when the conditions I, S, and L are satisfied, we obtain the following result. Theorem 1. i. Suppose the inverse link function is either logistic or probit. ii. Suppose the true regression model has sparse regression coefficients β satisfying |β| < ∞. iii. Suppose the available number of explanatory variables K n satisfies 1 ≺ K n and lnK n ≺ n. iv. Suppose the prior on γ is truncated i.i.d. binary. That is, first generate i.i.d. γ j ’s with P(γ j = 1) = rn /K n for all j = 1, . . . , K n . Then adopt this γ only if |γ | ≤ r¯n , where rn and r¯n satisfy 1 ≺ rn ≤ r¯n ≺ min{K n , n/lnK n , n/ln n}. (The resulting prior probability Kn γ πn (γ1 , . . . , γ K n ) ∝ Kj=n 1 λn j (1 − λn )1−γ j I [ l= 1 γ1 ≤ r¯n ], where λn = rn / K n .)
2772
W. Jiang
v. Conditional on model γ , suppose the prior on βγ ∼ N(0, V) with eigenvalues of V and V −1 bounded above by B|γ | for some constant B > 0, for all large enough |γ |. Then we have D consistency for the posterior estimates of the density, as well as the R and C consistency for the regression estimates and classifiers, respectively, whether they are obtained from the complete posterior distribution or from the selected posterior distribution, when the selection probability is bounded away from 0. Note that the results can be extended to the case with more general priors βγ ∼ N(0, V), with V and V −1 having eigenvalues bounded above by B|γ |v for some B > 0 and v ≥ 1. As discussed before proposition 3, we can modify the restriction in part iv of theorem 1 to be 1 ≺ rn ≤ r¯n ≺ min{K n , n/ ln K n , n/ ln n, n1/v }, for the consistency results to hold.
4 Proof of Proposition 3 The densities f = ψ(xγT βγ ) y {1 − ψ(xγT βγ )}1−y are parameterized by (γ , βγ ), where γ = (γ1 , . . . , γ K n , 0, 0, . . .), γ j ’s are 0/1 valued, and βγ ∈ |γ | . We will sometimes think about any parameter (γ , βγ ) as being embedded in K n n as θ = (β j γ j ) Kj=1 , with nonzero real β j ’s, which corresponds to filling in zeros for directions with zero γ j ’s. Then the density corresponding to θ is n n f θ (x, y) = ψ( Kj=1 x j θ j ) y {1 − ψ( Kj=1 x j θ j )}1−y . Note that there is a one-toone relation between the parameterization (γ , βγ ) and θ , since γ j = I [|θ j | > 0] and β j = θ j for all j such that |θ j | > 0. n n : Kj=1 |γ j | ≤ r¯n , 0 < |β j | ≤ Cn , ∀ j}, where r¯n = Let n = {θ = (β j γ j ) Kj=1 o(n/ ln K n ), r¯n ∈ [1, K n ], and Cn satisfies Cn−1 = o(1) and ln Cn = o(n/¯rn ), as in condition L. (Then n corresponds to the set of parameters (γ , βγ ) such that |γ | ≤ r¯n and all components of βγ are bounded above by Cn in size.) Denote Fn = f (n ) as the corresponding set of densities. The proof involves splitting the space of densities proposed by the prior into two parts Fn and Fnc and proving the three conditions below. This follows a variant of the consistency theorem in Wasserman (1998), which appeared in Lee (2000, theorem 2). These three conditions then imply the result of proposition 3. (The entropy condition ii was implicitly used in Lee (2000, Lemma 3).) Tail condition i. There exists an r > 0, such that the prior πn (Fnc ) < exp(−nr ) for all sufficiently large n. To prove this, note that πn (Fnc ) = πn [ f θ ∈ f (n )] ≤ πn [θ ∈ n ] ≤ πn [|γ | > r¯n ] + γ :|γ |≤¯rn πn [γ ]πn (∪ j:γ j =1 [|β j | > Cn ]|γ ), which is at most πn [|γ | > r¯n ] +
Consistent Bayesian Learning in High Dimensions
2773
maxγ :|γ |≤¯rn πn (∪ j:γ j =1 [|β j | > Cn ]|γ ), which is, due to condition L, at most 2e −cn < exp(−n(c/2)) for all large enough n. This proves the tail condition. Entropy condition ii. There exists some constant c > 0 such that ∀ε > 0, ε √ H[ ] (u)du ≤ c nε 2 for all sufficiently large n. 0
Here we denote H[ ] ( ) = ln N[ ] (ε, Fn ) as the Hellinger bracketing entropy for the set of densities Fn , where N[ ] (ε, Fn ) is the minimum number of -brackets needed to cover Fn . (An -bracket [l, u] is the set of all functions f such that l ≤f ≤ u, for l and u being two functions of (x, y) satisfying √ √ √ √ u − l2 ≡ ( u − l)2 dxdy ≤ ε.) To prove the entropy condition for Fn = f (n ), we first consider the complexity of the corresponding parameter space n , which includes all models of size |γ | ≤ r¯n with regression coefficients bounded above by ¯n ∪γ :|γ |=k γ , where γ = {θ = (β j γ j )1K n : 0 < |β j | ≤ Cn . Note that = ∪rk=0 Cn , ∀ j} denotes the bounded parameter space for model γ . The number of such model γ ’s with |γ | = k is at most K nk . For each model γ with |γ | = k, at most (Cn / + 1)k -balls in ∞ metric of the form B (t) = {s : d(s, t) ≤ } are needed to cover the bounded parameter space n |s j − t j |. γ , where d(s, t) ≡ max Kj=1 The total number of -balls needed r¯n tok cover n k(with |γ | ≤ r¯n ), called N( , n ), is therefore at most rn + 1){K n (Cn / + k=0 K n (C n / + 1) ≤ (¯ 1)}r¯n ≤ (4K n Cn / )r¯n for all large enough n. We will consider = /(2F 2 ) q +1 where > 0, F = 0.5K n (2Cn )q , and q ≥ 0 is as in condition I. For each f ∈ Fn = f (n ), f = f s for some s ∈ n , where s will lie in one of N( , n ) -balls, centered at, √ say, t.√By a first-order Taylor expansion and using condition I, we obtain | f s − f t | ≤ F d(s, t) ≤ F for all sufficiently √ large n. Then f s falls in an -bracket [{max(u− , 0)}2 , u2+ ], where u± = f t ± F and = 2F 2 . So we have N(2F 2 , Fn ) ≤ N( , n ). This, together with the upper √ bound of N( , n ) discussed above, eventually leads to H[ ] ( ) ≤ ln{[ 2(2K n )(2K n Cn )q +1 / ]r¯n } for all large enough Then using n. integral transformations and the Mill’s ratio, one obtains H[ ] ( )d ≺ 0 √ n∀ε > 0, when r¯n = o(n/ ln K n ), r¯n ∈ [1, K n ], and Cn satisfies Cn−1 = o(1) and ln Cn = o(n/¯rn ), as in condition L. This proves the entropy condition. Approximation condition iii. For any ξ, ν > 0, the prior πn (K L ξ ) ≥ exp(−nν), for all sufficiently large n. Here, for any ξ > 0, define a Kullback-Leibler (KL) ξ -neighborhood by K L ξ = { f : d K ( f, f 0 ) ≤ ξ }, where d K ( f, f 0 ) = f 0 ln( f 0 / f )dxdy is the KL divergence. T y To prove this condition, suppose the true density f 0 = ψ(x ∞ β) {1 − T 1−y ψ(x β)} involves the parameter vector β satisfying j=1 |β j | <
2774
W. Jiang
∞. For large integer rn > 0 and small η > 0, the set S(rn , η) = {(γ , βγ ) : γ = γ (rn ), βγ ∈ M(rn , η)}, where M(rn , η) = {(b 1 , . . . , brn )T : b j ∈ β j ± η/(2rn ), j = 1, . . . , rn }, indexes a small set of densities f = ψ(xγT βγ ) y {1 − ψ(xγT βγ )}1−y that approximates the true density f 0 = ψ(x T β) y {1 − ψ(x T β)}1−y . Here γ (rn ) denotes an increasing sequence of models of size rn , which will be taken as γ (rn ) = (10 , 20 , . . . , rn0 , 0, 0, . . .) here, for example. Any set of densities f ∗ = ψ(xγT βγ∗ ) y {1 − ψ(xγT βγ∗ )}1−y with parameter (γ , βγ∗ ) ∈ S(rn , η) will be very close to the true density f 0 in the KL sense. Define f as the density in the center of S(rn , η), that is, f = ψ(xγT βγ ) y {1 − ψ(xγT βγ )}1−y where γ = (10 , 20 , . . . , rn0 , 0, 0, . . .) and βγ includes the first rn components of the true parameter β. It can be shown, by a first-order approximation of ln( f 0 / f ∗ ), that d K ( f ∗ , f 0 ) ≤ c 1 · E|xγT βγ∗ − x T β| for some constant c 1 > 0. The difference E|xγT βγ∗ − x T β| ≤ E|xγT (βγ∗ − βγ )| + E|xγT βγ − x T β|, which is at most rn (η/(2rn )) + j>rn |β j |, and will be smaller than ∞ η for all large enough n, due to the finiteness of j=1 |β j |. Therefore, d K ( f ∗ , f 0 ) ≤ c 1 η for all f ∗ in f (S(rn , η)) (the set of densities with parameter in S(rn , η)). Then if we let η = ξ/c 1 , we have f (S(rn , η)) ⊂ K L ξ , and therefore the prior on the sets of densities satisfies πn (K L ξ ) ≥ πn [ f (S(rn , η)]. However, πn [ f (S(rn , η)] ≥ πn [(γ , βγ ) ∈ S(rn , η)], which is the product of πn [γ = γ (rn )] and πn [βγ ∈ M(rn , η)|γ = γ (rn )], which will be at least e −2cn = e −νn for all large enough n, if we take c = ν/2 in condition S. This shows the approximation condition. 5 Discussion In this article, we study the consistency of regression estimates and classifiers based on averaging over the posterior distributions in a framework of Bayesian variable selection. Binary logistic regression and probit regression with many input variables (possibly much more than the sample size) are considered, when the true regression coefficients satisfy a sparseness condition, that the aggregated sum of the sizes of the effects is bounded. Such a condition implies that the number of important effects are relatively few, despite the high dimensions, which enables Bayesian variable selection to perform well by proposing relatively simple models in the prior. Such a sparseness condition has been used in Donoho (1993) for function estimation with thresholding, and in density estimation by Yang and Barron (1998) with complexity penalty. More recently, the sparseness condition ¨ has been used in high-dimensional linear regression by Buhlmann (2004) (with boosting) and by Greenshtein and Ritov (2004) (with constrained optimization). In contrast to these frequentist approaches with thresholding, complexity penalty, or constraints, we use the approach of Bayesian variable selection, which has the advantage of providing a posterior assessment of the relative importance of the candidate models. To some degree, this article
Consistent Bayesian Learning in High Dimensions
2775
serves as a theoretical justification for some successful empirical works on high-dimensional Bayesian variable selection, such as Lee et al. (2003) and Zhou et al. (2004). Various generalizations can be considered, including multinomial regression and classification, study of convergence rates, and posterior asymptotic normality, perhaps in a more general framework of generalized linear models, which would extend the work of, for example, Ghosal (1997), to the case with variable selection. One may also be interested in relaxing the assumption of a sparse true model. We conjecture that Bayesian variable selection in this case will be estimating some best subset models that are closest to the truth in the KullbackLeibler sense. This could extend the work on persistence and best subset selection (see, e.g., Greenshtein & Ritov, 2004) to the case of Bayesian inference. The prior-based method described in this article may have some additional advantage over some other consistent procedures. As a referee pointed out, in the current setup, the R consistency may be easily achieved by, for example, maximum likelihood, including only the first rn explanatory variables, with rn increasing to infinity slowly enough. In practice, the earlier components may not be the more important ones, and the real challenge is to find out the relatively few important components out of many (K n ) candidates. The prior in this article, on the other hand, achieves this purpose by treating subsets of explanatory variables symmetrically, so that although the earlier components may not be more important, the resulting posterior can still pick up the important subsets. Further work may be needed to quantify such an advantage by studying the convergence rate or finite sample performance. General theory on convergence rates of Bayesian estimators, such as described in Ghosal, Ghosh, and van der Vaart (2000) and Shen and Wasserman (2001), can be applied for this purpose. This subject is studied in a more general framework of generalized linear models in work in progress. Acknowledgments We thank the referees for their helpful comments and Eitan Greenshtein for providing a technical report that led our attention to this research area. References ¨ ¨ Buhlmann, P. (2004). Boosting for high-dimensional linear models (Tech. Rep.). Zurich: ¨ Statistik, Mathematics Department of ETH Zurich. ¨ Seminar fur Available online at http://stat.ethz.ch/research/research reports/2004/120. Donoho, D. L. (1993). Unconditional bases are optimal bases for data compression and for statistical estimation. Applied and Computational Harmonic Analysis, 1, 100–115.
2776
W. Jiang
Ge, Y., & Jiang, W. (2006). On consistency of Bayesian inference with mixtures of logistic regression. Neural Computation, 18, 224–243. Ghosal, S. (1997). Normal approximation to the posterior distribution for generalized linear models with many covariates. Mathematical Methods of Statistics, 6, 332–348. Ghosal, S. (1999). Asymptotic normality of posterior distributions in high dimensional linear models. Bernoulli, 5, 315–331. Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics, 28, 500–531. Greenshtein, E., & Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparameterization. Bernoulli, 10, 971–988. Lee, H. K. H. (2000). Consistency of posterior distributions for neural networks. Neural Networks, 13, 629–642. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., & Mallick, B. K. (2003). Gene selection: A Bayesian variable selection approach. Bioinformatics, 19, 90–97. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. London: Chapman and Hall. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan Books. Sha, N., Vannucci, M., Tadesse, M. G., Brown, P. J., Dragoni, I., Davies, N., Roberts, T. C., Contestabile, A., Salmon, N., Buckley, C., & Falciani, F. (2004). Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics, 60, 812–819. Shen, X., & Wasserman, L. (2001). Rates of convergence of posterior distributions. Annals of Statistics, 29, 687–714. Smith, M., & Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics, 75, 317–343. Wasserman, L. (1998). Asymptotic properties of nonparametric Bayesian procedures. ¨ In D. Dey, P. Muller, & D. Sinha (Eds.), Practical nonparametric and semiparametric Bayesian statistics (pp. 293–304). New York: Springer. Yang, Y., & Barron, A. R. (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44, 95–116. Zhou, X., Liu, K.-Y., & Wong, S. T. C. (2004). Cancer classification and prediction using logistic regression with Bayesian gene selection. Journal of Biomedical Informatics, 37, 249–259.
Received July 27, 2005; accepted March 20, 2006.
LETTER
Communicated by Wloodzislaw Duch
A Maximum Likelihood Approach to Density Estimation with Semidefinite Programming Tadayoshi Fushiki [email protected] Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan
Shingo Horiuchi [email protected] Access Network Service Systems Laboratories, NTT Corp., Makuhari, Chiba, 261-0023, Japan
Takashi Tsuchiya [email protected] Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan
Density estimation plays an important and fundamental role in pattern recognition, machine learning, and statistics. In this article, we develop a parametric approach to univariate (or low-dimensional) density estimation based on semidefinite programming (SDP). Our density model is expressed as the product of a nonnegative polynomial and a base density such as normal distribution, exponential distribution, and uniform distribution. When the base density is specified, the maximum likelihood estimation of the polynomial is formulated as a variant of SDP that is solved in polynomial time with the interior point methods. Since the base density typically contains just one or two parameters, computation of the maximum likelihood estimate reduces to a one- or two-dimensional easy optimization problem with this use of SDP. Thus, the rigorous maximum likelihood estimate can be computed in our approach. Furthermore, such conditions as symmetry and unimodality of the density function can be easily handled within this framework. AIC is used to choose the best model. Through applications to several instances, we demonstrate flexibility of the model and performance of the proposed procedure. Combination with a mixture approach is also presented. The proposed approach has possible other applications beyond density estimation. This point is clarified through an application to the maximum likelihood estimation of the intensity function of a nonstationary Poisson process.
Neural Computation 18, 2777–2812 (2006)
C 2006 Massachusetts Institute of Technology
2778
T. Fushiki, S. Horiuchi, and T. Tsuchiya
1 Introduction Density estimation plays an important and fundamental role in constructing flexible learning models in pattern recognition, machine learning and statistics. In this article, we develop a new computational approach to density estimation based on the maximum likelihood estimation of parametric models and model selection by AIC. The rigorous maximum likelihood estimate of the parameters is computed by employing semidefinite programming (SDP) as a main tool. SDP is a new technology in convex programming developed in the past decade in optimization. We deal with one-dimensional density estimation, though the idea can be extended to the multidimensional case in a reasonable manner, as is explained at the end of section 2. We model a density function as the product of a univariate nonnegative polynomial and a base density such as normal distribution, exponential distribution, and uniform distribution. In other words, our model amounts to expressing the density functions with nonnegative Hermite, Laguerre, and Legendre polynomials (multiplied by their respective weight functions). The support of the density function can be any of (−∞, ∞), [0, ∞) and [a , b], where a , b ∈ R and the models are capable of approximating density functions (with the same support) precisely by adapting a polynomial of sufficiently large degree. Our main tool for the maximum likelihood estimation of this model is SDP. We formulate the maximum likelihood estimation as a bilevel optimization problem separating optimization of the nonnegative polynomial part and the base density part. At the lower level, the likelihood function is maximized with respect to the nonnegative polynomial part with the base density part fixed. Such a partially maximized likelihood is referred to as the profile likelihood. At a higher level, global optimization of the likelihood function is done by maximizing the profile likelihood function with respect to the base density function parameters. As will be explained later, the lower-level optimization of the nonnegative polynomial part reduces to a simple variant of SDP. This variant can be solved rigorously and efficiently with interior-point algorithms developed recently. Taking advantage of this fact, the higher-level optimization of the base density part is not difficult since the number of parameters involved in the base density is quite low—one or two, typically. SDP is a new class of convex programming developed quickly in the past decade (Alizadeh, 1995; Ben-Tal & Nemirovski, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz, Saigal, Vandenberghe, 2000). The main purpose of SDP is to solve a semidefinite programming problem, the problem of minimizing a linear function over the intersection of an affine space, and the space of symmetric positive semidefinite matrices. SDP is an extension of the classical linear programming and has wide applications in control, combinatorial optimization, signal processing, communication systems design, optimal shape design,
Maximum Likelihood Approach to Density Estimation
2779
and machine learning, among others. A semidefinite programming problem can be solved efficiently in both theory and practice with the interior point methods (Monteiro & Todd, 2000). We will show that the maximum likelihood estimate of our model can be computed efficiently and rigorously with the techniques of SDP. Recently, several interesting applications of convex programming have been developed in machine learning (e.g., Bhattacharyya, 2004; Graepel & Herbrich, 2004; Lanckriet, El Ghaoui, Bhattacharyya, & Jordan, 2002; Lanckriet, Cristianini, Bartlett, El Ghaoui, & Jordan), in particular after the success of Vapnik’s support vector machine (Vapnik, 1995). Bennet and Mangasarian (1992) is another pioneering work in the same direction. A necessary and sufficient condition for a univariate polynomial to be nonnegative over the support sets can be expressed as semidefinite constraints (Nesterov, 2000). With this formulation, when the base density is specified, the maximum likelihood estimation can be formulated as a variant of SDP, namely, maximizing the log determinant function under semidefinite constraint. The formulated optimization problem is not exactly a semidefinite program, but due to its special form, we can solve it as efficiently as an ordinary semidefinite program with the interior point method (Vandenberghe, Boyd, & Wu, 1998; Toh, 1999; Tsuchiya & Xia, 2006). Specifically, we use the primal-dual interior point method to solve the variant of SDP. Once the maximum likelihood estimate is obtained, AIC (Akaike, 1973, 1974) is used to choose the best model. We demonstrate that the model is flexible and that the MAIC (minimum AIC) procedure gives reasonable estimates of the densities. There has been much research on one-dimensional density estimation, in particular in statistics, where gaussian mixture approaches and nonparametric approaches are preferred due to their flexibility (Akaike, 1977; Eggermont & LaRiccia, 2001; Good & Gaskins, 1971, 1980; Hjort & Glad, 1995; Ishiguro & Sakamoto, 1984; McLachlan & Peel, 2000; Roeder, 1990; Scott, 1992; Silvermann, 1986; Tanabe & Sagae, 1999; Tapia & Thompson, 1978; Wand & Jones, 1995). The advantages of our approach compared with these approaches are as follows. First, our approach is based on a parametric model taking the Kullback-Leibler divergence as a loss function, where the rigorous maximum likelihood estimate can be computed without any trouble in principle. Furthermore, basically it fits into a general framework of model selection with information criteria. In other parametric approaches, it often occurs that there are a lot of local maxima of the objective function and the optimization does not work well. In other nonparametric approaches, the selection of the bandwidth becomes an issue and sometimes causes an unsatisfactory result. Since our approach is based on the likelihood, we can use this density model as a part in assembling general learning models in the context of multivariate regression, pattern discrimination, time series analysis, independent component analysis (ICA), and so on. In particular in ICA, one-dimensional density estimation
2780
T. Fushiki, S. Horiuchi, and T. Tsuchiya
plays a crucial role. Our approach would provide a reasonable alternate to the existing density estimation methods. Another important feature is flexibility in modeling densities. It is sometimes known in advance that the estimated density function should satisfy some conditions like symmetry and unimodality. In our approach, these conditions are naturally treated as linear and semidefinite constraints in formulation, although it is not easy to incorporate these conditions into models in other approaches. The proposed approach has possible other applications beyond density estimation. This point is clarified through an application to the maximum likelihood estimation of the intensity function of a nonstationary Poisson process later in this article. This article is organized as follows. In section 2, we explain our model and formulate the maximum likelihood estimation with the variant of semidefinite program. In section 3, we briefly explain SDP and introduce interior point methods to solve this problem. In section 4, the performance of our method is demonstrated through application to several instances. In section 5, we propose a mixture model in which the proposed density model is employed as a component. In section 6, we discuss possible applications of our approach to other problems, taking up as an example the estimation of the intensity function of a nonstationary Poisson process. Section 7 concludes the discussion. 2 Problem Formulation Let {x1 , . . . , xN } be data independently drawn from an unknown density g(x) over the support S ⊆ R. Our problem is to estimate g(x) based on {x1 , . . . , xN }. If some prior information on g(x) is available, we can use an appropriate statistical model to estimate g(x). A flexible model is necessary to estimate g(x) when such prior information is not available. In this article, we develop a computational approach to estimate g(x) based on the following statistical model, f (x; α, β) = p(x; α)K (x; β),
(2.1)
where p(x; α) is a univariate polynomial with parameter α and K (x; β) is a density function specified with parameter β over the support S. The function K (x; β) is referred to as a base density (function). We call α and β a polynomial parameter and a base density parameter, respectively. The polynomial p(x; α) is nonnegative over S. In the following, we consider the models where the base density is normal distribution, exponential distribution, and uniform distribution. These models will be referred to as the normal-based model, exponential-based model, and pure polynomial model, respectively.
Maximum Likelihood Approach to Density Estimation
2781
We associate a univariate polynomial of even degree with a matrix as follows. Given x ∈ R, we define x d = (1, x, x 2 , . . . , x d−1 ) ∈ Rd . In the following, we drop the subscript d of x d when there is no ambiguity. A polynomial T of degree n(= 2d − 2) can be written n as x iQx by choosing an appropriate Q ∈ Rd×d . If a polynomial q (x) = i=0 q i x is represented as q (x) = x T Qx with Q ∈ Rd×d , then we can recover the coefficient q k by q k = Tr(E k Q), where the (i, j)th element of E k is (E k )i j = Let q (x) = sented as
1 if i + j − 2 = k 0 otherwise
n−1 i=0
,
k = 1, . . . , n.
q i xi be the derivative of q (x). The coefficient q k is repre-
q k = (k + 1)Tr(E k+1 Q), k = 1, . . . , n − 1. The main theorem used in this article together with SDP is the following. Theorem 1 (e.g., Nesterov, 2000). Let p(x) be a univariate polynomial of degree n. Then, T Qx (n/2+1) holds for a symmetric i. p(x) ≥ 0 over S = (−∞, ∞) iff p(x) = x (n/2+1) (n/2 + 1)×(n/2 +1) positive semidefinite matrix Q ∈ R . ii. p(x) ≥ 0 over S = [a , ∞) iff (a) T T p(x) = x (n+1)/2 Q1 x (n+1)/2 + (x − a )x (n-1)/2 Q2 x (n-1)/2 for symmetric positive semidefinite matrices Q1 ∈ R((n+1)/2)×((n+1)/2) and Q2 ∈ R((n-1)/2)×((n-1)/2) (the case when n is odd). (b) T T p(x) = x (n/2+1) Q1 x (n/2+1) + (x − a )x n/2 Q2 x n/2 for symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2)×(n/2) (the case when n is even). iii. p(x) ≥ 0 over S = [a , b] iff (a) T T p(x) = (x − a )x (n+1)/2 Q1 x (n+1)/2 + (b − x)x (n+1)/2 Q2 x (n+1)/2 (2.2) for symmetric positive semidefinite matrices Q1 , Q2 ∈ R((n+1)/2)×((n+1)/2) (the case when n is odd), (b) T T p(x) = x (n/2+1) Q1 x (n/2+1) + (b − x)(x − a )x n/2 Q2 x n/2 (2.3) for symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2)×(n/2) (the case when n is even).
2782
T. Fushiki, S. Horiuchi, and T. Tsuchiya
The class of polynomials that admit a representation of x T Qx with Q 0 and x being a vector (function) consisting of (possibly multivariate) monomials is referred to as the sum-of-squares polynomials (Nesterov, 2000; Waki, Kim, Kojima, & Muramatsu, 2005), since it exactly coincides with the polynomials representable as the sum of several squared polynomials. To explain our approach, we focus on the normal-based model where S = (−∞, ∞) and K (x; β) is a normal distribution with parameter (µ, σ ). We assume that the parameter β = (µ, σ ) is given. Under this condition, as will be explained below, the maximum likelihood estimation is formulated as a tractable convex program that can be solved easily with the technique of SDP. Readers will readily see that the exponential-based model and the pure polynomial model can be treated in exactly the same manner. In the following, C ()0 means that a matrix C is symmetric positive semidefinite (definite). The maximum likelihood estimate of our model is an optimal solution of the following problem: N
max α,β
s.t.
log p(xi ; α) +
i=1
N
log K (xi ; β)
i=1
p(x; α)K (x; β)d x = 1,
p(x; α) ≥ 0 for all x.
(2.4)
We represent p(x; α) as x T Qx with some Q 0. The condition for f (x; α, β) to be a density function is written as x T Qx K (x; β)dx = 1,
Q 0.
It is easy to see that this condition is written as the following linear equality constraint with a semidefinite constraint, Tr(M(β)Q) = 1, Q 0, where M(β) =
x x T K (x; β)dx.
Note that M(β) is a matrix that can be obtained in a closed form as a function of β when K is normal, exponential, or uniform distribution. On the other
Maximum Likelihood Approach to Density Estimation
2783
hand, the log likelihood of model 2.1 becomes N
log f (xi ; α, β) =
i=1
N
log p(xi ; α) + log K (xi ; β)
i=1
=
N
log(x (i)T Qx (i) ) +
i=1
=
N
N
log K (xi ; β)
i=1
log Tr(x (i) x (i)T Q) +
i=1
N
log K (xi ; β),
i=1
where x (i) = (1, xi , xi2 , . . . , xid−1 )T . Note that the term in log is linear in Q. Therefore, the maximum likelihood estimation is formulated as follows:
max Q,β
N
log Tr(X(i) Q) +
i=1
s.t. Tr(M(β)Q) = 1,
N
log K (xi ; β)
i=1
Q 0,
(2.5)
where X(i) = x (i) x (i)T for i = 1, . . . , N. If we fix β and regard Q as the decision variable, this problem is a convex program closely related to SDP and can be solved efficiently in both theory and practice with the interior point method (Ben-Tal & Nemirovski, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000). Let g(β) be the optimal value of problem 2.5 when β is fixed. Then we maximize g(β) to obtain the maximum likelihood estimator. Since β is typically with one or two dimension (e.g., location and scale parameters), maximization of g can be done easily by grid search or nonlinear programming techniques (Fletcher, 1989; Nocedal & Wright, 1999). In the following, we show that many properties of the density functions such as symmetry and monotonicity can be expressed by adding linear equality constraints or semidefinite constraints to the above problem. Recall that we are dealing with the normal-based model with the base density parameter (µ, σ ). For simplicity, we also assume that µ = 0. If we consider a symmetric density function with respect to x = 0, we add several linear constraints Tr(E i Q) = 0 for all odd i such that 1 ≤ i ≤ n because all coefficients of the odd degrees in the polynomial p(x; α) must be zero. If we consider a density that is unimodal with the maximum at x = xˆ , the problem is formulated as follows. This condition is equivalent to the two conditions that f (x) is monotone increasing in the interval (−∞, xˆ ] and is monotone decreasing in the interval [xˆ , ∞). The first monotone-increasing condition can be formulated as follows. Since
2784
T. Fushiki, S. Horiuchi, and T. Tsuchiya
f (x; β) = { p (x) − xp(x)/σ }K (x; β), nonnegativity of f (x; β) in the interval (−∞, xˆ ] is equivalent to the nonnegativity of { p (x) − xp(x)/σ } in the interval (−∞, xˆ ]. In view of the second statement of theorem 1, we introduce symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2+1)×(n/2+1) to represent p (x) −
x p(x) = x T Q1 x − (x − xˆ )x T Q2 x. σ
Note that the degree of p(x) is always even. The formulation is completed by writing down the conditions to associate Q with Q1 and Q2 . This amounts to the following n linear equality constraints, 1 Tr(E k−1 Q) = Tr(E k Q1 ) σ − Tr(E k−1 Q2 ) + xˆ Tr(E k Q2 ),
(k + 1)Tr(E k+1 Q) −
k = 1, . . . , n,
where El = 0 for l = −1 and l = n + 1. The other monotone-decreasing condition on the interval [xˆ , ∞) can be treated in a similar manner. Thus, the approach is capable of handling various conditions about the density function in a flexible way by adding semidefinite and linear constraints to problem 2.5. From the optimization point of view, the problem to be solved yet remains in the same tractable class. This point will be explained in more detail in the next section. We also point out that estimation of a density from a histogram based on the maximum likelihood estimation is formulated as a slight variation of problem 2.5 where each log term in the latter summation in the objective function is replaced by a weight proportional to the number of samples in a bin. This problem belongs to the same tractable variant of SDP (Tsuchiya & Xia, 2006). Here we discuss a few possible extensions to the multivariate density estimation problems. In the k-multivariate case, we replace x = (1, x, . . . , x (d−1) ) in the univariate case with a vector consisting of monomials with k variables. For example, if k = 2 and we want to express a polynomial of (up to) the fourth order, we take x = (1, x, y, x 2 , y2 , xy) and consider a polynomial of the form x T Qx with Q 0. As is remarked earlier, a multivariate polynomial that admits this type of representation is a member of the sum-of-squares polynomials. Unfortunately, a multivariate nonnegative polynomial does not always admit a sum-of-squares representation. Therefore, the sum-of-squares polynomials are just a subclass of multivariate nonnegative polynomials, that is, there is a gap. In this respect, one may not be satisfied with this straightforward extension. However, the use of the sum-of-squares polynomials to represent a density function would be justified as long as it serves as a flexible and useful model in expressing a density function. Furthermore, the recent attempts to solve polynomial programming problems with the sum-of-squares polynomials relaxation
Maximum Likelihood Approach to Density Estimation
2785
suggest that the gap between nonnegative multivariable polynomials and the sum-of-squares polynomials is not large (see, e.g., Waki et al., 2005). The model has a feature that it admits a reasonable and efficient way of computing the maximum likelihood estimate based on SDP. Another reasonable approach to multivariate cases is to apply our model in the context of ICA. We consider a multivariate density in the form of k −1 i=1 f i ((A x)i )/det(A), where A is a k × k nonsingular mixing matrix and f i is one-dimensional density. This is a parametric ICA model. We can use our density model for f i and apply our method to estimate f i . It would be possible to apply the standard techniques in ICA for estimating A (e.g., Hyv¨arinen, Karhunen, & Oja, 2001). In view of ICA, one-dimensional density f i could be estimated with nonparametric methods, but estimation of A would be much easier if a reasonable parametric model is available. This ICA model is one of the future research topics to incorporate our model into various machine learning models.
3 Semidefinite Programming and Interior Point Methods In this section, we introduce SDP and the interior point method for SDP and explain how the method can be used in the maximum likelihood estimation formulated in the previous section. SDP (Ben-Tal & Nemirovskii, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000) is an extension of LP (linear programming) in the space of matrices, where a linear objective function is optimized over the intersection of an affine space and the cone of symmetric positive semidefinite matrices. SDP is tractable convex programming and has a number of applications in combinatorial optimization, control theory, signal processing, statistics, machine learning, and structure design among others (Ben-Tal & Nemirovski, 2001; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000). A nice property about SDP is duality. As will be shown later, the dual problem of a semidefinite program becomes another semidefinite program, and under mild assumptions, they have the same optimal value. The original problem is referred to as the primal problem in relation to the dual problem. The interior point method solves SDP by generating a sequence in the interior of the feasible region. There are two types of the interior point methods: the primal interior point method and the primal-dual interior point method. The first generates iterates in the space of the primal problem, while the other generates iterates in both spaces of the primal and dual problems. We illustrate basic ideas of interior point methods with the primal method since it is more intuitive and easier to understand. In numerical experiments conducted in the later sections, we adopted the primal-dual method because it is more flexible and numerically stable. (See the appendix for the basic ideas and outline of the primal-dual methods we implemented.)
2786
T. Fushiki, S. Horiuchi, and T. Tsuchiya
Let Ai j (i = 1, . . . , m and j = 1, . . . , n) ¯ and C j ( j = 1, . . . , n) ¯ be real symmetric matrices. The sizes of the matrices Ai j and C j are assumed to be n j × n j . A standard form of SDP is the following optimization problem with respect to n j × n j real symmetric matrix X j , j = 1, . . . , n: ¯
(P) min
n¯
X
s.t.
Tr(C j X j ),
j=1 n¯
Tr(Ai j X j ) = b i , i = 1, . . . , m, X j 0,
j = 1, . . . , n. ¯ (3.1)
j=1
Here we denote (X1 , . . . , Xn¯ ) by X, and X ()0 means that each X j is symmetric positive semidefinite (definite). A feasible solution X is called an interior feasible solution if X 0 holds. Since the cone of positive semidefinite matrices is convex, SDP is a convex program. Although the problem is highly nonlinear, it can be solved efficiently in both a theoretical and practical sense with the interior point method. The interior point method is a polynomial time method for SDP, and in reality, it can solve SDP involving matrices whose dimension is several thousands. To date, the interior point method is the only practical method for SDP. Let be a subset of {1, . . . , n}, ¯ and consider the following problem where a convex function − j∈ log det X j is added to the objective function in problem 3.1: ( P) min
X
s.t.
Tr(C j X j ) −
j
log det X j ,
j∈
Tr(Ai j X j ) = b i , i = 1, . . . , m, X j 0,
j = 1, . . . , n. ¯ (3.2)
j
It is not difficult to see that the maximum likelihood estimation, problem 2.5, can be cast into this problem as follows: (ML) min −
N
log det Yj ,
j=1
s.t. Yj − Tr(X( j) Q) = 0, i = 1, . . . , N, Yj 0, Tr(MQ) = 1, Q 0,
j = 1, . . . , N, (3.3)
where Yj , j = 1, . . . , N are new variables of a one-by-one matrix introduced to convert problem 2.5 to the form of problem 3.2. Thus, there are n¯ = N + 1 decision variables Yj ( j = 1, . . . , N) and Q in this problem.
Maximum Likelihood Approach to Density Estimation
2787
At a glance, problem 3.2 looks more difficult than problem 3.1 because of the additional convex term in the objective function; however, due to its special structure, we can solve problem 3.2 as efficiently as problem 3.1 by slightly modifying the interior point method for problem 3.1 without losing its advantages (Toh, 1999; Vandenberghe et al., 1998; Tsuchiya & Xia, 2006). In Vandenberghe et al. (1998), problem 3.2 is studied in detail from the viewpoint of applications and solution by the primal interior point method. For the time being, we continue explaining the interior point method for problem 3.1 to illustrate its main idea. Since a main difficulty of SDP comes from its highly nonlinear shape of the feasible region (even though it is convex), it is reasonable to provide machinery to keep iterates away from the boundary of feasible region in order to develop an efficient iterative method. For this purpose, the interior point method makes use of the logarithmic barrier function:
−
n¯
log det X j .
j=1
The logarithmic barrier function is a convex function whose value diverges as X approaches the boundary of the feasible region where one of X j becomes singular. Incorporating with this barrier function, let us consider the following optimization problem with a positive parameter ν: (Pν ) min
X
s.t.
Tr(C j X j ) − ν
j
log det X j ,
j
Tr(Ai j X j ) = b i , i = 1, . . . , m, X j 0,
j = 1, . . . , n, ¯
(3.4)
j
where ν is referred to as barrier parameter. Since the log barrier function is strictly convex, (Pν ) has a unique optimal solution. We denote by Xn¯ (ν)) the optimal solution of (Pν ). By using the method X(ν) = ( X1 (ν), . . . , of Lagrange multipliers, we see that X(ν) is a unique symmetric positive definite matrix X satisfying the following system of nonlinear equations in unknown X and (the Lagrangian multiplier) y: ν X−1 j =Cj − j
Ai j yi , i = 1, . . . , n, ¯
i
Tr(Ai j X j ) = b i , i = 1, . . . , m, X j 0, j = 1, . . . , n. ¯
(3.5)
2788
T. Fushiki, S. Horiuchi, and T. Tsuchiya
The set X(ν) : 0 < ν < ∞} CP ≡ { defines a smooth path that approaches an optimal solution of (P) as ν → 0. This path is called the central trajectory of problem 3.1. The main idea of the interior point method is to solve the SDP with the following procedure to trace the central trajectory. Starting from a point close to the central trajectory CP , we solve problem 3.1 by repeated application of the Newton method to problem 3.4, reducing ν gradually to zero. A relevant part of this method is solving problem 3.4 for each fixed ν where the Newton method is applied. The Newton method is basically a method for an unconstrained optimization problem, but the problem contains nontrivial constraints X 0. However, there is no difficulty in applying the Newton method here, because the problem is a minimization problem and the term − j log det X j diverges whenever X approaches the boundary of the feasible region. Therefore, the Newton method is not bothered by the constraint X 0. Another important issue here is initialization. We need an interior feasible solution to start. Usually this problem is resolved by a so-called twophase method or Big-M method, which are analogies of the techniques developed in classical linear programming. In the primal-dual method we introduce later, this point is resolved in a more elegant manner. Now we extend the idea of the interior point method to solve problem 3.2. We consider the following problem with a positive parameter η: ( Pη ) min
X
s.t.
j
Tr(C j X j ) −
log det X j − η
j∈
log det X j ,
j ∈
Tr(Ai j X j ) = b i , i = 1, . . . , m, X j 0,
j = 1, . . . , n. ¯
(3.6)
j
We denote by X(η) the optimal solution of problem 3.6 and define the central trajectory for problem 3.6 as DP ≡ { X(η) : 0 < η < ∞}. Note that X(η) approaches the optimal set of equation 3.2 as η → 0. Observe that the central trajectories CP and DP intersect at ν = 1 and η = 1, that is, X(1) = X(1). Therefore, we consider an interior point method to solve problem 3.2 consisting of two stages. We first obtain a point X∗ close to X(1) at stage 1 with the ordinary interior point method. In stage 2, starting from X∗ , a point close to the central trajectory DP for problem 3.2,
Maximum Likelihood Approach to Density Estimation
2789
we solve problem 3.6 with the Newton method, repeatedly reducing η gradually to zero. As was mentioned in the beginning of this section, this idea is further incorporated with the primal-dual interior point method and adopted in our implementation. See the appendix for details. 4 Numerical Results 4.1 Outline. We conducted numerical experiments of our method with the following five models: i. Normal-based model ii. Exponential-based model iii. Pure polynomial model where the density function and its first derivative are assumed to be zero on the boundary of the support iv. Normal-based model where we require another additional condition that the estimated density function is unimodal v. Exponential-based model where we require another additional condition that the estimated density function is monotone decreasing All the results of simulation data are compared with the results obtained by the kernel density estimation method where the bandwidth is determined by the procedure bw.ucv: unbiased cross-validation in statistical software R. Some of the data sets used in this experiment are generated by simulation from assumed distributions, and others are taken from real data sets that have been often used for benchmark. The algorithms are coded in Matlab and C, and all the numerical experiments are conducted under Matlab 6.5 environment with the Windows OS. We used several platforms, but the typical one is like a Pentium IV 2.4 GHz with 1 GB memory. The code runs without trouble in a notebook computer equipped with a Pentium III 650 MHz CPU and 256 MB memory. The maximum likelihood estimation is computed in two steps: we optimize parameter α–associated polynomials with SDP, and at higher levels we optimize parameter β for the base density. We have β = (µ, σ ) for i, β = λ for ii, and β = [a min , a max ] for iii, and in iv, we have β = (µ, σ, γ ), where γ denotes the peak of the distribution. Finally in v, we have β = λ. Assuming that α is optimized with SDP, we just need to perform at most threedimensional optimization problems (basically one- or two-dimensional) to accomplish global optimization of the likelihood function. According to the level of difficulty of SDP to be solved later, we employed optimization by nonlinear optimization (for i, ii, and v) and optimization by grid search (for i, ii, iii and iv) for optimization of β. We provided two versions of the primal-dual interior point methods: the basic algorithm and the predictor-corrector algorithm. The former is faster, but the latter is more sophisticated and robust. (See the appendix and
2790
T. Fushiki, S. Horiuchi, and T. Tsuchiya
Fushiki, Horiuchi, & Tsuchiya, 2003, for explanation of the two algorithms.) Generally we observed that SDPs for i, ii, and v are fairly easy, while the ones for iii and iv are more difficult. The basic algorithm is robust and stable enough to solve i, ii, and v without trouble, but it got into trouble when we tried to solve iii and iv. In that case, we need to use a more sophisticated predictor-corrector algorithm. The typical number of iterations of the basic algorithm is around 50, and the predictor-corrector algorithm is between 100 to 200. Concerning higher-level optimization to determine β, we adopted nonlinear optimization and grid search procedure. We compare the models with AIC (Akaike, 1973, 1974). Here we note that the number of parameters should be reduced by one for one addition of one equality constraint. Therefore, if we define AIC = −(Log likelihood) + k, where k denotes the number of parameters, k will become as follows for the five cases i to v: i. Normal-based model: k = n + 2 (dim(α) = n + 1, dim(β) = 2,
(number of linear equalities) = 1)
ii. Exponential-based model: k = n + 1 (dim(α) = n + 1, dim(β) = 1,
(number of linear equalities) = 1)
iii. Pure polynomial model: k = n − 2 (dim(α) = n + 1, dim(β) = 2,
(number of linear equalities) = 5)
iv. Normal-based model with unimodality: k = n + 2 (dim(α) = n + 1, dim(β) = 3,
(number of linear equalities) = 2)
v. Exponential-based model with monotonicity: k = n + 1 (dim(α) = n + 1, dim(β) = 1,
(number of linear equalities) = 1)
Here n is the degree of the polynomial p(x; α) in the model. We define AIC as one-half of the usual definition of AIC. All the models contain the linear equality constraint that the integral of the estimated density over the support is one. The pure polynomial model iii contains additional equality constraints that the value of the density function and its derivative on both ends of its support is zero. In iv and v, we did not give any “penalty term” on unimodality and monotonicity. In iv, we introduce a new parameter to specify the peak of the density, but we also have an additional linear equality constraint that the derivative of the density is zero at the peak. Therefore, the penalty term is the same as i.
Maximum Likelihood Approach to Density Estimation
2791
The data analyzed here are as follows: i. Normal-based model a. Simulated data 1 generated from a bimodal normal mixture distribution (N = 100) b. Simulated data 2 generated from an asymmetric unimodal normal mixture distribution (N = 250) c. Buffalo snowfall data (N = 62) (Carmichael, 1976; Parzen, 1979; Scott, 1992) d. Old Faithful geyser data (N = 107) (Weisberg, 1985; Silvermann, 1986) ii. Exponential-based model a. Simulated data 3 generated from a mixture distribution of an exponential distribution and a gamma distribution (N = 200) b. Coal-mining disasters data (N = 109) (Cox & Lewis, 1966) iii. Pure polynomial model a. Old Faithful geyser data b. Galaxy data (N = 82) (Roeder, 1990) iv. Normal-based model with unimodality condition a. The normal mixture distribution data set treated in i-b v. Exponential-based model with monotonicity condition a. Coal-mining disasters data treated in ii-b 4.2 Normal-Based Model. In this section, we show the results of the density estimation with the normal-based model. 4.2.1 Simulated Data 1: Bimodal Normal Mixture Distribution. We generated 200 observations from a bimodal normal mixture distribution:
0.3 (x + 1)2 (x − 1)2 0.7 exp − exp − + . √ √ 2 · 0.52 2 · 0.52 2π0.52 2π0.52 Figures 1a to 1c show the estimated densities for n = 2, 4, 6. The MAIC procedure picks up the model of degree 4 as the best model. It is seen that the estimated density (solid line) is close to the true density (broken line). In Figure 1d, the kernel density estimation result is shown with the broken line. In this case, the result with the kernel density estimation also appears close to the true density. In the following, we adopt the Kullback-Leibler divergence as the loss function to measure the goodness of fit of the estimated densities. The Kullback-Leibler divergence between the true density and the best model of degree 4 polynomial is 9.1693 × 10−3 , whereas the Kullback-Leibler divergence between the true density and the kernel density estimation result is 1.8359 × 10−2 . Thus, the estimation with our model is closer to the true density in terms of the Kullback-Leibler divergence in this case.
2792
T. Fushiki, S. Horiuchi, and T. Tsuchiya
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 -3
-2
-1
0
1
2
0 -3
3
(a) n = 2 (AIC = 248.19) 0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 -2
-1
0
1
2
-1
0
1
2
3
(b) n = 4 (AIC = 247.26)
0.7
0 -3
-2
0 -3
3
-2
-1
0
1
2
3
(d) Kernel method
(c) n = 6 (AIC = 247.72)
Figure 1: Estimated density function from the simulated data 1 with different degrees of polynomials (normal-based model) and the kernel method. Each data point is shown by a circle; the solid line is the estimated density with our method in a –c, and the kernel method in d. The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 20).
4.2.2 Simulated Data 2: Asymmetric Unimodal Normal Mixture Distribution. Here we generated a simulated data set of 250 observations from a distribution proportional to
x2 exp − 2
(x − 1)2 + 5 exp − 0.2
(x − 1)2 + 3 exp − 0.5
.
This is an asymmetric distribution with a sharp peak around x = 1. The estimated density by MAIC procedure is shown in Figure 2a. The MAIC procedure chooses the model with degree 8. The values of log likelihood (LL) and AIC are LL = −215.1 and AIC = 223.1. The Kullback-Leibler divergence between the true density and the estimated density is 3.9499 × 10−2 . For comparison, we show the estimated density with the kernel method
Maximum Likelihood Approach to Density Estimation
2793
1
0.8
0.6
0.4
0.2 0 -4
-2
0 (a)
2
4
-2
0 (b)
2
4
1
0.8
0.6
0.4
0.2 0 -4
Figure 2: Estimated density function from the simulated data 2 (normal-based model). Each data point is shown by a circle. The solid line is the estimated density with our method in a and with the kernel method in b. The broken line is the true density; the bars are the histogram density estimation (the number of the bins is 20).
in Figure 2b. The Kullback-Leibler divergence between the true density and the estimated density is 6.5548 × 10−2 . Both estimated densities have bumps on the left-hand tail. Thus, we see that the estimation of the density function is more difficult on the left-hand tail as long as we estimate the
2794
T. Fushiki, S. Horiuchi, and T. Tsuchiya
distribution just from the data. Later we will show how the estimation is stabilized if we assume the prior knowledge of unimodality of distribution. 4.2.3 Buffalo Snowfall Data. This is the set of 63 values of annual snowfall in Buffalo, New York, from 1910 to 1972, in inches (Carmichael, 1976; Parzen, 1979; Scott, 1992). In Figures 3a to 3c, we show profiles of the distribution obtained with the maximum likelihood estimate when the degree of polynomial is decreased or increased. The MAIC procedure chooses the model of degree 6 and seems to give a reasonable result. For comparison, we also plot the estimated density by the kernel method with a broken line. 4.2.4 Old Faithful Geyser Data. This set contains duration times of 107 nearly consecutive eruptions of the Old Faithful geyser in minutes (Weisberg, 1985; Silvermann, 1986). The estimated density is shown in Figure 4, where n = 10, LL = −105.6, and AIC = 117.6. It has longer tails in the both ends, reflecting the nature of the normal distribution. It seems that the tail of the distribution should be shorter. We also plot the estimated density by the kernel method with a broken line. Later we apply the pure polynomial model, which seems to give a better fit with shorter tail. 4.3 Exponential-Based Model. In this section, we show the results of the density estimation with the exponential-based model. 4.3.1 Simulated Data 3: Mixture of an Exponential Distribution and a Gamma Distribution. Here we generated simulated data of 200 observations from a mixture distribution of an exponential distribution and a gamma distribution with shape parameter 4: 3
x 0.2 2 exp(−2x) + 0.8 exp(−x) . 3! The MAIC procedure picks up the model with degree 2. As is seen in Figure 5a, the estimated distribution obtained by MAIC procedure recovers fairly well the original distribution. For comparison, the density estimation result based on the kernel method is plotted also in Figure 5b. The estimated density is truncated for negative values. The density estimated with our method is closer to the true density. Indeed, the Kullback-Leibler divergence between the true density and the estimated density with our method is 8.6162 × 10−3 , whereas the KullbackLeibler divergence between the true density and the estimated densith with the kernel method is 2.1609 × 10−1 .
Maximum Likelihood Approach to Density Estimation
2795
0.03
0.02
0.01
0 0
40
80
120
160
(a) n = 4 (AIC = 291.88) 0.03
0.02
0.01
0 0
40
80
120
160
(b) n = 6 (AIC = 289.99) 0.03
0.02
0.01
0 0
40
80
120
160
(c) n = 8 (AIC = 291.58) Figure 3: Estimated density function from the Buffalo snowfall data with different degrees of polynomials (normal-based model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).
2796
T. Fushiki, S. Horiuchi, and T. Tsuchiya 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
1
2
3
4
5
6
Figure 4: Estimated density function from the Old Faithful geyser data (normalbased model). Each data point is shown by a circle. The bars are the histogram density estimation (the number of the bins is 20). The solid line is the estimated density with our method, and the broken line is the estimated density with the kernel method.
4.3.2 Coal-Mining Disasters Data. Coal mining disasters in Great Britain from 1875 to 1951 are reported in days (Cox & Lewis, 1966). (See Figure 12 for the original sequence of disasters, where each accident is drawn as a circle on the horizontal axis.) Here we model it as a renewal process to estimate the interval density function with the exponential-based model. MAIC procedure picks up the model with degree 4, where LL = −699.0 and AIC = 704.0. The estimated distribution is shown in Figure 6. It is seen that the distribution is considerably different from the exponential distribution. The estimated density seems to consist of three slopes, with a small bump around x = 1200. We also plot the estimated density with the kernel method with a broken line for comparison in the same figure. Later we show how the estimated density will change if the density function is assumed to be monotonically decreasing. 4.4 Pure Polynomial Model. In this section, we show the profiles of the density functions estimated with the pure polynomial model. 4.4.1 Galaxy Data. These data are obtained by measurements of the speed of galaxies in a region of the universe (Roeder, 1990). We applied the pure polynomial model to estimate the density from galaxy data, since the model with the normal distribution base did not fit well. The MAIC procedure chooses the model with n = 13, where LL = 609.8 and AIC = −598.89.
Maximum Likelihood Approach to Density Estimation
2797
0.5 0.4 0.3 0.2 0.1 0 0
2
4
6
8
10
12
(a) Estimated density with our method
0.5 0.4 0.3 0.2 0.1 0 0
2
4
6
8
10
12
(b) Estimated density with the kernel method Figure 5: Estimated density function from the simulated data 3 (exponentialbased model). Each data point is shown by a circle. The solid line is the estimated density (our method in a and the kernel method in b). The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 30).
As is seen from Figure 7 (solid line), the model suggests that there are three clusters in distribution, which seems to be a bit too simple. Observe that the estimated density takes a value close to zero in the interval between 10,000 and 20,000, reflecting the fact that there are no data. This
2798
T. Fushiki, S. Horiuchi, and T. Tsuchiya
7
x 10-3
6 5 4 3 2 1 0 0
500
1000
1500
2000
Figure 6: Estimated density function from the coal-mining disasters data (exponential-based model). Each data point is shown by a circle. The bars are the histogram density estimation (the number of the bins is 30). The solid line is the estimated density with our method, and the broken line is the estimated density with the kernel method.
seems to impose a strong constraint on the shape of the estimated density. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. We return to this instance in section 5, where a better fit is obtained by combination with a mixture approach. 4.4.2 Old Faithful Geyser Data. We applied the pure polynomial model to estimate the density function of the Old Faithful geyser data. The MAIC procedure chooses the model with degree 14, where LL = −96.4 and AIC = 108.4. The estimated density is shown in Figure 8 (solid line). This model fits much better compared to the normal-based model in terms of both likelihood and AIC. The model captures well the structure of distribution, suggesting that the distribution has a very short tail and the left peak is higher than the right one. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. 4.5 Normal-Based Model with Unimodality Condition. As is demonstrated so far, our approach gives reasonable estimates for many cases. When we analyze real data, we sometimes have prior knowledge on the distribution such as symmetry or unimodality. It is difficult to incorporate
Maximum Likelihood Approach to Density Estimation
2799
x 10-4
3 2.5 2 1.5 1 0.5 0
0
10000
20000
30000
40000
Figure 7: Estimated density function from the galaxy data (pure polynomial model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20). 1
0.8
0.6
0.4
0.2
0 0
2
4
6
Figure 8: Estimated density function from the Old Faithful geyser data (pure polynomial model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).
2800
T. Fushiki, S. Horiuchi, and T. Tsuchiya 1
0.8
0.6
0.4
0.2
0 -4
-2
0
2
4
Figure 9: Estimated density function from the simulated data 2 (normal-based model with unimodality constraint). Each data point is shown by a circle. The solid line is the estimated density. The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 20).
such prior knowledge into the model in nonparametric approaches. In the following, we pick up the simulated data 2 as an example and compute the maximum likelihood estimate with the unimodality condition in our approach. Specifically, we observe how the addition of the unimodality condition changes the estimated density function. The distributions obtained by MAIC procedure adding a constraint of unimodality for the data set is shown in Figure 9. LL and AIC are −216.4 and 224.5, respectively. We see that the bumps in the estimated density without the unimodality condition disappeared and the shape of the estimated density looks much closer to the true one. LL and AIC get worse by about 1.5, where no “penalty” is added to AIC for the unimodality constraint. In our model with the monotonicity condition, the estimated density is unimodal and much closer to the true density in shape. The KullbackLeibler divergence between the true density and the estimated density is 1.8422 × 10−2 . This is the best among the three method we tested (see subsection 4.2). 4.6 Exponential Based Model with Monotonicity Condition. In section 4.3, we estimated the interval density function of the coal mining disasters data and observed that the density function with the best AIC value has a bump around x = 1200. Here we estimate the density based on the same data with the exponential-based model with monotonicity condition. In Figure 10 (solid line), we show the best model we found, where
Maximum Likelihood Approach to Density Estimation
7
2801
x 10-3
6 5 4 3 2 1 0 0
500
1000
1500
2000
Figure 10: Estimated density function from the coal mining disasters data (exponential-based model with monotonicity). Each data point is shown by a circle. The solid line is the estimated density. The bars are the histogram density estimation (the number of the bins is 30).
the degree of polynomial is 4, LL = −699.57 and AIC = 704.57. We did not impose any “penalty” on monotonicity in calculating AIC. These values of LL and AIC are almost the same as the ones we obtained in section 4.2. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. 4.7 Discussion. We discuss a few issues before concluding this section. 4.7.1 Comparison with the Kernel Method. In this section, we compare the estimated densities obtained with our method and the ones with the kernel method. We summarize in Table 1 the Kullback-Leibler divergence between the true density and the estimated densities with both methods (only for the cases when the true densities are known). In all three instances, the estimated densities with our method happened to be closer than the kernel method. As to the profiles of the estimated densities, the ones with our method tend to be smoother than those with the kernel method. The estimated densities with the kernel method tend to be shakier. 4.7.2 Computational Time. In Table 2, we summarized timing data for optimizing the polynomial part for a fixed base density for several typical
2802
T. Fushiki, S. Horiuchi, and T. Tsuchiya
Table 1: Kullback-Leibler Divergence Between True Densities and Estimated Densities: Summary.
Simulated data 1 (bimodal gaussian mixture) Simulated data 2 (unimodal gaussian mixture) Simulated data 3 (exponential + gamma) ∗
SDP
Kernel
SDP (with Unimodality)
∗ 9.1693 × 10−3
1.8359 × 10−2
—
3.9499 × 10−2
6.5548 × 10−2
∗ 1.8422 × 10−2
∗ 8.6162 × 10−3
2.1609 × 10−1
—
The best model in terms of the Kullback-Leibler divergence.
Table 2: Timing Data.
Instance
Model Type
Buffalo snowfall Coal mining Simulated data 1 Old Faithful geyser
Normal Exponential Normal Pure polynomial
Number of Samples
Time(sec)
Number of Iterations
63 109 200 107
33 61 202 623
34 29 33 162
Algorithm Basic Basic Basic Predictorcorrector
instances. In measurement, we ran our code in Matlab 7.1, but all other computational environments are the same as in the previous experiments in this section. In summary, if the number of data is up to 100 and the normal-based model or exponential-based model is used, optimization finishes in about 1 minute. If the number is 200, it requires around 3 minutes. Generally, the pure polynomial model tends to be more difficult than other two models. Since our code used in this experiment is based on a general-purpose code for SDP, it is possible to speed up the code considerably by taking into account the special structures of our problem. As is observed above, computational time increases as the number of data increase. This is inevitable due to the nature of SDP algorithm. One reasonable option for dealing with large data is to estimate density from the histogram, as was mentioned in section 2. In this extension, the number of bins corresponds to the number of data here. The number of the bins should be as large as possible within the range that the associated SDP problem can be solved in a reasonable time. ¨ uncu, ¨ Recently, well-known SDP software such as SDPT3 (Tut Toh, & Todd, 2003) started supporting a solution of the variant of SDP problems
Maximum Likelihood Approach to Density Estimation
2803
we deal with in this article. The use of such SDP software could be a reasonable option for readers to implement the density estimation method proposed here. 5 Combination with a Finite Mixture Model When the data set has a long interval without any data, the estimated polynomial with our method tends to take the value close to zero in the interval for achieving a better likelihood value. This implicitly imposes a strong restriction on the possible shape of the density. We encountered such a situation in analyzing the galaxy data in section 4.4. In order to construct a more flexible model in such a situation, here we consider a model where a density function is represented as a finite mixture of several normal-based models. We assume that the data set has one or more long intervals without data. Therefore, the data set may well be grouped into several clusters easily. In the case when clustering is difficult, the original model, model 2.1, should work fine, since the data set does not have a sparse interval. Let M be the number of clusters, and suppose that the density is represented as f (x; θ, w) =
M
wi f i (x; θi ),
i=1
M
wi = 1,
wi ≥ 0 (i = 1, . . . , M),
(5.1)
i=1
where f i is the density function of a normal-based model with θi as the parameter, θ = (θ1 , . . . , θ M ), and w = (w1 , . . . , w M ). Let I j , j = 1, . . . , M be the index of the data points belonging to the cluster j, and let D j = {x j : j ∈ I j }. D j is the set of data assigned to the jth cluster, which is associated with the density function f j (·; θ j ). If we assume that f j is almost zero for those data that do not belong to D j , the log likelihood of equation 5.1 is simply written as log f (x; θ, w) ≈
M i=1 j∈Ii
log f i (x j ; θi ) +
M
|Di | log wi .
(5.2)
i=1
From equation 5.2, it is easy to see that the maximum likelihood estimation reduces to the maximum likelihood estimation of the model f i based on the data set Di . Furthermore, minimization of AIC for model 5.1 can be done by minimizing AIC in estimation of each component f i . We implemented this procedure and applied it to the galaxy data. The best model we found in terms of AIC consists of four components, as shown in Figure 11a, where LL = 623.029, AIC = −606.029 and the number of parameters is 17. The second-best model is with three components and is slightly worse in terms of AIC (LL = 623.952, AIC = −605.952; the
2804
T. Fushiki, S. Horiuchi, and T. Tsuchiya
3
x 10 -4
2
1
0 0
10000
20000
30000
40000
(a) Four components
3
x 10 -4
2
1
0 0
10000
20000
30000
40000
(b) Three components Figure 11: Estimated density function from the galaxy data (mixture normalbased model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).
Maximum Likelihood Approach to Density Estimation
2805
number of parameters is 18). The profile is shown in Figure 11b. In the latter model, the two middle clusters in the former are treated as one cluster. We checked that the assumption on the behavior of f j in other clusters Di (i = j) is satisfied in the obtained density function. The estimated density is better than the one obtained in the previous section in terms of AIC and its shape is more similar to those obtained in other work (e.g., Roeder, 1990). 6 Other Applications The approach proposed here can be applied to other areas of statistics such as point process and survival data analysis. In order to clarify this point, here we pick up estimation of the intensity function of a nonstationary Poisson process as an example. Suppose we have a nonstationary Poisson process whose intensity function is given by λ(t), and let t1 , . . . , tN = T be the sequence of the time when events were observed. We estimate λ(t) as a nonnegative polynomial function on the interval (0, T]. The log likelihood is N
log λ(ti ) −
T
λ(t)dt.
0
i=1
If T is fixed, then we can apply exactly the same technique as density estimation developed in this article. If we represent λ(t) as equations 2.2 or 2.3 in theorem, the term
T
λ(t)dt
0
is represented as
T
λ(t)dt = Tr(M1 Q) + Tr(M2 Q1 ),
0
where M1 and M2 are appropriate symmetric matrices. Therefore, the maximum likelihood estimation is formulated as the following problem: max
N
(i) (i) log Tr(X1 Q) + Tr(X2 Q1 ) − Tr(M1 Q) + Tr(M2 Q1 ) ,
i=1
s.t. Q 0, Q1 0, (i)
(i)
where X1 and X2 , (i = 1, . . . , N) are matrices determined from the data. Thus, the problem just becomes problem 3.2 in this case. In Figure 12, we
2806
T. Fushiki, S. Horiuchi, and T. Tsuchiya 0.015
0.01
0.005
0 0
10000
20000
30000
Figure 12: Estimated intensity function from the coal mining disasters data. Each disaster is shown by a circle.
show the estimated intensity function λ(t) for the coal mining data with the MAIC procedure. The procedure picks up the polynomial of degree 7, where LL = −690.8 and AIC = 698.8. In the previous section, we analyzed these data as a renewal process, and AIC of the estimated model is around 704.0 in both cases where we require or do not require monotonicity condition. Thus, we see that the nonstationary Poisson model seems to fit better in this case than the renewal model. A similar technique can be applied to the analysis of other statistical problems such as estimation of a survival function for medical data. This is another interesting topic of further study. 7 Conclusion In this article, we proposed a novel approach to the density estimation problem by means of semidefinite programming. We adapted standard interior point methods for SDP to this problem and demonstrated through various numerical experiments that the method gives a reasonable estimate of the density function with MAIC procedure. We also demonstrated that such conditions as unimodality and monotonicity of the density function, which are usually difficult to handle with other approaches, can be easily incorporated within this framework. Combination with the mixture model and possible applications to other areas were also discussed. Here we discuss model selection criteria. We used AIC as the model selection criterion, and it worked well in our experiments. But we can consider other approaches for model selection. For example, cross-validation is
Maximum Likelihood Approach to Density Estimation
2807
a reasonable candidate. While cross-validation is applied in a more general context, it is computationally more expensive than AIC. It would be an interesting topic to develop a suitable criterion for model selection in our context. Finally, development of new learning models based on the proposed density estimation approach will be a long-term main direction of our research. Appendix: Dual Problem, Primal-Dual Formulation, and Primal-Dual Interior-point Method This appendix outlines the idea of primal-dual interior point method. First we introduce the dual problem of problem 3.1 and its associated central trajectory. Then we explain the primal-dual central trajectory and outline the primal-dual interior point method. The dual problem of problem 3.1 is defined as follows: (D) max b i yi , y,Z j
i
s.t. C j −
Ai j yi = Z j , Z j 0, j = 1, . . . , n, ¯
(A.1)
i
¯ is an n j × n j real symmetric matrix and y = where Z j , j = 1, . . . , n, (y1 , . . . , ym ) is an m-dimensional real vector. We denote (Z1 , . . . , Zn¯ ) by Z. Under mild conditions, problems 3.1 and A.1 have the optimal solutions with the same optimal value (the duality theorem) (Ben-Tal & Nemirovski, 2001; Monteiro & Todd, 2000; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996). Analogous to the case of problem 3.1, the central trajectory of equation A.1 is defined as the set of the unique optimal solution of the following problem when parameter ν is changed: (Dν ) max b i yi + ν log det Z j , y,Z j
i
s.t. C j −
j
Ai j yi = Z j , Z j 0, j = 1, . . . , n. ¯
(A.2)
i
We denote by ( Z(ν), yˆ (ν)) the optimal solution of problem A.2. Differentiation yields that ( Z(ν), yˆ (ν)) is a unique optimal solution to the following system of nonlinear equations: Ai j Z−1 ν j = b i , i = 1, . . . , m, j
Cj −
i
Ai j y j = Z j , Z j 0, j = 1, . . . , n. ¯
(A.3)
2808
T. Fushiki, S. Horiuchi, and T. Tsuchiya
The set CD ≡ {( Z(ν), yˆ (ν)) : 0 < ν < ∞} defines a smooth path that approaches an optimal solution of (D) as ν → 0. This path is called the central trajectory for problem A.1. Comparing problems A.3 and 3.5, we see that ( X(ν), Z(ν), yˆ (ν)) is the unique optimal solution of the following bilinear system of equations X j Z j = ν I, j = 1, . . . , n, ¯
Tr(Ai j X j ) = b i , i = 1, . . . , m,
j
Cj −
Ai j yi = Z j , j = 1, . . . , n, ¯
i
¯ X j 0, j = 1, . . . , n,
Z j 0, j = 1, . . . , n. ¯
(A.4)
Note that we also require X j = XTj and Z j = ZTj for each X j and Z j , since means that a matrix is symmetric positive semidefinite. We define the central trajectory of problems 3.1 and A.1 as C = {W(ν) : 0 < ν < ∞}, where W(ν) = ( X(ν), Z(ν), yˆ (ν)). The primal-dual interior point method solves (P) and (D) simultaneously by following the central trajectory C based on the formulation A.4. Namely, we solve equation A.4 repeatedly, reducing ν gradually to zero. As in the primal method, a crucial part of the primal-dual method is the solution procedure of equation A.4 for fixed ν. There are several efficient iterative algorithms for this problem based on the Newton method (see, e.g., Tsuchiya & Xia, 2006). Now we introduce the dual counterpart of problem 3.2 as follows: max (D) y,Z j
i
s.t. C j −
b i yi +
log detZ j + ||,
j∈
Ai j yi = Z j , Z j 0, j = 1, . . . , n.
(A.5)
i
It is known that the optimal values of problems A.5 and 3.2 coincide. In order to solve this problem, we consider the following optimization problem with
Maximum Likelihood Approach to Density Estimation
2809
parameter η > 0: η ) max (D y,Z j
i
s.t. C j −
b i yi +
log detZ j + η
j∈
log detZ j + ||,
j ∈
Ai j yi = Z j , Z j 0, j = 1, . . . , n.
(A.6)
i
We denote by ( Z(η), y˜ (η)) the unique optimal solution of this problem. We define the central trajectory for problem A.5 as Z(η), y˜ (η)) : 0 < η < ∞}. DD ≡ {( Note that ( Z(η), y˜ (η)) approaches the optimal set of problem 3.6 as η → 0. The set DD is referred to as the central trajectory for equation A.5. Analogous to equation A.4, we have the following primal-dual formulation of ( X(η), Z(η), y˜ (η)): X j Z j = I, j ∈ j
Cj −
X j Z j = ηI, j ∈ Tr(Ai j X j ) − b i = 0, i = 1, . . . , m
Ai j yi − Z j = 0, j = 1, . . . , n¯
i
X j 0 j = 1, . . . , n, ¯ Z j 0, j = 1, . . . , n. ¯
(A.7)
(η) = ( X (η), Z (η), y˜ (η)), and define Let W (η) : 0 < η < ∞} D = {W as the primal-dual central trajectory of problems 3.2 and A.5. Equation A.7 can be solved efficiently with the same iterative methods for equation A.4. Now we are ready to describe a primal-dual interior point method for problems 3.2 and A.5. As in the case of the primal method, the primal dual central trajectories C and D intersect at ν = η = 1, that is, W(1) = W(1). Therefore, we can solve equations 3.2 and A.5 in two stages as follows. We first apply the ordinary primal-dual interior point method for equations 3.1 and A.1 to find a point W∗ = (X∗ , Z∗ , y∗ ) close to W(1). Then starting from W∗ , a point close to the central trajectory D for problems 3.2 and A.5, we solve problem 3.2 by solving equation A.7 approximately repeatedly reducing η gradually to zero.
2810
T. Fushiki, S. Horiuchi, and T. Tsuchiya
A remarkable advantage of the primal-dual method is its flexibility concerning initialization. In the primal formulation in section 3, the method needs an initial feasible interior point. But obtaining such a point is already a nontrivial problem. In the primal-dual formulation, we can get around this difficulty, because the search directions can be computed for any (X, Z, y) such that X 0 and Z 0. Generally this does not necessarily satisfy linear equalities in equation A.7, but we may let them be satisfied in the end of iterations, since they are linear. In that case, we approach the central trajectory from outside the feasible region. We provided two versions of the primal-dual method in our implementation: the basic algorithm and the predictor-corrector algorithm. The first follows the central trajectory loosely. The method is simple and efficient but often encounters difficulty for ill-conditioned problems. The predictor-corrector algorithm follows the central trajectory more precisely. This method is slower but is robust and steady, suitable for ill-conditioned and difficult problems. (See Fushiki et al., 2003, for details.) When we reasonably exploit the structure of this problem, the number of arithmetic operations required per iteration of the primal-dual method becomes O(N3 ), where N is the number of data. This suggests that our approach is not computationally expensive since the number of iterations is typically up to 50 in interior point methods for SDP. In view of the state of the art of the current SDP software (Sturm, 1999; ¨ uncu ¨ Tut et al., 2003; Yamashita, Fujisawa, & Kojima, 2003), a sophisticated implementation would be capable of handling N up to a few thousands. Acknowledgments This research was supported in part with Grant-in-Aid for Young Scientists (B), 2005, 17700286 and Grant-in-Aid for Scientific Research (C), 2003, 15510144 from the Japan Society for the Promotion of Sciences. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Cs`aski (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267–281). Budapest: Akademiai Kiado. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(1974), 716–723. Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnaiah (Ed.), Applications of statistics (pp. 27–41). Amsterdam: North-Holland. Alizadeh, F. (1995). Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM Journal on Optimization, 5, 13–51.
Maximum Likelihood Approach to Density Estimation
2811
Bennet, K. P., & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23– 34. Ben-Tal, A., & Nemirovski, A. (2001). Lectures on modern convex optimization: Analysis, algorithms, and engineering applications. Philadelphia: SIAM. Bhattacharyya, C. (2004). Second order cone programming formulations for feature selection. Journal of Machine Learning Research, 5, 1417–1433. Carmichael, J.-P. (1976). The autoregressive method: A method for approximating and estimating positive functions. Unpublished doctoral dissertation, SUNY, Buffalo. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. New York: Wiley. Eggermont, P. P. B., & LaRiccia, V. N. (2001). Maximum penalized likelihood estimation, Vol 1: Density estimation. New York: Springer. Fletcher, R. (1989). Practical methods of optimization (2nd ed.). New York: Wiley. Fushiki, T., Horiuchi, S., & Tsuchiya, T. (2003). A new computational approach to density estimation with semidefinite programming (Research Memorandum No. 898). Tokyo: Institute of Statistical Mathematics. Good, I. J., & Gaskins, R. A. (1971). Nonparametric roughness penalties for probablity densities. Biometrika, 58, 255–277. Good, I. J., & Gaskins, R. A. (1980). Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data. Journal of the American Statistical Association, 75, 42–56. Graepel, T., & Herbrich, R. (2004). Invariant pattern recognition by semidefinite ¨ programming machines. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Hjort, N. L., & Glad, I. K. (1995). Nonparametric density estimation with a parametric start. Annals of Statistics, 23, 882–904. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Ishiguro, M., & Sakamoto, Y. (1984). A Bayesian approach to the probability density estimation. Annals of the Institute of Statistical Mathematics, 36, 523–538. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lanckriet, G. R. G., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555– 582. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Monteiro, R. D. C., & Todd, M. J. (2000). Path following methods. In H. Wolkowicz, R. Saigal, & L. Vandenberghe (Eds.), Handbook of semidefinite programming; Theory, algorithms, and applications (pp. 267–306). Boston: Kluwer. Nesterov, Y. (2000). Squared functional systems and optimization problems. In H. Frenk, K. Roos, T. Terlaky, & S. Zhang (Eds.), High performance optimization (pp. 405–440). Dordrecht: Kluwer. Nesterov, Y., & Nemirovskii, A. (1994). Interior-point methods for convex programming. Philadelphia: SIAM. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer.
2812
T. Fushiki, S. Horiuchi, and T. Tsuchiya
Parzen, E. (1979). Nonparametric statistical data modeling. Journal of American Statistical Association, 74, 105–131. Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of American Statistical Society, 85, 617–624. Scott, D. W. (1992). Multivariate density estimation. New York: Wiley. Silvermann, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Sturm, J. F. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11/12, 625–653. Tanabe, K., & Sagae, M. (1999). An empirical Bayes method for nonparametric density estimation. Cooperative Research Report of the Institute of Statistical Mathematics, 118, 11–37. Tapia, R. A., & Thompson, J. R. (1978). Nonparametric probability density estimation. Baltimore: Johns Hopkins University Press. Toh, K.-C. (1999). Primal-dual path-following algorithms for determinant maximization problems with linear matrix inequalities. Computational Optimization and Applications, 14, 309–330. Tsuchiya, T., & Xia, Y. (2006). An extension of the standard polynomial-time primaldual path-following algorithm to the weighted determinant maximization problem with semidefinite constraints (Research Memorandum No. 980). Tokyo: Institute of Statistical Mathematics. ¨ uncu, ¨ Tut R. H., Toh, K. C., & Todd, M. J. (2003). Solving semidefinite-quadratic-linear programs using SDPT3. Mathematical Programming, 95, 189–217. Vandenberghe, L., & Boyd, S. (1996). Semidefinite programming. SIAM Review, 38, 49–95. Vandenberghe, L., Boyd, S., & Wu, S-P. (1998). Determinant maximization with linear matrix inequality constraints. SIAM Journal on Matrix Analysis and Applications, 19, 499–533. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Waki, H., Kim, S., Kojima, M., & Muramatsu, M. (2005). Sums of squares and semidefinite programming relaxations for polynomial optimization problems with structured sparsity (Tech. Rep. No. B-411). Tokyo: Department of Mathematical and Computing Sciences, Tokyo Institute of Technology. Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London: Chapman & Hall. Weisberg, S. (1985). Applied linear regression. New York: Wiley. Wolkowicz, H., Saigal, R., & Vandenberghe, L. (2000). Handbook of semidefinite programming: Theory, algorithms, and applications. Boston: Kluwer. Yamashita, M., Fujisawa, K., & Kojima, M. (2003). Implementation and evaluation of SDPA 6.0 (SemiDefinite Programming Algorithm 6.0). Optimization Methods and Software, 18, 491–505.
Received June 14, 2005; accepted April 20, 2006.
LETTER
Communicated by Lee Giles
Piecewise-Linear Neural Networks and Their Relationship to Rule Extraction from Data ˇ Martin Holena [email protected] Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod vod´arenskou vˇezˇ´ı 2, CZ-18207 Praha 8, Czech Republic
This article addresses the topic of extracting logical rules from data by means of artificial neural networks. The approach based on piecewiselinear neural networks is revisited, which has already been used for the extraction of Boolean rules in the past, and it is shown that this approach can be important also for the extraction of fuzzy rules. Two important theoretical properties of piecewise-linear neural networks are proved, allowing an elaboration of the basic ideas of the approach into several variants of an algorithm for the extraction of Boolean rules. That algorithm has already been used in two real-world applications. Finally, a connection to the extraction of rules of the L ukasiewicz logic is established, relying on recent results about rational McNaughton functions. Based on one of the constructive proofs of the McNaughton theorem, an algorithm is formulated that in principle allows extracting a particular kind of formulas of the L ukasiewicz predicate logic from piecewise-linear neural networks trained with rational data. 1 Introduction The extraction of knowledge from data by means of artificial neural networks (ANNs) has received much attention in connection with data mining and pattern recognition (Alexander & Mozer, 1999; Andrews, Diederich, & Tickle, 1995; Bishop, 1995; Lu, Setiono, & Liu, 1996; Narazaki, Watanabe, & Yamamoto, 1996; Nauck, Nauck, & Kruse, 1996; Ripley, 1996; Tickle, Andrews, Golea, & Diederich, 1998). Actually, already the mapping computed by the network incorporates knowledge transferred to it during training from the training data, knowledge about the implications that certain values of the variables assigned to its inputs have for the values of the variables assigned to its outputs. That knowledge is represented through the network topology and through distributed numerical parameters parameterizing the computed mapping. Needless to say, such knowledge representation is not readily understandable (in the terms of Berthold & Hand, 1999, that representation provides a high data fit but a low mental fit). Neural Computation 18, 2813–2853 (2006)
C 2006 Massachusetts Institute of Technology
2814
ˇ M. Holena
Therefore, methods for knowledge extraction from data aim at bettercomprehensible knowledge representations (on representations with a higher mental fit), most commonly on some kind of logical rules. This representation is also by far most often employed in ANN-based knowledge extraction (Bologna, 2000; d’Avila Garcez, Broda, & Gabbay, 2001; Duch, Adamczak, & Grabczewski, 1998; Finn, 1999; Healy & Caudell, 1997; ˜ Ishikawa, 2000; Mitra, De, & Pal, 1997; Mitra & Hayashi, 2000; Rabunal, Dorado, Pereira, & Rivero, 2004; Setiono, 1997; Towell & Shavlik, 1993; Tsukimoto, 2000). Logical rules are a frequent way of communicating knowledge between humans, and they can be easily exposed to symbolic manipulation. In practical applications, ANN-based rule extraction methods have to compete with methods for the extraction of logical rules directly from data, most notably with methods relying on various kinds of decision trees (Breiman, Friedman, Olshen, & Stone, 1984; Loh, 2002; Quinlan, 1992; Siciliano & Mola, 2000; Vach, 1995). The main reasons that rule extraction from data via the intermediate step of a trained neural network is attractive even in the competition of direct methods are:
r r
Neural networks take into consideration all input variables at the same time; they perform a multivariate search, not a search in a variableby-variable manner. In the case of data resulting from continuous random variables, the neural network is trained with the original data, not with their discretizations. Thus, the loss of information caused by discretization, which inevitably accompanies rule extraction from such data, is postponed to the very last step of the method. (For comparison, decision trees include discretization of continuous variables from the very beginning whenever a new node is created.)
A large number of ANN-based rule extraction methods have already been proposed. They differ in a number of ways, the most important among which is the expressive power of the rules, given by the meaning they are able to convey (cf. the classification of those rules suggested in Andrews et al., 1995; Duch, Adamczak, & Grabczewski, 2000; and Tickle et al., 1998). Though the conveyable meaning of the rules depends also on the syntax of the language underlying the considered logic, which allows differentiating propositional and first-order logic rules, for example, it is primarily determined by the set of possible truth values of the rules. According to this criterion, extracted rules can be divided into two main groups:
r
Boolean rules—formulas of Boolean logic, such as the propositional ifthen rules or M-of-N rules. They can assume only two different truth values: true and false. The tertium-non-datum axiom of the Boolean logic implies that if a Boolean rule has been evaluated and has not been found true, then it automatically must have been found false.
Piecewise-Linear Networks and Rule Extraction from Data
r
2815
That is why methods for the extraction of Boolean rules need only to output rules that, within a given set of rules to evaluate, have been found valid in the data (Alexander & Mozer, 1999; Duch et al., 1998; Healy & Caudell, 1997; Ishikawa, 2000; Lu, Setiono, & Liu et al., 1996; ˜ et al., 2004; Setiono, 1997; Tsukimoto, Narazaki et al., 1996; Rabunal 2000). Fuzzy rules—formulas of some fuzzy logic, typically formulas of the ¨ product logic, L ukasiewicz logic, Godel logic, or some combination of those three. Their truth values can be arbitrary elements of some BLalgebra (H´ajek, 1998). In the proposed methods for the extraction of fuzzy rules so far, that BL-algebra is always the interval [0, 1] or some subalgebra thereof (Duch et al., 2000, Finn, 1999, Mitra et al., 1997, Nauck et al., 1996; a survey of many other methods can be found in Mitra and Hayashi, 2000).
The existing ANN-based rule extraction methods are based on a wide spectrum of various theoretical principles. Interestingly, so far apparently no particular principles have been employed both for the extraction of Boolean rules and for the extraction of fuzzy rules. That can be a serious drawback when switching between both kinds of rules, since results obtained with methods that do not share common theoretical fundamentals are difficult to compare. The drawback is increased by the fact that most of the existing methods rely primarily on heuristics, and their underlying theoretical principles are not very deep. In this article, a theoretically well-founded approach is presented that reveals to be relevant, at least theoretically, for the extraction of both Boolean and fuzzy rules. It is the approach based on piecewise-linear neural netˇ (2000) works that was proposed independently in Maire (1999) and Holena for the extraction of Boolean rules. In particular, two important previously unpublished properties of that kind of ANNs are proven, and its connection to the extraction of fuzzy rules is established, more precisely, to the extraction of particular formulas of the L ukasiewicz logic. This connection is based on recent results about rational McNaughton functions and on recently published constructive proofs of the McNaughton theorem. In this way, the article attempts to fill the gap between the recent fast development in theoretical fuzzy logic and the more application-oriented area of ANN-based rule extraction. In the following section, piecewise-linear neural networks are introduced, and their properties are established in propositions 1 and 2. Section 3 explains how such networks can be employed to extract Boolean rules. To illustrate the usefulness of the approach, two real-world applications are sketched in section 4. Finally, the connection to the extraction of formulas of the L ukasiewicz logic is the subject of section 5. The article concludes with complexity considerations and a discussion of possible simplifications
ˇ M. Holena
2816
of the method for the extraction of fuzzy rules, to bring its complexity to a level similar to that of the extraction of Boolean rules. 2 Piecewise-Linear Neural Networks Though piecewise linearity can be, principally, studied in connection with any kind of artificial neural networks that admits continuous activation functions, this article restricts its attention to multilayer perceptrons. The reason is the popularity of multilayer perceptrons in practical applications—both in general and in the specific context of rule extraction (Alexander & Mozer, 1999; Andrews et al., 1995; Ben´ıtez, Castro, & Requena, 1997; Duch et al., 1998; Howes & Crook, 1999; Mitra et al., 1997; Nauck et al., 1996; Narazaki et al., 1996; Setiono, 1997; Towell & Shavlik, 1993; Tsukimoto, 2000). To avoid misunderstanding due to differences encountered in the literature, the adopted meaning of the relevant concepts will be fixed in definitions 1 and 2. In the interleaved lemma 1, some basic properties of sigmoidal piecewise-linear functions are recalled. Definition 1. A function f : → will be called sigmoidal if f () ⊂ [0, 1] & limx→−∞ = 0 & limx→∞ = 1 and piecewise linear if can be covered by intervals I1 , . . . , I p+1 , p ∈ N such that for each k = 1, . . . , p + 1, Ik is closed with regard to and the restriction f |Ik is a linear function. Lemma 1. Any piecewise-linear function f : → with linearity intervals I1 , . . . , I p+1 fulfills the following: i. f is continuous. ii. If in addition f is sigmoidal, then p ≥ 2, f |I1 = 0, f |I p+1 = 1, and f (I2 ), . . . , f (I p ) are closed intervals covering [0, 1]. Proof. Both results easily follow from definition 1 and from properties of linear functions, in particular from their continuity and limit properties in ±∞. Definition 2. The term multilayer perceptron (MLP), more precisely, timeless fully connected multilayer perceptron with linear outputs, denotes the pair M = ((n0 , n1 , . . . , nL ), f )
(2.1)
in which i. (n0 , n1 , . . . , nL ) ∈ N L+1 , L ∈ N \ {1}, is called the topology of M, and is given by n0 input nodes, nL output nodes, and ni hidden nodes in the ith layer, i = 1, . . . , L − 1. ii. f : → is called the activation function of M. Most generally, it is required only to be nonconstant and Borel measurable. Typically, however, it has
Piecewise-Linear Networks and Rule Extraction from Data
2817
various additional properties. In the sequel, focus will be on multilayer perceptrons with one layer of hidden nodes and with a sigmoidal piecewise-linear activation function. Such artificial neural networks will be, for simplicity, called piecewise-linear neural networks. L
For each w = ((w1,1 , . . . , w 1,n1 ), . . . , (w L ,1 , . . . , w L ,nL )) ∈ L F f,w = (F 1f,w , . . . , F nf,w ) be the mapping of n0 into nL fulfilling:
i=1 (ni-1 +1)ni
, let
ˆ (∀i ∈ L)(∀ j ∈ nˆ i )(∃ϕi, j : ni-1 → )(∀z ∈ ni-1 )[i < L ⇒ ϕi, j (z) = f (wi, j ·(1, z)) & i = L ⇒ ϕ L , j (z) = w L , j ·(1, z)] & (∀x ∈ n0 )(∃z0 , . . . , zL ∈ n0 × . . . × nL ) ˆ zi = (ϕi,1 (zi-1 ), . . . , ϕi,ni (zi-1 ))], × [z0 = x & zL = F (x) & (∀i ∈ L)
(2.2)
where the symbol · denotes the standard scalar product of vectors in Euclidean spaces, and where for each number n ∈ N , the notation nˆ = {1, . . . , n} is used. The L mappings F f,w for vectors w ∈ i=1 (ni-1 +1)ni are said to be computable by M. The components of wi, j for i = 1, . . . , L, j = 1, . . . , ni , are called parameters of the MLP; in particular, their first components are called biases, and the remaining components are called weights. The set of all mappings computable by M will be denoted FM . Since piecewise-linear neural networks are a special case of multilayer perceptrons with one layer of hidden nodes, they inherit all properties of such MLPs, in particular, their universal approximation capabilities (see, ˚ e.g., Hornik, 1991; Hornik, Stinchcombe, White, & Auer, 1994; Kurkov´ a, 1992, 2000). On the other hand, piecewise-linear neural networks require specific caution in the case of training, that is, in the case of seeking a computable mapping F that fulfills E(F ) = min E(F ). F ∈FM
(2.3)
In equation 2.3, E : FM → [0, +∞) is some error function, in general consisting of some empirical error, which depends on a given sequence (ξ1 , η1 ), . . . , (ξm , ηm ) of input-output pairs and is typically based on the Euclidean norm or its square (the sum of squares empirical error), and of some regularization term, which does not depend on (ξ1 , η1 ), . . . , (ξm , ηm ) and ensures that the optimization task, equation 2.3, is well posed (Tikhonov & Arsenin, 1977). The difficulty with training of piecewise-linear neural networks is caused by the discontinuity of the partial derivatives of their activation functions with respect to weights and biases, which in turn implies the discontinuity of the partial derivatives of E. Such nonsmooth error functions are admissible in neither the popular backpropagation method
ˇ M. Holena
2818
(Hagan, Demuth, & Beale, 1996; Rumelhart, Hinton, & Williams, 1986) nor in more sophisticated methods for neural network training, such as conjugategradient methods, quasi-Newton methods, or the Levenberg-Marquardt method (Dennis & Schnabel, 1983; Hagan & Menhaj, 1994; Hagan et al., 1996). Fortunately, this problem can be bypassed. Since all existing algorithms for the optimization of general continuously differentiable functions are iterative, after a finite number of iterations, they in general do not find the optimal solution F of equation 2.3, but they find only some suboptimal solution instead. And with respect to suboptimality, mappings computable by a piecewise-linear neural network are in a certain sense interchangeable with their counterparts computable by a MLP with a continuous sigmoidal activation function that is sufficiently close to the activation function of the piecewise-linear neural network. That result is precisely formulated in proposition 1 below. Before, definitions 3 and 4 explain the meaning of the main involved concepts, and lemma 2 presents a property of sigmoidal functions needed for the formulation and proof of proposition 1. Definition 3. Let M1 and M2 be two MLPs with the same topology, that is, M1 = ((n0 , n1 , . . . , nL ), f ), M2 = ((n0 , n1 , . . . , nL ), g). Then for each L w ∈ i=1 (ni-1 +1)ni , the term counterpart of F f,w in M2 denotes the mapping Fg,w . Definition 4. Let M = ((n0 , n1 , . . . , nL ), f ) be a MLP, ε > 0, m ∈ N , and (ξ1 , η1 ), . . . , (ξm , ηm ) ∈ n0 × nL . Let further E : FM → [0, +∞). Then a mapping F ∈ FM is called optimal for (M, (ξ1 , η1 ), . . . , (ξm , ηm )) with respect to E if E(F ) E(F ) = min F ∈FM
(2.4)
and ε-suboptimal for (M, (ξ1 , η1 ), . . . , (ξm , ηm )) with respect to E if E(F ) < inf E(F ) + ε. F ∈FM
(2.5)
Lemma 2. Let Cς = { f ∈ C() : f is continuous sigmoidal}, and n I , n H , nO ∈ N . For each f ∈ Cς , denote F f the set of all mappings computable by the multilayer perceptron ((n I , n H , nO ), f ), and let E f : F f → [0, +∞) be an error function with the following properties: i. There exist numbers n1 , . . . , nk ∈ N , as well as functions ωj,1 , . . . , ω j,nj : (nI +1)n H +(n H +1)nO → , for j = 1, . . . , k, and continuous functions ϕ0 , . . . , ϕk : (nI +1)n H +(n H +1)nO → and ρ : k+1 → such that
Piecewise-Linear Networks and Rule Extraction from Data
(∀ f ∈ Cς )(∀w ∈ (nI +1)n H +(n H +1)nO ) E f (F f,w ) nk n1 f (ω1,i (w)), . . . , ϕk (w) f (ωk,i (w)) , = ρ ϕ0 (w), ϕ1 (w) i=1
2819
(2.6)
i=1
ii. lim
w →+∞
E f (F f,w ) = +∞ uniformly for all f ∈ Cς .
(2.7)
Then: a. Cς is a metric space with a metric dς defined (∀ f, g ∈ Cς ) dς ( f, g) = sup | f (x) − g(x)|,
(2.8)
x∈
b. inf F ∈F f E f (F ) as a mapping of Cς into [0, +∞) is continuous with respect to the metric dς . Proof. a. For any f, g, h ∈ Cς : i. dς ( f, f ) = 0; ii. if f = g, then (∃x ∈ ) f (x) − g(x) = 0; hence, dς ( f, g) > 0; iii. (∀x ∈ ) | f (x) − h(x)| ≤ | f (x) − g(x)| + |g(x) − h(x)| ≤ dς ( f, g) + dς (g, h); hence, dς ( f, h) ≤ dς ( f, g) + dς (g, h). Consequently, d is a metric on Cς . b. Let f ∈ Cς and ε > 0 be given. Because of condition 2.7, there exists a number K > 0 such that (∀g ∈ Cς )(∀w ∈ (nI +1)n H +(n H +1)nO ) w > K 3 ⇒ E f (F f,w ) > infn E f (F f,w ) + ε. w∈ 2
(2.9)
Since the closure B(0, K ) of the ball B(0, K ) = {w ∈ (nI +1)n H +(n H +1)nO : w < K } is compact, each function ϕ j , j = 0, . . . , k, maps it on some compact set [Aj , B j ] ⊂ . On the compact set [min(A0 , −B0 ), max(−A0 , B0 )] × [min(A1 , −B1 }, max(−A1 , B1 )] × . . . × [min(Ak , −Bk ), max(−Ak , Bk )], the function ρ is uniformly continuous; hence, there exists δ > 0 such that for any x, y ∈ [min(A0 , −B0 ), max(−A0 , B0 )] × [min(A1 , −B1 ), max(−A1 , B1 )] × . . . × [min(Ak , −Bk ), max(−Ak , Bk )] , x − y < δ ⇒ |ρ(x) − ρ(y)| <
ε . 2
(2.10)
ˇ M. Holena
2820
δ Put δ = min j=1,...,k n j max(−A . Then for any w in the ball B(0, K ) and j ,B j ,ε) a b a b any f , f ∈ Cς fulfilling dς ( f , f ) < δ,
nj nj a b f (ω j,i (w)) − ϕ j (w) f (ω j,i (w)) ϕ j (w) i=1 i=1 nj nj a b = ϕ j (w) f (ω j,i (w)) − f (ω j,i (w)) i=1 i=1 nj = ϕ j (w) ( f a (ω j,i (w)) − f b (ω j,i (w)) f a (ω j, (w)) f b (ω j, (w))
i=1
nj
≤ |ϕ j (w)|
| f a (ω j,i (w)) − f b (ω j,i (w)|
| f (ω j, (w))|
i=1
>i
| f b (ω j, (w))|
>i
nj
≤ |ϕ j (w)|
| f a (ω j,i (w)) − f b (ω j,i (w)|
i=1
< max(−Aj , B j )
nj
δ = n j max(−Aj , B j )δ ≤ δ ,
(2.11)
i=1
for j = 1, . . . , k. Further, for f ∈ { f a , f b } and for every j = 1, . . . , k, nj f (ω j,i (w)) ≤ |ϕ j (w)| ≤ max(−Aj , B j ), ϕ j (w) i=1
thus ϕ j (w)
nj
f (ω j,i (w)) ∈ [min(Aj , −B j ), max(−Aj , B j )].
(2.12)
i=1
Combining equations 2.11 and 2.12 with 2.10 yields |E f a (F f a ,w ) − E f b (F f b ,w )| nk n1 = ρ ϕ0 (w), ϕ1 (w) f a (ω1,i (w)), . . . , ϕk (w) f a (ωk,i (w)) i=1
− ρ ϕ0 (w), ϕ1 (w)
n1 i=1
i=1
f b (ω1,i (w)), . . . , ϕk (w)
nk i=1
ε f b (ωk,i (w)) < . 2 (2.13)
Piecewise-Linear Networks and Rule Extraction from Data
2821
Finally, let w f ∈ (nI +1)n H +(n H +1)nO be such that E f (F f,w f ) <
inf
w∈(n I +1)n H +(n H +1)n O
ε E f (F f,w ) + . 2
(2.14)
Due to equations 2.9 and 2.13, w f ≤ K ; hence, for any g ∈ Cς fulfilling dς ( f, g) < δ, E g (Fg,w f ) = E f (F f,w f ) + (E g (Fg,w f ) − E f (F f,w f )) < <
ε + |E g (Fg,w f ) − E f (F f,w f )| 2
inf
E f (F f,w ) +
inf
E f (E f (F f,w )) + ε.
w∈(n I +1)n H +(n H +1)n O w∈(n I +1)n H +(n H +1)n O
(2.15)
Consequently, inf
w∈(n I +1)n H +(n H +1)n O
E g (Fg,w ) <
inf
w∈(n I +1)n H +(n H +1)n O
E f (F f,w )) + ε.
(2.16)
Then there exists a wg ∈ n such that E g (Fg,wg ) < <
inf
E g (Fg,w ) +
ε 2
inf
E f (F f,w ) +
3ε . 2
w∈(n I +1)n H +(n H +1)n O
w∈(n I +1)n H +(n H +1)n O
(2.17)
Thus, again due to equations 2.9 and 2.13, wg ≤ K , and E f (F f,wg ) = E g (Fg,wg ) + (E f (F f,wg ) − E g (Fg,wg )) < <
ε + |E f (F f,wg ) − E g (Fg,wg )| 2
inf
E g (Fg,w ) +
inf
E g (Fg,w ) + ε,
w∈(n I +1)n H +(n H +1)n O w∈(n I +1)n H +(n H +1)n O
(2.18)
entailing inf
w∈(n I +1)n H +(n H +1)n O
E f (F f,w ) <
inf
w∈(n I +1)n H +(n H +1)n O
E g (Fg,w ) + ε.
(2.19)
Considering the inequalities 2.16 and 2.19 together leads to
inf
w∈(n I +1)n H +(n H +1)n O
E g (Fg,w ) −
inf
w∈(n I +1)n H +(n H +1)n O
E f (F f,w ) < ε,
(2.20)
ˇ M. Holena
2822
or, equivalently, E f (F )| < ε inf |E g (F ) − inf
F ∈Fg
F ∈F f
(2.21)
for any g ∈ Cς fulfilling dς ( f, g) < δ. Example 1. Given a sequence of input-output training pairs (ξ1 , η1 ), . . . . . . , (ξm , ηm ) ∈ nI × nO , simple examples of the error function from lemma 2 are the Euclidean distance between the desired and computed outputs, with a regularization term Freg (w), (∀ f ∈ Cς )(∀w ∈ (nI +1)n H +(n H +1)nO ) E 1, f (F f,w )
m F f,w (ξi ) − ηi 2 + Freg (w), =
(2.22)
j=1
where Freg (w) is a continuous function on n such that lim w →∞ Freg (w) = +∞, for example, Freg (w) = w p for some p > 0, or the square of that Euclidean distance, again with the considered regularization term, (∀ f ∈ Cς )(∀w ∈ (nI +1)n H +(n H +1)nO ) E 2, f (F f,w ) =
m
F f,w (ξi ) − ηi 2 + Freg (w).
(2.23)
j=1
To see it, put k = n H n O m, n1 = . . . = nk = 1, ϕ0 = Freg , and define the functions ϕi , ωi,1 for i = ( j − 1)n O n H + (h − 1)n H + 1 . . . , ( j − 1)n O n H + hn H , h = 1, . . . , n O , j = 1, . . . , m, by (∀w = (w1,1 , . . . , w1,n H , w2,1 , . . . , w2,nO ) ∈ (nI +1)n H +(n H +1)nO ) ϕi (w) i−( j−1)n O n H −(h−1)n H +1
= w2,h
& ωi,1 (w) = (1, ξ j )·w1,i−( j−1)nO n H −(h−1)n H , (2.24)
n H +1 1 where w2,h = (w2,h , . . . , w2,h ). Then ϕ0 , . . . , ϕk are indeed continuous, and E 1, f obeys the decomposition 2.6 with the continuous function ρ defined
(∀ξ = (ξ0 , ξ1,1 , . . . , ξ1,nO , . . . , ξm,1 , . . . , ξm,nO ) ∈ n H nO m+1 ) ρ(ξ )
m (ξ j,1 ·(1, . . . , 1), . . . , ξ j,nO ·(1, . . . , 1)) − η j 2 + ξ0 , = j=1
(2.25)
Piecewise-Linear Networks and Rule Extraction from Data
2823
whereas for E 2, f to obey that decomposition, ρ has to be defined: (∀ξ = (ξ0 , ξ1,1 , . . . , ξ1,nO , . . . , ξm,1 , . . . , ξm,nO ) ∈ n H nO m+1 ) ρ(ξ ) =
m
(ξ j,1 ·(1, . . . , 1), . . . , ξ j,nO ·(1, . . . , 1)) − η j 2 + ξ0 .
(2.26) (2.27)
j=1
Finally, the requirement lim w →∞ ϕ0 (w) = lim w →∞ Freg (w) = +∞ together with the fact that (∀ f ∈ Cς )(∀w ∈ (nI +1)n H +(n H +1)nO ) E 1, f (F f,w ) ≥ Freg (w) & E 2, f (F f,w ) ≥ Freg (w) implies that uniformly for all f ∈ Cς , condition 2.7 holds both for E 1, f and for E 2, f . Proposition 1. Consider an MLP M with one layer of hidden nodes, a topology (n I , n H , n O ) and a continuous sigmoidal activation function f M , a piecewise-linear neural network L with the same topology as M and an activation function f L , and a system of error functions (E f ) f ∈Cς fulfilling conditions 2.6 and 2.7. Let further ε > ε > 0, (ξ1 , η1 ), . . . , (ξm , ηm ) ∈ nI × nO , and F : nI → nO be a mapping ε -suboptimal for (M, (ξ1 , η1 ), . . . , (ξm , ηm ) with respect to E f . Then provided g is close enough to f in the metric dς , the counterpart of F in L is ε-suboptimal for L, (ξ1 , η1 ), . . . , (ξm , ηm ) with respect to E g . Proof. Let w ∈ (nI +1)n H +(n H +1)nO be such that F = F f,w . According to definitions 3 and 4, we know that E f (F f,w ) < inf E(F ) + ε , F ∈FM
(2.28)
and we need to prove E g (Fg,w ) < inf E(F ) + ε. F ∈FL
(2.29)
However, equation 2.29 directly follows from 2.28 and from the possibility of choosing g so close to f that | inf E(F ) − inf E(F )| < ε − ε , F ∈FL
F ∈FM
(2.30)
which is a consequence of lemma 2. Proposition 1 has a great practical impact. It provides the possibility to learn mappings computable by piecewise-linear neural networks without having to develop specific training methods taking into account the discontinuity of the partial derivatives of their activation functions, which even in the case of extremely simple piecewise-linear activation functions with only three linearity intervals lead to algorithms very different from what is
ˇ M. Holena
2824
available in common ANN software (Gad, Atiya, Shaheen, & El-Dessouki, 2000). Notice, however, that this approach is not equivalent to direct learning of the computable mappings in terms of computational complexity since piecewise-linear neural networks have a lower Vapnik-Chervonenkis dimension than MLPs with smooth activation functions (Maass, 1997). The piecewise-linearity of activation functions together with the construction of computable mappings in definition 2 imply that locally, in finitely many separate areas, the mappings computed by piecewise-linear neural networks coincide with linear mappings between the input and output space of the network. This in turn means that a piecewise-linear neural network transforms linearly constrained sets in the input space into linearly constrained sets in the output space—in particular, polyhedra into polyhedra and pseudopolyhedra into pseudopolyhedra. That result, the importance of which will become apparent in section 3, is formulated in definition 5 and proposition 2: Definition 5. Let n ∈ N , and let E = {e1 , . . . , en } be the set of standard basis vectors of n . Thus, for i = 1, . . . , n, the ith component of ei equals 1, whereas all remaining components equal 0. Let further for each halfspace S in n , aS ∈ n , aS = 0, and b S ∈ be such that S = {x ∈ n : aS ·x ≤ b S }
(2.31)
if S is closed, or S = {x ∈ n : aS ·x < b S }
(2.32)
if S is open. Then the following notation will be used: a. Pn = {P ⊂ n : P = ∅ & P = S∈S P S, where S P is a finite set of closed n halfspaces in }; the elements of Pn are called polyhedra. b. P˜ n = {P ⊂ n : P = ∅ & P = S∈S P S, where S P is a finite set of halfspaces in n }; the elements of P˜ n are called pseudopolyhedra. c. Hn = {H ∈ Pn : H = S1 ∩ S1 ∩ . . . ∩ Sq ∩ Sq & q ≤ n & (∀ j ∈ qˆ )(∃e j ∈ E)(∃b j , b j ∈ ) b j ≤ b j & aS j = −e j & b S j = −b j & aSj = e j & b Sj = b j & (∀ j, j ∈ qˆ ) j = j ⇒ e j = e j }; the elements of Hn are called hyperrectangles. d. H˜ n = {H ∈ P˜ n : H = S1 ∩ S1 ∩ . . . ∩ Sq ∩ Sq & q ≤ n & (∀ j ∈ qˆ )(∃e j ∈ E)(∃b j , b j ∈ ) b j ≤ b j & aS j = −e j & b S j = −b j & aSj = e j & b Sj = b j & (∀ j, j ∈ qˆ ) j = j ⇒ e j = e j }; the elements of H˜ n are called pseudohyperrectangles.
Piecewise-Linear Networks and Rule Extraction from Data
2825
Remark 1. Observe that this definition implies the following relationships between the defined properties of a set Q ⊂ n : Q is a hyperrectangle −→ Q is a pseudohyperrectangle, ↓ Q is a polyhedron
−→
↓ is a pseudopolyhedron
(2.33)
Notice that according to definition 5, in particular, all proper linear subspaces of n are polyhedra. Indeed, let Q n be a linear subspace of n , and let B Q⊥ be a base of the orthogonal complement of Q, that is, of the set Q⊥ = {x ∈ n : (∀y ∈ Q) x·y = 0}.
(2.34)
Then Q fulfills the conditions of definition 5a with the following set of closed halfspaces: S Q = {S ⊂ n : S is a closed halfspace & a S ∈ B Q⊥ & b S = 0} ∪ {S ⊂ n : S is a closed halfspace & − a S ∈ B Q⊥ & b S = 0}.
(2.35)
Proposition 2. Let L = ((n I , n H , n O ), f ) be a piecewise-linear neural network, and F : nI → nO be a nonconstant mapping computable by L. Then: i. There exist polyhedra P1 , . . . , Ps ∈ PnI such that P1 ∪ . . . ∪ Ps = nI and the restriction of F to any P j , j = 1, . . . , s is linear. ii. F is continuous on nI . iii. For each pseudopolyhedron Q ∈ P˜ nO such that F −1 (Q) = ∅, there exist pseudopolyhedra P1 , . . . , Pr ∈ P˜ nI , r ≤ s, such that Q = F ( rj=1 P j ) and F is linear on each of the pseudopolyhedra P1 , . . . , Pr (cf. Figure 1). iv. If in particular Q is a polyhedron, also P1 , . . . , Pr ∈ PnI are polyhedra.
Proof. In addition to the notation introduced in definition 5, the following notation will be used in connection with a polyhedron P1 ∈ Pn , or in connection with a pseudopolyhedron P2 ∈ P˜ n : Pn |P1 = {P ∈ Pn : P ⊂ P1 },
(2.36)
P˜ n |P2 = {P ∈ P˜ n : P ⊂ P2 }.
(2.37)
First, consider an arbitrary nonconstant linear mapping W : m → n defined by means of an n × m-dimensional matrix AW and a vector
ˇ M. Holena
2826
Figure 1: Illustration of Proposition 2, part iii, for a mapping F : 3 → 2 , F (x) = ι2 ( f (w1,1 ·(1, x)), f (w1,2 ·(1, x))) with ι2 denoting the identity on 2 , w1,1 = (−4, 1, 1, 0), w1,2 = (−4, 0, 1, 1), and the activation function f defined: t ≤ −36 ⇒ f (t) = 0, t ∈ [−36, −1.75] ⇒ f (t) = 0.0043t + 0.1555, t ∈ [−1.75, 1.75] ⇒ f (t) = 0.20115t + 0.5, t ∈ [1.75, 36] ⇒ f (t) = 0.0043t + 0.8445, t ≥ 36 ⇒ f (t) = 1.
b W ∈ n , (∀x ∈ m ) W(x) = AW x + bW ,
(2.38)
where AW = 0, due to the nonconstantness of W. Then for any closed halfspace S¯ O ⊂ n , W−1 ( S¯ O ) = {x ∈ m : a S¯ O ·(AW x + bW ) ≤ b S¯ O } the whole space m if a S¯ O AW = 0 & a S¯ O ·bW ≤ b S¯ O , ∅ if a S¯ O AW = 0 & a S¯ O ·bW > b S¯ O , = the closed halfspace S¯ I ⊂ m with a S¯ I = a S¯ O AW , b S¯ I = b S¯ O − a S¯ O ·bW if a S¯ O AW = 0. (2.39)
Piecewise-Linear Networks and Rule Extraction from Data
2827
Similarly, for any open halfspace S˚ O ⊂ n , W−1 (S˚ O ) = {x ∈ m : aS˚ O ·(AW x + bW ) < b S˚ O } the whole space m if a ˚ AW = 0 & aS˚ O·bW < b S˚ O , SO ∅ if a S˚ O A W = 0 & aS˚ O·b W ≥ b S˚ O , = ˚ I ⊂ m the open halfspace S with aS˚ I = aS˚ O AW , b S˚ I = b S˚ O − aS˚ O ·bW if aS˚ O AW = 0. (2.40) Consider now an arbitrary P = S1 ∩ . . . ∩ Sq ∈ P˜ n , where S1 , . . . , Sq are halfspaces. If P ∩ W(m ) = ∅, then W−1 (P) = ∅. Hence, suppose P ∩ W(m ) = (⊥) ∅, and denote by x(0) the projection of x ∈ m onto A−1 its W (0), and by x −1 ⊥ projection onto (AW (0)) . Observe that: 1. For any Q ⊂ n , W−1 (Q) = {x ∈ m : W(x) ∈ Q} = {x ∈ m : W(x(0) + x(⊥) ) ∈ Q} = {x ∈ m : W(x(⊥) ) ∈ Q} ⊥ = x ∈ A−1 : W(x) ∈ Q ⊕ A−1 W (0) W (0) −1 ⊥ −1 = W| AW (0) (Q) ⊕ A−1 W (0),
(2.41)
where Q1 ⊕ Q2 stands for the direct sum of sets Q1 , Q2 ⊂ m , that is, Q1 ⊕ Q2 = {x1 + x2 : x1 ∈ Q1 & x2 ∈ Q2 }. In particular, ⊥ −1 (P) ⊕ A−1 (2.42) W−1 (P) = W| A−1 W (0) W (0), and for each S ∈ S P , ⊥ −1 W−1 (S) = W| A−1 (S) ⊕ A−1 W (0) W (0). 2.
(2.43)
⊥ −1 ⊥ −1 W| A−1 (P) = W| A−1 ( S) W (0) W (0) =
S∈S P
⊥ −1 (S), W| A−1 W (0)
(2.44)
S∈S P ⊥ due to definition 5 and the fact that W|(A−1 W (0)) is a one-to-one mapping.
3. For each S ∈ S P , −1 ⊥ −1 ⊥ W| AW (0) (S) = x ∈ A−1 : W(x) ∈ S W (0) ⊥ = W−1 (S) ∩ A−1 W (0) .
(2.45)
ˇ M. Holena
2828
4. For any halfspace SO ⊂ n such that SO ∩ W(m ) = ∅, ⊥ W−1 (SO ) ∩ A−1 W (0) ⊥ −1 if W−1 (SO ) = m , AW (0) the closed halfspace S⊥ ⊂ (W−1 (0))⊥ with a S⊥ = a(⊥) SI , b S⊥ = b SI = if SI = W−1 (SO ) is a closed halfspace, (⊥) the open halfspace S⊥ ⊂ (W−1 (0))⊥ with a S⊥ = a SI , b S⊥ = b SI −1 if SI = W (SO ) is an open halfspace, (2.46) due to equations 2.34, 2.39, and 2.40, as well as due to remark 1. Consequently, ⊥ W−1 (SO ) ∩ A−1 ⊕ A−1 W (0) W (0) m if W−1 (SO ) = m , (⊥) the closed halfspace S⊥ ⊂ m with a S⊥ = a SI , b S⊥ = b SI −1 if SI = W (SO ) is a closed halfspace, = (⊥) the open halfspace S⊥ ⊂ m with a S⊥ = a SI , b S⊥ = b SI −1 if SI = W (SO ) is an open halfspace. (2.47) Combining equations 2.42 to 2.47 and taking into account P ∩ W( ) = ∅ leads to ⊥ ⊕ A−1 W−1 (P) = W−1 (S) ∩ A−1 W (0) W (0) m
S∈S P
= m ⊥ ∈ Pm ⊕ A−1 W−1 (S) ∩ A−1 = W (0) W (0) S∈S P ˜ ∈ Pm
if (∀S ∈ S P ) W−1 (S) = m , if P ∈ Pn & & (∃ j ∈ qˆ ) W−1 (S j ) = m , else. (2.48)
In addition, in the following two cases, the first possibility cannot occur (W−1 (P) = m ):
r r
If P is a hyperrectangle with card Sp = 2n because then W−1 (P) = m would imply ei AW = 0 for all standard basis vectors ei , i = 1, . . . , n, which contradicts AW = 0. If W−1 (P) = ∅ and P ⊂ W(P ) for some P ∈ P˜ m since then W−1 (P) ⊂ P m . Observe that this result, combined with equations 2.36, 2.37, and 2.48, implies W−1 (P) ∈ P˜ m |P , in particular, W−1 (P) ∈ Pm |P if P ∈ Pn , P ∈ Pm .
Piecewise-Linear Networks and Rule Extraction from Data
2829
Next, let w = (w1,1 , . . . , w1,n H , w2,1 , . . . , w2,nO ) ∈ (nI +1)n H +(n H +1)nO be such that F = F f,w . Let further b1 ∈ n H and A1 be an n H × n I –dimensional matrix such that the rows of (b1 , A1 ) are w1,1 , . . . , w1,n H , whereas b2 ∈ nO and A2 be a n H × n O -dimensional matrix such that the rows of (b2 , A2 ) are w2,1 , . . . , w2,nO . By means of the pairs (A1 , b1 ) and (A2 , b2 ), linear mappings W1 and W2 , respectively, can be defined according to equation 2.38, that is, AW1 = A1 , bW1 = b1 , AW2 = A2 , bW2 = b2 . To prove i, consider the linearity intervals I1 , . . . , I p of f , denote (2.49) P I = W1−1 (Ik1 × . . . × Ikn H ) : 1 ≤ k1 , . . . , kn H ≤ p + 1 , and define f ∗n H : n H → n H to be the component-wise application of the activation function f , that is, (∀x = (x 1 , . . . , x n H ) ∈ n H ) f ∗n H (x) = ( f (x 1 ), . . . , f (x n H )).
(2.50)
Observe that for each P = W1−1 (Ik1 × . . . × Ikn H ) ∈ P I , the restriction f ∗n H |W1 (P) of f ∗n H to W1 (P) = Ik1 × . . . × Ikn H , 1 ≤ k1 , . . . , kn H ≤ p + 1, is a linear mapping. This fact together with equations 2.33 and 2.48 to 2.50 implies:
r r
Each P = W1−1 (Ik1 × . . . × Ikn H ) ∈ P I is a polyhedron from PnI (cf. Figure 2). For each P = W1−1 (Ik1 × . . . × Ikn H ) ∈ P I , F (P) = W2 ( f ∗n H (W1 (P))). Thus, F is linear on P, that is, there exist a n O × n I -dimensional matrix A P and a vector b P ∈ nO such that the function F P , defined (∀x ∈ nI ) F P (x) = A P x + b P ,
r
(2.51)
fulfills F |P = F P |P.
P = 1≤k1 ,...,kn ≤ p+1 W1−1 (Ik1 × . . . × Ikn H ) P∈P I
= W1−1
H
Ik1 × . . . × Ikn H = W1−1 (n H ) = nI .
1≤k1 ,...,kn H ≤ p+1
This proves i with {P1 , . . . , Ps } = P I .
Figure 2: W1−1 (I3 × I4 ) for the mapping F from Figure 1.
ˇ M. Holena
2830
As to ii, the continuity of F follows from definition 2 and the continuity of the scalar product, as well as from the continuity of f , established in lemma 1. To prove iii and iv, consider a pseudopolyhedron Q ∈ P˜ nO with F −1 (Q) = ∅, and define P Q = {P ∩ F −1 (Q) : P ∈ P I & P ∩ F −1 (Q) = ∅}.
(2.52)
Combining equation 2.52 with 2.48, with definition 5, and with the aboveproved properties of P I = {P1 , . . . , Ps } yields:
r r r r r
Card P Q , the cardinality of P Q , fulfills card P Q ≤ s. Each P ∈ P Q is a pseudopolyhedron. If Q is a polyhedron, then each P ∈ P Q is also a polyhedron. F is linear on each P ∈ P Q .
F (∪P Q ) = F ( qs =1 (Pq ∩ F −1 (Q))) = F (( qs =1 Pq ) ∩ F −1 (Q)) = F (nI ∩ F −1 (Q)) = F (F −1 (Q)) = Q.
This proves iii and iv with {P1 , . . . , Pr } = P Q . Remark 2. Propositions 1 and 2 can be extended to multilayer perceptrons with an arbitrary number of hidden layers, provided the assumptions on the error function are appropriately extended. In particular, the examples 2.22 and 2.23 remain valid. The reasons that the restriction to one hidden layer has been adopted here are (1) the fact that even multilayer perceptrons with one hidden layer and linear outputs have the valuable approximation property (Hornik, 1991) and (2) complexity considerations. From equations 2.49 and 2.52 follows that the number of different linear pieces of F is O( p nI ). Extending proposition 2 to multilayer perceptrons with L hidL den layers would lead to O( p i=1 ni ) different linear pieces of F , in particular, to O(( p nI ) L ) if ni ≤ n I for i = 1, . . . , L. Hence, this would further exponentially increase the already exponential complexity of mappings computable by piecewise-linear neural networks. 3 Extraction of Boolean Rules Consider a piecewise-linear neural network M = ((n I , n H , n O ), f ) and suppose that in any input-output pair (x = (x 1 , . . . , x nI ), y = (y1 , . . . , ynO )) ∈ nI × nO used for training M, the numbers x 1 , . . . , x nI , y1 , . . . , ynO are values of some variables X1 , . . . , XnI , Y1 , . . . , YnO capturing quantifiable properties of objects in the application domain. Then for each P ∈ P˜ nI , the statement (X1 , . . . , XnI ) ∈ P is a n I -ary Boolean predicate; similarly, (Y1 , . . . , YnO ) ∈ Q is a n O -ary Boolean predicate for each Q ∈ P˜ nO .
Piecewise-Linear Networks and Rule Extraction from Data
2831
Consequently, the fact that the reciprocal image of Q in a mapping computed by the network is a union of pseudopolyhedra P1 , . . . , Pr , stated by proposition 2 iii, can be reformulated as the Boolean implication r
(X1 , . . . , XnI ) ∈ P j → (Y1 , . . . , YnO ) ∈ Q,
(3.1)
j=1
or, equivalently, as the disjunction of Boolean implications (X1 , . . . , XnI ) ∈ P j → (Y1 , . . . , YnO ) ∈ Q, j = 1, . . . , r.
(3.2)
A lack of the formulation 3.2 is that the (pseudo)polyhedra P1 , . . . , Pr and Q can be very complicated sets (see Figure 3). This makes the n I -ary predicates (X1 , . . . , XnI ) ∈ P1 , . . . , (X1 , . . . , XnI ) ∈ Pr and the n O -ary predicate (Y1 , . . . , YnO ) ∈ Q quite incomprehensible, and the overall usefulness of implications 3.1 and 3.2 low. Logicians have for a long time been aware of such difficulties with predicates of a higher arity. Therefore, observational logic (the branch of Boolean logic devoted to logical treatment of data analysis) basically deals only with monadic calculi, which contain merely unary predicates (H´ajek & Havr´anek, 1978). In this context, notice that unary predicates are closely connected to the particular kind of polyhedra and pseudopolyhedra recalled in definition 5—hyperrectangles and pseudohyperrectangles. Indeed, let H be a pseudohyperrectangle in nI
SD(Robackia demeierei) >
1 10 MaxSD(Robackia
demeierei)
Figure 3: A two-dimensional cut of an example union of polyhedra in the input space of a neural network, corresponding to an interval in one dimension of the output space.
ˇ M. Holena
2832
with projections H1 , . . . , HnI . Then the n I -ary predicate (X1 , . . . , XnI ) ∈ H is equivalent to a conjunction of unary predicates, (X1 , . . . , XnI ) ∈ H ≡
Xk ∈ Hk ,
(3.3)
k∈I H
where I H = {k : Hk = }. Similarly for a pseudohyperrectangle Q in nO with projections Q1 , . . . , QnO ,
(Y1 , . . . , YnO ) ∈ Q ≡
Yk ∈ Ok ,
(3.4)
k∈O Q
where O Q = {k : Qk = }. Consequently, implication 3.2 turns to k∈I H
Xk ∈ Hk →
Yk ∈ Qk ,
(3.5)
k∈O Q
which is exactly the kind of implication that is studied in observational predicate logic (H´ajek & Havr´anek, 1978). Unfortunately, even if the (pseudo)polyhedron Q in equation 3.2 is chosen to be a hyperrectangle or a pseudohyperrectangle, proposition 2 guarantees only the possibility of obtaining implications of the kind (X1 , . . . , XnI ) ∈ P →
Yk ∈ Qk ,
(3.6)
k∈O
for some P ∈ P˜ nI or P ∈ PnI . To arrive at an implication of the kind in equation 3.5, it is necessary to replace P with a pseudohyperrectangle or hyperrectangle in the input space of the net (see Figure 4). Several possibilities of how such a replacement can be accomplished and their impact on the extracted sets of rules will be discussed in sections 3.1 and 3.2. 3.1 Basic Algorithm. Whether a particular (pseudo)hyperrectangle H is feasible to replace a particular (pseudo)polyhedron P, and more generally, whether P is feasible to be replaced at all, depends on various conditions, most importantly on our dissatisfaction with that part of P that does not belong to H and with that part of H that does not belong to P—that is, on our dissatisfaction with the symmetric difference PH of P and H. Let us denote that dissatisfaction µ P (PH) (to indicate its possible dependence on P) and make the following assumptions about the conditions determining the eligibility of P for replacement and the feasibility of H to replace P:
Piecewise-Linear Networks and Rule Extraction from Data
2833
SD(Robackia demeierei) >
1 10 MaxSD(Robackia
demeierei)
Figure 4: A two-dimensional projection of hyperrectangles that replace those polyhedra from Figure 3, replaceable according to equation 3.13.
i. To be eligible for replacement, P has to cover at least one point of the available data. ii. The dissatisfaction is nonnegative (µ P (PH) ≥ 0). iii. Increasing the area PH leads to an increased dissatisfaction µ P (PH), that is, µ P is increasing with respect to set inclusion. iv. The dissatisfaction µ P (PH) is minimal among µ P (PH ) for hyperrectangles H in the considered space. v. The dissatisfaction µ P (PH) does not exceed some prescribed limit ε > 0. Conditions ii and iii imply that µ P is a nonnegative monotonne measure on the considered space, such that its domain contains PH for any pseudopolyhedron P and any pseudohyperrectangle H in that space (e.g., a nonnegative Borel measure on the space). If the considered space is the input space of a neural network, two nonnegative Borel measures are particularly attractive: A. The empirical distribution of ξ1 , . . . , ξm , that is, the empirical distribution of the input components of the sequence (ξ1 , η1 ), . . . , (ξm , ηm ) ∈ nI × nO of training pairs (observe that this measure does not depend on P) B. The conditional empirical distribution of the input components ξ1 , . . . , ξm of the sequence of training pairs, conditioned on P (hence, also dependent on P). An important property of measures A and B, not valid for general nonnegative Borel measures, is that for any polyhedron P in the input space
ˇ M. Holena
2834
of the network, a hyperrectangle HP in that space can be found such that condition iv holds, that is, µ P (PHP ) = min{µ P (PH ) : H is a hyperrectangle in the input space of the network}.
(3.7)
Due to conditions i and v, we can restrict attention to polyhedra from the set C = {P j : j = 1, . . . , r & µ P (PHP ) ≤ ε & P j covers at least one point of the available data}.
(3.8)
Replacing P with HP for polyhedra P ∈ C yields a system of implications of the kind (see implication 3.5)
Xi ∈ HP i →
i∈I HP
Yk ∈ Qk ,
(3.9)
k∈O Q
which is, in turn, equivalent to the implication P∈C i∈I HP
Xi ∈ HP i →
Yk ∈ Qk .
(3.10)
k∈O Q
Finally, we can also restrict attention to hyperrectangles contained within the bounding box of the polyhedron P. Formally, the bounding box of P is a hyperrectangle H = I1B × . . . × InBI such that inf IiB = inf{xi : (x 1 , . . . , x nI ) ∈ P} & sup IiB = sup{xi : (x 1 , . . . , x nI ) ∈ P} for i = 1, . . . , n I .
(3.11)
Notice that this definition also covers intervals that are unbounded from the left (inf IkB = −∞) or from the right (sup IkB = +∞). The possibility of considering only hyperrectangles contained within the bounding box of P is a consequence of the fact that if H ∈ HnI leads to the minimal dissatisfaction, then also H ∩ B(P) leads to the minimal dissatisfaction, implied by the inequality µ P (PH) ≥ µ P (P(H ∩ B(P))),
(3.12)
which follows from conditions iii and iv, and from P(H ∩ B(P)) ⊂ PH.
Piecewise-Linear Networks and Rule Extraction from Data
2835
The above considerations already allow formulating a basic algorithm for the extraction of Boolean rules from data by means of multilayer perˇ 2002a): ceptrons (a preliminary version has been presented in Holena, Algorithm 1 Input:
r r r r r r
Disjoint sets {X1 , . . . , XnI }, {Y1 , . . . , YnO } of real-valued variables capturing properties of objects in the application domain A set of predicates {Yk ∈ Ok : k ∈ O}, where ∅ = O ⊂ {1, . . . , n O }, and for each k ∈ O, Ok is an interval different from the whole Constants n H ∈ N and ε > 0 A continuous sigmoidal function f A sequence of input-output training pairs (ξ1 , η1 ), . . . , (ξm , ηm ) ∈ n I × n O A system (µ P ) P∈P˜ n I of nonnegative Borel measures on nI .
1. Initialize the set of extracted Boolean rules by R = ∅. 2. Construct a hyperrectangle H0 in nO such that for each k ∈ O, the kth projection of H0 is Ok , and if O = {1, . . . , n O }, then any remaining projections of H0 are . 3. Initialize an MLP M = ((n I , n H , n O ), f ). 4. Training M with (ξ1 , η1 ), . . . , (ξm , ηm ), obtain a mapping F computable by M. 5. For a piecewise-linear g close enough to f in Cς , construct the counterpart G of F in ((n I , n H , n O ), g). 6. Construct G −1 (H0 ) in nI . 7. Decompose G −1 (H0 ) as G −1 (H0 ) =
r j=1
P j , where P1 , . . . , Pr ∈ P˜ nI .
8. For each j = 1, . . . , r such that P j ∩ {ξ1 , . . . , ξm } = ∅ and there exists a hyperrectangle H j in nI fulfilling µ P j (P j H j ) = min{µ P j (P j H ) : H is a hyperrectangle in nI } ≤ ε : (3.13) a. Find the intervals I1 , . . . , InI such that H j = I1 × . . . × InI . b. Define the set I j = {i : Ii = }. c. Update the set of rules by R = R ∪ { i∈I j Xi ∈ Ik → k∈O Yk ∈ Ok }. Output: The Boolean implication that is the disjunction of rules from the extracted set R.
ˇ M. Holena
2836
3.2 Two Modifications. In this section, two modifications of the proposed approach are described. The first concerns only assumption iv—the assumption that the dissatisfaction µ P (PH ) as a function of hyperrectangles H reaches its minimum for the found hyperrectangle H. The proposed modification is motivated by the fact that the search for such a hyperrectangle H suffers from the curse of dimensionality phenomenon. For example, to find a hyperrectangle H fulfilling the condition iv if µ P is any of the measures A or B requires computing the value µ P (PH) for O(mnI +1 ) different hyperrectangles H . To eliminate the curse of dimensionality, the proposed modification attempts to reduce the search for a hyperrectangle H in the input space of the network to the search for intervals corresponding to the individual input dimensions. To this end, condition iv is replaced with the following two conditions: iv a. The dissatisfaction µ P is decomposable into its one-dimensional pro(1) (n ) jections, µ P = µ P × . . . × µ P I . (1) iv b. If P has projections P , . . . , P (nI ) and H j = I1 × . . . × InI , then for each (i) (i) i = 1, . . . , n I , µ P (P (i) Ii ), is minimal among µ P (PI ) with intervals I closed with respect to , and condition v is replaced with (i)
v . For no i ∈ nˆ I , µ P (P (i) Ii ) exceeds a prescribed limit ε > 0. Conditions iv a and iv b are weaker than condition iv; they do not imply its validity. Similarly, condition v’ does not imply the validity of condition v, no matter what the relationship is between the constants ε and ε . Hence, this modification is not necessarily superior to the basic algorithm described in the preceding section. It avoids the curse of dimensionality through a restriction to a search of rules with one-dimensional antecedents of the kind Xi ∈ Ii for some interval Ii , from which only subsequently conjunctions are formed, but pays for it with giving up the validity of conditions iv and v. The basic algorithm 1, on the contrary, enforces the validity of conditions iv v through a search of rules with conjunctive antecedents of the form nand k i X ∈ Ii , but pays for it with a high complexity. i=1 Examples of measures fulfilling iv a are the following counterparts of the above introduced measures A and B: A : The product of the marginal empirical distributions of the input components of the training sequence (ξ1 , η1 ), . . . , (ξm , ηm ). This measure does not depend on the (pseudo)polyhedron to be replaced, and if the marginal empirical distributions of the input components of the training sequence are mutually independent, it coincides with the measure A.
Piecewise-Linear Networks and Rule Extraction from Data
2837
B : The product of the marginal conditional empirical distribution of the input components of the training sequence (ξ1 , η1 ), . . . , (ξm , ηm ) conditioned by the (pseudo)polyhedron to be replaced (hence, it depends on that (pseudo)polyhedron). If the marginal conditional empirical distributions of the input components of the training sequence are mutually independent, this measure coincides with measure B. In connection with condition iv b, measures A and B require computing (i) one-dimensional projections µ P (PI ) for O(n I m2 ) closed intervals I ⊂ . Contrary to this first modification, the second modification concerns directly the starting principle of the proposed rectangularization approach—the principle that the decision whether to replace P with H should rely on our dissatisfaction with the symmetric difference PH. This modification is based on the point of view that the choice of a hyperrectangle H in the input space of the network should not rely on only a particular polyhedron P j , j ∈ rˆ , but on a broader context of that polyhedron. Therefore, such a rule extraction will be called contextual extraction, whereas rule extraction according to the original principle underlying the basic algorithm 1 will be called context-free extraction. More precisely, the considered context of a polyhedron P j is formed by all points from the polyhedra P1 , . . . , Pr that lie inside the bounding box of P j ; thus, it is the set B(P j ) ∩ (P1 ∪ . . . ∪ Pr ). Hence, the starting principle is modified in such a way that the decision whether to replace a particular P ∈ {P1 , . . . , Pr } with a hyperrectangle H should rely on our dissatisfaction with the sym metric difference (B(P) ∩ rj=1 P j )H instead of the dissatisfaction with
PH. Observe that (B(P) ∩ rj=1 P j )H ⊃ (B(P) ∩ rj=1 P j )(H ∩ B(P)); thus again, if H ∈ HnI leads to the minimal dissatisfaction, then H ∩ B(P) also leads to the minimal dissatisfaction. As in the case of one-dimensional and conjunctive antecedents, there is no superiority relationship between contextual and context-free extraction; the decision as to which of them to use depends on the importance of context in the currently solved task. To take into account the fact that any of these two modifications may sometimes be advantageous and sometimes not, the decision whether the antecedents of the extracted rules will be one-dimensional or conjunctive, as well as whether contextual or context-free extraction will be employed, should form part of the input to the algorithm. Consequently, basic algorithm 1 needs to be extended as follows: Algorithm 2 Input:
r r
Disjoint sets {X1 , . . . , XnI }, {Y1 , . . . , YnI } of real-valued variables capturing properties of objects in the application domain A set of predicates {Yk ∈ Ok : k ∈ O}, where ∅ = O ⊂ {1, . . . , n O }, and for each k ∈ O, Ok is an interval different from the whole
ˇ M. Holena
2838
r
r r r r r
The decision whether the available data will be used to extract Boolean rules with one-dimensional antecedents of the kind Xi ∈ Ii for some interval Ii , from which only subsequently conjunctions will be formed, or they will be used directly to extract rules with conjunctive antecedents of the form nI Boolean i X ∈ I i i=1 The decision whether contextual or context-free extraction will be employed Constants n H ∈ N and ε > 0 A continuous sigmoidal function f A sequence of input-output training pairs (ξ1 , η1 ), . . . , (ξm , ηm ) ∈ n I × n O A system (µ P ) P∈P˜ n I of nonnegative Borel measures on nI , which in the case that rules were decided to be extracted with one-dimensional antecedents are decomposable into their one(1) (n ) dimensional projections, µ P = µ P , × . . . × µ P I , P ∈ P˜ nI
1. Initialize the set of extracted Boolean rules by R = ∅. 2. Construct a hyperrectangle H0 in nO such that for each k ∈ O, the kth projection of H0 is Ok , and if O = {1, . . . , n O }, then any remaining projections of H0 are . 3. Initialize an MLP M = ((n I , n H , n O ), f ). 4. Training M with (ξ1 , η1 ), . . . , (ξm , ηm ), obtain a mapping F computable by M. 5. For a piecewise-linear g close enough to f in Cς , construct the counterpart G of F in ((n I , n H , n O ), g). 6. Construct G −1 (H0 ) in nI . 7. Decompose G −1 (H0 ) as G −1 (H0 ) =
r j=1
P j , where P1 , . . . , Pr ∈ P˜ nI .
8. Differentiate four cases according to the two input decisions: i. If rules were decided to be extracted with conjunctive antecedents and context-free extraction was decided to be employed, then for each j = 1, . . . , r such that P j ∩ {ξ1 , . . . , ξm } = ∅ and there exists a hyperrectangle H j in nI fulfilling: µ P j (P j H j ) = min{µ P j (P j H ) : H is a hyperrectangle in nI } ≤ ε : a. Find the intervals I1 , . . . , InI such that H j = I1 × . . . × InI .
(3.14)
Piecewise-Linear Networks and Rule Extraction from Data
2839
b. Define the set I j = {i : Ii = }. c. Update the set of rules by R = R ∪ { i∈I j Xi ∈ Ik → k k∈O Y ∈ Ok }. ii. If rules were decided to be extracted with one-dimensional antecedents and context-free extraction was decided to be employed, then for each j = 1, . . . , r such that P j ∩ {ξ1 , . . . , ξm } = ∅ and there exist intervals I1 , . . . , InI closed with respect to and fulfilling for i = 1, . . . , n I (i) (i) (i) (i) µ P P j Ii = min µ P P j I : I is an interval closed with respect to ≤ ε : (3.15) a. Define the set I j = {i : Ii = }. b. Update the set of rules by R = R ∪ { i∈I j Xi ∈ Ik → k k∈O Y ∈ Ok }. iii. If rules were decided to be extracted with conjunctive antecedents and contextual extraction was decided to be employed, then for each j = 1, . . . , r such that P j ∩ {ξ1 , . . . , ξm } = ∅: a. For i = 1, . . . , n I , put IiB = closure of (inf{xi : (x 1 , . . . , x nI ) ∈ P}, sup{xi : (x 1 , . . . , x nI ) ∈ P}). b. Put B(P j ) = I1B × . . . × InBI . c. Put P j = (B(P j ) ∩ rk=1 Pk ). If there exists a hyperrectangle H j in B(P j ) such that µ P j P j H j = min µ P j P j H : H is a hyperrectangle in B(P j ) ≤ ε, (3.16) then further: d. Find the intervals I1 , . . . , InI such that H j = I1 × . . . × InI . e. Define the set I j = {i : Ii = }. f. Update the set of rules by R = R ∪ { i∈I j Xi ∈ Ik → k k∈O Y ∈ Ok }. iv. If rules were decided to be extracted with one-dimensional antecedents and contextual extraction was decided to be employed, then for each j = 1, . . . , r such that P j ∩ {ξ1 , . . . , ξm } = ∅: a. For i = 1, . . . , n I , put IiB = closure of (inf{xi : (x 1 , . . . , x nI ) ∈ P}, sup{xi : (x 1 , . . . , x nI ) ∈ P}). b. Put B(P j ) = I1B × . . . × InBI . c. Put P j = (B(P j ) ∩ rk=1 Pk ). If there exist intervals I1 , . . . , InI closed with respect to and fulfilling for i = 1, . . . , n I , (i) (i) (i) (i) µ P P j Ii = min µ P P j I : I is an interval closed with respect to ≤ ε, (3.17) then further:
ˇ M. Holena
2840
d. Define the set I j = {i : Ii = }; e. Update the set of rules by R = R ∪ { i∈I j Xi ∈ Ik → k∈O Yk ∈ Ok }. Output: The Boolean implication that is the disjunction of rules from the extracted set R 4 Applications The extended algorithm 2 has been implemented in Matlab and has already been used in two important data mining applications, which will now be briefly sketched. The first of them is an ecological application, belonging to the area of the ecology of biocenoses. For a more detailed description of the application and for examples of obtained data mining results, the reader is referred to ˇ 2002b). Here, only a concise overview will be the published report (Holena, given. One very efficient way to increase the suitability of rivers for water transport is building groynes, but ecologists often fear the changes in the biocoenosis of the river and its banks to which groynes may lead. This is especially true for rivers in East European countries, where ecological considerations played a subordinate role until the 1980s. One of the most prominent examples of such rivers is the Czech and east German river Elbe. However, the complex relationships between the biocenosis and the ecological factors characterizing a groyne field are only poorly understood. Therefore, research conducted between 1998 and 2000 on the Elbe River investigated those relationships in order to propose an empirically proved hydrological model capturing them and allowing an estimate of changes in the biocoenosis that prospective groynes would cause. Five groyne fields typical for the middle part of the river were chosen near the town Neuwerben, in which a large amount of empirical data have been collected. The main part of the collected data is formed by nearly 1000 samples of benthic fauna and more than 1400 samples of terrestrial fauna. Each sample includes all animals caught in special traps during a prescribed period of time ranging from several hours to two days. Simultaneously with collecting those samples, various ecological factors were measured in the groyne fields, such as, oxygen concentration, diameter of ground grains and the proportion of the ground material that gets lost through glowing whereas others, such as water level and flow velocity, were computed using a hydrodynamic simulation model. The collected data were first analyzed by biologists with respect to the species contained in them. Then some preprocessing was performed, and finally data mining was applied to the preprocessed data. It is the data mining of those samples where the approach to ANN-based extraction of Boolean rules outlined here has been employed, complementing methods
Piecewise-Linear Networks and Rule Extraction from Data
2841
ˇ 2000, 2002b). To this end, an MLP of exploratory statistical analysis (Holena, with the topology (6, 4, 20) has been constructed in the case of terrestrial data and an MLP with the topology (4, 5, 9) in the case of benthic data. An example of rules extracted in this application, visualized through several two-dimensional projections, is given in Figure 5 (the polyhedra and hyperrectangles in Figures 3 and 4 come from this application as well). The second application belongs to the area of material science. The need for knowledge extraction in this area is due to the immense number of possible compositions of materials for a particular purpose, combined with an always increasing number of technologies that can be used to prepare the materials. Even with modern high-throughput technologies, only a very small fraction of those possibilities can be prepared and tested. To direct material designers and producers to the most promising ones, knowledge of
Figure 5: Four two-dimensional projections of the union of two hyperrectangles corresponding (up to rounding of values) to the following Boolean implication: (distance to the water line ≤ 35 m ∧ distance to the border of the permanent wetness area ≤ 30 m ∧ cover with litter ≤ 50%) ∨ (distance to the water line ≤ 65 m ∧ distance to the border of the permanent wetness area ≤ 50 m ∧ height of herbs ≤ 0,4 m ∧ cover with herbs ≤ 60% ∧ cover with litter ≤ 50%) ⇒ number of individuals of Elaphrus riparius in the sample = 1–2.
ˇ M. Holena
2842
the dependence of material quality on its properties and way of preparation is crucial. In some cases, existing physical and chemical knowledge can be used to this end, but this is usually possible only for very simple materials. For more complex ones, primary sources of that knowledge are numerical simulation and knowledge extraction from available material testing data. The ANN-based knowledge extraction method described above has been used for the design of materials to serve as catalysts in industrially imporˇ & Baerns, 2003a, 2003b). In those situations, tant chemical reactions (Holena the quality of the material is typically measured with some quantifiable properties of the considered reaction, such as the yield and selectivity of particular reaction products or the conversion of particular reaction educts assuming some standard reaction conditions. An example of extracted rules, this time visualized through a three-dimensional projection, is given in Figure 6. 5 Connection to the Extraction of Fuzzy Rules Although piecewise-linear neural networks have been proposed here primarily for the extraction of rules of the Boolean logic, they also have a strong
Yield C3H6 at least 8 %
% Mg 36 34 32 30 28 26 38
24 37.5
2 6
37 10
% Fe
36.5 14
% Ga
36
Figure 6: A three-dimensional projection of the union of two hyperrectangles corresponding (up to rounding of values) to the following Boolean implication: (proportion of Fe ≤ 12% ∧ 37.7% ≤ proportion of Ga ≤ 37.9% ∧ 29.5% ≤ proportion of Mg ≤ 36.5%) ∨ (10.5% ≤ proportion of Fe ≤ 14.5% ∧ 36.7% ≤ proportion of Ga ≤ 36.9% ∧ 23.6% ≤ proportion of Mg ≤ 26.0 %) ⇒ C3 H6 yield ≥ 8%.
Piecewise-Linear Networks and Rule Extraction from Data
2843
connection to MacNaugton functions, encountered in the L ukasiewicz logic, one of the three fundamental fuzzy logics (Cignoli, D’Ottaviano, & Mundici, 2000; H´ajek, 1998). That connection follows from the fact that if a piecewiselinear neural network is constructed using only rational numbers (which is actually a necessity, due to the finite precision of computers), then the mapping computed by the network is a rational MacNaugton function. This fact is precisely formulated in proposition 2, after the employed new terms are introduced in definition 6. Finally, corollary 1 transfers a recent result concerning rule extraction from rational MacNaugton to the mappings computed by piecewise-linear neural networks. Definition 6. Let Z and Q denote, respectively, the sets of integers and rational numbers. Let further P1 , . . . , Pr ∈ Pn and F : [0, 1]n → [0, 1] be such that
r n i. j=1 P j = [0, 1] . nI ii. For each j = 1, . . . , r , there exists a set S j of closed halfspaces in fulfilling P j = S∈S j S. iii. aS ∈ Qn and b S ∈ Q. iv. For each j = 1, . . . , r , the restriction F |P j is linear; hence (∃a j ∈ n )(∃b j ∈ )(∀x ∈ P j ) F (x) = a j ·x + b j . Then F is called: a. McNaughton function if a1 , . . . , ar ∈ Z n , b 1 , . . . , br ∈ Z. b. Rational McNaughton function if a1 , . . . , ar ∈ Qn , b 1 , . . . , br ∈ Q. Proposition 3. Let L = ((n I , n H , 1), f ) be a piecewise-linear neural network such that there exist t1 , t2 , . . . , tp , a 2 , . . . , a p , b 2 , . . . , b p ∈ Q fulfilling f |(−∞, t1 ] = 0 & f |[tp , ∞) = 1 & (∀ j ∈ {2, . . . , p})(∀t ∈ [t j−1 , t j ]) f (x) = a j t + b j .
(5.1)
Let further w = (w1,1 , . . . , w1,n H , w2 ) ∈ Q(nI +1)n H +n H +1 be such that F f,w ([0, 1]nI ) ⊂ [0, 1]. Then F f,w |[0, 1]nI is a rational McNaughton function. Proof. Repeating the proof of part i of proposition 2 with F replaced through F f,w , replaced through Q, and each vector space n for n ∈ N , replaced through Qn , we again get a finite subset P I of PnI such that
nI P∈P I P = and F f,w coincides on each P ∈ P I with some linear function F P , but this time all involved numbers, vectors, and matrices are rational numbers, vectors and matrices, in particular, P=
S∈S P
S ⇒ (∀S ∈ S P ) a S ∈ QnI & b S ∈ Q
(5.2)
ˇ M. Holena
2844
(∀x ∈ nI ) F P (x) = A P x + b P ⇒ A P ∈ QnI & b P ∈ Q
(5.3)
for each P ∈ P I . Thus, to complete the proof, it is sufficient to put {P1 , . . . , Pr } = {P ∩ [0, 1]nI : P ∈ P I }.
(5.4)
Remark 3. Notice that the condition F f,w ([0, 1]nI ) ⊂ [0, 1] is actually not at all restrictive. Indeed, definition 2, and the entailed continuity of F f,w , imply F f,w ([0, 1]nI ) = [a , b], with some a , b ∈ Q, a ≤ b. Then scaling F f,w to w2 F f,w with w1,1 = w1,1 , . . . , w1,n H = w1,n H , w2 = max(1,b−a diag(−a , 1, . . . , 1), ) where diag v stands for a diagonal matrix the diagonal elements of which are formed by the vector v, yields the validity of that condition for the scaled function, F f,w ([0, 1]nI ) ⊂ [0, 1]. In view of definition 4, the optimality of the original function F f,w for (L, (ξ1 , η1 ), . . . , (ξm , ηm )) is equivalent to the optiη1 −a ηm −a mality of the scaled function F f,w for (L, (ξ1 , max(1,b−a ), . . . , (ξm , max(1,b−a )). ) ) Corollary 1. Let L = ((n I , n H , 1), f ) be the piecewise-linear neural network and w be the vector of parameters from proposition 3. Let further xM1 =v1 ,...,xn =vn denote the truth value of a fuzzy logic formula , modeled in a BL algebra over a set M, where has n free variables xi , i = 1, . . . , n, each evaluated with a particular value vi ∈ M. Then: i. There exists a McNaughton function G f,w : [0, 1]nI +1 → [0, 1] such that (∀v1 , . . . , vnI ∈ [0, 1]) F f,w (v1 , . . . , vnI ) = max G f,w (v1 , . . . , vnI +1 ). vn I +1
(5.5) ii. There exists a formula of the Lukasiewicz logic, modeled in the infinitevalued MV algebra over [0, 1] that has n I + 1 free variables x1 , . . . , xnI +1 and fulfills 1] (∀v1 , . . . , vnI +1 ∈ [0, 1]) [x01,=v = G f,w (v1 , . . . , vnI +1 ). 1 ,...,xn+1 =vn+1
(5.6) iii. There exists a formula of the Lukasiewicz logic, modeled in the infinitevalued MV algebra over [0, 1] that has n I + 1 free variables x1 , . . . , xnI , y and fulfills 1] (∀v1 , . . . , vnI ∈ [0, 1]) (∃y) [x01,=v = F f,w (v1 , . . . , vnI ). 1 ,...,xn =vn
(5.7) Proof. Part i is a consequence of proposition 3 and of the fact that for each n ∈ N and each rational McNaughton function f : [0, 1]n → [0, 1], there
Piecewise-Linear Networks and Rule Extraction from Data
2845
exists a McNaughton function g : [0, 1]n+1 → [0, 1] fulfilling (∀x1 , . . . , xn ∈ [0, 1]) f (x1 , . . . , xn ) = max g(x1 , . . . , xn+1 ), xn+1
(5.8)
which has been proved in Aguzzoli and Mundici (2001). Part ii is simply the application of the well-known McNaughton theorem (McNaughton, 1951) to the function G f,w . Finally, part iii follows from combining i and ii with M the fact that (∃x) M = supv∈M x=v holds for a fuzzy logic formula
with a single free variable x, modeled in a BL algebra over a set M, M provided supv∈M x=v ∈ M (H´ajek, 1998). If specifically is a formula of the L ukasiewicz logic, modeled in the infinite-valued MV algebra over [0, 1], then in view of the McNaughton theorem and definition 6, [0,1] x=v is [0,1] a continuous function of v; thus, supv∈[0,1] [0,1] = max v∈[0,1] x=v x=v . The proof of equation 5.8 in Aguzzoli and Mundici (2001), on which part i of the preceding corollary relies, is constructive; thus, it allows formulating an algorithm to construct the McNaughton function G f,w for any function F f,w . The original proof of the McNaughton theorem (McNaughton, 1951), on which part ii relies, is not constructive, but fortunately, two constructive proofs of that theorem have been proposed in the nineties—by Mundici (Cignoli et al., 2000; Mundici, 1994), as well as by Perfilieva and Tonis (Nov´ak & Perfilieva, 1999; Nov´ak, Perfilieva, & Moˇckoˇr, 1999). Any of them allows formulating an algorithm to construct a formula fulfilling equation 5.6 for any McNaughton function G f,w , thus extending the preceding algorithm to the construction of a formula fulfilling equation 5.7 for any rational McNaughton function F f,w . An algorithm based on Mundici’s constructive proof of the McNaughton theorem ˇ (2005); an algorithm based on the proof by has been presented in Holena Perfilieva and Tonis is formulated below as algorithm 4. It repeatedly includes constructions of formulas of the L ukasiewicz propositional logic for [0, 1]-range restrictions of linear functions with integer coefficients. Therefore, that construction is first formulated, for a general linear function of many variables, as a separate algorithm 3. Before, some derived conjunctions and other concepts employed in those algorithms are introduced in definition 7. Definition 7. Let a ∈ , f be a real-valued function and , be formulas of the Lukasiewicz logic. Let further &, →, and ¬ denote the conjunction, implication, ¯ 1¯ are its truth constants. Then the and negation, respectively, in that logic, and 0, following additional notation will be used: a. a for the integer ceiling of a ∈ , that is, a = min{z ∈ Z : a ≤ z} b. lcdQ (a 1 , . . . , a n ) for the least common denominator of a vector of rationals, that is, lcdQ (a 1 , . . . , a n ) = lcd(d1 , . . . , dn ), where (a 1 , . . . , a n ) =
ˇ M. Holena
2846
( dc11 , . . . , dcnn ), c 1 , . . . , c n ∈ Z, d1 , . . . , dn ∈ N , and lcd denotes the least common denominator of natural numbers c. f ∗ for the [0, 1]-range restriction of a real-valued function f , that is, f ∗ = min(1, max(0, f )) d. for the Lukasiewicz disjunction, that is, = (¬ ) → (H´ajek, 1998) e. ∧ for the min-conjunction, that is, ∧ = &( → ) (H´ajek, 1998) f. ∨ for the max-disjunction, that is, ∨ = ( → ) → (H´ajek, 1998) ¯ g. p for the p-times conjunction of , p = 0, 1, 2, . . ., that is, 0 = 1, p+1 = p &
Algorithm 3 Input: linear function f : n → , defined as (∀ξ ∈ n ) f (ξ ) = a·ξ + b,
(5.9)
where a = (a 1 , . . . , a n ) ∈ Z n , b ∈ Z. 1. If a = 0, define 0¯ if f ∗ = b ∗ = 0, f∗ = 1¯ if f ∗ = b ∗ = 1,
(5.10)
and go to Output; otherwise, put k = min{i : a i = 0}.
(5.11)
2. Define a function f − : n → by (a − ek )·ξ + b n (∀ξ ∈ ) f − (ξ ) = 1 − (a + ek )·ξ − b
if
a k > 0,
if
a k < 0,
(5.12)
where ek is the unit vector from definition 5. 3. Repeat algorithm 3 with f − at input, receiving a formula f−∗ at output. 4. Repeat algorithm 3 with f − + 1 at input, receiving a formula ( f− +1)∗ at output. 5. Define f∗ =
( f−∗ xk )& ( f− +1)∗
if
a k > 0,
¬(( f−∗ xk )& ( f− +1)∗ )
if
a k < 0.
Output: formula f ∗ with free variables x1 , . . . , xn , fulfilling
(5.13)
Piecewise-Linear Networks and Rule Extraction from Data ∗ (∀v1 , . . . , vn ∈ [0, 1]) f ∗ [0,1] x1 =v1 ,...,xn =vn = f (v1 , . . . , vn ).
2847
(5.14)
Algorithm 4 Input: Function F f,w , computable by a piecewise-linear neural network L = ((n I , n H , 1), f ), fulfilling F f,w ([0, 1]nI ) ⊂ [0, 1], and described by means of:
r r
Numbers t1 , t2 , . . . , tp , a 2 , . . . , a p , b 2 , . . . , b p ∈ Q, defining the activation function f via equation 5.1. An n H × n I -dimensional rational matrix A1 , a column vector b1 ∈ Qn H , a row vector A2 ∈ Qn H , and a number b 2 ∈ Q such that the matrix (A1 , b1 ) has rows w1,1 , . . . , w1,n H ∈ QnI +1 , (A1 , b 2 ) = w2 ∈ Qn H +1 , and (w1,1 , . . . , w1,n H , w2 ) = w. n
H p + 1 , define 1. For each ( j1 , . . . , jn H ) ∈
P( j1 ,..., jn H ) = {ξ ∈ [0, 1]nI : (t j1 −1 , . . . , t jn H −1 ) ≤ A1 ξ + b1 ≤ (t j1 , . . . , t jn H ),
(5.15)
where the input numbers t1 , t2 , . . . , tp ∈ Q have been complemented with t0 = −∞, tp+1 = +∞. 2. Define P f,A1 ,b1 = {P( j1 ,..., jn H ) : ( j1 , . . . , jn H ) ∈ p+1 p+1 & (∀( j1 , . . . , jn H ) ∈
nH
nH
& P( j1 ,..., jn H ) = ∅
− {( j1 , . . . , jn H )})
[( j1 , . . . , jn H ) − ( j1 , . . . , jn H ) ∈ {0, 1}n H ⇒ P( j1 ,..., jn H ) ⊂ P( j1 ,..., jn H ) ] & [(( j1 , . . . , jn H ) − ( j1 , . . . , jn H ) ∈ {0, 1}n H & P( j1 ,..., jn H ) ⊂ P( j1 ,..., jn H ) ) ⇒ P( j1 ,..., jn H ) = P( j1 ,..., jn H ) ]}.
(5.16)
3. For each P = P( j1 ,..., jn H ) ∈ P f,A1 ,b1 , put a P = A2 diag(a j1 , . . . , a jn H )A1 ,
(5.17)
b P = A2 diag(a j1 , . . . , a jn H )b1 + b 2 ,
(5.18)
where the input numbers a 2 , . . . , a p , b 2 , . . . , b p ∈ Q have been complemented with a 1 = a p+1 = 0, b 1 = 0, b p+1 = 1. 4. For each P = P( j1 ,..., jn H ) ∈ P f,A1 ,b1 , define a subset P¬P of P f,A1 ,b1 , and a function g P : nI +1 → by ! nH " P = P( j1 ,..., jn H ) p+1 P¬P = P ∈ P f,A1 ,b1 : ∃ j1 , . . . , jn H ∈ ! " # & ∃i ∈ p + 1 | ji − ji | ≥ 2 (5.19)
ˇ M. Holena
2848
∀ξ ∈ nI +1 g P (ξ ) = lcdQ (a P , b P )(min(a P , 0)·ξ + b P , (0, . . . , 0, 1)·ξ )).
(5.20)
5. For each P ∈ P f,A1 ,b1 , and each S ∈ S P , define the function g S : nI +1 → by (∀ξ ∈ nI +1 ) g S (ξ ) = lcdQ (a S , b S )((a S , 0)·ξ − b S ).
(5.21)
6. For each P ∈ P f,A1 ,b1 , perform algorithm 3 with g P at input, receiving a formula g∗P at output. 7. For each P ∈ P f,A1 ,b1 and each S ∈ S P , perform algorithm 3 with g S at input, receiving a formula gS∗ at output. 8. For each P ∈ P f,A1 ,b1 , put $ q = P
min{max S∈S P g S (ξ ) : ξ ∈
%
1
P ∈P¬P
P ∪ ({0, 1}nI − P)}
.
(5.22) 9. Define =
P∈P f,A1 ,b1
g∗P &
q P ¬ gS∗
.
(5.23)
S∈S P
Output: Formula with free variables x1 , . . . , xnI , y, fulfilling equation 5.7. 6 Conclusion In this article, the extraction of logical rules from data by means of multilayer perceptrons with piecewise-linear sigmoidal activation functions has been dealt with. This approach to ANN-based knowledge extraction has ˇ (2000) for the exbeen proposed independently in Maire (1999) and Holena traction of Boolean rules. In this article, two important theoretical properties of the considered kind of artificial neural networks have been established, allowing the elaboration of the basic ideas of the approach into several variants of an algorithm for the extraction of Boolean rules from data. Using an implementation of that algorithm, the practical applicability of the approach has been verified in two real-world applications. In addition, the article shows that the same kind of neural networks can be used also to extract fuzzy rules—more precisely, particular formulas of the L ukasiewicz logic. The ability of piecewise-linear neural networks to serve as a common theoretical framework for the extraction of both kinds of rules can be of crucial importance when switching between Boolean and fuzzy rules and attempting to compare or juxtapose the obtained results.
Piecewise-Linear Networks and Rule Extraction from Data
2849
Looking at the complexity of the algorithms proposed in sections 3 and 5 reveals the following similarities and differences:
r
The overall complexity the extraction of both kinds of rules is exponential with respect to the number n I of input neurons, as well as with respect to the number n H of hidden neurons—that is, there exist constants CB and CŁ , and polynomials pB : N × N → and pŁ : N × N → such that the complexity of algorithms 1 and 2 is p (n I ,n H )
r
r
r
O(CB B
p (n I ,n H )
), whereas the complexity of algorithm 4 is O(CŁ Ł
).
In spite of this basic similarity, the complexity of algorithm 4 is in all respects higher than that of algorithms 1 and 2—more precisely, CŁ > CB
(6.1)
pŁ > pB on N × N
(6.2)
pB = o( pŁ ).
(6.3)
Whereas the dependence on n I and n H is a common feature of algorithms 1, 2, and 4, the complexity of the extraction of formulas of the L ukasiewicz logic in addition polynomially depends on the magnitudes of the numerators and denominators of the components of the vectors a P and of the numbers b P from equations 5.17 and 5.18 for P ∈ P f,A1 ,b1 , as well as on the magnitudes of the numerators and denominators of the components of the vectors a S and of the numbers b S for S ∈ S P , P ∈ P f,A1 ,b1 . This follows from the way the functions g P for P ∈ P f,A1 ,b1 , used as input to algorithm 3 in step 6 of algorithm 4, have been defined in step 4, from the way the functions g S for S ∈ S P , P ∈ P f,A1 ,b1 , used as input to algorithm 3 in step 7 of algorithm 4, have been defined in step 5, and from the fact that algorithm 3 is recursively n called i=1 |a i | times with a constant complexity if the function on its input is defined by means of numbers a 1 , . . . , a n , b ∈ Z according to equation 5.9. Although the multiplicative increase of the complexity of algorithm 4 caused by this additional dependence is only polynomial, it still can be extremely high, since the numerators and denominators of the components of the vectors a P , a S and of the numbers b P , b S for P ∈ P f,A1 ,b1 , S ∈ S P , which are computed from the algorithm inputs in steps 1 and 3, may theoretically be integers of any magnitude. Finally, the complexity of algorithm 4 depends in addition on the numbers q P for P ∈ P f,A1 ,b1 , defined in equation 5.22. Though in this case the resulting multiplicative increase of the algorithm complexity is only linear, even this increase can be extremely P can be
high because nI arbitrarily close to the from P disjoint set P P ∪ ({0, 1} − P), ∈P¬P allowing the value min{max S∈S P g S (ξ ) : ξ ∈ P ∈P¬P P ∪ ({0, 1}nI − P)} in equation 5.22 to be arbitrarily low, and thus the corresponding q P to be arbitrarily high.
ˇ M. Holena
2850
This analysis shows that the method for the extraction of fuzzy rules from data presented in section 5 has little chance of being used in realworld applications of similar size as those in which piecewise-linear neural networks have already been used for the extraction of Boolean rules, in spite of the fact that both methods share exponential complexity with respect to n I and n H . At the same time, the analysis indicates two important ways in which the algorithm could be simplified and its practical use increased:
r r
To restrict possible magnitudes of the numerators and denominators of the components of the vectors a P , a S and of the numbers b P , b S for P ∈ P f,A1 ,b1 , S ∈ S P
To restrict the lowest possible distance between P and P ∈P¬P P ∪ ({0, 1}nI − P) for P ∈ P f,A1 ,b1 , or in some other way
restrict from below the possible value of min{max S∈S P g S (ξ ) : ξ ∈ P ∈P¬P P ∪ ({0, 1}nI − P)}
These simplifications, together with restrictions to only those polyhedra from P f,A1 ,b1 that contain any data, determine the main directions of the ongoing development of the method presented in section 5 and of ongoing research into the relationship between piecewise-linear neural networks and the extraction of logical rules from data. Acknowledgments The reported research has been supported by grants 201/05/0325 and 201/05/0557 of the Grant Agency of Czech Republic, as well as by grant A100300503 of the Academy of Sciences of Czech Republic. It has also been partially supported by institutional research plan AV0Z10300504. I thank ˚ Vˇera Kurkov´ a and the anonymous reviewer for their comments and suggestions. References Aguzzoli, S., & Mundici, D. (2001). Weierstrass approximations by Lukasiewicz formulas with one quantified variable. In 31st IEEE International Symposium on Multiple-Valued Logic (pp. 361–366). Piscataway, NJ: IEEE. Alexander, J., & Mozer, M. (1999). Template-based procedures for neural network interpretation. Neural Networks, 12, 479–498. Andrews, R., Diederich, J., & Tickle, A. (1995). Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge Based Systems, 8, 378–389. Ben´ıtez, J., Castro, J., & Requena, I. (1997). Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8, 1156–1164. Berthold, M., & Hand, D. (Eds.) (1999). Intelligent data analysis: An introduction. Berlin: Springer Verlag.
Piecewise-Linear Networks and Rule Extraction from Data
2851
Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bologna, G. (2000). Symbolic rule extraction from the DIMPL neural network. In S. Wermter & R. Sun (Eds.), Hybrid neural systems (pp. 241–255). Heidelberg: Springer-Verlag. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Cignoli, L., D’Ottaviano, I., & Mundici, D. (2000). Algebraic foundations of many-valued reasoning. Dordrecht: Kluwer. d’Avila Garcez, A., Broda, K., & Gabbay, D. (2001). Symbolic knowledge extraction from artificial neural networks: A sound approach. Artificial Intelligence, 125, 155– 207. Dennis, J., & Schnabel, R. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice Hall. Duch, W., Adamczak, R., & Grabczewski, K. (1998). Extraction of logical rules from neural networks. Neural Processing Letters, 7, 211–219. Duch, W., Adamczak, R., & Grabczewski, K. (2000). A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 11, 277–306. Finn, G. (1999). Learning fuzzy rules from data. Neural Computing and Applications, 8, 9–24. Gad, E., Atiya, A., Shaheen, S., & El-Dessouki, A. (2000). A new algorithm for learning in piecewise-linear neural networks. Neural Networks, 13, 485–505. Hagan, M., Demuth, H., & Beale, M. (1996). Neural network design. Boston: PWS Publishing. Hagan, M., & Menhaj, M. (1994). Training feedforward networks with the Marquadt algorithm. IEEE Transactions on Neural Networks, 5, 989–993. H´ajek, P. (1998). Metamathematics of fuzzy logic. Dordrecht: Kluwer. H´ajek, P., & Havr´anek, T. (1978). Mechanizing hypothesis formation. Berlin: SpringerVerlag. Healy, M., & Caudell, T. (1997). Acquiring rule sets as a product of learning in a logical neural architecture. IEEE Transactions on Neural Networks, 8, 461–474. ˇ M. (2000). Observational logic integrates data mining based on statistics and Holena, neural networks. In D. Zighed, J. Komorowski, & J. Zytkov (Eds.), Principles of data mining and knowledge discovery (pp. 440–445). Berlin: Springer-Verlag. ˇ M. (2002a). Extraction of logical rules from data by means of piecewise-linear Holena, neural networks. In Proceedings of the 5th International Conference on Discovery Science (pp. 192–205). Berlin: Springer-Verlag. ˇ M. (2002b). Mining rules from empirical data with an ecological application (Tech. Holena, rep.). Cottbus: Brandenburg University of Technology. ˇ M. (2005). Extraction of fuzzy logic rules from data by means of artificial Holena, neural networks. Kybernetika, 41, 297–314. ˇ M., & Baerns, M. (2003a). Artificial neural networks in catalyst development. Holena, In J. Cawse (Ed.), Experimental design for combinatorial and high throughput materials development (pp. 163–202). Hoboken, NJ: Wiley. ˇ M., & Baerns, M. (2003b). Feedforward neural networks in catalysis. A tool Holena, for the approximation of the dependency of yield on catalyst composition, and for knowledge extraction. Catalysis Today, 81, 485–494.
2852
ˇ M. Holena
Hornik, K. (1991). Approximation capabilities of multilayer neural networks. Neural Networks, 4, 251–257. Hornik, K., Stinchcombe, M., White, H., & Auer, P. (1994). Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Computation, 6, 1262–1275. Howes, P., & Crook, N. (1999). Using input parameter influences to support the decisions of feedforward neural networks. Neurocomputing, 24, 191–206. Ishikawa, M. (2000). Rule extraction by successive regularization. Neural Networks, 13, 1171–1183. ˚ Kurkov´ a, V. (1992). Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5, 501–506. ˚ Kurkov´ a, V. (2000). Rates of approximation by neural networks. In P. Sinˇca´ k & J. Vasˇca´ k (Eds.), Quo vadis computational intelligence? (pp. 23–26). Berlin: SpringerVerlag. Loh, W. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Lu, H., Setiono, R., & Liu, H. (1996). Effective data mining using neural networks. IEEE Transactions on Knowledge and Data Engineering, 8, 957–961. Maass, W. (1997). Bounds for the computational power and learning complexity of analog neural nets. SIAM Journal on Computing, 26, 708–732. Maire, F. (1999). Rule-extraction by backpropagation of polyhedra. Neural Networks, 12, 717–725. McNaughton, R. (1951). A theorem about infinite-valued sentential logic. Journal of Symbolic Logic, 16, 1–13. Mitra, S., De, R., & Pal, S. (1997). Knowledge-based fuzzy MLP for classification and rule generation. IEEE Transactions on Neural Networks, 8, 1338–1350. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11, 748–768. Mundici, D. (1994). A constructive proof of McNaughton’s theorem in infinite-valued logic. Journal of Symbolic Logic, 59, 596–602. Narazaki, H., Watanabe, T., & Yamamoto, M. (1996). Reorganizing knowledge in neural networks: An exploratory mechanism for neural networks in data classification problems. IEEE Transactions on Systems, Man, and Cyberbetics–Part B: Cybernetics, 26, 107–117. Nauck, D., Nauck, U., & Kruse, R. (1996). Generating classification rules with the neuro-fuzzy system NEFCLASS. In Proceedings of the Biennial Conference of the North American Fuzzy Information Processing Society NAFIPS’96 (pp. 466–470). Berkeley: IFSA. Nov´ak, V., & Perfilieva, I. (1999). Some consequences of Herbrand and McNaughton theorems in fuzzy logic. In V. Nov´ak & I. Perfilieva (Eds.), Discovering world with fuzzy logic: Perspectives and approaches to formalization of human-consistent logical systems (pp. 271–295). Heidelberg: Springer-Verlag. Nov´ak, V., Perfilieva, I., & Moˇckoˇr, J. (1999). Mathematical principles of fuzzy logic. Dordrecht: Kluwer. Quinlan, J. (1992). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann.
Piecewise-Linear Networks and Rule Extraction from Data
2853
˜ Rabunal, J., Dorado, J., Pereira, J., & Rivero, D. (2004). A new approach to the extraction of ANN rules and to their generalization capacity through GP. Neural Computation, 16, 1483–1523. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error backpropagation. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing: Exploration in the microstructure of cognition (pp. 318–362). Cambridge, MA: MIT Press. Setiono, R. (1997). Extracting rules from neural networks by pruning and hidden unit splitting. Neural Computation, 9, 205–225. Siciliano, R., & Mola, F. (2000). Multivariate data analysis and modeling through classification and regression trees. Computational Statistics and Data Analysis, 32, 285–301. Tickle, A., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting rules from trained artificial neural networks. IEEE Transactions on Neural Networks, 9, 1057–1068. Tikhonov, A., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, DC: Halsted Press. Towell, G., & Shavlik, J. (1993). Extracting refined rules from knowledge-based neural networks. Machine Learning, 13, 71–101. Tsukimoto, H. (2000). Extracting rules from trained neural networks. IEEE Transactions on Neural Networks, 11, 333–389. Vach, V. (1995). Classification trees. Computational Statistics, 10, 9–14.
Received April 25, 2005; accepted January 10, 2006.
LETTER
Communicated by Andries P. Engelbrecht
Computation of Madalines’ Sensitivity to Input and Weight Perturbations Yingfeng Wang [email protected] Department of Computer Science, Hohai University, Nanjing, China
Xiaoqin Zeng [email protected] Department of Computer Science, Hohai University, Nanjing, China
Daniel So Yeung [email protected] Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong
Zhihang Peng [email protected] Department of Mathematics, Hohai University, Nanjing, China
The sensitivity of a neural network’s output to its input and weight perturbations is an important measure for evaluating the network’s performance. In this letter, we propose an approach to quantify the sensitivity of Madalines. The sensitivity is defined as the probability of output deviation due to input and weight perturbations with respect to overall input patterns. Based on the structural characteristics of Madalines, a bottomup strategy is followed, along which the sensitivity of single neurons, that is, Adalines, is considered first and then the sensitivity of the entire Madaline network. By means of probability theory, an analytical formula is derived for the calculation of Adalines’ sensitivity, and an algorithm is designed for the computation of Madalines’ sensitivity. Computer simulations are run to verify the effectiveness of the formula and algorithm. The simulation results are in good agreement with the theoretical results. 1 Introduction Generally, an artificial neural network is aimed at realizing a mapping between its input and output by establishing a set of connection weights. Therefore, the sensitivity of a neural network’s output to the input and weight perturbations is a fundamental issue with both theoretical and practical values in neural network research. It is obvious that a properly quantified sensitivity could be a useful measure for evaluating neurons’ Neural Computation 18, 2854–2877 (2006)
C 2006 Massachusetts Institute of Technology
Madalines’ Sensitivity
2855
relevance and performance, such as fault tolerance and generalization ability. For fault tolerance, there exist a number of theoretical studies of the response of neural networks to weight imprecision and input noise. For example, Stevenson, Winter, and Widrow (1990) have studied the sensitivity of Adalines to weight error aiming at addressing hardware imprecision; Choi and Choi (1992) have established a sensitivity measure for a specific input with white noise perturbation. They applied a kind of sensitivity as a measure to solve the fault tolerance problems in their studies. Recently we applied the sensitivity of perceptrons to consider the pruning issue of multilayer perceptron (MLP) networks and obtained good results (Zeng & Yeung, 2006). In our study, we employed the sensitivity of perceptrons to establish a relative measure for evaluating a neuron’s importance in a given MLP. With the measure, we can easily locate the least important neuron in the MLP and remove it without, or with the least effect on the performance of, the MLP. It is hopeful that the architecture pruning of Madalines will also benefit from Madalines’ sensitivity study. In the literature, a number of studies (Stevenson et al., 1990; Choi & Choi, 1992; Fu & Chen, 1993; Zurada, Malinowski, & Cloete, 1994; Alippi, Piuri, & Sami, 1995; Pich´e, 1995; Oh & Lee, 1995; Zurada, Malinowski, & Usui, 1997; Cheng & Yeung, 1999; Engelbrecht, 2001a, 2001b; Yeung & Sun, 2002; Zeng & Yeung, 2001, 2003, 2006) on the sensitivity of neural networks have emerged. They vary in their target networks and approaches. This letter focuses on the study of Madalines’ sensitivity and proposes a novel computational method to compute the sensitivity. Stevenson et al. (1990) first systematically investigated the sensitivity of Madalines to weight errors. They made use of hyperspheres as a mathematical model to theoretically analyze the sensitivity of Madalines. The surface of a hypersphere with radius n1/2 is used to express the input space for an Adaline with n-dimensional input. Geometrically, the surface of a hypersphere that represents all inputs of an Adaline can be divided by a hyperplane, which is determined by the Adaline’s weights and passes through the origin into two hemi-hyperspheres that correspond to the bipolar outputs of the Adaline. Based on such a geometrical model, they defined sensitivity as the probability of erroneous output of Madalines and derived approximate expressions as functions of the percentage error in inputs and weights under the assumption that the input and weight perturbations are small and the number of Adalines per layer is sufficiently large. Later, Alippi et al. (1995) generalized the hypersphere model by considering the sensitivity of Adalines with multiple-step activation functions. Unfortunately, since the discrete inputs of an Adaline generally do not span the whole hypersphere surface, their results may have large deviations when the input dimension of Adalines is not sufficiently large. Pich´e (1995) employed a statistical rather than a geometrical argument to analyze the effects of weight errors in Madalines. He made the assumption that inputs and weights as well as their errors are all independently and identically distributed (i.i.d.) with
2856
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
mean zero. Based on such a stochastic model and under the condition that both input and weight errors were small enough, Pich´e (1995) derived an analytical expression for the sensitivity as the ratio of the variance of the output error to the variance of the output. But his method is suitable only for an ensemble of neural networks rather than an individual one because of assumptions made by the stochastic model that were too strong. In our study, different from the methods we have noted, we propose a new method to derive a formula for calculating the sensitivity of Adalines and then design an algorithm to compute the sensitivity of Madalines neuron by neuron and layer by layer. Our method offers certain advantages over the other ones. For example, it does not demand the weight perturbation to be very small, as is required by Stevenson et al. (1990) and Pich´e (1995); its results are more accurate than that of Stevenson et al. (1990), and without the restriction that the weight perturbation ratio has to be the same for all Adalines in a Madaline required by Stevenson et al. (1990); and the probability used in our method is more direct and exact than the variance used in Pich´e (1995), because some assumptions on the mean of input, weight, and their perturbations need to be made prior to considering the variance. The rest of this letter is arranged as follows. The architecture and notations of Madalines as well as Adalines are briefly described in section 2. The definition and the formula derivation for Adalines’ sensitivity are given in section 3, and section 4, in a parallel way, gives the definition and the computation of the sensitivity of Madalines. Simulation results that support the theoretical results are presented in section 5. Section 6 concludes the letter. 2 The Madaline Model A Madaline is a discrete feedforward multilayered neural network; it consists of a set of Adalines that work together to maintain an input-output mapping. 2.1 Architecture. An Adaline is a basic building block of the Madaline. With n bipolar inputs and one bipolar output, a single Adaline is capable of performing certain logic functions. Each element of the inputs takes on a bipolar value of either +1 or −1 and is associated with an adjustable floating-point weight. The sum of weighted input elements is computed, producing a linear output, which is then fed to an activation function to yield the output of the Adaline. The commonly used activation function is the symmetrical hard-limit function: f (x) =
1
x≥0
−1 x < 0.
(2.1)
Madalines’ Sensitivity
2857
A Madaline is a layered network of Adalines. Links exist only between Adalines of two adjacent layers, and there is no link between Adalines in the same layer and in any two nonadjacent layers. All Adalines in a layer are fully linked to all the Adalines in the immediately preceding layer and all the Adalines in the immediately succeeding layer. At each layer, except the input layer, the inputs of each neuron are the outputs of the Adalines in the previous layer. 2.2 Notation. A Madaline in general can have L layers, and each layer l (1 ≤ l ≤ L) has nl (nl ≥ 1) neurons. The form n0 − n1 − . . . − nL is used to represent a Madaline with a given structural configuration, in which each nl (0 ≤ l ≤ L) not only stands for a layer from left to right, including the input layer, but also indicates the number of Adalines in the layer. n0 is an exception, which refers to the dimension of input vectors. nL refers to the output layer. Since the number of Adalines in layer l − 1 is equal to the output dimension of that layer, which in turn is equal to the input dimension of layer l, the input dimension of layer l is nl−1 . For Adaline i (1 ≤ i ≤ nl ) in layer l, the input vector is Xl = (x1l , . . . , xnl l−1 )T , the weight vector is l l l l T l , . . . , win Wil = (wi1 l−1 ) , and the output is yi = f (X Wi ). For each layer l, all Adalines in that layer have the same input vector Xl . The weight set of the layer is Wl = {W1l , . . . , Wnl l }, and the output vector of the layer is Yl = (y1l , . . . , ynl l )T . For an entire Madaline, the input vector is X1 or Y0 , the weight is W = W1 ∪ . . . ∪ W L , and the output is Y L . Let Xl = (x1l , . . . , xnl l−1 )T l l T and Wil = (wi1 , . . . , win be the perturbation of input vector l−1 ) l l X and weight vector Wi , and let X l = (x1l , . . . , xnl l−1 )T and Wil = l l T (wi1 , . . . , win be the corresponding perturbed input and weight vectors, l−1 ) respectively. 3 The Sensitivity of Adalines A perturbation in the inputs of an Adaline may alter its output, and a perturbation in the weights may alter the Adaline’s input-output mapping and thus its output. In this section, the effect of those perturbations on the output of an Adaline is studied. Since output deviation due to input and weight perturbations with respect to an individual input pattern can show the Adaline’s behavior only at that pattern and is usually not suitable for evaluating the Adaline’s performance, the average output deviation with respect to all input patterns is adopted in this letter as a sensitivity measure, which shows whether the Adaline is sensitive to input and weight perturbations. Definition 1. The sensitivity of Adaline i in layer l is defined as the probability of deviated output of the Aadaline due to its input and weight perturbations with
2858
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
respect to all input patterns, which is expressed as sil =
Nerr , Ninp
(3.1)
where Ninp is the number of all input patterns, and Nerr is the number of all output deviations for all the input patterns. Since the perturbation of an Adaline’s input element can result only in x j = x j or x j = −x j , it is obvious that an affected product in nj=1 x j w j can be expressed as x j w j = (−x j )w j = x j (−w j ). This means that the effective perturbation of x j is equivalent to the change of the sign of w j . In this way, an input perturbation can easily be converted to a weight perturbation. Without the loss of generality, only weight perturbation is considered in the following discussion of the sensitivity computation. For simplicity of expression, the superscript and the subscript that mark an Adaline’s layer and its order in the layer are omitted since the Adaline’s position in the network has no interest to us in this section. According to equation 2.1, it is obvious that whether there is an output deviation due to weight perturbation at input Xq (1 ≤ q ≤ Ninp ) is totally dependent on the signs of Xq W and Xq W . The output deviation occurs if and only if Xq W and Xq W have opposite signs (Xq W ≥ 0 and Xq W < 0 or Xq W < 0 and Xq W ≥ 0). This inspires us to consider the sensitivity as the probability of XW and XW having different signs over all input patterns. We have the following notations and expressions: s = P( f (XW) = f (XW )) = 1 − P( f (XW) = f (XW )) = 1 − (P((XW ≥ 0) ∩ (XW ≥ 0)) + P((XW < 0) ∩ (XW < 0))) +∞ +∞ 0 0 ≈1 − f (x, y)dxdy + f (x, y)dxdy , (3.2) 0
0
−∞
−∞
where f (x, y) is the joint probability density function of XW and XW . In order to derive a computable expression for the sensitivity, we first derive the distributions of XW and XW , respectively, then the joint distribution of XW and XW , and finally the probability obtained by using the joint distribution. Assume that all the n-dimensional inputs are uniformly distributed and that weight elements can be any real number but zero. Let ξi = xi wi , so that Ninp is equal to 2n and ξ1 , ξ2 , . . . , ξn are independent of each other. Since the probabilities of xi being either 1 or –1 are the same, the expectation is E(ξi ) = 0,
(3.3)
Madalines’ Sensitivity
2859
and the variance is D(ξi ) =
1 1 · (wi − 0)2 + · (−wi − 0)2 = wi2 . 2 2
(3.4)
Equations 3.3 and 3.4 show that ξ j has finite mathematical expectation and variance. Since ξ j = w j or ξ j = −w j , the distribution function of ξ j is 1, x ≥ |w j | F j (x) = P{ξ j ≤ x} = 12 , −|w j | ≤ x < |w j | 0, x < −|w j |.
(3.5)
Because F j (x) is a three-step constant function, the following Lindeberg condition is satisfied: ∀τ > 0, lim
n→∞
n 1
(x − u j )2 dF j (x) = 0, Bn2 |x−u j |≥τ Bn
(3.6)
j=1
n where Bn2 = i=1 D(ξi )2 and u j = E(ξ j ). Hence, in terms of Lindeberg’s n central limit theorem, it can be proved that i=1 xi wi converges in distribun 2 tion to a random variable with normal distribution N(0, w ). Similarly, i=1 i n in distribution to a random variable with normal disi=1 xi wi converges n tribution N(0, i=1 wi2 ). the joint probability density function of nWith the aim nat deriving x w and x w , it is required to obtain the covariance and the i=1 i i i=1 i i n n correlation coefficient of i=1 xi wi and i=1 xi wi . The covariance can be derived as follows: n n n n
xi wi , xi wi = E xi wi − E xi wi Cov i=1
i=1
i=1
n
×
i=1
xi wi − E
i=1
=E
n
i=1
n
xi wi
i=1
xi wi
n
xi wi
i=1
n n n n
=E xi x j wi w j = E(xi x j wi w j ) i=1 j=1
i=1 j=1
n
=
n
i=1 j=1
E(xi x j )E(wi w j ).
(3.7)
2860
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
If i = j, it is clear that E(xi x j ) = 1. For i = j, because the probabilities of xi and x j being 1 or –1 are the same, the probability of xi x j being 1 or –1 must also be the same as them. Therefore, we have E(xi x j ) = 0. Based nn n on the above analysis, we obtain i=1 E(xi x j )E(wi w j ) = i=1 wi wi , that j=1 is, Cov
n
xi wi ,
n
i=1
xi wi
=
i=1
n
wi wi .
(3.8)
i=1
From equation 3.8, the correlation coefficient of which is denoted as ρ, can be written as
n i=1
xi wi and
n n Cov i=1 xi wi , i=1 xi wi n n D D i=1 xi wi i=1 xi wi n n wi w i=1 wi wi = = i=1 i , n n Bn Bn 2 2 i=1 wi i=1 wi
n i=1
xi wi ,
ρ=
(3.9)
n n 2 2 w and B = where Bn = i=1 wi . The joint probability density nn n i=1 i function of i=1 xi wi and i=1 xi wi can therefore be approximately expressed as the following bivariate normal integral:
F (x, y) = P ≈
n
xi wi ≤ x,
n
i=1 x
xi wi
≤y
i=1 y
1 1 − ρ2 −∞ 2 u −1 uv v2 dudv, × exp − 2ρ + 2 2(1 − ρ 2 ) Bn2 Bn Bn Bn −∞ 2π Bn Bn
(3.10)
and the corresponding joint probability density function is f (x, y) ≈
2 x −1 xy y2 1 . exp − 2ρ + 2(1 − ρ 2 ) Bn2 Bn Bn Bn2 2π Bn Bn 1 − ρ 2 (3.11)
Now, based on the above discussion, we can calculate the integral in equation 3.2. However, the following transformation needs to be introduced
Madalines’ Sensitivity
2861
in advance; the solution can be found in Gong and Zhao (1996): (−β1 , −β2 , ρ) =
−β1
−∞
−β2
−∞
1 2π 1 − ρ 2
1 x 2 + x22 − 2ρx1 x2 × exp − · 1 d x1 d x2 2 1 − ρ2 ρ 1 = (−β1 )(−β2 ) + √ 0 2π 1 − t 2 1 β 2 + β22 − 2β1 β2 t dt, × exp − · 1 2 1 − t2 where (x) =
x −∞
√1 2π
(3.12)
exp(− 12 u2 )du.
+∞ +∞ f (x, y)dxdy = By means of the symmetry property: 0 0 0 0 f (x, y)dxdy and equations 3.11 and 3.12, the probability (i.e. −∞ −∞ the sensitivity) can finally be derived as s ≈1 − =1 − 2
+∞
0 0 −∞
+∞
f (x, y)dxdy +
0 0
−∞
0 −∞
f (x, y)dxdy = 1 − 2
0
−∞ 0
−∞
f (x, y)dxdy 0
1
−∞ 2π Bn Bn
1 − ρ2
xy y2 x2 dxdy − 2ρ + Bn2 Bn Bn Bn2 Bn 2 Bn −1 1 2 x =1 − 2 exp − 2ρxy + y dxdy 2(1 − ρ 2 ) −∞ −∞ 2π 1 − ρ 2 0 0 −1 1 2 2 exp − 2ρxy + y ) dxdy (x =1 − 2 2(1 − ρ 2 ) −∞ −∞ 2π 1 − ρ 2 ρ 1 1 1 1 = 1 − 2(0, 0, ρ) = 1 − 2 · + dt √ 2 2 2π 0 1 − t2 1 1 1 arcsin ρ =1 − 2 + arcsin ρ = − . (3.13) 4 2π 2 π
−1 2(1 − ρ 2 ) 0 0
× exp
It is worthy of note that the computational complexity of our approach, which is O(n), is much less than that of the simulation approach, which is O(n2n ).
2862
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
Table 1: Associated Parameters and Experimental Results for the Three Adalines. Adaline
n
w
p (Simulations)
s (Ours)
s (Stevenson)
Neuron 1
15
Neuron 2
20
Neuron 3
30
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.004089 0.007751 0.011047 0.014648 0.017883 0.020813 0.024231 0.027710 0.031067 0.033936 0.004318 0.008636 0.012972 0.017553 0.021755 0.026026 0.030306 0.034578 0.038784 0.042980 0.005878 0.011672 0.017380 0.022999 0.028528 0.033968 0.039315 0.044576 0.049728 0.054772
0.003786 0.007551 0.011293 0.015012 0.018707 0.022377 0.026021 0.029639 0.033230 0.036794 0.004365 0.008730 0.013092 0.017450 0.021803 0.026148 0.030484 0.034810 0.039124 0.043424 0.005909 0.011736 0.017480 0.023137 0.028707 0.034188 0.039580 0.044881 0.050091 0.055209
0.003896 0.007791 0.011687 0.015582 0.019478 0.023374 0.027269 0.031165 0.035060 0.038956 0.004365 0.008730 0.013096 0.017461 0.021826 0.026191 0.030557 0.034922 0.039287 0.043652 0.006317 0.012635 0.018952 0.025270 0.031587 0.037905 0.044222 0.050539 0.056857 0.063174
4 The Sensitivity of Madalines Our final goal is to determine the sensitivity of Madalines to input and weight perturbations. From a global point of view, the sensitivity of a Madaline reflects its output deviation, or exactly its output layer’s output deviation, due to the first layer’s input perturbation and the network’s weight perturbation. Because the inputs of each neuron in a layer are the outputs of all neurons in its immediately preceding layer, the perturbations that occurred in first layer will be propagated through all internal layers to influence the output layer’s output. Based on the structural characteristics of Madalines and the sensitivity of Adalines discussed in section 3, we define the sensitivity of a layer and the sensitivity of an entire Madaline as follows:
Madalines’ Sensitivity
2863
0.08 Simulation’s Ours Stevenson’s
0.07
Sensitivity or Probability
0.06
Neuron 3 0.05 0.04
Neuron 2
0.03 Neuron 1 0.02 0.01 0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Weight perturbation
Figure 1: Experimental results of the three Adalines with weight perturbations.
Definition 2. The sensitivity of layer l (1 ≤ l ≤ L) is a vector in which each element is the sensitivity of the corresponding Adaline in the layer due to its input and weight perturbations, which is expressed as Sl = (s1l , s2l , . . . , snl l )T .
(4.1)
Definition 3. The sensitivity of a Madaline is the sensitivity of its output layer, that is, S = SL = (s1L , s2L , . . . , snLL )T .
(4.2)
There are two sources of perturbations for a Madaline: (1) the weight perturbation of all layers and (2) the input perturbation from the input layer, which is expressed as S0 = (s10 , s20 , . . . , sn00 )T , indicating the probability of input perturbation for each input element. This makes the notation for the input layer consistent with that of the succeeding hidden and output layers. Definition 3 is actually recursive, because SL implicitly depends on SL−1 , which in turn depends on SL−2 and so on, until the recursion finally reaches the initial S0 . This suggests that the sensitivity of a Madaline can be
2864
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
Table 2: Experimental Results for the Three Madalines. Architecture
w
p (Simulations)
s (Ours)
Net 1
20-15-1
Net 2
15-20-10-1
Net 3
25-20-15-10-1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.019060 0.036343 0.050889 0.065135 0.078338 0.089858 0.099978 0.108974 0.117135 0.127218 0.042542 0.070831 0.096497 0.123444 0.142426 0.165527 0.182373 0.194580 0.204102 0.213837 0.068878 0.126714 0.170748 0.211107 0.238871 0.266118 0.292421 0.314146 0.328447 0.344706
0.017774 0.034296 0.049696 0.064093 0.077591 0.090285 0.102256 0.113580 0.124321 0.134535 0.042172 0.076665 0.105433 0.129852 0.150915 0.169345 0.185682 0.200332 0.213605 0.225741 0.088355 0.144633 0.183842 0.213013 0.235816 0.254339 0.269850 0.283160 0.294815 0.305189
Madaline
obtained by calculating the sensitivity of each Adaline in the Madaline from the first layer to the last layer with such a partial order that the Adalines in the preceding layer should be calculated before those in the succeeding layer and that the Adalines in the same layer could be in any order. In section 3, we pointed out that the input perturbation of an Adaline can be treated as a kind of weight perturbation. We then provided a way to compute the sensitivity of the Adaline with weight perturbation. Hence, what needs to be done is to merge input perturbation into weight perturbation and then make use of equations 3.9 and 3.13 to compute the sensitivity of every Adaline in a Madaline. In section 3, the input perturbation is expressed as Xl (1 ≤ l ≤ L), but here it is in the form of Sl−1 (1 ≤ l ≤ L), the probability of the input perturbation. Because of the different forms of expression, a transformation from Sl−1 to Xl is needed for each layer before computing the sensitivity of each Adaline in the layer. For layer l with Sl−1 , since the perturbation
Madalines’ Sensitivity
2865
0.45 0.4
Simulation results Theoretical results
Sensitivity or Probability
0.35 Net 3 0.3 0.25 0.2 Net 2 0.15 0.1 Net 1 0.05 0 0.1
0.2
0.3
0.4 0.5 0.6 0.7 Weight perturbation
0.8
0.9
1
Figure 2: Experimental results for the three Madalines with weight perturbations.
probability of each input element may be different, the perturbed input vector in general may have many combinations of perturbed input elements at a given number, say, k (0 ≤ k ≤ nl−1 ). This causes the transformation to be quite complex. For simplicity, in the following derivations, s1l−1 , s2l−1 , . . . , snl−1 l−1 are approximated by their mathematical expectation: n 1
l−1
s l−1
=
nl−1
sil−1 .
(4.3)
i=1
Thus, the probability of k elements being perturbed in the input of layer l can be approximated as l−1
pkl ≈ Cnkl−1 (1 − s l−1 )n
−k
(s l−1 )k .
(4.4)
In order to compute the corresponding correlation coefficient for an Adaline with both input and weight perturbations, equation 3.9 needs to be modified to merge the input perturbation into the weight perturbation.
2866
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
l Let ρik denote the correlation coefficient of the ith Adaline in layer l with l l T k input elements perturbed, and let Wil = (wi1 , . . . , win represent the l−1 ) perturbed weight after the merge. Equation 3.9 can be modified as follows by noticing that we can only have either wilj = wilj when the jth input element is not perturbed, or wilj = −wilj when the jth input element is perturbed:
nl−1 l ρik =
j=1 nl−1 j=1
wil j wilj
nl−1 j=1
(wil j )2
wil j wilj − k wil j wilj nl−1 l 2 l 2 j=1 (wi j ) j=1 (wi j )
nl−1 −k (wilj )2
= nl−1
wil j wilj − 2 k wil j wilj = . nl−1 nl−1 l 2 l 2 j=1 (wi j ) j=1 (wi j ) nl−1
j=1
(4.5)
Since different combinations of perturbed input elements may yield difl ferent values of ρik under a given k, we again use the average correlation l l coefficient, denoted as ρik , to approximate ρik . When k = 0, it is always true l l that ρi0 = ρi0 . Different combinations of perturbed input elements may yield different items k wil j wilj in equation 4.5. Under a given k and m, if the mth input element is perturbed, the other nl−1 − 1 input elements must have k − 1 input elements perturbed. That is, there must be Cnk−1 l−1 −1 different combinations of perturbed input elements with the mth input element perturbed in all Cnkl−1 different combinations. It means that under given k and k l l wi j wi j , m, in all Cnkl−1 different k wil j wilj , there must be Cnk−1 l−1 −1 different k l l l l k wi j wi j are added, for which have wim wim , that is, if all Cnl−1 different l l wim are all added Cnk−1 any given m (1 ≤ m ≤ nl−1 ), wim l−1 −1 times. Because of equation 4.5,
C k−1 l−1 n
−1
C kl−1
=
n
l ρik =
Cnkl−1
l l and ρi0 = ρi0 , the average correlation coefficient is
nl−1
nl−1 l l wil j wilj − 2Cnk−1 l−1 −1 j=1 wi j wi j nl−1 nl−1 l 2 l 2 Cnkl−1 j=1 (wi j ) j=1 (wi j )
nl−1 =
k nl−1
j=1
j=1
wil j wilj − 2
C k−1 l−1
nl−1
l l j=1 wi j wi j nl−1 l 2 l 2 j=1 (wi j ) j=1 (wi j )
nl−1
n
−1
C kl−1 n
l−1 n k l l k nl−1 l l l l 1 − 2 nl−1 w w w w − 2 w w j=1 i j i j j=1 i j i j j=1 i j i j nl−1 = = nl−1 nl−1 nl−1 nl−1 l 2 l 2 l 2 l 2 (w ) (w ) (w ) ij ij ij j=1 j=1 j=1 j=1 (wi j ) k k l l = ρi0 1 − 2 l−1 = ρi0 1 − 2 l−1 . n n nl−1
(4.6)
Madalines’ Sensitivity
2867
Table 3: Experimental Results for the Three Madalines. Architecture
δW
p (Simulation)
s (Ours)
s (Stevenson)
Net 1
20-15-1
Net 2
15-20-10-1
Net 3
25-20-15-10-1
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0.011400 0.021933 0.032858 0.040961 0.050033 0.058933 0.066330 0.074381 0.081882 0.089083 0.024780 0.049133 0.065125 0.081543 0.097473 0.110779 0.128845 0.138306 0.149780 0.161316 0.045918 0.081263 0.119598 0.148934 0.176012 0.198072 0.222837 0.239430 0.258338 0.271175
0.010720 0.020977 0.030801 0.040219 0.049258 0.057943 0.066295 0.074336 0.082086 0.089563 0.025913 0.048813 0.069195 0.087462 0.103938 0.118889 0.132533 0.145053 0.156599 0.167296 0.059424 0.103402 0.137284 0.164259 0.186324 0.204786 0.220531 0.234178 0.246173 0.256846
0.036058 0.051192 0.062939 0.072955 0.081876 0.090028 0.097606 0.104733 0.111496 0.117957 0.120930 0.144180 0.159999 0.172422 0.182856 0.191969 0.200137 0.207593 0.214495 0.220951 0.221407 0.241815 0.254826 0.264655 0.272694 0.279583 0.285672 0.291175 0.296230 0.300934
Madaline
From equations 4.4 and 4.6, the sensitivity sil can be expressed as n
l−1
sil
≈
k=0
pkl
l 1 arcsin ρik − . 2 π
(4.7)
In summary, an algorithm for the computation of the sensitivity of a Madaline can be given as follows: MADALINE SENS(W, W, S0 , . . .) For layer l from 1 to L do: For Adaline i from 1 to nl do: Calculate sil using equations 4.3, 4.4, 4.6, and 4.7: L L (s1 , s2 , . . . , snLL )T is the sensitivity of the Madaline.
2868
Y. Wang, X. Zeng, D. Yeung, and Z. Peng 0.4
0.35
Simulation’s Ours Stevenson’s
Sensitivity or Probability
0.3
Net 3
0.25 0.2
Net 2
0.15 0.1
Net 1
0.05 0 0.01
0.02
0.03
0.04 0.05 0.06 0.07 Weight perturbation ratio
0.08
0.09
0.1
Figure 3: Experimental results for the three Madalines with weight perturbation ratios.
By analysis, it can be known that the computational complexity of the L algorithm is O( l=1 (nl−1 )2 nl ) and that of the simulation approach is 0 L n l−1 l O(2 n ). Obviously, our approach is much more efficient. l=1 n 5 Experimental Verification To verify the effectiveness of the derived formula and the given algorithm, a number of experiments have been conducted. Again, the bottom-up strategy was followed, in which Adalines are considered first. 5.1 Verification for Adalines. The sensitivity results for three Adalines with input dimensions of 15, 20, and 30 were separately computed using equation 3.13, in which the elements of W were randomly selected from [−10, −1] and [1, 10], and the elements of W were all identical and ranged from 0.1 to 1 with an increment of 0.1. The randomly obtained weight values of the three Adalines are given in Table 4 in the appendix. Computer simulations were run to compute the actual probability of the output deviations on all 2n possible input patterns for the three Adalines with the same parameters as in the sensitivity computation. All of these results are
Madalines’ Sensitivity
2869
given in Table 1 and illustrated in Figure 1. Both the theoretical results s and the simulation results p given in Table 1 and Figure 1 are well matched. This verifies the correctness of our approach. Further, the corresponding theoretical sensitivities s based on Stevenson’s approach (Stevenson et al., 1990) for the three Adalines are also computed and given in Table 1 and drawn in Figure 1. A comparison of the data obtained from simulation results, our results, and Stevenson’s results demonstrates that the accuracy of our approach is better than the results of Stevenson et al. 5.2 Verification for Madalines. The experiments on Madalines were carried out for two purposes: to verify the effectiveness of our algorithm and to compare our sensitivity results with those of Stevenson et al. (1990) to demonstrate that the accuracy of ours is better than that of Stevenson et al. In our experiments, three Madalines were implemented. They have architectures of 20 − 15 − 1, 15 − 20 − 10 − 1 and 25 − 20 − 15 − 10 − 1, respectively, and have weight elements randomly selected from [−10, −1] and [1, 10]. All the weight values of the three Madalines thus obtained are given in Table 5 in the appendix. To address the first objective, the sensitivity results for the three Madalines were computed by MADALINE SENS(W, W, S0 , . . .) with the elements of W being identical and ranging from 0.1 to 1.0 with an increment of 0.1 and S0 being zero. Besides, the actual probability results of output deviations for the three Madalines with the same parameters as in the sensitivity computation were computed by running computer simulations. Both the theoretical results s and the simulation results p are given in Table 2 and drawn in Figure 2. The data and the graphs in Table 2 and Figure 2 show that the computed sensitivity and the simulated probability are matched well. In the fulfillment of the second purpose, since Stevenson’s approach (Stevenson et al., 1990) requires all Adalines in a Madaline to have the same weight perturbation ratio, our sensitivity results, Stevenson’s sensitivity results, and the simulation’s probability results for the three Madalines were all computed with the same weight perturbation ratio from 0.01 to 0.1. The results are all given in Table 3 and drawn in Figure 3. By comparing the corresponding data in Table 3 and graphs in Figure 3, we can conclude that the accuracy of our approach is better. 6 Conclusion In this article, a quantified sensitivity measure for Madalines to input and weight perturbations is put forward. Formulas and algorithms are given for the computation of the sensitivity. Experimental verifications demonstrate that the theoretical results obtained from the formula and the algorithm match well with the simulation results, even when the dimension of the input of each layer is not very large. The sensitivity measure is expected to be useful as a relative criterion to evaluate the networks’ performance
2870
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
so that some crucial issues of Madalines, such as improvements of error tolerance and generalization ability, and the simplification of the network architecture, will benefit from this measure. In our future work, we intend to apply the sensitivity measure to evaluate the relevance of each Adaline as well as each input attribute in a given Madaline and then, based on the evaluation to trim the Madaline, to have an appropriate architecture with higher performance and lower cost. Further, we will investigate how to apply our method to other kinds of neural networks. Appendix This appendix presents the parameter values of the Adalines and Madalines used in the experiments. Table 4: Parameter Values of the three Adalines Used in the Experiment. Adaline
Dimension
Weight
Neuron 1
15
Neuron 2
20
Neuron 3
30
{8.586, −9.271, 8.843, −8.257, −6.671, 2.871, 9.700, 7.314, 8.243, −8.486, −6.343, 9.900, −7.171, 9.743, 8.400} {9.967, −2.367, −6.650, −7.350, 8.600, −9.517, −8.283, 8.817, 6.133, −9.200, 2.567, 8.633, 5.867, −4.783, −8.267, 5.783, −7.750, 8.300, 4.533, −6.117} {−7.1500, 2.9130, 8.5531, −6.6591, −2.2040, 2.8642, 6.4648, 6.6690, 4.3343, 6.1763, 5.0628, 1.3951, −1.2447, 3.8142, 1.1158, −4.4557, −7.1480, 1.8356, 1.3180, 6.5116, 6.4769, 1.1418, −1.1472, 2.7107, 6.2823, 1.5182, −4.3081, −6.6831, 7.4587, 7.2340}
Table 5: Parameter Values of the Three Madalines Used in the Experiment. Madaline Net 1
Architecture
Weight
20−15−1
Layer 1: {−4.357, −1.014, 3.514, −6.400, −4.343, −1.500, 7.829, 7.143, 9.043, −7.700, 3.571, −5.214, 8.771, −7.529, −3.357, 4.557, −9.557, −9.129, −8.100, −8.029}, {8.971, 4.271, −3.557, 8.386, 4.529, −1.771, 1.929, 7.729, 7.543, 6.157, −9.871, −3.814, −2.857, 8.400, 4.629, −4.143, 1.329, −1.086, 4.500, −2.643}, {−6.200, −4.143, −3.371, −5.414, 9.671, −4.029, 2.014, 1.500, −3.186, −4.200, 2.129, −9.371, 6.057, 5.286, −6.071, 7.900, −6.814, 4.914, 8.471, 1.071}, {−4.186, 5.643, 6.957, −3.543, 9.214, −3.600, −5.129, −9.071, 2.829, 8.271, −4.700, 6.500, −1.600, −8.014, 2.729, 9.543, −7.500, −3.543, 6.343, −1.114},
Madalines’ Sensitivity
2871
Table 5: Continued Madaline
Net 2
Architecture
15−20−10−1
Weight {−2.186, −6.529, −8.043, −8.729, −6.571, −3.329, 1.886, 7.429, −3.386, 4.443, 3.714, 1.286, −6.843, 9.029, 6.643, −2.157, 3.800, −3.700, −4.200, 8.614}, {−2.971, 4.800, −7.557, −7.214, 7.300, 3.229, −3.014, 5.343, 8.614, −7.743, 7.871, 5.286, −5.257, 9.114, −3.286, 6.657, 8.129, 7.686, 9.457, −1.086}, {6.243, −4.557, −1.514, −5.314, 7.500, 1.629, 8.143, −5.186, 9.343, 9.300, −7.143, 2.243, 1.457, −1.086, 9.786, 3.543, −5.086, 2.414, −7.871, 8.986}, {4.414, −8.229, 8.286, −1.757, 5.557, −1.971, −4.429, 8.329, −1.900, −8.229, −2.829, 1.371, 6.114, −9.214, −2.771, −7.100, 3.257, −4.986, −3.971, 2.343}, {−6.829, 6.729, −3.114, 4.043, −1.643, −2.957, 8.857, 1.400, −1.271, −5.557, −1.671, −9.457, 2.671, −4.300, −4.971, 8.714, 2.586, −8.357, −6.814, −8.686}, {−6.157, −1.486, −8.429, 4.943, −1.200, 9.429, 6.057, 4.400, 1.514, −1.243, 4.586, −7.057, −2.900, −7.886, 8.386, 9.386, −1.600, −8.029, −3.314, −7.929}, {−9.143, −1.757, −2.371, 1.214, 2.786, −3.457, 5.029, −1.343, −4.843, 7.514, 1.814, −8.671, −3.657, 9.586, 2.271, 8.686, 7.629, 4.457, −5.143, −2.986}, {−1.743, −9.329, 7.586, 8.100, 1.586, 3.114, 9.557, −5.229, 3.086, 8.343, −9.986, 6.343, −4.214, 3.057, −4.457, 9.229, −1.129, −2.900, −4.886, −1.014}, {−6.900, −4.843, −4.343, 5.786, 7.471, −7.329, 5.057, −8.871, 4.657, −2.929, 6.943, 7.586, −8.386, −3.571, −9.486, 4.486, −4.371, 3.271, −7.786, −2.129}, {6.471, 3.043, 7.300, −10.000, −1.186, 4.143, 5.314, −9.943, 7.629, −7.643, −4.571, −1.957, −3.371, −1.386, 4.071, −9.414, 1.586, −7.971, 5.043, −8.371}, {2.329, 6.714, 5.400, 6.143, 8.714, −1.929, 4.457, 1.771, −4.857, −9.286, −6.400, −2.400, 5.171, 3.086, −5.629, −5.743, 1.686, −7.243, 3.629, −3.643}. Layer 2: {4.686, −4.757, 9.714, −5.086, −7.814, 8.957, 6.129, −7.914, −5.786, 1.243, 8.514, 1.800, −2.171, 2.571, −5.343}. Layer 1: {2.057, 6.643, −6.400, −7.757, 1.643, −1.071, 4.586, 7.057, −3.986, 4.243, −2.943, −7.986, −9.814, −5.943, 1.086}, {9.500, 1.486, −3.071, 9.857, −3.629, 2.657, −3.457, −6.771, −1.343, 8.000, 9.771, 9.386, 1.843, −3.714, −7.971}, {7.729, 4.029, −3.886, −5.243, −3.586, −2.929, 2.971, −7.271, 1.800, 2.914, 4.671, 1.586, 7.157, 9.929, 6.514}, {−8.529, 3.729, −6.914, 4.186, 5.186, 1.171, −2.100, 2.686, −7.700, −9.400, −4.786, −8.814, 5.043, 7.829, 4.971}, {−6.843, −8.800, −7.914, −2.257, −5.157, 7.700, −5.557, 7.614, 8.871, −2.200, −4.729, −3.029, 9.686, 7.800, −7.214}, {−4.371, −5.186, 9.386, 2.771, −5.229, −6.614, 6.857, −5.886, 1.657, 3.957, −4.329, 3.443, 5.943, −3.143, 1.257},
2872
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
Table 5: Continued Madaline
Architecture
Weight {−6.371, 8.929, −7.829, 8.971, 4.857, 4.071, −9.114, −9.457, −6.243, 9.929, 4.300, −3.071, −4.714, 2.700, −2.943}, {2.029, −4.457, 2.900, −4.400, 3.400, 3.014, −9.786, −4.529, −1.643, −4.814, 5.586, −9.843, 7.743, 3.414, 1.257}, {2.586, 2.971, 4.686, −3.029, −7.529, −2.586, −6.200, −4.471, −1.857, 8.571, −7.143, 1.614, −9.057, 7.729, −1.700}, {9.529, −6.014, −6.286, 7.486, 4.600, 8.514, 3.557, −1.257, −1.429, −4.286, −5.743, −3.200, 1.057, 7.257, 2.386}, {−5.914, −2.486, −1.029, −8.943, 4.757, 1.914, −5.900, 8.800, 4.300, 2.114, −4.543, 2.071, 2.371, 2.043, −4.200}, {2.129, −4.700, 2.371, 6.529, −4.629, 8.686, 5.600, −7.614, −3.400, 2.771, −9.443, 2.800, 2.314, 5.571, 1.443}, {1.100, 9.943, −6.743, 4.286, 9.871, 5.229, −7.700, −5.700, −1.614, 6.114, 6.343, 9.543, −1.071, 5.857, −1.071}, {−1.871, 6.943, 8.886, 6.843, 5.743, 2.157, −8.100, −6.700, −5.614, −8.986, 9.986, 8.057, −3.229, 2.271, 4.371}, {−5.757, 7.314, −3.686, 1.571, 6.643, 7.029, −9.957, −4.314, −2.557, 2.500, 9.900, −8.757, 5.757, −3.214, 9.743}, {−6.757, 1.171, 8.686, 9.671, −4.471, 3.914, −6.243, 9.343, 4.743, 7.729, 8.700, 7.500, 1.671, 7.400, 1.243}, {4.871, 3.329, −6.329, 7.257, 6.429, −3.129, −9.557, 4.629, 1.757, −4.671, 2.629, −9.743, −9.986, 4.400, −3.343}, {9.100, 6.314, −5.729, 6.286, 9.686, −5.643, 5.843, 9.757, 8.357, 1.629, −9.829, 6.500, 8.343, 5.829, 6.386}, {2.014, 5.657, 2.429, −5.600, 2.957, −6.871, −3.386, 2.114, −5.257, −2.143, 4.186, 9.500, −5.686, −5.086, 5.586}, {−3.386, 9.114, −7.229, −9.986, −6.357, −8.757, 9.143, −7.657, −5.614, 1.343, −7.143, 7.871, 5.000, −4.000, 2.600}. Layer 2: {−5.443, −2.986, −4.800, −8.629, 9.814, −7.214, 6.686, 6.571, 1.729, −4.214, 3.543, 9.429, −3.857, −1.700, −1.329, −8.743, 9.029, 3.214, 9.771, 6.286}, {−3.971, 5.429, 4.429, 1.043, 4.243, 3.129, −2.471, −7.800, 8.371, −3.729, −7.743, 9.386, −2.000, −1.971, 7.557, −7.629, −3.143, 3.114, 4.486, 8.614}, {−3.814, −5.329, −2.686, 8.929, 1.729, 6.286, 7.257, −2.771, −9.186, 6.357, 3.943, −6.871, 8.457, −7.171, −6.071, −6.414, −6.571, −8.243, 2.771, 5.786}, {−4.400, −3.114, −2.129, −4.700, −8.671, 1.386, −2.157, −8.914, 2.757, −6.214, −5.043, 2.443, 1.486, −9.914, 8.743, −8.229, 5.814, −4.771, 9.286, 6.200}, {−7.614, 2.571, 6.600, −5.986, −6.571, −3.886, 6.043, 4.386, 2.529, −7.600, 1.571, 7.686, −6.671, 1.200, 9.286, 1.471, −3.229, −3.286, −5.786, 3.786}, {6.071, −9.143, 5.557, 5.357, 5.171, −7.443, 9.457, 4.286, 6.414, −7.529, 5.700, 7.900, 5.271, 8.129, −6.300, −7.500, −8.557, −8.886, 2.686, 7.000}, {5.886, −1.643, 1.129, −8.686, −8.614, 2.386, 4.143, −5.914,
Madalines’ Sensitivity
2873
Table 5: Continued Madaline
Net 3
Architecture
25−20−15−10−1
Weight −6.814, −8.814, 1.957, 1.286, 3.271, 1.814, −3.629, −6.971, −7.286, 2.057, −3.771, 5.457}, {9.400, 6.943, −5.129, −7.171, 9.086, 8.943, 4.143, 8.643, −8.986, −2.914, 8.857, −4.486, −5.629, −7.429, 7.614, 7.186, −7.929, 4.971, 3.157, 5.886}, {7.043, 9.629, 5.843, −9.000, −3.214, −2.086, −2.871, −2.086, 2.114, −1.671, 8.614, −7.014, −6.986, −6.886, 1.900, 8.086, −2.414, −5.171, −8.043, −2.771}, {−7.000, 5.057, −7.371, 2.786, −9.343, 1.543, −9.500, −1.886, 9.429, −2.400, 8.129, −3.829, −4.586, 6.529, −4.800, 1.114, 7.457, 5.729, −5.657, 9.029}. Layer 3: {6.586, −9.400, −1.786, −7.457, −1.157, −3.257, 1.243, 9.771, −5.900, −1.014}. Layer 1: {−2.400, 1.871, −5.400, −2.000, −6.143, −7.386, 8.471, 9.929, −2.714, 6.371, 9.443, 4.786, 7.600, 8.686, −8.757, 9.843, −4.400, 3.543, −7.957, −3.886, 3.757, 7.771, −6.100, 4.000, −3.129}, {2.614, −4.043, 5.371, −2.857, −2.543, −1.214, 3.329, 9.557, −8.500, 6.643, 8.529, −3.600, 8.329, −3.243, 1.671, 1.371, 3.214, 6.429, 9.529, 6.114, 9.129, 8.714, 5.071, 4.329, −6.986}, {−3.729, 6.400, 5.871, 8.900, −8.371, −9.414, 7.429, 7.214, −3.986, −8.329, −4.757, 6.700, −3.500, 8.743, 6.529, 6.757, 4.943, 1.814, 1.600, −7.386, 3.929, 7.043, 3.971, −1.086, −9.343}, {−3.729, −1.557, −3.200, −1.414, 5.714, −8.514, 8.143, −6.729, 8.243, 4.400, −2.500, −6.000, −7.814, 4.771, −3.729, −2.757, 6.471, −1.986, −3.300, 8.157, 7.743, 7.586, −2.871, 1.814, −8.843}, {9.729, 3.286, 9.429, 7.014, −1.186, 9.300, 2.357, 2.743, −7.800, −7.986, −1.486, −4.600, −3.429, 5.400, −3.600, 8.671, −8.814, 7.571, −1.943, 8.271, 1.114, 8.471, 2.743, −6.814, −1.557}, {6.029, 1.186, 4.743, 8.414, −9.829, −6.557, −4.571, 5.829, 4.400, 6.086, 1.686, −1.371, −8.200, 3.700, 2.071, 5.486, 3.371, −9.957, 3.300, −6.386, −6.500, 8.286, −6.086, −3.543, −2.871}, {8.614, −2.314, 3.257, 8.571, −4.443, 2.229, 6.000, −4.729, 8.557, 5.171, 1.643, −2.300, 7.614, −7.657, 5.657, −5.871, 8.443, −9.557, 8.186, 2.257, 6.229, −7.929, −6.943, 9.557, −2.086}, {−8.686, −8.586, −4.829, −6.157, −6.614, 7.057, 4.386, −8.014, 6.214, −5.514, −4.757, −6.171, −2.571, 2.700, 5.214, −6.629, −4.057, −7.857, 9.600, −8.329, −1.300, 3.286, 4.886, −1.357, −2.200}, {7.429, 2.771, 7.200, −5.786, −9.600, 5.243, −1.129, −9.086, −4.657, 3.457, 8.900, −4.486, 2.714, 9.686,
2874
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
Table 5: Continued Madaline
Architecture
Weight 6.314, −7.000, −3.071, 1.686, −4.071, −6.757, 9.386, −1.229, 3.771, 8.543, 7.686}, {1.943, 4.614, −3.271, 7.743, −7.286, 8.657, 9.186, −5.571, 9.071, 9.471, 8.629, 5.386, −8.543, −6.900, −9.814, 9.200, −2.471, −8.543, −8.014, 1.529, −8.386, −8.886, 5.986, 7.143, 4.829}, {9.443, −2.114, 5.214, −1.243, 9.943, −9.843, 6.371, 2.714, 7.000, 7.500, 6.329, −7.529, 9.829, 7.400, 6.271, 9.600, 3.729, −5.143, 2.886, 5.343, 6.400, −1.243, 9.886, −3.657, 9.000}, {−6.614, −3.743, −3.386, −4.271, 8.471, −5.714, −2.029, 5.671, 6.114, 2.286, −5.586, −6.386, −7.429, 5.471, −7.800, 8.986, −8.671, 1.329, 6.671, 5.329, −1.857, −9.400, 2.529, 3.914, −2.586}, {3.914, 9.943, −9.729, −8.900, −7.657, 5.400, −7.600, −6.300, 2.371, 5.371, −5.114, 8.943, −9.486, 8.371, 3.557, 8.229, 9.129, 7.014, 1.786, 9.057, −3.886, 4.871, 6.600, −6.329, 9.343}, {8.071, 1.600, −7.600, −4.457, −5.800, −8.071, 2.000, 6.600, 4.500, 5.871, −3.071, 8.700, −2.014, 2.314, 7.814, 1.886, 4.814, 9.200, −5.143, −1.971, 7.686, 7.714, 1.343, −7.000, 7.786}, {3.629, −8.143, −9.171, 1.100, 1.314, 1.000, 7.871, 3.529, −8.800, −3.814, 5.771, 9.143, −2.886, 9.557, 5.329, −5.386, −4.257, −6.786, 5.171, 7.743, 1.271, 9.986, −2.457, −4.157, 4.871}, {−3.371, 2.200, 3.414, −9.629, −1.971, −4.457, 2.357, −2.543, −3.671, −7.871, −6.386, −1.786, −5.029, 6.786, 8.100, −4.071, 2.829, −2.986, −3.743, −5.529, 9.943, 9.257, 3.786, −5.629, −6.043}, {4.900, −9.500, −1.057, −2.743, −7.300, −6.400, −6.629, 6.257, −5.343, −8.800, 9.357, 1.629, −4.486, 4.200, −8.357, 1.143, 3.543, −8.886, −4.014, 8.429, −7.257, −8.986, 3.429, −1.571, 9.086}, {2.057, 4.657, −9.743, 6.271, −2.400, −2.257, −4.057, 7.743, −2.186, −2.871, −9.800, 1.143, 1.229, −3.586, −7.471, −6.129, 3.000, 8.100, 4.857, 7.814, 1.271, −6.171, −2.229, 1.186, 8.543}, {−7.086, 7.329, −8.943, −8.543, 2.900, 2.629, −9.143, −4.529, −8.014, −1.329, −3.471, 4.000, −8.600, −2.300, −9.929, −6.157, −3.286, 6.014, −8.400, −9.886, −1.200, −8.471, 8.729, −5.314, −4.257}, {9.571, 2.229, 9.157, 3.429, −8.800, −2.357, −7.029, −4.886, 2.600, 1.571, −1.129, −6.914, 3.757, −8.486, 7.800, −7.686, 8.543, 3.400, −9.186, 4.529, −1.571, 4.743, −9.414, 8.714, 5.543}. Layer 2: {6.471, 6.900, −2.929, −2.743, 1.086, 9.943, 1.371, 4.529, 4.000, −9.543, 8.900, 7.543, −8.300, −2.143, −6.657, 9.457, −1.186, −2.886, −9.014, 4.571},
Madalines’ Sensitivity
2875
Table 5: Continued Madaline
Architecture
Weight {2.100, −1.214, 1.700, 6.829, 6.671, 5.014, −8.343, 2.529, 4.843, −5.300, 3.557, 9.129, −9.286, 1.243, 2.286, 8.800, −3.543, −7.843, −9.686, 7.871}, {−5.429, −2.814, 7.057, 8.786, −5.029, 6.000, −2.000, 3.357, −3.757, 7.357, 9.743, 5.957, −2.100, 2.414, 9.857, 2.429, −3.186, −7.643, −9.829, −6.743}, {1.229, −8.243, −4.100, 2.857, −7.071, −1.771, −5.486, −7.057, 1.843, 9.243, −6.129, −1.614, −7.914, −1.400, −9.914, −9.814, 9.657, 7.900, −3.429, −2.157}, {−3.343, 9.286, −1.229, 2.086, 7.414, −9.357, −2.514, −6.857, −7.586, 5.343, 9.543, 5.014, −4.743, 2.529, −5.457, 8.014, −5.529, 8.014, −6.086, −6.386}, {−5.771, 1.443, 1.829, 4.129, −8.029, −6.357, −7.814, 1.643, −2.986, 5.157, 1.257, −6.757, 4.500, 7.543, 3.486, −3.943, −1.214, −4.643, −9.700, −1.029}, {9.414, −2.957, 8.986, 6.414, 5.729, −7.900, 6.057, −5.514, −1.657, 7.600, 7.171, −1.414, 3.414, 3.957, 9.829, −9.514, −7.429, −4.143, −1.986, −5.271}, {5.257, −8.086, −9.443, 9.557, −1.329, 1.086, −8.514, 4.114, −8.129, 2.829, 2.257, 4.900, 6.886, 5.471, −1.900, −4.200, 4.300, 7.700, 6.800, −1.843}, {−9.714, −5.743, −9.314, 5.114, −9.343, −9.514, −8.114, −3.429, −6.186, −2.743, 6.257, −5.229, −4.271, 3.214, 2.657, −5.771, 9.814, 3.714, −9.557, −1.486}, {1.214, −9.786, −2.714, −3.900, 5.200, −5.043, 2.629, 1.457, 8.171, 7.714, 7.329, 2.343, −6.643, 4.243, −4.843, −9.943, 6.214, −8.857, −9.914, 7.200}, {−3.886, 6.871, 1.286, 8.586, 1.029, 4.343, −1.500, 4.429, 1.357, 7.943, −6.057, 8.929, 2.500, −3.171, −2.443, −6.900, 5.357, −5.457, −9.686, 8.700}, {−4.843, 8.743, 8.343, −9.986, −4.543, 4.357, 3.429, 4.100, −4.729, −8.857, −3.200, 8.843, 3.486, 5.714, 2.271, 8.429, 4.771, 1.414, −1.357, −2.971}, {−1.900, 2.871, 9.986, 2.743, −5.229, −2.843, −8.514, −4.029, −10.000, 6.057, 9.029, −2.986, 6.543, 4.286, 2.629, −2.186, 5.829, 1.743, 4.543, −2.700}, {−9.557, −8.143, −1.743, −2.757, −4.971, −5.914, −1.929, −5.443, −2.371, −6.829, 9.886, 4.200, −6.357, −9.957, −5.357, −4.971, 2.014, 2.043, −2.771, −5.257}, {4.557, 5.771, −7.057, −8.857, −1.357, 2.443, −9.014, 7.771, −9.100, −3.571, 3.343, −6.571, 1.343, −8.500, −9.229, −3.686, −1.271, −9.514, −6.200, −5.329} Layer 3: {7.500, −3.514, −9.586, 3.800, 3.229, −9.129, −9.243, −2.614, 6.143, 2.286, −1.100, 6.443, 7.129, −3.671, −2.314}, {7.800, −9.529, −1.029, −8.629, −7.443, −6.471, −3.371, 2.143, 4.886, 1.400, 6.414, −3.814, −6.100, −6.543, −6.386}, {−4.500, 5.743, 1.514, −2.014, −6.114, 6.771, 6.500, 5.271, −2.157, −3.700, −9.543, −1.629, 3.900, 6.971, −8.900},
2876
Y. Wang, X. Zeng, D. Yeung, and Z. Peng
Table 5: Continued Madaline
Architecture
Weight {−9.200, −1.057, 6.086, −3.014, 9.429, 4.500, −5.114, 6.171, 9.214, 7.929, −9.957, −4.729, −3.114, 2.357, 2.871}, {4.343, 9.886, −5.843, 1.814, 8.043, 9.771, −4.329, −3.571, 2.543, −1.029, 9.843, 5.043, −5.871, 3.486, −2.957}, {8.086, −9.643, 9.871, −5.186, −8.000, 8.071, 6.614, 4.929, 1.257, 9.814, 6.929, 9.371, −6.786, 8.529, 7.086}, {7.743, 5.471, −9.871, −8.843, −8.186, −4.700, 9.843, −2.700, 2.714, 9.271, −3.043, 2.686, −2.614, 8.086, 6.914}, {−3.843, −1.529, −7.586, 6.186, 1.829, −5.871, 8.343, 7.343, 2.986, 9.229, 9.457, −9.957, 8.371, 1.900, 1.129}, {−8.100, 1.871, 2.229, −6.486, 9.800, −3.029, −2.614, 9.800, −2.643, −2.443, −3.129, −7.657, −6.914, 2.614, −1.300}, {−9.543, 6.300, −7.829, 5.457, 1.114, 9.371, −3.629, 8.429, −8.814, −7.443, −5.857, −5.186, 2.114, 2.457, −2.486} Layer 4: {−1.243, 5.629, 3.000, 8.029, −9.400, −8.571, −4.743, −4.857, −3.343, −8.971}
Acknowledgments We express our gratitude to the reviewers for their helpful suggestions that significantly improved this article. This work was supported by the National Natural Science Foundation of China under grant 60571048 and the Provincial Natural Science Foundation of Jiangsu, China under grant BK2004114. References Alippi, C., Piuri, V., & Sami, M. G. (1995). Sensitivity to errors in artificial neural networks: A behavioral approach. IEEE Transactions on Circuits and Systems–I: Fundamental Theory and Applications, 42(6), 358–361. Cheng, A. Y., & Yeung, D. S. (1999). Sensitivity analysis of neocognitron. IEEE Transactions on System, Man, and Cybernetics—Part C: Applications and Reviews, 29(2), 238–249. Choi, J. Y., & Choi, C. H. (1992). Sensitivity analysis of multilayer perceptron with differentiable activation functions. IEEE Transactions on Neural Networks, 3(1), 101–107. Engelbrecht, A. P. (2001a). Sensitivity analysis for selective learning by feedforward neural networks. Fundamenta Informaticae, 45(1), 295–328. Engelbrecht, A. P. (2001b). A new pruning heuristic based on variance analysis of sensitivity information, IEEE Transactions on Neural Networks, 12(6), 1386–1399. Fu, L., & Chen, T. (1993). Sensitivity analysis for input vector in multilayer feedforward neural networks. Proc. of IEEE Int. Conf. on Neural Networks, 1, 215–218.
Madalines’ Sensitivity
2877
Gong, J., & Zhao, G. (1996). An approximate algorithm for bivariate normal integral. Computational Structural Mechanics and Applications, 13(4), 494–497. Oh, S. H., & Lee, Y. (1995). Sensitivity analysis of a single hidden-layer neural network with threshold function. IEEE Transactions on Neural Networks, 6(4), 1005–1007. Pich´e, S. W. (1995). The selection of weight accuracies for madalines. IEEE Transactions on Neural Networks, 6(2), 432–445. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity of feedforward neural networks to weight errors. IEEE Transactions on Neural Networks, 1(1), 71–80. Yeung, D. S., & Sun, X. (2002). Using function approximation to analyze the sensitivity of the MLP with antisymmetric squashing activation function. IEEE Transactions on Neural Networks, 13(1), 34–44. Zeng, X., & Yeung, D. S. (2001). Sensitivity analysis of multilayer perceptron to input and weight perturbations. IEEE Transactions on Neural Networks, 12(6), 1358–1366. Zeng, X., & Yeung, D. S. (2003). A quantified sensitivity measure for multilayer perceptron to input perturbation. Neural Computation, 15(1), 183–212. Zeng, X., & Yeung, D. S. (2006). Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing, 69, 825–827. Zurada, J. M., Malinowski, A., & Cloete, I. (1994). Sensitivity analysis for minimization of input data dimension for feedforward neural network. In Proceedings of the IEEE International Symposium on Circuits and Systems (pp. 447–450). Piscataway, NJ: IEEE. Zurada, J. M., Malinowski, A., & Usui, S. (1997). Perturbation method for deleting redundant inputs of perceptron networks. Neurocomputing, 14(2), 177–193.
Received January 5, 2005; accepted March 29, 2006.
ARTICLE
Communicated by Misha Tsodyks
Short-Term Synaptic Plasticity Can Enhance Weak Signal Detectability in Nonrenewal Spike Trains ¨ Niklas Ludtke [email protected]
Mark E. Nelson [email protected] Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801, U.S.A
We study the encoding of weak signals in spike trains with interspike interval (ISI) correlations and the signals’ subsequent detection in sensory neurons. Motivated by the observation of negative ISI correlations in auditory and electrosensory afferents, we assess the theoretical performance limits of an individual detector neuron receiving a weak signal distributed across multiple afferent inputs. We assess the functional role of ISI correlations in the detection process using statistical detection theory and derive two sequential likelihood ratio detector models: one for afferents with renewal statistics; the other for afferents with negatively correlated ISIs. We suggest a mechanism that might enable sensory neurons to implicitly compute conditional probabilities of presynaptic spikes by means of short-term synaptic plasticity. We demonstrate how this mechanism can enhance a postsynaptic neuron’s sensitivity to weak signals by exploiting the correlation structure of the input spike trains. Our model not only captures fundamental aspects of early electrosensory signal processing in weakly electric fish, but may also bear relevance to the mammalian auditory system and other sensory modalities.
1 Introduction In response to a growing body of experimental studies on short-term synaptic plasticity (Tsodyks & Markram, 1997; Zucker & Regehr, 2002; XuFriedman & Regehr, 2004; Blitz, Foster, & Regehr, 2004), models of nonlinear synaptic transmission have emerged that emphasize the functional importance of the relative timing of presynaptic action potentials (Maass & Zador, 1999; Markram, 2003; Abbott & Regehr, 2004). Sensitivity of synaptic transmission to the history of presynaptic spikes becomes especially relevant when interspike intervals (ISIs) are correlated. Such input correlations have been reported in various sensory modalities (Lowen & Teich, 1992; Teich, Turcott, & Siegel, 1996; Bahar et al., 2001) and are particularly prominent in Neural Computation 18, 2879–2916 (2006)
C 2006 Massachusetts Institute of Technology
¨ N. Ludtke and M. Nelson
sensory neuron
ISI n+1
2880
6 5 4 3 2 1 0 0 1
2
3
4
5
6
ISI n
input array
Figure 1: Schematic representation of a sensory neuron receiving afferent input from an array of spiking units. The stimulus intensity is encoded in the afferent firing rates. The noisy pattern of gray-scale values in the input array depicts spike counts obtained within a fixed time window. The insets show a spike train of an individual afferent and the corresponding joint ISI histogram, revealing negative ISI correlations. Short ISIs tend to be followed by longer ISIs and vice versa. Since the firing activity is stochastic, the spike counts exhibit variability. For temporally correlated ISIs, the spike count variability depends on the count window length.
the active electrosensory system of weakly electric fish (Ratnam & Nelson, 2000). The negative ISI correlations observed in the electrosensory system have been successfully modeled using an integrate-and-fire type mechanism with threshold fatigue (Chacron, Maler, & Longtin, 2001; Brandman & Nelson, 2002). In these models, the firing threshold is elevated following an action potential and subsequently decays toward a baseline level. With appropriate parameters, the ISI sequence exhibits negative correlations, that is, short ISIs tend to be followed by longer ISIs and vice versa. 1.1 Detection of Weak Signals. In this article we explore the interplay between ISI correlations and fast synaptic plasticity in sensory systems that respond to extremely weak stimuli by detecting small changes in afferent spike activity. Figure 1 shows a generalized scheme of a detector neuron receiving input from a receptor array via afferent fibers. The gray-scale intensity represents stimulus intensity encoded as a noisy, localized pattern of afferent firing rates above a baseline activity. Throughout this article, we assume there are no interconnections among the afferent fibers and hence no spatial correlations in the activation pattern other than those due to the stimulus. We are particularly interested in situations where the signal-to-noise ratio is low, such that the stimulus-induced change in activity is small compared to
Synaptic Plasticity Can Enhance Signal Detectability
2881
intrinsic fluctuations of sensory afferent activity. Under these circumstances, the detector neuron has to perform spatiotemporal averaging to enhance the signal relative to the stochastic background activity. 1.2 Sensory Neurons as Likelihood Ratio Detectors. The accumulation and evaluation of evidence conveyed by the activities of multiple afferents can be assessed using a key concept from statistical detection theory: the likelihood ratio. This quantity compares the probability of the afferent activity in the presence of a stimulus (baseline plus signal) and without a stimulus (pure baseline) by computing the ratio p(afferents; signal + baseline) . p(afferents; baseline) Using a more formal notation, we introduce the detection-theoretic definition of an optimal detector. Let a be the input data vector (afferent activities). The ratio of data likelihoods under assumptions of signal presence (hypothesis H1 ) and absence (null hypothesis H0 ), respectively, is compared to a threshold γ : = ln
p( a ; H1 ) > γ. p( a ; H0 )
(1.1)
The detector decides in favor of H1 if the threshold is exceeded. The choice of threshold value determines the probability of a false alarm. According to the Neyman-Pearson theorem (Kay, 1998), the likelihood ratio detector is optimal in the sense that it achieves maximum probability of detection for a given false alarm probability. Although the Neyman-Pearson theorem holds for any monotonic function of the likelihood ratio, one usually takes the logarithm. Under the independence assumption, the ensemble activity can be expressed as a product of individual afferent probabilities, and the logarithm of the product of probabilities then transforms into a sum of log likelihoods. Moreover, the logarithm of the likelihood ratio can be written as the difference of the log likelihoods. Temporal evidence accumulation can be accomplished by summing the likelihood ratios obtained at different time instances. This procedure is commonly referred to as the cumulative sum (CUSUM) algorithm (Page, 1954) and is a repeated sequential likelihood ratio test (Wald, 1948). The resultant quantity is the cumulative log likelihood ratio, denoted by cum and defined in a recursive fashion: cum [k + 1] = max{cum [k] + [k], 0};
cum [0] = 0.
(1.2)
¨ N. Ludtke and M. Nelson
2882
Note that the cumulative log-likelihood ratio cum undergoes rectification. The detector decides H1 at time step k if cum (k) > γ˜ , where γ˜ is a threshold that determines the false alarm rate. Intuitively, the rectification in equation 1.2 seems advantageous, since it limits the accumulation of negative evidence, allowing a faster recovery of the cumulative likelihood ratio once [k] turns positive. In addition, it has been formally proved (Moustakides, 1986) that this scheme is indeed optimal in the sense that for a given false alarm rate, the CUSUM algorithm exhibits the shortest average detection delay. It has been demonstrated that neurons have the ability to carry out likelihood ratio computations (Gold & Shadlen, 2001). In the context of decision making and motion perception, evaluating the logarithm of a likelihood ratio is equivalent to calculating the difference in firing rate of two neurons with opposite preferred directions of motion, provided the neural responses are described by normal, Poisson, or exponential densities (Gold & Shadlen, 2001). Furthermore, if the responses are independent and identically distributed over time, temporal accumulation of evidence, as in the CUSUM procedure, can be accomplished by an integrate-and-fire neuron (Gold & Shadlen, 2002). 1.3 Spike Train Statistics and Detection Performance. To assess theoretical detection performance limits of an individual sensory neuron for different input spike train statistics, we first derive a CUSUM detector model based on equation 1.2 for afferents with renewal spike statistics. Though designed for a renewal process, this type of detector can also operate with temporally correlated input. Detection performance has been shown to improve in the presence of negative ISI correlations, since a detector can passively benefit from the reduced spike count variability caused by the correlations (Ratnam & Nelson, 2000; Chacron et al., 2001; Goense & Ratnam, 2003). The main contribution of this article is a more sophisticated detector model that actively utilizes the redundancy in temporally correlated input. Rather than matching the average firing probability, this detector operates with an estimate of the current firing probability at each time instance. The interdependence of adjacent ISIs requires the use of conditional likelihoods dependent on the spike train history, which poses a computational challenge and raises the question of how conditional firing probabilities could be represented in neural systems. We suggest that a record of afferent spike train history can be kept implicitly in the form of short-term synaptic plasticity and demonstrate that such synaptic plasticity would enable a sensory neuron to robustly track the fluctuations of the presynaptic firing probabilities.
Synaptic Plasticity Can Enhance Signal Detectability
2883
Thus, the incoming evidence can be evaluated in terms of current conditional log-likelihood ratios, resulting in more efficient detection of weak signals. 1.4 Relation to Electrosensory Signal Processing. Our simplified detector model is motivated by the study of electrosensory prey detection in weakly electric fish (Nelson & MacIver, 1999). The model captures the fundamental aspects of the feedforward pathway at the first stage of electrosensory processing, in the electrosensory lateral line lobe (ELL), which receives electrosensory afferent input. Typical electrosensory stimuli induced by small prey are localized and extremely weak perturbations of the fish’s self-generated electric field. The resultant minute changes in transdermal potential are sensed by cutaneous electroreceptors and encoded in the activities of primary electrosensory afferent nerve fibers. Due to the small signal amplitude and the variability of afferent firing, the fish must solve a challenging detection task. It is estimated that small prey (such as Daphnia magna), at a typical detection distance of 2 cm from the skin, will cause only about one extra spike above a background of 60 spikes within a 200 ms interval (Ratnam & Nelson, 2000). Our simulations show that detection performance is enhanced through the interplay between ISI correlations and synaptic plasticity in a model neuron. We speculate that a neural correlate of this mechanism could be implemented by short-term plasticity at ELL excitatory afferent synapses onto ELL pyramidal neurons. 1.5 Outline. First, we formulate the prey detection task within the framework of statistical detection theory and establish a link between the key concept of the likelihood ratio and the integrate-and-fire model of a sensory neuron. Next, we develop two alternative sequential likelihood ratio detector models—one for uncorrelated and the other for correlated spike trains. We then specify the electrosensory signal of a prey-like stimulus in a simplified cylindrical geometry, derive the corresponding models for a putative electrosensory detector neuron, and compare their performance in the presence or absence of negative correlations in the input spike trains. Finally, we discuss the implications of our results with regard to a possible role of short-term synaptic plasticity in the enhanced detection of weak signals encoded in correlated spike trains. 2 A Log-Likelihood Ratio Detector for Binomial Spike Trains Let a i [k] be the spike state of the ith input fiber at the current time step k, and a [k], the spike state of an ensemble of n fibers: a i [k] ∈ {0, 1};
a [k] ∈ {0, 1}n .
¨ N. Ludtke and M. Nelson
2884
Assuming independence, firing probabilities for individual afferents can be multiplied, and the logarithm of the likelihood ratio is then given by P(a i [k]; H1 ) P( a [k]; H1 ) = log P( a [k]; H0 ) P(a i [k]; H0 ) n
= log
i=1
=
n
log P(a i [k]; H1 ) − log P(a i [k]; H0 ).
(2.1)
i=1
In electrosensory afferents, the independence assumption is well justified due to the absence of interconnections between afferent ganglion cells (Maler & Berman, 1999). A simple spiking model is the binomial probability encoder, which represents the signal amplitude by a proportional change in firing probability at each time step. The resultant spike train is a renewal process: the ISIs are independent random variables. The likelihood of an individual afferent spike state a i is log[rbase t + si gt] if a i [k] = 1 log P(a i [k]; H1 ) = , (2.2) log[1 − rbase t − si gt] if a i [k] = 0 where rbase is the baseline firing rate, si the signal strength at the ith receptor site, g the gain, and t the duration of a time step. If the increment in firing rate caused by the signal, si g, is small compared to the baseline firing rate, the logarithm in equation 2.2 can be approximated using a first-order Taylor expansion: log(x + x0 ) ≈ log x0 + x/x0 . Furthermore, one can combine the two rows by weighting the entries with a i and 1 − a i , respectively. Hence, gt si log P(a i [k]; H1 ) ≈ a i [k] log(rbase t) + rbase t gt si . (2.3) + (1 − a i [k]) log(1 − rbase t) − (1 − rbase t) Accordingly, one can obtain the corresponding probability for the null hypothesis by setting the signal amplitude to zero (si = 0): log P(a i [k]; H0 ) ≈ a i [k] log(rbase t) + (1 − a i [k]) log(1 − rbase t). Subtracting equation 2.4 from 2.3 yields: log P(a i [k]; H1 ) − log P(a i [k]; H0 ) = a i [k]
gt si gt si − (1 − a i [k]) rbase t (1 − rbase t)
(2.4)
Synaptic Plasticity Can Enhance Signal Detectability
2885
1 gt s˜i 1 = Aa i [k] gt s˜i −A + rbase t 1 − rbase t (1 − rbase t)
= A a i [k] wi − b i ,
wi
bi
(2.5)
where A is the stimulus amplitude and s˜i denotes the normalized signal strength at receptor site i, in the sense that A = max{si } and si = A˜si . Therefore, the likelihood ratio of the entire ensemble state at time step k is given by n
n P( a [k]; H1 ) =A a i [k]wi − b , where b = bi . log P( a [k]; H0 ) i=1
(2.6)
i=1
The resultant detector constitutes a linear filter with a fixed spatial receptive field, since the afferent inputs a i are weighted according to the expected relative signal strength s˜i at the associated receptor sites. Note that linearity in a i [k] is not caused by the linearization in signal strength. The Taylor approximation in si merely makes it possible to factor out the signal amplitude. Hence, the detector responds preferentially to a stimulus with particular spatial characteristics, and the absolute stimulus intensity determines the detection delay. Therefore, in all of our simulations, we focus on the spatial aspects of detection and restrict our analysis to stimuli with constant spatial characteristics and instantaneous onset. Adapting the detector to more complex stimuli with varying intensity and spatial extent would require a varying threshold and dynamic weights matched to the expected time course of the signal intensity at the receptor locations. 3 Spike Trains with Negative ISI Correlations Correlations among neighboring ISIs can influence the spike count statistics on timescales well beyond that of the mean ISI. For instance, spike trains with negative ISI correlations can exhibit a lower spike count variability than a surrogate renewal process with identical ISI distribution or a Poisson process. Such long-term regularization has been observed in the auditory nerve (Lowen & Teich, 1992) and is particularly prominent in electrosensory afferents, where the Fano factor (variance to mean ratio) of the spike count can be reduced by more than an order of magnitude at behaviorally relevant time scales of 100 to 200 ms (Ratnam & Nelson, 2000). It has been suggested that such regularization enhances the encoding and detectability of weak signals (Ratnam & Nelson, 2000; Chacron et al., 2001; Brandman & Nelson, 2002; Goense & Ratnam, 2003). Therefore, it is expected that a naive detector based on equation 2.6 would passively benefit from the higher signal-to-noise ratio of regularized input spike trains. However, there is a potential additional benefit: ISI correlations imply a certain degree of
¨ N. Ludtke and M. Nelson
2886
predictability (Raciot & Longtin, 1997), since individual spike events are statistically dependent on previous ISIs. 3.1 Toward a Correlation-Sensitive Detector. The nonrenewal property of the input spike trains should be accounted for in the calculation of the cumulative likelihood ratio. Since the firing probability is determined by previous spike events, one would have to consider conditional probabilities of spike states dependent on the spike train history. In the simplest case, the current firing probability depends on only the previous ISI, and the likelihood ratio would be of the form P( a [k] I [k − 1]; H1 ) log , P( a [k] I [k − 1]; H0 ) where the conditional probabilities could be obtained from joint ISI histograms. However, since experimentally observed Markov orders of electrosensory afferent ISIs are at least five or greater (Ratnam & Nelson, 2000), one is faced with a dilemma: while it may be technically feasible to obtain estimates of conditional firing probabilities using higher-order joint histograms from baseline spike trains of sufficient length, such an approach would clearly be biologically implausible and thus provide no further insight into a possible physiological mechanism. Therefore, we propose a solution based on the nonlinear dynamics of the spike generating process. 3.2 A Nonlinear Adaptive Threshold Model. We introduce a generalization of the time-discrete afferent model by Brandman and Nelson (2002). As in most other models of electrosensory afferents, the firing threshold is raised following an action potential and decreases as long as no spike is generated. This type of threshold adaptation leads to negative ISI correlations, since the threshold level at each time step depends on its previous value, thus creating a memory of the spike train history. The equations of the Brandman-Nelson model are: v[k] = c s[k] + W[k] 1 if v[k] > θ [k] a [k] = 0 otherwise θ [k + 1] = θ [k] − β/α + a [k] β,
(3.1) (3.2) (3.3)
where v is the membrane potential, s the signal amplitude, W ∼ N (0, σ ) the intrinsic gaussian white noise component, θ the firing threshold, and c the gain. Beginning with an arbitrary initial value θ [0], the threshold decays by a fixed amount of β/α during a time step and, if an action potential has been generated, is raised by amount β. For more biological realism, the
threshold increment [rel. units]
Synaptic Plasticity Can Enhance Signal Detectability
1
η =8.0
2887
η =16.0 η =64.0
0.8 0.6
η =0 0.4
η =0.5 η =1.0
0.2
η =2.0 0 -1
-0.5
0
0.5
1
η =4.0
threshold Figure 2: The sigmoid function governing the threshold increment for several different values of η. The increment is normalized by the factor 2β. For extremely large η, the sigmoid approaches a step function; for η = 0, it reduces to a constant.
linear decay could be replaced with an exponential one (Chacron, Longtin, St. Hilaire, & Maler, 2000). Since β relates an action potential at step k to the firing threshold at step k + 1, β controls the degree of correlation between subsequent ISIs. In our modified version of the above model, we introduce a new refractory term in equation 3.3. Instead of a constant boost following an action potential, the threshold is now raised by a variable amount dependent on its current value θ [k]: θ [k + 1] = θ [k] − β/α + a [k] β g θ [k] , where
g(θ ) =
2e −ηθ . 1 + e −ηθ
(3.4)
If the threshold is low, it can be boosted by a maximum amount determined by the parameter β. If the threshold has reached a high level, the boost is smaller than β; at a very high threshold level, virtually no further increase is possible and the decay exceeds the boost. The increase is governed by the sigmoid function g(θ ; η). The saturation parameter η controls the degree to which the increase depends on threshold level (see Figure 2), thus also affecting ISI correlations. In the special case that η = 0, the sigmoid in equation 3.4 reduces to a constant, g(θ ) ≡ 1, resulting in the same constant boost as in the linear adaptive threshold model.
¨ N. Ludtke and M. Nelson
2888
Level-dependent threshold boosting is also a feature of the (time continuous) model of Chacron, Pakdaman, and Longtin (2003). The difference is that instead of a linearly increasing function, in our model the threshold boost is governed by a nonlinear monotonically decreasing function of threshold level. Our assumption is biologically plausible, since physiological firing thresholds cannot increase arbitrarily and must eventually saturate. Though not biophysically detailed, our model exhibits spike train statistics that closely resemble those of actual afferents. Figure 3 compares the ISI statistics of the model to those of an extracellular afferent recording in the absence of electrosensory stimulation due to external objects. We use the term baseline rather than spontaneous activity since the electroreceptors are always driven by the continuously oscillating field generated by the fish’s electric organ, which remains intact under anesthesia and immobilization. In appendix A, we provide a detailed analysis of the influence of parameter η on the dynamics of threshold sequences. The properties of the iterative map defined by equation 3.4 play a crucial role in deriving an advanced likelihood ratio detector. 3.3 The Firing Probability. In the above model, the firing probability at time step k is given by the tail probability of the potential v, which is obtained by integrating the probability density function p(v): P[k] = P(v[k] > θ [k]) = 1 −
θ [k]
−∞
p(v) dv.
We will assume the noise to be gaussian of zero mean and variance σ . Since the potential v is the sum of the signal and the noise component, the pdf p(v) is also gaussian with the same variance but with mean µ = c s. Equivalently, one can integrate a gaussian of zero mean and subtract the signal contribution from the threshold,
P[k] =
1 1 −√ 2 2πσ
θ [k]−c s[k]
0
where the gaussian integral
√
1 2πσ
0
−∞
e −v
2
/2σ 2
dv =
has already been subtracted.
1 2
e −v
2
/2σ 2
dv,
Synaptic Plasticity Can Enhance Signal Detectability
Data ISI Distribution
0.4
A
0.3 0.2
0.1 1
2
3
4
5
6
7
0
8
1
C
1
2
3
4
5
3
4
5
6
7
6
7
7 6 5 4 3 2 1 0
D
0
1
2
ISI [n]
3
4
5
6
Data Serial Corr. Coeff.
Model Serial Corr. Coeff. 1
E
F
0.5
SCC
SCC
7
ISI [n]
1
0 -0.5
8
Model Joint ISI Histogram ISI [n+1]
ISI [n+1]
Data Joint ISI Histogram
0
2
ISI [EOD cycles]
ISI [EOD cycles]
7 6 5 4 3 2 1 0
B
0.2
0.1 0
Model ISI Distribution
0.3
p(ISI)
p(ISI)
0.4
2889
0
2
4
6
8
10
0.5 0
-0.5
lag [EOD cycles]
0
2
4
6
8
10
lag [EOD cycles]
Figure 3: Comparison of the baseline interspike interval statistics of spike trains obtained from a tonic electrosensory afferent (A, C, E) and the nonlinear adaptive threshold model (B, D, F). Model parameters were automatically optimized by maximum likelihood estimation (see section 3.7.1). The timescale is measured in units of the electric organ discharge (EOD) cycle, which is roughly 1 ms. The afferent spike train was recorded extracellularly from a weakly electric fish (Apteronotus leptorhynchus; data by Rama Ratnam).
√ After applying the transformation v˜ = v/ 2σ , the firing probability can be expressed in terms of the complementary error function (erfc): P[k] =
1 1 −√ 2 π
0
θ [k]−c √ s[k] 2σ
e −˜v d v˜ = 2
θ [k] − c s[k] 1 . erfc √ 2 2σ
(3.5)
¨ N. Ludtke and M. Nelson
2890
A
Pfiring
1 0.5 0
0
B
10
20
30
40
50
60
time [ms] Figure 4: (A) Spike train generated by our nonlinear adaptive threshold model model and (B) the corresponding firing probability. The time course of the firing probability resembles that of the postsynaptic current at a depressing synapse. Note that the very short timescale is not an inherent feature of the model; parameters were chosen to match the high baseline firing rates of electrosensory afferents.
Figure 4 shows a model spike train and the (discretized) time course of the corresponding firing probability, which bears a striking resemblance to the postsynaptic current at a depressing synapse, except at a much shorter timescale. Typical cortical time constants of short-term synaptic depression range in the hundreds of milliseconds (Zucker & Regehr, 2002). However, the timescale of the model is flexible and can be controlled via the parameter α in equation 3.4, which determines the mean ISI and hence the baseline firing rate. In Figure 4, the baseline firing rate is set to approximately 300 Hz in order to match typical baseline firing rates of electrosensory afferents. Choosing a lower baseline firing rate would increase the timescale of the course of the firing probability accordingly. Therefore, we hypothesize that a form of short-term synaptic plasticity could enable a postsynaptic neuron to track the varying firing probabilities associated with its afferent input.
3.4 The Log-Likelihood Ratio for Correlated Spike Trains. In a similar manner as in section 2, one can derive an expression for the logarithm of the likelihood ratio for correlated model afferent spike trains. However, there is no longer a constant baseline firing rate, since the threshold is a dynamic variable influenced by previous spike events. Using equation 3.5, one can calculate the likelihood of individual afferent spike states a i [k] under the two alternative hypotheses (denoted by Hx ). Instead of constant firing probabilities P(a i [k]; Hx ), the likelihood ratio consists of conditional probabilities P(a i [k] | θi [k]; Hx ), and the threshold value θi [k] contains a record of the entire spike train history up to time k. Thus, the logarithm of
Synaptic Plasticity Can Enhance Signal Detectability
2891
the likelihood ratio has the following form: n log P a i [k] θi [k]; H1 − log P a i [k] θi [k]; H0 . a [k] =
(3.6)
i=1
The conditional probabilities P(a i [k] | θi [k]; Hx ) replace more complicated higher-order Markov models with transition probabilities of the type P a i [k] a i [k − 1], a i [k − 2], . . . , a i [k − m]; Hx , where the spike train history is explicitly taken into account. In the subsequent sections, we describe a mechanism that enables the detector to track the changing thresholds θi and thus the firing probabilities. By iteratively updating the estimate of the current threshold values, θi [k], arbitrary Markov orders can be taken into account implicitly, using the same formalism without increasing the model complexity. The entire spike train history can thus be absorbed into one variable. Using equation 3.5, one can calculate the likelihood of the state of the ith afferent at time k under the signal hypothesis: log P a i [k] θi [k]; H1 log 12 erfc θi [k]−c √ si [k] if a i [k] = 1 2σi = . (3.7) log 1 − 1 erfc θi [k]−c √ si [k] if a i [k] = 0 2 2σ i
Equation 3.7 can also be written as
1 θi [k] − c si [k] erfc √ 2 2σi 1 θi [k] − c si [k] . + 1 − a i [k] log 1 − erfc √ 2 2σi
log P(a i [k] | θi [k]; H1 ) = a i [k] log
(3.8)
Again, one can linearize the expressions, since the signal si [k] introduces only a small perturbation: c si [k] θi [k]. Hence, θi [k] − c si [k] 1 erfc log √ 2 2σi √ erfc θi [k]/ 2σi θi [k] c si [k] 1 − ≈ log + log erfc √ √ √ 2 2σi 2σi erfc θ [k]/ 2σ i
i
√ 2 c exp − θi [k]2 /2σi2 θi [k] 1 = log + log erfc √ − √ √ si [k], 2 2σi π σi erfc θi [k]/ 2σi
(3.9)
¨ N. Ludtke and M. Nelson
2892
where the prime in the first row denotes the derivative of the complementary error function, which is defined as x 2 d 2 2 −x˜ 2 1− √ erfc (x) = e d x˜ = − √ e −x . dx π 0 π
Analogously, one obtains a linearization of the second term in equation 3.8: θi [k] − c si [k] 1 ≈ log 1 − erfc √ 2 2σi θi [k] 1 log + log 1 + erf √ 2 2σi √ 2 c exp − θi [k]2 /2σi2 −√ √ si [k]. πσi 1 + erf θi [k]/ 2σi
(3.10)
For convenience, the regular error function erf is employed using the definition erfc = 1 − erf. Inserting 3.9 and 3.10 into equation 3.8 yields the afferent log likelihood under the signal hypothesis. From this result, the log likelihood under the null hypothesis is obtained by setting the signal intensity to zero (si [k] = 0). Given these log likelihoods, one obtains the logarithm of the likelihood ratio of the spike state vector of the afferent ensemble, ( a [k]): n 2 a [k] = A a i [k] π i=1 2 2 2 2 exp − θ [k] /2σ [k] /2σ exp − θ i i i i c i s˜i [k] × √ + √ σi 1 − erf θi [k]/ 2σi 1 + erf θi [k]/ 2σi wi [k]
2 2 c i s˜i [k] exp − θi [k] /2σi − √ σi 1 + erf θi [k]/ 2σi
=A
b i [k] n 2 a i [k] wi [k] − b i [k], π i=1
(3.11)
Synaptic Plasticity Can Enhance Signal Detectability 4
2893
-2
e x / [1+erf(x)] -2
ex / [1- erf(x)]
3
2
1
0 -2
-1
0
1
2
x 2
Figure 5: Plot of the function f (x) = e −x /[1 − erf(x)] (circles) and its counter2 part f (x) = e −x /[1 + erf(x)] (squares), which appear in the dynamic synaptic weights and biases of the correlation-sensitive likelihood ratio in equation 3.11. Both functions are monotonic and well approximated by cubic polynomials (solid lines).
where A denotes the signal amplitude and s˜i [k] is the normalized signal at receptor site i. The amplitude can thus be absorbed into the detector threshold γ . When integrating over time, the mean detection delay will be proportional to the amplitude if the signal amplitude is constant. Note that the synaptic weights and the bias terms are time dependent. Since they are functions of the firing threshold, they implicitly depend on the spike train history. Although the analytical expressions for the weights and biases look complicated, they are smooth monotonic functions that can be well approximated by much simpler functions, such as the cubic polynomial shown in Figure 5. Thus, from the viewpoint of biological plausibility, the detector is not as computationally demanding as it may first appear. 3.5 Threshold Prediction. In order to construct a likelihood ratio detector based on equation 3.11, knowledge of the spike threshold θi [k] is required at each time step. At first glance, this approach may seem infeasible since such information is not readily available. However, as we demonstrate in this section, it is possible to track the fluctuations of the firing threshold by feeding a model afferent spike train into a dynamical system that mimics the changes of the threshold in the spike-generating mechanism. We introduce a predictor variable, , that is incremented or decremented, depending on whether a spike was received at the previous step, in the same
¨ N. Ludtke and M. Nelson
2894
manner as in equation 3.4: ˜ α˜ + a [k] β˜ g˜ [k] .
[k + 1] = [k] − β/
(3.12)
Indices have been dropped for simplicity. The predictor variable is initialized with a random value, [0], drawn from a gaussian distribution (though the choice of distribution is not critical). Equation 3.12 defines two alternative maps that describe how the subsequent threshold value is computed from its current value. The received spike train controls the choice of map. In effect, is a variable undergoing change in a dynamical system that randomly alternates between two deterministic components. If the param˜ and η˜ of the predictor system are identical to those of the spike eters α, ˜ β, generator and if this dynamical system has a stable orbit, the predictor sequence ( [k]) will converge toward the sequence of the thresholds (θ [k]) of the spike generator, regardless of the initial value [0]. Using equation 3.5, [k] can be transformed into the corresponding estimate of the firing probability, Ppred [k]. Figure 6A shows an example of such convergence of the predicted firing probability (dotted line) toward the actual firing probability used to generate the spike train (solid line). From approximately 20 time steps onward, the predictor sequence is tracking the actual firing probability accurately. The semilogarithmic plot of their difference, shown in Figure 6B, reveals an exponential convergence. It is this property that enables the predictor to track the actual firing threshold without knowledge of its initial value θ [0]. Sensitivity to the initial condition would destroy this property so that the sequences could never synchronize. Although the convergence in Figures 6A and 6B is demonstrated only for pure baseline activity, the predictor mechanism is not affected by the presence of a stimulus. A signal alters the membrane potential v in an additive fashion, thus changing the firing probability. However, the threshold predictor system does not require knowledge of the amount of change in signal amplitude s, since s does not appear explicitly in the transformation, equation 3.12. The predictor system implicitly receives information about the change of firing probability through the incoming spike train, thereby maintaining its ability to select the correct map to transform at any time instance. 3.6 Synaptic Plasticity. In order to illustrate the synaptic plasticity in our model detector neuron, we investigate the behavior of an individual input weight wi [k] in equation 3.11 and the corresponding bias b i [k] under stimulation with a test spike train, a sequence of bursts. This simulation resembles a typical neurophysiological test for short-term synaptic plasticity. Figure 7 demonstrates the plasticity at an individual synapse under tetanic stimulation. Note that synaptic efficacy rather than a postsynaptic
Synaptic Plasticity Can Enhance Signal Detectability
A
1
firing probability predictor
0.8
probability
2895
0.6 0.4 0.2 0 0
20 40 time [EOD cycles]
60
0
-5
-10
log
10
| P- P
pred
|
B
-15 0
50
100
150
200
time [EOD cycles]
Figure 6: (A) Within about 20 time steps, the sequence of predicted firing probabilities (dashed line) has closely approached that of the spike-generating process (solid line). (B) The logarithm of the prediction error as a function of time reveals exponential convergence. The dashed line is a linear fit to the simulation data. The value of its slope closely approximates the Lyapunov exponent of the system (see appendix A) for the given set of parameters. A negative slope thus corresponds to a negative Lyapunov exponent, indicating convergence toward a stable orbit of the dynamical system.
current is plotted. Hence, this is a genuine nonlinear plasticity effect rather than a consequence of linear summation of overlapping excitatory postsynaptic potentials. The synaptic weight wi exhibits facilitation (see Figure 7B), whereas the bias b i shows rapid depression (see Figure 7C). The net contribution, the log-likelihood ratio for this afferent, is a combination of the two (see Figure 7D). This example reveals how the detector interprets ISI sequences. Compared to the usual baseline activity with short ISIs followed by longer ones and vice versa, a tetanic burst is an unusual cluster of short ISIs, suggesting the presence of a stimulus. Hence, the likelihood ratio is increasingly
¨ N. Ludtke and M. Nelson
2896
ai
wi
A
5
B
0
bi
1
C
0.5
0 6 4 a iwi - b i 2 0 -2
D
60
65
70
75
80
85
90
95
100
105
time [ms] Figure 7: Behavior of one model synaptic weight and bias term under stimulation with tetanic bursts (A). The synaptic weight exhibits facilitation (B), whereas the bias shows rapid depression (C). The spike-controlled combination of the two is the log-likelihood ratio, the evidence contribution of the individual afferent (D). A positive value provides evidence for the presence of a stimulus, and a negative value suggests the opposite.
positive. In between bursts, there is an unusually long ISI, which is more likely to occur in the absence of a stimulus. Therefore, the likelihood ratio decreases into the negative range. 3.7 Tracking the Firing Rate of Biological Spike Trains. Given the close agreement in the ISI statistics of data and model in Figure 3, one might wonder whether it is possible to obtain convergence with natural spike trains. To investigate this, we fed our predictor system, equation 3.12, with baseline spike trains obtained by in vivo recording from electrosensory afferent fibers of a weakly electric fish (Apteronotus leptorhynchus). In this case, the only available information from the afferent neuron is the spike train. Since the recording was extracellular, the internal fluctuations of the threshold that determines the firing probability could not be observed. Therefore, without the “ground truth” sequence (θ [k]), one cannot directly verify convergence of firing probabilities as in Figure 6A. To test whether a dynamical system, such as the one defined by equation 3.12, accurately tracks the firing probability of the received spike train, one can compare the predicted firing probability, denoted by Ppred , with the empirical spike probability of the received spike train, P(a = 1| Ppred ), for every value of Ppred . Since Ppred can assume any value from zero to one
Synaptic Plasticity Can Enhance Signal Detectability
2897
(within the numerical precision), we divide the interval [0, 1] into N narrow bins of width Pbin = 1/N, into which the sequence Ppred [k] is sorted. For each bin, the empirical firing probability is approximated by a normalized spike count obtained through the following procedure:
r r
Let Pbin be the center of the considered bin. Define the set K of all time steps k at which Ppred is within the bin range: K = {k | Ppred [k] ∈ [Pbin − Pbin /2, Pbin + Pbin /2)}.
r
Let nbin = |K | be the number of instances for which Ppred [k] is within the bin. Count the spikes that occurred at the time steps recorded in K : nspikes = a [k]. k∈K
r
The empirical firing probability given the center value of the bin is then nspikes P(a = 1| Pbin ) = . nbin
If the spike train is sufficiently large, the bin width small, and the predictor correct, Pbin ≈ P(a = 1|Pbin ) for each bin. In other words, on average the midvalue of a bin matches the normalized actual spike count obtained at all instances k when Ppred [k] ≈ Pbin . Equality would be reached in the limit Pbin → 0 and with an infinitely long spike train. 3.7.1 Parameter Estimation. Accurate tracking of the firing probability is possible only if the parameters of the predictor system match those of the spike generator. In order to fit our predictor system with the afferent spike train, we employed a gradient-ascent algorithm to determine the parameter ˜ η, set α, ˜ β, ˜ and σ˜ for which the recorded spike train becomes most likely. The total spike train log likelihood is given by the sum of the conditional log likelihoods of individual spike states: L( a) =
m & a [k] log P a [k] = 1 [k] k=1
' + 1 − a [k] log P a [k] = 0 [k] , where m is the length of the spike train (in time steps). The firing probability, P(a [k] = 1 | [k]), is calculated according to equation 3.5, and the
¨ N. Ludtke and M. Nelson
empirical firing probability
2898 1 0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
predicted firing probability Figure 8: Scatter plot of predicted versus empirical firing probability of an afferent spike train (baseline activity) obtained from a weakly electric fish (Apteronotus leptorhynchus). The duration of the spike train was 275.2 s, and there were 95,027 spikes at a mean firing rate of 345 Hz ( f E OD = 970 Hz). System parameters were obtained by maximizing the spike train likelihood using a gradient-ascent algorithm (see appendix C for details). The parameter values were α = 2.6905, β = 0.5062, η = 1.54, and σ = 0.199. This fit yielded the parameter set used for the comparison in Figure 3.
probability of the complementary event a [k] = 0 is simply P a [k] = 0 [k] = 1 − P a [k] = 1 [k] . For the gradient ascent, one has to calculate the partial derivative of L with respect to each parameter in order to obtain the increments of the update equations (details are provided in appendix C). Figure 8 shows a scatter plot of the empirical firing probability versus its predicted value. Ideally, if the predictions were exact, all points would be on a diagonal. However, due to the finite bin size and length of the spike train, this is never the case. Interestingly, the tracking of the firing rate seems to work quite well provided that the parameters are chosen appropriately. There are only minor systematic errors in the data fit. Such good adaptation of the model system to a natural spike train is somewhat surprising, considering that the predictor system is not a biophysically detailed model and has but four free parameters. Additional confirmation that the obtained parameter set is indeed meaningful is provided by the closely matching ISI statistics of model and data, as shown in Figure 3. Note that the parameter optimization was not designed to fit the model’s ISI histograms or serial correlation coefficient to the afferent spike train.
Synaptic Plasticity Can Enhance Signal Detectability
2899
3.8 Synopsis and Biological Plausibility. In summary, the correlationsensitive detection algorithm involves the following steps: 1. Establishing a match between the synaptic dynamics of short-term plasticity (see equation 3.13) and the adaptation dynamics of the presynaptic spike-generating mechanism (see equation 3.4). 2. An initialization of a synaptic state variable and its subsequent temporal evolution (see equation 3.13) 3. A smooth, nonlinear transformation between the time-varying state variable and the time-varying synaptic weight w and bias b (see equation 3.12, and Figure 5) 4. A summation of time-varying bias terms (b i terms in equation 3.12) 5. A summation of weighted spike inputs (wi terms in equation 3.12) 6. Temporal integration of the weighted sum and comparison with a threshold level (equation 1.2) Steps 1 to 3 apply to each individual synapse, and steps 4 to 6 apply to the sum over all synapses that occurs at the postsynaptic detector neuron. The biological plausibility of carrying out log-likelihood-type computations using the integrate-and-fire dynamics associated with steps 5 and 6 has been established previously (Gold & Shadlen, 2001). Here we discuss the plausibility of the initial steps, 1 to 4, which are associated with the proposed short-term plasticity mechanism. Step 1: To the extent that the afferent population has homogeneous adaptation dynamics, a match between the synaptic dynamics of shortterm plasticity and the adaptation dynamics of the input spike trains could be hard-wired into the system through an evolutionary process of variation and natural selection. In this scenario, the adaptation dynamics giving rise to ISI correlations in the input spike train would likely evolve first because even a detector with static weights can benefit passively from the reduced spike count variability (Ratnam & Nelson, 2000; Chacron et al., 2001; Goense & Ratnam, 2003). Further improvements in detection performance, and hence a selective advantage, would be afforded to individuals with genetically specified synaptic dynamics that were more closely matched to those of the spike-generating mechanism. An afferent population that exhibited heterogeneity in adaptation dynamics would presumably require some sort of online or developmental tuning of individual synapses in order to take advantage of the proposed mechanism. Step 2: As discussed in section 3.5 and appendices A and B, the dynamical system associated with state prediction (see equation 3.13) robustly converges toward a trajectory that yields accurate estimates of firing probability, independent of state initialization or slight variations in parameter
¨ N. Ludtke and M. Nelson
2900
values. The robust convergence properties should carry over into any biological implementation with similar underlying dynamics. Step 3: The transformations between state estimates and synaptic weights w and biases b are analytically complex (see equation 3.12), but are smoothly varying, weakly nonlinear functions, as illustrated in Figure 5. Such nonlinear relationships could be established by a variety of biological mechanisms associated with calcium signaling and transmitter release at the synapse. Step 4: In addition to a synaptic weight wi , the application of statistical detection theory predicts an associated bias b i for each synapse. When the input is a renewal spike train, these bias terms are constant (see equation 2.6) and can be absorbed into a redefinition of the threshold γ associated with the detection process (see equation 1.1). When the spike trains have ISI correlations, the individual synaptic biases b i become time-dependent (see equation 3.12). If the detector neuron receives a large (nnumber of independent afferent inputs, n, the sum of the biases b(t) = i=1 b i (t) will have a variance that decreases linearly with n, according to the central limit theorem. Thus, when the degree of afferent convergence is large, the bias term is approximately constant and can once again be absorbed into a redefinition of the detection threshold γ , as was the case for renewal process inputs. Thus, we see that all the elements of the proposed model are plausible under certain biologically relevant circumstances (e.g., multiple converging afferents with homogeneous adaptation dynamics). The model could also be applicable to more challenging circumstances (e.g., a small number of converging afferents with heterogeneous dynamics), but would require the specification of additional mechanisms for tuning the dynamics and accommodating time-varying threshold levels in the detection process. 3.9 The Electrosensory Image Model. We model a patch of fish skin as a segment of a cylindrical surface. Let n be the number of receptors contained in the skin patch and i be an index referencing individual receptors (afferent fibers). Let (r0 , φi , xi ) be the position of the ith receptor organ in cylindrical coordinates and (r, φ, x) be the (unknown) target position. The signal intensity at the ith receptor is modeled as a two-dimensional gaussian: si = A exp − (xi − x)2 /2σs2 − r02 (φi − φ)2 /2σs2 , main axis
(3.13)
polar angle
where A is the signal amplitude, r0 the radius of curvature of the patch, and σs characterizes the width of the electrosensory image. Both the amplitude A and the stimulus width σs depend on the target distance r . Within a distance of a few centimeters, the amplitude follows a power law
Synaptic Plasticity Can Enhance Signal Detectability
2901
(Rasnow, 1996): A(r ) = ke r −α .
(3.14)
The factor ke incorporates the fish’s electric field strength, as well as the conductivity of the target object. For the exponent of the power law, a value of α ≈ 4 has been observed in Apteronotus albifrons (Chen, House, Krahe, & Nelson, 2005). The steep power law limits the effective range for prey detection to a few centimeters (MacIver, Sharabash, & Nelson, 2001). For small targets, the width σs is approximately proportional to the distance of the target from the skin (Rasnow, 1996): σs (r ) ∝ r − r0 . The constant of proportionality is approximately unity, so we set σs (r ) = r − r0 .
(3.15)
Inserting equations 3.14 and 3.15 into 3.13 yields the intensity of the electrosensory stimulus at any receptor position (r0 , φi , xi ) as a function of target coordinates (r, φ, x):
si = ke r
−α
−r02 (φi − φ)2 − (xi − x)2 exp 2(r − r0 )2
) .
(3.16)
3.10 Performance Comparison. In our simulation of electrosensory signal detection, we have restricted our analysis to a proof of principle using only stimuli with instantaneous onset and constant intensity. Optimal detection of time-varying signals would require a temporal receptive field matched to the expected time course of typical stimuli. In the electrosensory system of weakly electric fish, there is evidence that such expectations are relayed via feedback from higher brain areas (Maler & Berman, 1999; Bastian, 1999; Lewis & Maler, 2002). Such top-down information could be included in the likelihood ratio framework, but is beyond the scope of this article. We performed computer simulations of a single detector “neuron” monitoring a 15 × 15 array of receptors. The stimulus was centered on the receptor array and thus the receptive field of the detector. Using the natural time discretization provided by the periodicity of the fish’s electric organ discharge (EOD) and assuming an EOD frequency of 1000 Hz resulted in a step size of t = 1/ f EOD = 1 ms. In order to assess the influence of ISI correlations on detection performance, we generated renewal spike trains using the probability encoder model described in section 2 and nonrenewal spike trains using the more
2902
¨ N. Ludtke and M. Nelson
realistic nonlinear adaptive threshold model introduced in section 3.2. The parameters of both spike generators were adjusted, so that equal stimulus amplitudes resulted in the same increase in firing rate above equal baseline activities. We used a gain of ≈ 250 spikes/s/mV, in accordance with experimental observation (Nelson, Xu, & Payne, 1997). To make meaningful comparisons, the mean false alarm rates of the two detectors were matched. The thresholds of both detectors were set to obtain a mean false alarm interval of 95 ms, which corresponds to a false alarm rate of approximately 10 Hz, similar to typical spontaneous firing rates of ELL neurons (Bastian & Nguyenkim, 2001). At the beginning of each trial, the cumulative likelihood ratio was set to zero. After 50 time steps, a gaussian electrosensory image according to equation 3.15 was presented and the time counter started. The time interval before stimulus onset allowed for transients in the afferent spike generators to decay and ensured sufficient convergence of synaptic predictors. The value of 50 time steps was determined empirically (see Figure 6). The detection delay is then the time from stimulus onset to the first postsynaptic spike (detector decision in favor of hypothesis H1 ). The same procedure was performed with zero signal amplitude in order to test the detector’s behavior for pure baseline input. The detection delay is then the time from the timer reset to the first false alarm. 3.10.1 Distributions of Detection Delay. To demonstrate the advantage of a dynamic detector, we analyze the distribution of detection delays. Figure 9 shows delay histograms for both detector types, with and without a stimulus. Under baseline conditions, the histograms for both detectors are virtually identical. However, in the presence of a stimulus, the dynamic detector, Figure 9C, exhibits a smaller coefficient of variation than the static detector, Figure 9D, and has more probability mass concentrated at shorter delay times. There is no difference in mean detection delay between the two detector types, since false alarm rates are equal and the static detector is adjusted to match the mean afferent firing rates. Although the static detector cannot track the afferent firing probability, its estimate is correct on average. 3.10.2 Detection Probability as a Function of Integration Time. Integrating the delay distributions over time, that is, summing the counts of all histogram bins in Figures 9C and 9D up to a given delay time and normalizing by the total count, yields the “hit probability” as a function of integration time. This includes correct detections and false alarms. Integrating the baseline histograms, Figures 9A and 9B, over the same desired time window yields the probability of false alarm, which must be subtracted from the hit probability to obtain the probability of correct detection. Figure 10 shows a plot of the probability of correct detection as a function of integration time for three situations: (1) dynamic detector with correlated
Synaptic Plasticity Can Enhance Signal Detectability
static weights, false alarms
mean=95ms
A
0.06 0.04
CV=0.82 0.02 0
0
100
200
normalized count
normalized count
dynamic weights, false alarms 0.08
0.08
0.04
CV=0.8 0.02
0
0.06 0.04
CV=0.54 0.02 0
0
10
20
30
40
detection delay [ms]
200
300
static weights, detections normalized count
normalized count
C
100
detection delay [ms]
dynamic weights, detections mean=9.6ms
B
0.06
0
300
mean=95ms
detection delay [ms]
0.08
2903
0.08
mean=9.6ms
D
0.06 0.04
CV=0.65 0.02 0
0
10
20
30
40
detection delay [ms]
Figure 9: Distributions of the detection delay for both detectors under baseline and signal condition. Histograms were obtained for the static and the dynamic detector, both of which received correlated input. Thresholds were chosen so that false alarm rates were equal. Therefore, the mean detection delay (dashed lines) under baseline conditions is the same in both detectors (A and B). In the presence of a stimulus, the mean detection delay is still the same, but the dynamic detector (C) has more probability mass concentrated at shorter delay times and exhibits a smaller coefficient of variation (CV) than the static detector (D).
input, (2) static detector with correlated input, and (3) static detector with renewal input. (The histograms for the third case are not shown in Figure 9.) In addition to the expected beneficial effect of reduced spike count variability in the correlated firing, the dynamic detector is able to exploit the higher degree of predictability of afferent spike events due to temporal correlations. Consequently, the same detection probability is reached within a shorter integration time. Interestingly, the optimal integration time for the dynamic detector, which is about 15 ms, matches typical values of membrane time constants observed in the ELL of weakly electric fish (Berman & Maler, 1998). Thus, the model suggests a biologically plausible integration timescale, even though it has no leak term.
¨ N. Ludtke and M. Nelson
2904
Pdetection - P
false alarm
1 0.8 0.6 0.4 0.2 0
1
10
2
10
integration time T [ms] Figure 10: Detection probability as a function of integration time for a given, fixed signal intensity. The static detector with uncorrelated input spike train statistics (dashed curve) performs poorly, requiring at least 70 ms to reach a maximum of only 30%. Fed with a negatively correlated spike train, the detection performance improves significantly (dash-dotted curve). The dynamic detector (solid curve) requires a shorter integration time to reach the same detection probability. In all three simulations, the false alarm rate is kept at the same level. The signal intensity is equivalent to that caused by a small prey-like object at a typical detection distance of about 2 cm.
Moreover, the synaptic plasticity transforms the inherent redundancy in the presynaptic spike trains into a reduced variability of the postsynaptic spike output, indicated by a smaller coefficient of variation in the detection delay. Such a reduction in firing variability is a well-known consequence of short-term synaptic depression (Abbott & Regehr, 2004). A population of such detector neurons would be more likely to fire within a small time interval than an equivalent population of static detectors. A neuron in a higher brain area, receiving input from a population of neurons with dynamic synaptic weights, could act as a coincidence detector and, due to the more precise firing of its input, the time window of coincidence could be tighter, resulting in a more efficient rejection of false alarms. 4 Discussion There is ample evidence of activity-dependent synaptic conductances varying on timescales comparable to the interspike interval of their presynaptic input (Zucker & Regehr, 2002; Xu-Friedman & Regehr, 2004). Hence, the question of the functional significance of such plasticity arises. Our primary focus has been the interplay between short-term synaptic plasticity and presynaptic input spike trains with correlated ISIs and its role in weak
Synaptic Plasticity Can Enhance Signal Detectability
2905
signal detection. While it has been suggested that depressive synapses can achieve decorrelation (“whitening”) of positively correlated spike trains (Goldman, Maldonado, & Abbott, 2002), the same synaptic mechanism would have the opposite effect in the presence of negative correlations. Instead of decorrelating, short-term synaptic depression would preserve the negative correlations. The combination of facilitating and depressing plasticity in our dynamic detector model enables the postsynaptic neuron to differentiate between expected and unexpected spikes by exploiting the inherent redundancy in its correlated presynaptic input. The predictive synaptic plasticity introduced in this article is a novel mechanism that goes beyond previous models of weak signal detection in spike trains with correlated ISIs (Ratnam & Nelson, 2000; Chacron et al., 2001; Goense & Ratnam, 2003). In all of these approaches the emphasis is on the long-term regularization (i.e., on a timescale of multiple ISIs) rather than short-term predictability of spike trains. While the framework of statistical detection theory lends itself very well to weak signal detection in correlated spike trains, an explicit representation of conditional spiking probabilities would seem biologically implausible. As a history-dependent process, short-term synaptic plasticity offers a way to address the computational challenge of spike forecasting by implicitly representing conditional firing probabilities. Though most of the increase in detection performance compared to a renewal input stems from long-term regularization, there is also a modest, but significant, benefit in exploiting the statistical dependence of ISIs via synaptic plasticity. The apparent similarity between neuron models with spike-driven threshold adaptation and short-term synaptic depression has been noted by Chacron and colleagues (Chacron et al., 2003), though the authors caution that for electrosensory afferents, the required time constant of neurotransmitter recovery would have to be significantly shorter than those typically found in cortical neurons. However, the improvement of detection performance in our dynamic detector model raises the question of whether a neural correlate of such a detector exists. It would be surprising if electrosensory systems, and perhaps sensory neurons in other modalities, did not in some way actively exploit the correlated nature of afferent spike trains. We therefore speculate that a rapid form of short-term plasticity in excitatory synapses of postsynaptic (ELL) neurons might be able to mimic afferent threshold fluctuations, thus enabling the synapses to track the firing probabilities associated with presynaptic spikes. The unusually high firing rates of electrosensory afferents may be matched by unusually small synaptic time constants unique to sensory neurons in the ELL. From a bioengineering point of view, the described form of predictive synaptic plasticity may also have implications for the design of neuromorphic systems or bioelectronic interface technology in sensory prostheses, where it may be advantageous to precisely match the ISI statistics of a
¨ N. Ludtke and M. Nelson
2906
sensory device output to the synaptic properties of the cells to which the device is connected. Appendix A: The Lyapunov Exponent of the Afferent Model The stability of threshold orbits is determined by the Lyapunov exponent of the map that transforms threshold θ [k] into its successor θ [k + 1] (see equation 3.4):
f k : R −→ R ,
θ −→ θ − β/α + a [k] β
2e −ηθ . 1 + e −ηθ
(A.1)
The subscript k denotes that the map f k is time dependent. The Lyapunov exponent of this map is given by (Strogatz, 1994), n−1 1
ln f k (θ [k]), n→∞ n
λ = lim
(A.2)
k=0
where the prime indicates the derivative with respect to θ . If the limit in equation A.2 exists, the absolute difference between actual and predicted threshold changes exponentially: [k] − θ [k] = [0] − θ [0] e λk . Consequently, the Lyapunov exponent must be negative (see Figure 6B) in order to produce stable threshold orbits that the predictor can converge toward. The time constant of this convergence is τ=
1 . |λ|
For a positive Lyapunov exponent, even the slightest difference between
[0] and θ [0] would be substantially magnified within a small number of time steps. Such sensitivity to initial conditions is a characteristic feature of deterministic chaos. Therefore, it is important to investigate the behavior of the Lyapunov exponent for different parameter settings in order to avoid chaotic regimes if they exist. Apart from special cases, the Lyapunov exponent must be calculated numerically since no general analytical expression exists. However, it is relatively easy to assess the dependence of λ on the boost parameter η. The
Synaptic Plasticity Can Enhance Signal Detectability
2907
derivative of the map f k with respect to θ is given by ηe −ηθ [k] f k (θ ) = 1 − 2a [k]β
2 . 1 + e −ηθ [k]
(A.3)
In the special case of constant threshold boost, that is, for η = 0 (see equation A.1), the expression reduces to f k ≡ 1 ,
for all k and all θ .
By inserting this result into equation A.2, one finds that the Lyapunov exponent is zero, implying an infinite convergence time constant. Obviously, threshold saturation plays a crucial role in system stability. For instance, a predictor system based on the simpler linear adaptive threshold model (Brandman & Nelson, 2002), which lacks a threshold saturation term, does not possess the convergence property and would require precise knowledge of the initial value of θ [0] in order to set [0] = θ [0]. The remainder of the analysis is based on numerical calculation of the Lyapunov exponent, approximating equation A.2 by summing over a large number of time steps (c. 10,000). Since the derivative of the map in equation A.3 does not depend on α, the Lyapunov exponent is determined by only two parameters, β and η. This considerably simplifies numerical stability analysis and enables us to visualize the Lyapunov exponent in the relevant parameter space. As Figure 11A shows, stable threshold orbits are guaranteed over a wide parameter range. Appendix B: Robustness of Threshold Tracking To test the predictor robustness, we generated artificial spike trains using the values for the parameters α and β obtained from the gradient ascent fit with afferent data (see section 3.7.1) and chose slightly deviant values for α˜ and β˜ in the predictor system. Keeping these parameters fixed, we varied only η and η, ˜ setting η˜ = η. The overall discrepancy between prediction Ppred and actual firing probability P was measured in terms of the root mean square (RMS) deviation, *
P − Ppred
+ rms
. m ,* + .1 2 = (P − Ppred ) = / (P[k] − Ppred [k])2 , m k=1
where the angular brackets denote the temporal average. Figure 12A shows the RMS prediction error as a function of η. From this plot, one might conclude that the fitted parameter value η∗ is quite far from the minimum and would therefore be a suboptimal choice.
¨ N. Ludtke and M. Nelson
Lyapunov exponent
2908
0 -0.2 -0.4 -0.6 -0.8 0
1 10
η
30 0
0.5 β
η from data fit
0.4 Lyapunov exp./SCC
20
ρ(1)
0.2 0
λ
-0.2 -0.4 -0.6 -0.8
0
2
4 6 8 system parameter η
10
Figure 11: (Top) The Lyapunov exponent of the map f k in equation A.1 as a function of the parameters η and β. Over the entire plotted parameter range, the Lyapunov exponent is negative, indicating stable orbits and the absence of chaos. (Bottom) Double plot of the Lyapunov exponent (dashed-dotted curve) and the correlation coefficient of adjacent ISIs (solid curve) as a function of threshold saturation parameter η. Parameters α and β remain fixed. The dashed vertical line marks the value η∗ , obtained from fitting the model to an electrosensory afferent spike train (see Figure 8). Increasing η lowers the convergence time constant (i.e., the Lyapunov exponent becomes more negative), while reducing the degree of negative correlativeness. Thus, the two desirable properties, negative correlativeness of subsequent ISIs and short convergence time constant, cannot be optimized simultaneously. Interestingly, when η assumes the value η∗ , obtained from fitting neural data, it realizes a trade-off between the two quantities.
Intuitively, one expects the robustness against parameter perturbations to depend on the Lyapunov exponent, which determines the convergence time constant. If the unperturbed system has a negative Lyapunov exponent, the predictor variable partially self-corrects so that the asymptotic prediction error is bounded, and thus the sequences (θ [k]) and ( [k]) do
Synaptic Plasticity Can Enhance Signal Detectability
2909
0
〈 P - Ppred 〉
rms
10
10
-1
η∗
-2
10
1% 10% 5%
-1
10
0
1
10 10 system parameter
2
10
η
2
〈 θ−Θ〉rms /δβ
10
1
10
0
10
10
-1
10
-3
10
-2
10
-1
magn. of Lyapunov exponent
0
10
|λ|
Figure 12: (Top) The saturation parameter η controls the robustness of the predictor system against small inaccuracies of its parameter values. The difference between actual and predicted firing probability, measured in terms of the root mean square error of the sequences P[k] and Ppred [k], is plotted as a function of system parameter η. The error stems from the fact that predictor parameters α˜ and β˜ deviated from their counterparts α and β in the spike generator by 1%, 5%, and 10%, respectively. Both systems had the same fixed noise variance. The spike generator values were α = 2.6905 and β = 0.5062. For the fitted value, η = η∗ , the error is close to 4% for a 1% parameter deviation. For larger deviations, the error increases significantly, but the system remains operational. This graceful degradation is important for parameter learning and supports the biological plausibility of the model. (Bottom) The root mean square error of threshold prediction (for 1% parameter deviation) as a function of the magnitude of the Lyapunov exponent, |λ|. As predicted in equation B.1, the error scales with |λ|−1 , except for larger values of |λ|, for which the integral approximation becomes invalid.
¨ N. Ludtke and M. Nelson
2910
not diverge. The shorter the time constant, the more quickly the perturbed predictor sequence is drawn back to the original attractor orbit, resulting in a smaller RMS prediction error. As shown in Figure 11 bottom, the Lyapunov exponent, and thus the perturbation robustness, is controlled by the parameter η (dashed curve). Also plotted is the serial correlation coefficient of adjacent ISIs (lag=1), ρ(1), defined as * + (Ik+1 − I¯ )(Ik − I¯ ) ρ(1) = , σ 2 {I } where I¯ and σ 2 {I } are the mean and variance of the ISI. Obviously, increasing the robustness (magnitude of negative Lyapunov exponent) reduces the degree of negative ISI correlation, which would affect detection performance. Interestingly, the parameter value η∗ appears to realize a trade-off between these two desirable properties that cannot be maximized simultaneously. B.1 Prediction Error and Lyapunov Exponent. In order to quantify how the RMS prediction error depends on the Lyapunov exponent of the threshold map, we consider a small perturbation in the parameter β, denoted by δβ. Consequently, the predictor variable is transformed via the perturbed map:
[k] = [k − 1] −
(β + δβ) + 2a [k − 1] (β + δβ) g( [k − 1], η). α
Thus, at each time step, the resultant perturbation in is δ [k] = −
δβ + 2a [k − 1] δβ g( [k − 1], η) . α
In order to obtain a relation between [k] and θ [k], one must consider the perturbations from previous time steps. All past perturbations have decayed exponentially with time constant 1/|λ|. Consequently, the difference between threshold and predictor at time k is the sum of all the decayed previous perturbations:
[k] − θ [k] =
k j=0
δ [k − j] e −|λ| j .
Synaptic Plasticity Can Enhance Signal Detectability
2911
For the RMS error, one obtains ∞ ,* + ,* + ( [k] − θ [k])2 = (δ )2 e −|λ| j . j=0
In the asymptotic limit k → ∞, the summation over exponentials can be approximated by an integral that is easy to evaluate: ∞ ,* * + +
[k] − θ [k] rms = (δ )2 e −|λ| j j=0
≈
,*
(δ )2
+
∞ e 0
−|λ|t
dt =
* + δ rms |λ|
.
(B.1)
Hence, the RMS prediction error is expected to scale inverse proportionally to the Lyapunov exponent. As Figure 12 bottom shows, the theoretical result is in good agreement with the numerical simulation. Only for larger magnitudes of the Lyapunov exponent is the power law scaling incorrect. In this range of λ, the convergence time constant τ = 1/|λ| is of the same order of magnitude as the length of the time step between iterations, and thus the rapidly decaying integrand is a poor approximation of the piecewise constant entries in the summation in equation B.1. The scaling law for the robustness shows that the predictor parameters do not have to perfectly match those of the spike generator in order to achieve a reasonably accurate threshold prediction. To reduce the prediction error, one would merely have to increase the Lyapunov exponent. However, due to the trade-off between Lyapunov exponent and serial correlation coefficient, any increase in robustness would be at the expense of a lower degree of ISI correlation (see Figure 11 bottom). Appendix C: Maximum Likelihood Estimation of Parameters by Gradient Ascent In this section we describe the gradient-ascent procedure used to estimate ˜ η, the model parameters α, ˜ β, ˜ and σ˜ that best match a given afferent spike train. Let the spike train be given as binary vector a ∈ {0, 1}m of length m. This format is obtained by re-sampling the original afferent recording at the EOD frequency, since the afferents never produce more than one spike per cycle of the oscillating field. The spike train log likelihood is the sum of the
¨ N. Ludtke and M. Nelson
2912
conditional log likelihoods of the individual spike events: L( a) =
m & a [k] log P a [k] = 1 [k] k=1
' + 1 − a [k] log P a [k] = 0 [k] . To make equations more compact, we introduce the following shorthand notation for the conditional firing probability: P[k] := P a [k] = 1 | [k] . Thus, the spike train likelihood function is written as L( a) =
m {a [k] log P[k] + (1 − a [k]) log(1 − P[k])}.
(C.1)
k=1
Note that P[k] contains an account of the entire spike train history, since it is a function of [k] and hence indirectly a function of all previous threshold values [k − 1], [k − 2], . . . , [0]. By virtue of equation 3.5, the firing probability of a baseline spike train is obtained by setting the stimulus intensity to zero (s[k[≡ 0). Hence,
[k] 1 P[k] = erfc √ . (C.2) 2 2σ In order to find the parameter values that maximize the likelihood function, we employ a gradient-ascent algorithm. At each iteration, the current parameters are updated by adding a fraction of the gradient of the likelihood function: ˜ η, ˜ η, α, ˜ β, ˜ σ˜ n+1 = α, ˜ β, ˜ σ˜ n + l ∇α, ˜ η, ˜ β, ˜ σ˜ L, where l is the (heuristically determined) learning rate and the gradient is ∂ ∂ ∂ ∂ ∇α, , , , . (C.3) ˜ η, ˜ β, ˜ σ˜ = ∂ α˜ ∂ β˜ ∂ η˜ ∂ σ˜ Using equation C.1, we obtain the gradient of the likelihood function: ∇α, ˜ η, ˜ β, ˜ σ˜ L =
m ∇α, ∇α, ˜ η, ˜ η, ˜ β, ˜ σ˜ P[k] ˜ β, ˜ σ˜ P[k] − (1 − a [k]) a [k] P[k] 1 − P[k] k=1
m a [k] 1 − a [k] − ∇α, = ˜ η, ˜ β, ˜ σ˜ P[k]. P[k] 1 − P[k] k=1
(C.4)
Synaptic Plasticity Can Enhance Signal Detectability
2913
Hence, one has to calculate the partial derivatives of the firing probability, equation C.2. For σ˜ , this is straightforward: ∂ P[k] − [k]2
[k] . exp = √ ∂ σ˜ 2σ 2 2πσ 2
(C.5)
However, with the exception of σ˜ , P[k] is not an explicit function of the ˜ and η˜ parameters. Therefore, the partial derivatives with respect to α, ˜ β, have to be obtained via the chain rule: ∇α, ˜ η˜ P[k] = ˜ β,
d P[k] ∇α, ˜ η˜ [k] . ˜ β, d [k]
(C.6)
Differentiating equation C.2 yields − [k]2 d P[k] 1 = − √ exp . d [k] 2σ 2 σ 2π
(C.7)
To evaluate the gradient of [k], one must take into account that [k] is the result of k-fold iteration of the map (see appendix A): f k : R −→ R ,
−→ − β/α + a [k] β
2e −η
. 1 + e −η
Hence, one can write ∂ ∂ k ∂ [k] = f ( [0]) = f ( , α) ˜ ∂ α˜ ∂ α˜ ∂ α˜ k−1
j=0
.
= [ j]
This yields β˜ ∂ f ( , α) ˜ = 2 ∂ α˜ α˜
∂ [k] = ∂ α˜
⇒
β˜ α˜ 2
k .
Together with equation C.7, one obtains ∂ P[k] 1 =− √ ∂ α˜ σ 2π
β˜ α˜ 2
k exp
− [k]2 2σ 2
.
(C.8)
In a similar manner, one can calculate the partial derivatives of P[k] with respect to β˜ and η. ˜ Inserting the expressions for the partial derivatives of P[k] into equation C.4 yields the components of the gradient of the likelihood
¨ N. Ludtke and M. Nelson
2914
function: k m β˜ ∂L Q[k] , = ∂ α˜ α˜ 2
(C.9)
k=1
m k−1 ∂L 1 1 = Q[k] − + a [ j] 1 − α˜ 1 + exp(−η [ j]) ∂ β˜ k=1 j=0
(C.10)
a [ j] [ j] exp(−η [ j]) ∂L Q[k]β˜ k =−
2 , ∂ η˜ 1 + exp(−η [ j])
(C.11)
m 1 ∂L =− Q[k] [k], ∂ σ˜ σ
(C.12)
m
k−1
k=1
j=0
k=1
where 1 1 − a [k] a [k] − [k]2 Q[k] = − √ exp − . 2σ 2 P[k] 1 − P[k] σ 2π
(C.13)
Acknowledgments This work has been supported by a grant from the National Institutes of Health. We thank the reviewer and Mark Mahler for valuable comments on the manuscript. References Abbott, L., & Regehr, W. (2004). Synaptic computation. Nature, 431, 796–803. Bahar, S., Kantelhardt, J., Neiman, A., Rego, H., Russel, D., Wilkens, L., Bunde, A., & Moss, F. (2001). Long-range temporal anti-correlations in paddlefish electroreceptors. Europhysics Letters, 56(3), 454–460. Bastian, J. (1999). Plasticity of feedback inputs in the apteronotid electrosensory system. Journal of Experimental Biology, 202, 1327–1337. Bastian, J., & Nguyenkim, J. (2001). Dendritic modulation of burst-like firing in sensory neurons. Journal of Neurophysiology, 85, 10–22. Berman, N., & Maler, L. (1998). Inhibition evoked from primary afferents in the electrosensory lateral line lobe of weakly electric fish. Journal of Neurophysiology, 80, 3173–3196. Blitz, D., Foster, K., & Regehr, W. (2004). Short-term synaptic plasticity: A comparison of two synapses. Nature Reviews Neuroscience, 5(8), 630–640. Brandman, R., & Nelson, M. (2002). A simple model of long-term spike train regularization. Neural Computation, 14, 1575–1597.
Synaptic Plasticity Can Enhance Signal Detectability
2915
Chacron, M., Longtin, A., St. Hilaire, M., & Maler, L. (2000). Suprathreshold stochastic firing dynamics with memory in p-type electroreceptors. Physical Review Letters, 85, 1576–1579. Chacron, M., Maler, L., & Longtin, A. (2001). Negative interspike interval correlations increase the neuronal capacity for encoding time dependent stimuli. Journal of Neuroscience, 21, 5328–5343. Chacron, M., Pakdaman, K., & Longtin, A. (2003). Interspike interval correlations, memory, adaptation, and refractoriness in a leaky integrate-and-fire model with threshold fatigue. Neural Computation, 15, 253–278. Chen, L., House, J., Krahe, R., & Nelson, M. E. (2005). Modeling signal and background components of electrosensory scenes. Journal of Comparative Physiology A, 191(4), 331–345. Goense, J., & Ratnam, R. (2003). Continuous detection of weak sensory signals in afferent spike trains: The role of anticorrelated interspike intervals in detection performance. Journal of Comparative Physiology, 189, 741–759. Gold, J., & Shadlen, M. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5(1), 10–16. Gold, J., & Shadlen, M. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36, 299–308. Goldman, M., Maldonado, P., & Abbott, L. (2002). Redundancy reduction and sustained firing with stochastic depressing synapses. Journal of Neuroscience, 22(2), 584–591. Kay, S. (1998). Fundamentals of statistical signal processing. Upper Saddle River, NJ: Prentice Hall. Lewis, J., & Maler, L. (2002). Dynamics of electrosensory feedback: Short-term plasticity and inhibition in a parallel fiber pathway. Journal of Neurophysiology, 88, 1695–1706. Lowen, S., & Teich, M. (1992). Auditory-nerve action-potentials form a nonrenewal point process over short as well as long-time scales. Journal of the Acoustical Society of America, 92(2), 803–806. Maass, W., & Zador, A. (1999). Dynamic stochastic synapses as computational units. Neural Computation, 11, 903–917. MacIver, M., Sharabash, N., & Nelson, M. (2001). Prey-capture behavior in gymnotid electric fish: Motion analysis and effects of water conductivity. Journal of Experimental Biology, 204, 543–557. Maler, L., & Berman, N. (1999). Neural architecture of the electrosensory lateral line lobe: Adaptations for coincidence detection, a sensory searchlight and frequency-dependent adaptive filtering. Journal of Experimental Biology, 202, 1243– 1253. Markram, H. (2003). Elementary principles of nonlinear synaptic transmission. In R. Hecht-Nielsen & T. McKenna (Eds.), Computational models for neuroscience (pp. 126–169). Berlin: Springer-Verlag. Moustakides, G. (1986). Optimal stopping times for detecting changes in distributions. Annals of Statistics, 14, 1379–1387. Nelson, M., & MacIver, M. (1999). Prey capture in the weakly electric fish Apteronotus albifrons: Sensory acquisition strategies and electrosensory consequences. Journal of Experimental Biology, 202, 1195–1203.
2916
¨ N. Ludtke and M. Nelson
Nelson, M., Xu, Z., & Payne, J. (1997). Characterization and modeling of p-type electrosensory afferent responses to amplitude modulations in a wave-type electric fish. Journal of Comparative Physiology, 181, 532–544. Page, E. (1954). Continuous inspection schemes. Biometrika, 41, 100–114. Raciot, D., & Longtin, A. (1997). Spike train patterning and forcastability. BioSystems, 40, 111–118. Rasnow, B. (1996). The effects of simple objects on the electric field of Apteronotus. Journal of Comparative Physiology A, 178, 397–411. Ratnam, R., & Nelson, M. (2000). Nonrenewal statistics of electrosensory afferent spike trains: Implications for the detection of weak sensory signals. Journal of Neuroscience, 20(17), 6672–6683. Strogatz, S. (1994). Nonlinear dynamics and chaos. Boulder, CO: Westview Press. Teich, M., Turcott, R., & Siegel, R. (1996). Temporal correlation in cat striate-cortex neural spike trains. IEEE Transactions on Engineering in Medicine and Biology, 15, 79–87. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal cells depends on neurotransmitter release. Proceedings of the National Academy of Sciences, 94, 719–723. Wald, A. (1948). Sequential analysis. New York: Wiley. Xu-Friedman, M., & Regehr, W. (2004). Structural contributions to short-term synaptic plasticity. Physiological Reviews, 84, 69–85. Zucker, R., & Regehr, W. (2002). Short-term synaptic plasticity. Annual Review of Physiology, 64, 355–405.
Received September 19, 2005; accepted May 2, 2006.
NOTE
Communicated by Peter Thomas
On the Use of Analytical Expressions for the Voltage Distribution to Analyze Intracellular Recordings Michelle Rudolph [email protected]
Alain Destexhe [email protected] Unit´e de Neuroscience Int´egratives et Computationnelles, CNRS, 91198 Gif-sur-Yvette, France
Different analytical expressions for the membrane potential distribution of membranes subject to synaptic noise have been proposed and can be very helpful in analyzing experimental data. However, all of these expressions are either approximations or limit cases, and it is not clear how they compare and which expression should be used in a given situation. In this note, we provide a comparison of the different approximations available, with an aim of delineating which expression is most suitable for analyzing experimental data. Synaptic noise can be modeled by fluctuating conductances described by Ornstein-Uhlenbeck stochastic processes (Destexhe, Rudolph, Fellous, & Sejnowski, 2001). This system was investigated by using stochastic calculus to obtain analytical expressions for the steady-state membrane potential (Vm ) distribution (Rudolph & Destexhe, 2003, 2005). Analytical expressions can also be obtained for the moments of the underlying three-dimensional Fokker-Planck equation (FPE) (Richardson, 2004) or by considering this equation under different limit cases (Lindner & Longtin, 2006). One of the greatest promises of such analytical expressions is that they can be used to deduce the characteristics of conductance fluctuations from intracellular recordings in vivo (Rudolph, Piwkowska, Badoual, & Destexhe, 2004; Rudolph, Pelletier, Par´e, & Destexhe, 2005). A recent article (Lindner & Longtin, 2006) provided an in-depth analysis of some of these expressions, as well as different analytically exact limit cases. One of the conclusions of this analysis was that the original expression provided by Rudolph and Destexhe (2003) was derived using steps that were incorrect for colored noise and that the expression obtained matches numerical simulations only for restricted ranges of parameters. The latter conclusion was in agreement with the analysis provided in Rudolph and Destexhe (2005). Another conclusion was that the “extended expression” proposed by Rudolph and Destexhe (2005), although providing an excellent fit to Vm distributions in general, does not match for some parameter values Neural Computation 18, 2917–2922 (2006)
C 2006 Massachusetts Institute of Technology
2918
M. Rudolph and A. Destexhe
and, in particular, does not agree with the analytically exact static noise limit. This extended expression is therefore not an exact solution of the system either. Since several analytical expressions were provided for the steady-state Vm distribution (Rudolph & Destexhe, 2003; Richardson, 2004; Rudolph & Destexhe, 2005; Lindner & Longtin, 2006), and since all of these expressions are either approximations or limit cases, it is not clear how they compare and which expression should be used in a given situation. In particular, it is unclear which expression should be used to analyze experimental recordings. In this note, we attempt to answer these questions by clarifying a number of points about some of the previous expressions and providing a detailed comparison of the different expressions available in the literature.
Figure 1: Comparison of the accuracy of different analytical expressions for the Vm distributions of membranes subject to colored conductance noise. (A) Example of Vm distribution calculated numerically (thick gray trace; model from Destexhe et al., 2001, simulated during 100 s), compared to different analytical expressions (see legend). (B) Same as in A in log scale. (C) Mean square error obtained for each expression by scanning a plausible parameter space spanned by seven parameters. Ten thousand runs similar to A were performed, using randomly chosen (uniformly distributed) parameter values. For each run, the mean square error was computed between the numerical solution and each expression. Parameters varied and range of values: membrane area a = 5000– 50,000 µm2 , mean excitatory conductance ge0 = 10–40 nS, mean inhibitory conductance gi0 = 10–100 nS, correlation times τe = 1–20 ms, and τi = 1–50 ms. The standard deviations (σe , σi ) were randomized between 20% and 33% of the mean conductance values to limit the occurrence of negative conductances (in which case, some analytical expressions would not apply). Fixed parameters: leak conductance density g L = 0.0452 mS cm−2 and reversal potential E L = −80 mV, specific membrane capacitance Cm = 1 µF cm−2 , and reversal potentials for excitation and inhibition: E e = 0 mV and E i = −75 mV, respectively. (D) Histogram of best estimates (black) and second best estimates (gray; both expressed in percentage of the 10,000 runs in B). The extended expression (Rudolph & Destexhe, 2005) had the smallest mean square error for about 80% of the cases. The expression of Richardson (2004) was the second best estimate, for about 60% of the cases. (E) Similar scan of parameters restricted to physiological values (taken from Rudolph et al., 2005; ge0 = 1–96 nS, gi0 = 20–200 nS, τe = 1–5 ms, and τi = 5–20 ms). In this case, Rudolph and Destexhe (2005) was the most performant for about 86% of the cases. (F) Scan using strong conductances and slow time constants (ge0 = and gi0 = 50–400 nS, τe and τi = 20–50 ms). In this case, the static noise limit Lindner & Lung was the most performant for about 50% of the cases. All simulations were performed using the NEURON simulation environment (Hines & Carnevale, 1997). See the supplementary information for additional scans and the NEURON code of these simulations.
Analytical Voltage Distributions to Analyze Intracellular Recordings
2919
Numerical simulation Rudolph & Destexhe, 2003 (R&D 2003) Rudolph & Destexhe, 2005 (R&D 2005) Rudolph & Destexhe, 2005 (Gaussian approximation; R&D 2005*) Richardson, 2004 (R 2004) Lindner & Longtin, 2006 (white noise limit; L&L 2006) Lindner & Longtin, 2006 (static noise limit; L&L 2006*)
B 0.04
-4
0.03
-6
log ρ V
ρV
A
0.02
L&L 2006 R&D 2003 L&L 2006* R&D 2005
-8
-10
0.01
R 2004 R&D 2005
-70
-60
-50
-40
-70
V (mV)
Best estimate Second best estimate
80
% best
MSE
-40
100
0.0075
0.005
0.0025
60 40 20
R&D 2003
R&D 2005
R&D 2005*
R 2004
L&L 2006
L&L 2006*
F
100
R&D 2003
R&D 2005
R&D 2005*
R 2004
L&L 2006
L&L 2006*
R&D 2003
R&D 2005
R&D 2005*
R 2004
L&L 2006
L&L 2006*
100 80
% best
80
% best
-50
V (mV) D
C
E
-60
60 40 20
60 40 20
R&D 2003
R&D 2005
R&D 2005*
R 2004
L&L 2006
L&L 2006*
2920
M. Rudolph and A. Destexhe
First, we would like to clarify a number of misleading statements we made in the original article (Rudolph & Destexhe, 2003) that may have led to confusion. The goal of the article was to obtain an analytical expression for the steady-state Vm distribution of membranes subject to conductancebased colored noise sources. To obtain this, we considered the full system under a t → ∞ limit. In this limit, we noted that the noise time constants become infinitesimally small compared to the time over which the system is considered, and this property allowed us to treat the system as for white noise. Our main assumption was that this procedure would allow us to obtain the correct steady-state properties like the Vm distribution. Our approach was to obtain a simplified FPE that gives the same steady-state solutions as the FPE describing the full system. These assumptions were stated in the Results section of Rudolph and Destexhe (2003), but were not clearly stated in the abstract and Discussion, and it could be understood that we claimed to provide an FPE valid for the full system. We clarify here that the treatment followed in that article did not intend to describe the full system but was restricted to steady-state solutions. Unlike the original expression (Rudolph & Destexhe, 2003), which matches only for a restricted range of parameters, the extended expression (Rudolph & Destexhe, 2005) matches for several orders of magnitude of the parameters (see also the supplementary information of Rudolph and Destexhe, 2005, at http://cns.iaf.cnrs-gif.fr). Why the extended expression matches so well, although it is not an exact solution of the system (Lindner & Longtin, 2006), is currently unknown. It is not due to the presence of boundary conditions, which could compensate for mismatches by chance. Simulations with and without boundary conditions gave equally good fits for the parameters considered here (see the NEURON code in supplementary information). Our interpretation (Rudolph & Destexhe, 2005) is that the t → ∞ limit altered the spectral structure of the stochastic process (filtering), and one can recover a better spectral structure by following the same approximation for a system that is solvable (e.g., that of Richardson, 2004) and correct it accordingly. Thus, as also found by Lindner and Longtin (2006), the extended expression is a very good approximation of the steady-state Vm distribution. Other expressions have been proposed under different approximations (Richardson, 2004; see also Moreno-Bote & Parga, 2005) or limit cases (Lindner & Longtin, 2006) and also match well the simulations for the applicable range of parameters. Since different expressions were proposed corresponding to different approximations (Rudolph & Destexhe, 2003, 2005; Richardson, 2004; Lindner & Longtin, 2006), we investigated which expression must be used in practical situations. We have considered an extended range of parameters and tested all expressions by running the model for 10,000 randomly selected values within this parameter space. The results of this procedure are shown in Figures 1A to 1D. The smallest error between analytical expressions and numerical simulations was found for the extended expression of Rudolph
Analytical Voltage Distributions to Analyze Intracellular Recordings
2921
and Destexhe (2005), followed by gaussian approximations of the same authors and that of Richardson (2004). The fourth best approximation was the static noise limit by Lindner and Longtin (2006). By scanning only within physiologically relevant values based on conductance measurements in cats in vivo (Rudolph et al., 2005), the same ranking was observed (see Figure 1E), with even more drastic differences (up to 95%; see the supplementary information). Manual examination of the different parameter sets where the extended expression was not the best estimate revealed that this happened when both time constants were slow (“slow synapses”; decay time constants > 50 ms). Indeed, performing parameter scans restricted to this region of parameters showed that the extended expression, while still providing good fits to the simulations, ranked first for less than 30% of the cases, while the static noise limit was the best estimate for almost 50% of parameter sets (see Figure 1F; see the details in the supplementary information). Scanning parameters within a wider range of values including fast and slow synapses and weak and strong conductances showed that the extended expression was still the best estimate (about 47%), followed by the static noise limit (37%; see the supplementary information). In conclusion, we have clarified here two main points. First, we clarified the assumptions and approximations that were too ambiguously stated in Rudolph and Destexhe (2003). Second, we provided a comparison of the different expressions available so far in the literature. This comparison showed that for physiologically relevant parameter values, the extended expression of Rudolph and Destexhe (2005) is the most accurate for about 80% to 90% of the cases. Outside this range, however, the situation may be different. In systems driven by slow noisy synaptic activity, the static noise limit performed better. We therefore conclude that for practical situations of realistic conductance values and synaptic time constants, the extended expression provides the most accurate alternative available. This is also supported by the fact that the extended expression was successfully tested in real neurons (Rudolph et al., 2004), which is perhaps the strongest evidence that this approach provides a powerful tool to analyze intracellular recordings. Acknowledgments We thank Magnus Richardson for extensive discussions. This research was supported by CNRS and the Human Frontier Science Program. Supplementary information can be found at http://cns-iaf.cnrs-gif.fr. References Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107, 13–24.
2922
M. Rudolph and A. Destexhe
Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9, 1179–1209. Lindner, B., & Longtin, A. (2006). Comment on “Characterization of subthreshold voltage fluctuations in neuronal membranes.” Neural Comput., 28, 1896–1931. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high-conductance state. Physical Review Letters, 94, 088103. Richardson, M. (2004). The effects of synaptic conductances on the voltage distribution and firing rate of spiking neurons. Physical Review E69, 051918. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comput., 15, 2577–2618. Rudolph, M., & Destexhe, A. (2005). An extended analytic expression for the membrane potential distribution of conductance-based synaptic noise. Neural Comput., 17, 2301–2315. Rudolph, M., Pelletier, J-G., Par´e, D., & Destexhe, A. (2005). Characterization of synaptic conductances and integrative properties during electrically-induced EEG-activated states in neocortical neurons in vivo. J. Neurophysiol., 94, 2805– 2821. Rudolph, M., Piwkowska, Z., Badoual, M., Bal, T., & Destexhe, A. (2004). A method to estimate synaptic conductances from membrane potential fluctuations. J. Neurophysiol., 91, 2884–2896.
Received November 11, 2005; accepted April 5, 2006.
NOTE
Communicated by Alain Destexhe
A Distributed Computing Tool for Generating Neural Simulation Databases Robert J. Calin-Jageman [email protected]
Paul S. Katz [email protected] Department of Biology, Georgia State University, Atlanta, GA 30302, U.S.A.
After developing a model neuron or network, it is important to systematically explore its behavior across a wide range of parameter values or experimental conditions, or both. However, compiling a very large set of simulation runs is challenging because it typically requires both access to and expertise with high-performance computing facilities. To lower the barrier for large-scale model analysis, we have developed NeuronPM, a client/server application that creates a “screen-saver” cluster for running simulations in NEURON (Hines & Carnevale, 1997). NeuronPM provides a user-friendly way to use existing computing resources to catalog the performance of a neural simulation across a wide range of parameter values and experimental conditions. The NeuronPM client is a Windows-based screen saver, and the NeuronPM server can be hosted on any Apache/PHP/MySQL server. During idle time, the client retrieves model files and work assignments from the server, invokes NEURON to run the simulation, and returns results to the server. Administrative panels make it simple to upload model files, define the parameters and conditions to vary, and then monitor client status and work progress. NeuronPM is open-source freeware and is available for download at http://neuronpm.homeip.net. It is a useful entry-level tool for systematically analyzing complex neuron and network simulations. 1 Introduction Modern computing technology makes it relatively trivial to simulate complex neurons and networks involving tens to hundreds of parameters. This complexity comes at a cost, however, as it makes it increasingly difficult to understand a model from the results of a single simulation run. Thus, computational neuroscientists have recognized the utility of compiling databases of simulation runs (e.g., Foster, Ungar, & Schwaber, 1993; Goldman, Golowasch, Marder, & Abbott, 2001; Prinz, Billimoria, & Marder, 2003; Prinz, Bucher, & Marder, 2004), in which the behavior of a model is systematically explored across a wide range of parameter values or Neural Computation 18, 2923–2927 (2006)
C 2006 Massachusetts Institute of Technology
2924
R. Calin-Jageman and P. Katz
experimental conditions. Databases of neural simulations can be useful for tuning models to new data sets, analyzing the influence of different parameters, and assessing the uniqueness of different solution sets. Compiling large sets of simulation runs typically requires access to highperformance computing facilities. For example, Prinz et al. (2003) used several months of processor time on a high-speed cluster to explore the effects of nine parameters on a model of a single neuron (1.9 million simulations). This presents a resource barrier, as high-performance computing facilities are not universally available and accessible. It also presents a development barrier, as designing simulations to take full advantage of high-performance resources typically requires additional programming and expertise. Thus, there are currently impediments to conducting the extensive parameter-space analyses desirable for fully understanding a model. To make it easier for neuroscientists to perform large-scale simulation analyses in their own labs, we sought to develop a system that would use commonly available computing resources and require minimal technical expertise for deployment. Our solution is a client/server application called NeuronPM, which creates a “screen-saver cluster” for running simulations in NEURON (Hines & Carnevale, 1997, 2001). NeuronPM provides a user-friendly means to compile large databases of simulation runs with commonly available computer resources. 2 System Architecture The NeuronPM client is a Microsoft Windows screen saver written with Microsoft Visual Studio.Net Express, a free development environment. The client runs simulations written for the NEURON simulation environment (Hines & Carnevale, 1997, 2001). Simulations are executed by directly invoking NEURON on the client computers, so NEURON models can be run using NeuronPM with little or no modification. Model files and work assignments are retrieved from a NeuronPM server, and result files are uploaded back to the server. All communication is through HTTP protocols over port 80. The NeuronPM server is a set of PHP files on an Apache web server connected to a MySQL database. The server can be hosted on compatible Windows or Linux machines. Work assignments are given to clients in large batches, reducing the communication overhead between the client and server. This allows even a rudimentary server to maintain a relatively large pool of clients. An extensive administrative interface is included on the NeuronPM server, with functions for uploading model files, defining the ranges and intervals of parameters and experimental conditions, and tracking clients and work assignments. Thus, setting up and executing large simulation runs can be accomplished entirely through the administrative interface,
Generating Simulation Databases
2925
with no additional programming. This distinguishes NeuronPM from alternatives such as PVM (e.g., Geist et al., 1994), which require coding a C, C++, or Fortran control system for defining and spawning tasks. A simple IP-based security system allows the administrator to filter unwanted clients.
3 Test Case We used NeuronPM to conduct a parameter-space analysis of a four-neuron model of the Tritonia swim central pattern generator (Frost, Lieb, Tunstall, Mensh, & Katz, 1997). To understand how the synaptic parameters in this network control the production of rhythmic activity, we varied the weights of nine synapses, with each weight taking on five different values (∼1.9 millions configurations). We distributed this analysis using NeuronPM and a pool of 14 clients. Our original NEURON scripts for the model simulated 50 s of network activity and reported onset and timing of burst activity for each cell in the network and onset and timing of network oscillations. Preparing the NEURON model for distribution required changing only two lines of code in the model (for reporting results). We used the administrative panels from the NeuronPM server to define the parameters and parameter ranges for the search. The server scripts automatically generated all the NEURON code for setting parameters and creating results files. NeuronPM completed the parameter sweep in just less than four days. The best client in the pool (a 2.8 Ghz Pentium IV with 2 GB RAM) averaged 1.21 s per simulation run and would have required about 27 days of continuous work to complete the entire series of runs. NeuronPM thus accomplished a 6.75 times increase in performance from the best client. Performance did not scale linearly with client pool size as the clients do not serve as dedicated machines, but complete work only when idle and in screen-saver mode. The server (a 1.6 Ghz Pentium IV with 756 MB RAM) never approached maximal load and continued to serve as a development platform throughout the analysis. The final result set consisted of approximately 66 MB of indexed text files containing both summary and detailed measurements for each of the 1.9 million model configurations. This database of model performance allowed us to characterize the boundaries at which the model network produces rhythmic spiking activity. It also enabled us to determine how different synapses influence this oscillatory behavior (frequency and duration). This type of large database of simulation results could easily serve other purposes, such as tuning (finding parameter sets that provide a desired behavior), stability analysis (defining the range of parameters within which a model can maintain a desired behavior), and experimentation (determining model behavior under a range of inputs). To this end, NeuronPM also
2926
R. Calin-Jageman and P. Katz
has been successfully transplanted to other laboratories and tested with a variety of NEURON models. 4 Limitations The IP-based security system used for NeuronPM is probably insufficient for distribution of the client to the general public. However, it would be appropriate for deployment to lab- and department-wide pools of clients. For example, we have installed the NeuronPM client at a department undergraduate computing lab (over 30 clients). Finally, NeuronPM distributes sets of simulation runs to each client. Unlike distributed systems like PVM, NeuronPM cannot distribute a single simulation across multiple clients. 5 Conclusion NeuronPM is an open-source, freeware project. The client, server, and documentation are available on the NeuronPM web site: http://neuronpm.homeip.net. NeuronPM is unlikely to match the speed of a high-performance computing resource. However, the usability of the system lowers the entry point for compiling large databases of simulation runs, an essential step in the analysis of complex neuron and network simulations. Furthermore, NeuronPM could easily serve as a foundation for developing a similar solution for other simulation packages (e.g., XPP). Acknowledgments This work was supported in part by NIH grant 5R01NS035371-15 to P.S.K. References Foster, W. R., Ungar, L. H., & Schwaber, J. S. (1993). Significance of conductances in Hodgkin-Huxley models. J. Neurophysiol., 70(6), 2502–2518. Frost, W. N., Lieb, J. R., Tunstall, M. J., Mensh, B., & Katz, P. S. (1997). Integrateand-fire simulations of two molluscan neural circuits. In P. Stein (Ed.), Neurons, networks and motor behavior (pp. 173–179). Cambridge, MA: MIT Press. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., & Sunderam, V. (1994). PVM: Parallel virtual machine. Cambridge, MA: MIT Press. Goldman, M. S., Golowasch, J., Marder, E., & Abbott, L. F. (2001). Global structure, robustness, and modulation of neuronal models. J. Neurosci., 21(14), 5229–5238. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9(6), 1179–1209. Hines, M. L., & Carnevale, N. T. (2001). NEURON: A tool for neuroscientists. Neuroscientist, 7(2), 123–135.
Generating Simulation Databases
2927
Prinz, A. A., Billimoria, C. P., & Marder, E. (2003). Alternative to hand-tuning conductance-based models: Construction and analysis of databases of model neurons. J. Neurophysiol., 90(6), 3998–4015. Prinz, A. A., Bucher, D., & Marder, E. (2004). Similar network activity from disparate circuit parameters. Nat. Neurosci., 7(12), 1345–1352.
Received November 11, 2005; accepted February 19, 2006.
NOTE
Communicated by Johan Suykens
Kernel Least-Squares Models Using Updates of the Pseudoinverse E. Andeli´c [email protected]
¨ M. Schaffoner [email protected]
M. Katz [email protected]
¨ S. E. Kruger [email protected]
A. Wendemuth [email protected] Cognitive Systems Group, IESK, Otto-von-Guericke-University, 39016 Magdeburg, Germany
Sparse nonlinear classification and regression models in reproducing kernel Hilbert spaces (RKHSs) are considered. The use of Mercer kernels and the square loss function gives rise to an overdetermined linear leastsquares problem in the corresponding RKHS. When we apply a greedy forward selection scheme, the least-squares problem may be solved by an order-recursive update of the pseudoinverse in each iteration step. The computational time is linear with respect to the number of the selected training samples. 1 Introduction Models for regression and classification that enforce square loss functions are closely related to Fisher discriminants (Duda & Hart, 1973). Fisher discriminants are Bayes optimal in case of classification with normally distributed classes and equally structured covariance matrices (Duda & Hart, 1973; Mika, 2002). However, in contrast to support vector machines (SVMs), least squares models (LSMs) are not sparse in general and hence may cause overfitting in a supervised learning scenario. One way to circumvent this problem is to incorporate regularization controlled by a continuous parameter into the model. For instance, Ridge regression (Rifkin, Yeo, & Poggio, 2003) penalizes the norm of the solution, yielding flat directions in the RKHS, which are robust against outliers caused by noise, for example. Suykens and Vandewalle (1999) introduced least-squares SVMs (LS-SVMs), which are closely related to gaussian processes and Fisher discriminants. A Neural Computation 18, 2928–2935 (2006)
C 2006 Massachusetts Institute of Technology
Kernel Least-Squares Models Using Updates of the Pseudoinverse
2929
linear set of equations in the dual space is solved using, for example, the conjugate gradient methods for large data sets or a direct method for a small number of data. The solution is pruned (De Kruif, & De Vries, 2003; Hoegaerts, Suykens, Vanderwalle, & De Moor, 2004) in a second stage. The close relation between the LS-SVM and the kernel Fisher discriminant (KFD) was shown in Van Gestel et al. (2002). It follows from the equivalence between the KFD and a least-squares regression onto the labels (Duda & Hart, 1973; Mika, 2002) that the proposed method is closely related to the KFD and the LS-SVM. However, the proposed method imposes sparsity on the solution in a greedy fashion using subset selection as in Billings and Lee (2002) and Nair, Choudhury, and Keane (2002). Especially in the case of large data sets, subset selection is a practical method. It aims to eliminate the most irrelevant or redundant samples. However, finding the best subset of fixed size is an NP-hard combinatorial search problem. Hence, one is restricted to suboptimal search strategies. Forward selection starts with an empty training set and adds sequentially one sample that is most relevant according to a certain criterion (e.g., the mean square error). In Billings and Lee (2002) and Nair et al. (2002), orthogonal decompositions of the Gram matrix are used for the forward selection scheme and the update of the solution. Here, we present a very simple way for constructing LSMs in an RKHS within a forward selection rule. The proposed method exploits the positive definiteness of the Gram matrix for an order-recursive update of the pseudoinverse, which represents the optimal solution in the least-squares sense. In section 2, a computationally efficient update rule for the pseudoinverse and the corresponding residual is derived. In section 3, some experimental results on regression and classification data sets are presented. Finally, a conclusion is given in section 4. 2 Update of the Pseudoinverse In a supervised learning problem, one is faced with a training data set D = {xi , yi }, i = 1, . . . , M. Here, xi denotes an input vector of fixed size, and yi is the corresponding target value, which is contained in R for regression or in {1, −1} for binary classification. It is assumed that xi = x j , for i = j. We focus on sparse approximations of models of the form yˆ = Kα. The use of Mercer kernels k(·, x) (Mercer, 1909) gives rise to a symmetric positive definite Gram matrix K with elements Ki j = k(xi , x j ) defining the subspace of the RKHS in which learning takes place. The weight vector α = {b, α1 , . . . , α M } contains a bias term b with a corresponding column 1 = {1, . . . , 1} in the Gram matrix. Consider the overdetermined least-squares problem, αˆ m = argmin Km α m − y2 , αm
(2.1)
¨ ¨ E. Andeli´c, M. Schaffoner, M. Katz, S. Kruger, and A. Wendemuth
2930
in the mth forward selection iteration with the reduced Gram matrix Km = [1 k1 . . . km ] ∈ R M×(m+1) where ki = (k(·, x1 ), . . . , k(·, x M ))T , i ∈ {1, . . . , m} denotes one previously unselected column of the full Gram matrix. We denote the reduced weight vector as α m = {b, α1 , . . . , αm } ∈ Rm+1 and the target vector as y = (y1 , . . . , yM )T . Among all generalized inverses of Km , † T T the pseudoinverse Km = (Km Km )−1 Km is the one that has the lowest Frobenius norm (Ben-Israel & Greville, 1977). Thus, the corresponding solution † αˆ m = Km y has the lowest Euclidean norm. Partitioning Km and α m in the form Km = [Km−1 km ] and α m = (α m−1 αm )T and setting αm = αm0 = const, the square loss becomes L(α m−1 , αm0 ) = Km−1 α m−1 − (y − km αm0 )2 .
(2.2)
The minimum of equation 2.2 in the least-squares sense is given by †
αˆ m−1 = Km−1 (y − km αm0 ).
(2.3)
Inserting equation 2.3 into 2.2 yields †
†
L(αm0 ) = (I − Km−1 Km−1 )km αm0 − (I − Km−1 Km−1 )y2 ,
(2.4)
with I denoting the identity matrix of appropriate size. † Note that the vector r = (I − Km−1 Km−1 )km is the residual corresponding to the least-squares regression onto km . Hence, r is a null vector if and only if km is a null vector unless K is not strictly positive definite. To ensure strictly positive definiteness of K, it is mandatory to regularize the Gram matrix by adding a small, positive constant ε to the main diagonal in the T form K → K + εI. Furthermore, the condition number of the matrix Km Km increases as the number of selected columns m grows. Thus, to ensure numerical stability, it is important to monitor the condition number of this matrix and to terminate the iteration if the condition number exceeds a predefined value unless another stopping criterion is reached earlier. In the following km = 0 is assumed. The minimum of equation 2.4, is met at †
αˆ m0 = r† (I − Km−1 Km−1 )y.
(2.5)
Noting that the pseudoinverse of a vector is given by r† =
rT , r2
(2.6)
Kernel Least-Squares Models Using Updates of the Pseudoinverse
2931
equation 2.5 may be written as †
αˆ m0 =
†
T I − Km−1 Km−1 2 y rT (I − Km−1 Km−1 )y km = . 2 r r2
(2.7)
†
The matrix I − Km−1 Km−1 is symmetric and idempotent, and thus equation 2.7 simplifies to αˆ m0 = r† y.
(2.8)
Combining equation 2.8 with 2.3, the current weight vector αˆ m may be updated as
αˆ m−1 αˆ m = αˆ m0
=
† † Km−1 − Km−1 km r† y, r†
(2.9)
revealing the update K†m
=
†
†
Km−1 − Km−1 km r† r†
(2.10)
for the current pseudoinverse. In the mth iteration, O(Mm) operations are required for these updates. In the following we refer to the described method as nonlinear pseudodiscriminants (NPDs). 2.1 Forward Selection. The goal of every forward selection scheme is to select the columns of the Gram matrix that provide the greatest reduction of the residual. Methods like basis matching pursuit (Mallat & Zhang, 1993), order-recursive matching pursuit (Natarajan, 1995), or probabilistic ¨ approaches (Smola & Scholkopf, 2000) are several contributions to this issue. In Nair et al. (2002), forward selection is performed by simply choosing the column that corresponds to the entry with the highest absolute value in the current residual. The reasoning is that the residual provides the direction of the maximum decrease in the cost function 0.5α T Kα − α T y, since the Gram matrix is strictly positive definite. The latter method is used in the following experiments. But note that the NPDs may be applied within any of the above forward selection rules. Furthermore, one advantage of the NPDs is that the corresponding residual may be updated with a negligible computational cost. Consider the residual eˆ m = y − yˆ m = y − [Km−1 km ] αˆ m
(2.11)
2932
¨ ¨ E. Andeli´c, M. Schaffoner, M. Katz, S. Kruger, and A. Wendemuth
in the mth iteration. Inserting equation 2.9 into 2.11 yields
† † Km−1 − Km−1 km r† eˆ m = y − [Km−1 km ] y r† †
= y − yˆ m−1 + Km−1 Km−1 km r† y − km r† y †
= eˆ m−1 − (r† y)(km − Km−1 Km−1 km ) = eˆ m−1 − (r† y)r.
(2.12)
The current residual may be updated without even knowing the weight vector αˆ m . Hence, αˆ m may be computed in one shot after the forward selection is stopped. It is possible to determine the number of basis functions using cross validation, or one may use for instance, the Bayesian information criterion or the minimum description length as alternative stopping criteria. 3 Experiments To show the usefulness of the proposed method empirically, some experiments for regression and classification are performed. Thirteen artificial and real-world benchmark data sets1 for classification were chosen. For each of the 13 data sets, randomly generated partitions for training and testing exist (20 partitions for image and splice and 100 partitions for all other). In all experiments, the gaussian kernel is used. The width of the kernel function, the regularization constant ε, and the number of the selected input vectors (number of basis functions) are optimized on the first five training partitions of each data set using a five-fold cross validation. The NPDs are compared with regularized AdaBoost (ABR ), support vector machines (SVM), Kernel fisher discriminant (KFD) (not sparse) and the orthogonal least-squares algorithm (OLS). The results for these methods are ¨ taken from Billings and Lee (2002) and R¨atsch, Onoda, and Muller (2001). It can be seen from Table 1 that the NPDs perform at least as well as the other state-of-the-art classifiers. For regression, the two real world data sets Boston and Abalone, which are available from the UCI machine learning repository, are chosen. The hyperparameters are optimized in a five-fold cross-validation procedure. For both data sets, random partitions of the mother data for training and testing are generated (100 (10) partitions with 481 (3000) instances for training and 25 (1177) for testing for the Boston and Abalone data set, respectively).
1 These data sets can be downloaded from http://ida.first.fhg.de/projects/bench/ benchmarks.htm.
Kernel Least-Squares Models Using Updates of the Pseudoinverse
2933
Table 1: Estimation of Generalization Errors on 13 Benchmark Data Sets (in %). Data Set Banana Breast Cancer Diabetes Flare Solar German Heart Image Ringnorm Splice Thyroid Titanic Twonorm Waveform
ABR
SVM
KFD
OLS
NPD
10.9 ± 0.4(95) 26.5 ± 4.5(97)
11.5 ± 0.7(78) 26.0 ± 4.7(42)
10.8 ± 0.5 24.8 ± 4.6
10.7 ± 0.5(93) 25.8 ± 4.8(96)
10.5 ± 0.4(87) 26.8 ± 4.8(95)
23.8 ± 1.8(97) 34.2 ± 2.2(99)
23.5 ± 1.7(57) 32.4 ± 1.8(9)
23.2 ± 1.6 33.2 ± 1.7
23.1 ± 1.8(98) 33.6 ± 1.6(99)
23.0 ± 1.8(91) 33.5 ± 1.8(94)
24.3 ± 2.0(99) 16.5 ± 3.5(98) 2.7 ± 0.6(92) 1.6 ± 0.1(97) 9.5 ± 0.7(99) 4.6 ± 2.2(94) 22.6 ± 1.2(97) 2.7 ± 0.2(97) 9.8 ± 0.8(97)
23.6 ± 2.0(58) 16.0 ± 3.2(51) 3.0 ± 0.6(87) 1.7 ± 0.1(62) 10.9 ± 0.7(31) 4.8 ± 2.2(79) 22.4 ± 1.0(10) 3.0 ± 0.2(82) 9.9 ± 0.4(60)
23.7 ± 2.2 16.1 ± 3.4 4.8 ± 0.6 1.5 ± 0.1 10.5 ± 0.6 4.2 ± 2.0 23.3 ± 2.0 2.6 ± 0.2 9.9 ± 0.4
24.0 ± 2.3(99) 15.8 ± 3.4(98) 2.8 ± 0.6(78) 1.6 ± 0.1(98) 11.7 ± 0.6(67) 4.6 ± 2.4(84) 22.4 ± 1.0(93) 2.7 ± 0.2(97) 10.0 ± 0.4(96)
23.9 ± 2.2(89) 16.2 ± 3.4(91) 2.8 ± 0.6(74) 1.8 ± 0.1(94) 11.7 ± 0.8(50) 4.1 ± 1.9(82) 22.3 ± 1.0(73) 2.6 ± 0.2(75) 10.0 ± 0.5(51)
Notes: Sparsity levels in percentages are in parentheses. Best results are in bold type, and second-best results are italicized.
Table 2: MSE Comparison of QR and NPD for the Boston Data Set Using Different Subset Sizes. Subset Size 50 100 150 200
QR
NPD
13.68 ± 7.65 11.19 ± 6.51 10.48 ± 6.01 8.35 ± 5.67
9.62 ± 3.44 8.20 ± 3.77 8.09 ± 3.72 7.82 ± 3.66
Table 3: MSE Comparison of QR and NPD for the Abalone Data Set Using Different Subset Sizes. Subset Size 50 100 150 200 250 300 350
QR
NPD
4.83 ± 0.29 4.66 ± 0.30 4.65 ± 0.28 4.51 ± 0.27 4.54 ± 0.34 4.55 ± 0.21 4.53 ± 0.29
4.41 ± 0.18 4.38 ± 0.17 4.37 ± 0.17 4.36 ± 0.16 4.36 ± 0.17 4.35 ± 0.16 4.34 ± 0.16
All continuous features are rescaled to zero mean and unit variance for both Abalone and Boston. The gender encoding (male/female/infant) for the Abalone data set is mapped into {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. The mean
2934
¨ ¨ E. Andeli´c, M. Schaffoner, M. Katz, S. Kruger, and A. Wendemuth
squared error (MSE) of the NPDs is compared with a forward selection algorithm based on a QR-decomposition of the Gram matrix (Nair et al., 2002). The results in Tables 2 and 3 show that the MSE is improved significantly in all cases by the NPDs. It should be noted that the best performance of the NPD for the Boston data set is quite favorable compared with the best ¨ performance of SVMs (MSE 8.7 ± 6.8) (Scholkopf & Smola, 2002). 4 Conclusion A computationally efficient training algorithm for nonlinear discriminants using Mercer kernels is presented. The pseudoinverse is updated directly and reveals the current residual needed for the forward selection. The competitiveness of the proposed method is demonstrated on various experiments for both classification and regression. References Ben-Israel, A., & Greville, T. N. E. (1977). Generalized inverses: Theory and applications. New York: Wiley. Billings, S. A., & Lee, K. L. (2002). Nonlinear Fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm. Neural Networks, 15, 263–270. De Kruif, B. J., & De Vries, T. J. A. (2003). Pruning error minimization in least squares support vector machines. IEEE Transactions on Neural Networks, 14, 696–702. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Hoegaerts, L., Suykens, J. A. K., Vanderwalle, J., & De Moor, B. (2004). A comparison of pruning algorithms for sparse least squares support vector machines. In Proceedings of the 11th International Conference on Neural Information Processing (ICONIP 2004). Berlin: Springer. Mallat, S. G., & Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415. Mercer, J. (1909). Functions of positive and negative type and their connections to the theory of integral equations. Philos. Trans. Roy. Soc., A 209, 415–446. London. Mika, S. (2002). Kernel Fisher discriminants. Unpublished doctoral dissertation, Technical University Berlin. Nair, P., Choudhury, A., & Keane, A. J. (2002). Some greedy learning algorithms for sparse regression and classification with Mercer kernels. Journal of Machine Learning Research, 3, 781–801. Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal of Computing, 25, 227–234. ¨ R¨atsch, G., Onoda, T., & Muller, K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Rifkin, R., Yeo, G., & Poggio, T. (2003). Regularized least squares classification. In J. A. K. Suykens, G. Horvath, S. Basu, C. A. Micchelli, & J. Vandewalle (Eds.),
Kernel Least-Squares Models Using Updates of the Pseudoinverse
2935
Advances in learning theory: Methods, models and applications (pp. 131–154). Amsterdam: IOS Press. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. ¨ Smola, A. J., & Scholkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning (pp. 911–918). San Francisco: Morgan Kaufmann. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300. Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., & Vandewalle, J. (2002). A Bayesian framework for least squares support vector machine classifiers, gaussian processes and kernel Fisher discriminant analysis. Neural Computation, 14, 1115–1147.
Received December 16, 2005; accepted March 23, 2006.
NOTE
Communicated by Andrew Barto
Learning Tetris Using the Noisy Cross-Entropy Method Istv´an Szita [email protected]
Andr´as Lo rincz [email protected] Department of Information Systems, E¨otv¨os Lor´and University, Budapest, Hungary H-1117
The cross-entropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning (RL) seems to be limited because it often converges to suboptimal policies. We apply noise for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration. The resulting policy outperforms previous RL algorithms by almost two orders of magnitude. 1 Introduction Tetris is one of the most popular computer games (see, e.g., Fahey, 2003). Despite its simple rules, playing the game well requires a complex strategy and lots of practice. Furthermore, Demaine, Hohenberger, and LibenNowell (2003) have shown that Tetris is hard in a mathematical sense as well. Finding the optimal strategy is NP-hard even if the sequence of tetrominoes is known in advance. These properties make Tetris an appealing benchmark problem for testing reinforcement learning (and other machine learning) algorithms. Reinforcement learning (RL) algorithms are quite effective in solving a variety of complex sequential decision problems. Despite this, RL approaches tried to date on Tetris show surprisingly poor performance. The aim of this note is to show how to improve RL for this hard combinatorial problem. In this article, we put forth a modified version of the cross-entropy (CE) method (de Boer, Kroese, Mannor, & Rubinstein, 2004). 2 Applying the Cross-Entropy Method to Tetris 2.1 Value Function and Action Selection. Following the approach in Bertsekas and Tsitsiklis (1996), we shall learn state-value functions that are linear combination of several basis functions. We use 22 such basis functions: maximal column height, individual column heights, differences of column heights, and the number of holes. More formally, if s denotes a Tetris state Neural Computation 18, 2936–2941 (2006)
C 2006 Massachusetts Institute of Technology
Learning Tetris Using the Noisy Cross-Entropy Method
2937
and φi (s) is the value of basis function i in this state, then according to weight vector w, the value of state s is
Vw (s) :=
22
wi φi (s).
(2.1)
i=1
During a game, the actual tetromino is test-placed in every legal position, and after erasing full rows (if any), the value of the resulting state is calculated according to Vw . Finally, we choose the column and direction with the highest value. 2.2 The Cross-Entropy Method. The cross-entropy (CE) method is a general algorithm for (approximately) solving global optimization tasks of the form w ∗ = arg max S(w), w
(2.2)
where S is a general real-valued objective function, with an optimum value γ ∗ = S(w ∗ ). The main idea of CE is to maintain a distribution of possible solutions and update this distribution at each step (de Boer et al., 2004). Here a very brief overview is provided. The CE method starts with a parametric family of probability distributions F and an initial distribution f 0 ∈ F. Under this distribution, the probability of drawing a high-valued sample (having value near γ ∗ ) is presumably very low; therefore, finding such samples by naive sampling is intractable. For any γ ∈ R, let g≥γ be the uniform distribution over the set {w : S(w) ≥ γ }. If one finds the distribution f 1 ∈ F closest to g≥γ with regard to the cross-entropy measure, then f 0 can be replaced by f 1 and γ -valued samples will have larger probabilities. For many distribution families, the parameters of f 1 can be estimated from samples of f 0 . This estimation is tractable if the probability of the γ -level set is not very low with regard to f 0 . Instead of the direct computation of the Fdistribution closest to g≥γ ∗ , we can proceed iteratively. We select a γ0 appropriate for f 0 , update the distribution parameters to obtain f 1 , select γ1 , and so on, until we reach a sufficiently large γk . Below we sketch the special case when w is sampled from a member of the gaussian distribution family. Let the distribution of the parameter vector at iteration t be f t ∼ N(µt , σt2 ). After drawing n sample vectors w1 , . . . , wn and obtaining their value S(w1 ), . . . , S(wn ), we select the best ρ · n samples, where 0 < ρ < 1 is the selection ratio. This is equivalent to setting γt = S(wρ·n ). Denoting the set of indices of the selected samples by I ⊆ {1, 2, . . . , n}, the mean and
I. Szita and A. Lo rincz
2938
the deviation of the distribution is updated using µt+1 :=
i∈I
wi
(2.3)
|I |
and 2 := σt+1
i∈I (wi
− µt+1 )T (wi − µt+1 ) . |I |
(2.4)
2.3 The Cross-Entropy Method and Reinforcement Learning. Applications of the CE method to RL include the parameter tuning of radial basis functions (Menache, Mannor, & Shimkin, 2005) and adaptation of a parameterized policy (Mannor, Rubinstein, & Gat, 2003). We apply CE to learn the weights of the basis functions, drawing each weight from an independent gaussian distribution. 2.4 Preventing Early Convergence. Preliminary investigations showed that applicability of CE to RL problems is restricted severely by the phenomenon that the distribution concentrates to a single point too fast. To prevent this, we adapt a trick frequently used in particle filtering: at each iteration, we add some extra noise to the distribution: instead of equation 2.4, we use (wi − µt+1 )T (wi − µt+1 ) 2 σt+1 := i∈I (2.5) + Zt+1 , |I | where Zt+1 is a constant vector depending only on t. 3 Experiments In the experiments we used the standard Tetris game described in Bertsekas and Tsitsiklis (1996), scoring 1 point for each cleared row. Each parameter had an initial distribution of N(0, 100). We set n = 100 and ρ = 0.1. Each drawn sample was evaluated by playing a single game using the corresponding value function. After each iteration, we updated distribution parameters using equations 2.3 and 2.5 and evaluated the mean performance of the learned parameters. This was accomplished by playing 30 games using Vµt , and averaging the results. (The large number of evaluation games was necessary because Tetris strategies have large performance deviations; Fahey, 2003.) In experiment 1 we tested the original CE method (corresponding to Zt = 0). As expected, deviations converge to 0 too fast, so the mean performance settles at about 20,000 points. In experiment 2 we used a constant noise rate
Learning Tetris Using the Noisy Cross-Entropy Method
2939
6
Cleared lines
10
4
10
2
10
CE, no noise CE, constant noise CE, decreasing noise 0
20
40 Episode
60
80
Figure 1: Tetris scores on logarithmic scale. Mean performance of 30 evaluations versus iteration number. Error bars denoting 95% confidence intervals are shown on one of the curves.
of Zt = 4, raising mean performance to a 70,000 point level. Our analysis showed that further improvement was counteracted by the amount of noise, which was too high, and thus prevented convergence of the distributions. Therefore, in experiment 3, we applied a decreasing amount of noise, Zt = max(5 − 10t , 0). With this setting, average score exceeded 300,000 points by the end of episode 50, and the best score exceeded 800,000 points. In experiments 2 and 3, noise parameters were selected in an ad hoc manner, and no optimization was carried out. Results are summarized in Figure 1 and in Table 1. Assuming an exponential score distribution (Fahey, 2003), we calculated the 95% confidence intervals of the mean. For better visibility, confidence intervals have been plotted only for experiment 3. Learning took more than one month of CPU time on a 1 GHz machine using Matlab. The main reason for the long learning course is the fact that for Tetris, evaluation time of the value function scales linearly with the score, and the score is very noisy.1 Related to this, the computational overhead of the CE method is negligibly small. Because of the large running times, experiments consisted of only a single training run. However, preliminary results on simplified Tetris problems show that over multiple trials, the method consistently converges to the same region. 1 It is conjectured that the length of a game can be approximated from its starting sequence (Fahey, 2003), which could reduce evaluation time considerably.
I. Szita and A. Lo rincz
2940 Table 1: Average Tetris Scores of Various Algorithms. Method Nonreinforcement learning Hand-coded Genetic algorithm Reinforcement learning Relational reinforcement learning+kernel-based regression Policy iteration Least squares policy iteration Linear programming + Bootstrap Natural policy gradient CE+RL CE+RL, constant noise CE+RL, decreasing noise
Mean Score
631,167 586,103
Reference
Dellacherie (Fahey, 2003) ¨ (Bohm et al., 2004)
≈50
Ramon and Driessens (2004)
3183 <3000
Bertsekas and Tsitsiklis (1996) Lagoudakis, Parr, and Littman (2002) Farias and van Roy (2006) Kakade (2001)
4274 ≈6800 21,252 72,705 348,895
3.1 Comparison to Previous Work. Tetris has been chosen as a benchmark problem by many researchers to test their RL algorithms. Table 1 summarizes results known to us, comparing them to our algorithm and to two state-of-the-art non-RL algorithms as well. The comparison shows that our method improves on the performance of the best RL algorithm by almost two orders of magnitude and gets close ¨ ´ to the best non-RL algorithms (Fahey, 2003; Bohm, Kokai, & Mandl, 2004). We think that by applying the performance enhancement tech¨ niques in Bohm et al. (2004)—more basis functions, exponential value function—further significant improvement is possible. References Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Nashua, NH: Athena Scientific. ¨ ´ Bohm, N., Kokai, G., & Mandl, S. (2004). Evolving a heuristic function for the game of Tetris. In T. Scheffer (Ed.), Proc. Lernen, Wissensentdeckung und Adaptivit¨at LWA—2004 (pp. 118–122). Berlin. de Boer, P., Kroese, D., Mannor, S., & Rubinstein, R. (2004). A tutorial on the crossentropy method. Annals of Operations Research, 134(1), 19–67. Demaine, E. D., Hohenberger, S., & Liben-Nowell, D. (2003). Tetris is hard, even to approximate. In Proc. 9th International Computing and Combinatorics Conference (COCOON 2003) (pp. 351–363). Berlin: Springer. Fahey, C. P. (2003). Tetris AI. Available online at http://www.colinfahey.com Farias, V. F., & van Roy, B. (2006). Tetris: A study of randomized constraint sampling. In G. Calafiore & F. Dabbene (Eds.), Probabilistic and randomized methods for design under uncertainty. Berlin: Springer-Verlag.
Learning Tetris Using the Noisy Cross-Entropy Method
2941
Kakade, S. (2001). A natural policy gradient. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 1531–1538). Cambridge, MA: MIT Press. Lagoudakis, M. G., Parr, R., & Littman, M. L. (2002). Least-squares methods in reinforcement learning for control. In SETN ’02: Proceedings of the Second Hellenic Conference on AI (pp. 249–260). Berlin: Springer-Verlag. Mannor, S., Rubinstein, R. Y., & Gat, Y. (2003). The cross-entropy method for fast policy search. In Proc. International Conf. on Machine Learning (ICML 2003), (pp. 512–519). Menlo Park, CA: AAAI Press. Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaption in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238. Ramon, J., & Driessens, K. (2004). On the numeric stability of gaussian processes regression for relational reinforcement learning. In ICML-2004 Workshop on Relational Reinforcement Learning (pp. 10–14). N.p.: Omni press.
Received October 3, 2005; accepted May 15, 2006.
LETTER
Communicated by Wulfram Gerstner
A Spiking Neuron Model of Cortical Correlates of Sensorineural Hearing Loss: Spontaneous Firing, Synchrony, and Tinnitus Melissa Dominguez [email protected]
Suzanna Becker [email protected] Department of Psychology, Neuroscience, and Behavior, McMaster University, Hamilton, Ontario, Canada L8S 4K1
Ian Bruce [email protected] Department of Electrical and Computer Engineering, McMaster University, Hamilton, Ontario, Canada L8S 4K1
Heather Read [email protected] Department of Psychology, University of Connecticut, Storrs, CT 06269, U.S.A.
Hearing loss due to peripheral damage is associated with cochlear hair cell damage or loss and some retrograde degeneration of auditory nerve fibers. Surviving auditory nerve fibers in the impaired region exhibit elevated and broadened frequency tuning, and the cochleotopic representation of broadband stimuli such as speech is distorted. In impaired cortical regions, increased tuning to frequencies near the edge of the hearing loss coupled with increased spontaneous and synchronous firing is observed. Tinnitus, an auditory percept in the absence of sensory input, may arise under these circumstances as a result of plastic reorganization in the auditory cortex. We present a spiking neuron model of auditory cortex that captures several key features of cortical organization. A key assumption in the model is that in response to reduced afferent excitatory input in the damaged region, a compensatory change in the connection strengths of lateral excitatory and inhibitory connections occurs. These changes allow the model to capture some of the cortical correlates of sensorineural hearing loss, including changes in spontaneous firing and synchrony; these phenomena may explain central tinnitus. This model may also be useful for evaluating procedures designed to segregate synchronous activity underlying tinnitus and for evaluating adaptive hearing devices that compensate for selective hearing loss.
Neural Computation 18, 2942–2958 (2006)
C 2006 Massachusetts Institute of Technology
A Spiking Neuron Model of Tinnitus
2943
1 Introduction Persistent tinnitus is a frequently occurring symptom in individuals with sensorineural hearing loss and can be extremely distressing. Such hearing loss, characterized by the loss of inner and outer hair cells, occurs in normal aging and with noise-induced trauma. As such, persistent tinnitus is quite common, especially among older adults. The hair cells tuned to high frequencies are especially vulnerable to damage, and as a result, hearing loss is most common in the high frequencies. The tinnitus percept may be either tonal or more broadband in nature and is typically within the region of hearing loss (Eggermont, 2003). Animal studies have identified many changes that occur throughout the auditory system with noise-induced high-frequency hearing loss. At the level of the auditory nerve, these include increased synchrony capture of high-frequency fibers by lower frequencies and broadening of tuning curves (Norena & Eggermont, 2003; Norena, Tomita, & Eggermont, 2003). In both the auditory nerve and the inferior colliculus, spontaneous firing rates are lower in the damaged animal than in the healthy animal (Chen & Jastreboff, 1995), while spontaneous rates are elevated at the higher levels of dorsal cochlear nucleus and primary auditory cortex (A1) (Seki & Eggermont, 2003). In addition, at the cortical level, there is increased synchrony in firing in the damaged region (Eggermont, 2003). Furthermore, the tonotopic map is reorganized such that neurons formerly tuned to high frequencies now respond best to frequencies near the edge of the region of hearing loss (Seki & Eggermont, 2003). Tinnitus associated with hearing loss has been characterized as a phantom sensation (Rauschecker, 1999), akin to phantom limb syndrome. Like phantom limb sensations, it is a percept arising from a damaged sensory system. We speculate that cortical correlates of hearing loss such as increased synchrony, elevated spontaneous firing rates, and shifted tonotopic maps may be responsible for the tinnitus percept. For example, areas of A1 that were originally associated with frequencies now in the region of hearing loss may now be driven by other frequencies or by spontaneous input generated by feedback within the cortex. We present in this article a model that captures some of these cortical correlates and in the future hope to develop this model as a platform to test behavioral therapies for tinnitus.
2 Related Work Given our long-term goal of developing models that lead to rehabilitation strategies for tinnitus, it is important that our model capture the relevant features of A1. In this section, we review what is known about the anatomical and functional organization of A1, particularly with regard to plasticity and reorganization after damage.
2944
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
2.1 Functional and Anatomical Organization of A1. Little work has been done to describe in detail the functional and anatomical organization of A1 in humans. For this reason, we rely primarily on the literature describing this organization in other mammals, including rabbits, rats, and cats. The primary auditory cortex displays a roughly cochleotopic organization along the latero-medial axis with expanded representations of behaviorally significant characteristic frequencies. (For a review, see Read, Winer, & Schriener, 2001.) The organization along the orthogonal, isofrequency axis is less clear. It appears to be a fairly complicated arrangement with several overlapping, patchy maps of various parameters, including bandwidth, binaural response type, intensity selectivity, timbre, and more complex spectrotemporal features (Read, Winer, & Schriener, 2002; Velenovsky, Cetas, Price, Sinex, & McMullen, 2003). Along this axis, narrow and broadband regions alternate, but there are also patches with consistent binaural response type, intensity threshold, operating range, and response to frequency modulation. The bandwidth and threshold topographies are correlated, in that broadband areas have higher response thresholds. In addition, binaural properties vary in a periodic fashion, but their relationship to other parameters is unclear (Read et al., 2002). Langner, Sams, Heil, and Schulze (1997) in magnetoent cephalography (MEG) studies found evidence that periodicity, and timbre vary along the isofrequency axis. In addition to short-range lateral excitatory and inhibitory lateral connections, long-range lateral connections exist. These long-range lateral A1to-A1 connections tend to be within the isofrequency contours, and there are few that cross the cochleotopic dimension. There are connections between patches with similar narrow-band spectral tuning properties, which could be a system for processing with high spectral resolution (Read et al., 2001). 2.2 Plasticity and Reorganization After Damage. Neuroplastic reorganization has been postulated to contribute to the tinnitus percept. In this section, we review evidence for plasticity in the adult primary auditory cortex. In the next section, we review models of cortical reorganization and discuss their plausibility as candidate mechanisms underlying tinnitus. Plasticity is necessary in development in order to achieve such important skills as sound localization. Since humans have a large degree of variation in shape and size of pinnae, as well as a lesser degree of variation in shape and size of the head, each person has a different head-related transfer function that changes during development. Thus, plasticity is needed to develop the highly accurate sound localizations that most adults possess. (For an excellent review of auditory cortical plasticity, see Rauschecker, 1999.) But is auditory cortical plasticity active in adults? Musical skill is associated with enhanced auditory cortical representation for musical notes as measured with MEG (Pantev et al., 1998), but that enhancement is correlated with the age of first learning the instrument (earlier learners have
A Spiking Neuron Model of Tinnitus
2945
more enhancement). Further experiments show that highly skilled musicians have enhanced auditory cortical representations for the timbre of their own instrument relative to other instruments and pure sine waves (Pantev, Roberts, Schulz, Engelien, & Ross, 2001). This indicates that the expansion is probably due to increased exposure and salience of those tones, rather than that people who become musicians are just naturally endowed with enhanced auditory systems. Further experiments (Menning, Roberts, & Pantev, 2000) show that intensive discrimination training in adults can induce similar expansion of cortical representation for the frequency trained. Thus, the plasticity is not limited to childhood and can be induced in normal adults. Animal studies allow for a more direct and precise measurement of cortical representations and changes to them. Kilgard and Merzenich (1998) showed that receptive field size in the auditory cortex of the adult rat can be narrowed or broadened when appropriate auditory stimuli are paired with stimulation of the nucleus basalis. Although direct stimulation of the nucleus basalis is not a natural phenomenon, this does confirm that auditory cortical receptive fields are subject to plastic change. Reorganization of cortical frequency response characteristics in A1 depends on the size of the cochlear or spiral ganglion lesion. Lesions covering a small area (1 mm of cochlea, or approximately 1 octave) are associated with rapid subcortical changes in frequency tuning (Snyder, Sinex, McGee, & Walsh, 2000) and no change in cochleotopic maps of frequency selectivity near threshold (Rajan & Irvine, 1998). Rajan (1998) investigated the effects of receptor organ damage on the organization of auditory cortex in adult cats. He found that limited damage to auditory receptors causes loss of functional surround inhibition in the cortex, unmasking of latent inputs, and significantly altered neural coding. However, these changes do not lead to plasticity of the cortical map, defined by the most sensitive input from the receptor surface to each cortical location. Thus, in sensory cortex, loss of surround inhibition as a consequence of receptor damage does not always produce a change in topographic mapping, especially when that receptor damage is not extensive. Larger lesions, however, have been shown to produce cochleotopic map reorganization in auditory cortex. Norena et al. (2003) found that after a 1 hour exposure to a 120 dB sound pressure level pure tone, significant peripheral damage covering two to three octaves resulted. This trauma induced a shift in characteristic frequencies of cortical neurons toward the lower frequencies, as well as an increase in bandwidth of the tuning curve and other changes. Plastic reorganization in the visual system has been much more extensively studied than in the auditory system and may provide insights relevant to the auditory system. Chino, Kaas, Smith, Langston, and Chen (1992) found rapid reorganization of cat visual cortex after partial deafferentation of the retina, but only
2946
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
if the undamaged eye was removed. Thus, bilateral deafferentation is necessary for reorganization to occur. The reorganization occurs within hours and is complete (there are no “silent” areas of cortex that respond to nothing). Because it is so rapid, the reorganization must be due to reweighting of existing connections rather than growth of new connections. Whether thalamo-cortical afferents are plastic in the adult after loss of sensory input, but in the absence of total deafferentation, remains unclear. The large shifts in tuning curves resulting in gross topographic map reorganization could be due to other factors, such as unmasking of weak, previously noneffective afferent connections, plasticity on long-range horizontal connections, or cortico-thalamic connections. In support of the latter possibility, Ivanco (1997) observed long-term potentiation (LTP) in vivo, in awake, behaving rat in cortico-thalamic auditory pathways but not thalamo-cortical pathways. The shifts in tuning curves could also be due to a downregulation in GABAergic inhibition known to result from loss of sensory input (Mossop, Wilson, Caspary, & Moore, 2000) at several levels of the auditory system that could unmask broadly tuned excitatory afferent connections or multisynaptic chains of lateral excitatory connections. Generally it appears that large lesions are necessary for primary sensory cortex map reorganization to occur, but smaller amounts of damage can cause other changes that may be important to the understanding of tinnitus. 2.3 Modeling. Very little work has been done on modeling plastic reorganization in the auditory cortex. Mercado, Myers, and Gluck (2001) simulated experience-dependent plasticity in a Kohonen map representation of auditory cortex. They simulated training an animal on a single frequency and the subsequent overrepresentation of that frequency in cortex by training the network on input at only one frequency, thereby increasing the representation of that frequency in the network. Jenison (1997) modeled the reorganization of the tonotopic map after partial deafferentation via plasticity of thalamic afferent connections to a two-dimensional sheet of cortical neurons with short-range lateral excitation and long-range lateral inhibition. After a simulated cochlear lesion, the thalamic afferent connections are reorganized using Hebbian learning, while the lateral connections were held constant. After recovery (relearning), the cortical units that had responded to the damaged portion of the cochlea responded to neighboring areas, thus successfully simulating the cortical reorganization seen in animal studies of noise-induced hearing loss. However, neurophysiological work suggests that changes to thalamocortical connection strengths are not enough to fully explain the tonotopic map reorganization seen in animal studies of noise-induced hearing loss. Paired thalamo-cortical recording studies (Miller, Escabi, Schreiner, 2001) suggest that A1 neurons receive afferent input from about 30 thalamic neurons. Thalamocortical pairs have either the same best frequencies (+/− 0.05 octaves) or best frequencies with a maximal difference of +/− 1/3 octave (Miller, Escabi, Read, & Schreiner,
A Spiking Neuron Model of Tinnitus
2947
2001). In contrast, tonotopic map reorganization can result in tuning curve shifts of multiple octaves. Such large shifts in tuning are not likely mediated by increased excitatory drive from thalamocortical afferent convergence. A1 neurons also receive a large degree of nonthalamic input, including cortical afferent input (Miller, Escabi, Read et al., 2001). Tonotopic reorganization may explain some perceptual phenomena associated with hearing loss, including hyperacusis for frequencies near the edge of hearing loss (Eggermont, 2003). However, it may not be necessary, or even sufficient, to explain the emergence of tinnitus. More likely correlates of tinnitus are the increased spontaneous firing and increased synchrony. Several modelers have proposed mechanisms of tinnitus based on subcortical lateral inhibition, which enhances the response of neurons adjacent to a deafferented region (Bruce, Bajaj, & Ko, 2003; Gerken, 1996). This may give rise to an enhanced sensory-driven response to stimuli near the edge of hearing loss, but would not account for spontaneously generated activity in this frequency range in the absence of hearing loss. Further, subjective ratings of tinnitus indicate that the perceived sound is typically well within the range of hearing loss (Norena, Micheyl, Chery-Croze, & Collet, 2002), not at the edge of the loss region. Although no modeling work has been done on illusory auditory percepts in cortex, Wilson’s computational account of illusory visual percepts such as migraine auras and Charles Bonnet syndrome may give some insights into what is causing the tinnitus percept. Wilson, Blake, and Lee (2001) explained binocular dominance wave propagation (oscillating illusions caused by static patterns) in terms of recurrent excitation and mutual inhibition. He described the illusions as being caused by a dense set of competing stimuli. Inhibitory interactions among V4 concentric units cause the hallucinations, and the oscillations are caused by spike frequency adaptation. Selective attention results from biasing these competitive interactions in parallel networks. 3 Model Our neural network model is built on a spike response model neuron that captures the shapes of excitatory and inhibitory postsynaptic currents and of action potentials (Gerstner & Kistler, 2002; Bruce et al., 2003). The input to our model is generated by a lateral inhibitory network (LIN) (Bruce et al., 2003) model of midbrain auditory processing. This model captures several important aspects of subcortical changes after high-frequency hearing loss: it produces output with decreased spontaneous firing in the impaired region and, via lateral inhibition, exhibits enhanced stimulus-generated firing of neurons tuned to the edge frequencies due to loss of surround inhibition. Our cortical model, which is the novel component of this work, includes a layer of ventral medial geniculate input neurons and a cortical layer of neurons with geniculate afferent input, and lateral excitatory and inhibitory
2948
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
connections. The widespread afferent cortical connections result in broad frequency tuning sharpened by lateral inhibition. Simulated high-frequency peripheral damage in the input model results in enhanced firing of neurons at the edge frequency region in the cortical model, thus capturing some of the tonotopic map reorganization seen after high-frequency hearing loss. We hypothesized that the changes in synchrony and spontaneous firing rates are due to compensatory changes in the gain of lateral connections in the cortex after the reduction of afferent input. Our simulations show that both increasing the gain on lateral excitatory connections and decreasing the gain on lateral inhibitory connections result in increased spontaneous firing rates and increased synchrony in the cortical hearing loss region (Wierenga, Ibata, & Turrigiano, 2005). 3.1 Model Neuron. The cortical neuron membrane potentials (vi (t)) are described by the following equation (for all potentials less than the spiking potential), τ dv(t)/dt + v(t) = Vi a (t) + Uil e(t) − Wil i(t),
(3.1)
where τ is the membrane time constant (τ was set to 5 milliseconds for these simulations), V is the matrix of thalamic afferent connection strengths, U is the matrix of lateral excitatory connections, W is the matrix of lateral inhibitory connections, i a (t) is the vector of thalamic afferent, il e(t) is the vector of excitatory lateral input, and il i(t) is the vector of inhibitory lateral input at time t (time step size was 2 milliseconds).1 The input is a postsynaptic current in response to spiking activity at the input neuron. The postsynaptic currents are described by the following equation: i(t) = (α/10τ ) 2 t exp(−αt/τ ).
(3.2)
For excitatory neurons, postsynaptic currents are sharp and fast (α = 5), and for inhibitory neurons, postsynaptic currents are slower (α = 1; see Figure 1A). When the membrane potential reaches the spiking threshold, a spike is generated (membrane potential spikes and then is reset to a low value), and a refractory period is entered (see Figure 1B). Input to the model is created by running Poisson-generated spontaneous input plus an optional tonal stimulus through a spiking LIN (see Bruce et al., 2003). A hearing impairment is modeled at the input level by decreasing both the spontaneous rate and the stimulus-driven response rate in the deafferented region.
1 This step size was chosen for the purposes of the synchrony analysis. See section 4.3 for further discussion.
A Spiking Neuron Model of Tinnitus
A
2949
B Spiking Activity
Conductances
–9
10
x 10
5 4
8 3 Membrane potential
Conductance [S]
6 4 2 0
2 1 0 –1 –2 –3
–2 –4
–4
0
0.5
1
1.5 Time [s]
2
2.5
3 –3
x 10
0
1
2
3 4 Time [s]
5
6 –4
x 10
Figure 1: Model neuron. (A) Postsynaptic currents in response to an excitatory (solid) or inhibitory (dotted) input spike at time 0. (B) Spiking activity. The membrane potential of a neuron in response to a constant input current. Low Frequency
High Frequency
Primary Auditory Cortical Units
MGB (Auditory Thalamic Units) Inhibitory Connection
Excitatory Connection
Deafferented
Figure 2: The structural architectures of the three models was identical. The cortical layer of units receives two kinds of input: afferent input from the thalamic layer and lateral input (both excitatory and inhibitory) from other cortical units. In all cases, connection strength was a function of distance. Solid lines represent normal excitatory connections, and dotted lines represent inhibitory connections. For the impaired models, afferent input was damaged at the higher frequencies (dashed lines).
3.2 Cortical Model. The model network consists of two layers: an input layer, analogous to auditory thalamus, and a cortical layer. The cortical layer receives diffuse, broadly tuned excitatory input from the thalamic layer, and both excitatory and inhibitory recurrent input. (See Figure 2.) The current cortical model lacks back projections to the thalamus. These projections
2950
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
A
B d1 d d2
Figure 3: Connection strength as a function of distance. (A) Thalamo-cortical afferent connections. The range of connections (d) is +/−2/3 of an octave, or 8 units. (B) Cortico-cortical lateral connections: d1 is +/−3/4 of an octave, or 9 units, and d2 is +/−1/6 of an octave, or 2 units.
are present in the mammalian auditory system in large numbers. As noted earlier, they are a good candidate for plasticity (Ivanco, 1997), which may further contribute to the development of the tinnitus percept. However, the function of these cortico-thalamic projections is not well understood at this time, so they have not been included. The connection strengths of both thalamic afferents and lateral connections are a function of distance. Each model neuron in each layer is spatially arranged along a cochleotopic axis. The strength of a thalamic afferent connection is a gaussian function of lateral distance between the two neurons at their corresponding locations along the cochleotopic axis. The strength of a lateral connection is a Mexican hat function of the distance between the two neurons (see Figure 3). A cortical neuron receives afferent input from thalamic neurons with best frequencies within a range of +/−1/3 octave of the cortical neuron’s best frequency (Miller, Escabi, Read et al., 2001). The full range of the lateral connections is +/−3/4 octaves, with the excitatory portion being +/−1/6 octave. The lateral excitatory connection strengths are much weaker than the thalamic afferent connection strengths, but the lateral inhibitory connection strengths are matched to the excitatory thalamic afferent connection strengths. Thus, in the normal model, the lateral connections act as a mechanism to sharpen tuning. We hypothesize that as a response to regional hearing loss and the consequent reduction in stimulus-driven thalamic afferents, compensatory changes occur in the connection strengths of lateral connections in that region. When hearing is impaired in a region of the frequency map, that region receives less afferent input. By changing the connection strengths of the lateral inputs, it is possible to raise the level of total input back to unimpaired levels. There are three possible ways of changing lateral connection strengths to increase activity levels: increasing excitatory strength, decreasing inhibitory strength, or both. All three possibilities were tried, and the best results were obtained using the combination of increased excitation and decreased
A Spiking Neuron Model of Tinnitus
2951
Table 1: Model Parameters. Parameter Neurons per layer Neurons per octave Spontaneous activity rate Peak tonal activity rate Correlation measured with lags of
Value 100 Approximately 11 200 spikes/sec. 500 spikes/sec. +/−50 ms
inhibition. Only decreasing inhibition produced qualitatively similar results to those presented here, but the levels of synchrony were lower. Increasing lateral excitation to the necessary level to achieve the unimpaired level of input without altering inhibition created an unstable, overactive network. Model parameters are summarized in Table 1. The results presented here are from a model with increased strength of lateral excitatory inputs and decreased strength of lateral inhibitory inputs to the cortical units, which receive impaired thalamic afferent input, thus raising their total input levels to the input levels seen in the unimpaired region. 4 Results We compared the performance of three model architectures on five different conditions. The three models were a “normal model” with input from a normally responding subcortical model; an impaired model with no compensatory changes to lateral connections, which had input from a damaged cochlear model but a cortical model identical to the normal model; and an impaired model with changes, which had input from the damaged cochlear model and compensatory changes in lateral connection strengths in the cortical model. Each model was tested on five input conditions: spontaneous input (no input tone) and with responses to four different input tones: one in the normal hearing region, one to either side of the edge of hearing loss, and one in the impaired region. There were 10 different randomly generated input sequences for each of the five input conditions. 4.1 Spontaneous Activity. The responses of the three models to spontaneous activity from the cochlear model can be seen in Figure 4. As one might expect, the response of the normal model (solid line) is essentially flat across the entire frequency spectrum. As the effect of hearing impairment in the cochlear model is lowered, spontaneous activity in the impaired region, the response of the impaired model with no changes to lateral weights (dashed line), also shows a drop in spontaneous activity. However, the impaired model with changes to lateral weights shows an increase in spontaneous
2952
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
A
B Spontaneous Activity
Synchronous Spontaneous Activity
600
1 0.9 0.8 0.7
400
Synchrony
Mean Spike Rate [spikes/s]
500
300
200
0.6 0.5 0.4 0.3 0.2
100
0.1 0
0.4
0.6 0.8 1 2 4 Characteristic Frequency [kHz]
8
0
Deafferented
Across Regions
Normal
Figure 4: Response of the cortical models to spontaneous input (spike rate versus characteristic frequency). (A) Solid line: normal model; dash-dotted line: impaired model with no changes; dashed line: impaired model with changes. (B) Synchrony in spontaneous activity is greater in the impaired model with changes (gray bars) than in the normal model (black bars) or impaired model without changes (white bars). The difference is greatest in the region of hearing loss.
activity in the impaired region, as is consistent with physiological experimental findings. 4.2 Response to Tonal Input. The models were tested on four different types of tonal input. Their response profiles can be seen in Figure 5. In Figure 5A, the tone is in the range of normal hearing. As this input is in the frequency range not affected by the impairment or by the changes in lateral weights, the differences between the three models are the same as those described for spontaneous inputs. The stimulus-driven activity is the same across the three models; they differ only in their responses to the spontaneous activity. Figures 5B and 5C show the responses to tonal input near the edge of the hearing loss. The model with compensatory changes shows a particularly elevated response to tonal input just below the edge of the hearing loss. Figure 5D shows the models’ response to a tone in the impaired region. Note that spontaneous activity and edge activity remain elevated. Further, note that while the impaired model without changes shows basically no stimulus-driven response, the impaired model with changes does show stimulus-driven response. Another interesting aspect of the impaired model with changes is that it shows increased activity at the edge of the hearing loss. This could be an analogue of the hyperacusis that is frequently reported by tinnitus sufferers.
A Spiking Neuron Model of Tinnitus
A
2953
B Tone just below edge of hearing loss
Tone in normal region 1000
600
900 800 Mean Spike Rate [spikes/s]
Mean Spike Rate [spikes/s]
500
400
300
200
700 600 500 400 300 200
100 100
0
0
0.4
0.6 0.8 1 2 4 Characteristic Frequency [kHz]
8
C
0.4
0.6 0.8 1 2 4 Characteristic Frequency [kHz]
8
D Tone in region of hearing loss
Tone just above edge of hearing loss
550
600
500 450 Mean Spike Rate [spikes/s]
Mean Spike Rate [spikes/s]
500
400
300
200
400 350 300 250 200 150
100
100 0
0.4
0.6 0.8 1 2 4 Characteristic Frequency [kHz]
8
50
0.4
0.6 0.8 1 2 4 Characteristic Frequency [kHz]
8
Figure 5: Response of the cortical models to a tonal input (spike rate versus characteristic frequency). Solid line: normal model; dash-dotted line: impaired model with no changes; dashed line: impaired model with changes. (A) Tone in normal hearing range. (B) Tone just below the edge of hearing loss. (C) Tone just above the edge of hearing loss. (D) Tone in the range of hearing loss.
The impaired model without changes also shows this feature, but to a much lesser degree. 4.3 Synchrony of Spontaneous Activity. Another change that occurs in animals with hearing loss is an increase in synchronous firing in the spontaneous activity of primary auditory cortex (Eggermont & Komiya, 2000; Norena & Eggermont, 2003; Seki & Eggermont, 2003). We evaluated the synchronous firing of our three models by computing the cross correlations of the spike trains of various cortical neurons across in a manner similar to that used by Norena and Eggermont (2003). We used a time step size of 2 ms, analogous to how they divided spike trains into 2 ms bins. We
2954
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
Table 2: Normalized Synchrony Results with Standard Error.
Normal Deafferented Across Normal
0.9533 +/−9e-4 0.9985 +/−1e-4 0.9525 +/−7e-4
Impaired with Changes 0.9992 +/−1e-5 0.9996 +/−2e-5 0.9566 +/−0.0012
Impaired without Changes 0.0033 +/−3e-5 0.0014 +/−5e-4 0.9569 +/−7e-4
Note: Rows are regions of the model; columns are different models.
then calculated for each pair of neurons the cross correlations of the spike trains for every time lag within a 100 ms time window. We determined how many of the correlation coefficients were three standard deviations above the mean for that run. These were considered synchronous activity. These counts were then normalized for the sizes of the various regions to obtain comparable numbers. We broke our results into three categories: those correlations computed only between neurons in the deafferented region, those computed between neurons in the normal range and those in the deafferented range, and those computed only between neurons in the normal region. The results can be seen in Figure 4 and Table 2. In all cases, synchrony was higher for the impaired model with changes than for the other models. The greatest increase in synchrony is in the deafferented region, as one would expect given that that is where the changes to the lateral connections were. Synchrony across regions was greater than synchrony in the normal region, as it included the impaired region. Finally, note that synchrony in the impaired model with no changes was extremely low in the deafferented region and in the across-region measurement. This is due to the overall low levels of activity in the deafferented region for that model. 5 Discussion and Conclusions The dynamical modes of behavior of neural networks with Mexican hat profiles of lateral connectivity have been analyzed by Wilson and Cowan (1972, 1973) and Wilson (1999). As observed in our own simulations, when the weights on lateral connections are sufficiently strong, the network exhibits a spatially localized oscillatory response to external input, and when the level of lateral inhibition is weakened, the network exhibits oscillatory behavior over a long spatial extent (Wilson, 1999). Although the WilsonCowan model differs from ours in that it employs rate-coded neurons and separate populations of inhibitory and excitatory neurons, Gerstner and Kistler (2002) have proven the functional correspondence between ratecoded dynamical neurons of the Wilson-Cowan type and a spike response model similar to that employed in our simulations.
A Spiking Neuron Model of Tinnitus
2955
By changing the balance of excitation and inhibition on lateral connections in our cortical model, we were able to model several key features of cortical reorganization after peripheral impairment. Our model shows elevated spontaneous activity in the deafferented region, hyperexcitability at the edge of hearing loss, some spread of activation into the impaired region when driven by stimuli at the edge of hearing loss, and an increase in synchrony in spontaneous firing. In contrast, in our simulations of a model with hearing impairment but lacking compensatory changes in lateral connections, we observed no change in synchrony (simulation results not shown here) and decreased spontaneous firing rates. Thus, changes in the balance of excitation and inhibition of lateral connections in auditory cortex may be implicated in the sensation of tinnitus. These results are a promising step in the process of modeling, understanding, and eventually treating tinnitus. 6 Future Work This work will be extended to automate the compensation method. Thus, the changes in connection strength will be generated through a homeostatic process designed to maintain a long-term average rate of activity. A great deal is known about the organization of primary auditory cortex. For example, auditory neurons exhibit complex time-varying frequencytuning profiles (Shamma & Versnal, 1995). Spatial summation across the lagged and nonlagged thalamic input neurons in our model will allow us to simulate the formation of spectrotemporal receptive fields at the cortical layer, as has been proposed in models of visual cortex (Bednar & Miikkulainen, 2003). Another key detail of auditory cortex is that along the dimension orthogonal to the tonotopic map axis, neuronal tuning curves vary in blob-like fashion according to their bandwidth and level (intensity threshold) tuning (for a review, see Read et al., 2002) as well as binaural responsiveness and other things (Velenovsky et al., 2003; de Venecia & McMullen, 1994). Further, long-range horizontal connections are patchy and tend to connect neurons with similar bandwidth tuning profiles (Read et al., 2002). There is also evidence of long-range horizontal connections along the axis of the tonotopic map between neurons tuned to harmonically related frequencies (Read et al., 2002). We hypothesize that plasticity on the long-range connections is a key link in the induction of tinnitus, and this is the main focus of our ongoing work in this area. Acknowledgments This work was supported by the NET Programme, Institute of Aging, Canadian Institutes of Health Research.
2956
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
References Bednar, J., & Miikkulainen, R. (2003). Self-organization of spatiotemporal receptive fields and laterally connected direction and orientation maps. Neurocomputing, 52, 473–480. Bruce, I., Bajaj, H., & Ko, J. (2003). Lateral-inhibitory-network models of tinnitus. In Proceedings of the 5th IFAC Symposium on Modelling and Control in Biomedical Systems, (pp. 359–363). Dordrecht: Elsevier. Chen, G., & Jastreboff, P. (1995). Salicylate-induced abnormal activity in the inferior colliculus of rats. Hearing Research, 82, 158–178. Chino, Y., Kaas, J. H., Smith, E. L. III, Langston, A., & Cheng, H. (1992). Rapid reorganization of cortical maps in adult cats following restricted deafferentation in retina. Vision Research, 32(5), 789–796. de Venecia, R., & McMullen, N. (1994). Single thalamocortical axons diverge to multiple patches in neonatal auditory cortex. Developmental Brain Research, 81, 135–142. Eggermont, J. (2003). Central tinnitus. Auris Nasus Larynx, 30, 7–12. Eggermont, J., & Komiya, H. (2000). Moderate noise trauma in juvenile cats results in profound cortical topographic map changes in adulthood. Hearing Research, 141, 89–101. Gerken, G. (1996). Central tinnitus and lateral inhibition: An auditory brainstem model. Hearing Research, 97, 75–83. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Ivanco, T. (1997). Activity dependent plasticity in pathways between subcortical and cortical sites. Unpublished doctoral dissertation, McMaster University. Jenison, R. (1997). Modeling sensorinerual hearing loss. Mahwah, NJ: Erlbaum. Kilgard, M., & Merzenich, M. (1998). Cortical map reorganization enabled by nucleus basalis activity. Science, 279(5357), 1714–1718. Langner, G., Sams, M., Heil, P., & Schulze, H. (1997). Frequency and periodicity are represented in orthogonal maps in the human auditory cortex: Evidence from magnetoencephalography. J. Com. Physiol. A., 181, 1714–1718. Menning, H., Roberts, L., & Pantev, C. (2000). Plastic changes in the human auditory cortex induced by intensive discrimination training. NeuroReport, 11(4), 817– 822. Mercado, E. III, Myers, C., and Gluck, M. (2001). A computational model of mechanisms controlling experience-dependent reorganization of representational maps in auditory cortex. Cognitive, Affective, and Behavioral Neuroscience, 1(1), 37–55. Miller, L., Escabi, M., Read, H., & Schreiner, C. (2001). Functional convergence of response properties in the auditory thalamocortical system. Neuron, 32(1), 151– 160. Miller, L., Escabi, M., & Schreiner, C. (2001). Feature selectivity and interneuronal cooperation in the thalamocortical system. Journal of Neuroscience, 21(20), 8136– 8144. Mossop, J., Wilson, M., Caspary, D., & Moore, D. (2000). Down-regulation of inhibition following unilateral deafening. Hearing Research, 147(1–2), 183–187.
A Spiking Neuron Model of Tinnitus
2957
Norena, A., & Eggermont, J. (2003). Changes in spontaneous neural activity immediately after an acoustic trauma: Implications for neural correlates of tinnitus. Hearing Research, 183, 137–153. Norena, A., Micheyl, C., Chery-Croze, S., & Collet, L. (2002). Psychoacoustic characterization of the tinnitus spectrum: Implications for the underlying mechanisms of tinnitus. Audiology and Neuro-Otology, 7(6), 358–369. Norena, A., Tomita, M., & Eggermont, J. (2003). Neural changes in cat auditory cortex after a transient pure-tone trauma. Journal of Neurophysiology, 90(4), 2387– 2401. Pantev, C., Oostenveld, R., Engelien, A., Ross, B., Roberts, L., & Hoke, M. (1998). Increased auditory cortical representation in musicians. Nature, 392(6678), 811– 814. Pantev, C., Roberts, L., Schulz, M., Engelien, A., & Ross, B. (2001). Timber-specific enhancement of auditory cortical representations in musicians. Cognitive Neuroscience and Neuropsychology, 12(1), 169–174. Rajan, R. (1998). Receptor organ damage causes loss of cortical surround inhibition without topographic map plasticity. Nature Neuroscience, 1, 138–143. Rajan, R., & Irvine, D. (1998). Absence of plasticity of the frequency map in dorsal cochlear nucleus of adult cats after unilateral partial cochlear lesions. Journal of Computational Neurology, 399(1), 35–46. Rauschecker, J. P. (1999). Auditory cortical plasticity: A comparison with other sensory systems. Trends in Neuroscience, 22, 74–80. Read, H., Winer, J., & Schriener, C. (2001). Modular organization of intrinsic connections associated with spectral tuning in cat auditory cortex. Proceedings of the National Academy of Science, 98(14), 8042–8047. Read, H., Winer, J., & Schriener, C. (2002). Functional architecture of the auditory cortex. Current Opinion in Neurobiology, 12(4), 433–440. Seki, S., & Eggermont, J. (2003). Changes in spontaneous firing rate and neural synchrony in cat primary auditory cortex after localized tone-induced hearing loss. Hearing Research, 180, 28–38. Shamma, S., & Versnal, H. (1995). Ripple analysis in ferret primary auditory cortex (Tech. Rep. No. 95-18). College Park, MD: Institute for Systems Research, University of Maryland. Snyder, R., Sinex, D., McGee, J., & Walsh, E. (2000). Acute spiral ganglion lesions change the tuning and tonotopic organization of cat inferior colliculus neurons. Hearing Research, 147(1–2), 200–220. Velenovsky, D., Cetas, J., Price, R., Sinex, D., & McMullen, N. (2003). Functional subregions in primary auditory cortex defined by thalamocortical terminal arbors: An electrophysiological and anterograde labeling study. Journal of Neuroscience, 23(1), 308–316. Wierenga, C., Ibata, K., & Turrigiano, G. (2005). Postsynaptic expression of homeostatic plasticity at neocortical synapses. Journal of Neuroscience, 25(11), 2895–2905. Wilson, H. (1999). Spikes, decisions and actions: Dynamical foundations of neuroscience. New York: Oxford University Press. Wilson, H., Blake, R., & Lee, S.-H. (2001). Dynamics of travelling waves in visual perception. Nature, 412(30), 907–910.
2958
M. Dominguez, Suzanna Becker, I. Bruce, and H. Read
Wilson, H., & Cowan, J. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12, 1–24. Wilson, H., & Cowan, J. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetic, 13, 55–80.
Received October 20, 2004; accepted April 26, 2006.
LETTER
Communicated by Stefano Fusi
Event-Driven Simulation Scheme for Spiking Neural Networks Using Lookup Tables to Characterize Neuronal Dynamics Eduardo Ros [email protected]
Richard Carrillo [email protected]
Eva M. Ortigosa [email protected] Department of Computer Architecture and Technology, E.T.S.I. Inform´atica, University of Granada, 18071, Spain
Boris Barbour [email protected] Laboratoire de Neurobiologie, Ecole Normale Sup´erieure, 75230 Paris Cedex 05, France
Rodrigo Ag´ıs [email protected] Department of Computer Architecture and Technology, E.T.S.I. Inform´atica, University of Granada, 18071, Spain
Nearly all neuronal information processing and interneuronal communication in the brain involves action potentials, or spikes, which drive the short-term synaptic dynamics of neurons, but also their long-term dynamics, via synaptic plasticity. In many brain structures, action potential activity is considered to be sparse. This sparseness of activity has been exploited to reduce the computational cost of large-scale network simulations, through the development of event-driven simulation schemes. However, existing event-driven simulations schemes use extremely simplified neuronal models. Here, we implement and evaluate critically an event-driven algorithm (ED-LUT) that uses precalculated look-up tables to characterize synaptic and neuronal dynamics. This approach enables the use of more complex (and realistic) neuronal models or data in representing the neurons, while retaining the advantage of high-speed simulation. We demonstrate the method’s application for neurons containing exponential synaptic conductances, thereby implementing shunting inhibition, a phenomenon that is critical to cellular computation. We also introduce an improved two-stage event-queue algorithm, which allows Neural Computation 18, 2959–2993 (2006)
C 2006 Massachusetts Institute of Technology
2960
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
the simulations to scale efficiently to highly connected networks with arbitrary propagation delays. Finally, the scheme readily accommodates implementation of synaptic plasticity mechanisms that depend on spike timing, enabling future simulations to explore issues of long-term learning and adaptation in large-scale networks. 1 Introduction Most natural neurons communicate by means of individual spikes. Information is encoded and transmitted in these spikes, and nearly all of the computation is driven by these events. This includes both short-term computation (synaptic integration) and long-term adaptation (synaptic plasticity). In many brain regions, spiking activity is considered to be sparse. This, coupled with the computational cost of large-scale network simulations, has given rise to the event-driven simulation schemes. In these approaches, instead of iteratively calculating all the neuron variables along the time dimension, the neuronal state is updated only when a new event is received. Various procedures have been proposed to update the neuronal state in this discontinuous way (Watts, 1994; Delorme, Gautrais, van Rullen, & Thorpe, 1999; Delorme & Thorpe, 2003; Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003). In the most widespread family of methods, the neuron’s state variable (membrane potential) is updated according to a simple recurrence relation that can be described in closed form. The relation is applied on reception of each spike and depends only on the membrane potential following the previous spike, the time elapsed, and the nature of the input (strength, sign): Vm,t = f (Vm,t−t , t, J ),
(1.1)
where Vm is the membrane potential, t is elapsed time (since the last spike), and J represents the effect of the input (excitatory or inhibitory weight). This method can describe integrate-and-fire neurons and is used, for instance, in SpikeNET (Delorme et al., 1999; Delorme & Thorpe, 2003). Such algorithms can include both additive and multiplicative synapses (i.e., synaptic conductances), as well as short-term and long-term synaptic plasticity. However, the algorithms are restricted to synaptic mechanisms whose effects are instantaneous and to neuronal models, which can only spike immediately upon receiving input. These conditions obviously restrict the complexity (realism) of the neuronal and synaptic models that can be used. Implementing more complex neuronal dynamics in event-driven schemes is not straightforward. As discussed by Mattia and Del Giudice (2000), incorporating more complex models requires extending the event-driven framework to handle predicted spikes that can be modified if
Event-Driven Stimulation Scheme for Spiking Neural Networks
2961
intervening inputs are received; the authors propose one approach to this issue. However, in order to preserve the benefits of computational speed, it must, in addition, be possible to update the neuron state variables discontinuously and also predict when future spikes would occur (in the absence of further input). Except for the simplest neuron models, these are nontrivial calculations, and only partial solutions to these problems exist. Makino (2003) proposed an efficient Newton-Raphson approach to predicting threshold crossings in spike-response model neurons. However, the method does not help in calculating the neuron’s state variables discontinuously and has been applied only to spike-response models involving sums of exponentials or trigonometric functions. As we shall show below, it is sometimes difficult to represent neuronal models effectively in this form. A standard optimization in high-performance code is to replace costly function evaluations with lookup tables of precalculated function values. This is the approach that was adopted by Reutimann et al. (2003) in order to simulate the effect of large numbers of random synaptic inputs. They replaced the online solution of a partial differential equation with a simple consultation of a precalculated table. Motivated by the need to simulate a large network of “realistic” neurons (explained below), we decided to carry the lookup table approach to its logical extreme: to characterize all neuron dynamics off-line, enabling the event-driven simulation to proceed using only table lookups, avoiding all function evaluations. We term this method ED-LUT (for event-drivenlookup table). As mentioned by Reutimann et al. (2003), the lookup tables required for this approach can become unmanageably large when the model complexity requires more than a handful of state variables. Although we have found no way to avoid this scaling issue, we have been able to optimize the calculation and storage of the table data such that quite rich and complex neuronal models can nevertheless be effectively simulated in this way. The initial motivation for these simulations was a large-scale real-time model of the cerebellum. This structure contains very large numbers of granule cells, which are thought to be only sparsely active. An event-driven scheme would therefore offer a significant performance benefit. However, an important feature of the cellular computations of cerebellar granule cells is reported to be shunting inhibition (Mitchell & Silver, 2003), which requires noninstantaneous synaptic conductances. These cannot be readily represented in any of the event-driven schemes based on simple recurrence relations. For this reason, we chose to implement the ED-LUT method. Note that noninstantaneous conductances may be important generally, not just ¨ in the cerebellum (Eckhorn et al., 1988; Eckhorn, Reitbock, Arndt, & Dicke, 1990). The axons of granule cells, the parallel fibers, traverse large numbers of Purkinje cells sequentially, giving rise to a continuum of propagation delays. This spread of propagation delays has long been hypothesized to
2962
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
underlie the precise timing abilities attributed to the cerebellum (Braitenberg & Atwood, 1958). Large divergences and arbitrary delays are features of many other brain regions, and it has been shown that propagation and synaptic delays are critical parameters in network oscillations (Brunel & Hakim, 1999). Previous implementations of event queues were not optimized for handling large synaptic divergences with arbitrary delays. Mattia and Del Giudice (2000) implemented distinct fixed-time event queues (i.e., one per delay), which, though optimally quick, would become quite cumbersome to manage when large numbers of distinct delays are required by the network topology. Reutimann et al. (2003) and Makino (2003) used a single ordered event structure in which all spikes are considered independent. However, for neurons with large synaptic divergences, unnecessary operations are performed on this structure, since the arrival order of spikes emitted by a given neuron is known. We introduce a twostage event queue that exploits this knowledge to handle efficiently large synaptic divergences with arbitrary delays. We demonstrate our implementation of the ED-LUT method for a model of a single-compartment neuron receiving exponential synaptic conductances (with different time constants for excitation and inhibition). In particular, we describe how to calculate and optimize the lookup tables and the implementation of the two-stage event queue. We then evaluate the performance of the implementation in terms of accuracy and speed and compare it with other simulation methods. 2 Overview of the ED-LUT Computation Scheme The ED-LUT simulation scheme is based on the structures shown in Figure 1. A simulation is initialized by defining the network and its interconnections (including latency information), giving rise to the neuron list and interconnection list structures. In addition, several lookup tables that completely characterize the neuronal and synaptic dynamics are calculated: the exponential decay of the synaptic conductances; a table that can be used to predict if and when the next spike of a cell would be emitted, in the absence of further input; and a table defining the membrane potential (Vm ) as a function of the combination of state variables at a given point in the past (in our simulations, this table gives Vm as a function of the synaptic conductances and the membrane potential, all at the time of the last event, and the time elapsed since that last event). If different neuron types are included in the network, they will require their own characterization lookup tables with different parameters defining their specific dynamics. Each neuron in the network stores its state variables at the time of the last event, as well as the time of that event. If short- or long-term synaptic dynamics are to be modeled, additional state variables are stored per neuron or per synapse. When the simulation runs, events (spikes) are ordered using the event heap (and the interconnection list—see below) in order to be processed in
Event-Driven Stimulation Scheme for Spiking Neural Networks
2963
HEAP
INPUT SPIKES
events NEURON LIST SIMULATION ENGINE
OUTPUT SPIKES
INTERCONN. LIST CHARACTERIZATION TABLES NETWORK DEFINITION Vm
Ge
Gi
Tf
Figure 1: Main structures of the ED-LUT simulator. Input spikes are stored in an input queue and are sequentially inserted into the spike heap. The network definition process produces a neuron list and an interconnection list, which are consulted by the simulation engine. Event processing is done by accessing the neuron characterization tables to retrieve updated neuronal states and forecast spike firing times.
chronological order. The response of each cell to spikes it receives is determined with reference to the lookup tables, and any new spikes generated are inserted into the event heap. External input to the network can be fed directly into the event heap. Two types of events are distinguished: firing events, the times when a neuron emits a spike, and propagated events, the times when these spikes reach their target neurons. In general, each firing event leads to many propagated events through the synaptic connection tree. Because our synaptic and neuronal dynamics allow the neurons to fire after inputs have been received, the firing events are only predictions. The arrival of new events can modify these predictions. For this reason, the event handler must check the validity of each firing event in the heap before it is processed. 3 Two-Stage Event Handling Events (spikes) must be treated in chronological order in order to preserve the causality of the simulation. The event-handling algorithm must therefore be capable of maintaining the temporal order of spikes. In addition, as our neuronal model allows delayed firing (after inputs), the algorithm
2964
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
must cope with the fact that predicted firing times may be modified by intervening inputs. Mattia and Del Guidice (2000) used a fixed structure (called a synaptic matrix) for storing synaptic delays. This is suited only for handling a fixed number of latencies. In contrast, our simulation needed to support arbitrary synaptic delays. This required that each spike transmitted between two cells is represented internally by two events. The first one (the firing event) is marked with the time instant when the source neuron fires the spike. The second one (the propagated event) is marked with the time instant when the spike reaches the target neuron. Most neurons have large synaptic divergences. In these cases, for each firing event, the simulation scheme produces one propagated event per output connection. The algorithm efficiency of event-driven schemes depends on the size of the event data structure, so performance will be optimal under conditions that limit load (low connectivity, low activity). However, large synaptic divergences (with many different propagation delays) are an important feature of most brain regions. Previous implementations of event-driven schemes have used a single event heap, into which all spikes are inserted and reordered (Reutimann et al., 2003; Makino, 2003). However, treating each spike as a fully arbitrary event leads to the event data structure’s becoming larger than necessary, because the order of spike emission by a given neuron is always known (it is defined in the interconnection list). We have designed an algorithm that exploits this knowledge by using a multistage event-handling process. Our approach is based on a spike data structure that functions as an interface between the source neuron events and target neurons. We use a heap data structure (priority queue) to store the spikes (see appendix A for a brief motivation). The output connection list of each neuron (which indicates its target cells) is sorted by propagation delay. When a source neuron fires, only the event corresponding to the lowestlatency connection is inserted into the spike heap. This event is linked to the other output spikes of this source neuron. When the first spike is processed and removed from the heap, the next event in the output connection list is inserted into the spike heap, taking into account the connection delay. Since the output connection list of each neuron is sorted by latency, the next connection carrying a spike can easily be found. This process is repeated until the last event in the list is processed. In this way, the system can handle large connection divergences efficiently. Further detail on the performance of this optimization is reported in appendix A. Each neuron stores two time labels. One indicates the time the neuron was last updated. This happens on reception of each input. As described in Figure 2, when a neuron is affected by an event, the time label of this neuron is updated to tsim if it is an input spike (propagated event) or to tsim + trefrac if it is an output spike (firing event), to prevent it from firing again during the refractory period. This is important because when the characterization tables are consulted, the time label indicates the time that has elapsed since
Event-Driven Stimulation Scheme for Spiking Neural Networks
2965
While tsim < tend { Extract the event with a shortest latency in the spike heap If it is a firing event If it is still a valid event and the neuron is not under a refractory period Update the neuron state: Vm, gexc, ginh to the postfiring state. Prevent this neuron from firing during the refractory period. (Once this is done, update the neuron time label to tsim + trefrac). Predict if the source neuron will fire again with the current neuron state. If the neuron will fire: Insert a new firing event into the spike heap. Insert the propagated event with the shortest latency (looking at the output connection list). If it is a propagated event Update the target neuron state: Vm, gexc, ginh looking at the characterization tables, before the event is computed. Modify the conductances (gexc , ginh) using the connection weight (Gexc,i , Ginh,i) for the new spike. Update the neuron time label to tsim. Predict if the target neuron will fire. If it fires: Insert the firing event into the spike heap with the predicted time. Insert only the next propagated event with the next shortest latency (looking at the output connection delay table). }
Figure 2: Simulation algorithm. This pseudocode describes the simulation engine. It processes all the events of the spike heap in chronological order.
the last update. The other time label maintains the up-to-date firing time prediction. This is used to check the validity of events extracted from the central event heap. The basic computation scheme consists of a processing loop, in each iteration of which the next event (i.e., with the shortest latency) is taken from the spike heap. This event is extracted from the spike heap structure, the target neuron variables are updated (in the neuron list structure), and if the affected neurons generate them, new events are inserted into the spike heap. Also, if the processed event is a propagated event, the next spike from the output connection list of the neuron is inserted into the heap. This computation scheme is summarized in Figure 2. It should be noted that
2966
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Cm
grest
Vm Erest
gexc
ginh
Eexc
Einh
Figure 3: Equivalent electrical circuit of a model neuron. gexc and ginh are the excitatory and inhibitory synaptic conductances, while grest is the resting conductance, which returns the membrane potential to its resting state (E rest ) in the absence of input stimuli.
events are inserted into the heap in correct temporal sequence, but only the spike with the shortest latency is ever extracted. Events that are superseded by intervening inputs in the neuron concerned are left in the event heap. They are discarded upon extraction if invalid (this is checked against the up-to-date firing prediction stored in the neuron). 4 Neuronal and Synaptic Models We model neurons as single compartments receiving exponential excitatory and inhibitory synaptic conductances with different time constants. The basic electrical components of the neuron model are shown in Figure 3. The neuron is described by the following parameters: (1) membrane capacitance, Cm , (2) the reversal potentials of the synaptic conductances, E exc and E inh , (3) the time constants of the synaptic conductances, τexc and τinh , and (4) the resting conductance and its reversal potential, grest and E rest , respectively. The membrane time constant is defined as τm = Cm /grest . The neuron state variables are the membrane potential (Vm ), the excitatory conductance (gexc ), and the inhibitory conductance (ginh ). The synaptic conductances gexc and ginh depend on the inputs received from the excitatory and inhibitory synapses, respectively. The decision was made to model synaptic conductances as exponential: gexc (t) = ginh (t) =
0 , t < t0 G exc · e −(t−t0 )/τexc , t ≥ t0 0 , t < t0 , G inh · e −(t−t0 )/τinh , t ≥ t0
(4.1)
Event-Driven Stimulation Scheme for Spiking Neural Networks
2967
Figure 4: A postsynaptic neuron receives two consecutive input spikes (top). The evolution of the synaptic conductance is the middle plot. The two excitatory postsynaptic potentials (EPSPs) caused by the two input spikes are shown in the bottom plot. In the solid line plots, the synaptic conductance transient is represented by a double-exponential expression (one exponential for the rising phase, one for the decay phase). In the dashed line plot, the synaptic conductance is approximated by a single-exponential expression. The EPSPs produced with the different conductance waveforms are almost identical.
where G exc and G inh represent the peak individual synaptic conductances and gexc and ginh represent the total synaptic conductance of the neuron. This exponential representation has numerous advantages. First, it is an effective representation of realistic synaptic conductances. Thus, the improvement in accuracy from the next most complex representation, a double-exponential function, is hardly worthwhile when considering the membrane potential waveform (see Figure 4). Second, the exponential conductance requires only a single state variable, because different synaptic inputs can simply be summed recursively when updating the total conductance: gexc (t) = G exc,j + e
−(tcurrentspike −tpreviousspike )
gexc previous (t).
(4.2)
(G exc,j is the weight of synapse j; a similar relation holds for inhibitory synapses). Most other representations would require additional state
2968
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Table 1: Excitatory and Inhibitory Synaptic Characteristics, Based on the Cerebellar Granule Cell. Excitatory synapse Inhibitory synapse
Maximum conductance (G exc max ) nS, 0–7.5 Maximum conductance (G inh max ) nS: 0–29.8
Time constant (τexc ) ms, 0.5 Time constant (τinh ) ms: 10
Reversal potential (E exc ) mV, 0 Reversal potential (E inh ) mV: −80
Note: The first column is an estimation of the maximum cell conductance (summed over all synapses on the cell). The conductances of individual synapses (G exc and G inh ) are not included in this table, as they depend on the connection strengths and are therefore provided through the network definition process and synaptic plasticity.
variables or storage of spike time lists, so the exponential representation is particularly efficient in terms of memory use. In our simulations, the synaptic parameters have been chosen to represent excitatory AMPA-receptor-mediated conductances and inhibitory GABAergic conductances of cerebellar granule cells (Silver, Colquhoun, Cull-Candy, & Edmonds, 1996; Nusser, Cull-Candy, & Farrant, 1997; Tia, Wang, Kotchabhakdi, & Vicini, 1996; Rossi & Hamann, 1998). These are summarized in Table 1. Note that different synaptic connections in different cells might have quite distinct parameters; extreme examples in the cerebellum include the climbing fiber input to Purkinje cells and the mossy fiber input to unipolar brush cell synapses. The differential equation 4.2 describes the membrane potential evolution (for t ≥ t0 ) in terms of the excitatory and inhibitory conductances at t0 , combined with the resting conductance, Cm
d Vm = gexc (t0 )e −(t−t0 )/τexc (E exc − Vm ) dt + ginh (t0 ) e −(t−t0 )/τinh (E inh − Vm ) + G rest (E rest − Vm ) ,
(4.3)
where the conductances gexc (t0 ) and ginh (t0 ) integrate all the contributions received through individual synapses. Each time a new spike is received, the total excitatory and inhibitory conductances are updated as per expression 4.2. Equation 4.3 is amenable to numerical integration. In this way, we can calculate Vm , gexc , ginh , and firing time t f for given time intervals after the previous input spike. t f is the time when the membrane potential would reach the firing threshold (Vth ) in the absence of further stimuli (if indeed the neuron would fire). 5 Table Calculation and Optimization Strategies The expressions given in section 3 are used to generate the lookup tables that characterize each cell type, with each cell model requiring four tables:
Event-Driven Stimulation Scheme for Spiking Neural Networks
2969
Figure 5: f g (t), the percentage conductance remaining after a time (t) has elapsed since the last spike was received. This is a lookup table for the normalized exponential function. The time constant of the excitatory synaptic conductance gexc (shown here) was 0.5 ms and for ginh (t), 10 ms. Since the curve exhibits no abrupt changes in the time interval [0, 0.0375] seconds, only 64 values were used.
r r r
Conductances: gexc (t) and ginh (t) are one-dimensional tables that contain the fractional conductance values as functions of the time t elapsed since the previous spike. Firing time: t f (Vm,t0 , ge xc,t0 , ginh,t0 ) is a three-dimensional table representing the firing time prediction in the absence of further stimuli. Membrane potential: Vm (Vm,t0 , ge xc,t0 , ginh,t0 , t) is a fourdimensional table that stores the membrane potential as a function of the variables at the last time that the neuron state was updated and the elapsed time t.
Figures 5, 6, and 7 show some examples of the contents of these tables for a model of the cerebellar granule cell with the following parameters: Cm = 2 p F, τexc = 0.5 ms, τinh = 10 ms, grest = 0.2 nS, E exc = 0 V, E inh = −80 mV, E rest = −70 mV, and Vth = −70 mV. The sizes of the lookup tables do not significantly affect the processing speed, assuming they reside in main memory (i.e., they are too large for processor cache but small enough not to be swapped to disk). However, their size and structure obviously influence the accuracy with which the
2970
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 6: Firing time (t f ) plotted against gexc and initial Vm . t f decreases as the excitatory conductance increases and as Vm,t0 approaches threshold. ginh = 0.
neural characteristic functions are represented. The achievable table sizes (in particular, the membrane potential table) are limited by memory resources. However, it is possible to optimize storage requirements by adapting the way in which their various dimensions are sampled. Such optimization can be quite effective, because some of the table functions change rapidly only over small domains. We evaluate two strategies: multiresolution sampling and logarithmic compression along certain axes. Different approaches for the membrane potential function Vm (Vm,t0 , ge xc,t0 , ginh,t0 , t), the largest table, with respect to the inhibitory conductance (ginh,t0 ) are illustrated in Figure 8. It can be seen that a logarithmic sampling strategy in the conductance dimensions is an effective choice for improving the accuracy of the representation of neural dynamics. For the following simulation, we have used logarithmic sampling in the ginh and gexc dimensions of the Vm table (as illustrated in Figure 8C). Storage requirements and calculation time are dominated by the largest table—that for Vm . We shall show in the next section that a table containing about 1 million data points (dimension sizes: t = 64, gexc = 16, ginh = 16, Vm,to = 64) gives reasonable accuracy. In order to populate this table, we solve numerically equation 4.3. This was done using a RungeKutta method with Richardson extrapolation and adaptive step size control. On a standard 1.8 GHz Pentium platform, calculation of this table takes about 12 s. The firing time table had the same dimensions for gexc , ginh , and Vm,to . As stated previously, the individual conductance lookup tables had 64 elements each.
Event-Driven Stimulation Scheme for Spiking Neural Networks
2971
Figure 7: Membrane potential Vm Vm,t0 , ge xc,t0 , ginh,t0 , t plotted as a function of (A) Vm,t0 and t(gexc = ginh = 0); (B) ge xc,t0 and t (ginh = 0, Vm,t0 = E rest = −70 mV). The zoom in the t axis of plot B highlights the fact that the membrane potential change after receiving a spike is not instantaneous.
In principle, these tables could also be based on electrophysiological recordings. Since one of the dimensions of the tables is the time, the experimenter would need to set up only the initial values of gexc , ginh , and Vm and then record the membrane potential evolution following this
2972
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Event-Driven Stimulation Scheme for Spiking Neural Networks
2973
initial condition. With our standard table size, the experimenter would need to measure neuronal behavior for 64 × 16 × 16 (G exc , G inh , Vm ) triplets. If neural behavior is recorded in sweeps of 0.5 second (at least 10 membrane time constants), only 136 minutes of recording would be required, which is feasible (see below for ways to optimize these recordings). Characterization tables of higher resolution would require longer recording times, but such tables could be built up by pooling or averaging recordings from several cells. Moreover, since the membrane potential functions are quite smooth, interpolation techniques would allow the use of smaller, easier-to-compile tables. In order to control the synaptic conductances (gexc and ginh ), it would be necessary to use the dynamic clamp method (Prinz, Abbott, & Marder, 2004). With this technique, it is possible to replay accurately the required excitatory and inhibitory conductances. It would not be feasible to control real synaptic conductances, though prior determination of their properties would be used to design the dynamic clamp protocols. Dynamic clamp would most accurately represent synaptic conductances in small, electrically compact neurons (such as the cerebellar granule cells modeled here). Synaptic noise might distort the recordings, in which case it could be blocked pharmacologically. Any deleterious effects of dialyazing the cell via the patch pipette could be prevented by using the perforated patch technique (Horn & Marty, 1988), which increases the lifetime of the recording and ensures that the neuron maintains its physiological characteristics. 6 Simulation Accuracy An illustrative simulation is shown in Figure 9. A single cell with the characteristics of a cerebellar granule cell receives excitatory and inhibitory spikes (upper plots). We can see how the membrane conductances change abruptly due to the presynaptic spikes. The conductance tables emulate the excitatory AMPA-receptor-mediated and the inhibitory GABAergic synaptic inputs (the inhibitory inputs have a longer time constant). The conductance transients (excitatory and inhibitory) are also shown. The bottom plot shows a comparison between the event-driven simulation scheme, which updates the membrane potential at each input spike (these updates are
Figure 8: Each panel shows 16 Vm relaxations with different values of ginh,t0 . The sampled conductance interval is ginh,t0 ∈ [0,20]nS. (A) Linear approach: [0,20]nS was sampled with a constant intersample distance. (B) Multiresolution approach: two intervals [0,0.35]nS and [0.4,20]nS with eight traces each were used. (C) Logarithmic approach: ginh,t0 was sampled logarithmically.
2974
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 9: Single neuron simulation. Excitatory and inhibitory spikes are indicated on the upper plots. Excitatory and inhibitory conductance transients are plotted in the middle plots. The bottom plot is a comparison between the neural model simulated with iterative numerical calculation (continuous trace) and the event-driven scheme, in which the membrane potential is updated only when an input spike is received (marked with an x).
marked with an x) and the results of an iterative numerical calculation (Euler method with a time step of 0.5 µs). This plot also includes the output spikes produced when the membrane potential reaches the firing threshold. The output spikes are not coincident with input events, although this is obscured by the timescale of the figure. It is important to note that the output spikes produced by the event-driven scheme are coincident with those of the Euler simulation (they superimpose in the bottom plot). Each time a neuron receives an input spike, both its membrane potential and the predicted firing time of the cell are updated. This occurs only rarely, as the spacing of the events in the event-driven simulation illustrates. It is difficult to estimate the appropriate size of the tables for a given accuracy. One of the goals of this simulation scheme is to be able to simulate accurately large populations of neurons, faithfully reproducing phenomena such as temporal coding and synchronization processes. Therefore, we are interested in reproducing the exact timing of the spikes emitted. In order to evaluate this, we need a way to quantify the difference between two spike trains. We used the van Rossum (2001) distance between two spike trains. This is related to the distance introduced by Victor and Purpura (1996, 1997), but is easier to calculate, with expression 6.1 and has a more
Event-Driven Stimulation Scheme for Spiking Neural Networks
2975
natural physiological interpretation (van Rossum, 2001): D2 ( f, g)tc = f (t) =
1 tc
M
∞
[ f (t) − g(t)]2 dt
(6.1)
0
H(t − ti )e −(t−ti )/tc .
(6.2)
i
In expression 6.2, H is the Heaviside step function (H(x) = 0 if x < 0 and H(x) = 1 if x ≥ 0) and M is the number of events in the spike train. In expression 6.1, the distance D is calculated as the integration of the difference between f and g, which are spike-driven functions with exponential terms, as indicated in expression 6.2. Note that the resulting distance and, indeed, its interpretation, depends on the exponential decay constant, tc in expression 6.2, whose choice is arbitrary (van Rossum, 2001). We used tc = 10 ms. The distance also depends on the number of spikes in the trains. For this reason, we have chosen to report a crudely normalized version D2 ( f, g)tc /M. Two trains differing only by the addition or removal of a single spike have a normalized distance of (1/2M ). Two trains differing only by the relative displacement of one spike by δt have a normalized distance of (1 − exp(−|δt|/tc ))/M. In order to evaluate the accuracy of the ED-LUT method and evaluate the influence of table size, we computed the neural model using iterative calculations and the ED-LUT method and then calculated the distance between the output spike trains produced by the two methods. Figure 10 illustrates how the accuracy of the event-driven approach depends on the synaptic weights of each spike in an example using a Poisson input spike train. We plot as a function of synaptic weight the normalized van Rossum distance between the output spike trains calculated with the Euler method and obtained with ED-LUT. Spikes with very low weights do not generate output events (in either the event-driven scheme or the numerical computation one). Conversely, spikes with very large weights will always generate output events. Therefore, the deviation between the eventdriven and the numerical approach will be low in both cases. However, there is an interval of weights in which the errors are appreciable, because the membrane potential spends more time near threshold and small errors can cause the neuron to fire or not to fire erroneously. In general, however, a neuron will have a spread of synaptic weights and is unlikely to show such a pronounced error peak. Action potential variability in subthreshold states is also seen in biological recordings (Stern, Kincaid, & Wilson, 1997); therefore, a certain level of error may be affordable at a network scale. The accuracy of the event-driven scheme depends on the sampling resolution of the different axes in the tables. We varied the resolution of various
2976
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 10: The accuracy of the event-driven simulation depends on the weights of the synapses, with maximal error (normalized van Rossum distance) occurring over a small interval of critical conductances. All synaptic weights were equal.
parameters and quantified the normalized van Rossum distance of the spike trains produced, with respect to the “correct” output train obtained from an iterative solution. The axes of the Vm and t f table were varied together, but the conductance lookup tables were not modified. Effective synaptic weights were drawn at random from an interval of [0.5, 2] nS, thus covering the critical interval illustrated in Figure 10. From Figure 11 we see that the accuracy of t and gexc is critical, but the accuracy of the event-driven scheme becomes more stable when table dimensions are above 1000 K samples. Therefore, we consider appropriate resolution values are the following: 16 values for ge xc,t0 and ginh,t0 , 64 values for t, and 64 values for Vm,t0 . These dimensions will be used henceforth. Illustrative output spike trains for different table sizes, as well as the reference train, are shown in Figure 12. The spike trains obtained with the iterative method and the event-driven scheme are very similar for the large table with increased resolution in t. A spurious spike difference is observed in the other simulations. Doubling the resolution in dimensions other than t does not increase the accuracy in this particular simulation. We can also see how the spike train obtained with the small tables is significantly different. This is consistent with the accuracy study results shown in Figure 11.
Event-Driven Stimulation Scheme for Spiking Neural Networks
2977
Figure 11: The accuracy of the event-driven approach depends on the resolution of the different dimensions and therefore on the table sizes. To evaluate the influence of table size on accuracy, we ran the simulations with different table sizes. For this purpose, we chose an initial Vm table of 1000 K samples (64 values for t, 16 values for ge xc,t0 , 16 values for ginh,t0 , and 64 values for Vm,t0 ). We then halved the size of individual dimensions, obtaining tables of size 500 K samples and 250 K samples from the original table of 1000 K samples. Finally, we doubled the sampling density of individual dimensions to obtain the largest tables of 2000 K samples. For each accuracy estimation, we used an input train of 100 excitatory and 33 inhibitory spikes generating 26 output spikes (when simulated with iterative methods and very high temporal resolution).
7 Simulation Performance and Comparisons with Other Methods With ED-LUT as described, the simulation time is essentially independent of the network size, depending principally on the rate of events that need to be processed. In other words, the simulation time depends on the network activity, as illustrated in Figure 13. This implementation allows, for instance, the simulation of 8 · 104 neurons in real time with an average firing rate of 10 Hz on a 1.8 GHz Pentium IV platform. This implies the computation at a rate of 8 · 105 spikes per second as illustrated in Figure 13. Large numbers of synaptic connections of single neurons are efficiently managed by the two-stage strategy described
2978
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 12: Output spike trains for different table sizes. The first two plots represent the excitatory and inhibitory spikes. The E plots are the output events obtained with numerical iterative methods with different time step resolutions (Euler method with 0.5 µs and with 2 µs). The other plots represent the outputs generated with the event-driven scheme using different table sizes: small (S) of 500 K elements, medium (M) of 1000 K elements, and large (L) of 2000 K elements. The subscripts indicate which dimension resolution has been doubled (or halved) from the medium (M) size table.
in Figure 2. The size of the event queue is affordable, even in simulations with neurons with several thousands of synapses each. The number of synapses that the simulation engine is able to handle is limited by memory resources. Each neuron requires 60 bytes and each synapse 52 bytes. Therefore, a simulation of 8 · 105 neurons consumes about 46 MB and a total of 62 · 106 connections consumes about 3 GB. In order to illustrate the potential of the ED-LUT method, we have compared the performance of this computation scheme with other methods (see Table 2). We have implemented three alternative strategies:
r r
Time-driven iterative algorithm with a fixed time step (TD-FTS). We have used the Runge-Kutta method with a fixed time step. Time-driven iterative algorithm with variable time step (TD-VTS). We use the Runge-Kutta method with step doubling and the Richardson extrapolation technique (Cartwright & Piro, 1992). In this case, the computational accuracy is controlled by defining the error tolerance. In this scheme, the iterative computations are done with time step sizes that depend on the smoothness of the function. If a calculation leads to an error estimation above the error tolerance, the time step is reduced. If the error estimation is below this threshold, the time step is doubled. This scheme is expected to be fast when only smooth changes occur in the neuronal states (between input spikes). Although
Event-Driven Stimulation Scheme for Spiking Neural Networks
2979
Figure 13: The time taken to simulate 1 second of network activity on a Pentium IV (1.8 GHz) platform. Global activity represents the total number of spikes per second in the network. The network size did not have a significant impact on the time required. The time was almost linear with respect to network activity. The horizontal grid represents the real-time simulation limit—1 second of simulation requiring 1 second of computation.
r
this method is time driven, its computation speed depends on the cell input in the sense that the simulation passes quickly through time intervals without input activity, and when an input spike is received, the computation approach reduces the time step to simulate accurately the transient behavior of the cell. A similar simulation scheme with either global or independent variable time step integration has been adopted in NEURON (Hines & Carnevale, 2001; Lytton & Hines, 2005). Pseudoanalytical approximation (PAA) method. In this case we have approximated the solution of the differential equations that govern the cell. In this way we can adopt an event-driven scheme similar to that proposed in Makino (2003) and Mattia and Del Giudice (2000), in which the neuron behavior is described with analytical expressions. As in Makino (2003), the membrane potential is calculated with the analytical expression, and the firing time is calculated using an iterative method based on Newton-Raphson. Since the differential equations defining the cell behavior of our model have no analytical solution, we need to approximate a four-dimensional function. Even using advanced mathematical tools, this represents a hard task. The accuracy of this approach depends significantly on how good this
2980
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Table 2: Performance Evaluation of Different Methods: Accuracy versus Computing Time Trade-Off.
Time driven with fixed time step (TD-FTS) Time driven with variable time step (TD-VTS) Pseudoanalytical approximation method (PAA) Lookup-table-based event-driven scheme (ED-LUT)
Time step (s) 56 · 10−5 43 · 10−5 34 · 10−5 Error tolerance 68 · 10−5 18 · 10−5 2 · 10−5
Table size (106 samples) 1.05 6.29 39.32
Normalized van Rossum Distance
Computing Time (s)
0.061 0.033 0.017
0.286 0.363 0.462
0.061 0.032 0.017
0.209 0.275 0.440
0.131
0.142
0.061 0.032 0.017
0.0066 0.0074 0.0085
Note: We have focused on the computation of a single neuron with an input spike train composed of 100 seconds of excitatory and inhibitory input spikes (average input rate 200 Hz) and 100 seconds of only excitatory input spikes (average input rate 10 Hz). Both spike trains had a standard deviation of 0.2 in the input rate and random weights (uniform distribution) in the interval [0,0.8] nS for the excitatory inputs and [0,1] nS for the inhibitory inputs.
r
approximation is. In order to illustrate the complexity of the complete cell behavior, it is worth mentioning that the expression used was composed of 15 exponential functions. As shown in Table 2, even this complex approximation does not provide sufficient accuracy, but we have nevertheless used it in order to estimate the computation time of this event-driven scheme. Event driven based on lookup tables (ED-LUT). This is our approach, in which the transient response of the cell and the firing time of the predicted events are computed off-line and stored in lookup tables. During the simulations, each neuronal state update is performed by taking the appropriate value from these supporting tables.
In order to determine the accuracy of the results, we obtained the “correct” output spike train using a time-driven scheme with a very short time step. The accuracy of each method was then estimated by calculating the van Rossum distance (van Rossum, 2001) between the obtained result and “correct” spike train. In all methods except the pseudoanalytical approach, the accuracy versus computation time trade-off is managed with a single parameter (time step
Event-Driven Stimulation Scheme for Spiking Neural Networks
2981
in TD-FTS, error tolerance in TD-VTS, and table size in ED-LUT). We have chosen three values for these parameters that facilitate the comparison between different methods (i.e., values that lead to similar accuracy values). It is worth mentioning that all methods except the time-driven with fixed time step require a computation time that depends on the activity of the network. Table 2 illustrates several points:
r r
r
The computing time using tables (ED-LUT) of very different sizes is only slightly affected by the memory resource management units. The event-driven method based on analytical expressions is more than an order of magnitude slower than ED-LUT (and has greater error). This is caused by the complexity of the analytical expression and the calculation of the firing time using the membrane potential expression and applying the Newton-Raphson method. The ED-LUT method is about 50 times faster than the time-driven schemes (with an input average activity of 105 Hz).
8 Discussion We have implemented an event-driven network simulation scheme based on precalculated neural characterization tables. The use of such tables offers flexibility in the design of cell models while enabling rapid simulations of large-scale networks. The main limitation of the technique arises from the size of the tables for more complex neuronal models. The aim of our method is to enable simulation of neural structures of reasonable size, based on cells whose characteristics cannot be described by simple analytical expressions. This is achieved by defining the neural dynamics using precalculated traces of their internal variables. The proposed scheme efficiently splits the computational load into two different stages:
r
r
Off-line neuronal model characterization. This preliminary stage requires a systematic numerical calculation of the cell model in different conditions to scan its dynamics. The goal of this stage is to build up the neural characterization tables. This can be done by means of a large numerical calculation and the use of detailed neural simulators such as NEURON (Hines & Carnevale, 1997) or GENESIS (Bower & Beeman, 1998). In principle, this could even be done by compiling electrophysiological recordings (as described in section 6). Online event-driven simulation. The computation of the simulation process jumps from one event to the next, updating the neuron states according to precalculated neuron characterization tables and efficiently managing newly produced events.
2982
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
The proposed scheme represents a simulation tool that is intermediate between the very detailed simulators, such as NEURON (Hines & Carnevale, 1997) or GENESIS (Bower & Beeman, 1998), and the event-driven simulation schemes based on simple analytically described cell dynamics (Delorme et al., 1999; Delorme & Thorpe, 2003). The proposed scheme is able to capture cell dynamics from detailed simulators and accelerate the simulation of large-scale neural structures. The approach as implemented here allows the simulation of 8 · 104 neurons with up to 6 · 107 connections in real time with an average firing rate of 10 Hz on a 1.8 GHz Pentium IV platform. It is difficult to make a precise performance comparison between our method and previous event-driven methods, since they are based on different neuron models. Nevertheless, we have evaluated different computational strategies to illustrate the potential of our approach (see section 7). Mattia and Del Giudice (2000) used a cell model whose dynamics are defined by simple analytical expressions, and Reutimann et al. (2003) extended this approach by including stochastic dynamic. They avoided numerical methods by using precalculated lookup tables. In this case, provided that the reordering event structure is kept of reasonable size (in those approaches, large, divergent connection trees may overload the spike reordering structure), the computation speed of these schemes is likely to be comparable to our approach, since the evaluation of a simple analytical expression and a lookup table consultation consume approximately the same time. The method has been applied to simulations containing onecompartment cell models with exponential synaptic conductances (with different time constants) approximating excitatory AMPA receptor-mediated and GABAergic inhibitory synaptic inputs. The inclusion of new mechanisms, such as voltage-dependent channels, is possible. However it would require the inclusion of new neural variables and thus new table dimensions. Although very complex models may eventually require lookup tables that exceed current memory capacities, we have shown how even a modest number of table dimensions can suffice to represent quite realistic neuronal models. We have also evaluated several strategies for compressing the tables in order to accommodate more complex models. Furthermore, in appendix C, the proposed table-based methodology is used to simulate the Hodgkin and Huxley model (1952). The event-driven scheme could be used for multicompartment neuron models, although each compartment imposes a requirement for additional (one to three) dimensions in the largest lookup table. There are two ways in which multicompartment neurons may be partially or approximately represented in this scheme. After preliminary studies, using suitable sampling schemes in order to achieve reasonable accuracy with a restricted table size, we can manage lookup tables of reasonable accuracy with more than seven dimensions. Therefore, we can add two extra dimensions to enable two-compartment simulations. Quite rich cellular behavior could
Event-Driven Stimulation Scheme for Spiking Neural Networks
2983
be supplied by this extension. More concretely, we plan the addition of a second electrical compartment containing an inhibitory conductance. This new compartment will represent the soma of a neuron, while the original compartment (containing both excitatory and inhibitory conductances) will represent the dendrites. The somatic voltage and inhibitory conductance require two additional dimensions in the lookup table. With this model, it would be possible to separate somatic and dendritic processing, as occurs in hippocampal and cortical pyramidal cells, and implement the differential functions of somatic and dendritic inhibition (Pouille & Scanziani, 2001, 2004) (note that most neurons do not receive excitation to the soma). If individual dendrites can be active and have independent computational functions (this is currently an open question), it may be possible to approximate the dendrites and soma of a neuron as a kind of two-layer network (Poirazi, Brannon, & Mel, 2003), in which dendrites are actually represented in a manner similar to individual cells, with spikes that are routed to the soma (another cell) in the standard manner. We have embedded spike-driven synaptic plasticity mechanisms (see appendix B) in the event-driven simulation scheme. For this purpose, we have implemented learning rules approximated by exponential terms that can be computed recursively using an intermediate variable. Short-term dynamics (Mattia & Del Guidice, 2000) are also easy to include in the simulations. They are considered important in the support of internal stimulus representation (Amit, 1995; Amit & Brunel, 1997a, 1997b) and learning. In summary, we have implemented, optimized, and evaluated an eventdriven network simulation scheme based on prior characterization of all neuronal dynamics, allowing simulation of large networks to proceed extremely rapidly by replacing all function evaluations with table lookups. Although very complex neuronal models would require unreasonably large lookup tables, we have shown that careful optimization nevertheless permits quite rich cellular models to be used. We believe ED-LUT will provide a useful addition to available simulation tools. This software package is currently being evaluated in the context of realtime simulations in four labs in different institutions. We plan to extend its use to other labs in the near future. The software is available on request from the authors. Using this method, neural systems of reasonable complexity are already being simulated in real time, in experiments related to robot control by bio-inspired processing schemes (Boucheny, Carrillo, Ros, & Coenen, 2005). Appendix A: Event Data Structure Complex data structures, such as balanced trees, can be used for this purpose, offering good performance for both sorted and random-order input streams. To prevent performance degradation, they optimize their structure after each insertion or deletion. However, this rebalancing process adds
2984
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
more complexity and additional computational overhead (Karlton, Fuller, Scroggs, & Kaehler, 1976). Insertion and deletion of elements in these structures have a computational cost of O(log(N)), where N is the number of events in the structure. Another candidate data structure is the “skip list” (Pugh, 1990), but in this instance, the cost of the worst case may not be O(log(N)) because the insertion of an input stream can produce an unbalanced structure. Consequently, the search time for a new insertion may be longer than in the balanced trees. This structure offers optimal performance in searching specific elements. However, this is not needed in our computation scheme as we need to extract only the first element (i.e., the next spike). Finally, the heap data structure (priority queue) (Aho, Hopcroft, & Ullman, 1974; Chowdhury & Kaykobad, 2001; Cormen, Lierson, & Rivest, 1990) offers a stable computational cost of O(log(N)) in inserting and deleting elements. This is the best option as it does not require more memory resources than the stored data. This is because it can be implemented as an array, while the “balanced trees” and “skip lists” need further pointers or additional memory resources. For all of these methods, the basic operation of inserting an event costs roughly O(log(N)), where N is the number of events in the event data structure. Clearly, the smaller the data structure, the less time such insertions will take. We explain in section 3 the two-stage event handling process we have implemented in order to minimze event heap size while allowing arbitrary divergences and latencies. Compared to a method using a singleevent data structure, we would expect the event insertions to be O(log(c)) quicker, where c is the average divergence (connectivity). In Figure 14, we compare the use of one- and two-stage event handling within our simulation scheme. Although event heap operations represent only part of the total computation time, there is a clear benefit to using the two-stage process. For divergences of up to 10,000, typical for recurrent cortical networks, a better than twofold improvement of total computation time is observed. Appendix B: Spike-Timing-Dependent Synaptic Plasticity We have implemented Hebbian-like (Hebb, 1949) spike-driven learning mechanisms (spike-timing-dependent plasticity, STDP). The implementation of such leaning rules is suitable because the simulation scheme is based on the time labels of the different events. Spike-time-dependent learning rules require comparison of the times of presynaptic spikes (propagated events) with postsynaptic spikes (firing event). In principle, this requires the trace of the processed presynaptic spikes during a time interval to be kept in order for them to be accessible if postsynaptic spikes occur. Different definite expressions can be used for the learning rule (Gerstner & Kistler, 2002). The weight change function has been approximated with exponential expressions (see equation B.1) to accommodate the experimental results of
Event-Driven Stimulation Scheme for Spiking Neural Networks
2985
Figure 14: Total computation time for processing an event (top) and size of the event heap (bottom) for one-stage (dashed plot) and two-stage (continuous plot) as functions of synaptic divergence.
Bi and Poo (1998). The computation of this learning rule, by means of exponential terms, facilitates its implementation in a recursive way, avoiding the need to keep track of previous spikes: f (s) =
a pre e −bpre s a post e bpost s
if if
s<0 , s>0
(B.1)
where s = t post − t pre represents the temporal delay between the postsynaptic spike and the presynaptic one. The aim function (Bi & Poo, 1998) can be calculated with expression B.1 using the following parameters (a pre = 0.935, b pre = −0.075, a post = −0.326, b post = −0.036). They have been approximated using the Trust-region method (Conn, Gould, & Toint, 2000). The learning rules are applied each time a cell both receives and fires a spike. Each time a spike from cell i reaches a neuron j, the connection weight (wij ) is changed according to expression B.2, taking into account the time since the last action potential (AP) in the postsynaptic neuron. This time is represented by s in expression B.1: wij ← wij + wij wher e wij = wij f (s).
(B.2)
2986
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Other postsynaptic spikes are not taken into account for the sake of simplicity, but they can be included if necessary. Each time cell j fires a spike, the learning rule of expression B.3 is applied, taking into account all the presynaptic spikes received in a certain interval: wij ← wij + wij wher e wij = wij f (sk ).
(B.3)
k
In order to avoid keeping track of all the presynaptic spikes during the learning window, we can rearrange the sum of expression B.3, since the learning rule can be expressed in terms of exponentials B.1. Each time the neuron fires a spike, the learning rule is applied in each input connection, taking into account the previous spikes received through these inputs. Therefore, each weight changes according to expression B.4:
wij ← wij +
N
wij f (sk ) = wij 1 +
k=1
N
a pre e
b pre sk
,
(B.4)
k=1
where k is iterated over all N presynaptic spikes from cell i received by the neuron j in a time window. This expression can be rearranged as follows: wij ← wij + wij 1 + a pre e bpre s1 1 + e bpre s2 . . . 1 + e bpre s N wij ← wij + wij 1 + a pre e bpre s1 + e bpre s1 +bpre s2 + . . . + e bpre s1 +...+bpre s N .(B.5) This expression can be calculated recursively, accumulating all the multiplicative terms in an intermediate variable Aij, as indicated in expression B.6. s is the time difference between the action potential of cell j and the last presynaptic spike received from cell i: Aij ← 1 + Aij e bpre s .
(B.6)
The learning rule is applied recursively as indicated in expression B.7, incorporating the last presynaptic spike. Note that the term Aij accumulates the effect of all previous presynaptic spikes: wij ← wij + wij wher e wij = wij a pre e bpre s Aij .
(B.7)
Event-Driven Stimulation Scheme for Spiking Neural Networks
2987
Table 3: Hodgkin and Huxley Model (1952). d Vm dt
= I − g K · n4 · (Vm − VK ) − gNs · m3 · h · (Vm − VNa ) − gl (Vm − Vl ) Cm dn dt
αn =
= φ · (αn · (1 − n) − βn · n) ; dm dt = φ · (αm · (1 − m) − βm · m) ; dh dt = φ · (αh · (1 − h) − βh · h) 0.01·Vm +0.1 exp(0.1·Vm +1)−1 ;
αm =
0.1·Vm +2.5 exp(0.1·Vm +2.5)−1 ;
αh = 0.07 · exp (0.05 · Vm )
βn = 0.125 · exp (Vm /80) ; βm = 4 exp (Vm /18) ; βh =
1 exp(0.1·Vm +3)+1
φ = 3(T−6.3)/10
I = −gexc · (Vm − E exc ) − ginh · Vm − E inh dgexc dt
= − gτ exc ; exc
dginh dt
g
= − τ inh inh
Note: The first expression describes the membrane potential evolution. The differential equations of n, m, and h govern the ionic currents. The last two expressions of the table describe the input-driven currents and synaptic conductances. The parameters are the following: Cm = 1µ F/cm2 , g K = 1 mS/cm2 , gNa = 120 mS/cm2 , gl = 0.3 mS/cm2 , VNa = −115 mV, VK = 12 mV, Vl = −10.613 mV, and T = 6.3o C. The parameters of the synaptic conductances are the following: E exc = −65 mV, E inh = 15 mV, τexc = 0.5 ms, and τinh = 10 ms.
Appendix C: Hodgkin and Huxley Model In order to further validate the simulation scheme, we have also compiled into tables the Hodgkin and Huxley model (1952) and evaluated the accuracy obtained with the proposed table-based methodology. Table 3 shows the differential expressions that define the neural model. We have also included expressions for synaptic conductances. Interfacing the explicit representation of the action potential to the eventhandling architecture, which is based on idealized instantaneous action potentials, raises a couple of technical issues. The first is the choice of the precise time point during the action potential that should correspond to the idealized (propagated) event. This choice is arbitrary; we chose the peak of the action potential. The second issue arises from the interaction of this precise time point with discretization errors during updates close to the peak of the action potential. As illustrated in Figure 15, a simpleminded implementation can cause the duplication (or by an analogous mechanism, omission) of action potentials, a significant error. This can happen when an update is triggered by an input arriving just after the peak of the action potential (and thus after the propagated event). Discretization errors can cause the prediction of the peak in the immediate future, equivalent to a very slight shift to the right of the action potential waveform. Since we have identified the propagated event with the peak, a duplicate action potential would be emitted. The frequency of such
2988
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 15: Discretization errors could allow an update shortly following an action potential peak to predict the peak of the action potential in the immediate future, leading to the emission of an erroneous duplicate spike. (The errors have been magnified for illustrative purposes.)
errors depends on the discretization errors and thus the accuracy (size) of the lookup tables and on the frequency of inputs near the action potential peaks. These errors are likely to be quite rare, but as we now explain, they can be prevented. We now describe one possible solution (which we have implemented) to this problem (see Figure 16). We define a “firing threshold” (θ f ; in practice, −10 mV). This is quite distinct from the physiological threshold, which is much more negative. If the membrane potential exceeds θ f , we consider that an action potential will be propagated under all conditions. We exploit this assumption by always predicting a propagated event if the membrane potential is greater than θ f after the update, even if the “present” is after the action potential peak (in this case, emission is immediate). This procedure ensures that no action potentials are omitted, leaving the problem of duplicates. We also define a postemission time window. This extends from the time of emission (usually the action potential peak) to the time the membrane
Event-Driven Stimulation Scheme for Spiking Neural Networks
2989
Figure 16: Prevention of erroneous spike omission and duplication. Once the neuron exceeds θ f , a propagated event is ensured. In this range, updates that cause the action potential peak to be skipped cause immediate emission. This prevents action potential omission. Once the action potential is emitted (usually at t f ), the time t f end is stored, and no predicted action potential emissions before this time are accepted. This ensures that no spikes are propagated more than once.
potential crosses another threshold voltage, θ f end . This time, t f end , is stored in the source neuron when the action potential is emitted. Whenever new inputs are processed, any predicted output event times are compared with t f end and only those predicted after t f end are accepted. This procedure eliminates the problem of duplicate action potentials. In order to preserve the generality of this implementation, we chose to define these windows around the action potential peak by voltage level crossings. In this way, the implementation will adapt automatically to changes of action potential waveform (possibly resulting from parameter changes). This choice entailed the construction of an additional large lookup table. Simpler implementations based on fixed time windows could avoid this requirement. However, the cost of the extra table was easily borne.
2990
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Figure 17: Single neuron event-driven simulation of the Hodgkin and Huxley model. Note that in order to facilitate the comparison of the plots with the ones of the integrate-and-fire model (see Figure 9) the variable (V) has been calculated using the following expression V = (−Vm − Vr est )/1000 with Vr est = 65 mV.
We have compiled the model into the following tables:
r r r r
One table of seven dimensions for the membrane potential, Vm = f (t, gexc 0 , ginh 0 , n0 , m0 , h 0 , V0 ). Three tables of seven dimensions for the variables driving ionic currents, n = f (t, gexc 0 , ginh 0 , n0 , m0 , h 0 , V0 ), m = f (t, gexc 0 , ginh 0 , n0 , m0 , h 0 , V0 ), h = f (t, gexc 0 , ginh 0 , n0 , m0 , h 0 , V0 ). Two tables of two dimensions for the conductances, gexc = f (t, gexc 0 ), ginh = f (t, ginh 0 ). Two tables of six dimensions for the firing prediction, t f = f (gexc , ginh , n0 , m0 , h 0 , V0 ) and t f end = f (gexc , ginh , n0 , m0 , h 0 , V0 ). With θ f = −0.01 V and θ f end = −0.04 V.
An accurate simulation of this model (as shown in Figure 17) requires approximately 6.15 Msamples (24.6 MB using 4-byte floating point data representation) for each seven-dimension table. We use a different number of samples for each dimension: t(25), gexc 0 (6), ginh 0 (6), n0 (8), m0 (8), h 0 (8), and V0 (14). The table calculation and compilation stage of this model requires approximately 4 minutes on a Pentium IV 1.8 Ghz. Figure 17 shows an illustrative simulation of the Hodgkin and Huxley model using the table-based event-driven scheme. Note that the simulation engine is able to accurately jump from one marked instant (bottom plot) to
Event-Driven Stimulation Scheme for Spiking Neural Networks
2991
the next one (according to either input or generated events). The membrane potential evolution shown in the bottom plot has been calculated using a numerical method (continuous plot) and the marks (placed onto the continuous trace) have been calculated using the event-driven approach. We have also included the generated events using numerical calculation (vertical continuous lines) and those generated by the table-based event-driven approach (vertical dashed lines). In order to evaluate the model accuracy, we have adopted the same methodology described in section 5. We have simulated a single cell receiving an input spike train using numerical calculation to obtain a reference output spike train. Then we have used the proposed table-based eventdriven approach to generate another output spike train. As in section 7, the accuracy measurement is obtained calculating the van Rossum (2001) distance between the reference and the event-driven output spike trains. We have used a randomly generated test input spike train of average rate 300 Hz with a standard deviation of 0.7 and a uniform synaptic weight distribution in the interval [0.1,1] mS/cm2 . Using the table sizes mentioned above, the van Rossum distance (with a time constant of 10 ms and the normalization mentioned in section 6) between the reference spike train and that obtained with the proposed method is 0.057 (in the same range as the Rossum distances obtained when comparing other simpler neural models in Table 2). In fact, in order to obtain a similar accuracy using Euler numerical calculation, a time step shorter than 65 µs is required. Acknowledgments This work has been supported by the EU projects SpikeFORCE (IST-200135271), SENSOPAC (IST-028056) and the Spanish National Grant (DPI-200407032). We thank Olivier Coenen, Mike Arnold, Egidio D’Angelo, and Christian Boucheny for their interesting suggestions during this work. References Aho, A. V., Hopcroft, J. E., & Ullman, J. D. (1974). The design and analysis of computer algorithms. Reading, MA: Addison-Wesley. Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioural and Brain Sciences, 18, 617–657. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured (learned) delay activity during delay periods in cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472.
2992
E. Ros, R. Carrillo, E. Ortigosa, B. Barbour, and R. Ag´ıs
Boucheny, C., Carrillo, R., Ros, E., & Coenen, O. J. M. D. (2005). Real-time spiking neural network: An adaptive cerebellar model. Lecture Notes in Computer Science, 3512, 136–144. Bower, J. M., & Beeman, B. (1998). The book of GENESIS. New York: Springer-Verlag. Braitenberg, V., & Atwood, R. P. (1958). Morphological observations on the cerebellar cortex. J. Comp. Neurol., 109, 1–33. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Cartwright, J. H. E., & Piro, O. (1992). The dynamics of Runge-Kutta methods. Int. J. Bifurcation and Chaos, 2, 427–449. Chowdhury R. A., & Kaykobad, M. (2001). Sorting using heap structure. International Journal of Computer Mathematics, 77, 347–354. Conn, A. R., Gould, N. I. M., & Toint, P. L. (2000). Trust-region methods. Philadelphia: SIAM. Cormen, T. H., Lierson, C. E., & Rivest, R. L. (1990). Introduction to algorithms. Cambridge, MA: MIT Press. Delorme, A., Gautrais, J., van Rullen, R., & Thorpe, S. (1999). SpikeNET: A simulator for modelling large networks of integrate and fire neurons. Neurocomputing, 26– 27, 989–996. Delorme, A., & Thorpe, S. (2003). SpikeNET: An event-driven simulation package for modelling large networks of spiking neurons. Network: Computation in Neural Systems, 14, 613–627. ¨ Eckhorn, R., Bauer, R., Jordan, W., Brosh, M., Kruse, W., Munk, M., & Reitbock, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cyber., 60, 121–130. ¨ Eckhorn, R., Reitbock, H. J., Arndt, M., & Dicke, D. (1990). Feature linking via synchronization among distributed assemblies: Simulations of results from cat visual cortex. Neural Computation, 2, 293–307. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117, 500–544. Horn, R., & Marty, A. (1988). Muscarinic activation of ionic currents measured by a new whole-cell recording method. Journal of General Physiology, 92, 145–159. Karlton, P. L., Fuller, S. H., Scroggs, R. E., & Kaehler, E. B. (1976). Performance of height-balanced trees. Communications of ACM, 19(1), 23–28. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Computation, 17, 903–921. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Computing and Applications, 11, 210–223. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12(10), 2305–2329.
Event-Driven Stimulation Scheme for Spiking Neural Networks
2993
Mitchell, S. J., & Silver, R. A. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38, 433–445. Nusser, Z., Cull-Candy, S., & Farrant, M. (1997). Differences in synaptic GABA(A) receptor number underlie variation in GABA mini amplitude. Neuron, 19(3), 697– 709. Poirazi, P., Brannon, T., & Mel, B. W. (2003). Pyramidal neuron as two-layer neural network. Neuron, 37(6), 989–999. Pouille, F., & Scanziani, M. (2001). Enforcement of temporal fidelity in pyramidal cells by somatic feed-forward inhibition. Science, 293(5532), 1159–1163. Pouille, F., & Scanziani, M. (2004). Routing of spike series by dynamic circuits in the hippocampus. Nature, 429(6993), 717–723. Prinz, A. A., Abbott, L. F., & Marder, E. (2004). The dynamic clamp comes of age. Trends Neurosci., 27, 218–224. Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668–676. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Computation, 15, 811–830. Rossi, D. J., & Hamann, M. (1998). Spillover-mediated transmission at inhibitory synapses promoted by high affinity alpha6 subunit GABA(A) receptors and glomerular geometry. Neuron, 20(4), 783–795. Silver, R. A., Colquhoun, D., Cull-Candy, S. G., & Edmonds, B. (1996). Deactivation and desensitization of non-NMDA receptors in patches and the time course of EPSCs in rat cerebellar granule cells. J. Physiol., 493(1), 167–173. Stern, E. A., Kincaid, A. E., & Wilson, C. J. (1997). Spontaneous subthreshold membrane potential fluctuations and action potential variability of rat corticostriatal and striatal neurons in vivo. J. Neurophysiol., 77, 1697–1715. Tia, S., Wang, J. F., Kotchabhakdi, N., & Vicini, S. (1996). Developmental changes of inhibitory synaptic currents in cerebellar granule neurons: Role of GABA(A) receptor alpha 6 subunit. J. Neurosci., 16(11), 3630–3640. van Rossum, M. C. W. (2001). A novel spike distance. Neural Computation, 13, 751–763. Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol., 76, 1310–1326. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network: Computation in Neural Systems, 8, 127–164. Watts, L. (1994). Event-driven simulation of networks of spiking neurons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 967–934). San Mateo, CA: Morgan Kaufmann.
Received November 30, 2004; accepted April 28, 2006.
LETTER
Communicated by Pekka Orponen
On the Computational Power of Threshold Circuits with Sparse Activity Kei Uchizawa [email protected] Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan
Rodney Douglas [email protected] ¨ ¨ Institute of Neuroinformatics, ETH Zurich, CH-8057 Zurich, Switzerland
Wolfgang Maass [email protected] Institute for Theoretical Computer Science, Technische Universitaet Graz, A-8010 Graz, Austria
Circuits composed of threshold gates (McCulloch-Pitts neurons, or perceptrons) are simplified models of neural circuits with the advantage that they are theoretically more tractable than their biological counterparts. However, when such threshold circuits are designed to perform a specific computational task, they usually differ in one important respect from computations in the brain: they require very high activity. On average every second threshold gate fires (sets a 1 as output) during a computation. By contrast, the activity of neurons in the brain is much sparser, with only about 1% of neurons firing. This mismatch between threshold and neuronal circuits is due to the particular complexity measures (circuit size and circuit depth) that have been minimized in previous threshold circuit constructions. In this letter, we investigate a new complexity measure for threshold circuits, energy complexity, whose minimization yields computations with sparse activity. We prove that all computations by threshold circuits of polynomial size with entropy O(log n) can be restructured so that their energy complexity is reduced to a level near the entropy of circuit states. This entropy of circuit states is a novel circuit complexity measure, which is of interest not only in the context of threshold circuits but for circuit complexity in general. As an example of how this measure can be applied, we show that any polynomial size threshold circuit with entropy O(log n) can be simulated by a polynomial size threshold circuit of depth 3. Our results demonstrate that the structure of circuits that result from a minimization of their energy complexity is quite different from the structure that results from a minimization of previously considered Neural Computation 18, 2994–3008 (2006)
C 2006 Massachusetts Institute of Technology
On the Computational Power of Threshold Circuits with Sparse Activity 2995
complexity measures, and potentially closer to the structure of neural circuits in the nervous system. In particular, different pathways are activated in these circuits for different classes of inputs. This letter shows that such circuits with sparse activity have a surprisingly large computational power. 1 Introduction The active outputs of neurons are stereotypical electrical pulses (action potentials, or spikes). The stereotypical form of these spikes suggests that the output of neurons is analogous to the 1 of a threshold gate. In fact, historically and even currently, threshold circuits are commonly viewed as abstract computational models for circuits of biological neurons. Nevertheless, it has long been recognized by neuroscientists that neurons are generally silent and that information processing in the brain is usually achieved with a sparse distribution of neural firing.1 One reason for this sparse activation may be metabolic cost. For example, a recent biological study on the energy cost of cortical computation (Lennie, 2003) concludes that “the cost of a single spike is high, and this limits, possibly to fewer than 1%, the number of neurons that can be substantially active concurrently.” The metabolic cost of the active (1) state of a neuron is very asymmetrical. The production of a spike consumes a substantial amount of energy (about 2.4 × 109 molecules of ATP according to Lennie (2003)), whereas the energy cost of the no-spike rest state is substantially less. In contrast to neuronal circuits, computations in feedforward threshold circuits (and many other circuit models for digital computation) have the property that a large portion, usually around 50%, of gates in the circuit output a 1 during any computation. Common abstract measures for the energy consumption of electronic circuits treat the cost of the two output states 0 and 1 of a gate symmetrically and focus instead on the required number of switchings between these two states (see Kissin, 1991, as well as Reif, & Tyagi, 1990). Exceptions are Weinzweig (1961), Kasim-Zade (1992), Cheremisin (2003), which provide Shannon-type results for the number of gates that output a 1 in Boolean circuits consisting of gates with bounded fan-in. Circuits of threshold gates (= linear threshold gates = McCulloch-Pitts neurons) are an important class of circuits that are frequently used as simplified models for computations in neural circuits (Minsky & Papert, 1988; Roychowdhury, Siu, & Orlitsky, 1994; Parberry, 1994; Siu, Roychowdhury, & Kailath, 1995). Their output is binary, like that of a biological neuron (which outputs a spike or no spike), but they work in discrete time. In this letter, we consider how investigations of such
1 According to recent data (Margrie, Brecht, & Sakmann, 2002) from whole cell recordings in awake animals, the spontaneous firing rates are on average below 1 Hz.
2996
K. Uchizawa, R. Douglas, and W. Maass
abstract threshold circuits can be reconciled with actual activity characteristics of biological neural networks. In section 2 we give a precise definition of threshold circuits and also define their energy complexity, whose minimization yields threshold circuits that carry out computations with sparse activity: on average, few gates output a 1 during a computation. In section 2 we introduce another novel complexity measure: the entropy of a computation. This measure is interesting for many types of circuits beyond the threshold circuits discussed in this letter. It measures the total number of different patterns of gate states that arise during computations on different circuits inputs. We show in section 3 that the entropy of circuit states defines a coarse lower bound for its energy complexity. This result is relevant for any attempt to simulate a given threshold circuit by another threshold circuit with lower energy complexity, since the entropy of a circuit is directly linked to the algorithm that it implements. Therefore, it is unlikely that there exists a general method permitting any given circuit to be simulated by one with smaller entropy. In this sense, the entropy of a circuit defines a hard lower bound for any general method that aims to simulate any given threshold circuit using a circuit with lower energy complexity. However, we will prove in section 3 that there exists a general method that reduces—if this entropy is O(log n)—the energy complexity of a circuit to a level near the entropy of the circuit. Since the entropy of a circuit is a complexity measure that is interesting in its own right, we also offer in section 4 a first result on the computational power of threshold circuits with low entropy. Some open problems related to the new concepts introduced in this letter are listed in section 5. 2 Definitions A threshold gate g (with weights w1 , . . . , wn ∈ R and threshold t ∈ R) gives as output for any input X = (x1 , . . . , xn ) ∈ Rn
g(X) = sign
n i=1
wi xi − t
=
1,
if
n
wi xi ≥ t
i=1
0,
otherwise,
where we set sign(z) = 1 if z ≥ 0 and sign(z) = 0 if z < 0. For a threshold gate gi within a feedforward circuit C that receives X = (x1 , . . . , xn ) as circuit input, we write gi (X) for the output that the gate gi gives for this circuit input X (although the actual input to gate gi during this computation will in general consist of just some variables xi from X, and in addition, or even exclusively, of outputs of other gates in the circuit C). We define the energy complexity of a circuit C consisting of threshold gates g1 , . . . , gm as the expected number of 1’s that occur in a computation
On the Computational Power of Threshold Circuits with Sparse Activity 2997
for some given distribution Q of circuit inputs X, that is, m gi (X) , EC Q (C) := E i=1
where the expectation is evaluated with regard to the distribution Q over X ∈ Rn (or X ∈ {0, 1}n ). Thus, for the case where Q is the uniform distribution over {0, 1}n , we have, for example, ECuniform :=
1 2n
m
gi (X).
X∈{0,1}n i=1
In some cases it is also of interest to consider the maximal energy consumption of a circuit for any input X, defined by m n gi (X) : X ∈ R . ECmax (C) := max i=1
We define the entropy of a (feedforward) circuit C by PC (A) · log PC (A), HQ (C) := − A∈{0,1}m
where PC (A) is the probability that the internal gates g1 , . . . , gm of the circuit C assume the state A ∈ {0, 1}m during a computation of circuit C (for some given distribution Q of circuit inputs X ∈ Rn ). We often write Hmax (C) for the largest possible value that HQ (C) can assume for any distribution on a given set of circuit states A. If MAX(C) is defined as the total number of different circuit states that circuit C assumes for different inputs X ∈ {0, 1}n , then one has HQ (C) = Hmax (C) if Q is such that these MAX(C) circuit states all occur with the same probability and Hmax (C) is then equal to log2 MAX(C). We write si ze(C) for the number m of gates in a circuit C and depth(C) for the length of the longest path in C from an input to its output node (which is always assumed to be the node gm ). 3 Construction of Threshold Circuits with Sparse Activity It is obvious that the number of 1’s in a computation limits the number of states that the circuit can assume: ECmax (C)
HQ (C) ≤ log(MAX(C)) ≤ log
j=0
size(C) j
≤ log(size(C) ECmax (C) ) = ECmax (C) · log size(C)
(3.1)
2998
K. Uchizawa, R. Douglas, and W. Maass
(for sufficiently large values of ECmax (C) and si ze(C); log always stands for log2 in this letter; the first inequality follows from the previously discussed equality Hmax (C) = log2 (MAX(C))). Hence, ECmax (C) ≥ HQ (C)/ log size(C).
(3.2)
In fact, this argument shows that ECmax (C) ≥ Hmax (C)/ log size(C).
(3.3)
Every Boolean function f : {0, 1}n → {0, 1} can be computed by a threshold circuit C of depth 2 that represents its disjunctive normal form in such a way that for every circuit input X, at most a single gate on level 1 outputs a 1. This circuit C has the property that ECmax (C) = 2. Furthermore, it is an easy exercise to construct a distribution Q such that this circuit has HQ (C) = log(size(C) − 1). Hence it is in some cases possible to achieve ECmax (C) < HQ (C), and the factor log size(C) in equations 3.2 and 3.3 cannot be eliminated or significantly reduced. Threshold circuits that represent a Boolean function f in its disjunctive normal form allow us to compute any Boolean function with a circuit C that achieves ECmax (C) = 2. However, these circuits C in general have exponential size in n. Therefore, the key question is whether one can also construct polynomial size circuits C with small EC Q or ECmax . Because of the a priori bounds of equations 3.2 and 3.3, this is possible only for those functions f that can be computed with a low entropy of circuit states. The following results show that the existence of a circuit C that computes f with Hmax (C) = O(log n) is sufficient to guarantee the existence of a circuit that computes f with low energy complexity. Theorem 1. Assume that a Boolean function f : {0, 1}n → {0, 1} can be computed by some polynomial size threshold circuit C with Hmax (C) = O(log n). Then f can also be computed by some polynomial size threshold circuit C with ECmax (C ) ≤ Hmax (C) + 1 = O(log n).
(3.4)
Furthermore, if Q is any distribution of inputs X ∈ {0, 1}n , then one can construct a polynomial size threshold circuit C with EC Q (C ) ≤
HQ (C) + 2 = O(log n). 2
(3.5)
Remark 1. The subsequent proof shows that in fact the following more general statements hold for any function f and any distribution Q:
On the Computational Power of Threshold Circuits with Sparse Activity 2999
If f can be computed by some arbitrary (feedforward) threshold circuit C, then f can also be computed by a threshold circuit C with si ze(C ) ≤ 2 Hmax (C) , depth(C ) ≤ si ze(C) + 1, Hmax (C ) ≤ Hmax (C), and ECmax (C ) ≤ Hmax (C) + 1. Furthermore, f can also be computed by a threshold circuit C with si ze(C ) ≤ 2 Hmax (C)+1 , depth(C ) ≤ si ze(C) + 1, HQ (C ) ≤ H (C) HQ (C), and EC Q (C ) ≤ Q2 + 2. Remark 2. The assumption Hmax (C) = O(log n) is satisfied by standard constructions of threshold circuits for many commonly considered functions f . Examples are all symmetrical functions (hence, in particular, PARITY of n bits), COMPARISON of binary numbers, and BINARY ADDRESSING (routing) where the first k input bits represent an address for one of the 2k subsequent input bits (thus, n = k + 2k ). In fact, to the best of our knowledge, there is no function known that can be computed by polynomial size circuits but not by polynomial size circuits C with Hmax (C) = O(log n). Proof of Theorem 1. The proof is split up into a number of lemmas (see lemmas 1 to 6). The idea is to first simulate in lemma 1 the given circuit C by a threshold decision tree (i.e., by a decision tree T with threshold gates at the nodes; see definition 1), which has at most 2 Hmax (C) leaves. Then this threshold decision tree is restructured in lemma 3 in such a manner that every path in the tree from the root to a leaf takes the right branch at an internal node at most log(# of leaves) times. Hence, the path can take the right branch at most Hmax (C) times. Obviously such an asymmetrical cost measure is of interest when one wants to minimize an asymmetrical complexity measure such as EC, which assigns different costs to gate outputs 0 and 1. Finally we show in lemma 5 that the computations of the resulting threshold decision tree can be simulated by a threshold circuit where some gate outputs a 1 whenever the simulated path in the decision tree moves into the right subtree at an internal node of the tree. The proof of lemma 5 has to take into account that the control structures of decision trees and circuits are quite different: a gate in a decision tree is activated only when the computation path happens to arrive at the corresponding node of the decision tree, but a gate in a threshold circuit is activated in any computation of that circuit. Hence, a threshold decision tree with few threshold gates that output 1 does not automatically yield a threshold circuit with low energy complexity. However, we show that all gates in the simulating threshold circuit that do not correspond to a node in the decision tree where the right branch is chosen receive an additional input with a strongly negative weight (see lemma 4), so that they output a 0 when they get activated. Finally, we show in lemma 6 that the threshold decision tree can be restructured alternatively, so that the average number of times when a computation path takes the right subtree at a node remains small (instead of the maximal number of taking the right subtree). This maneuver yields the proof of the second part of the claim of theorem 1.
3000
K. Uchizawa, R. Douglas, and W. Maass
Definition 1. A threshold decision tree T (called a linear decision tree in Gr¨oger and Tur´an (1991)) is a binary tree in which each internal node has two children, a left and a right one, and is labeled by a threshold gate that is applied to the input X ∈ {0, 1}n for the tree. All the leaves of threshold decision trees are labeled by 0 or 1. To compute the output of a threshold decision tree T on an input X, we apply the following procedure from the root until reaching a leaf: we go left if the gate at a node outputs 0; otherwise, we go right. If we reach a leaf labeled by l ∈ {0, 1}, then l is the output of T for input X. Note that the threshold gates in a threshold decision tree are applied only to input variables from the external input X ∈ {0, 1}n , not to outputs of preceding threshold gates. Hence, it is obvious that computations in threshold decision trees have a quite different structure from computations in threshold circuits, although both models use the same type of computational operation at each node. The depth of a threshold decision tree is the maximum number of nodes from the root to a leaf. We assign binary strings to nodes of T in the usual manner:
r r
gˆ ε denotes the root of the tree (where ε is the empty string). For a binary string s, let gˆ s0 and gˆ s1 be the left and right child of the node with label gˆ s .
For example, the ancestors of a node gˆ 1011 are gˆ , gˆ 1 , gˆ 10 , and gˆ 101 . Let ST be the set of all binary strings s that occur as indices of nodes gˆ s in a threshold decision tree T. Then all the descendants of node gˆ s in T can be represented as gˆ s∗ for s∗ ∈ ST . The given threshold circuit C can be simulated in the following way by a threshold decision tree: Lemma 1. Let C be a threshold circuit computing a function f : {0, 1}n → {0, 1} with m gates. Then one can construct a threshold decision tree T with at most 2 Hmax (C) leaves and depth(T) ≤ m, which computes the same function f . Proof. Assume that C consists of m gates. We number the gates g1 , . . . , gm of C in topological order. Since gi receives the circuit input X and the outputs of g j only for j < i as its inputs, we can express the output gi (X) of gi for circuit input X = x1 , . . . , xn as n i−1 wij x j + wgi j g j (X) + ti , gi (X) = sign j=1
j=1
where wgi j is the weight that gi applies to the output of g j in circuit C.
On the Computational Power of Threshold Circuits with Sparse Activity 3001
Let S be the set of all binary strings of length up to m − 1. We define threshold gates gˆ s : X → {0, 1} for s ∈ S by n |s|+1 w j x j + ts with gˆ s (X) = sign j=1
ts =
|s|
wg|s|+1 s j + t|s|+1 , j
j=1
where s j is the jth bit of string s and |s| is the length of s. Obviously these gates gˆ s are variations of gate gi with different built-in assumptions s about the outputs of preceding gates. Let T be the threshold decision tree consisting of gates gˆ s for s ∈ S. That is, gate gˆ = g1 is placed at the root of T. We let the left child of gˆ s be gˆ s0 and the right child of gˆ s be gˆ s1 . We let each gˆ s with |s| = m − 1 have a leaf labeled by 0 as left child and a leaf labeled 1 as right child. Since gˆ s computes the same function as g|s|+1 if the preceding gates gi output si for 1 ≤ i ≤ |s|, T computes the same function f as C. We then remove all leaves from T for which the associated paths correspond to circuit states A ∈ {0, 1}m that do not occur in C for any circuit input X ∈ {0, 1}n . This reduces the number of leaves in T to 2 Hmax (C) . Finally, we iteratively remove all nodes without children, and replace all nodes below which there exists just a single leaf by a leaf. In this way, we arrive again at a binary tree. We now introduce a cost measure cost(T) for trees T that, like the energy complexity for circuits, measures for threshold decision trees how often a threshold gate outputs a 1 during a computation: Definition 2. We denote by cost(T) the maximum number of times where a path from the root to a leaf in a binary tree T goes to the right. If T is a leaf, then cost(T) = 0. We will show in lemma 5 that one can simulate any threshold decision tree T by a threshold circuit C T with ECmax (C T ) ≤ cost(T ) + 1. Hence, it suffices for the proof of theorem 1 to simulate the threshold decision tree T resulting from lemma 1 by another threshold decision tree T for which cost(T ) is small. This is done in lemma 4, where we will construct a tree T that reduces cost(T ) down to another cost measure rank(T). This measure rank(T) always has a value ≤ log(# of leaves of T) according to lemma 2; hence, rank(T) ≤ Hmax (C) for the tree T constructed in lemma 1.
3002
K. Uchizawa, R. Douglas, and W. Maass
Definition 3.
r r
The rank of a binary tree T is defined inductively as follows:
If T is a leaf, then rank(T) = 0. If T has subtrees Tl and Tr , then rank(Tl ), rank(T) = rank(Tr ) + 1, rank(Tr ),
Lemma 2.
if rank(Tl ) > rank(Tr ) if rank(Tl ) = rank(Tr ) if rank(Tl ) < rank(Tr ).
Let T be any binary tree. Then rank(T) ≤ log(# of leaves of T).
Proof. We proceed by induction on the depth of T. If depth(T) = 0, then T consists of a single node; hence, rank(T) = 0 = log 1 = log(# of leaves of T). Assume now that depth(T) > 0, and let Tl and Tr be the left and right subtree of the root of T. Case 1: rank(Tl ) = rank(Tr ). Then the claim follows immediately from the induction hypothesis, since rank(T) = rank(Tl ) or rank(T) = rank(Tr ). Case 2: rank(Tl ) = rank(Tr ). Assume without loss of generality that (# of leaves of Tl ) ≤ (# of leaves of Tr ). Then the induction hypothesis implies that rank(T) = ra nk(Tl ) + 1 ≤ log(# o f leaves o f Tl ) + 1 = log(2 · (# o f leaves o f Tl )) ≤ log(# o f leaves o f T). Lemma 3. Let T be a threshold decision tree computing a function f : {0, 1}n → {0, 1}. Then f can also be computed by a threshold decision tree T that has the same depth and the same number of leaves as T and satisfies cost(T ) = ra nk(T). Proof. Let T consist of gates gs for s ∈ ST . We define T s as the subtree of T whose root is gs . Let Tls (respectively, Trs ) denote the left(right) subtree below the root of T s . We modify T inductively by the following procedure, starting at the nodes gs of largest depth. If cost(Tls ) < cost(Trs ), we replace gs by its complement and swap the left subtree and the right subtree. The complement of gs is another threshold ngate g that outputs n 1 if and only if gs outputs 0. Such a gate g exists since w x < t ⇔ i i i=1 i=1 (−wi )xi > −t ⇔ n i=1 (−wi )xi ≥ T for another threshold T (which always exists if the xi assume only finitely many values). Let Tˆ s be the threshold decision tree produced from T s by this procedure. By construction, it has the following properties:
r
If the children of gs both are leaves, then we have cost(Tˆ s ) = 1.
On the Computational Power of Threshold Circuits with Sparse Activity 3003
r
Otherwise,
cost(Tˆls ), s ˆ cost(T ) = cost(Tˆrs ) + 1, cost(Tˆrs ),
if if if
cost(Tˆls ) > cost(Tˆrs ) cost(Tˆls ) = cost(Tˆrs ) cost(Tˆls ) < cost(Tˆrs ),
where Tˆ s has subtrees Tˆls and Tˆrs . Since this definition coincides with the definition of the rank, we have constructed a tree T with cost(T ) = rank(T). This procedure preserves the function that is computed, the depth of the tree, and the number of leaves. We now show that the threshold decision tree that was constructed in lemma 3 can be simulated by a threshold circuit with low energy complexity. As preparation, we first observe in lemma 4 that one can “veto” any threshold gate g through some extra input. This will be used in lemma 5 in order to avoid the event that gates in the simulating circuit that correspond to gates in an inactive path of the simulated threshold decision tree increase the energy complexity of the resulting circuit. n Lemma 4. Let g(x1 , . . . , xn ) = sign( i= 1 wi xi − t) be a threshold gate. Then one can construct a threshold gate g using an additional input xn+1 with the following property:
g (x1,...,xn ,xn+1 ) =
0, g(x1 , . . . , xn ),
if if
xn+1 = 1 xn+1 = 0.
n Proof. We set wn+1 := −( i=1 |wi | + |t| + 1). Apart from that g uses the same weights and threshold as g, it is obvious that the resulting gate g has the desired property. Lemma 5. Let T be a threshold decision tree that consists of k internal nodes and computes a function f . Then one can construct a threshold circuit C T with ECmax (C T ) ≤ cost(T) + 1 that computes the same function f . In addition, C T satisfies depth(C T ) ≤ depth(T) + 1 and si ze(C T ) ≤ k + 1. Proof. We can assume without loss of generality that every leaf with label 1 in T is the right child of its parent (if this is not the case, swap this leaf with the right subtree of the parent, and replace the threshold gate at the parent node as in the proof of lemma 3 by another threshold gate that always outputs the negation of the former gate; this procedure does not increase the cost of the tree or its depth or number of internal nodes). Now let n s gs (X) = sign w j x j − ts j=1
3004
K. Uchizawa, R. Douglas, and W. Maass
s be the threshold gate in T at the node with label s ∈ ST . Let wn+1 be the weight constructed in lemma 4 for an additional input that can force gate s gs to output 0. Set W := max{|wn+1 | : s ∈ ST }. The threshold circuit C T that simulates T has a gate gs for every gate gs in T and, in addition, an OR-gate that receives inputs from all gates gs so that gs has a leaf with label 1. (Because of our assumption, this leaf is reached whenever the gate gs at node s ∈ ST gets activated and gs outputs a 1.) We make sure that any gate gs in C T outputs 1 for a circuit input X if and only if the gate gs in T gets activated for this input X and outputs 1. Therefore, a gate gs in C T can output 1 only if it corresponds to a gate gs in T with output 1 that lies on the single path of T that the circuit input X activates. Hence, this construction automatically ensures that ECmax (C T) ≤ cost(T) + 1 (where the +1 arises from the additional OR-gate in C T ). In order to achieve this objective, gs gets additional inputs from all gates gs˜ in C T such that s˜ is a proper prefix of s. The weight for the additional input from gs˜ is −W if s˜ 0 is a prefix of s, and W otherwise. In addition, the threshold of gs is increased by ls · W, where ls is the number of 1 s in the binary string s. In this way, gs can output 1 if and only if gs outputs 1 for the present circuit input X, and all gates gs˜ of T for which gs lies in the right subtree below gs˜ output 1, and all gates gˆ s˜ of T for which gs lies in the left subtree below gs˜ output 0. Thus, gs outputs 1 if and only if the path leading to gate gs gets activated in T and gs outputs 1.
The proof of the first claim of theorem 1 follows now immediately from lemmas 1 to 5. Note that the number k of internal nodes in a binary tree is equal to (# of leaves)−1; hence, k ≤ 2 Hmax (C) − 1 in the case of the decision tree T resulting from applications of lemmas 1 and 3. This yields si ze(C T ) ≤ 2 Hmax (C) for the circuit C T that is constructed in lemma 5 for this tree T. The proof of the second claim of theorem 1 follows by applying the subsequent lemma 6 instead of lemma 3 to the threshold decision tree T resulting from lemma 1. In addition, a minor modification is needed in the proof of lemma 5. The threshold decision tree T that results from lemma 6 is constructed to have the property that each gate in T outputs 1 with probability ≤ 1/2. This may require that the left child of a node is a leaf with label 1, causing in lemma 5 a potential doubling of the circuit size and an additive increase by 1 of the energy complexity. Lemma 6. Let T be a threshold decision tree computing f : {0, 1}n → {0, 1}. Then for any given distribution Q of circuit inputs, there exists a threshold decision tree T computing f such that the expected number of 1’s with regard to Q is at most HQ (C)/2. Proof. Let cost Q (s) be the expected number of times where one goes to the right in a subtree of T whose root is the node labeled by s. Let P(s) be the probability (with regard to Q) that gate gs outputs 1. We construct T
On the Computational Power of Threshold Circuits with Sparse Activity 3005
by modifying T inductively (starting at the nodes of the largest depth m in T) through the following procedure: if P(s) > 1/2, replace gs by a threshold gate that computes its negation and swap the left and right subtree below this node. By construction, we have P(s) ≤ 1/2 for every gate gs in T . Furthermore, we have:
r r
If |s| = m − 1, then cost Q (s) = P(s). If 0 ≤ |s| < m − 1, then P(s) ≤ 1/2 P(s)costQ (s1) + (1 − P(s))costQ (s0).
and
cost Q (s) = P(s) +
One can prove by induction on |s| that cost Q (s) ≤ HQ (s)/2 for all s ∈ ST , where HQ (s) is the entropy of states of the ensemble of gates of T in the subtree below gate gs . For the induction step, one uses the convexity of the log function (which implies that P(s) = −P(s) · (−1) = −P(s) · log P(s)+(1−P(s)) ≤ 2 log(P(s))+log(1−P(s)) −P(s)( )) and the fact that P(s) ≤ 1 − P(s) to show that 2 HQ (s1) HQ (s0) cost Q (s) ≤ P(s) + P(s) · + (1 − P(s)) · 2 2
log P(s) + log(1 − P(s)) ≤ −P(s) · 2 HQ (s1) HQ (s0) + (1 − P(s)) · 2 2 P(s) (1 − P(s)) HQ (s1) ≤− log P(s) − log(1 − P(s)) + P(s) 2 2 2 HQ (s0) HQ (s) + (1 − P(s)) ≤ . 2 2 + P(s)
Remark 3. The results of this section can also be applied to circuits that compute arbitrary functions f : D→{0, 1} for some arbitrary finite set D ⊆ Rn (instead of {0, 1}n ). For domains D ⊆ Rn of infinite size, a different proof would be needed, since then one can no longer replace any given threshold gate by another threshold gate that computes its negation (as used in the proofs of lemmas 3, 5, and 6). 4 On the Computational Power of Circuits with Low Entropy The concepts discussed in this letter raise the question which functions f : {0, 1}n → {0, 1} can be computed by polynomial size threshold circuits C with Hmax (C) = O(log n). There is currently no function f in P (or even in NP) known for which this is provably false. But the following result
3006
K. Uchizawa, R. Douglas, and W. Maass
shows that if all functions that can be computed by layered2 polynomial size threshold circuits of bounded depth can be computed by a circuit C of the same type that satisfies in addition Hmax (C) = O(log n), then this implies a collapse of the depth hierarchy for polynomial-size threshold circuits: Theorem 2. Assume that a function f : {0, 1}n →{0, 1} (or f : Rn →{0, 1}) can be computed by a threshold circuit C with polynomially in n many gates and Hmax (C) = O(log n). Then one can compute f with a polynomial size threshold circuit C of depth 3. Proof. According to lemma 1, there exists a threshold decision tree T with polynomially in n many leaves and depth(T) ≤ si ze(C). Design (as in ¨ Groger & Tur´an, 1991) for each path p from the root to a leaf with output 1 in T a threshold gate g p on layer 2 of C that outputs 1 if and only if this path p becomes active in T. The output gate on layer 3 of C is simply an O R of all these gates g p .
5 Discussion In this letter, we have introduced an energy complexity measure for threshold circuits that reflects the biological fact that the firing of a neuron consumes more energy than its nonfiring. We also have provided methods for restructuring a given threshold circuit with high energy consumption by a threshold circuit that computes the same function but has brain-like sparse activity. Theorem 1 in combination with remark 2 implies that the computational power of such circuits is quite large. The resulting circuits with sparse activity may help us to elucidate the way in which circuits of neurons are designed in biological systems. In fact, the structure of computations in the threshold circuits with sparse activity that were constructed in the proof of theorem 1 is reminiscent of biological results on the structure of computations in cortical circuits of neurons, where there is concern for the selection of different pathways (dynamic routing) in dependence of the stimulus (Olshausen, Anderson, & Essen, 1995). In addition, our constructions provide first steps toward the design of algorithms for future extremely dense VLSI implementations of neurally inspired circuits, where energy consumption and heat dissipation become critical factors. It is well known (see, e.g., Hajnal, Maass, Pudlak, Szegedy, & Tur´an, 1993) that threshold circuits can be made robust against random failure of gates with a moderate increase in circuit size. Such methods can also be
2
A feedforward circuit is said to be layered if its gates can be partitioned into layers so that edges go from only one layer to the next. We actually need to assume here only that edges from circuit inputs go only to gates on layer 1.
On the Computational Power of Threshold Circuits with Sparse Activity 3007
applied to the sparsely active threshold circuits that were constructed in this letter, maintaining their sparse activity feature. For example, one can replace each threshold gate by an odd number k of identical copies of this gate and take their majority vote with the help of another threshold gate. This increases the circuit size by a factor of only k + 1, but preserves their sparse activity. Furthermore, the resulting circuit computes correctly as long as the majority of gates in each group of k gates computes without a fault. Additional noise suppression could exploit that all legitimate activation patterns of gates in the circuit C T that was constructed in lemma 5 have a quite specific structure, since they simulate an activation path in a tree T. The new concepts and results of this letter suggest a number of interesting open problems in computational complexity theory. At the beginning of section 3, we showed that the energy complexity of a threshold circuit that computes some function f cannot be less than the a priori bound given by the minimal required circuit entropy for computing such a function. This result suggests that the entropy of circuit states required for various practically relevant functions should be investigated. Another interesting open problem is the trade-off between energy complexity and computation speed in threshold circuits, both in general and for concrete computational problems. Finally, we consider that both the energy complexity and the entropy of threshold circuits are concepts of interest in their own right. They give rise to interesting complexity classes that have not been considered previously in computational complexity theory. In particular, it may be possible to develop new lower-bound methods for circuits with low entropy, thereby enlarging the reservoir of lower-bound techniques in circuit complexity theory. Acknowledgments We thank Michael Pfeiffer, Pavel Pudlak, and Robert Legenstein for helpful discussions; Kazuyuki Amano and Eiji Takimoto for their advice; and Akira Maruoka for making this collaboration possible. This work was partially supported by the Austrian Science Fund FWF, project S9102-N04, and projects FP6-015879 (FACETS) and FP6-2005-015803 (DAISY) of the European Union. References Cheremisin, O. V. (2003). On the activity of cell circuits realising the system of all conjunctions. Discrete Mathematics and Applications, 13(2), 209–219. ¨ Groger, H. D., & Tur´an, G. (1991). On linear decision trees computing Boolean functions. Lecture Notes in Computer Science, 510, 707–718. Hajnal, A., Maass, W., Pudlak, P., Szegedy, M., & Tur´an, G. (1993). Threshold circuits of bounded depth. J. Comput. System Sci., 46, 129–154.
3008
K. Uchizawa, R. Douglas, and W. Maass
Kasim-Zade, M. (1992). On a measure of the activeness of circuits made of functional elements. Mathematical Problems in Cybernetics, 4, 218–228. (in Russian) Kissin, G. (1991). Upper and lower bounds on switching energy in VLSI. J. Assoc. Comp. Mach., 38, 222–254. Lennie, P. (2003). The cost of cortical computation. Current Biology, 13, 493–497. Margrie, T. W., Brecht, M., & Sakmann, B. (2002). In vivo, low-resistance, wholecell recordings from neurons in the anaesthetized and awake mammalian brain. Pflugers Arch., 444(4):491–498. Minsky, M., & Papert, S. (1988). Perceptrons: An introduction to computational geometry. Cambridge, MA: MIT Press. Olshausen, B. A., Anderson, C. H., & Essen, D. C. V. (1995). A multiscale dynamic routing circuit for forming size- and position-invariant object representations. J. Comput. Neurosci., 2(1):45–62. Parberry, I. (1994). Circuit complexity and neural networks. Cambridge, MA: MIT Press. Reif, J. H., & Tyagi, A. (1990). Energy complexity of optical computations. In Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing (pp. 14–21). Piscataway. NJ: IEEE. Roychowdhury, V. P., Siu, K. Y., & Orlitsky, A. (1994). Theoretical advances in neural computation and learning. Boston: Kluwer Academic. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation; A theoretical foundation. Upper Saddle River, NJ: Prentice Hall. Weinzweig, M. N. (1961). On the power of networks of functional elements. Dokl. Akad. Nauk SSSR, 139, 320–323.
Received September 14, 2005; accepted May 5, 2006.
LETTER
Communicated by Bard Ermentrout
Parameter Space Structure of Continuous-Time Recurrent Neural Networks Randall D. Beer [email protected] Department of Electrical Engineering and Computer Science and Department of Biology, Case Western Reserve University, Cleveland, OH 44106, U.S.A.
A fundamental challenge for any general theory of neural circuits is how to characterize the structure of the space of all possible circuits over a given model neuron. As a first step in this direction, this letter begins a systematic study of the global parameter space structure of continuoustime recurrent neural networks (CTRNNs), a class of neural models that is simple but dynamically universal. First, we explicitly compute the local bifurcation manifolds of CTRNNs. We then visualize the structure of these manifolds in net input space for small circuits. These visualizations reveal a set of extremal saddle node bifurcation manifolds that divide CTRNN parameter space into regions of dynamics with different effective dimensionality. Next, we completely characterize the combinatorics and geometry of an asymptotically exact approximation to these regions for circuits of arbitrary size. Finally, we show how these regions can be used to calculate estimates of the probability of encountering different kinds of dynamics in CTRNN parameter space. 1 Introduction Although tremendous progress has been made on the computational modeling of specific neural circuits, the need for a more general theory of such circuits is becoming widely recognized in neuroscience. This is especially true in work on invertebrate pattern generation, where detailed experimental analyses of several pattern generation circuits have brought questions of general principles to the fore (Selverston, 1980; Getting, 1989; Marder & Abbott, 1995; Marder & Calabrese, 1996). For example, recent computational and experimental studies of neurons and small circuits from the crustacean stomatogastric ganglion (STG) have become increasingly concerned with the overall parameter space structure of this circuit. In studies of both model and biological STG neurons, Goldman, Golowasch, Marder, and Abbott (2001) found that activity patterns were quite robust to some sets
The author is now at the Cognitive Science Program at Indiana University. Neural Computation 18, 3009–3051 (2006)
C 2006 Massachusetts Institute of Technology
3010
R. Beer
of conductance variations but extremely sensitive to other sets. Golowasch, Goldman, Abbott, and Marder (2002) found that averaging the conductance values of different models of an STG cell exhibiting a given firing characteristic could sometimes fail to produce an average model with the same characteristic due to nonlinearities in conductance space. Prinz, Billimoria, and Marder (2003) found that widely disparate parameter sets in a three-cell pyloric network model could give rise to almost indistinguishable network activity. More recently, Prinz, Bucher, and Marder (2004) have constructed maps of the parameter spaces of model STG neurons through sampling on a coarse grid of conductance values. A general theory of neural circuits would benefit such studies in many ways. It would tell us what kinds of dynamical behavior lie in different regions of a circuit’s parameter space and allow us to quantify the probability of encountering these different dynamics. It would help us to understand a circuit’s response to sensory inputs or neuromodulators by distinguishing those directions of variation in parameter space to which its behavior is most sensitive from those to which it is robust. It would support the classification of novel circuits from their parameters alone, without requiring an exhaustive dynamical analysis of each new circuit. It would provide mathematical and computational tools for calculating experimentally testable predictions regarding novel manipulations. It would provide a methodology for the design of neural circuits with desired behavior. Perhaps most important, it would supply a general context within which to understand the behavior of particular circuits, allowing us to situate the details of the actual within the space of the possible. Characterizing the general parameter space structure of neural circuits is an extremely difficult problem and is likely to be impossible in full detail for most models of interest. Nevertheless, even a partial characterization of the parameter space structure of the simplest nontrivial model neural circuits would be extremely useful. Studies of simpler models can help us to build intuition about what such a general theory might look like. Such studies would also allow us to develop the conceptual framework necessary to formulate the theory and the mathematical and computational machinery required to derive its consequences. In addition, they would give us a better appreciation for the kinds of questions that we can reasonably expect to answer with such a theory and the kinds of questions that are likely to be beyond its reach. Only armed with this experience can we hope to extend our understanding to the parameter space structure of more biophysically realistic models. In this letter, we begin the task of characterizing the parameter space structure of a class of model neural circuits that, though simple, is dynamically universal. Section 2 introduces the continuous-time recurrent neural network (CTRNN) model we will study and reviews some of its key properties. In section 3, we explicitly compute the local bifurcation manifolds of CTRNNs and visualize their detailed structure for small N. An
CTRNN Parameter Space
3011
approximation to the overall structure of the outer envelope of these manifolds for arbitrary N is then derived analytically in section 4. Sections 5 and 6 use the theory to calculate estimates of the probability of encountering different kinds of dynamics in CTRNN parameter space. The letter concludes with a discussion of the current status of the theory and directions for future work. An electronic supplement in the form of a Mathematica (Wolfram, 2003) notebook provides tools for readers to reproduce the main calculations in the article and to carry out their own explorations (Beer, 2005). 2 Continuous-Time Recurrent Neural Networks Continuous-time recurrent neural networks are among the simplest possible nonlinear continuous-time neural models. They are defined by the vector differential equation τ y˙ = −y + W σ (y + θ ) + I
(2.1)
˙ y, θ , and I are length N vectors, W = {wi j } is an N × N matrix, where τ , y, and all vector operations (including the application of the output function σ (x) = 1/(1 + e −x )) are performed element-wise. The standard neurobiological interpretation of this model is that yi represents the mean membrane potential of the ith neuron, σ (·) represents its mean firing rate, τi represents its membrane time constant, θi represents its threshold/bias, Ii represents an external input, the weights wi j, j=i represent synaptic connections from neuron j to neuron i, and the selfinteraction wii represents a simple active conductance. This model can also be interpreted as representing nonspiking neurons (Dunn, Lockery, PierceShimomura, & Conery, 2004). In this case, σ (·) represents saturating nonlinearities in synaptic input. Note that the distinction between I and θ is merely semantic; with respect to the output dynamics of equation 2.1, only the net input I + θ matters, since equation 2.1 can be rewritten in the form τ x˙ = −x + σ (W x + I + θ) using the substitution y → W x + I. Without loss of generality, we will often assume that I = 0, so the net input to a CTRNN is given simply by θ . Thus, an N-neuron CTRNN has N time constants, N net inputs, and N2 weights, giving CCTRNN (N), the space of all possible CTRNNs on N neurons, an (N2 + 2N)-dimensional parameter space. Compared to more biologically realistic neural models, the dynamics of an individual CTRNN neuron is quite trivial. However, small networks of CTRNNs can reproduce qualitatively the full range of nerve cell phenomenology, including spiking, plateau potentials, and bursting. More important, CTRNNs are known to be universal approximators of smooth dynamics (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Chow & Li, 2000). Thus, at least in principle, the use of CTRNNs implies no restriction whatsoever on biological realism. CTRNNs can be thought of as a basis
3012
R. Beer
dynamics from which any other dynamics can be reproduced to any desired degree of accuracy. CTRNNs are a special case of the general class of additive neural network models τ y˙ = −y + W ξ (y + θ ) + I (Grossberg, 1988). Additive neural networks have been extensively studied in both their continuous-time and discrete-time versions (Cowan & Ermentrout, 1978; Cohen & Grossberg, 1983; Hopfield, 1984; Hirsch, 1989; Borisyuk & Kirillov, 1992; Blum & Wang, 1992; Zhaojue, Schieve, & Das, 1993; Beer, 1995; Hoppensteadt & ˇ Horne & Giles, 2001; Pasemann, 2002; Haschke & Izhikevich, 1997; Tino, Steil, 2005). Although our analysis will focus on equation 2.1, the results we obtain would be qualitatively identical for any additive model with a (smooth, monotone, bounded) sigmoidal activation function ξ (x). A particularly convenient class of activation functions can be parameterized as σα,β,µ (x) =
α + β, 1 + e −µx
ˇ et al., 2001). This class contains several comwhere α, µ ∈ R+ , β ∈ R (Tino mon activation functions, including both the one used here (σ1,0,1 ) and the hyperbolic tangent function (σ2,−1,2 ). All additive neural network models using activation functions in this class share an important property: their dynamics are topologically conjugate (Haschke, 2004). Specifically, the quantitative results we obtain for equation 2.1 can be directly translated to any other additive model τ y˙ = −y + W σα,β,µ (y + θ ) + I via the change of variables y = µ−1 y τ = τ W = (α µ)−1 W θ = µ−1 θ I = µ−1 I − (α µ)−1 W · β, where β is the length N vector (β, . . . , β). The steady-state input-output (SSIO) curve of a single CTRNN neuron will play an important role in this article (see Figure 1). By transforming equation 2.1 to the output space defined by o ≡ σ (y + θ ) and setting the time derivative to 0, we find that the SSIO curve of a neuron with self-weight w is given by I + θ = σ −1 (o) − w o. A single additive model neuron can exhibit either unistable or bistable dynamics, depending on the strength of its selfweight and its net input (Cowan & Ermentrout, 1978). In a single CTRNN neuron, only unistable dynamics are possible when w < 4 (see Figure 1A). When w > 4, bistable dynamics occurs when I L (w) ≤ I + θ ≤ I R (w) (see
CTRNN Parameter Space
3013
A
B
0.8
0.8
0.6
0.6 o
1
o
1
0.4
0.4
0.2
0.2
0 -4
-2
0 I+ θ
2
4
0 -4
-2
0 I+ θ
2
4
Figure 1: Representative steady-state input-output (SSIO) diagrams of a single CTRNN for (A) w = 2 and (B) w = 8. The solid line shows the output space location of the neuron’s equilibrium points as a function of the net input I + θ. Note that the SSIO becomes folded for w > 4, indicating the existence of three equilibrium points. When the SSIO is folded, the left and right edges of the fold are given by I L (w) and I R (w), respectively (black points in B). The ranges of synaptic inputs received from other neurons are indicated by gray rectangles. The lower (min I) and upper (max I) limits of this range play an important role in the analysis described in this article. In both plots, two synaptic input ranges are shown: one for which the neuron is saturated off (left rectangle) and one for which the neuron is saturated on (right rectangle). The dashed line in A shows the piecewise linear SSIO approximation used in section 4.2, which suggests using the intersections of the linear pieces (black points) as the analog of the fold edges in part B.
Figure 1B), where the left and right edges of the fold are given by (Beer, 1995) √ √ w+ w−4 w ± w(w − 4) I L (w), I R (w) = ±2 ln − , 2 2 and it is convenient to define the width of the fold as IW (w) ≡ I R (w) − I L (w). Analogous expressions can be derived for arbitrary sigmoidal activation functions. As we study the parameter space of CTRNNs, we will repeatedly en counter center-crossing circuits defined by the condition θi∗ = − Nj=1 wi j /2 (Beer, 1995). When this condition is satisfied, the null manifolds of each neuron intersect at their centers of symmetry, or, equivalently, the SSIO of each neuron is centered over the range of synaptic inputs that it receives
3014
R. Beer
from the other neurons. Center-crossing circuits are important for a variety of reasons. First, the richest possible dynamics can be found in the neighborhood of such circuits. By “richest possible dynamics,” I mean dynamics that makes maximal use of the available degrees of freedom in the circuit. Second, the bifurcations of the central equilibrium point of a centercrossing circuit can often be fully characterized analytically. Finally, for any given weight matrix, the corresponding center-crossing circuit serves as a symmetry point in the net input parameter space for that circuit. 3 Visualizing Local Bifurcation Manifolds Before we attempt to characterize the general structure of CCTRNN (N), it would be helpful to directly visualize this structure, at least for small N. As is typical in dynamical systems theory, we will abstract over the details of individual trajectories and study instead the equivalence classes induced by topological conjugacy of entire flows (Kuznetsov, 1998). Under this equivalence relation, the parameter space of a dynamical system is divided into regions of topologically equivalent dynamics by bifurcation manifolds. If we wish to understand CCTRNN (N), then characterizing the structure of these manifolds is a good place to begin. Bifurcations can be local or global. Local bifurcations involve changes in the neighborhood of a limit set and can be explicitly defined by algebraic conditions on the vector field and its derivatives in that neighborhood. For example, the change of stability of an equilibrium point as parameters are varied is a local bifurcation. In contrast, global bifurcations involve changes that are not localized to any particular limit set and can usually be studied only numerically. For example, saddle connections are global bifurcations in which the unstable manifold of one equilibrium point coincides with the stable manifold of another. We consider only local bifurcations in this article. The two most common local bifurcations are the saddle node and Hopf bifurcations (Kuznetsov, 1998). In a saddle node bifurcation, the real eigenvalues of an equilibrium point change sign, signaling a change in its stability. In a Hopf bifurcation, the real parts of a complex conjugate pair of eigenvalues of an equilibrium point change sign, signaling a change in its stability and the production of a limit cycle. These bifurcations are defined by the conditions det (J) = 0 (saddle node bifurcation) det(2J 1) = 0 (Hopf bifurcation) where J = {∂ f i /∂ y j } is the Jacobian matrix of partial derivatives of the vector field f(y) ≡ (−y + W · σ (y + θ ))/τ with respect to y evaluated at an equilib¯ 1 denotes the identity matrix, and is the bialternate matrix rium point y,
CTRNN Parameter Space
3015
product (Guckenheimer, Myers, & Sturmfels, 1997). Given two N × N matrices A = a i j and B = b i j , A B is the 12 N (N − 1) × 12 N (N − 1) matrix whose rows are labeled by the multi-index ( p, q ) (where p = 2, 3, . . . , N, and q = 1, 2, . . . , p − 1), whose columns are labeled by the multi-index (r, s) (where r = 2, 3, . . . , N, and s = 1, 2, . . . , r − 1), and whose elements are given by (Kuznetsov, 1998) (A B)( p, q )(r, s)
1 = 2
a a b b pr ps pr ps + . b qr b q s a qr a q s
An implementation of is included in the electronic supplement (Beer, 2005). The significance of the bialternate matrix product lies in the fact that if an N × N matrix M has eigenvalues λ1 , . . . , λ N , then 2M 1 has eigenvalues λi + λ j , with 1 ≤ j < i ≤ N. Thus, the Hopf condition det(2J 1) = 0 is satisfied whenever a complex conjugate pair of eigenvalues has zero real parts. Note that an equilibrium point having a pair of real eigenvalues with equal magnitude but opposite sign will also satisfy this condition (such a point is called a neutral saddle). Since such points are not actually Hopf bifurcations, the portions of a solution manifold of det(2J 1) = 0 for which the eigenvalues are not pure imaginary must be removed in a postprocessing step. Haschke (2004; Haschke & Steil, 2005) developed a method for computing local bifurcation manifolds of discrete-time recurrent neural networks in net input space that can easily be adapted to CTRNNs. If we define ≡ σ (y¯ + θ ) to be the vector of derivatives of σ (·) at an equilibrium point ¯ with 0 < ψi ≤ 1/4, then the CTRNN Jacobian at that point can be written y, as
J = diag τ −1 · diag() · W − 1 , where τ −1 denotes the vector of reciprocals of the time constants τ and diag(v) denotes the diagonal matrix containing the vector v. This allows the saddle node and Hopf bifurcation conditions to be expressed directly in terms of . For example, for a two-neuron CTRNN, the saddle node bifurcation manifolds are given by the solutions (ψ1 , ψ2 ) to w ψ −1 w ψ 11 1 21 1 τ1 det(J) = det w τ1ψ w22 ψ2 − 1 12 2 τ2 τ2 = 1 − w11 ψ1 − w22 ψ2 + ψ1 ψ2 det W = 0,
3016
R. Beer
where the time constants have been eliminated by multiplying through by τ1 τ2 . The Hopf bifurcation manifolds are given by the solutions to w ψ −1 w ψ 11 1 21 1 10 τ1 det(2J 1) = det 2 w τ1ψ w22 ψ2 − 1 0 1 12 2 τ2 τ2 =
w11 ψ1 − 1 w22 ψ2 − 1 + = 0. τ1 τ2
We can visualize these local bifurcation manifolds by numerically approximating the solution manifolds defined implicitly by each bifurcation condition or explicitly solving for one ψi in terms of the others. The results for two sample two-neuron circuits are shown in Figures 2A and 2B. Note that these manifolds are functions of W (for saddle node bifurcations) or W and τ (for Hopf bifurcations). The bifurcation manifolds in must then be transformed to net input space. Using the fact that y¯ = σ −1 () − θ (where σ −1 (·) denotes the inverse of the derivative of σ (·)) and then substituting this for y¯ in the equilibrium ¯ = 0, we obtain point condition f(y)
θi = σ −1 (ψi ) − W · σ σ −1 (ψi ) ,
(3.1)
where σ −1 (ψi ) = ln
1±
1 − 4ψi − 2ψi . 2ψi
Note that σ −1 (·) is two-valued for a sigmoidal function. Since each component of θ can come from either branch, each bifurcation manifold in space can therefore generate up to 2 N bifurcation manifolds in net input space. Figures 2C and 2D show the θ-space bifurcation manifolds corresponding to the -space manifolds shown in Figures 2A and 2B, respectively. Using the same approach, we can also calculate and display the local bifurcation manifolds of three-neuron circuits. Figures 3A and 3B provide an external view of the local bifurcation manifolds for two different threeneuron circuits, and the slices in Figures 3C and 3D reveal some of the rich internal structure. In principle, this method can be applied to circuits of any size. However, the exponential scaling of the number of bifurcation manifolds in net input space and the difficulty of visualizing manifolds in dimensions greater than three make it practical for small circuits only. Mathematica code for the visualization of the local bifurcation manifolds of CTRNNs in two and three dimensions can be found in the electronic supplement (Beer, 2005).
CTRNN Parameter Space
3017
A
0.25
B
0.25
0.2
0.2
0.15
0.15
ψ2
ψ2 0.1
0.1
0.05
0.05
0
0 0
0.05
0.1
ψ1
0.15
0.2
0.25
0
C
-1
0.05
0.1
0.15
0.2
0.25
D
-1
-2
ψ1
-1.5
-2
-3 θ2
θ2 -4
-2.5
-5
-3
-6 -6
-5
-4
θ1
-3
-2
-1
-3.5 -4.5
-4
-3.5
θ1
-3
-2.5
-2
Figure 2: Representative local bifurcation curves of two-neuron CTRNNs. In all cases, solid black curves represent saddle node bifurcations, dashed curves represent Hopf bifurcations, and solid gray curves represent neutral saddles. (A) Bifurcation curves in the space (ψ1 , ψ2 ) of activation function derivatives
and τ1 = τ2 = 1. (B) Bifurcation curves in (ψ1 , ψ2 ) for a circuit with W = and τ1 = τ2 = 1. (C) The bifurcation curves from A in net input space (neutral saddle curves have been removed). Each curve in (ψ1 , ψ2 ) can produce multiple curves in (θ1 , θ2 ). (D) The bifurcation curves from B in net input space (neutral saddle curves have been removed). for a circuit with W =
6 1 1 6
5.5 −1 1 5.5
4 The Global Structure of Local Bifurcation Manifolds How can we characterize the structure of CTRNN parameter space? As Figures 2 and 3 demonstrate, the local bifurcation structure of CTRNNs can be quite complex even in small circuits. Of course, including numerically computed global bifurcation manifolds would serve only to complicate these plots (Borisyuk & Kirillov, 1992; Hoppensteadt & Izhikevich, 1997).
3018
R. Beer
Figure 3: Representative local bifurcation surfaces of three-neuron CTRNNs in 6
1
1
net input space. (A) A circuit with W = 11 16 61 and τi = 1. (B) A circuit with 6 −1 1 −1 6 1 and τi = 1. (C) A cutaway view of the interior of the plot in A. W = −1 1 6 (D) A cutaway view of the interior of the plot in B. The bifurcation manifolds in the space (ψ1 , ψ2 , ψ3 ) of activation function derivatives are not shown in this figure, and saddle node and Hopf bifurcation surfaces are not distinguished in these plots.
If we have any hope of achieving an understanding of CCTRNN (N) that can be scaled to large N, then we must focus on the overall structure of these manifolds rather than their fine details. The plots in Figures 2 and 3 illustrate two key features of the overall structure of CTRNN parameter space. First, there is always a compact central region in net input space (whose location and extent depend on the
CTRNN Parameter Space
3019
connection weights) with the richest dynamics and the highest density of bifurcations. As it turns out, the center of each of these regions corresponds to the center-crossing network for that weight matrix. In these central regions, all N neurons are dynamically active: the range of net input each neuron receives from the other neurons overlaps the most sensitive region of its activation function σ (·). Dynamically active neurons can respond to changes in the outputs of other neurons and are therefore capable of participating in nontrivial dynamics. A second key feature of CTRNN parameter space apparent in Figures 2 and 3 is that the local bifurcation manifolds flatten out as we move away from the central region, forming quasirectangular regions with an apparent combinatorial structure. This structure is produced by different subsets of neurons becoming saturated: the range of net input received is such that σ (·) ≈ 0 or σ (·) ≈ 1. Saturated neurons effectively drop out of the dynamics and become constant inputs to other neurons because their outputs are insensitive to changes in the outputs of those other neurons (see Figure 4). For example, in Figure 3A, we see a central “cube” surrounded by six “poles,” which are in turn interconnected by twelve “slabs.” In the central cube, all three neurons are dynamically active. In each pole, only two neurons are dynamically active; one of the neurons is saturated either on or off. Thus, in these regions, the dynamics of the three-neuron circuit effectively becomes two-dimensional. Indeed, the structure of the local bifurcation manifolds visible in the pole cross-sections in Figure 3A is similar to that of the twoneuron circuit in Figure 2C (Haschke, 2004). In the slabs, only one neuron is dynamically active, while in the eight remaining corner regions of this plot, all three neurons are saturated. We can use this observation to partition the net input space of a CTRNN into regions of dynamics with different effective dimension depending on the number of neurons that are dynamically active. A CTRNN with an effective dimension of M has limit sets whose extent, distribution, and responses to perturbations span an M-dimensional subspace of the N-dimensional output space of an N-neuron circuit, with 0 ≤ M ≤ N (see Figure 4). In this section, we will completely characterize the combinatorics and geometry of an approximation to these regions for arbitrary N. Due to some differences in the details, we first consider the case when all wii > 4 and then the case when some wii < 4. 4.1 All wii > 4. When all wii > 4, the SSIO curves of all neurons are folded (see Figure 1B), and the edges of these folds play an important role in structuring CTRNN parameter space (see Figure 4). Specifically, when the right edge of a neuron’s synaptic input range falls below its left fold, that neuron will be saturated off regardless of the states of the other neurons in the circuit (see the left rectangle in Figure 1B). And when the left edge of a neuron’s synaptic input range falls above its right fold, that neuron will be saturated on (see the right rectangle in Figure 1B). At
3020
R. Beer A 1 0.8 0.6 o2 0.4 0.2 0 0
0.2 0.4 0.6 0.8 o1
1
B 1 0.8 0.6 o2 0.4 0.2 0 0
0.2 0.4 0.6 0.8 o1
1
C 1 0.8 0.6 o2 0.4 0.2 0 0
0.2 0.4 0.6 0.8 o1
1
CTRNN Parameter Space
3021
the boundaries of these saturated regions (when a fold is tangent to the corresponding edge of the synaptic input range), saddle node bifurcations occur. These saddle node bifurcations are extremal in the sense that as long as a neuron remains saturated in this way, it cannot participate in any further bifurcations. Thus, all bifurcations (both local and global) that a given subset of neurons can undergo must be contained within the extremal bifurcation manifolds that delineate the dynamically active ranges of those neurons. When all wii > 4, the extremal bifurcation manifolds are the boundaries that divide CTRNN parameter space into regions of dynamics with different effective dimension. Although the extremal bifurcation manifolds can be calculated analytically using the method described in the previous section, the expressions become extremely unwieldy for larger N. Thus, it is far more convenient for arbitrary N to work with approximations to these manifolds. Our approximation will be based on the fact that as we move away from the centercrossing point in net input space, these extremal bifurcation manifolds become asymptotically flat due to the saturation of σ (·). By extrapolating inward toward the center-crossing point, we can approximate the extremal bifurcation manifolds by the hyperplanes that they asymptotically approach. Figures 5A through 5D show the region approximations that correspond to Figures 2C, 2D, 3A, and 3B, respectively. Note that while these approximations are asymptotically exact and generally quite good, there can be nontrivial differences when nonlinear effects come into significant play in the neighborhood of the center-crossing point (see Figures 5B and 5D). In the remainder of this section, we formally characterize the structure of the regions bounded by these hyperplanes. We proceed in three steps. First, we describe the region of M-dimensional dynamics in an N-neuron circuit (denoted by the symbol R) as a union of disjoint polytopes (denoted by the symbol Q) corresponding to different subsets of N − M neurons in saturation. Second, we decompose the structure of each Q into a union of
Figure 4: Phase portraits (left) and SSIO diagrams (right) of two-neuron CTRNNs with dynamics of different effective dimension. In the phase portrait plots, stable equilibrium points are shown in black, saddle points are shown in gray, unstable equilibrium points are shown as circles, and the nullclines are shown as black curves. In the SSIO plots, the range of synaptic input that each neuron receives from the other is indicated by a gray rectangle. (A) A phase portrait with an effective dimension of 2. The folds of both neurons intersect their range of synaptic inputs. (B) A phase portrait with an effective dimension of 1. Neuron 1 is saturated on, confining the dynamics to the right edge of output space after transients have passed. (C) A phase portrait with an effective dimension of 0. Both neurons are saturated on, confining the dynamics to the upper-right-hand corner of output space after transients have passed.
3022
R. Beer
Figure 5: Asymptotic hyperplane approximations to the extremal saddle node bifurcation manifolds shown in Figures 2 and 3. (A) Region approximations corresponding to Figure 2C. The darkest square is R22 , the lighter rectangles are R21 , and the lightest regions are R20 . (B) Region approximations corresponding to Figure 2D. (C) Region approximations corresponding to Figure 3A. (D) Region approximations corresponding to Figure 3B.
overlapping rectangular hypersolids (denoted by the symbol B). Third, we write the boundaries of each B as a set of N inequalities involving functions of the connection weights. N Let R M (W) be the asymptotic hyperplane approximation to the region of M-dimensional dynamics in the net input space of an N-neuron circuit with weight matrix W, 0 ≤ M ≤ N. Because the saddle node bifurcation condiN tion is independent of the time constants, R M does not depend on τ . Thus, N R M is an N-dimensional volume characterized by N2 weight parameters. For example, Figure 6 shows the structure of R3M for M = 0, . . . , 3 for the same circuit shown in Figures 3A and 5C.
CTRNN Parameter Space
3023
Figure 6: The individual components of the region approximation shown in Figure 5C. The front-most polytope 3 Q1 2 3 has been removed from R30 in order to make the interior visible. N As can be clearly seen in Figure 6, R M (W) is composed of disjoint polytopes corresponding to different subsets of N − M neurons in satui ,...,i N−M ration. Let N Qik+1 (W) be the polytope with neurons i 1 , . . . , i k satu1 ,...,i k rated off and neurons i k+1 , . . . , i N−M saturated on for any 0 ≤ k ≤ N − M. Let Z(S) ≡ × {i , i } be the set containing the indices in S in all possible i∈S
raised and lowered combinations, for example, Z({1, 2}) = {1 , 1 } × {2 , 2 } = {{1 , 2 }, {1 , 2 }, {1 , 2 }, {1 , 2 }}. Then N RM (W) =
N S∈K N−M J ∈Z(S)
N
Q |J (W),
(4.1)
3024
R. Beer
where K KN denotes the K -subsets of {1, . . . , N} and N Q |J denotes the concatenation of the indices J onto N Q. For example, R20 = 2 Q1 2 ∪ 2 Q1 2 ∪ 2 2 Q1 ∪ 2 Q12 . Note that the index set J is unordered: 2 Q1 2 = 2 Q2 1 . Since there
N N are N−M = M different ways of selecting N − M neurons to saturate and
N N−M N 2 each of these may be saturated either off or on, R M is composed of M nonoverlapping Q polytopes. Each N Q|J polytope is in general nonconvex. In order to simplify the N description of R M , we will further decompose these N Q|J s into unions of rectangular hypersolids N B|L(W), where L is an ordered list of raised and lowered indices: 3 B 1 2 = 3 B 1 2 . Then N
Q|J(W) =
N
B |L (W),
(4.2)
L∈P(J )
where P(J ) is the set of permutations of the raised and lowered indices J . For example, 3 Q1 2 = 3 B 1 2 ∪ 3 B 2 1 (see Figure 7). Here 3 B 1 2 can be interpreted to mean that neuron 1 is on and neuron 2 is on given that neuron 1 is on, and 3 B 2 1 means that neuron 2 is on and neuron 1 is on given that neuron 2 is on. Since L contains N − M elements, each N Q|J is composed N of (N M)! possibly overlapping N B|Ls. Thus, R M is composed of a total
N −N−M N! N−M of M 2 (N − M)! = M! 2 rectangular hypersolids, which obviously grows very quickly with the codimension N − M. In order to calculate the bounds of a N B |L rectangular hypersolid, we will need the synaptic input spectrum IiS (W) of neuron i: the set of possible synaptic inputs received by neuron i from the subcircuit consisting of the neurons in S. The min and max elements of this set define the range of synaptic inputs that neuron i can receive (see Figure 1). I can be defined as IiS (W) ≡ Pow({wi j | j ∈ S and j = i}),
(4.3)
where P ow (W) denotes the power set of the set of weights W and P ow (W) denotes the set of sums of the elements of the sets in P ow (W). For example, the synaptic input spectrum of neuron 2 in the {2, 3, 4} subcircuit of a fourneuron network is {2,3,4}
I2
= Pow({w23 , w24 }) = {{}, {w23 }, {w24 }, {w23 , w24 }} = {0, w23 , w24 , w23 + w24 }.
We can now write each N B|L(W) as a set of N inequalities. In order to obtain the tightest possible bounds, we must treat the saturated and dynamically active neurons separately. Consider a neuron i in a circuit with a set of saturated neurons S and a set of dynamically active neurons
CTRNN Parameter Space
3025
Figure 7: An illustration of the structure of Q and B regions. 3 B1 2 (A) and 3 21 B (B) differ in the order in which their defining inequalities are constructed. (C) The nonconvex polytope 3 Q1 2 is formed by the union of the rectangular solids 3 B1 2 and 3 B2 1 .
A and let Ii be the net synaptic input that i receives from the neurons in S. Then neuron i will be saturated off when the left edge of its fold (I L (wii ) − (θi + Ii )) falls above the right edge of the range of synaptic input it receives from the active neurons (max IiA ) (see the left rectangle in Figure 1B), it will be saturated on when the right edge of its fold (I R (wii ) − (θi + Ii )) falls below the left edge of the range of synaptic input it receives from the active neurons (min IiA ) (see the right rectangle in Figure 1B), and it will be dynamically active otherwise.
3026
R. Beer
More formally, each saturated neuron whose index appears in L leads to an inequality of the form lvi < θvi < uvi . Write the ordered index list L as the element-wise product of an elevation vector e and an index vector v : Li = e i vi , where e i = 1 if index vi is raised in L and 0 if it is lowered. Then the inequality bounds for a saturated neuron can be written as
lvi =
I R (wvi vi ) −
−∞ N \{v , ..., vi−1 } min Ivi 1
−
i−1
e j wvi v j
if
ei = 0
if
ei = 1
j=1
I (w ) − max I N \{v1 , ..., vi−1 } − i−1 e j wvi v j vi L vi vi uvi = j=1 ∞
(4.4) if e i = 0 if e i = 1
where N = {1, . . . , N} and the set difference S1 \S2 is the set consisting of all elements of S1 that are not in S2 . For example, for 4 B 4 1 3 we have L = {4 ,1 , 3 }, with e = (1, 0, 1) and v = (4, 1, 3). Consider neuron 1 in this circuit, which occurs at an index of i = 2 in L. The notation 4 B 4 1 3 tells us that neuron 1 is saturated off (e 2 = 0), which requires that the left edge of its fold falls to the right of the maximum {1, 2, 3, 4}\{4} {1, 2, 3} synaptic input max I1 = max I1 it receives (see Figure 1B). The left edge of its fold is given by I L (w11 ) offset by the net input it receives from the other saturated neurons relative to its own bias: I L (w11 ) − θ1 − 1 j=1 e j wvi v j = I L (w11 ) − θ1 − e 1 w14 = I L (w11 ) − θ1 − w14 . Thus, the neuron {1, 2, 3}
or θ1 < I L (w11 ) − 1 boundary of 4 B 4 1 3 is I L (w11 ) − θ1 − w14 > max I1 {1, 2, 3} max I1 − w14 . On the other hand, each dynamically active neuron i in A = N \L leads to an inequality of the form li < θi < ui , with li = I L (wii ) − max IiA −
N−M
e j wi v j
j=1
ui = I R (wii ) − min IiA −
N−M
(4.5) e j wi v j .
j=1
For example, for 4 B 4 1 3 we have A = {1, 2, 3, 4}\{4, 1, 3} = {2}. For neuron 2 to be dynamically active, its fold must overlap the range of synaptic input it receives from the other dynamically active neurons. The minimum {2} {2} and maximum synaptic input are given by min I2 = 0 and max I2 = 0, respectively. The left (resp. right) edges of the fold are given by I L (w22 ) (resp. I R (w22 )) offset by the net input it receives from the saturated
CTRNN Parameter Space
3027
neurons relative to its own bias: I L (w22 ) − θ2 − 3j=1 e j w2v j = I L (w22 ) − θ2 − (e 1 w24 + e 2 w21 + e 3 w23 ) = I L (w22 ) − θ2 − w23 − w24 (resp. I R (w22 ) − θ2 − w23 − w24 ). Thus, the neuron 2 bound is I L (w22 ) − w23 − w24 < θ2 < I R (w22 ) − w23 − w24 . Completing this example, the rectangular hypersolid 4 B 4 1 3 would be defined by the four inequalities {1,2,3}
θ1 < I L (w11 ) − max I1
− w14
I L (w22 ) − w23 − w24 < θ2 < I R (w22 ) − w23 − w24 {2,3}
I R (w33 ) − min I3
− w34 < θ3 {1,2,3,4}
I R (w44 ) − min I4
< θ4 .
Despite the notational complexity of this section, the basic idea is straightforward. When all wii > 4, the extremal saddle node bifurcation manifolds divide the net input parameter space of an N-neuron CTRNN into regions of dynamics whose effective dimensionality ranges from 0 to N. Furthermore, the combinatorial structure, location, and extent of an asymptotic approximation to each of these regions can be calculated explicitly using equations 4.1 to 4.5. Mathematica code for the computation N (and display when N = 2 or 3) of R M (W) can be found in the electronic supplement (Beer, 2005). 4.2 Some wii < 4. In contrast to the wii > 4 case, the SSIO of a neuron whose self-weight is less than 4 is unfolded (see Figure 1A). Thus, such a neuron will not undergo the extremal saddle node bifurcations that play such a crucial role in the parameter space structure characterized above. Can this analysis be extended to circuits containing such neurons? To gain some insight into this question, Figure 8 compares the net input parameter space of a two-neuron CTRNN with w22 = 5 (see Figure 8A) and w22 = 3 (see Figure 8B). Note that the left and right branches of saddle node bifurcations disappear when w22 passes below 4, as expected. However, there are still differences in the effective dimensionality of the dynamics of this circuit as θ2 is varied. For example, between the two saddle node bifurcation curves, the three-equilibrium-point phase portrait changes from occupying the interior of the state space (and therefore being effectively two-dimensional in the distribution of its equilibrium points and its response to perturbations) at the point C1 to occupying only the bottom edge (effectively one-dimensional) at C2. Similarly, outside the saddle node bifurcation curves, the single equilibrium point phase portrait changes from occupying the right edge (effectively one-dimensional) at D1 to occupying the bottom right-hand corner (effectively zero-dimensional) at D2.
3028
R. Beer
Figure 8: An illustration of the extension of the definition of R to CTRNNs containing self-weights less than 4. (A) Local bifurcation curves and region approximations for a two-neuron circuit with W = ( 63 35 ). (B) The same circuit as A, but with w22 = 3. Note that the saddle node bifurcations involving the bistability of neuron 2 have now disappeared since w22 < 4. However, the extended defini˜ can still be used to calculate the regions shown. C1, C2, D1, and D2 show tion R the phase portraits at the four indicated points in B, with the nullclines drawn as black curves, stable equilibrium points colored black, and saddle points colored gray. Although no saddle node bifurcations separate C1 from C2 or D1 from D2 in this circuit, these regions do differ in their effective dimensionality.
CTRNN Parameter Space
3029
Thus, when w22 < 4, regions of dynamics with different effective dimensionality still exist, but there are no sharp boundaries between them because these regions are no longer delineated by saddle node bifurcations. If we N wish to extend our definitions of the R M (W) boundaries from the previous section to the case when some self-weights are less than four, then we need to identify some feature of a neuron’s unfolded SSIO curve against which we can compare the range of synaptic inputs that it receives. Different choices will lead to somewhat different boundaries. Perhaps the simplest way to accomplish this is to make use of the piecewise linear approximation (the dashed lines in Figure 1A), 0 σ˜ (y + θ ) = y + θ + 1 2 4 1
y < −θ − 2 −θ − 2 ≤ y ≤ −θ + 2 y > −θ + 2,
and use the points where the linear pieces intersect as markers for boundary calculations (black points in Figure 1A). By setting the resulting one-neuron equation to 0 and solving for the net input, we obtain left and right boundaries of −2 and 2 − w, respectively, leading to the extended definitions I˜L (w) = I˜ R (w) =
−2
w<4
I L (w) w ≥ 4 2−w
w<4
I R (w)
w ≥ 4.
It is important to reiterate that these extended definitions can no longer be interpreted as folds when w < 4. Rather, they correspond to boundaries between saturated and dynamically active behavior. If we replace I L (w) (resp. I R (w)) by I˜L (w) (resp. I˜ R (w)) everywhere in N ˜M ˜ (W), our previous analysis, we obtain the extended regions R (W), N Q|J ˜ and N B|L(W) that are valid for all wii . In our two-neuron example, these extended definitions give rise to the shaded regions shown in Figure 8B. Although the original region definitions will be used in the remainder of this article, the extended definitions could be substituted at any point. N 5 Calculating R M Probabilities
In many applications, knowing what can happen in principle is often far less useful than knowing what typically does happen in practice. Probabilistic calculations can be used to characterize the most likely behavior under various conditions. In this section, we study the probability that a random parameter sample will encounter a region of M-dimensional
3030
R. Beer
dynamics in an N-neuron circuit. Such calculations provide estimates of the dynamical complexity of randomly chosen circuits. In addition, because N the R M boundaries become increasingly complex in higher dimensions, a statistical description can provide a very useful summary of the overall scaling of the structure of CCTRNN (N) with N. Probabilistic calculations are also important for the application of stochastic search techniques such as evolutionary algorithms to CTRNNs (Beer & Gallagher, 1992; Harvey, Husbands, Cliff, Thompson, & Jacobi, 1997; Nolfi & Floreano, 2000), where search behavior is dominated by the most common dynamics in parameter space. They can help to determine how best to seed the initial population of an evolutionary search. They can help select weight and bias parameter ranges (which is often done in an ad hoc manner) so as to maximize the chances of encountering interesting dynamics. They can also help to explain the dynamics of the evolutionary search itself and assess the evolvability of different types of dynamics. Assuming that weights are in the range [wmin , wmax ] and biases are in the range [θmin , θmax ], this probability is given by the fraction of parameter space volume occupied by the region of interest,
N = P RM
N vol R M (wmax − wmin ) N2 (θmax − θmin ) N
,
(5.1)
where vol (R) denotes the volume of the region R. N is composed of nonoverlapping polytopes N Q|J , we have Since R M
N = vol( N Q|J ). vol R M
(5.2)
N S∈K N−M J ∈Z(S)
It is more difficult to compute vol( N Q|J ) because its constituent N B |L rectangular hypersolids can overlap. In general, the volume of these possibly overlapping sets is given by the sum of the volumes of the individual sets adjusted by the volumes of their various overlaps,
vol( Q|J ) = N
L∈P(J )
|P(J )|
vol( B | L) − N
i=2
i
(−1)
P(J )
H∈Ki
vol
N
B |L ,
L∈H
(5.3) where P(J ) once again denotes the permutations of J , |P(J )| deP(J ) notes the cardinality denotes ).
3 1of P(J ) and
3 1 2K K
the K -subsets of P(J 2 For example, vol Q = vol B + vol 3 B 2 1 − vol 3 B 1 2 ∩ 3 B 2 1 (see Figure 7). Since the N B |L are rectangular hypersolids, each intersection in
CTRNN Parameter Space
3031
equation 5.3 will also be a rectangular hypersolid, the bounds of which can be found by taking the appropriate maxs and mins of the bounds of the constituent N B |L s. Finally, the volume of each rectangular hypersolid N B |L is given by vol( N B|L) =
N θmax [ui ]θθmax , − [l ] i θ min min W i=1
2
where W is the hypercube [wmin , wmax ] N , the expressions ui and li for the bounds of the ith dimension of N B |L are given in equations 4.4 and 4.5, and the notation [x]max min means to clip x to the bounds [min, max]. Since ui and li depend on only the N weights coming into neuron i, denoted Wi , this N2 dimensional integral can be factored into the product of N N-dimensional integrals as vol
N
N B |L =
i=1
Wi
θ . − [li ]θθmax [ui ]θmax min min
(5.4)
N Thus, in order to calculate P(R M ), we must evaluate equations 5.1 to 5.4. Mathematica code supporting the construction and evaluation of such expressions for sufficiently small N − M is provided in the electronic supplement (Beer, 2005). Note that these expressions are not as efficient as they could be. By taking into account integral symmetries, it should be possible to derive equivalent expressions that involve the evaluation of considerably fewer integrals. N As a concrete illustration of the calculation of P(R M ), consider the region N R N of N-dimensional dynamics in an N-neuron circuit. This is not only the simplest case, but also the most important, because all N neurons are dynamically active. In this case, R N N consists of a single rectangular hypersolid and thus
N Q = vol N B . vol R N N = vol Since all N integrals are identical, equation 5.4 can be written as
vol R N N =
4
wmax
wmax
−wmax
···
wmax
−wmax
I R (w) − min I N
θmax θmin
θmax − I L (w) − max I N θmin dw1 · · · dw N−1 dw
N
,
(5.5)
where wi are the incoming weights to an arbitrary neuron, w is that {1,...,N} for neuron’s self-connection, and I N is an abbreviation for Ii
3032
R. Beer
arbitrary i. Note that the lower limit of the outermost integral must be 4 because we are using the original region definitions (which are defined only for w ≥ 4) rather than the extended ones. Note also that we have assumed that wmin = −wmax for simplicity and wmax ≥ 4 so that vol(R N N ) is nonzero. Although it is unclear in general how to evaluate these arbitrarily iterated piecewise integrals in closed form, it is possible to evaluate them for fixed w and θ bounds and fixed N (see section A.1). In addition, de pending on the range of θmin and θmax relative to the points θmin and θmax where clipping begins to occur, there are two cases of interest where evaluation of these integrals for general N is relatively straightforward: (1) when clipping dominates equation 5.5 and (2) when no clipping occurs. The points θmin and θmax can be defined as follows. For 4 ≤ w ≤ ∈ wmax , we have I R (w) (w ) , −2] and I L (w) [I R max ∈ [I L (wmax ) , −2]. Since I N has the form 0, ±wmax , . . . , ±wmax (N − 1) , we can conclude that I R (w) − min I N ∈ [I R (wmax ) , wmax (N − 1) − 2] and that I L (w) − max I N ∈ [I L (wmax ) − wmax (N − 1) , −2], giving = I L (wmax ) − wmax (N − 1) θmin = wmax (N − 1) − 2. θmax
The first case in which the iterated integrals can be evaluated in closed form is when θmin θmin and θmax θmax , which will occur when N becomes sufficiently large relative to fixed θmin and θmax . In this case, the integrands are almost everywhere clipped to either θmin or θmax , and the iterated integrals evaluate to N
N−1 vol∞ R N , N = (θmax − θmin ) (wmax − 4) (2wmax ) where the ∞ subscript reminds us that this expression is accurate only for sufficiently large N. The probability of a random parameter sample hitting RN N therefore scales as
P∞ R N N
=
vol∞ R N N (2wmax )
N2
(θmax − θmin ) N
=
wmax − 4 2wmax
N
.
The second case in which the iterated integrals can be calculated in closed and θmax ≥ θmax , which will occur when N is small form is when θmin ≤ θmin relative to fixed θmin and θmax . In this case, the θ bounds are sufficiently large that no clipping occurs and the [·]θθmax can be dropped from the integrands. min
CTRNN Parameter Space
3033
The integrals can then be evaluated (see section A.2) to obtain
N−2 (wmax ) N−1 (wmax (N(wmax − 4) − wmax vol0 R N N = (2 + wmax (wmax − 4) + ln 256 + 4) √ − 8(wmax − 1) ln( wmax − 4 + wmax ) + 2 wmax (wmax − 4) − 8 ln 2)) N , where the 0 subscript reminds us that this expression is accurate only for sufficiently small N. The probability in this case thus scales as
P0 R N N =
vol0 R N N (2wmax ) N2 (θmax − θmin ) N
.
Figure 9A shows the scaling of the two approximations P0 and P∞ with N for the particular case wmax = 16, θmin = −24 and θmax = 24 (here and throughout the remainder of the article, such specific values are for illustrative purposes only). Superimposed on these curves are data points taken 106 random samples of parameter space at each N. Note that
Nfrom P0 R N provides the better fit for N < 5, whereas P∞ R N N provides the better fit for N > 7. The data actually begin to deviate from P0 by N = 2 (since θmin = −24 < θmin = I L (16) − 16 (N − 1) for N > 1), but the largest error occurs in the crossover region between these two curves, where the θ clipping begins to become significant. This becomes even more apparent for the narrower bounds θmin = −16 and θmax = 16 (see Figure 9B). If higher accuracy
is required in this crossover region, then the full iterated integrals for vol R N N must be evaluated (see section A.1). Such calculations can also be used to choose appropriate [wmin , wmax ] and [θmin , θmax ] parameter ranges so as to maximize P(R N N ) for a CTRNN of a given size. 6 Calculating Phase Portrait Probabilities As classification schemes, the effective dimensionality equivalence relation described above and the topological conjugacy equivalence relation normally studied in dynamical systems theory are distinct. A parameter space region containing dynamics with a given effective dimension may include many nonequivalent phase portraits. Conversely, a given phase portrait may appear in multiple regions with different effective dimensionality. However, when the defining conditions are simple enough, we can estimate the densities of individual phase portraits using geometric reasoning similar to that employed above. In this section, we give two examples of such calculations.
3034
R. Beer A 5
4
3 % 2
1
0 2
4
6
8
10
6
8
10
N
B 5
4
3 % 2
1
0 2
4 N
N
N
Figure 9: Plots of the P R N N approximations P0 R N (solid curve) and P∞ R N (dashed curve) compared to data (black points) from 106 random parameter space samples at each N, with wi j ∈ [−16, 16] and θi ∈ [−24, 24] (A) or θi ∈ [−16, 16] (B). P0 is most accurate for small N, while P∞ is most accurate for large N. The maximum error occurs at the crossover point between the two approximations and can be significant for smaller θ ranges. The crosshairs at N = 4 in B correspond to the exact value calculated in section A.1.
6.1 Maximal Phase Portraits. The maximum number of equilibrium points that an N-neuron CTRNN can exhibit is 3 N (Beer, 1995). A sufficient (but not quite necessary) condition for the occurrence of this maximal phase portrait P3NN is that the range of synaptic input each neuron receives
CTRNN Parameter Space
3035
from the other neurons should fall entirely within its SSIO fold—that is, I L (wii ) − min IiN ≤ θi ≤ I R (wii ) − max IiN for all neurons i. Thus, the density of this phase portrait in the parameter space of an N-neuron circuit can be estimated as P P3NN =
w! max w! max 4
wmin
···
w! max " wmin
θ I R (w) − max I N θmax min
−
#
θ I L (w) − min I N θmax dw1 min 0
N · · · dw N−1 dw ,
2
(wmax − wmin ) N (θmax − θmin ) N
where the outer clipping to 0 in the integrand enforces the condition IW (wii ) > range IiN ≡ max I N − min I N , ensuring the existence of this phase portrait for a given set of weights. Not surprisingly, this arbitrarily iterated piecewise integral is also difficult to evaluate in general, but can often be evaluated for particular N and parameter ranges. As an example, consider the density of phase portraits with nine equilibrium points P92 in a two-neuron CTRNN whose weights and biases are in the range [−16, 16]. In this case, the previous integral reduces to
P P92 = ! ! " 16 16 4 −16
# 2 16 [I R (w) − max (0, w1 )]16 −16 − [I L (w) − min (0, w1 )]−16 dw1 dw 0
324 322
.
(6.1) This expression can then be evaluated (see section A.3) to obtain
P
P92
√ √ 2 2 √
1152 − 576 3 ln 2 + 3 + 240 ln 2 + 3 = ≈ 0.0060%. 1073741824
For comparison, a random sample of 106 two-neuron circuits exhibited the nine-equilibrium-point phase portrait with a probability of 0.0059%. Clearly, the maximal phase portrait is very rare in the parameter space of two-neuron CTRNNs, and only becomes more so in larger networks. However, we can use our knowledge of the SSIO geometry underlying this phase portrait to substantially increase its probability. For example, if we randomly choose the cross-weights, then randomly choose each self-weight such that IW (wii ) > range IiN (assuming wii > 4), and then randomly
3036
R. Beer
choose each bias from the range I L (wii ) − min IiN , I R (wii ) − max IiN , we are guaranteed to obtain the maximal phase portrait. 6.2 Central Oscillation Phase Portraits. Some phase portraits do not have even approximate defining conditions that can be expressed solely in terms of the geometry of SSIO curves. However, sometimes we can still calculate a good estimate of occurrence probability in such cases. Consider, for example, the density of central oscillatory dynam2 ics in a two-neuron CTRNN. By “central,” I mean a phase portrait P1LC in which a single stable limit cycle surrounds a single unstable equilibrium point in the neighborhood of a center-crossing configuration. While there are other oscillatory phase portraits that can occur in twoneuron CTRNNs (Borisyuk & Kirillov, 1992; Beer, 1995; Hoppensteadt & Izhikevich, 1997; Ermentrout, 1998), their defining conditions are delicate, and they are therefore likely to make only a small contribution to oscillation probability. By linearizing about the center-crossing point (θ1∗ , θ2∗ ) and solving for the Hopf bifurcation condition (see section A.4), we can derive two pairs H1± ( θ2 ; W, H2± ( θ1 ; W, τ ) such that
of expressions
τ ) and ± ± ∗ ∗ ∗ ∗ θ1 + H1 θ2 − θ2 ; W, τ and θ2 + H2 θ1 − θ1 ; W, τ approximate the Hopf bifurcation curve in net input space. A comparison of these approximations (gray curves) to the actual Hopf curves (dashed curves) is shown in Figure 10. Note that the approximate Hopf curves can extend beyond the inner edge of the saddle node bifurcation “tongues” even though the central oscillations terminate there in saddle node bifurcations on a loop. Note also that the character of these curves changes from elliptical to hyperbolic as one of the self-weights passes through 0. Indeed, when one self-weight becomes sufficiently negative, the oscillatory region splits into two distinct regions and central oscillations no longer exist. There are several different ways we can use these curves to esti2 mate P P1LC . Perhaps the simplest is to approximate the os ∗ approach − W, τ ) , θ1∗ + H1+ (0; W, τ ) × cillatory region by the rectangle θ + H (0; 1 1 ∗ − + ∗ θ2 + H2 (0; W, τ ) , θ2 + H2 (0; W, τ ) obtained from the upper and lower bounds of the Hopf curve approximation at the center-crossing point (see the gray rectangles in Figure 10). If we clip this rectangle to the θ bounds and saddle node bifurcation manifolds, clip its width and height to 0, integrate over the weights and time constants, and divide by the total volume of parameter space, we obtain
2 P P1LC =
2 (wmax − wmin ) N2 (θmax − θmin ) N (τmax − τmin ) N ∗ × [θ1 + H1+ (0; W, τ )]ab11 − [θ1∗ − H1− (0; W, τ )]ab11 0
D
× [θ2∗ + H2+ (0; W, τ )]ab22 − [θ2∗ − H2− (0; W, τ )]ab22
0
CTRNN Parameter Space
3037
A
20
B
C
15
15
10
10
15 10
θ2 5
θ2
5
θ2
0
5 0
0 -5
-5
-10
-10
-5
-25 -20 -15 -10 θ1
-5
0
-20 -15 -10 θ1
D
θ2
-5
0
5
-20 -15 -10 -5 θ1
E 15
15
10
10
10
5
5
θ2
5
0
0
0
-5
-5
-5
-10
-10 -20 -15 -10 -5 θ1
0
5
5
F
15
θ2
0
-10 -20 -15 -10
-5
0
5
-20 -15 -10
θ1
-5 θ1
0
5
Figure 10: An illustration of an approximation to the region of central oscillation phase portraits P12LC in two-neuron CTRNNs. The Hopf bifurcation curve approximations θ ∗ + H ± (θ ; W, τ ) are shown as solid gray curves. The gray rectangles approximate the oscillatory These rect regions themselves. angles are defined by θ1∗ + H1± (0; W, τ ) × θ2∗ + H2± (0; W, τ ) clipped to the 5 16 ), and in saddle node bifurcation curves. Here τ1 = τ2 = 1. In A, W = ( −16 6 w11 16 B to F , W = ( −16 10 ), with w11 = 4 (B), w11 = 1 (C), w11 = −1 (D), w11 = −2 (E) and w11 = −3 (F). Although no central oscillations exist in E and F , θ ∗ + H ± (θ; W, τ ) still approximate the Hopf bounds of regions of noncentral oscillations. A more accurate nonrectangular approximation to the oscillatory region would be able to take advantage of this fact.
where
ai = bi =
max(θmin , I R (wii ) − max(0, wi j,i= j ))
wii ≥ 4
θmin
wii < 4
min(θmax , I L (wii ) − min(0, wi j,i= j ))
wii ≥ 4
θmax
wii < 4
clip the Hopf curve to the saddle node bifurcation manifolds and θ bounds, and D : {(w11 , w12 , w21 , w22 , τ1 , τ2 )|wmin ≤ wii ≤ wmax , 0 ≤ w12 ≤ wmax , wmin ≤ w21 ≤ 0, τmin ≤ τi ≤ τmax , κη1 , κη2 ≥ 0}
3038
R. Beer
gives the domain of integration (see section A.4 for an explanation of the last two conditions). Since it is well known that oscillations can occur in a two-neuron CTRNN only when the cross-weights are oppositely signed (Ermentrout, 1995), we have assumed above that w12 > 0 and w21 < 0 and doubled the integral to account for the opposite possibility. We will not attempt to evaluate this integral in closed form. Assuming wi j , θi ∈ [−16, 16] and τi ∈ [0.5, 10], numerical
2 integration using a quasirandom Monte Carlo method gives P P1LC ≈ 0.22%, which accords quite well with the empirical probability of 0.24% observed in a random sample of 106 two-neuron circuits. The empirical estimate was obtained by randomly generating 10 initial conditions in the range yi ∈ [−16, 16], integrating each with the forward Euler method for 2500 integration steps of size 0.1 to skip transients, and then integrating for an additional 500 integration steps. If the output of either neuron varied by more than 0.05 during this second integration for any initial condition, then the circuit was classified as oscillatory. Since the empirical value includes both central and noncentral oscillations, it would be expected to be slightly
higher. How does the probability of oscillation P O N scale with N in CTRNNs? By “oscillation,” I mean any asymptotic behavior other than an equilibrium point, so that periodic, quasi-periodic, and chaotic dynamics are included. Although this question is beyond the theory described in this letter, we can examine it empirically. A plot of P O N is shown in Figure 11A (black
2 value calculated above corresponding to the N = 2 curve), with the P P1LC point in this plot. For comparison, the scaling of oscillation probability with N for random center-crossing circuits is also shown (gray curve). Note that both curves monotonically increase toward 100%, although oscillations are clearly much more likely in random center-crossing circuits than they are in completely random CTRNNs (Beer, 1995; Mathayomchan & Beer, 2002).
Figure 11: Empirical
studies of oscillation probability in CTRNNs. (A) The probability P O N of observing “oscillatory” (nonequilibrium) dynamics in general (black) and center-crossing (gray) N-neuron circuits with 105 (N ≤ 20) or 104 (N > 20) random samples from wi j , θi ∈ [−16, 16], τi ∈ [0.5, 10]. Although oscillations are clearly much more common in center-crossing networks, both curves increase with N. Samples were obtained by randomly generating 10 initial conditions in the range yi ∈ [−16, 16], integrating each with the forward Euler method for 2500 integration steps of size 0.1 to skip transients and then integrating for an additional 500 integration steps. If the output of any neuron varied by more than 0.05 during this second integration for any condition,
initial N of observing the circuit was classified as oscillatory. (B) The probability P O M M-dimensional oscillatory dynamics in an N-neuron circuit, using 105 parameter space samples for each N and the same sampling protocol as in A. Note that the distribution shifts to the right, broadens, and rises with increasing N.
CTRNN Parameter Space
3039 A
100
80
60 % 40
20
0 0
20
40
60
80
100
N
B
3 %
25
2 20
1 0 0
15 10
5 10 5
15 M
20 25 0
N
3040
R. Beer
Interestingly, samples of oscillatory circuits taken from this data set suggest that chaotic dynamics becomes increasingly common for large N, which is consistent with other work on chaos in additive neural networks (Sompolinsky & Crisanti, 1988). In order to gain some insight into the underlying structure of P(O N ), N the probability P(O M ) that exactly M neurons are oscillating in an Nneuron circuit (with the remaining N − M neurons in saturation) is plotted in Figure 11B. As N increases, note that (1) the most probable oscillating subcircuit size (denoted M ) increases, (2) the distribution of oscillatory subcircuits broadens, and (3) the probability of M increases. Several factors underlie these features. First, the distribution broadens and shifts to the right because the range of possible subcircuit sizes grows with N. At least within the range with N is √ of this plot, the shift in peak location roughly M ≈ 1.26 N − 0.074. Second, the probability of M increases be N cause both the number of possible subcircuits M ∼ 2 N grows expoM nentially and the parameter ranges over which a subcircuit
of a given
N size can oscillate increases with N − M. As long as P O N = M P O M < 1, the probability of M can continue to increase. However, as P O N approaches 1, one would expect the probability of M to decrease as a fixed area is distributed across an increasing range of subcircuit sizes. Finally, the N quantitative details of P O M obviously depend on the relative proportion of different oscillatory regions that fall within the range of allowable bias values. 7 Discussion Despite the extreme difficulty involved, there is a growing recognition of the need for a more general theory of neural circuits, even a partial one. One path toward such a theory involves the systematic study of the structure of the space of all possible circuits over a given model neuron. This article begins such a study for the relatively simple but still dynamically universal CTRNN model. I have explicitly computed the local bifurcation manifolds of arbitrary CTRNNs in net input space and presented visualizations of their structure for small N. I have also shown how the outermost envelope of saddle node bifurcations formed by the saturation of the sigmoid activation function divides net input space into regions of dynamics with different effective dimensionality, and I have derived analytical approximations to these regions. While these regions by no means exhaust the intricate structure of CTRNN parameter space, they do provide a coarse map on which more detailed behavior can be situated. I have also demonstrated how to calculate estimates of the probabilities of these different regions and of specific phase portraits. Although saturation is the most obvious feature of σ (·), it is still remarkable how this simple property almost completely dominates the structure
CTRNN Parameter Space
3041
of CCTRNN (N). As N increases, the probability of finding circuits with saturated subcircuits exponentially overwhelms the probability of finding circuits in which all neurons are dynamically active. However, either activitydependent regulation (Turrigiano, Abbott, & Marder, 1994; Williams & Noble, in press) or biased sampling (Beer, 1995; Mathayomchan & Beer, 2002) can counteract this trend. The domination of saturation also suggests the utility of applying techniques from the logical analysis of switching networks (Lewis & Glass, 1992; Edwards, 2000; Thomas & Kaufman, 2001) or modular decomposition (Chiel, Beer, & Gallagher, 1999; Beer, Chiel, & Gallagher, 1999) to CTRNNs. While there is no question that the richest dynamics that an N-neuron CTRNN can exhibit occurs when all N neurons are dynamically active, it would be a mistake to dismiss circuits in which some neurons are saturated as somehow less interesting. Even in, say, a 23-neuron circuit, 17-dimensional dynamics can be quite rich. Furthermore, it can be much easier to evolve 17-dimensional
dynamics in a 23-neuron circuit than
in a 17 17-neuron circuit, since P R23 17 can be considerably higher than P R17 . In addition, N-neuron CTRNNs containing saturated subcircuits may be switchable into different M-dimensional dynamical modes by external inputs. Despite the progress reported in this letter, much work remains to be done. It would be useful to have an exact closed-form expression for P(R N N) for general parameter ranges, as well as either approximate or exact closed
N form expressions for P(R M ) and P P3NN . Although the piecewise function capabilities of Mathematica (version 5.1 and above) are quite powerful, further advances in the evaluation of the arbitrarily iterated piecewise integrals that arise in these calculations will be necessary. It would also be interesting 2 to derive at least an approximate closed form for the P(P1LC ) calculations in section 6.2 for general parameter ranges, as well as to improve the accuracy of these calculations by using nonrectangular approximations to the oscillatory region. More generally, it may be possible to calculate probability estimates for at least some of the other two-neuron phase portraits, and perhaps some phase portraits in larger circuits as well. Empirical estimates of the probabilities of many of the two-neuron phase portraits have already been determined for a specific set of parameter ranges (Izquierdo-Torres, 2004). It would also be interesting to characterize the structure of the nonextremal local bifurcation manifolds that appear in Figures 2 and 3. Further examination of the submanifold of center-crossing circuits may be fruitful in this regard. Note that while there is only one true center-crossing point in net input space for any fixed set of weights, each subcircuit region has its own center-crossing submanifold. For example, at the centers of symmetry of the poles in Figure 3 lie lines of two-neuron center-crossing circuits with the third neuron in saturation, while the centers of symmetry of the slabs likewise contain planes of one-neuron center-crossing circuits. Thus,
3042
R. Beer
the center-crossing (sub)circuits form a kind of “skeleton” for CCTRNN (N) within the “skin” of the extremal saddle node bifurcations. By studying small perturbations from these center-crossing submanifolds, it may be possible to gain some insight into the structure of the nonextremal local bifurcation manifolds that lie between the skeleton and the skin. Even in its present form, the theory described in this article has significant applications to evolutionary robotics, where CTRNNs are widely used as controllers (Beer & Gallagher, 1992; Harvey et al., 1997; Nolfi & Floreano, 2000). For example, it would be interesting to explore the utility of seeding evolutionary searches with circuits drawn uniformly from R N N , which includes a much greater variety of rich dynamics than the highly symmetrical center-crossing circuits that have previously been utilized as seeds (Beer, 1995; Mathayomchan & Beer, 2002). The theory can also be used to select [wmin , wmax ] and [θmin , θmax ] parameter ranges that maximize P(R N N ) for a CTRNN of a given size. Studies of neutrality (Izquierdo-Torres, 2004) and evolutionary dynamics in CTRNN parameter spaces (Seys & Beer, 2004) should also benefit from the theoretical picture of CCTRNN (N) presented here, and the probability calculations we have demonstrated should be directly applicable to empirical studies of the implications of saturation for the evolvability of CTRNNs (Williams & Noble, in press). The theory could also be used to further work on the impact of network architecture on CTRNN dynamics (Psujek, Ames, & Beer, 2006), since different circuit architectures simply correspond to subspaces of CCTRNN (N) with some connection weights fixed to 0. Finally, given the qualitative similarities between CTRNNs and some models of genetic regulatory networks (de Jong, 2002), the work described here may even have applications in this area. Returning to the recent results on the parameter spaces of invertebrate pattern generator models with which this article began (Goldman et al., 2001; Golowasch et al., 2002; Prinz et al., 2003, 2004), what might our CTRNN results tell us about the maximum conductance parameter spaces of more biophysically realistic neural models? Although the parameter space structure of such models is considerably more complex, some of the insights gained from studies of CTRNNs, and the mathematical and computational tools developed to support such studies, might carry over. For example, various kinds of saturation effects, which play such an important role in structuring CCTRNN (N), also occur throughout more realistic models. Thus, one would predict that saturation would similarly dominate the parameter spaces of conductance-based models. Furthermore, one would expect that the proper functioning of such models would impose constraints on the interactions between conductances analogous to those defining the R N N region in CTRNNs. More generally, the study of CTRNN parameter space can serve as an exemplar for parameter space studies of conductance-based models, suggesting fruitful questions and directions and making tentative predictions. Indeed, the discovery of sensitivity and robustness to different combinations of
CTRNN Parameter Space
3043
conductance variations, failure of averaging, and multiple instantiability in conductance-based models of invertebrate pattern generators was partly anticipated by earlier studies with CTRNNs (Chiel et al., 1999; Beer et al., 1999). Ultimately, the work described in this article needs to be extended to these more biophysically realistic neural circuits. Perhaps the best place to begin such a complexification would be with two-dimensional model neurons, which can be configured to qualitatively reproduce a wide range of nerve cell phenomenology (Rowat & Selverston, 1997; Rinzel & Ermentrout, 1998). Indeed, one interesting possibility would be to study circuits of two-dimensional model neurons formed from pairs of CTRNN units, since the theory developed here could be directly applied. From there, more complex model neurons can be explored, such as the threedimensional generic bursting model recently examined by Ghigliazza and Holmes (2004). Only by studying the simplest possible model circuits and then incrementally complicating them as our understanding progresses will we be able to approach a more general theory of neural circuits that can do justice to the parameter space complexity that studies of biological circuits are revealing. Appendix A.1 Calculating vol(R N N ) in Closed Form for Fixed N. In this section, we show how to calculate the value of vol(R N N ) from equation 5.5 for fixed N w and θ ranges and fixed N. We can write vol(R N N ) = (RN − L N ) , where
wmax
RN ≡ LN ≡
4
−wmax
wmax
wmax
wmax
−wmax
4
···
···
wmax
−wmax wmax
−wmax
I R (w) − min I N
I L (w) − max I N
θmax θmin
θmax θmin
dw1 · · · dw N−1 dw
dw1 · · · dw N−1 dw.
By construction, I N has 2 N−1 elements, with N−1 k-wise sums. Since the k expression min I N will evaluate to each element of I N precisely when all terms in that element are negative, RN can be written as RN =
N−1 0
+ +
N−1 1 N−1 2
+···
wmax wmax
4
···
0 wmax wmax
4
4
···
0 wmax wmax
wmax
0
0
0
··· 0
[I R (w)]θθmax dw1 · · · dw N−1 dw min
wmax
0
−wmax wmax 0
[I R (w) − w1 ]θθmax dw1 · · · dw N−1 dw min
0
−wmax −wmax
[I R (w) − w1 − w2 ]θθmax dw1 · · · dw N−1 dw min
3044
R. Beer +
N−1 N−1
wmax
0 −wmax
4
···
$
0
I R (w) −
−wmax
N−1
%θmax wi
dw1 · · · dw N−1 dw.
i
θmin
Evaluating the integrals involving positive weights (which do not appear in the corresponding integrands), we obtain RN =
N−1 (wmax ) N−1 0
+ +
wmax
4
N−1 (wmax ) N−2 1 N−1 (wmax ) N−3 2
[I R (w)]θθmax dw min
wmax
0 −wmax
4 wmax
[I R (w) − w1 ]θθmax dw1 dw min
0 −wmax
4
0
−wmax
[I R (w) − w1 − w2 ]θθmax dw1 dw2 dw min
+··· +
N−1 (wmax )0 N−1
wmax
0
−wmax
4
···
$
0
−wmax
I R (w) −
N−1
%θmax wi
dw1 · · · dw N−1 dw,
i=1
θmin
which sums to RN =
N−1 N−1
k
k=0
with
SRk
(wmax ) N−k−1 SRk
wmax 0
≡
−wmax
4
···
$
0
−wmax
I R (w) −
k
%θmax wi
i=1
dw1 · · · dwk dw. θmin
A similar derivation can be applied to L N to obtain LN =
N−1 N−1
k
k=0
with
SLk
≡ 4
(wmax ) N−k−1 SLk
wmax wmax 0
··· 0
wmax
$ I L (w) −
k i=1
%θmax wi
dw1 · · · dwk dw. θmin
Unfortunately, it is not clear how to evaluate SRk and SLk in closed form for general k. There are two obstacles to overcome. First, we ! k must evaluate integrals of the form Dk i=1 xi for general k, with k Dk : {(x1 , . . . , xk )| −wmax ≤ xi ≤ 0, i=1 xi < I R (w) − θmax } for SRk and Dk : k k xi > I L (w) − {(x1 , . . . , xk )| 0 ≤ xi ≤ wmax , i=1 min } for SL . Second, we ! wθmax k k evaluate integrals of the form γ R ≡ 4 (I R (w)) dw and γ Lk ≡ !must wmax k (I L (w)) dw for general k. Fortunately, using cylindrical decomposi4 tion, Mathematica (Wolfram, 2003) can symbolically tabulate the values of
CTRNN Parameter Space
3045
such integrals for fixed k and fixed w and θ ranges. Thus, for modest N, it is possible to explicitly calculate vol(R N N ) in closed form. For example, consider the point of greatest discrepancy between the approximate curves and the empirical data in Figure 9B, which occurs at N = 4. For wmax = 16, θmin = −16, and θmax = 16, we obtain
4 vol(R44 ) = 4096 SR0 − SL0 + 768 SR1 − SL1 + 48 SR2 − SL2 + SR3 − SL3 =
4 −7798784γ R0 − 557056γ R1 + 10752γ R2 + 128γ R3 − 2γ R4 − 7798784γ L0 + 557056γ L1 + 19968γ L2 + 256γ L3 + γ L4 , 331776
which numerically evaluates to a probability of P(R44 ) ≈ 0.375%. This theoretical value (the crosshairs in Figure 9B) matches the empirical value of 0.376% quite closely. Details of these calculations, as well as the tabulations of SRk , SLk , and γ Rk and γ Lk necessary to support them, can be found in the electronic supplement (Beer, 2005). A.2 Calculating the Approximation vol0 (R N N ) in Closed Form. In section 5, a closed-form expression for vol0 (R N N ) was given. Here, we show the derivation of this expression. Since the integrand clipping can be dropped by assumption, the integral 5.5 becomes vol0 (R N N) =
wmax
wmax
−wmax
4
···
wmax
−wmax
(I R (w) − min I N ) N
− (I L (w) − max I N ) dw1 · · · dw N−1 dw
,
which is equivalent to
wmax
wmax
···
−wmax
4
wmax
N
IW (w) + range I N dw1 · · · dw N−1 dw
−wmax
,
where range I N = max I N − min I N = |w1 | + · · · + |w N−1 |. Thus, we can rewrite this integral as
wmax
wmax
−wmax
4
wmax
+ 4
···
wmax
−wmax
wmax −wmax
···
IW (w) dw1 · · · dw N−1 dw
N−1 wmax
−wmax i=1
N
|wi | dw1 · · · dw N−1 dw
.
3046
R. Beer
The first term can be reduced to a one-dimensional integral of IW (w) that can be evaluated explicitly to obtain
wmax
wmax
···
−wmax
4
= (2wmax )
wmax
−wmax
wmax
N−1
IW (w) dw 4
= (2wmax ) N−1
IW (w) dw1 · · · dw N−1 dw
wmax
wmax (wmax − 4) + ln 256
2
√ √ 8 (wmax − 1) ln wmax − 4 + wmax − 2 2 wmax (wmax − 4) − ln 256 + . 2
In the second term, we can eliminate the absolute value operations by splitting the integral into 2 N−1 parts, all of which have the same value. Then we distribute the integrals over the sum and evaluate to obtain
wmax
4
wmax
···
−wmax
= 2 N−1
wmax
4
= 2 N−1
−wmax i=1 wmax
···
0
N−1 i=1
N−1 wmax
4
wmax
|wi | dw1 · · · dw N−1 dw
N−1 wmax
0
wmax 0
wi dw1 · · · dw N−1 dw
i=1
···
wmax
wi dw1 · · · dw N−1 dw
0
= 2 N−1 (N − 1) (wmax − 4) (wmax ) N−2
(wmax )2 . 2
Raising the sum of the previous two expressions to the Nth power and simplifying gives N−2 vol0 (R N (wmax ) N−1 (wmax (N(wmax − 4) − wmax N ) = (2 + wmax (wmax − 4) + ln 256 + 4) √ − 8(wmax − 1) ln( wmax − 4 + wmax ) + 2 wmax (wmax − 4) − 8 ln 2)) N .
CTRNN Parameter Space
3047
A.3 Calculating P(P92 ) for Two-Neuron Circuits. In section 6.1, a value 2 for P P9 was given. Here we show the derivation of this value. Given 4 ≤ w ≤ 16 and −16 ≤ w1 , θ1 , θ2 ≤ 16, the integrand of equation 6.1 can be simplified to obtain
16 [I R (w) − max(0, w1 )]16 −16 − [I L (w) − min(0, w1 )]−16 & I (w) − w1 0 ≤ w1 ≤ IW (w) = W IW (w) + w1 −IW (w) ≤ w1 ≤ 0.
0
We can then split the inner integral of equation 6.1 and evaluate to obtain
P P92 =
! ! 16 0 4
! 16 = =
4
−IW (w) I W (w)
IW (w)2 dw
2
+ w1 dw1 +
! IW (w) 0
2 IW (w) − w1 dw1 dw
324 322
326
√ √ √ (1152 − 576 3 ln(2 + 3) + 240 ln(2 + 3)2 )2 . 1073741824
A.4 Calculating an Approximation to the Hopf Curve for Two-Neuron Circuits. In section 6.2, approximations H1± ( θ2 ; W, τ ) and H2± ( θ1 ; W, τ ) to the Hopf bifurcation curve in two-neuron circuits were utilized. Here we give the derivation of these expressions. Within a small perturbation θ of a center-crossing circuit, we can replace σ (x) with its linearization σˆ (x) = x/4 + 1/2 in order to obtain the linearized dynamics τ y ˙ = −y+ W · σˆ (y + θ ∗ + θ ). By solving for the simultaneous zeroes of these equations, ¯ where, we find that the central equilibrium point occurs at ¯y = −θ ∗ + y, for a two-neuron circuit, we have y¯ 1 =
(4w11 − w11 w22 + w12 w21 ) θ1 + 4w12 θ2 w11 w22 − 4w11 − 4w22 − w12 w21 + 16
y¯ 2 =
(4w22 − w11 w22 + w12 w21 ) θ2 + 4w21 θ1 . w11 w22 − 4w11 − 4w22 − w12 w21 + 16
¯ then the Hopf If Jˆ is the Jacobian of the linearized system evaluated at y, bifurcation condition is given by det(2Jˆ 1) =
w11 σˆ y¯ 1 + θ1 + θ1 − 1 w22 σˆ y¯ 2 + θ2 + θ2 − 1 + = 0, τ1 τ2
where σˆ (x) = 1/4 − x 2 /16 is the quadratic approximation to σ (x).
3048
R. Beer
Substituting y¯ i = −θi + y¯ i , solving for θ1 and θ2 , and simplifying, we obtain H1± ( θ2 ; W, τ ) = H2± ( θ1 ; W, τ ) =
α θ2 ± α θ1 ±
β(χ( θ2 )2 + κη1 ) , 2η1 β(χ( θ1 )2 + κη2 ) , 2η2
where α = 2w22 w21 τ1 (w11 − 4) + 2w11 w12 τ2 (w22 − 4) β = (w12 w21 + 4(w22 − 4) − w11 (w22 − 4))2 χ = −4w11 w22 τ1 τ2 κ = τ2 (w11 − 4) + τ1 (w22 − 4) η1 = w11 τ2 (w22 − 4)2 + w22 τ1 (w21 )2 η2 = w22 τ1 (w11 − 4)2 + w11 τ2 (w12 )2 . Note that these H expressions are real valued only when β(χ( θ2 )2 + κη1 ), β(χ( θ1 )2 + κη2 ) ≥ 0 and that they have singularities at η1 = 0 and η2 = 0. The last two conditions in the domain of integration D given in the main text arise from the requirement that κη1 , κη2 ≥ 0 for the H functions to be real valued when θ1 = θ2 = 0, since β is strictly positive. Acknowledgments I thank Jeff Ames, Michael Branicky, Alan Calvitti, Hillel Chiel, Bard Ermentrout, Eldan Goldenberg, Robert Haschke, and Eduardo Izquierdo-Torres for their feedback on an earlier draft of this letter. This research was supported in part by NSF grant EIA-0130773. References Beer, R. D. (1995). On the dynamics of small continuous-time recurrent neural networks. Adaptive Behavior, 3, 469–509. Beer, R. D. (2005). Electronic supplement to “Parameter space structure of continuous-time recurrent neural networks.” Available online at http://mypage.iu.edu/∼rdbeer/CTRNNSupplement.nb. Beer, R. D., Chiel, H. J., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking II. General principles and individual variability. J. Computational Neuroscience, 7, 119–147.
CTRNN Parameter Space
3049
Beer, R. D., & Gallagher, J. C. (1992). Evolving dynamical neural networks for adaptive behavior. Adaptive Behavior, 1, 91–122. Borisyuk, R. M., & Kirillov, A.B. (1992). Bifurcation analysis of a neural network model. Biological Cybernetics, 66, 319–325. Blum, E. K., & Wang, X. (1992). Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, 5, 577–587. Chiel, H. J., Beer, R. D., & Gallagher, J. C. (1999). Evolution and analysis of model CPGs for walking I. Dynamical modules. J. Computational Neuroscience, 7, 99–118. Chow, T. W. S., & Li, X.-D. (2000). Modeling of continuous time dynamical systems with input by recurrent neural networks. IEEE Trans. on Circuits and Systems—I: Fundamental Theory and Applications, 47, 575–578. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems, Man and Cybernetics, 13, 813–825. Cowan, J. D., & Ermentrout, G. B. (1978). Some aspects of the eigenbehavior of neural nets. In S. A. Levin (Ed.), Studies in mathematical biology 1: Cellular behavior and the development of pattern (pp. 67–117). Providence, RI: Mathematical Association of America. de Jong, H. (2002). Modeling and simulation of genetic regulatory networks: A literature review. J. Computational Biology, 9, 67–103. Dunn, N. A., Lockery, S. R., Pierce-Shimomura, J. T., & Conery, J. S. (2004). A neural network model of chemotaxis predicts functions of synaptic connections in the nematode Caenorhabditis elegans. J. Computational Neuroscience, 17, 137–147. Edwards, R. (2000). Analysis of continuous-time switching networks. Physica D, 146, 165–199. Ermentrout, B. (1995). Phase-plane analysis of neural activity. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 732–738). Cambridge, MA: MIT Press. Ermentrout, B. (1998). Neural networks as spatio-temporal pattern-forming systems. Reports on Progress in Physics, 61, 353–430. Funahashi, K. I., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6, 801–806. Getting, P. A. (1989). Emerging principles governing the operation of neural networks. Annual Review of Neuroscience, 12, 185–204. Ghigliazza, R. M., & Holmes, P. (2004). Minimal models of bursting neurons: How multiple currents, conductances and timescales affect bifurcation diagrams. SIAM J. Applied Dynamical Systems, 3, 636–670. Goldman, M. S., Golowasch, J., Marder, E., & Abbott, L. F. (2001). Global structure, robustness and modulation of neuronal models. J. Neuroscience, 21, 5229–5238. Golowasch, J., Goldman, M. S., Abbott, L. F., & Marder, E. (2002). Failure of averaging in the construction of a conductance-based neural model. J. Neurophysiol., 87, 1129–1131. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17–61. Guckenheimer, J., Myers, M., & Sturmfels, B. (1997). Computing Hopf bifurcations I. SIAM J. Numerical Analysis, 84, 1–21.
3050
R. Beer
Harvey, I., Husbands, P., Cliff, D., Thompson, A., & Jacobi, N. (1997). Evolutionary robotics: The Sussex approach. Robotics and Autonomous Systems, 20, 205–224. Haschke, R. (2004). Bifurcations in discrete-time neural networks: Controlling complex network behavior with inputs. Unpublished doctoral dissertation, University of Bielefeld. Haschke, R., & Steil, J. J. (2005). Input space bifurcation manifolds of recurrent neural networks. Neurocomputing, 64C, 25–38. Hirsch, M. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopfield, J. J. (1984). Neurons with graded response properties have collective computational properties like those of two-state neurons. Proc. National Academy of Sciences, 81, 3088–3092. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. Berlin: Springer. Izquierdo-Torres, E. (2004). Evolving dynamical systems: Nearly neutral regions in continuous fitness landscapes. Unpublished master’s thesis, University of Sussex. Kimura, M., & Nakano, R. (1998). Learning dynamical systems by recurrent neural networks from orbits. Neural Networks, 11, 1589–1599. Kuznetsov, Y. A. (1998). Elements of applied bifurcation theory (2nd ed.). Berlin: Springer. Lewis, J. E., & Glass, L. (1992). Nonlinear dynamics and symbolic dynamics of neural networks. Neural Computation, 4, 621–642. Marder, E., & Abbott, L. F. (1995). Theory in motion. Current Opinion in Neurobiology, 5, 832–840. Marder, E., & Calabrese, R. L. (1996). Principles of rhythmic motor pattern generation. Physiological Reviews, 76, 687–717. Mathayomchan, B., & Beer, R. D. (2002). Center-crossing recurrent neural networks for the evolution of rhythmic behavior. Neural Computation, 14, 2043–2051. Nolfi, S., & Floreano, D. (2000). Evolutionary robotics. Cambridge, MA: MIT Press. Pasemann, F. (2002). Complex dynamics and the structure of small neural networks. Network: Computation in Neural Systems, 13, 195–216. Prinz, A. A., Billimoria, C. P., & Marder, E. (2003). Alternative to hand-tuning conductance-based models: Construction and analysis of databases of model neurons. J. Neurophysiol., 90, 3998–4015. Prinz, A. A., Bucher, D., & Marder, E. (2004). Similar network activity from disparate circuit parameters. Nature Neuroscience, 7, 1345–1352. Psujek, S., Ames, J., & Beer, R. D. (2006). Connection and coordination: The interplay between architecture and dynamics in evolved model pattern generators. Neural Computation, 18, 729–747. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch and I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 251–291). Cambridge, MA: MIT Press. Rowat, P. F., & Selverston, A. I. (1997). Oscillatory mechanisms in pairs of neurons with fast inhibitory synapses. J. Computational Neuroscience, 4, 103–127. Selverston, A. I. (1980). Are central pattern generators understandable? Behavioral and Brain Sciences, 3, 535–571. Seys, C. W., & Beer, R. D. (2004). Evolving walking: The anatomy of an evolutionary search. In S. Schaal, A. Ijspeert, A. Billard, S. Vijayakumar, J. Hallam, & J.-A. Meyer
CTRNN Parameter Space
3051
(Eds.), From animals to animats 8: Proceedings of the Eighth International Conference on the Simulation of Adaptive Behavior (pp. 357–363). Cambridge, MA: MIT Press. Sompolinsky, H., & Crisanti, A. (1988). Chaos in random neural networks. Physical Review Letters, 61(3), 259–262. Thomas, R., & Kaufman, M. (2001). Multistationarity, the basis of cell differentiation and memory. II. Logical analysis of regulatory networks in terms of feedback circuits. Chaos, 11, 180–195. ˇ P., Horne, B. G., & Giles, C. L. (2001). Attractive periodic sets in discrete-time reTino, current neural networks (with emphasis on fixed-point stability and bifurcations in two-neuron networks). Neural Computation, 13, 1379–1414. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264, 974–976. Williams, H., & Noble, J. (in press). Homeostatic plasticity improves signal propagation in continuous-time recurrent neural networks. Biosystems. Wolfram, S. (2003). The mathematica book (5th ed.). Champaign, IL: Wolfram Media. Zhaojue, Z., Schieve, W. C., & Das, P. K. (1993). Two neuron dynamics and adiabatic elimination. Physica D, 67, 224–236.
Received October 11, 2005; accepted April 25, 2006.
LETTER
Communicated by Maxim Bazhenov
Bifurcation Analysis of Jansen’s Neural Mass Model Franc¸ois Grimbert [email protected]
Olivier Faugeras [email protected] Odyss´ee Laboratory, INRIA/ENPC/ENS, INRIA Sophia-Antipolis, 06902 Sophia Antipolis, France.
We present a mathematical model of a neural mass developed by a number of people, including Lopes da Silva and Jansen. This model features three interacting populations of cortical neurons and is described by a six-dimensional nonlinear dynamical system. We address some aspects of its behavior through a bifurcation analysis with respect to the input parameter of the system. This leads to a compact description of the oscillatory behaviors observed in Jansen and Rit (1995) (alpha activity) and Wendling, Bellanger, Bartolomei, and Chauvel (2000) (spike-like epileptic activity). In the case of small or slow variation of the input, the model can even be described as a binary unit. Again using the bifurcation framework, we discuss the influence of other parameters of the system on the behavior of the neural mass model. 1 Introduction Jansen’s neural mass model is based on the work of Lopes da Silva, Hoeks, and Zetterberg (1974), Lopes da Silva, van Rotterdam, Barts, van Heusden, and Burr (1976), and van Rotterdam, Lopes da Silva, van den Ende, Viergever, and Hermans, (1982). They developed a biologically inspired mathematical framework to simulate spontaneous electrical activities of neurons assemblies recorded by EEG, with a particular interest for alpha activity. In their model, populations of neurons interact by excitation and inhibition and can, in effect, produce alpha activity. Jansen, Zouridakis, and Brandt (1993) and Jansen and Rit (1995), discovered that this model was also able to simulate evoked potentials, that is, EEG activities observed after a sensory stimulation (such as a flash of light or a sound). More recently, Wendling, Bellanger, Bartolomei, and Chauvel (2000) used this model to synthesize activities very similar to those observed in epileptic patients, and David and Friston (2003); David, Cosmelli, and Friston (2004) studied connectivity between cortical areas with a similar framework. The contribution of this letter is a fairly detailed description of the behavior of this particular neural mass model as a function of its input. This Neural Computation 18, 3052–3068 (2006)
C 2006 Massachusetts Institute of Technology
Bifurcation Analysis of Jansen’s Neural Mass Model a.
3053
b. y0 P1
he (t) excitatory inhibitory interneurons interneurons
S1
C1
C3 Sigm S3
S2
Sigm
Sigm
y y1
main population
y2 +
C2 P2
–
C4
P3
hi (t)
+
+
he (t)
external input
p(t)
Figure 1: (a) Neural mass model of a cortical unit. It features a population of pyramidal cells interacting with two populations of interneurons—one excitatory (left branch) and the other inhibitory (right branch). (b) Block representation of a unit. The h boxes simulate synapses between the neuron’s populations. Sigm boxes simulate cell bodies of neurons by transforming the membrane potential of a population into an output firing rate. The constants Ci model the strength of the synaptic connections between populations.
description is grounded in the mathematics of dynamic systems and bifurcation theories. We briefly recall the model in section 2 and describe in section 3 the properties of the associated dynamical system. 2 Description of the Model. The model features a population of pyramidal neurons (see the central part of Figure 1a) that receive excitatory and inhibitory feedback from local interneurons and an excitatory input from neighboring cortical units and subcortical structures like the thalamus. Actually the excitatory feedback must be considered as coming from both local pyramidal neurons and genuine excitatory interneurons like spiny stellate cells. 2.1 Equations of the Model. Figure 1b is a translation of Figure 1a in the language of system theory. It represents the mathematical operations
3054
F. Grimbert and O. Faugeras
performed inside such a cortical unit. The excitatory input is represented by an arbitrary average firing rate p(t), which can be random (accounting for a nonspecific background activity) or deterministic, accounting for some specific activity in other cortical units. The three families—pyramidal neurons, excitatory interneurons, and inhibitory interneurons—and synaptic interactions between them are modeled by different systems. The postsynaptic systems Pi , i = 1, 2, 3 (labeled h e (t) or h i (t) in the figure) convert the average firing rate describing the input to a population into an average excitatory (EPSP) or inhibitory (IPSP) postsynaptic potential. From the signal processing standpoint, they are linear stationary systems that are described by either a convolution with an impulse response function or, equivalently, a second-order linear differential equation. They have been proposed by van Rotterdam et al. (1982) in order to reproduce well the characteristics of real EPSPs and IPSPs. The impulse response function is of the form h(t) =
αβte −βt t ≥ 0 t<0
0
.
In other words, if x(t) is the input to the system, its output y(t) is the convolution product h ∗ x(t). The constants α and β are different in the excitatory and inhibitory cases. α, expressed in millivolts, determines the maximal amplitude of the postsynaptic potentials; β, expressed in s −1 , lumps together characteristic delays of the synaptic transmission, that is, the time constant of the membrane and the different delays in the dendritic tree (Freeman, 1975; Jansen et al., 1993). The corresponding differential equation is y¨ (t) = αβx(t) − 2β y˙ (t) − β 2 y(t).
(2.1)
In the excitatory (resp. inhibitory) case, we have α = A, β = a (resp. α = B, β = b). This second-order differential equation can be conveniently rewritten as a system of two first-order equations:
y˙ (t) = z(t) z˙ (t) = αβx(t) − 2αz(t) − α 2 y(t)
.
(2.2)
The sigmoid systems introduce a nonlinear component in the model. They are the gain functions that transform the average membrane potential of a neural population into an average firing rate (see, e.g., Gerstner & Kistler, 2002): Sigm(v) =
νmax r νmax 1 + tanh (v − v0 ) = , 2 2 1 + e r (v0 −v)
Bifurcation Analysis of Jansen’s Neural Mass Model
3055
where νmax is the maximum firing rate of the families of neurons, v0 is the value of the potential for which a 50% firing rate is achieved, and r is the slope of the sigmoid at v0 ; v0 can be viewed as either a firing threshold or the excitability of the populations. This sigmoid transformation approximates the functions proposed by the neurophysiologist Walter Freeman (1975) to model the cell body action of a population. The connectivity constants C1 , . . . , C4 account for the number of synapses established between two neuron’s populations. We will see that they can be reduced to a single parameter C. There are three main variables in the model—the outputs of the three postsynaptic boxes noted y0 , y1 , and y2 (see Figure 1b). We also consider their derivatives y˙ 0 , y˙ 1 , y˙ 2 , noted y3 , y4 , and y5 , respectively. If we write two equations similar to equation 2.2 for each postsynaptic system, we obtain a system of six first-order differential equations that describes Jansen’s neural mass model: 2 y˙0 (t) = y3 (t) y˙3 (t) = Aa Sigm[y1 (t) − y2 (t)] − 2a y3 (t) − a y0 (t) y˙1 (t) = y4 (t) y˙4 (t) = Aa { p(t) + C2 Sigm[C1 y0 (t)]} − 2a y4 (t) − a 2 y1 (t) (2.3) y˙2 (t) = y5 (t) y˙5 (t) = BbC4 Sigm[C3 y0 (t)] − 2by5 (t) − b 2 y2 (t). We focus on the variable y = y1 − y2 , the membrane potential of the main family of neurons (see Figure 1b). We think of this quantity as the output of the unit because in the cortex, the pyramidal cells are the main vectors of long-range cortico-cortical connections. Besides, their electrical activity corresponds to the EEG signal: pyramidal neurons throw their apical dendrites to the superficial layers of the cortex where the postsynaptic potentials are summed, accounting for the essential part of the EEG activity (Kandel, Schwartz, & Jessel, 2000). 2.2 Numerical Values of the Parameters. The parameters A, B, a , and b have been adjusted by van Rotterdam et al. (1982) to reproduce some basic properties of real postsynaptic potentials and make the system produce alpha activity. These authors set A = 3.25 mV, B = 22 mV, a = 100 s−1 , and b = 50 s−1 . The excitability of cortical neurons can vary as a function of the action of several substances, and v0 could potentially take different values, though we will use v0 = 6 mV as suggested by Jansen on the basis of experimental studies due to Freeman (1987). The works of the latter also suggest that νmax = 5s −1 and r = 0.56 mV−1 , the values used by Jansen and Rit (1995). The connectivity constants Ci , i = 1, . . . , 4 are proportional to the average number of synapses between populations. On the basis of several neu¨ 1998, among others) where roanatomical studies (Braitenberg and Schuz, these quantities had been estimated by counting synapses, Jansen and Rit
3056
F. Grimbert and O. Faugeras
Figure 2: Activities of the unit shown in Figure 1 when simulated with a uniformly distributed white noise (between 120 and 320 Hz) as input. The different curves show different activities depending on the value of the parameter C. The third curve from the top looks like alpha activity and has been obtained for C = 135 (from Jansen & Rit, 1995).
succeeded in reducing them to fractions of a single parameter C:
C1 = C
C2 = 0.8C
C3 = 0.25C C4 = 0.25C
.
Jansen and Rit varied C to observe alpha-like activity and obtained it for C = 135 (see Figure 2).
Bifurcation Analysis of Jansen’s Neural Mass Model
3057
In summary, previous work shows that the following set of parameters allows the neural mass model described by equations 2.3 to produce a set of EEG-like signals: A = 3.25 B = 22 a = 100 b = 50 . v0 = 6 C = 135
(2.4)
We show in section 3.4 that the behavior of the neural mass model is fairly sensitive to the choice of the values of these parameters. Indeed, changes as small as 5% in these values produce some fairly different behaviors. The quantity p represents the lumped activity of the brain areas connected to the unit. Jansen and Rit (1995) chose p(t) to be a uniformly distributed noise ranging from 120 to 320 pulses per second, as they wanted to model nonspecific input (they used the term background spontaneous activity). This noise dynamics allowed them to produce alpha-like activity. Similarly, Wendling and his colleagues (2000) used a white gaussian noise (mean 90 and standard deviation 30) for p(t) and observed the emission of spikes that was reminiscent of an epileptic activity. We show in the next section that these two different behaviors can be nicely accounted for by a geometric study of system 2.3 through its bifurcations. 3 Bifurcations and Oscillations In this section we consider p as a parameter of the system and propose to study the behavior of a unit when p varies. We therefore study the dynamical system 2.3, with all parameters but p being kept constant and equal to the values set by Jansen and Rit (1995) (see equation 2.4). In section 3.4 we extend this analysis to other values of the parameters in equation 2.4. Let Y = (y0 , . . . , y5 )T ; the system has the form Y˙ = f (Y, p), where f is the smooth map from R6 to R6 given by equation 2.3 and p is a parameter. We are interested in computing the fixed points and periodic orbits of the system as functions of p because they will allow us to account for the appearance of such activities as those shown in Figure 2 (alpha-like activity) and Figure 3 (epileptic spike-like activity). 3.1 Fixed Points 3.1.1 The One Parameter Family of Fixed Points. We look for the points Y where the vector field f (., p) vanishes (called fixed points, critical points, or
3058
F. Grimbert and O. Faugeras
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 3: (a–e) Activities of the unit shown in Figure 1 when simulated with a white gaussian noise as input (corresponding to an average firing rate between 30 and 150 Hz). The authors varied the excitation/inhibition ratio A/B. As this ratio is increased, we observe sporadic spikes followed by increasingly periodic activities. (f–i) Real activities recorded from epileptic patients before (f,g) and during a seizure (h,i). (From Wendling et al. 2000.)
Bifurcation Analysis of Jansen’s Neural Mass Model
3059
equilibrium points). Writing Y˙ = 0, we obtain the system of equations y0 = y1 = y2 =
A Sigm[y1 − y2 ] a A ( p + C2 Sigm[C1 y0 ]) a B C Sigm[C3 y0 ] b 4
y3 = 0 y4 = 0
(3.1)
y5 = 0,
which leads to the (implicit) equation of the one-parameter family of equilibrium points in the ( p, y = y1 − y2 ) plane: y=
A A B A A p + C2 Sigm C1 Sigm(y) − C4 Sigm C3 Sigm(y) . a a a b a
(3.2)
As mentioned before, y = y1 − y2 can be thought of as representing the EEG activity of the unit, and p is our parameter of interest. We show the curve defined by equation 3.2 in Figure 4a. The number of intersections between this curve and a vertical line of equation p = consta nt is the number of equilibrium points for this particular value of p. We notice that for p ≈ 110– 120, the system goes from three equilibrium points to a single one. We also note that the curve has been drawn for some negative values of p. These points do not have any biological meaning since p is a firing rate. It turns out, though, that they play a central role in the mathematical description of the model (see section 3.2). The coordinates of the singular points cannot be written explicitly as functions of p, but every singular point is completely determined by the quantity y. More precisely, the coordinates of every singular point S(y) have the following form:
S(y) =
A A A p + C2 Sigm C1 Sigm(y) Sigm(y) a a a A B C4 Sigm C3 Sigm(y) 0 0 0 , b a
(3.3)
p and y being related through equation 3.2. 3.1.2 Local Study Near the Singular Points. In order to study the behavior of the system near the fixed points, we linearize it and calculate its Jacobian matrix, that is, the partial derivative J of f (., p) at the fixed point S(y). It is easy but tedious to show that at the fixed point S(y), we have
J (S(y)) =
I3 03 K M(y) −K
,
3060
F. Grimbert and O. Faugeras b
8
8
6
6
y (mV)
y (mV)
a
4 2 0
4 2 0
0
100
200
300
400
0
100
200
p (Hz)
p (Hz)
c
d
300
400
12 10
4 2
5
4
3 0.1
6
0
2
4
y
y (mV)
8
1
2
0.05
0 1 0
100
200
300
400
0
0
50
p (Hz)
100
150
p
Figure 4: (a) Curve defined by equation 3.2. For each value of p, the curve yields the coordinate(s) y of the corresponding fixed point(s). (b) Fixed points and their stability. Stable fixed points lie on the solid portions of the curve, and unstable points lie on the dashed ones. Stars correspond to transition points where the Jacobian matrix has some eigenvalues with zero real part. (c) Curve of the fixed points with two branches of limit cycles (shaded regions bounded by thick black curves). The stars labeled 3 and 5 are Hopf bifurcation points. The oval between them is the branch of Hopf cycles: for each 89.83 ≤ p ≤ 315.70, the thick black curves between points 3 and 5 give the highest and lowest y values attained by the Hopf cycle. The other branch of limit cycles lies in the domain between the star labeled 1, where there is a saddle node bifurcation with homoclinic orbit, and the dash-dotted line 4 representing a fold bifurcation of limit cycles. This kind of representation is called a bifurcation diagram. (d) A Hopf bifurcation at the point labeled 2 ( p = −12.15) gives rise to a branch of unstable limit cycles that merges with the branch of stable cycles lying between the point labeled 1 and the dashed line labeled 4. This phenomenon is called a fold bifurcation of limit cycles.
where
−a /2 γ (y) −γ (y)
K = 2diag(a , a , b), M(y) = δ(y) −a /2 θ (y)
0
0 −b/2
,
Bifurcation Analysis of Jansen’s Neural Mass Model
3061
I3 is the three-dimensional identity matrix, and 03 is the three-dimensional null matrix. The three functions γ , δ, and θ are defined by A Sigm (y) 2 AC1 C2 δ(y) = Sigm (C1 y0 (y)) 2 BC3 C4 θ (y) = Sigm (C3 y0 (y)), 2
γ (y) =
where y0 (y) is the first coordinate of S(y) and Sigm is the derivative of the function Sigm. We compute the eigenvalues of J along the curve of Figure 4a to analyze the stability of the family of equilibrium points. The results are summarized in Figure 4b. The solid portions of curve correspond to stable fixed points (all eigenvalues have a negative real part) and the dashed ones to unstable fixed points (some eigenvalues have a positive real part). Stars indicate points where at least one eigenvalue of the system crosses the imaginary axis, therefore having a zero real part. These points are precious landmarks for the study of bifurcations of the system. 3.2 Bifurcations and Oscillatory Behavior in Jansen’s Model. A bifurcation is a drastic and sudden change in the behavior of a dynamic system that occurs when one or several of its parameters are varied. Often it corresponds to the appearance or disappearance of limit cycles. Describing oscillatory behaviors in Jansen’s model is therefore closely related to studying its bifurcations. In our case, when p varies from −∞ to +∞, the system undergoes five bifurcations (remember that only the positive values of p are biologically relevant). We now describe these bifurcations from a somewhat intuitive viewpoint, but our results are grounded in the mathematical theory of bifurcations (Perko, 2001; Ioos & Adelmeyer, 1999; Kuznetsov, 1998; Berglund, 2001a, 2001b) and the extensive use of the software XPP-Aut due to Bard Ermentrout (available on http://www.pitt.edu/∼phase/). We were also inspired by bifurcation studies of single-neuron models (see Izhikevich, in press; Hoppenstaedt & Izhikevich, 1997; Rinzel & Ermentrout, 1998). 3.2.1 Hopf Bifurcations and Alpha Activity in Jansen’s Model. When p is varied smoothly, the eigenvalues of the fixed points move smoothly in the complex plane: when two complex conjugate eigenvalues cross the imaginary axis, the system undergoes in general what is called a Hopf bifurcation. Two of them happen in Jansen’s model (if we ignore the negative values of p) for p = 89.83 and p = 315.70. A theorem due to Hopf (Perko, 2001)
3062
F. Grimbert and O. Faugeras
shows1 that for p = 89.83, a one-parameter family of stable periodic orbits appears at the fixed point that has two complex conjugate eigenvalues crossing the imaginary axis toward positive real parts. These periodic orbits persist until p = 315.70, where a second Hopf bifurcation occurs: the two eigenvalues whose real parts became positive for p = 89.83 see them become negative again, corresponding to the (re)creation of a simple, attractive fixed point. This is shown in Figure 4c: for p between 89.83 and 315.70, there is a family of periodic orbits (we call them Hopf cycles from now on) parameterized by p for which the minimal and maximal y values have been plotted (thick oval curve). Numerically, using XPP-Aut, we find that these oscillations have a frequency around 10 Hz, which corresponds to alpha activity. So it appears that alpha-like activity in Jansen’s model is determined by Hopf cycles. Interestingly enough, the system does not display any Hopf bifurcation if we approximate the sigmoid by a piecewise linear function, or if we try to reduce the dimensionality of the system by singular perturbation theory (Berglund 2001b). In both cases, the system is unable to produce alpha activity. Let us interpret Jansen and Rit’s results in the light of our mathematical analysis. They report observing alpha activity (see the third curve in Figure 2) when they use a uniformly distributed noise in the range 120 to 320 Hz at the entry of the system. This is easy to account for if we look at Figure 4c: in this domain of p values, the Hopf cycles are essentially the only attractors of the dynamical system 2.3. So at every time instant t, its trajectories will tend to coil around the Hopf cycle corresponding to p = p(t). We will therefore see oscillations of constant frequency and varying amplitude leading to the waxing and waning activity reported by Jansen and Rit. 3.2.2 Global Bifurcations and Spike-Like Epileptic Activity. Hopf bifurcations are called local because their appearance depends on only local properties of the dynamical system around the bifurcation point. In Figure 3, we see that the system is able to display spike-like activities that resemble certain epileptic EEG recordings (Wendling et al., 2000). These activities arise from a branch of large stable periodic orbits delimited by a pair of global bifurcations (i.e., depending not only on local properties of the dynamical system) that correspond to the star labeled 1 and the dash-dotted line labeled 4 in Figure 4c. From now on, we will call these orbits spike cycles. The branch of spike cycles begins for p = 113.58, thanks to a saddlenode bifurcation with homoclinic orbit2 (see Perko, 2001; Kuznetsov, 1998). It ends for p = 137.38 because of a fold bifurcation of limit cycles that we 1 The proof of the existence of a Hopf bifurcation relies on the calculation of the Lyapunov number at the bifurcation points. It is quite technical and is not developed here. 2 The proof of the existence of this saddle node bifurcation with homoclinic orbit uses a theorem due to Shil’nikov (Kuznetsov, 1998).
Bifurcation Analysis of Jansen’s Neural Mass Model
3063
identified with XPP-Aut. This bifurcation results from the fusion of a stable and an unstable family of periodic orbits. The stable family is the branch of spike cycles, and the unstable family originates from a Hopf bifurcation occuring at p = −12.15. Thanks to XPP-Aut, we have been able to plot the folding and the associated Hopf bifurcation with respect to the y0 axis (see Figure 4d). So far we have shown the bifurcation diagrams in the ( p, y) plane, but for technical reasons due to XPP-Aut, we show the bifurcation diagram in the ( p, y0 ) plane in this case. Its general properties are the same, though. For example, we recognize the S shape of the fixed points diagram and the relative position of landmarks 1, 2, and 4. Contrary to the Hopf cycles whose periods remains around 10 Hz, the spike cycles can display every frequency in the range 0 to 5 Hz (it increases with p) so that they are able to reproduce the various “spiking” activities observed in Figure 3. In this case also we can identify the central role played by p in shaping the output of the unit. Wendling et al. (2000) used a gaussian noise with mean 90 and standard deviation 30 to produce the spikes in Figure 3, resulting in an input to the unit essentially varying between 30 and 150 Hz, which is quite low compared to the range used by Jansen and Rit (1995). Let us first distinguish two parts in the curve of fixed points in Figure 4c. We call the set of stable fixed points below the star labeled 1 lower branch and the one between the stars labeled 2 and 3 upper branch. For p between 30 and 90, the system displays a classical bistable behavior with two stable fixed points (one on each branch), the lowest fixed points appearing to be dominant. We found experimentally that the basin of attraction of the upper point is not very large, so that one has to start quite close to it in order to converge to it. As a result, a low input ( p ≤ 90) produces in general a low output: the trajectory is attracted by the lower branch. For p between 110 and 140, we are in the range of p values where spike-like activity can appear and spiking competes with Hopf cycles, but trajectories near the lower branch are attracted by spike cycles (as we will see in section 3.3), hence producing spike-shaped activities. These two facts—attraction to the lower branch and preference of the lower branch for spike cycles—allow us to understand how the model can produce epileptic-like activities. 3.3 Synthesis: Behavior of the Cortical Unit Model According to the Input Parameter p. We now have in hand all the ingredients to describe the activity of this neural mass model when stimulated by a slowly varying input. For that purpose, we computed two trajectories (or orbits) of the system with p increasing linearly in time at a slow rate (d p/dt = 1). The system was initialized at the two stable fixed points at p = 0: the stable state on the lower branch and the one on the upper branch (see the stars in Figure 5). As long as p ≤ 89.83, the two trajectories are flat, following their respective branches of fixed points (see Figure 6, p = 80). After the Hopf
3064
F. Grimbert and O. Faugeras
12
10
8
y (mV)
6
4
2
0
0
50
100
150
200
250
300
350
400
p (Hz)
Figure 5: Diagram of the stable attractors (stable fixed points and stable limit cycles) of the model described by equations 2.3. The stars show the starting points of the two trajectories we simulated with p slowly increasing. Their time courses have been frozen for p = 80, 100, 125, and 200 (as indicated by the vertical dashed lines on this figure) and can be seen in Figure 6. Lower and upper states of the unit correspond to the thick and thin lines, respectively.
bifurcation occurring at p = 89.83, the orbit corresponding to the upper branch naturally coils on the Hopf cycles branch (see Figure 6, p = 100), resulting in alpha-like activity. The trajectory on the lower branch does the same with the spike cycles as soon as p reaches the value 113.58 (see Figure 6, p = 125). As p ≥ 137.38, the only remaining attractor is the Hopf cycle branch so that the system can exhibit only alpha-like behavior (see Figure 6, p = 200). For high values of p (≥315.70), there is only one stable fixed point, and the trajectory is flat again. These results lead us to distinguish two states, the lower and the upper, for the unit. The lower state is described by the combination of the lower branch of fixed points that correspond to rest and the spike cycles (thick lines in Figure 5). It corresponds to positive values of p less than 137.38. The upper state is described by the upper branch of fixed points, the Hopf cycle branch, and the branch of fixed points following it (thin lines in Figure 5). It corresponds to positive values of p. These states are relevant for slow dynamics of the input p. In effect, a trajectory starting
Bifurcation Analysis of Jansen’s Neural Mass Model p=100
12
12
10
10
8
8
y (mV)
y (mV)
p=80
6
6
4
4
2
2
0 80
80.5
3065
81
0 100
81.5
100.5
t (s)
10
10
8
8
6
4
2
2 126
t (s)
201
201.5
6
4
125.5
101.5
p=200 12
y (mV)
y (mV)
p=125 12
0 125
101
t (s)
126.5
0 200
200.5
t (s)
Figure 6: Activities produced by Jansen’s neural mass model for typical values of the input parameter p (see the text). The thin (resp. thick) curves are the time courses of the output y of the unit in its upper (resp. lower) state. For p > 137.38, there is only one possible behavior of the system. In the case of oscillatory activities, we added a very small amount of noise to p (a zero mean gaussian noise with standard deviation 0.05).
near one of these states will stay in its neighborhood when p is varied slowly (increasing or decreasing). When the unit is in its lower-state and p becomes larger than 137.38, it jumps to its upper state and cannot return to its lower state (if p varies slowly). Therefore, when in its upper state, a unit essentially produces alpha-like activity, and its input must be decreased abruptly to bring it back to its lower state. Conversely, starting in the lower state, a unit can be brought to the upper state by an abrupt increase of its input. It can also stay in its lower state regime, between rest and spiking, if the input and its variation remain moderate. 3.4 What About Other Parameters? We think that the bifurcation analysis with respect to p is the most relevant since this parameter is expected to vary more and faster than the others, but it is interesting to build bifurcation diagrams with respect to p with different settings of the other parameters.
3066
F. Grimbert and O. Faugeras a
b
10
14 12
8 10 8 6
4
y(mV)
y(mV)
6
2
4 2 0
0
0
50
100
150
200
p(Hz)
250
300
350
400
0
100
200
300
400
500
600
p(Hz)
Figure 7: The stable attractors of the system in two typical cases encountered for different settings of parameters A, B, C, a , or b. (a) Corresponds to lower (resp. higher) values of A, B, C (resp. a and b) than those given by equation 2.4. Here, A = 3 instead of 3.25: there are no more limit cycles. (b) Corresponds to higher (resp. lower) values of A, B, C (resp. a and b). Here C = 140 instead of 135. The spiking behavior is more prominent and is the only one available in a wide range of p values (112.6 ≤ p ≤ 173.1). Except in a narrow band (173.1 ≤ p ≤ 180.4), the system displays one single behavior for each value of p.
We indeed observed that varying any parameter by more than 5% leads to quite drastic changes in the bifurcation diagram and to significantly less rich behaviors of the unit. These changes fall into two broad categories (see Figure 7). For low values of A, B, or C or high values of a or b, the system is no longer able to produce oscillations (see Figure 7a). For high values of A, B, or C or low values of a or b, we observed a new kind of bifurcation diagram (an example is given in Figure 7b). In this regime, the system has only one stable state for each value of p, except sometimes in a narrow range of p values (in the figure, 173.1 ≤ p ≤ 180.4). The range where spiking can occur is broader and the one for alpha activity is severely truncated. Moreover, spiking does not coexist with alpha rhythm anymore so that (except for a very small range of p values) it is the only available behavior on a broad interval of p values (in the figure, 112.6 ≤ p ≤ 173.1). So spiking becomes really prominent. The mathematical explanation for this new diagram is the fusion of the Hopf cycles branch with the branch of unstable periodic orbits that can be seen in Figure 4d. It results in a new organization of periodic orbit branches. We have two stable branches (for 112.6 ≤ p ≤ 180.4 and 173.1 ≤ p ≤ 457.1), linked by a branch of unstable orbits. Transitions between stable and unstable orbits are made via fold bifurcations of limit cycles like the one in Figure 4d.
Bifurcation Analysis of Jansen’s Neural Mass Model
3067
4 Conclusion The bifurcation diagram (see Figure 5) is a precious tool to describe Jansen’s neural mass model’s behaviors for constant or slowly varying stimulus. We also showed that this analysis provided a good basis for understanding what happened when the input was noisy. In the case of small or slow variations of this input, the model can be reduced to a binary unit with two possible states. When we studied how the bifurcation diagram varied when changing the values of the other model’s parameters, it appeared that Jansen’s model’s behavior was quite sensitive to the choice of the physiological parameters A, B, C, a , and b. Variations of a few percentages in the values of these parameters can cause drastic changes in the qualitative behavior of this neural mass model. Detailed comparisons of these behaviors with experimental data should be essential for further validation of the model and for defining ways to make it evolve. What about the behavior of spatial assemblies of such models? Jansen et al. have studied evoked potentials in two connected cortical units (Jansen & Rit, 1995; Jansen et al., 1993) and Wendling et al. (2000) have simulated an epileptogenic network composed of three units. There are still no studies involving an arbitrary number of such units or a continuum of them. This is a difficult task for at least three reasons. First, the size of the system of differential equations describing the network increases linearly with the number of units, making its mathematical analysis even more difficult. Second, the nonlinearities in the model and the network open the door to emerging properties impossible to predict from the sole knowledge of the behavior of one unit. Third, the way to connect those units is an open question since our knowledge of anatomical connectivity in the cortex is still very poor. Nevertheless, we think this is an important area for future work. Acknowledgments This work was partially supported by Elekta Instrument AB. References Berglund, N. (2001a). Geometrical theory of dynamical systems. Citebase. Berglund, N. (2001b). Perturbation theory of dynamical systems. Citebase. ¨ A. (1998). Cortex: Statistics and geometry of neuronal connecBraitenberg, V., & Schuz, tivity (2nd ed.). Berlin: Springer. David, O., Cosmelli, D., & Friston, K. J. (2004). Evaluation of different measures of functional connectivity using a neural mass model. Neurolmage, 21, 659–673. David, O., & Friston, K. J. (2003). A neural mass model for MEG/EEG: Coupling and neuronal dynamics. Neurolmage, 20, 1743–1755.
3068
F. Grimbert and O. Faugeras
Freeman, W. (1975). Mass action in the nervous system. New York: Academic Press. Freeman, W. (1987). Simulation of chaotic EEG patterns with a dynamic model of the olfactory system. Biological Cybernetics, 56, 139–150. Gerstner, W., & Kistler, W. M. (2002). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Hoppenstaedt, F., & Izhikevich, E. (1997). Weakly connected neural networks. New York: Springer-Verlag. Ioos, G., & Adelmeyer, M. (1999). Topics in bifurcation theory and applications (2nd ed.). Singapore: World Scientific. Izhikevich, E. M. (in press). Dynamical systems in neuroscience: The geometry of excitability and bursting. Cambridge, MA: MIT Press. Jansen, B. H., & Rit, V. G. (1995). Electroencephalogram and visual evoked potential generation in a mathematical model of coupled cortical columns. Biol. Cybern., 73, 357–366. Jansen, B. H., Zouridakis, G., & Brandt, M. E. (1993). A neurophysiologically-based mathematical model of flash visual evoked potentials. Biological Cybernetics, 68, 275–283. Kandel, E., Schwartz, J., & Jessel, T. (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Kuznetsov, Y. A. (1998). Elements of applied bifurcation theory. (2nd ed.). New York: Springer. Lopes da Silva, F., Hoeks, A., & Zetterberg, L. (1974). Model of brain rhythmic activity. Kybernetik, 15, 27–37. Lopes da Silva, F., van Rotterdam, A., Barts, P., van Heusden, E., & Burr, W. (1976). Model of neuronal populations: The basic mechanism of rhythmicity. In M. A. Corner & D. F. Swaab (Eds.), Progress in brain research (pp. 281–308). Amsterdarm: Elsevier. Perko, L. (2001). Differential equations and dynamical systems (3rd ed.). New York: Springer. Rinzel, J., & Ermentrout, G. (1998). Analysis of neuronal excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (pp. 251–291). Cambridge, MA: MIT Press. van Rotterdam, A., Lopes da Silva, F., van den Ende, J., Viergever, M., & Hermans, A. (1982). A model of the spatial-temporal characteristics of the alpha rhythm. Bulletin of Mathematical Biology, 44(2), 283–305. Wendling, E, Bellanger, J., Bartolomei, F., & Chauvel, P. (2000). Relevance of nonlinear lumped-parameter models in the analysis of depth-EEG epileptic signals. Biological Cybernetics, 83, 367–378.
Received September 23, 2005; accepted April 28, 2006.
LETTER
Communicated by Laurence T. Maloney
A Model for Perceptual Averaging and Stochastic Bistable Behavior and the Role of Voluntary Control Ansgar R. Koene [email protected] Department of Psychology, University College London, London, WC1H 0AP, U.K.
We combine population coding, winner-take-all competition, and differentiated inhibitory feedback to model the process by which information from different, continuously variable signals is integrated for perceptual awareness. We focus on “slant rivalry,” where binocular disparity is in conflict with monocular perspective in specifying surface slant. Using a robust single parameter set, our model successfully replicates three key experimental results: (1) transition from signal averaging to bistability with increasing signal conflict, (2) change in perceptual reversal rates as a function of signal conflict, and (3) a shift in the distribution of percept durations through voluntary control exertion. Voluntary control is implemented through the use of a single top-down bias input. The transition from signal averaging to bistability arises as a natural consequence of combining population coding and wide receptive fields, common to higher cortical areas. The model architecture does not contain any assumption that would limit it to this particular example of stimulus rivalry. An emergent physiological interpretation is that differentiated inhibitory feedback may play an important role for increasing percept stability without reducing sensitivity to large stimulus changes, which for bistable conditions leads to increased alternation rate as a function of signal conflict. 1 Introduction Many perceptual aspects of our environment present themselves to the observer through multiple sensory channels. The slant in depth of a surface, for instance, is provided through binocular disparity, but also through perspective cues like foreshortening. To construct a coherent internal representation of a stimulus, the brain must somehow integrate the different sensory signals. One way to investigate this integration process is to subject the visual system to conflicting signals from different sensory channels. This can instigate bistability in perception; an example is binocular rivalry, which results when the signals from the two eyes provide incompatible information, leading to a breakdown of binocular signal integration. Quite a few models Neural Computation 18, 3069–3096 (2006)
C 2006 Massachusetts Institute of Technology
3070
A. Koene
have been put forward to explain bistability in perception on the basis of competition between either the sensory signals or the alternative percepts (Vickers, 1972; Sugie, 1982; Matsuoka, 1984; Kawamoto & Anderson, 1985; Lehky, 1988; Blake, 1989; Mueller & Blake, 1989; Mueller, 1990; Ditzinger & Haken, 1989; Lehky & Blake, 1991; Lumer, 1998; Dayan, 1998; Kalarickal & Marshall, 2000; Laing & Chow, 2002; Merk & Schnakenberg, 2002; Stollenwerk & Bode, 2003; Wilson, 2003; Zhou, Gao, White, & Yao, 2004). How sensory cues are successfully combined in the case of stable perception, however, has not been addressed in these models. The combination of sensory signals into a unified stable percept has generally been studied as a separate issue (e.g., binocular slant perception, as opposed to binocular slant rivalry), resulting in models that do not consider the properties of ambiguous perception. Key information for our understanding of perception that has therefore been missing is an understanding of the transition from signal combination to signal rivalry. Perceived surface slant is a measure that is particularly suitable to provide such information because it consists of both a regime with slant integration and a regime with slant rivalry. Both regimes have recently been experimentally investigated by van Ee and colleagues. To study signal integration, they measured how much depth is perceived when subjects view a slanted plane in which binocular disparity and monocular perspective provide different slant information for slant about a vertical axis (van Ee, van Dam, & Erkelens, 2002) about a horizontal axis (van Dam & van Ee, 2005) and for real planes slanted in depth (van Ee, Krumina, Pont, & van der Ven, 2005). Figure 1 illustrates the stimulus and the percepts. Using a metrical experimental paradigm, it was found that for small cue conflict, perceived slant was a weighted average of the perspective and disparity-specified slants. When the cue conflict was large, however, observers experienced bistable slant rivalry. Slant-rivalry appeared to have dynamics and stochastic bistable properties that are similar to other rivalry stimuli (van Ee, 2005). In a subsequent fMRI study, Brouwer, Tong, Schwarzbach, and van Ee (2004) revealed both systematic increases in activity in intraparietal sulcus and lateral occipital complex, as well as increasing alternation rates at higher incongruencies. Eye movements, including microsaccades, were shown to be not essential for the perceptual alternation process, suggesting that slant rivalry is a central process (van Dam & van Ee, 2005). Extant models of perceptual bistability commonly assume a bottom-up binary process in which the percept results from a competition between two discrete alternatives. The most common architecture is depicted in Figure 2. In the slant rivalry experiments, however, the percept can assume any of a complete range of possible slants. This sensory signal conflict-dependent transition from averaging to bistability found in slant rivalry has not been explicitly addressed by traditional competition-based bistability models. In addition, this transition in combination with voluntary control has not been taken into account.
Perceptual Averaging and Bistability Model
3071
Right side far
Left eye
Right eye Left side far
Figure 1: The slant rivalry stimulus. In the anaglyph, stereogram monocular perspective and binocular disparity specify conflicting surface slants about the vertical axis. When the left eye views the green (shown in light gray) image and the right eye views the red (shown in dark gray) image, two competing percepts can be experienced. In the perspective-dominated percept, the grid recedes in depth with its right side farther away (it is perceived as a slanted rectangle). In the disparity-dominated percept, the left side of the grid is farther away (it is perceived as a trapezoid with the near edge shorter than the far edge). Each of the two percepts can be selected and maintained at will in a relatively controlled fashion. (More demonstrations of slant rivalry can be found online at www.phys.uu.nl/∼vanee/.)
Voluntary control exertion by the subject affects the dynamics of perceptual alternations in a variety of perceptual situations (e.g., Lack, 1978; Goryo, Robinson, & Wilson, 1984; Tsal, 1984; Schulman, 1992; Gomez, Argandona, Solier, Angulo, & Vazquez, 1995; Hol, Koene, & van Ee, 2003; Toppino, 2003; Meng & Tong, 2004; Chong, Tadin, & Blake, 2005). Van Ee, van Dam, & Brouwer (2005) examined to what degree the perceptual reversal frequency in slant rivalry is under voluntary control. They found that slant rivalry is systematically influenced by voluntary control, which makes it interesting for this study. In their work, they examined four voluntary control exertion tasks: natural, hold perspective, hold disparity, and speed-up. For the natural task, the subject passively viewed the stimulus and indicated (through key presses) which side of the slanted plane he or she perceived to be closer. For the hold perspective and hold disparity tasks, subjects were instructed to attempt to perceive the right or left side closer (corresponding to the perspective-specified slant) or to perceive the other
3072
A. Koene
Competing input signals
Output
Excite Inhibit Figure 2: Classical models for bistable perception. The neural network architecture of classical models for both binocular rivalry and perceptual rivalry is essentially bottom-up using reciprocal inhibition. The competing interpretations constitute the input signals. Random signal noise generates slight signal strength differences over time, even for identical input signals. The signal that is slightly stronger suppresses the weaker signal through reciprocal inhibition. Some form of internal dynamics, such as temporal integration with gain control, is used to make the activity of the neurons at time T depend on their activity at T − δt, producing the inherent bias toward maintaining the previous percept. The strength of this bias decays in time to produce the experimentally found percept duration distributions.
side closer (disparity-specified slant) for as long as possible. For the speedup task, the subject was instructed to alternate the two percepts as rapidly as possible. For each task, the probability density distribution histograms of percept durations showed a skewed, asymmetric (gamma) distribution (Brascamp, van Ee, Pestman, & van den Berg, 2005), similar to the distributions found for binocular rivalry (e.g., Levelt, 1966). The effect of voluntary control resulted in a shift in both the peak of the distribution and the mean percept duration. In this article, we propose a neural network that uses a combination of population coding (for the averaging) and winner-take-all competition (for the bistability). The effect of voluntary control is incorporated in the model as a top-down process that primes the neurons corresponding to the instructed shift in attention such that they have an elevated baseline response. Using a single parameter set, this network successfully replicates the three key results of the slant rivalry studies: (1) transition from cue averaging to bistability as a function of stimulus incongruence, (2) increased alternation rates as a function of increasing cue conflict, and (3) a clear shift in the distribution of percept durations as a result of voluntary control exertion.
Perceptual Averaging and Bistability Model
3073
Top-down bia s
Perspective signal Input Disparity signal Noise
Winner-take-all decision network
Layer 1
Output
Layer 3
Figure 3: General structure of our network model for slant rivalry. The disparity and perspective input signals are combined, on the one hand, with top-down bias signals from the control exertion instruction that was given to the subjects, and, on the other hand, with an inhibitory feedback signal from the output of the network. The feedback path biases the output toward the current percept. The internal network noise, which may in fact have its origin at different stages, is also added at the input stage. Based on these inputs, a winner-take-all network selects the current slant, resulting in a perceived slant that is being forwarded for subsequent processing. The layers refer to Figure 4.
2 Method (Model Design) Our model uses population coding that is similar to the coding found in the visual cortex (Hubel & Wiesel, 1959, 1979; van Essen, Anderson, & Fellman, 1992) with relatively broadly tuned receptive fields to generate percept averaging for small cue conflicts and winner-take-all competition (Marr & Poggio, 1977; McClelland & Rumelhart, 1981) to generate bistability for large cue conflicts. A weak top-down bias input is used to replicate the effect of attention. 2.1 General Architecture of the Model. The basic structure of our model is portrayed in Figure 3. The forward path combines the input signals (e.g., the perspective and disparity defined slants), resolves this information into a single output, and sends this output to the higher visual processing areas. The feedback path feeds the output (which determines the current percept) back to the input of the decision network, biasing it toward the current percept. The internal network noise, which may in fact have its true origin at different stages in the perceptual system, is also added at the input stage since this is where the noise influences the behavior of the network. The noise input is ultimately the cause of the stochastically alternating percepts in bistability. The top-down bias input provides the voluntary control that enables the subject to bias perception. When present, this bias input
3074
A. Koene
makes the network more sensitive to a specific subset of bottom-up input signals. The integration of different stimulus cues related to the same perceptual property (such as slant in depth) must, by the nature of its input signals, occur in higher cortical networks. Accordingly, the receptive fields of the input layer in our model were chosen to be relatively broadly tuned. One of the properties of broadly tuned receptive fields, which are exploited by our model, is that they naturally provide a means of averaging between similar input signals. The winner-take-all process, which selects among the potential percepts and generates the bistability behavior for large-cue conflicts, could result from lateral inhibitory connections that occur in many cortical networks (Lund, Angelucci, & Bressloff, 2003; Budd & Kisvarday, 2001; Crook, Kisvarday, & Eysel, 1998). For clarity, the details of the forward and feedback signal paths are discussed separately. The assumption of broadly tuned receptive fields in our network is not unreasonable considering the increasing receptive field size that has been reported in the visual cortex for some other perceptual properties (van Essen et al., 1992). 2.2 Model Forward Path. The forward path is shown in Figure 4. The neurons in layers 1, 2, and 3 form place-coded maps in which each neuron responds preferentially to a specific percept (e.g., slant). The averaging/bistability behavior of the model results from a combination of broad gaussian receptive fields in layer 1 and an iterative winnertake-all process that selects the location of peak activity in layer 1. The proposed population activation in layer 1 is similar to the response distribution found in direction-selective cells in visual brain area MT when stimulated with transparent moving random dot stimuli (Treue, Hol, & Rauber, 2000). It should be noted, however, that this does not mean that layer 1 corresponds to MT. The receptive field size increase from one cortical layer to the next (van Essen et al., 1992; Wilson, 2003) and winner-take-all type network behavior, resulting from inhibitory lateral connections, are both common features in cortical networks (Moldakarimov, Rollenhagen, Olson, & Chow, 2005). Using gaussian receptive fields, the level of activation of each layer 1 neuron depends on the degree to which the slant coded by it matches the slants coded by the input signals, that is, the use of receptive fields provides a natural implementation of gain factors that depend on the similarity between the input signal and the value coded by a particular layer 1 neuron. Figures 5A to 5C illustrate how differences between the (disparity and perspective defined) inputs affect the resulting bottom-up activation of layer 1 (black solid line). Figure 5D shows the effect of adding signal noise to the inputs used in Figure 5C. The resulting activation of the neurons in layer 1 goes through a winner-take-all competition process (Marr & Poggio, 1977; McClelland & Rumelhart, 1981). Layer 2 functions as a gate between layers 1 and 3. As long as there is more than one active neuron in layer 1, the strong inhibitory connections
Perceptual Averaging and Bistability Model
3075
Winner take all process
P Perspective & disparity signal Inputs
D
Output
Inhibit Excite
Layer 1
Layer 2
Layer 3
Figure 4: The feedforward path of the slant rivalry model. The input units provide the perspective-related (P) and disparity-related (D) signals. Each node in layers 1, 2, and 3 specifies one particular slant. In each layer, the nodes at the same position specify the same slant. For clarity, the figure focuses on the connections, which are active when the slant coded by the second node from the top is being processed. The connectivity paths for the other slants have the same structure. The slant is being selected by the winner-take-all process in layer 1. Layer 2 functions as a gate to stop layer 1 signals from reaching layer 3 before the winner-take-all process has focused the activity in layer 1 down to a single node. Layer 3 signals the output of the slant selection network to areas that determine the perceived slant. The recursive self-excitation in layer 3 maintains the layer 3 output during the time that layer 2 is stopping other inputs from reaching layer 3, that is, during the periods in which layer 1 is in the process of determining the winner.
keep all layer 2 neurons inhibited. Once the winner-take-all competition has silenced all but one layer 1 neuron, the excitatory connection, from this layer 1 neuron to layer 2, results in activation of its layer 2 counterpart. The activated layer 2 neuron subsequently inhibits all layer 3 neurons except the neuron coding the corresponding percept, which it excites. The layer 3 neurons are the output neurons of the network and also the source of the internal feedback signal back to layer 1. The recursive selfexcitation of the layer 3 neurons serves to maintain network output during the period that the layer 2 gate blocks bottom-up inputs from reaching layer 3. 2.3 Model Feedback Path. Inhibitory feedback connections from layer 3 to layer 1 (see Figure 6) bias the network toward maintaining the current
3076
A. Koene 1
1
Relative activation level [%]
A
Perspective Disparity Resulting
0.5
0.5
0
-50
0
1
50
0
-50
0
50
0
50
1
C
D
0.5
0
B
0.5
-50
0
0 50 -50 Slant coded by neuron [deg]
Figure 5: Activation of layer 1 neurons of the slant rivalry model. The dotted and dashed lines indicate the activation of the perspective and disparity inputs, respectively, for the coded slant angles. The solid line indicates the resulting activation. (A–C) The effect of increasing cue conflict on the resulting activation in the absence of internal noise. The activation shifts from having a single peak centered between the perspective and disparity specified slants (i.e., averaging) to having two peaks of equal height located close to the perspectiveand disparity-specified slants (i.e., bistability). The gaussian activation profiles reflect the width of the receptive fields of the layer 1 neurons, determining the minimum level of cue conflict where bistability arises. For intermediate cue conflicts, the resulting activation of layer 1 neurons has a single peak with an ill-defined location. For this case, there is slant averaging with a large intertrial standard deviation in perceived slant. (D) Internal signal noise produces one peak that is slightly higher than the other, causing the percept to shift.
percept. Increasing the activation of the neurons coding the currently perceived percept, through excitatory feedback or by modeling the neurons as leaky integrators (Matsuoka, 1984; Wilson, 2003; Laing & Chow, 2002; Mueller & Blake, 1989; Lehky, 1988), results in a uniform decrease in the relative probability of going to any of the other possible percepts. The use of distributed inhibitory feedback has the advantage that the strength of the inhibitory connection can be a function of the similarity between the values (e.g., depth slant) coded by the layer 1 and layer 3 neurons. This allows the network to suppress some percept changes more than others, which is not possible when using excitatory feedback. It should be noted that this difference between using excitatory versus inhibitory feedback is apparent only
Perceptual Averaging and Bistability Model
3077
P Perspective & disparity signal Inputs
Output
D Inhibit Excite Layer 1
Layer 3
Figure 6: Feedback path of the slant rivalry model. The inhibitory feedback projects the current percept, as coded by layer 3, back to layer 1. Inhibitory feedback is one of the main features of our slant rivalry model. The use of inhibitory feedback has the advantage that it allows strong inhibition of layer 1 neurons that code a similar slant as the active layer 3 neuron, with only weakly inhibiting layer 1 neurons that code a very different slant. Decreasing the strength of the inhibitory feedback as a function of increasing difference between the slant coded by the origin (layer 3 neuron) and the destination (layer 1 neuron) reduces noise-induced percept modifications while maintaining sensitivity to large stimulus changes. Layer 2 is omitted since it is not involved in the feedback.
if there are more than two stable states that the network can be in (as is the case in the depth slant perception but not in binocular rivalry). If applied to a network involving only two competing units (e.g., Matsuoka, 1984, and Mueller, 1990), the inhibitory feedback proposed here would result in the same network dynamics as the use of excitatory feedback. The dashed line in Figure 7 illustrates the relative strength of the inhibitory feedback signals to layer 1 when the activity in layer 3 corresponds to the percept indicated by the black arrows. The strong inhibition of the direct neighbors of the current percept, and the decreasing inhibition strength with distance from the current percept, serves to reduce noise-induced percept changes while maintaining sensitivity to large stimulus differences. Figures 7A to 7C illustrate the effect of the inhibitory feedback on layer 1 activation for different cue conflicts (same as in Figure 5). The dotted lines indicate the excitation by the bottom-up inputs. The solid line indicates the resulting activation (i.e., forward minus feedback activation). Figure 7D illustrates the effect of signal noise for the same cue conflict as in Figure 7C. When the feedforward activation generates a single peak (see Figures 7A and 7B), the inhibitory feedback sharpens this peak, greatly reducing the
3078
A. Koene
1
1
Relative activation level [%]
A
Feedforward Feedback Resulting
0.5
0.5
0
-50
0
1
50
0
-50
0
50
0
50
1
C
D
0.5
0
B
0.5
-50
0
0 50 -50 Slant coded by neuron [deg]
Figure 7: Effect of inhibitory feedback on the activation of layer 1 neurons of the slant rivalry model. The dashed and dotted lines indicate the inhibition and excitation by the feedback and forward paths, respectively, for the coded slant angles. The solid line indicates the resulting activation (i.e., forward minus feedback activation). The profile of the feedback inhibition reflects both the strong inhibition of the direct neighbors of the current percept (indicated by black arrows) and the decreasing inhibition strength with distance from the current percept. (A–C) The effect of the inhibitory feedback for the different cue conflicts (see Figure 5). (D) The activation for the same cue conflict as in C but now with the addition of internal signal noise. When the disparity-specified and perspective-specified slants are similar (i.e., the feedforward activation generates a single peak), the inhibitory feedback has the effect of sharpening peaks, reducing the probability that the noise might cause fluctuations. (C) When the cue conflict is large, the inhibitory feedback has the effect of both sharpening the peak of the currently perceived slant and reducing the height of the competing peak, thereby reducing the probability of a percept change.
probability of noise-induced percept fluctuations. When the cue conflict is large (see Figure 7C), the inhibitory feedback sharpens the peak of the perceived slant and reduces the height of the competing peak, thereby reducing the probability of a percept change. 2.4 Gain Control. Experimental investigations of bistable perception have consistently found that the percept duration distributions are asymmetrically skewed (gamma-function-like) (e.g., Levelt, 1966; Fox &
Perceptual Averaging and Bistability Model
3079
Herrman, 1967; Borsellino, de Marco, Allazetta, Rinesi, & Bartolini, 1972; Sugie, 1982; Ditzinger & Haken, 1989; Gomez et al., 1995; Lehky, 1995; Kalarickal & Marshall, 2000; Merk & Schnakenberg, 2002; Zhou et al., 2004; Brascamp et al., 2005). This indicates that the instantaneous probability for a percept change increases with percept duration (if the percept change probability were constant, the percept duration distribution would be described by an exponential decay). A common way to model the increase in instantaneous percept change probability is through gain control of the neurons whose activity maintains the percept (Matsuoka, 1984; Lehky, 1988; Blake, 1989; Mueller, 1990; Dayan, 1998; Kalarickal & Marshall, 2000; Wilson, Blake, & Lee, 2001; Laing & Chow, 2002; Merk & Schnakenberg, 2002; Stollenwerk & Bode, 2003; Wilson, 2003). The gain control simulates a gradual reduction in neural activity (firing rate), sometimes referred to as fatigue (Palmer, 1999), which may be due to slow after-hyperpolarizing potentials (Lehky, 1988; Wilson, 2003; Lee, 2004), synaptic depression (Bear & Malenka, 1994), or self-inhibition (Stollenwerk & Bode, 2003). There are two places in our model where gain control would result in a gamma-like distribution of percept durations. One possibility is a gain reduction of the “winning” layer 1 neuron that “survived” the winnertake-all competition. In this case, the feedforward activation of layer 1 that sustains the current percept is reduced, increasing the instantaneous probability of a percept change. The other possibility is a gain reduction of the layer 3 neuron associated with the current percept, resulting in a reduction in the strength of the inhibitory feedback signal. In both cases, the gain reduction may be related to synaptic depression or slow afterhyperpolarizing potentials that accumulated over the prolonged period of activation of these neurons. Any combination of these possible sites of gain control is equally suited for reproducing the experimentally found percept duration distributions. Here we assume that the gain control is primarily in the “winning” layer 1 neuron. 2.5 Implementation of Voluntary Control. Top-down voluntary control or attention-driven percept biasing has been examined in a number of psychophysical studies on bistable perception (e.g., Lack, 1978; Gomez et al., 1995; Hol et al., 2003; Toppino, 2003; Meng & Tong, 2004; van Ee, Krumina et al., 2005; van Ee, Van Dam, et al., 2005; Chong et al., 2005; see also an early network model by Vickers, 1972). Attention or voluntary control is implemented in our model as a bias input to layer 1 (see Figure 8). This top-down input provides an excitatory subthreshold input to the layer 1 neurons that correspond to the attended percept. The bias in itself is a weak signal in order to avoid “hallucinatory” percept generation in the absence of any corresponding bottom-up inputs. For large-cue conflicts, where there are two activation peaks in layer 1, the top-down bias lifts the peak on the attended side slightly above the competing peak, increasing the
3080
A. Koene
‘Hold control exertion condition’
P Inputs
D
Bias input
Figure 8: Effect of voluntary control exertion in the slant rivalry model. Voluntary control on the slant selection is modeled through a top-down controlled bias. If the task is to hold one slant over the other, all layer 1 neurons coding such a slant receive an excitatory bias input. For speed-up control exertion, the bias is always sent to the side, coding the slant with opposite sign. For large-cue conflicts, where there are two activation peaks in the layer 1 population, this bias lifts the peak on the biased side slightly above the competing peak, thereby increasing the probability that the corresponding percept is perceived. To account for the experimental finding that spontaneous percept reversals cannot be completely suppressed, the strength of the top-down bias must be less than the peak network noise.
probability that the corresponding percept is being perceived. To account for the experimental finding that spontaneous percept reversals cannot be completely suppressed, the strength of the top-down bias must be less than the peak network noise. 2.6 Model Parameters. The key parameters in our model are: (1) receptive field tuning size of the layer 1 neurons; (2) the width of the (gaussian) inhibitory projective field in the feedback from layer 3 to layer 1; (3) the strength of the bottom-up input signals into layer 1; (4) amplitude and distribution of the signal noise; (5) initial strength of the inhibitory feedback signal; (6) the type of gain decrease in the “winning” layer 1 neuron as a function of percept duration; and (7) the strength of top-down activation bias of layer 1 related to voluntary control. 1. The width of the receptive fields in layer 1 determines how broad an area gets activated by the bottom-up inputs. This determines the level of cue conflict at which the network behaviour switches from
Perceptual Averaging and Bistability Model
3081
averaging to bistability. For slant perception using disparity and perspective cues, the receptive field tuning size can therefore in principle be determined from the data concerning the difference between perspective and disparity-specified slant at which bistability first occurs (van Ee et al., 2002; van Ee, Adams, & Mamassian, 2003; van Ee, Kromina et al., 2005). For simplicity, our current model implementation assumed that the receptive field size is the same for all layer 1 cells. 2. The width of the (gaussian) inhibitory projective field in the feedback from layer 3 to layer 1 determines how the mean rate of percept changes increases with increased cue conflict. The width of the inhibition field has to be greater than the receptive field size of the layer 1 units in order for it to affect percept stability for large-cue conflicts (i.e., during bistable perception). This parameter was set to twice the receptive field tuning width of the layer 1 units. Due to the coarseness of the available data concerning changes in mean percept alternation rate as a function of degree of cue conflict, more precise parameter fitting was not considered useful at this time. 3. The relative strength of the bottom-up inputs determines the probability that the corresponding percept is perceived. If an individual exhibits a strong bias toward perception based on one of the bottomup inputs, this is modeled by increasing the relative strength of this input. 4. The signal-to-noise ratio (SNR) in layer 1 regulates the variance in the location of peak activation in the population-coded layer 1. If the SNR is very large, the percept is fully determined by the bottom-up input signals. If the SNR is very small, the peak activation location will randomly alternate between neurons corresponding to a wide range of percepts. If the noise in layer 1 is due to stochastic firing properties of the individual layer 1 neurons, then the noise at each neuron is independent of its neighbor (i.e., uniformly distributed noise). If the noise is linked to the activation level of the neurons (i.e., more strongly activated neurons generate more noise), the noise distribution corresponds to the product of a uniform noise distribution and the noise-free activation levels of the neurons. This is similar to assuming that the primary source of noise in the system is in the strength of the input signals. Our model is able to simulate published experimental results with either noise model (the results shown here used uniform distributed, activity-independent noise). 5. With the properties of the noise and the bottom-up input signals fixed, the initial strength of the inhibitory feedback signal determines the probability for an immediate (i.e., within 1 second) percept change. Since the inhibitory feedback reduces the activity of the competing
3082
A. Koene
layer 1 neurons (see Figure 7), the probability for a percept change is negatively correlated to the strength of the feedback signal. The initial feedback strength does not fully determine the probability of a percept change beyond the immediate onset of the percept since the gain control of the “winning” layer 1 neuron (see parameter 6) gradually increases the probability of a percept change. 6. Gain control of the “winning” layer 1 neuron, which reduces the activation of this layer 1 neuron as a function of percept duration, determines how the probability of a percept switch changes with time. When the rate of neural activity reduction decreases exponentially with time, the percept durations predicted by the model fit a skewed asymmetrical (gamma-like) distribution. 7. When present (i.e., in the control-exertion conditions where the subject exerts voluntary control), the strength of the bias input is modeled as a constant fixed value. All control exertion conditions were modeled using the same bias input strength. The instruction “hold the left [or right] side in front” is modeled by a bias input to the layer 1 neurons that code for slants with the left (right) side in front (see Figure 8). For the speed-up control exertion case, the bias is always sent to the side coding a slant with opposite sign to the current percept. A mathematical description of the model and its parameters is given in the appendix. 3 Results To test the validity of our network we simulated the slant rivalry experiments by van Ee et al. (van Ee et al., 2002; van Ee, Krumina et al., 2005; van Ee, van Dam et al., 2005) and Brouwer et al. (2004) using 60,000 iterations per simulated experimental condition and control exertion task. As we will now show, all three key results have been successfully replicated using a single parameter set. 3.1 Signal Integration as a Function of Cue Conflict: Averaging vs. Bistability. Figure 9 shows that our slant rivalry network replicates the transition from averaging to bistable behavior as a function of cue conflict between perspective- and disparity-specified slant (van Ee et al., 2002; van Ee, Krumina et al., 2005; van Ee, van Dam et al., 2005). When the cue conflict is small, the signal integration results in averaging (+ symbols). For larger cue conflicts, the percept is bistable, alternating between the percepts indicated by the X and circle symbols. Far from the borders to the averaging regime, the perceived slant in the bistable regime closely corresponds to the slant signaled by either the perspective or the disparity cue. Close to the averaging regime, however, the perceived slant becomes a weighted average
Perceptual Averaging and Bistability Model
50 Persp.-70
50 Persp. -50
50 Persp. -25
0
0
0
-50
-50
-50
-50 Perceived slant [deg]
3083
0
50
-50
0
50
-50
Mean StdErr of 50 Persp. 0 experiment data: 2.5deg
0
50
50
50
50
0
0
0
-50
Persp. 25 -50
0
50
-50
50
Perspective-dominated Disparity-dominated Averaging percept Experiment data
0 Mean difference between model prediction and -50 experiment data: 6.5deg -50
0
Persp. 50
-50
-50 0 50 Disparity-specified slant [deg]
Persp. 70 -50
0
50
Figure 9: Experimental and simulated bistability. Our model produces both averaging and bistability in perceived slant as a function of perspective-specified and disparity-specified slant. Each panel shows the perceived slant versus the disparity-specified slant for a particular perspective specified slant. “Persp.” indicates perspective-specified slant. The X and circle symbols indicate the perceived slants when the percept is bistable for disparity-dominated and perspective-dominated percepts, respectively. The + symbols indicate the perceived slant when the percept corresponds to stable averaging of the disparity and perspective-specified slants. Our slant rivalry neural network replicates perceived bistable slants obtained under various experimental conditions (van Ee et al., 2002; van Ee, Krumina, et al., 2005; van Dam & van Ee, 2005). The black dots indicate the perceived slants reported in van Ee, et al. (2005). The mean absolute difference between the model predictions and the experimental data is 6.5 degrees (mean standard error of the experimental data was 2.5 degrees).
of the slants signaled by the two cues, with the weights gradually equalizing as the transition border to the averaging regime is approached. The black dots in Figure 9 indicate the perceived slants reported in van Ee, Krumina et al. (2005). The mean absolute difference between the model predictions and the experimental data is 6.5degrees. For comparison, the mean standard error of the experimental data was 2.5 degrees. 3.2 Signal Integration as a Function of Cue Conflict: Mean Percept Duration. Figure 10 shows how the degree of cue conflict affects the mean
3084
A. Koene 9.5 Disparity-dominated percept Perspective-dominated percept
9
7.5 7 6.5
Averaging regime
8
Bi-stable regime
Mean percept duration [s]
8.5
Values used in 'voluntary control' simulations
6 5.5 5 4.5
0 25 50 75 100 125 150 175 Conflict between disparity and perspective specified slant [deg]
Figure 10: Mean percept duration versus conflict between disparity and perspective-specified slants. The circle and diamond symbols indicate the mean percept duration for the disparity-dominated and the perspective-dominated percept, respectively, showing that the mean percept duration decreases with increasing cue conflict (asymptotically approaching 1s) due to the reduction in inhibitory feedback connection strength for increasingly different slants. This behavior, which depends essentially on the use of inhibitory feedback, has recently been supported by experimental results (Brouwer et al., 2004). Models that rely on excitatory feedback to strengthen the current percept, rather than on inhibitory feedback to weaken neighboring competitors, do not produce this behavior.
percept durations. For large-cue conflicts (i.e., in the bistable regime), the mean percept duration is negatively correlated with the degree of cue conflict. Experimental data by Brouwer et al. (2004) show the same trend in the rate of percept changes, that is, a decrease in mean percept duration, as a function of increasing stimulus incongruence. Brouwer et al. further reported that the change in mean percept duration is more closely correlated to subjective rather than objective cue conflict. This agrees with our model since the change in mean percept duration as a function of cue conflict depends on the tuning width of the layer 1 receptive fields and the size of the feedback projection from layer 3, both of which are internal parameters that may well be subject dependent. The positive correlation between degree of cue conflict and size of cortical activation reported by Brouwer et al. is also predicted by our model since larger-cue conflict results in less overlap in
Perceptual Averaging and Bistability Model
3085
the population of layer 1 neurons that is activated by the perspective and disparity cues. 3.3 Modulation of Bistability Behavior Through Voluntary Top-Down Biasing. Figure 11 shows the probability density distributions for the simulated percept durations in the bistable regime and the effect of top-down modulation. The top row shows the results for the “natural” condition without top-down bias. The second and third rows show the results for the “hold” control-exertions where a top-down bias was added favoring one over the two alternative percepts. The bottom row depicts the results for the speed-up control exertion where the top-down bias was always assigned to the percept that was currently suppressed. Our slant rivalry neural network replicates the experimentally found percept duration probability distributions obtained under these control exertion conditions (van Ee, Krumina et al. 2005). For each experimental condition and for all percept durations, the simulation error falls within the standard deviation of the experimentally found probability densities. 3.4 Robustness Analysis. The following robustness analysis was performed to illustrate how the model behavior is affected by the choice of its parameter values. Rather than exhaustively testing the model for all possible parameter sets, we opted to test each parameter separately by analyzing the effect of a 10% change in the tested parameter:
r
r
r
Receptive field width of the layer 1 neurons: Provided the ratio between layer 1 receptive field width and the width of the inhibitory projective field of the layer 3 feedback is held constant, increasing or decreasing the layer 1 receptive field width has the same effect as decreasing or increasing cue conflict. Width of the inhibitory projective field in the feedback from layer 3 to layer 1: If the layer 1 bottom-up receptive field tuning width is held constant, changing the width of the inhibition area changes the mean percept duration. The relationship between mean percept duration and degree of cue conflict (see Figure 10) is not affected. At the level of cue conflict used in our “voluntary control” simulations, a 10% increase or decrease in inhibition field area results in a 50% increase or 33% decrease in mean percept durations. Strength of the bottom-up input signals into layer 1: Increasing the strength of one bottom-up input to 10% more than the competing input shifts the distribution of percept durations such that the mean percept duration of the stronger cue is about 2.5 times longer than the mean percept duration of the weaker cue. Note that since our model concerns high-level stimulus conflicts, layer 1 in our model does not correspond to the sensor level. An N-fold increase in stimulus strength
3086
A. Koene Perspective-dominated
Disparity-dominated
Speed Up
Probability density [%] Hold Perspective Hold Disparity
Natural
0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0 0.4 0.2 0 0
0.4 Mean: 6.17s Mean error: 0.01s M. exp. std: 0.02s
10
20
0.2
30
Mean: 9.60s Mean error: 0.01s M. exp. std: 0.03s
10
20
0.2
30
Mean: 3.66s Mean error: 0.01s M. exp. std: 0.02s
10
20
0 0 0.4 0.2
30
Mean: 3.69s Mean error: 0.01s M. exp. std: 0.02s
10 20 Duration [s]
0 0 0.4
0 0 0.4 0.2
30
0 0
Mean: 5.84s Mean error: 0.01s M. exp. std: 0.02s
10
20
30
Mean: 3.56s Mean error: 0.01s M. exp. std: 0.02s
10
20
30
Mean: 9.22s Mean error: 0.01s M. exp. std: 0.02s
10
20
30
Mean: 3.60s Mean error: 0.01s M. exp. std: 0.02s
10 20 Duration [s]
30
Figure 11: Percept durations produced by the slant rivalry neural network. The left and right columns depict the probability density distributions for the disparity-dominated and perspective-dominated percepts durations, respectively. The top row shows the results for the natural control exertion condition without top-down bias. The second and third rows show the results for the hold control exertion where a top-down bias was added favoring one over the two alternative percepts. The bottom row depicts the results when the top-down bias was always assigned to the percept that was currently not prevailing. All simulations were done using the same fixed set of model parameters. The model parameters were determined by the natural control exertion condition, assuming no top-down bias. A single bias input was sufficient to fit each of the control exertion conditions. Our slant rivalry neural network replicates experimentally found percept duration probability density distributions obtained under the described control exertion conditions (van Ee, Krumina et al., 2005) within one standard error. Mean error indicates the mean difference between simulation results and experimental data, m.expt.std indicates mean standard deviation of the experimental data.
Perceptual Averaging and Bistability Model
r r r
r
3087
therefore does not automatically correspond to an N-fold increase in the input signal to layer 1. Amplitude of the signal noise: Increasing or decreasing the noise level by 10% decreases or increases the mean percept durations by approximately 10%. Initial strength of the inhibitory feedback signal: Increasing or decreasing the initial feedback strength by 10% results in an approximately 60% increase or 30% decrease in the mean percept durations. Gain change of the “winning” layer 1 neuron as a function of percept duration: A 10% increase or decrease in parameter α, β, or γ (see equation A.1) results in approximately a 1% decrease or increase, 10% decrease or increase, or 40% increase or 25% decrease in mean percept duration, respectively. Top-down activation bias related to voluntary control: Increasing or decreasing the strength of the voluntary control top-down bias by 10% increases or decreases the mean percept duration of the attended percept by 5% and decreases or increases the mean percept duration of the alternative percept by 5%.
4 Discussion We have presented a neural network model of signal integration that seamlessly combines averaging and stochastic bistability behavior, as well as voluntary control, in the perception of ambiguous slant stimuli. The network behavior changes from averaging to bistability as a function of the degree of conflict between the disparity-specified and perspective-specified slants and shows increasing rates of perceptual slant reversals for increasing stimulus incongruence. The voluntary control of perception is achieved through the use of a top-down bias input. Most previous mechanistic models for perceptual signal integration modeled either the generation of unified stable (binocular) perception (reviewed by Howard & Rogers, 2002) or bistability (Vickers, 1972; Sugie, 1982; Matsuoka, 1984; Kawamoto & Anderson, 1985; Lehky, 1988; Blake, 1989; Mueller & Blake, 1989; Mueller, 1990; Ditzinger & Haken, 1989; Lehky & Blake, 1991; Lumer, 1998; Dayan, 1998; Kalarickal & Marshall, 2000; Laing & Chow, 2002; Merk & Schnakenberg, 2002; Stollenwerk & Bode, 2003; Wilson, 2003; Zhou et al., 2004) without explicitly addressing the transition from one type of behavior to the other. To our knowledge, the only previous models that account for these perceptual transitions are a Bayesian signal integration model by van Ee et al. (2003) and a binocular vision model by Hayashi, Maeda, Shimojo, and Tachi (2004). Van Ee et al. provided a framework to describe the behavior of the perceptual system but did not address the neural network responsible for this perceptual behavior. Hayashi et al.’s
3088
A. Koene
model did provide a mechanistic framework for the switch from fusion to rivalry but is not easily generalized to other signal integration cases since it relies heavily on mechanisms that are specific to the role of half-occluded unpaired points in solving the binocular correspondence problem. The model of stereo depth acuity and transparency by Lehky and Sejnowski (1990) has some similarity to the population-coding scheme here. In that model, going to a transparent surface (two depths) corresponded to going from a unimodal to bimodal distribution of activity in the population, similar to here. However, in that model, there were no oscillations, as the percept of transparency is stable. The dynamics of the percept changes in our model are governed by the gain decay or fatigue function (see the appendix, “gain control”) of the layer 1 neurons that gradually decreases the stability of the percept and the signal-to-noise ratio of the input signals to layer 1. We have reproduced the results of a set of experiments (van Ee et al., 2002; van Ee, Krumina et al., 2005; van Ee, van Dam et al., 2005; van Dam & van Ee, 2005, Brouwer et al., 2004) that showed a transition from perceptual averaging of the disparity- and the perspective-specified slants to bistability (see Figure 9), changes in mean bistable percept duration as a function of perceptual cue conflict (see Figure 10), and the effect of voluntary control on the probability density distribution of bistable percept durations (see Figure 11). The results of our simulations provided a good fit with the experimental data. All of the qualitative behavior was reproduced, and the predicted percept durations (see Figure 11) were all within one standard deviation of the experimental results. 4.1 Purpose of the Inhibitory Feedback from the Current Percept. Most mechanistic models of bistable perception contain some mechanism (i.e., feedback or temporal integration) that allows the percept at time t to affect the way the input signals are being perceived at time t + δt (Matsuoka, 1984; Wilson, 2003; Laing & Chow, 2002; Mueller & Blake, 1989; Lehky, 1988). By strengthening the signals related to the current percept, the stability of the network behavior is enhanced, reducing the perceptual consequences of signal noise. In our model this stabilizing mechanism is implemented by the inhibitory feedback from layer 3 to layer 1. Without the inhibitory feedback, the combination of network noise and wide receptive fields in layer 1 would result in small random shifts of the location of peak activation, even in the absence of a cue conflict. This would cause the percept to wobble or jitter. Note, however, that the inhibitory feedback does not ensure consistency of the percept over separate presentations. On each presentation, the network noise affects at which exact value the peak is initially located. This location is then held stable. For unambiguous stimuli, the internal noise can be seen in the variance of perceived slant when the same stimulus is offered repeatedly. Since the purpose of the feedback is to enhance perceptual reliability by reducing the consequences of internal
Perceptual Averaging and Bistability Model
3089
noise, its effect on the ability of the visual system to detect changes in the environment should be minimized. Excitatory feedback, strengthening the signals related to the current percept, would have the same net effect as inhibitory feedback uniformly reducing all other signals in the network, resulting in a uniform reduction in the sensitivity to changes in the stimulus. Our inhibitory feedback signal allows a differentiated approach. By applying strong inhibition to the cells that are functionally close to the current percept and decreasing the inhibition strength with distance from the current percept, we minimize the reduction of the salience of big changes in the stimulus while suppressing noise-induced percept instability. An additional property of this type of differentiated inhibitory feedback is that it predicts a gradual increase in the rate of percept changes (decrease in mean percept duration) as the cue conflict is increased (Brouwer et al., 2004). Models that rely on excitatory feedback to strengthen the current percept do not produce this behavior. 4.2 Multiple Stages of Binocular Rivalry. The model of Hayashi et al. (2004) integrates stereopsis and binocular rivalry, which are usually treated separately, into a single framework of binocular vision. In this model rivalry arises due to interocular inhibition between representations of monocularly visible regions. In the hierarchy of visual perception, this can be considered as low- or midlevel rivalry occurring just after binocular convergence at the correspondence problem-solving stage. While this model provides an explanation for the occurrence of bistability in binocular rivalry, it does not explain slant rivalry data since the Hayashi et al. model is specifically tailored toward binocular fusion or rivalry. The Hayashi et al. model also does not predict the slight shift toward the nonperceived percept that was found in slant rivalry experiments. In our model, the population code in layer 1 encodes only the perceptual property of input signals, not the sensory origin of the stimulus cue from which the signal is derived. The percept derived from our signal integration network is therefore blind to its sensory origin and will not be affected by interchanging the sensory origin of stimulus cues. When applied to binocular rivalry, our network can thus be understood as modeling a network involved in higher-level stimulus rivalry (Logothetis, Leopold, & Sheinberg, 1996) rather than classical binocular rivalry. Thus, our model and the model by Hayashi et al. (2004) can be considered complementary parts in a two- (or more) stage process and as such may be linked to the multiple stage rivalry model suggested by Wilson (2003). 4.3 Predictions and Possible Extensions 4.3.1 Weighted Cue Combination. Many investigations of sensory cue combination have shown that cue reliability is taken into account in the percept formation (e.g., Landy, Maloney, Johnston, & Young, 1995; Oruc,
3090
A. Koene
Maloney, & Landy, 2003; Ernst & Banks, 2002). The cue reliability is either statistically determined (e.g., Bayesian estimators) or derived from ancilliary cues (related cues that provide a context for the cue estimation). A simple way in which to implement reliability-dependent cue weighting in our neural network is by adjusting the activation strength of the input signals to layer 1 (see, equation A.1: a 0, j (t)). This kind of cue weighting might implement a bias based on past experience of cue reliability. The predicted effect of biased cue weighting is very similar to the effect of the voluntary top-down bias (see section 3). Since our model uses population codes to signal the stimulus property, the model also implements online cue reliability-based weighting (Ernst & Buelthoff, 2004) through the layer 1 receptive fields. If a cue is unreliable, that is, there is a lot of variance in the layer 1 input provided by this cue, the area in layer 1 that is activated by the cue will smear, becoming broader with a lower peak. The broadening of the activated layer 1 area will have the same effect as using larger receptive fields in layer 1 would (see section 2). The decrease of the activation peak would have an effect that is similar to a voluntary top-down bias in favor of the more reliable cue. 4.3.2 Multistable Percepts with More Than Two Input Cues. In order to extend our signal integration network for the condition where there are more than two input cues, the additional cues can simply be added as further inputs to layer 1 of the network. For slant perception, for instance, surface texture and shading could be added as additional slant cues in exactly the same way as perspective and disparity (in equation A.1, simply add j = 3 and j = 4). The predicted percept(s) in this condition will again depend on the relative differences between the input signals (analogous to Figure 5). Due to the differentiated inhibitory feedback in our model, which decreases with functional distance, our model predicts preferential switching between pairs of perceptual states during multistable perception, as reported by Suzuki and Grabowecky (2002). The preferred switch will always be to the state that is most different from the current percept since that state is least inhibited. In slant perception, for instance, having four cues that signal 0, 45, 90, and 135degree slant, respectively, would result in preferential switching between 0 and 90degrees and between 45 and 135degrees since these are farthest from each other and thus inhibit each other least (note that 0degree slant is the same as 180degree slant). 5 Conclusion Our neural network provides a possible mechanism for explaining processes in which continuously variable information from different sources is integrated for perceptual awareness. It successfully incorporates both averaging and stochastic bistability behavior in slant perception. The transition in our model behavior from a percept-averaging regime to a regime
Perceptual Averaging and Bistability Model
3091
of bistability is a natural consequence of the combination of population coding and wide receptive field tuning common to higher cortical areas. Voluntary control of perception is achieved through the use of a single top-down bias input. Differentiated inhibitory feedback plays an important role in our network for both increased percept stability without reduced sensitivity to larger stimulus changes and increasing alternation rates as a function of stimulus incongruence. In this article, we have validated the model by applying it to slant rivalry instigated by conflicting perspective and disparity signals. The model architecture, however, does not contain any assumption that would limit it to this particular example of stimulus rivalry. Key elements of this model are the use of population coding and inhibitory feedback with wide receptive and projective fields. Appendix: Mathematical Description of the Model and Its Parameters The model was implemented in Matlab5 and is available on request. A.1 Layer 1. The activation of the layer 1 neurons by their bottom-up excitatory inputs and top-down inhibitory inputs is given by a 1,i (t) = G 1,i (τ )
−
k=i
(m1,i − m0, j )2 a 0, j (t) exp − σ12 j=1
2
(m1,i − m3,k )2 a 3,k (t) exp − σ32 β
G 1,i (τ ) = e −(τ +α) + γ ,
+ ε1,i (t) + α1,i (t) .
where α = 0.3, β = 0.38 and γ = 0.51, (A.1)
where a 1,i (t) is the activation of the ith layer 1 neuron at time t; a 0, j (t) is the activation of the jth bottom-up input at time t; m1,i is the center of the receptive field of the ith layer 1 neuron (i.e., the slant orientation coded by this cell); m0, j is the slant orientation signaled by the jth bottom-up input; σ1 is the width of the layer 1 receptive fields (for simplicity, all layer 1 neurons are assumed to have the same receptive field tuning width); a 3,k (t) is the activation of the kth layer 3 neuron at time t; m3,k is the center of the receptive field of the ith layer 1 neuron (i.e., the slant orientation coded by this cell); σ3 is the width of the layer 3 inhibitory projective fields (for simplicity, all layer 3 neurons are assumed to have the same projective field width); ε1,i (t) is the noise acting on the ith layer 1 neuron at time t; and α1,i (t) is the activation bias due to attention affecting the ith neuron of layer 1at time t. G 1,i (τ ) is the gain of the ith layer 1 where τ is the duration that this layer 1 neuron has signaled the current percept, that is, the perceived
3092
A. Koene
slant was the same as the slant coded by the ith neuron. For layer 1 neurons that do not code, the current percept τ = 0 (i.e., G(τ ) = 1). A.2 Winner-Take-All Interaction in Layer 1. A biologically plausible implementation could take the form of a network of lateral inhibition such that a 1,i (t + δt) = a 1,i (t) −
N a 1, j (t) j=1
N
where a 1,i (t + δt) is the activation of the ith layer 1 neuron at time t + δt; a 1,i (t) is the activation of the ith layer 1 neuron at time t; N is the total number of layer 1 neurons, and a 1, j (t) is the activation of the ith layer 1 neuron at time t. After a couple of iterations, this will reduce the activity of all neurons, except the maximally activated neuron, to zero. A.3 Layer 2. The function of layer 2 is to accommodate the network for the processing time required by physiologically plausible winner-take-all processes. The activation of the layer 2 neurons by their bottom-up inputs is given by a 2,i (t) = Ea 1,i (t) −
I a 1,k (t),
k=i
with
E I,
where a 2,i (t) is the activation of the ith layer 2 neuron at time t; E is the gain of the excitatory projections from layer 1 to layer 2; a 1,i (t) is the activation of the ith layer 1 neuron at time t; I is the gain of the inhibitory projections from layer 1 to layer 2; and a 1,k (t) is the activation of the kth layer 1 neuron at time t. A.4 Layer 3. The activation of the layer 3 neurons is given by a 3,i (t) = min a 3,i (t − δt) + E a 2,i (t) −
I a 2,k (t), S ,
k=i
with
E I.
where a 3,i (t) is the activation of the ith layer 3 neuron at time t; a 3,i (t − δt) is the activation of the ith layer 3 neuron at time t − δt; E is the gain of the excitatory projections from layer 2 to layer 3; a 2,i (t) is the activation of the ith layer 2 neuron at time t; I is the gain of the inhibitory projections from
Perceptual Averaging and Bistability Model
3093
layer 2 to layer 3; and a 2,k (t) is the activation of the kth layer 2 neuron at time t. S is the neural activity saturation level. Acknowledgments Part of the work was developed while A.K. was at the INSERM U534, Lyon, France, in 2003. Major parts of the work have been presented at the Conf´erence Ladislav Tauc en Neurobiologie in 2003 and the European Conference on Visual Perception in 2004. A.K. was partly supported by a Human Frontiers Research Project grant assigned to A. Johnston and partly supported by a Gatsby charitable foundation grant assigned to L. Zhaoping. The help of R. van Ee in providing the slant rivalry data files is sincerely appreciated. References Bear, M. F., & Malenka, R. C. (1994). Synapic plasticity: LTP and LTD. Current Opinion in Neurobiology, 4, 389–399. Blake, R. (1989). A neural theory of binocular rivalry. Psychological Review, 96, 145– 167. Borsellino, A., de Marco, A., Allazetta, A., Rinesi, A., & Bartolini, B. (1972). Reversal time distribution in the perception of visual ambiguous stimuli. Kybernetik, 10, 139–144. Brascamp, J. W., van Ee, R., Pestman, W. R., & van den Berg A.V. (2005). Distribution of alternation rates in various forms of bistable perception. Journal of Vision, 5, 287–298. Brouwer, G. J., Tong, F., Schwarzbach, J., & van Ee, R. (2004). Neural correlates of stereoscopic depth perception in visual cortex. Society for NeuroScience, 664.13. Budd, J. M., & Kisvarday, Z. F. (2001). Local lateral connectivity of inhibitory clutch cells in layer 4 of cat visual cortex (area 17). Experimental Brain Research, 140(2), 245–250. Chong, S. C., Tadin, D., & Blake R. (2005). Endogenous attention prolongs dominance durations in binocular rivalry. Journal of Vision, 5, 1004–1012. Crook, J. M., Kisvarday, Z. F., & Eysel, U. T. (1998). Evidence for a contribution of lateral inhibition tuning and direction selectivity in cat visual cortex: Reversible inactivation of functionally characterized sites combined with neuroanatomical tracing techniques. European Journal of Neuroscience, 10(6), 2056–2075. Dayan, P. (1998). A hierarchical model of binocular rivalry. Neural Computation, 10, 1119–1135. Ditzinger, T., & Haken, H. (1989). Oscillations in the perception of ambiguous patterns. Biological Cybernetics, 61, 279–287. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429–433. Ernst, M. O., & Buelthoff, H. H. (2004). Merging the senses into a robust percept. Trends in Cognitive Sciences, 8(4), 162–169.
3094
A. Koene
Fox, R., & Herrman, J. (1967). Stochastic properties of binocular rivalry alternations. Perception and Psychophysics, 2, 432–436. Gomez, C., Argandona, E. D., Solier, R. G., Angulo, J. C., & Vazquez, M. (1995). Timing and competition in networks representing ambiguous figures. Brain and Cognition, 29, 103–114. Goryo, K., Robinson, J. O., & Wilson, J. A. (1984). Selective looking and the MullerLyer illusion: The effect of changes in the focus of attention on the Muller-Lyer illusion. Perception, 13, 647–654. Hayashi, R., Maeda, T., Shimojo, S., & Tachi, S., (2004). An integrative model of binocular vision: A stereo model utilizing interocularly unpaired points produces both depth and binocular rivalry. Vision Research, 44(20), 2367–2380. Hol, K., Koene, A., & van Ee, R. (2003). Attention-biased multi-stable surface perception in three-dimensional structure-from-motion. Journal of Vision, 3, 486– 498. Howard, I. P., & Rogers, B. J. (2002). Seeing in depth, Vol 2: Depth perception. Toronto: I. Porteous. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 148, 574–591. Hubel, D. H., & Wiesel, T. N. (1979). Brain mechanisms of vision. Scientific American, 241(3), 150–162. Kalarickal, G. J., & Marshall, J. A. (2000). Neural model of temporal and stochastic properties of binocular rivalry. Neurocomputing, 32–33, 843–853. Kawamoto, A. H., & Anderson, J. A. (1985). A neural network model of multistable perception. Acta Psychologica, 59, 35–65. Lack, L. C. (1978). Selective attention and the control of binocular rivalry. The Hague: Mouton. Laing, C. R., & Chow, C. C. (2002). A spiking neuron model for binocular rivalry. Journal of Computational Neuroscience, 12, 39–53. Landy, M. S., Maloney, L. T., Johnston, E. B., & Young, M. (1995). Measurement and modeling of depth Cue combination: In defense of weak fusion. Vision Research, 35(3), 389–412. Lee, S. H. (2004). Binocular battles on multiple fronts. Trends in Cognitive Sciences, 8(4), 148–151. Lehky, S. (1988). An astable multivibrator model of binocular rivalry. Perception, 17, 215–228. Lehky, S. R., (1995). Binocular rivalry is not chaotic. Proceedings of the Royal Society London B, 259, 71–76. Lehky, S. R., & Blake, R. (1991). Organization of binocular pathways: Modeling and data related to rivalry. Neural Computation, 3, 44–53. Lehky, S. R., & Sejnowski, T. J. (1990). Neural network model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10, 2281–2299. Levelt, W. J. M. (1966). The alternation process in binocular rivalry. British Journal of Physiology, 57, 225–238. Logothetis, N. K., Leopold, D. A., & Sheinberg, D. L. (1996). What is rivalling during binocular rivalry? Nature, 380, 621–624.
Perceptual Averaging and Bistability Model
3095
Lumer, E. D. (1998). A neural model of binocular integration and rivalry based on the coordination of action-potential timing in primary visual cortex. Cerebral Cortex, 8, 553–561. Lund, J. S., Angelucci, A., & Bressloff, P. C. (2003). Anatomical substrates for functional columns in macaque monkey primary visual cortex. Cerebral Cortex, 13(1), 15–24. Marr, D., & Poggio, T. (1977). A computational theory of human stereo disparity. Science, 194, 283–287. Matsuoka, K. (1984). The dynamic model of binocular rivalry. Biological Cybernetics, 49, 201–208. McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: I. An account of basic findings. Psychological Review, 88(5), 375–407. Meng, M., & Tong, F. (2004). Can attention selectively bias bistable perception? Differences between binocular rivalry and ambiguous figures. Journal of Vision, 4, 539–551. Merk, I., & Schnakenberg, J. (2002). A stochastic model of multistable visual perception. Biological Cybernetics, 86, 111–116. Moldakarimov, S., Rollenhagen, J. E., Olson, C. R., & Chow, C. C. (2005). Competitive dynamics in cortical responses to visual stimuli. Journal of Neurophysiology, 94(5), 3388–3396. Mueller, T. J. (1990). A physiological model of binocular rivalry. Visual Neuroscience, 4, 63–73. Mueller, T. J., & Blake, R. (1989). A fresh look at the temporal dynamics of binocular rivalry. Biological Cybernetics, 61, 223–232. Oruc, I., Maloney, L. T., & Landy, M. S. (2003). Weighted linear cue combination with possibly correlated error. Vision Research, 43, 2451–2468. Palmer, S. E. (1999). Vision science: Photons to phenomenology. Cambridge, MA: MIT Press. Shulman, G. L. (1992). Attentional modulation of figural aftereffect. Perception, 21, 7–19. Stollenwerk, L., & Bode, M. (2003). Lateral neural model of binocular rivalry. Neural Computation, 15, 2863–2882. Sugie, N. (1982). Neural models of brightness perception and retinal rivalry. Biological Cybernetics, 43, 13–21. Suzuki, S., & Grabowecky, M. (2002). Evidence for perceptual “trapping” and adaptation in multistable binocular rivalry. Neuron, 36, 143–157. Toppino, T. C. (2003). Reversible-figure perception: Mechanisms of intentional control. Perception and psychophysics, 65(8), 1285–1295. Treue, S., Hol, K., & Rauber, H.-J. (2000). Seeing multiple directions of motionphysiology and psychophysics. Nature Neuroscience, 3(3), 270–276. Tsal, Y. (1984). A Muller-Lyer illusion induced by selective attention. Quarterly Journal of Experimental Psychology, Human Experimental Psychology, 36A, 319– 333. van Dam, L. C. J., & van Ee, R. (2005). The role of (micro)saccades and blinks in perceptual bi-stability from slant rivalry. Vision Research, 45(18), 2417– 2435.
3096
A. Koene
van Ee, R. (2005). Dynamics of perceptual bi-stability for stereoscopic slant rivalry and a comparison with grating, house-face, and Necker cube rivalry. Vision Research, 45, 29–40. van Ee, R., Adams, W. J., & Mamassian, P. (2003). Bayesian modeling of cue interaction: Bi-stability in stereoscopic slant perception. Journal of the Optical Society of America A, 20, 1398–1406. van Essen, D. C., Anderson, C. H., & Fellman, D. J. (1992). Information processing in the primate visual system: An integrated systems perspective. Science, 255(5043), 419–423. van Ee, R., Krumina, G., Pont, S. P., & van der Ven, S. (2005). Voluntarily controlled bi-stable slant perception of real and photographed surfaces. Proceedings of the Royal Society London. Series B, Biological Sciences, 272(1559), 141–148. van Ee, R., van Dam L. C. J., & Brouwer, G. J. (2005). Voluntary control and the dynamics of perceptual bi-stability. Vision Research, 45, 41–55. van Ee, R., van Dam, L. C. J., & Erkelens, C. J. (2002). Bi-stability in perceived slant when binocular disparity and monocular perspective specify different slants. Journal of Vision, 2, 597–607. Vickers, D. (1972). A cyclic decision model of perceptual alternation. Perception, 1, 31–48. Wilson, H. R. (2003). Computational evidence for a rivalry hierarchy in vision. Proceedings of the National Academy of Sciences of the United States of America, 100(24), 14499–14503. Wilson, H. R., Blake, R., & Lee, S. H. (2001). Dynamics of travelling waves in visual perception. Nature, 412, 907–910. Zhou, Y. H., Gao, J. B., White, K. D., & Yao, K. (2004). Perceptual dominance time distributions in multistable visual perception. Biological Cybernetics, 90, 256–263.
Received January 4, 2006; accepted April 27, 2006.
LETTER
Communicated by Guenther Palm
A Unifying View of Wiener and Volterra Theory and Polynomial Kernel Regression Matthias O. Franz [email protected]
¨ Bernhard Scholkopf [email protected] ¨ biologische Kybernetik, D-72076 Tubingen, ¨ Max-Planck-Institut fur Germany
Volterra and Wiener series are perhaps the best-understood nonlinear system representations in signal processing. Although both approaches have enjoyed a certain popularity in the past, their application has been limited to rather low-dimensional and weakly nonlinear systems due to the exponential growth of the number of terms that have to be estimated. We show that Volterra and Wiener series can be represented implicitly as elements of a reproducing kernel Hilbert space by using polynomial kernels. The estimation complexity of the implicit representation is linear in the input dimensionality and independent of the degree of nonlinearity. Experiments show performance advantages in terms of convergence, interpretability, and system sizes that can be handled. 1 Introduction In system identification, one tries to infer the functional relationship between system input and output from observations of the ingoing and outgoing signals. If the system is linear, it can be always characterized uniquely by its impulse response. For nonlinear systems, however, there exists no canonical representation that encompasses all conceivable systems. The earliest approach to a systematic, that is, a nonparametric, characterization of nonlinear systems dates back to V. Volterra, who extended the standard convolution description of linear systems by a series of polynomial integral operators with increasing degree of nonlinearity, very similar in spirit to the Taylor series for analytical functions (Volterra, 1887). The last 120 years have seen the accumulation of huge amount of research done on both the class of systems that can be represented by Volterra operators and their application in such diverse fields as nonlinear differential equations, neuroscience, fluid dynamics and electrical engineering (overviews and bibliography in Schetzen, 1980; Rugh, 1981; Mathews & Sicuranza, 2000; Giannakis & Serpedin, 2001). A principal problem of the Volterra approach is the exponential growth of the number of terms in the operators, with both degree of nonlinearity Neural Computation 18, 3097–3118 (2006)
C 2006 Massachusetts Institute of Technology
¨ M. Franz and B. Scholkopf
3098
and input dimensionality. This has limited its application to rather lowdimensional systems with mild nonlinearities. Here, we show that this problem can be largely alleviated by reformulating the Volterra and Wiener series as operators in a reproducing kernel Hilbert space (RKHS). In this way, the whole Volterra and Wiener approach can be incorporated into the rapidly growing field of kernel methods. In particular, the estimation of Volterra or Wiener expansions can be done by polynomial kernel regression that scales only linearly with input dimensionality, independent of the degree of nonlinearity. Moreover, RKHS theory allows us to estimate even infinite Volterra series, which was not possible before. Our experiments indicate that the RKHS formulation also leads to practical improvements in terms of prediction accuracy and interpretability of the results. In the next section, we review the essential results of the classical Volterra and Wiener theories of nonlinear systems. (This section is mainly a review for readers who are not familiar with Wiener and Volterra theory.) In section 3, we discuss newer developments since the mid-1980s that lead to our new formulation, which is presented in section 4. A pre¨ liminary account of this work has appeared in Franz and Scholkopf (2004). 2 Volterra and Wiener Theory of Nonlinear Systems 2.1 The Volterra Class. A system can be defined as a map that assigns an output signal y(t) to an input signal x(t) (we assume for the moment that x(t) and y(t) are functions of time t). Mathematically, this rule can be expressed in the form y(t) = T x(t)
(2.1)
using a system operator T that maps from the input to the output function space. The system is typically assumed to be time invariant and continuous; the system response should remain unchanged for repeated presentation of the same input, and small changes in the input functions x(t) should lead to small changes in the corresponding system output functions y(t). In traditional systems theory, we further restrict T to be a sufficiently wellbehaved compact linear operator H1 such that the system response can be described by a convolution, y(t) = H1 x(t) =
h (1) (τ )x(t − τ ) dτ,
(2.2)
of x(t) with a linear kernel (or impulse response) h (1) (τ ). A natural extension
Wiener and Volterra Theory and Polynomial Kernels
3099
of this convolution description to nonlinear systems is the Volterra series operator, y(t) = Vx(t) = H0 x(t) + H1 x(t) + H2 x(t) + · · · + Hn x(t) + · · · ,
(2.3)
in which H0 x(t) = h 0 = const. and Hn x(t) =
h (n) (τ1 , . . . , τn )x(t − τ1 ) . . . x(t − τn ) dτ1 , . . . , dτn
(2.4)
is the nth-order Volterra operator (Volterra, 1887, 1959). The integral kernels h (n) (τ1 , . . . , τn ) are the Volterra kernels. Depending on the system to be represented, the integrals can be computed over finite or infinite time intervals. The support of the Volterra kernel defines the memory of the system; that is, it delimits the time interval in which past inputs can influence the current system output. The Volterra series can be regarded accordingly as a Taylor series with memory: whereas the usual Taylor series represents only systems that instantaneously map the input to the output, the Volterra series characterizes systems in which the output also depends on past inputs. The input functions typically come from some real, separable Hilbert space such as L 2 [a , b], the output functions from the space C[a , b] of bounded continuous functions. Similar to the Taylor series, the convergence of a Volterra series can be guaranteed for only a limited range of the system input amplitude. As a consequence, the input functions must be restricted to some suitable subset of the input space. For instance, if the input signals form a compact subset of the input function space, one can apply the Stone-Weierstraß theorem (a generalization of the Weierstraß theorem to nonlinear operators; see, e.g., Hille & Phillips, 1957) to show that any continuous, nonlinear system can be uniformly approximated (i.e., in the L ∞ -norm) to arbitrary accuracy by a Volterra series operator of sufficient but finite order (Fr´echet, 1910; Brilliant, 1958; Prenter, 1970).1 Although this approximation result appears to be rather general on first sight, the restriction to compact input sets is quite severe. An example of a compact subset is the set of functions from L 2 [a , b] defined over a closed time interval with a common upper bound (proof in Liusternik & Sobolev, 1961). In practice, this means that the input signals have to be nonzero only on a finite time interval and that the approximation holds only there. Many natural choices of input signals are precluded by this requirement, such as the unit ball in L 2 [a , b] or infinite periodic forcing signals.
1 If one further restricts the system to have fading memory (i.e., the influence of past inputs decays exponentially), the uniform approximation by finite Volterra series can be extended to bounded and slew-limited input signals on infinite time intervals (Boyd & Chua, 1985).
3100
¨ M. Franz and B. Scholkopf
2.2 The Wiener Class. So far, we have discussed only the representation of a general nonlinear system. Now we come to the problem of obtaining such a representation from data. For a linear system, this is a straightforward procedure since it suffices to test the system on a set of basis functions from the input space (e.g., delta functions or sinusoids). In a nonlinear system, however, we ideally have to measure the system response for all possible input functions. One way to achieve this is by testing the system on realizations of a suitable random process. The stochastic input in Wiener theory is the limiting form of the random walk process as the number of steps goes to infinity (or, equivalently, as the step size goes to zero), which is now known as the Wiener process (Papoulis, 1991). One can show that the Wiener process assigns a nonzero probability to the neighbourhood of every continuous input function (Palm & Poggio, 1977). Thus, the realizations of the Wiener process play a similar role in Wiener theory as the sinusoidal test inputs in linear system theory since they are capable of completely characterizing the system. In system identification, we are given only pairs of input and output functions, whereas the system itself is treated as a black box. The appropriate Volterra representation has to be found by minimizing some error measure between the true output and the model output, such as the integral over the squared error. Thus, the approximation has to be only in the L 2 norm, not in the L ∞ -norm as in Volterra theory. A weaker approximation criterion typically relaxes the restrictions imposed on the input and output set and on the type of systems that can be represented by a Volterra series (Palm, 1978). Wiener theory relaxes the approximation criterion even further: assuming that the input is generated by the Wiener process, it requires only an approximation in the mean squared error sense over the whole process, not for any single realization of it. The minimization of the mean squared error for the estimation of the Volterra kernels requires the solution of a simultaneous set of integral equations. This can be avoided by using an orthogonal least-squares framework as proposed by Wiener (1958) and Barrett (1963). Since the distribution of the input is known for the Wiener process, we can choose an input-specific decomposition of the system operator T, y(t) = G 0 x(t) + G 1 x(t) + G 2 x(t) + · · · + G n x(t) + · · · ,
(2.5)
into a Wiener series of operators G n that are mutually uncorrelated, that is, orthogonal with respect to the Wiener process. The Wiener operators G n are linear combinations of Volterra operators up to order n. They can be obtained from the original Volterra series by a procedure very similar to
Wiener and Volterra Theory and Polynomial Kernels
3101
Gram-Schmidt orthogonalization. For instance, the second-degree Wiener operator2, G 2 x(t) =
h 2 (τ1 , τ2 )x(t − τ1 )x(t − τ2 ) dτ1 dτ2 −
h 2 (τ1 , τ1 ) dτ1 ,
(2.6)
consists of a zero-order and a second-order Volterra operator. The integral kernel of the highest-order (i.e., nth order) Volterra operator of G n is called the leading Volterra kernel of G n . As a result of the orthogonalization, the G n can be estimated independent of each other. Moreover, any truncation of this orthogonalized series minimizes the mean squared error among all truncated Volterra expansions of the same order. All systems that produce square integrable output for the Wiener input process can be approximated in the mean square sense by finite-order Wiener series operators (Ahmed, 1970). In practice, this means that the systems must be nondivergent and cannot have infinite memory. Due to the different types of inputs and convergence, the classes of systems that can be approximated by infinite Volterra or Wiener series operators are not identical. Some systems of the Wiener class cannot be represented as a Volterra series operator and vice versa (Palm & Poggio, 1977; Korenberg & Hunter, 1990). However, a truncated Wiener or Volterra series can always be transformed into its truncated counterpart. One of the reasons for the popularity of the Wiener series is that the leading Volterra kernels can be directly measured via the cross-correlation method of Lee and Schetzen (1965). If one uses gaussian white noise with standard deviation A instead of the Wiener process as input, the leading Volterra kernel of G n can be estimated as n−1 1 G l x(t) x(t − τ1 ) . . . x(t − τn ), h (τ1 , . . . , τn ) = y(t) − n!An (n)
(2.7)
l=0
where the bar indicates the average over time. The zero-order kernel is simply the time average h (0) = y(t) of the output function. The other lowerorder Volterra kernels of G n can be derived from the leading kernel by again applying a Gram-Schmid-type orthogonalization procedure. 2.3 Discrete Volterra and Wiener Systems. In practical signal processing, one uses a discretized form for a finite sample of data. Here, we assume
2 Strictly speaking, the integrals in the Wiener operators have to be interpreted as stochastic integrals (e.g., Papoulis, 1991) with respect to the Wiener process; that is, the equality holds only in the mean squared sense. For conditions under which the equality also holds for specific inputs, see Palm & Poggio (1977).
¨ M. Franz and B. Scholkopf
3102
that the input data are given as a vector x = (x1 , . . . , xm ) ∈ Rm of finite dimension. The vectorial data can be generated from any multidimensional input or, for instance, by a sliding window over a discretized image or time series. A discrete system is simply described by a function T : Rm → R, not by an operator as before. The discretized Volterra operator is defined as the function Hn (x) =
m i 1 =1
...
m i n =1
(n)
h i1 ...in xi1 , . . . , xin ,
(2.8) (n)
where the Volterra kernel is given as a finite number of mn coefficients h i1 ,...,in (Alper, 1965). It is, accordingly, a linear combination of all ordered nth-order monomials of the elements of x.3 Analogous to the continuous-time Volterra series, it can be shown by applying the Stone-Weierstraß theorem that all continuous systems with compact input domain can be uniformly approximated by a finite, discrete Volterra series. For systems with exponentially fading memory, the uniform approximation can be extended to all input vectors with a common upper bound (Boyd & Chua, 1985). The discrete analog to the Wiener series is typically orthogonalized with respect to gaussian input x ∼ N (0, A) since this is the only practical setting where the popular cross-correlation method can be applied. All properties of continuous Wiener series operators described above carry over to the discrete case. In particular, any square-summable functional with gaussian input can be approximated in the mean square sense by a finite, discrete Wiener series (Palm & Poggio, 1978). 2.4 Problems of the Cross-Correlation Method. The estimation of the Wiener expansion via cross-correlation poses some serious problems: 1. The estimation of cross-correlations requires large sample sizes. Typically, one needs several tens of thousands of input-output pairs before a sufficient convergence is reached. Moreover, the variance of the cross-correlation estimator in equation 2.7 increases with increasing values of the time delay τi (Papoulis, 1991) such that only operators with relatively small memory can be reliably estimated. 2. The estimation via cross-correlation works only if the input is gaussian noise with zero mean, not for general types of input. In physical experiments, however, deviations from ideal white noise and the resulting estimation errors cannot be avoided. Specific inputs, on the other hand, may have a very low probability of being generated by 3 Throughout this letter, we assume that the Volterra kernels are symmetrical with respect to permutations of the indices i j . A nonsymmetrical kernel can be converted into a symmetrical kernel without changing the system output (Volterra, 1959; Mathews & Sicuranza, 2000).
Wiener and Volterra Theory and Polynomial Kernels
3103
white noise. Since the approximation is computed only in the mean square sense, the system response to these inputs may be drastically different from the model predictions.4 3. In practice, the cross-correlations have to be estimated at a finite resolution (cf. the discretized version of the Volterra operator in equation 2.8). The number of expansion coefficients in equation 2.8 increases with mn for an m-dimensional input signal and an nth-order Wiener kernel. However, the number of coefficients that actually have to be estimated by cross-correlation is smaller. Since the products in equation 2.8 remain the same when two different indices are permuted, the associated coefficients are equal in symmetrical Volterra operators. As a consequence, the required number of measurements is (n + m − 1)!/(n!(m − 1)!) (Mathews & Sicuranza, 2000). Nonetheless, the resulting numbers are huge for higher-order Wiener kernels. For instance, a fifth-order Wiener kernel operating on 256-dimensional input contains roughly 1012 coefficients, 1010 of which would have to be measured individually by cross-correlation. As a consequence, this procedure is not feasible for higher-dimensional input signals. 4. The cross-correlation method assumes noise-free signals. For real noise-contaminated data, the estimated Wiener series models both signal and noise of the training data, which typically results in reduced prediction performance on independent test sets.
3 Estimating Wiener Series by Linear Regression in RKHS 3.1 Linear Regression. The first two problems can be overcome by adopting the framework of linear regression: given observations (x1 , y1 ), . . . , (x N , yN ), linear regression tries to estimate y as a function of x by y = f (x) =
M j=1
γ j ϕ j (x),
(3.1)
using γ j ∈ R and a dictionary of M functions ϕ j : Rm → R where M is allowed to be infinite. In the case of pth-order Volterra or Wiener series, this dictionary consists of all monomials of x up to order p (see equation 2.8).
4
A number of studies develop an orthogonal framework with respect to other input classes (Schetzen, 1965; Ogura, 1972; Segall & Kailath, 1976). None of these, however, can be applied to input classes different from the one they were developed for.
¨ M. Franz and B. Scholkopf
3104
Instead of assuming an infinite amount of data, the γ j are found by minimizing the mean squared error over the dataset c((x1 , y1 , f (x1 )), . . . , (x N , yN , f (x N ))) =
1 N ( f (x j ) − y j )2 , j=1 N
(3.2)
which disposes of the cumbersome cross-correlation estimator (Korenberg, Bruder, & McIlroy, 1988; Mathews & Sicuranza, 2000). Moreover, the input signal class is no more restricted to gaussian noise, but can be chosen freely, for example, from the “natural” input ensemble of the system. As long as the input is known to the experimenter, there is no need for controlling the input as in the classical system identification setting. Note, however, that the obtained Volterra models will approximate the Wiener series only for sufficiently large data sets of gaussian white noise. Korenberg et al. (1988) have shown that the linear regression framework leads to Wiener models that are orders of magnitude more accurate than those obtained from the cross-correlation method. Unfortunately, the solution of this regression problem requires the inversion of an M × M matrix (Mathews & Sicuranza, 2000). This is again prohibitive for high-dimensional data and higher orders of nonlinearity since M scales like mn . 3.2 Regression in RKHS. If we stack the basis functions ϕ j (x) into a common vector (x) = (ϕ1 (x), ϕ2 (x), . . .), we can interpret (x) as a nonlinear mapping from Rm into another, possibly infinite-dimensional space H. For certain dictionaries, H constitutes a dot product space where the dot product can be expressed in terms of a positive definite5 kernel function6 k(x, x ) = (x) (x ), which can be evaluated without computing , that is, the possibly infinite-dimensional dot product can be replaced by a simple ¨ function evaluation (see, e.g., Scholkopf & Smola, 2002). An important property of such a space is that it can be identified with a suitable closure of the space of functions, f (x) =
K j=1
α j k(x, z j ),
(3.3)
with an arbitrary set of points z1 , . . . , z K from Rm ; in this case, any expansion ¨ of type 3.1 can be expressed in the form 3.3 (Scholkopf & Smola, 2002). This space has the structure of an RKHS, which allows the application of the socalled representer theorem. It states the following: suppose c is an arbitrary
5 That is, the Gram matrix K = k(x , x ) is positive definite for all choices of the ij i j x1 , . . . , x N from the input domain. 6 Note that with a slight abuse of notation, we will nevertheless use the transpose to denote the dot product in that space.
Wiener and Volterra Theory and Polynomial Kernels
3105
cost function, is a nondecreasing function on R+ , and · F is the norm of the RKHS. If we minimize an objective function c((x1 , y1 , f (x1 )), . . . , (x K , yK , f (x K ))) + ( f F ),
(3.4)
over all α j and z j in equation 3.3, then an optimal solution7 can be expressed as f (x) =
N j=1
α j k(x, x j ),
α j ∈ R.
(3.5)
In other words, although we did consider functions that were expansions in terms of arbitrary points z j (see equation 3.3), it turns out that we can always express the solution in terms of the training points x j only. Hence, the optimization problem over an arbitrarily large number of M weights γ j is transformed into one over N weights α j , where N is the number of training points. In our case, the cost function is given by equation 3.2, and the regularizer is zero. The optimal weight set α = (α1 , . . . , α N ) is readily computed by setting the derivative of equation 3.2 with respect to the weights α j equal to zero; it takes the form α = K −1 y where y = (y1 , . . . , yN ) ; hence,8 y = f (x) = α k(x) = y K −1 k(x),
(3.6)
where k(x) = (k(x, x1 ), k(x, x2 ), . . . , k(x, x N )) ∈ R N . As a result, we have to invert an N × N matrix instead of an M × M matrix in linear regression. For high-dimensional data, we typically have M = mn N. In this case, a time complexity9 of O(mN2 + N3 ) and memory complexity of O(N2 ) compares favorably to the exponential complexity of the original linear regression problem, which is O(m3n ) and O(m2n ), respectively. 3.3 Volterra Series as Linear Operator in RKHS. In order to apply the RKHS framework to the problem of estimating the Volterra and Wiener expansion of a system, we have to find a suitable kernel. Our starting point is the discretized version of the Volterra operators from equation 2.8. The nth-order Volterra operator is a weighted sum of all nth-order monomials of the input vector x. For n = 0, 1, 2, . . . we define the map n as 0 (x) = 1 and n (x) = (x1n , x1n−1 x2 , . . . , x1 x2n−1 , x2n , . . . , xmn )
7
(3.7)
¨ For conditions on the uniqueness of the solution, see Scholkopf and Smola (2002). If K is not invertible, K −1 denotes the pseudo-inverse of K . 9 The evaluation of the kernel function typically has complexity O(m), which holds true in the polynomial kernels described below. 8
¨ M. Franz and B. Scholkopf
3106
n
such that n maps the input x ∈ Rm into a vector n (x) ∈ Hn = Rm containing all mn ordered monomials of degree n evaluated at x. Using n , we can write the nth-order Volterra operator in equation 2.8 as a scalar product in Hn , Hn (x) = ηn n (x),
(3.8) (n)
(n)
with the coefficients stacked into the vector ηn = (h 1,1,...,1 , h 1,2,...,1 , (n) h 1,3,...,1 , . . .)
∈ Hn . Fortunately, the monomials constitute an RKHS. It can ¨ be easily shown (e.g., Scholkopf & Smola, 2002) that n n (x1 ) n (x2 ) = (x 1 x2 ) =: kn (x1 , x2 ).
(3.9)
This equivalence was used as early as 1975 in an iterative estimation scheme for Volterra models, long before the RKHS framework became commonplace (Poggio, 1975). The estimation problem can be solved directly if one applies the same idea to the entire pth-order Volterra series. By stacking the maps n with positive weights a n into a single map ( p) (x) = (a 0 0 (x), a 1 1 (x), . . . , a p p (x)) , one obtains a mapping from Rm into H( p) = R × p+1 2 p Rm × Rm × . . . × Rm = R M with dimensionality M = 1−m . The entire 1−m pth-order Volterra series can be written as a scalar product in H( p) , p n=0
Hn (x) = η( p) ( p) (x),
(3.10)
with η( p) ∈ H( p) . Since H( p) is generated as a Cartesian product of the single spaces Hn , the associated scalar product is simply the weighted sum of the scalar products in Hn : ( p) (x1 ) ( p) (x2 ) =
p n=0
n a n2 x =: k ( p) (x1 , x2 ). 1 x2
(3.11)
A special case of this kernel is the inhomogeneous polynomial kernel used in the Volterra estimation approach of Dodd and Harrison (2002), p ( p) kinh (x1 , x2 ) = 1 + x 1 x2 ,
(3.12)
which corresponds to
1+
p x 1 x2
=
p p n=0
n
x 1 x2
n
(3.13)
Wiener and Volterra Theory and Polynomial Kernels
3107
via the binomial theorem. If a suitably decaying weight set a n is chosen, the approach can be extended even to infinite Volterra series. For instance, for a n = 1/n! we obtain the translation-variant gaussian kernel
k (∞) (x1 , x2 ) = e x1 x2 =
∞ n=0
1 n x x2 , n! 1
(3.14)
or for x < 1, α > 0, Vovk’s infinite polynomial kernel (Saunders et al., Smola, 1998), −α ∞ kVovk (x1 , x2 ) = 1 − x = 1 x2
n=0
−α n
n (−1)n x 1 x2 .
(3.15)
The latter two kernels have been shown to be universal; the functions of their associated RKHS are capable of uniformly approximating all continuous functions on compact input sets in Rm (Steinwart, 2001). As we have seen in the discussion of the approximation capabilities of discrete Volterra series, the family of finite polynomial kernels in its entirety is also universal since the union of their RKHSs comprises all discrete Volterra series. Isolated finite polynomial kernels, however, do not share this property. 3.4 Implicit Wiener Series Estimation. We know now that both finite and infinite discretized Volterra series can be expressed as linear operators in an RKHS. As we stated above, the pth-degree Wiener expansion is the pth-order Volterra series that minimizes the squared error if the input is white gaussian noise with zero mean. This can be put into the regression framework: assume we generate white gaussian noise with zero mean, feed it into the unknown system, and measure its output. Since any finite Volterra series can be represented as a linear operator in the corresponding RKHS, we can find the pth-order Volterra series that minimizes the squared error by linear regression. This, by definition, must be the pth-degree Wiener series since no other Volterra series has this property.10 . From equations 2.7 and 3.6, we obtain the following expressions for the implicit Wiener series, G 0 (x) =
1 y 1 N
and
p n=0
G n (x) =
p n=0
( p) Hn (x) = y K −1 p k (x),
(3.16) where the Gram matrix K p and the coefficient vector k( p) (x) are computed using the kernel from equation 3.11 and 1 = (1, 1, . . .) ∈ R N . The system is now represented as a linear combination of kernels evaluated at the 10 Assuming symmetrized Volterra kernels, which can be obtained from any Volterra expansion.
¨ M. Franz and B. Scholkopf
3108
training points instead of a linear combination of monomials; that is, the Wiener series and its Volterra functionals are represented only implicitly. Thus, there is no need to compute the possibly large number of coefficients explicitly. The explicit Volterra and Wiener expansions can be recovered at least in principle from equation 3.16 by collecting all terms containing monomials of the desired order and summing them up. The individual nth-order Volterra operators ( p > 0) are given implicitly by Hn (x) = a n y K −1 p kn (x),
(3.17)
n n 11 n with kn (x) = ((x 1 x) , (x2 x) , . . . , (x N x) ) . For p = 0, the only term is the constant zero-order Volterra operator H0 (x) = G 0 (x). The coefficient vector (n) (n) (n) ηn = (h 1,1,...,1 , h 1,2,...,1 , h 1,3,...,1 , . . .) of the explicit Volterra operator is obtained as −1 η n = a n n K p y,
(3.18)
using the design matrix n = (φn (x1 ), φn (x2 ), . . . , φn (x N )) . Note that these equations are also valid for infinite polynomial kernels such as k (∞) or kVovk . Similar findings are known from the neural network literature where Wray and Green (1994) showed that individual Volterra operators can be extracted from certain network models with sigmoid activation functions that correspond to infinite Volterra series. The individual Wiener operators can be recovered only by applying the regression procedure twice. If we are interested in the nth-degree Wiener operator, we have to compute the solution for the kernels k (n) (x1 , x2 ) and k (n−1) (x1 , x2 ). The Wiener operator for n > 0 is then obtained from the difference of the two results as G n (x) =
n i=0
G i (x) −
n−1 i=0
G i (x)
−1 k(n−1) (x) . = y K n−1 k(n) (x) − K n−1
(3.19)
The corresponding ith-order Volterra operators of the nth-degree Wiener operator are computed analogous to equations 3.17 and 3.18.
11 Note that the time complexity of computing the explicit Volterra operator is O(mn N2 + N3 ), and the corresponding memory complexity is O(mn N + N2 ). Thus, using the implicit estimation method as an intermediate step is still preferable over direct linear regression for mn > N.
Wiener and Volterra Theory and Polynomial Kernels
3109
3.5 Orthogonality. The resulting Wiener operators must fulfill the orthogonality condition, which in its strictest form states that a pth-degree Wiener operator must be orthogonal to all monomials in the input of lower order. However, we have constructed our operators in a different function basis; we have expanded the pth-order Wiener operators in terms of kernels instead of monomials. Thus, we have to prove the following: Theorem 1. condition
The operators obtained from equation 3.19 fulfill the orthogonality
E m(x)G p (x) = 0
(3.20)
where E denotes the expectation over the training set and m(x) an r th-order monomial with r < p. Proof. We will show that this a consequence of the least-squares fit of any linear expansion in a set of basis functions of the form of equation 3.10. In the case of the implicit Wiener and Volterra expansions, the basis functions ϕ j (x) are polynomial kernel functions k (r ) (x, xi ) evaluated at the training examples xi . We denote the error of the expansion as e(x) = y − M j=1 α j ϕ j (x). The minimum of the expected quadratic loss with respect to the expansion coefficient αk is given by ∂ Ee(x)2 = −2E [ϕk (x)e(x)] = 0. ∂αk
(3.21)
This means that for an expansion of the type of equation 3.1 minimizing the squared error, the error is orthogonal to all basis functions used in the expansion. Now let us assume we know the Wiener series expansion (which minimizes the mean squared error) of a system up to degree p − 1. The approximation error is then given by the sum of the higher-order Wiener operators e(x) = ∞ n= p G n (x), so G p (x) is part of the error. As a consequence of the linearity of the expectation, equation 3.21, implies ∞ n= p
E [ϕk (x)G n (x)] = 0 and
∞ n= p+1
E [ϕk (x)G n (x)] = 0
(3.22)
for any ϕk of order less than p. The difference of both equations yields E ϕk (x)G p (x) = 0, so that G p (x) must be orthogonal to any of the lowerorder basis functions—to all kernel functions k (r ) (x, xi ) with order r smaller than p. Since any monomial m(xi ) of degree r < p evaluated on a training data point xi can be expressed as a linear combination of kernel functions
¨ M. Franz and B. Scholkopf
3110
up to degree r , orthogonality on the training set must also hold true for any monomial of order r < p. For both regression and orthogonality of the resulting operators, the assumption of white gaussian noise was not required. In practice, this means that we can compute a Volterra expansion according to equation 3.16 for any type of input, not just for gaussian noise. Note, however, that the orthogonality of the operators can be defined only with respect to an input distribution. If we use equation 3.19 for nongaussian input, the resulting operators will still be orthogonal, but with respect to the nongaussian input distribution. The resulting decomposition of the Volterra series into orthogonal operators will be different from the gaussian case. As a consequence, the operators computed according to equation 3.19 will not be the original Wiener operators, but an extension of this concept as proposed by Barrett (1963). 3.6 Regularized Estimation. So far we have not addressed the fourth problem of the cross-correlation procedure: the negligence of measurement noise. The standard approach in machine learning is to augment the mean squared error objective function in equation 3.4 with a penalizing functional , often given as a quadratic form, = λ α Rα,
λ > 0,
(3.23)
with a positive semidefinite matrix R. R is chosen to reflect prior knowledge that can help to discriminate the true signal from the noise. λ controls the trade-off between the fidelity to the data and the penalty term. The resulting Wiener series is given by p n=0
G n (x) =
p n=0
Hn (x) = y (K p + λR)−1 k( p) (x)
(3.24)
instead of equation 3.16. When choosing R = I N , one obtains standard ridge regression, which leads to smoother, less noise-sensitive solutions by limiting their RKHS norm. Alternatively, Nowak (1998) suggested selectively penalizing noise-contaminated signal subspaces by a suitable choice of R for the estimation of Volterra series. Regularization also offers possibilities for compensating for some of the difficulties associated when using higher-order polynomials for regression, such as their poor extrapolation capabilities, or the notoriously bad conditioning of the Gram matrix. As a result, the prediction performance of polynomials on many standard data sets is worse than that of other function bases such as gaussians. However, by a suitable choice of the regularization matrix R, these problems can be alleviated (Franz, Kwon, Rasmussen, ¨ & Scholkopf, 2004). Moreover, there is a close correspondence between
Wiener and Volterra Theory and Polynomial Kernels
3111
regularized regression in RKHS and gaussian process regression (Wahba, ¨ 1990; Scholkopf & Smola, 2002; Rasmussen & Williams, 2006). This correspondence can be used to approximate arbitrary other kernels by polynomial kernels (Gehler & Franz, 2006). If one is interested in single Wiener operators, the regularized estimation has a decisive disadvantage: the operators computed according to equation 3.19 are no more orthogonal. However, orthogonality can be still enforced by considering the (smoothed) output of the regularized Wiener system on the training set y˜ = y (K p + λR)−1 K ,
(3.25)
as a modified, “noise-corrected” training set for equation 3.19, which becomes
−1 k(n−1) (x) . G n (x) = y (K p + λR)−1 K K n−1 k(n) (x) − K n−1
(3.26)
The resulting Wiener operators are an orthogonal decomposition of the regularized solution over the training set. 4 Experiments12 The principal advantage of our new representation of the Volterra and Wiener series is its capability of implicitly handling systems with highdimensional input. We will demonstrate this in a reconstruction task of a fifth-order receptive field. Before doing so, we compare the estimation performance of the kernelized technique to previous approaches. 4.1 Comparison to Previous Estimation Techniques. Our first data set comes from a calibration task for a CRT monitor used to display stimuli in psychophysical experiments. The data were generated by displaying a gaussian noise pattern (N (128, 642 )) on the monitor, which was recorded by a cooled change-coupled device camera operating in its linear range. The system identification task is to quantify the nonlinear distortion of the screen and the possible interaction with previous pixels on the same scan line. The input data were generated by sliding a window of fixed length m in scanning direction over the lines of the gaussian input pattern; the system output value is the measured monitor brightness at the screen location corresponding to the final pixel of the window. We used three techniques to fit a Wiener model: (1) Classical crosscorrelation with model orders 1, 2, and 3 and window size 1 to 4; (2) direct
12 Code for the implicit estimation of Volterra series can be found online at http://www.kyb.tuebingen.mpg.de/bs/people/mof/code.
¨ M. Franz and B. Scholkopf
3112 a.
b. 4
10
320 lin. regression adaptive polynomial summed polynomial transl. var. Gaussian
mean squared error on test set
Mean squared error
300
3
10
280
260
240
220
2
10
50
100
150
200 250 300 No. of training examples
350
400
450
200 10
20
30
40 50 No. of training examples
60
70
Figure 1 Mean squared error on the test set for varying training set size. (a) First- (x) and second-order (squares) cross-correlation leads to test errors orders of magnitude higher than the regression techniques (dots). (b) Performance of the tested regression techniques (see the key) for training set size below 75.
linear regression with monomials as basis functions; and (3) kernel regression with the adaptive polynomial 3.11, the inhomogeneous polynomial 3.12, and the infinite Volterra series kernel of equation 3.14. For techniques 2 and 3, we used the standard ridge regularizer R = I M and R = I N , respectively. The regularization parameter λ in equation 3.23, the weights a i in the adaptive polynomial kernel 3.11, the window size m, and the model order p were found by minimizing the analytically computed leave-one-out error (Vapnik, 1982). We varied the number of training examples from 10 to 1000 to characterize the convergence behavior of the different techniques. The independent test set always contained 1000 examples. As the result shows (see Figure 1a), the mean squared error on the test set decreases at a significantly faster rate for the regression methods due to the unfavorable properties of the cross-correlation estimator. In fact, a comparable test error could not be reached even for the maximal training set size of 1000 (not contained in the figure). We display the cross-correlation results only for m = 2 and p = 1, 2, which had the lowest test error. The third-order cross-correlation produced test (and training) errors above 105 on this data set. We observe small but significant differences between the tested regression techniques due to the numerical conditioning of the required matrix inversion (see Figure 1b). For a training set size above 40, the adaptive polynomial kernel performs consistently better since the weights a i can be adapted to the specific structure of the problem. Interestingly, the infinite Volterra kernel shows a consistently lower performance in spite of the higher approximation capability of its infinite-dimensional RKHS.
Wiener and Volterra Theory and Polynomial Kernels
3113
Figure 2 (Left) 16 × 16 linear kernel of the test system. (Right) Reconstructed linear kernel from the fifth-order Volterra kernel by computing a preimage (after 2500 samples).
4.2 Reconstruction of a Fifth-Order Linear-Nonlinear Cascade. This experiment demonstrates the applicability of the proposed method to highdimensional input. Our example is the fifth-order LN cascade system 5 y = ( 16 k,l=1 h kl xkl ) that acts on 16 × 16 image patches by convolving them with a linear kernel h kl of the same size shown in Figure 2 (left) before the nonlinearity is applied. We generated 2500 image patches containing uniformly distributed white noise and computed the corresponding system output to which we added 10% gaussian measurement noise. The resulting data were used to estimate the implicit Wiener expansion using the inhomogeneous polynomial kernel 3.12. In classical cross-correlation and linear regression, this would require the computation of roughly 9.5 billion independent terms for the fifth-order Wiener kernel. Moreover, even for much lower-dimensional problems, it usually takes tens of thousands of samples until a sufficient convergence of the cross-correlation technique is reached. Even if all entries of the fifth-order Wiener kernel were known, it would be still hard to interpret the result in terms of its effect on the input signal. The implicit representation of the Volterra series allows for the use ¨ of preimage techniques (e.g., Scholkopf & Smola, 2002) where one tries to choose a point z in the input space such that the nonlinearly mapped image in F, φ(z), is as close as possible to the representation in the RKHS. In the case of the fifth-order Wiener kernel, this amounts to representing H5 [x] by the operator (z x)5 with an appropriately chosen preimage z ∈ R256 . The nonlinear map z → z5 is invertible, so that we can use the direct technique ¨ described in Scholkopf and Smola (2002) where one applies the implicitly given Volterra operator from equation 3.17 to each of the canonical base vectors of R256 resulting √ in a 256-dimensional response vector e. The preimage is obtained as z = 5 e. The result in Figure 2 (right) demonstrates that the original linear kernel is already recognizable after using 2500 samples. The example shows that preimage techniques are capable of revealing the input structures to which the Volterra operator is tuned, similar to the classical analysis techniques in linear systems.
¨ M. Franz and B. Scholkopf
3114 a.
b.
f( )
f( )
f( )
f( )
Figure 3 Representation of a Volterra or Wiener system by (a) a cascade of a linear system (preimage) and a static nonlinearity f (x) (e.g., (1 + x) p or e x , depending on the choice of the kernel) and (b) a set of several parallel cascades (reduced set).
5 Conclusion We have presented a unifying view of the traditional Wiener and Volterra theory of nonlinear systems and newer developments from the field of kernel methods. We have shown that all properties of discrete Volterra and Wiener theory are preserved by using polynomial kernels in a regularized regression framework. The benefits of the new kernelized representation can be summarized as follows: 1. The implicit estimation of the Wiener and Volterra series allows for system identification with high-dimensional input signals. Essentially, this is due to the representer theorem: although a higher-order series expansion contains a huge number of coefficients, it turns out that when estimating such a series from a finite sample, the information in the coefficients can be represented more parsimoniously using an example-based implicit representation. 2. The complexity of the estimation process is independent of the order of nonlinearity. Even infinite Volterra series expansions can be estimated. 3. Regularization techniques can be naturally included in the regression framework to accommodate measurement noise in the system outputs. As we have shown, one still can extract the corresponding Wiener operators from the regularized kernel solution while preserving their orthogonality with respect to the input. The analysis of a system in terms of subsystems of different orders of nonlinearity can thus be extended to noisy signals. 4. Preimage techniques reveal input structures to which Wiener or Volterra operators are tuned. These techniques try to represent the
Wiener and Volterra Theory and Polynomial Kernels
3115
system by a cascade consisting of a linear filter followed by a static nonlinearity (see Figure 3a). 5. As in standard linear regression, the method also works for nongaussian input. At the same time, convergence is considerably faster than in the classical cross-correlation procedure because the estimation is done directly on the data. Both regression methods omit the intermediate step of estimating cross-correlations, which converges very slowly. Although we obtained useful experimental results for problems where the number of terms in the Wiener expansion largely exceeds the number of training examples, this will not always be feasible, for example, in cases where the Volterra kernels cannot be approximated by smooth functions. In this sense, we cannot circumvent the curse of dimensionality. However, if the system to be identified has a suitably smooth underlying structure, the proposed technique (in particular, the regularized variant) can take advantage of it. The preimage method in our experiment works only for Volterra kernels ¨ of odd order. More general techniques exist (Scholkopf & Smola, 2002), including the case of other kernels and the computation of approximations in terms of parallel cascades of preimages and nonlinearities (reduced sets; cf. Figure 3b). In the case of a second-order system, the reduced set corresponds to an invariant subspace of the Volterra operator (cf. Hyv¨arinen & Hoyer, 2000). It can be shown that the entire class of discrete Volterra systems can be approximated by cascades where the nonlinearities are polynomials of sufficient degree (Korenberg, 1983) and that any doubly-finite discrete Volterra series can be exactly represented by a finite sum of such cascades (Korenberg, 1991). There also exists an iterative technique of directly fitting such cascade expansions to the training data by choosing the preimages from a fixed set of candidates generated from the data and adapting the nonlinearity (Korenberg, 1991). Each iteration requires only the inversion of a ( p + 1) × ( p + 1) matrix ( p being the degree of nonlinearity); thus, convergence can be very fast depending on the training data. The resulting parallel cascades will generally be different from reduced-set expansions, which have a fixed number of elements and a prescribed nonlinearity defined by the kernel. However, both cascade representations can be converted into their corresponding Volterra series expansions, which makes a comparison of the results possible. Having seen that Volterra and Wiener theory can be treated just as a special case of a kernel regression framework, one could argue that this theory is obsolete in modern signal analysis. This view is supported by the fact that on many standard data sets for regression, polynomial kernels are outperformed by other kernels, such as the gaussian kernel. So why do we not replace the polynomial kernel by some more capable kernel and forget about Wiener and Volterra theory altogether? There are at least two
3116
¨ M. Franz and B. Scholkopf
arguments against this point of view. First, our study has shown that in contrast to other kernels, polynomial kernel solutions can be directly transformed into their corresponding Wiener or Volterra representation. Many entries of the Volterra kernels have a direct interpretation in signal processing applications (examples in Mathews & Sicuranza, 2000). This interpretability is lost when other kernels are used. Second, Wiener expansions decompose a signal according to the order of interaction of its input elements. In some applications, it is important to know how many input elements interact in the creation of the observed signals, such as in the analysis of higher-order statistical properties (an example on higher-order ¨ image analysis can be found in Franz & Scholkopf, 2005). Acknowledgments The ideas presented in this article have greatly profited from discussions with P. V. Gehler, G. Bakır, and M. Kuss. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. The publication reflects only the authors’ views. References Ahmed, N. U. (1970). Closure and completeness of Wiener’s orthogonal set G n in the class L 2 (, B, µ) and its application to stochastic hereditary differential systems. Information and Control, 17, 161–174. Alper, A. (1965). A consideration of the discrete Volterra series. IEEE Trans. Autom. Control, AC-10(3), 322–327. Barrett, J. F. (1963). The use of functionals in the analysis of non-linear physical systems. J. Electron. Control, 15, 567–615. Boyd, S., & Chua, L. O. (1985). Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Trans. Circuits. Syst., CAS-32, 1150– 1161. Brilliant, M. B. (1958). Theory of the analysis of nonlinear systems (RLE Tech. Rep. No. 345). Cambridge, MA: MIT. Dodd, T. J., & Harrison, R. F. (2002). A new solution to Volterra series estimation. In CD-ROM Proc. 2002 IFAC World Congress. Oxford: Elsevier. ¨ Franz, M. O., Kwon, Y., Rasmussen, C. E., & Scholkopf, B. (2004). Semi-supervised kernel regression using whitened function classes. In C. E. Rasmussen, H. H. ¨ ¨ Bulthoff, M. A. Giese, & B. Scholkopf (Eds.), Pattern recognition: Proc. 26th DAGM Symposium (pp. 18–26). Berlin: Springer. ¨ Franz, M. O., & Scholkopf, B. (2004). Implicit estimation of Wiener series. In A. Barros, J. Principe, J. Larsen, T. Adali, & S. Douglas (Eds.), Proc. 2004 IEEE Signal Processing Society Workshop (pp. 735–744). New York: IEEE. ¨ Franz, M. O., & Scholkopf, B. (2005). Implicit Wiener series for higher-order image analysis. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 465–472). Cambridge, MA: MIT Press.
Wiener and Volterra Theory and Polynomial Kernels
3117
Fr´echet, M. (1910). Sur les fonctionelles continues. Annales Scientifiques de L’Ecole Normale Sup´erieure, 27, 193–216. Gehler, P. V., & Franz, M. O. (2006). Implicit Wiener Series. Part II: Regularised esti¨ mation (MPI Tech. Rep. No. 148). Tubingen, Germany: Max-Planck-Institute for Biological cybernetics. Giannakis, G. B., & Serpedin, E. (2001). A bibliography on nonlinear system identification. Signal Processing, 81, 533–580. Hille, E., & Phillips, R. S. (1957). Functional analysis and semi-groups. Providence, RI: American Mathematical Society. Hyv¨arinen, A., & Hoyer, P. (2000). Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12, 1705–1720. Korenberg, M. J. (1983). Statistical identification of parallel cascades of linear and nonlinear systems. In Proc. IFAC Symp. Identification and System Parameter Estimation (pp. 669–674). Korenberg, M. J. (1991). Parallel cascade identification and kernel estimation for nonlinear systems. Ann. Biomed. Eng., 19(4), 429–455. Korenberg, M. J., Bruder, S. B., & McIlroy, P. J. (1988). Exact orthogonal kernel estimation from finite data records: Extending Wiener’s identification of nonlinear systems. Ann. Biomed. Eng., 16(2), 201–214. Korenberg, M. J., & Hunter, I. W. (1990). The identification of nonlinear biological systems: Wiener kernel approaches. Ann. Biomed. Eng., 18, 629–654. Lee, Y. W., & Schetzen, M. (1965). Measurement of the Wiener kernels of a non-linear system by crosscorrelation. Intern. J. Control, 2, 237–254. Liusternik, L., & Sobolev, V. (1961). Elements of functional analysis. New York: Unger. Mathews, V. J., & Sicuranza, G. L. (2000). Polynomial signal processing. New York: Wiley. Nowak, R. (1998). Penalized least squares estimation of Volterra filters and higher order statistics. IEEE Trans. Signal Proc., 46(2), 419–428. Ogura, H. (1972). Orthogonal functionals of the Poisson process. IEEE Trans. Inf. Theory, 18(4), 473–481. Palm, G. (1978). On representation and approximation of nonlinear systems. Biol. Cybern., 31, 119–124. Palm, G., & Poggio, T. (1977). The Volterra representation and the Wiener expansion: Validity and pitfalls. SIAM J. Appl. Math., 33(2), 195–216. Palm, G., & Poggio, T. (1978). Stochastic identification methods for nonlinear Systems: An extension of Wiener theory. SIAM J. Appl. Math., 34(3), 524–534. Papoulis, A. (1991). Probablity, random variables and stochastic processes. New York: McGraw-Hill. Poggio, T. (1975). On optimal nonlinear associative recall. Biol. Cybern., 19, 201–209. Prenter, P. M. (1970). A Weierstrass theorem for real, separable Hilbert spaces. J. Approx. Theory, 3, 341–351. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Rugh, W. J. (1981). Nonlinear system theory. Baltimore: Johns Hopkins University Press.
3118
¨ M. Franz and B. Scholkopf
¨ Saunders, C., Stitson, M. O., Weston, J., Bottou, L., Scholkopf, B., & Smola, A. (1998). Support vector machine—reference manual (Tech. Rep.). Egham, UK: Department of Computer Science, Royal Holloway, University of London. Schetzen, M. (1965). A theory of nonlinear system identification. Intl. J. Control, 20(4), 577–592. Schetzen, M. (1980). The Volterra and Wiener theories of nonlinear systems. Malabar, FL: Krieger. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Segall, A., & Kailath, T. (1976). Orthogonal functionals of independent-increment processes. IEEE Trans. Inf. Theory, 22(3), 287–298. Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. JMLR, 2, 67–93. Vapnik, V. (1982). Estimation of dependencies based on empirical data. New York: Springer. Volterra, V. (1887). Sopra le funzioni che dipendono de altre funzioni. In Rend. R. Academia dei Lincei 2◦ Sem., pp. 97–105, 141–146, 153–158. Volterra, V. (1959). Theory of functionals and of integral and integro-differential equations. New York: Dover. Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Wiener, N. (1958). Nonlinear problems in random theory. New York: Wiley. Wray, J., & Green, G. G. R. (1994). Calculation of the Volterra kernels of non-linear dynamic systems using an artificial neural network. Biol. Cybern., 71, 187–195.
Received November 1, 2005; accepted April 5, 2006.
LETTER
Communicated by Michael Schmitt
An Upper Bound on the Minimum Number of Monomials Required to Separate Dichotomies of {−1, 1}n Erhan Oztop [email protected] JST-ICORP Computational Brain Project and ATR Computational Neuroscience Laboratory, 2-2-2 Hikari-dai Soraku-gun, Kyoto, 619-0288, Japan
It is known that any dichotomy of {−1, 1}n can be learned (separated) with a higher-order neuron (polynomial function) with 2n inputs (monomials). In general, less than 2n monomials are sufficient to solve a given dichotomy. In spite of the efforts to develop algorithms for finding solutions with fewer monomials, there have been relatively fewer studies investigating maximum density ((n)), the minimum number of monomials that would suffice to separate an arbitrary dichotomy of {−1, 1}n . This article derives a theoretical (upper) bound for this quantity, superseding previously known bounds. The main theorem here states that for any binary classification problem in {−1, 1}n (n > 1), one can always find a polynomial function solution with 2n − 2n /4 or fewer monomials. In particular, any dichotomy of {−1, 1}n can be learned by a higher-order neuron with a fan-in of 2n − 2n /4 or less. With this result, for the first time, a deterministic ratio bound independent of n is established as (n)/2n ≤ 0.75. The main theorem is constructive, so it provides a deterministic algorithm for achieving the theoretical result. The study presented provides the basic mathematical tools and forms the basis for further analyses that may have implications for neural computation mechanisms employed in the cerebral cortex.
1 Introduction Higher-order neurons (units) or sigma pi units are computationally powerful extensions of linear neuron models (Rumelhart, Hinton, & Williams, 1986; Giles & Maxwell, 1987; Schmitt, 2005). These units capture the nonlinearity in the input-output relation through products of input variables called monomials. The net input to a higher-order unit is the sum of the monomials weighted by adjustable parameters. The output is obtained by the application of a predefined activation function (e.g., sigmoidal function or a threshold function) to the net input. When the output is a threshold function, these units are sometimes called polynomial threshold units. Accumulating biological data suggest that specific neurons in the cerebral Neural Computation 18, 3119–3138 (2006)
C 2006 Massachusetts Institute of Technology
3120
E. Oztop
cortex compute in a multiplicative way (see Schmitt, 2002). Therefore, higher-order units whose monomials capture the nonlinear dendritic information processing can be considered better models for real neurons compared to the McCulloch-Pitts model (Mel & Koch, 1990; Mel, 1994). It is well known that the use of higher-order units increases the computational power and storage capacities of neural networks (Schmitt, 2002); however, the combinatorial growth in the number of monomials required for a given problem limits their application. Some work has been devoted to developing algorithms for finding a reduced set of monomials to realize a given classification problem without suffering from the combinatorial growth problem (e.g., Ghosh & Shin, 1992; Guler, 2001). Theoretical results concerning the bounds for Vapnik-Chervonenkis dimension of higher-order neurons have also been obtained (Schmitt, 2002, 2005). More relevant to this article is the study of the so-called polynomial threshold density of Boolean functions (i.e., dichotomies) (see, e.g., Saks, 1993), which indicates the minimum number of monomials over all the polynomial functions that solve a given classification problem in {−1, 1}n . It is of both practical and theoretical importance to determine the maximum density, (n), the number of monomials that one can always separate any dichotomy of {−1, 1}n . Spectral theory of Boolean functions produced important results and elegant methods to derive some bounds on the maximum density, (n). The best-known lower bound is 0.11 × 2n (see, Saks, 1993; O’Donnell & Servedio, 2003). The upper bound is due to Gotsman (1989), who proved that the maximum den√ sity is at most 2n − 2n + 1. These bounds √ tell us that every dichotomy of {−1, 1}n can be separated with 2n − 2n + 1 or fewer monomials, and there are dichotomies that cannot be separated with fewer than 0.11 × 2n monomials. In fact, it is known that there exist dichotomies that cannot be separated with a polynomially (in n) bounded number of monomials (Bruck, 1990). Recently, O’Donnell and Servedio (2003) improved the upper bound asymptotically to 2n − 2n O(n). This article further improves the latter bound by proving (n) ≤ 2n − 2n /4, thereby, for the first time, establishing a deterministic ratio bound independent of n: (n) 2n ≤ 0.75. 1.1 A Motivating Example. Consider the dichotomy (fully specified binary classification problem) given in Table 1. For simplicity we use −1 and +1 for the class labels. A solution to this problem would be a polynomial of x0 , x1 , and x2 with no powers greater than 1 (higher powers are not needed since xk2 = 1) such that the sign of the polynomial function evaluated at each x0 , x1 , x2 picked from the rows of Table 1 coincides with the class label given in that row. There are infinitely many such polynomials. Here are some examples:
On the Minimum Number of Monomial Solutions
3121
Table 1: Class Assignment Table for the Example of Section 1.1. x0
x1
x2
Class
p1
p2
p3
1 −1 1 −1 1 −1 1 −1
1 1 −1 −1 1 1 −1 −1
1 1 1 1 −1 −1 −1 −1
−1 1 1 −1 −1 1 1 1
−4 4 4 −4 −4 4 4 4
−3 3 1 −1 −1 5 3 1
−1 3 3 −5 −3 1 1 1
Notes: The last three columns enlist the outputs of the polynomials. The sign of the values in these columns match the class labels, which indicates that all three polynomials are solutions to the given problem.
p1 = −x2 x1 x0 + x2 x1 + x2 x0 − x2 − 3x1 x0 − x1 − x0 + 1 p2 = −x2 − 2x1 x0 − x0 + 1 p3 = −x2 x1 x0 + x2 x1 + x2 x0 − 2x1 x0 . It can be verified that these are solutions to the dichotomy (see the last three columns of Table 1). Note that the polynomial p1 contains eight monomials, whereas p2 and p3 contains four monomials. One wonders whether it is possible to find a solution with fewer terms (monomials). The study presented in this article is motivated by this question. More generally, we pursue an answer to the question, “Can we find a general upper bound on the minimum number of monomials that one can separate any dichotomy of {−1, 1}n ?” The next section presents definitions needed for the derivations leading to an (affirmative) answer to this question. 2 Definitions Definition 1. A binary classification problem Cn = (Sn+ , Sn− ) in {−1, 1}n is defined with two disjoint sets of input vectors Sn+ ⊂ {−1, 1}n and Sn− ⊂ {−1, 1}n . We use n to represent the collection of all dichotomies (i.e., fully specified binary n classification problems) in {−1, 1}n . Note that |n | = 22 . When it is clear from the context, the subscript n may be suppressed. Definition 2. A polynomial function (of dimension n) is a polynomial over the field of real numbers interpreted as a function of {−1, 1}n . We represent the set of polynomial functions of dimension n with n = { p(x) ∈ [x]| p(x) : {−1, 1}n → }.
3122
E. Oztop
Definition 3. Any polynomial function, p(x0 , x1 , . . . , xn−1 ) can be written as 2 n − 1}) when i=1 a i k∈Si xk (i runs through all the subsets, Si ⊂ {0, 1, · · · , n restricted to {−1, 1}n , since xk2 = 1. Note that we have defined k∈{} xk = 1. The terms in the expression of p(x 0 , x1 . . . xn−1 ) without the leading coefficients are called monomials (i.e., k∈Si xk for some i). The set of monomials thatcan be generated using x0 , x1 , . . . , xn−1 is denoted by Mn . Formally, Mn = { k∈Si xk : Si ⊂ {0, 1, · · · , n − 1}}. Thus, |Mn | = 2n . Definition 4. We define ψ( p) : n → {0, 1, . . . , 2n } as the number of monomials contained in the polynomial function p. We will also extend the number of monomials function to operate on sets of polynomial functions so that ψ(Q) is the set of nonnegative integers that are the number of monomials contained in the polynomial functions in Q ⊂ n . Definition 5. Given a binary classification problem C = (Sn+ , Sn− ), a solution is a function f (x) : {−1, 1}n → such that f (x) > 0 whenever x ∈ S+ and f (x) < 0 whenever x ∈ S− . Then we say f solves C. When a polynomial function (p) solves a binary classification problem (C), we say that the monomials of p solves C as well as p solves C. Furthermore, the problem C is said to have a solution with ψ( p) number of monomials. Definition 6. Given a binary classification problem C = (Sn+ , Sn− ), a solution (S+ ,S− ) ⊂ n set is the collection of polynomial functions that solves C. We define n to be the set of all polynomial function solutions to the classification problem (Sn+ , Sn− ). We also use nY when the label Y uniquely identifies the classification problem under consideration. Definition 7. Given a binary classification problem C = (Sn+ , Sn− ), the density (S+ ,S− ) of C = (Sn+ , Sn− ) is defined to be the minimum element of ψ(n ). Definition 8. The minimum number of monomials that suffices to separate any dichotomy of {−1, 1}n is defined to be the maximum density associated with {−1, 1}n (or n). Formally we have (n) = max
min ψ(Cn ).
(2.1)
C∈n
The goal of this letter is to advance our understanding of (n), the maximum over the densities of the dichotomies of {−1, 1}n . 3 Polynomial (Spectral) Representation of Dichotomies of {−1, 1}n The polynomial/spectral representation of Boolean functions (i.e., dichotomies of {−1, 1}n ) and the standard results are covered in Saks (1993)
On the Minimum Number of Monomial Solutions
3123
and Siu, Roychowdhury, and Kailath (1995). Here we will present only the necessary results without proofs. A dichotomy of {−1, 1}n being equivalent to a Boolean function f : {−1, 1}n → {−1, 1} can be represented as a vector n in {−1, 1}2 by adopting a fixed ordering over the assignment vectors. Moreover, f has a unique representation as the weighted sum of monomials with n coefficients a = (a 1 , a 2 , . . . , a 2n )T ∈ 2 , called the spectral coefficients, 2 n
f (x1 , x2 , · · · , xn ) =
ai
i=1
xk where Si ⊂ {0, 1, · · · , n − 1} .
k∈Si
Noting that each monomial is also a Boolean function, we switch to vector notation and write f = Dn a where the columns of Dn are the vector representations of the monomials. With appropriate ordering of the monomials1 and assignment vectors,2 Dn becomes a so-called Sylvester-type Hadamard matrix (Bruck, 1990; Siu et al., 1995), which has the following properties. Dn satisfies the recursive relation
Lemma 1.
1 1 D = , 1 −1 1
Dn Dn for n > 0. = Dn −Dn
n+1
D
Lemma 2.
Dn is symmetric.
Lemma 3.
Dn Dn = 2n I.
Corollary 1.
The inverse of Dn is (Dn )−1 = 2−n Dn .
Corollary 2.
ˆ n = 2−n/2 Dn is orthogonal. The matrix D
(3.1)
4 The Set of Solving Polynomial Functions Definition 9 (standard form). Assume p ∈ n with p(x) = a 1 + a 2 x0 + a 3 x1 + a 4 x1 x0 + · · · + a 2n xn−1 xn−2 · · · x0 solves a classification problem C = n (S+ , S− ). Further assume C is a dichotomy. Let a = (a 1 , a 2 , · · · , a 2n )T ∈ 2 . n n Since p(S+ ) > 0 and p(S− ) < 0, C partitions the rows of D into D+ and Dn−
are ordered as 1, x0 , x1 , x1 x0 , x2 , x2 x0 , x2 x1 , x2 x1 x0 , . . . , xn−1 . . . x1 x0 . to (x0 , x1 , x2 , . . . , xn−1 ) are ordered as (0’s represent 1’s and 1’s represent −1): 000 . . . 0, 100 . . . 0, 010 . . . 0, 110 . . . 0, 001 . . . 0, . . . , 111 . . . 1. 1 Monomials
2 Assignments
3124
E. Oztop
with Dn+ a > 0 and Dn− a < 0. By defining Y = diag([ y1 y2 · · · y2n ]) where yi =
−1 if ith assignment ∈ S− +1 if ith assignment ∈ S+
the problem can be written as YDn a > 0. Then a solution a to this inequality system provides the coefficients of the polynomial function that is a solution to the classification problem C = (S+ , S− ). We call this representation the standard form. Assume we are given a problem in the standard form YDn a > 0. Then for a solution a, there exists a positive k = (k1 k2 , . . . , k2n )T > 0 such that YDn a = k. Noting that Y−1 = Y and (Dn )−1 = 2−n Dn (see corollary 1), we can solve the coefficients of p as a = 2−n Dn Yk. Thus, a solution to the problem C = (S+ , S− ) is a positive linear combination of the columns of Dn Y (or rows of YDn ). Conversely, assume b is a positive combination of the columns of Dn Y so that b = Dn Yk for some k > 0. Then YDn b = 2n k > 0, implying that q (x0 , x1 · · · xn−1 ) = b 1 + b 2 x0 + b 3 x1 + b 4 x1 x0 + · · · + b 2n xn−1 xn . . . x0 ∈ n x is a solution to C = (S+ , S− ). So we have established the following result: Theorem 1. Given a dichotomy of dimension n in the standard form, YDn a > 0, the set of solutions, nY ⊂ n is exactly the polynomial functions with the coefficients taken from the interior of the cone defined by the rows of YDn . In short, we write nY = int cone(YDn ). 5 Main Theorem: An Upper Bound for the Minimum Number of Monomials Theorem 2 (main theorem). For any binary classification problem in {−1, 1}n , n > 1, there exists always a polynomial function solution with 2n − 2n /4 or fewer monomials. Equivalently, the maximum density over all the n-dimensional Boolean functions is bounded (from above) by 2n − 2n /4. Formally, max
min ψ(Cn ) = (n) ≤ 2n − 2n /4.
(5.1)
C∈n
Proof. We prove the theorem by constructing a polynomial function solution with 2n − 2n /4 or fewer monomials for an arbitrary dichotomy of {−1, 1}n .
On the Minimum Number of Monomial Solutions
3125
Any dichotomy of {−1, 1}n is characterized by the standard form n diag(y)Dn z > 0 for some y ∈ {−1, 1}2 . A solution vector z gives the monomial coefficients of the solving polynomial function. Thus, we have to prove that a solution to the inequality system exists with at least one-fourth of the components of z equal to zero. Using lemma 1, write Dn in terms of Dn−1 , and partition y and z into two halves:
diag(y)Dn z > 0 ⇔ diag
yu yd
Dn−1 Dn−1 Dn−1 −Dn−1
w > 0. t
(5.2)
Further, write the submatrices Dn−1 as row vectors:
d1 d1 .. .. .
u . d2n−1 d2n−1 y w diag d1 −d1 t > 0. yd . .. .. . d2n−1 −d2n−1
(5.3)
By expanding (5.3) we see that for each i ∈ {1..2n−1 } we have yiu (di w + di t) > 0 and yid (di w − di t) > 0.
(5.4)
Depending on the values of the components of y, we have four cases; we first look at the following two: (yiu , yid ) = (+1, +1) ⇒ di w > di t > −di w (yku , ykd ) = (−1, −1) ⇒ −dk w > dk t > dk w ⇒ (−dk )w > (−dk )t > −(−dk )w.
(5.5)
We construct the r (F) × 2n−1 matrix F using the rows that satisfy equation 5.5, namely, with di and −dk . Then equation 5.5 can be compactly expressed as Fw > Ft > −Fw. Note we allow r (F) = 0, meaning that F is the empty matrix (see the Remark below). Next, consider the remaining two cases: (yiu , yid ) = (+1, −1) ⇒ di t > di w > −di t (yku , ykd ) = (−1, +1) ⇒ −dk t > dk w > dk t ⇒ (−dk )t > (−dk )w > −(−dk )t.
(5.6)
3126
E. Oztop
Similarly, we construct the matrix G of size r (G) × 2n−1 using the rows that satisfy equation 5.6, namely, with di and −dk . Then equation 5.6 can be compactly stated as Gt > Gw > −Gt. Note we allow r (G) = 0, meaning that G is the empty matrix (see the Remark below). A simple but useful identity regarding the size of F and G is r (F) + r (G) = 2n−1 .
(5.7)
Thus, we have obtained two coupled systems of inequalities in the 2n−1 dimension (if r (F) = 0 or r (G) = 0 there will be a single inequality system): Fw > Ft > −Fw and Gt > Gw > −Gt.
(5.8)
Note that the systems are satisfied for all w ∈ int cone(F) and t ∈ int cone(G) because the rows of F and G are mutually orthogonal. Our goal is to find a solution vector z = [w, t] with as many zero components as possible. We now show that equation 5.8 enables us to drive a lower bound for the number of zeros we can obtain in z = [w, t]. Remark. If r (F) = 0, the theorem’s claim is readily satisfied by taking w = 0 and t ∈ int cone(G). The same is true for r (G) = 0 with the choice of t = 0 and w ∈ int cone(F). In this case, the problem is equivalent to a problem in one lower dimension and assumes at least a 2n−1 monomial solution. Now write equation 5.8 in terms of (w − t) and (w + t) to get F(w − t) > 0
F(w + t) > 0
−G(w − t) > 0 G(w + t) > 0;
(5.9)
equivalently,
F −G
(w − t) > 0
F G
(w + t) > 0.
(5.10)
Notice that we have decomposed the original problem into two subproblems of the standard form diag(y )Dn−1 z > 0 for some y . Therefore, by theorem 1, the vectors (w − t)T and (w + t)T must belong to the interior of the cones spanned by the rows of {F, −G} and {F, G}, respectively. Therefore all the solutions to equation 5.9 are characterized by arbitrary positive real
On the Minimum Number of Monomial Solutions
3127
row vectors α, α , γ , γ , (w − t)T = 2α F − 2γ G (w + t)T = 2α F + 2γ G
α, γ > 0 α , γ > 0,
(5.11)
which easily yield expressions for w and t: wT = (α + α )F + (−γ + γ )G tT = (−α + α )F + (γ + γ )G
α, α , γ , γ > 0.
(5.12)
Now either r (G) ≥ r (F) or r (F) > r (G). Assume the former and choose α, α > 0 arbitrarily to get wT = (w1 w2 w3 , . . . , w2n−1 ) + (−γ + γ )G
γ,γ > 0.
(5.13)
˜ the reduced row-echelon form of G: since the rows of G Now consider G, ˜ has no zero rows. So we can define i c to be the column are orthogonal, G ˜ i , the ith row of G. ˜ Consider the index of the leading nonzero element of G row vector v=
r (G)
˜ i. −wic G
(5.14)
i=1
Clearly v is a linear combination of the rows of G, that is, ∃β ∈ r (G) such that βG = v.
(5.15)
Since GGT = 2n−1 Ir (G) , β can be found by the projection (and scaling) of v onto the rows of G: β = 2−(n−1) vGT . As γ , γ > 0 are free parameters in equation 5.13, we can choose them to construct β (and hence v). Formally stated, ∀β ∈ r (G) , ∃γ , γ ∈ r (G) > 0 such that (−γ + γ ) = β.
(5.16)
But because of the construction of v given in equation 5.14, wT = (w0 w1 w2 · · · w2n−1 −1 ) + v must have at least r (G) zeros. The value of r (G) can easily be bounded from below. Since r (G) ≥ r (F) and r (G) + r (F) = 2n−1 , we have r (G) ≥ 2n−1 − r (G) ⇒ r (G) ≥ 2n /4.
(5.17)
3128
E. Oztop
Table 2: Example Problem Used in Section 6 and the Verification of the Solution Found by Application of the Main Theorem. Example Problem
The Solution via Theorem 1
x0
x1
x2
Class
Sign of p(x0 , x1 , x2 )
p(x0 , x1 , x2 )
1 −1 1 −1 1 −1 1 −1
1 1 −1 −1 1 1 −1 −1
1 1 1 1 −1 −1 −1 −1
−1 −1 1 1 1 −1 −1 −1
−1 −1 1 1 1 −1 −1 −1
−8 −4 8 16 16 −4 −16 −8
Thus, the solution z = [w, t] has 2n − 2n /4 or fewer nonzero elements, so the corresponding polynomial function has 2n − 2n /4 or fewer monomials. The case for r (F) > r (G) is proven in the same way by changing the roles of F, (α, α ), w with G, (γ , γ ), t, respectively. Combining the proofs for the two cases, we conclude that the number of zeros is at least maxF,G (r (F), r (G)) = 2n /4. Thus, for any binary classification problem in {−1, 1}n , there exists a polynomial function solution with 2n − maxF,G (r (F), r (G)) = 2n − 2n /4 or fewer monomials. 6 An Example We apply the theorem to the dichotomy of {−1, 1}3 given in Table 2 (consider the left four columns). We write the problem in the standard form diag(y)D3 z > 0, where y = (−1, −1, 1, 1, 1, −1, −1, −1) is formed by copying the class labels in the order they appear in the table. Applying the partitioning used in the theorem, we have yu = (−1, −1, 1, 1),
yd = (1, −1, −1, −1), z = [w, t] and 1 1 1 1 d1 2 d2 1 −1 1 −1 D D2 D3 = where D2 = d3 = 1 1 −1 −1 . D2 −D2 1 −1 −1 1 d4
Considering components yu and yd , (−1, 1), (−1, −1), (1, −1), (1, −1), we construct F and G matrices as instructed in the theorem, finding
F = −1 1
−1 −1 1 and G = 1 1
−1 1 −1
−1 −1 −1 −1 . −1 1
On the Minimum Number of Monomial Solutions
3129
The number of rows in F and G are 1 and 3, respectively, so r (F) = 1 and r (G) = 3. Since r (G) ≥ r (F), we work on the w part of the solution vector. In the expression for wT = (α + α )F + (−γ + γ )G, we can choose α, α > 0 arbitrarily. Let us set α = α = 0.5 (note that since r (F) = 1, α, α became scalars) to have wT = ( −1
1 −1
1 ) + (−γ + γ )G.
Applying row echelon reduction to G yields 1 0 0 1 ˜ = 0 1 0 −1 . G 0 0 1 1
So 1c = 1, 2c = 2, 3c = 3 . Applying the formula v = obtain ˜ 1 − (1)G ˜ 2 − (−1)G ˜3 v = −(−1)G
⇒ v = ( 1 −1
r(G) i=1
1
˜ i we −wic G
3 ).
Plugging v in β = 2−(n−1) vGT , we get β = 2−2 ( 1 −1 1 3 )GT ⇒ β = ( −1 −1 1) . Now we have infinitely many ways of choosing positive vectors γ and γ to satisfy (−γ + γ ) = β. Let us choose γ = ( 2 2 1 ) and γ = ( 1 1 2 ). According to the theorem, wT = ( −1 1 −1 1 ) + (−γ + γ )G must have at least r (G) = 3 zeros and z = [w, t] must be a solution to the original problem, where t is given by tT = (−α + α )F + (γ + γ )G = (γ + γ )G. Carrying out the arithmetic, we indeed find z = ( 0 0 0 4 3 −3 −9 −3 )T . Thus, the solving polynomial function is p(x0 , x1 , x2 ) = −3x2 x1 x0 − 9x2 x1 − 3x2 x0 + 3x2 + 4x1 x0 . It is easily verified that the sign of p(x0 , x1 , x2 ) satisfies the assignment table (compare class labels with the last two columns of Table 2). Figure 1 shows the separation obtained by this solution. The surface shown is the contour plot of p(x0 , x1 , x2 ) at 0. This example demonstrates the freedom of choice in constructing a solution. One suspects that a better bound can be obtained by studying the freedom provided by α, α , γ , γ > 0. In fact, we will prove in section 9 that all dichotomies of {−1, 1}3 can be solved with four monomials, which is fewer than the five-monomial solution found by the application of the main theorem to the example problem.
3130
E. Oztop
Figure 1: (Left) Depiction of the problem given in section 6. (Right) Solution obtained by the application of the main theorem. The surface has two sides, each facing exclusively the circles from a single class.
7 Fourier-Motzkin Elimination for Binary Classification Problems Fourier-Motzkin (FM) elimination is a method for eliminating variables from a system of inequalities. It is often used to determine the solvability of the system and finding a feasible solution if it is solvable. Here we will show that FM elimination can be used to construct polynomial function solutions (with fewer monomials) for binary classification problems. First, we introduce the elimination procedure. 7.1 Fourier-Motzkin Elimination. Given an inequality system, the aim of FM elimination is to produce a new inequality system with fewer variables. The key step in the procedure is the elimination of a single variable, which is repeatedly applied to the current inequality system until no variable can be eliminated. If the elimination yields an inconsistent inequality system, then it is concluded that the original system has no solution. Here we assume that the inequality system is given byAx > 0. For a more general treatment readers are referred to other sources (e.g., Chandru, 1993). Suppose we wish to eliminate the variable x j (or column j) from Ax > 0. Define I+ = {i : Aij > 0} I− = {i : Aij < 0} I0 = {i : Aij = 0}.
(7.1)
On the Minimum Number of Monomial Solutions
3131
If I− = {} or I+ = {}, then x j cannot be eliminated. Assume this is not the case. We create a new matrix A with the row entries taken from the set {|Ak j |Ai + |Ai j |Ak : i ∈ I− , k ∈ I+ } ∪ I0 . The new inequality system A x > 0 ˜ x˜ > 0 has zero coefficients for x j . Thus, we can write a reduced system A (by removing column j from A and x j from x). Clearly a solution to the ˜ x˜ > 0 since it is constructed system Ax > 0 is a solution to the system A with elementary row operations that involve positive scaling and addition of the rows of A. The converse is also true, as stated next (for a proof, see Chandru, 1993). Proposition 1. Given the inequality system Ax > 0, consider the one-variable ˜ x > 0. Then for all the solutions of the reduced system, (say, x1 ) eliminated system A˜ it is guaranteed that there will be a value for x1 such that the original inequality system will be satisfied with x = [x1 , x˜ ]. 7.2 Polynomial Function Solution via Fourier-Motzkin Elimination. Given a classification problem in the standard form YDn a > 0, the idea is to pick an elimination order and eliminate the matrix YDn regarding a as the vector of variables. The resultant matrix then can be easily converted into a solution vector with the zero components corresponding to the eliminated columns of YDn . Definition 10. Given an inequality system Ax > 0, we use A◦ to denote the matrix after the repeated application of FM elimination to all the columns of A. The order of elimination is indeterminate, so A◦ is in general ambiguously defined. When no order is specified, an arbitrary order is implied. Note that A◦ has the same number of columns as A but with zero columns corresponding to the eliminated variables. Proposition 2. Assume that we are given a classification problem in the standard form YDn a > 0. Let Q = YDn so that we have the system of inequalities Qa > 0. Apply FM elimination to all columns of Q to obtain Q◦ a > 0. Then the row sum m ◦ given by c = i=1 Qi is a solution, that is, YDn cT > 0, where m is the number of rows of Q◦ . Proof. Clearly cT ∈ int cone(YDn ). Therefore due to theorem 1, c must m satisfy YDn cT > 0. In fact, for any wi > 0, the sum c = i=1 wi Qi◦ is also a solution. Definition 11.
We call the row vector c =
m i=1
Qi◦ the FM sum.
7.3 Example: Polynomial Function Solution via FM Elimination. Consider the two-dimensional classification problem specified in Table 3.
3132
E. Oztop
Table 3: Assignment Table for the Two-Dimensional Example Problem. x0
x1
Class
1 −1 1 −1
1 1 −1 −1
−1 1 1 −1
We write the problem in the standard form YD2 a > 0:
−1 0 0 0
0 1 0 0
a0 a0 1 1 1 1 −1 −1 −1 −1 0 0 0 0 1 −1 1 −1 a 1 = 1 −1 1 −1 a 1 > 0. 1 0 1 1 −1 −1 a 2 1 1 −1 −1 a 2 1 −1 −1 1 −1 1 1 −1 0 −1 a3 a3 YD2 a
Qa
Let us eliminate column 1 from Q. Since I + = {2, 3}, I − = {1, 4} and I 0 = {}, we get 0 −2 0 −2 0 0 2 −2 Q1 = 0 0 −2 −2 . 0 2 0 −2
Eliminate column 2 from Q1 : I + = {4}, I − = {1}, I 0 = {2, 3}, so we have
0 0 2 −2 Q2 = 0 0 −2 −2 . 0 0 0 −8 Eliminate column 3 from Q2 : I + = {1}, I − = {2},I 0 = {3}, so we have Q3 =
0 0 0 −8 . 0 0 0 −8
We cannot eliminate any more columns, so Q◦ = Q3 ; thus, the sum of the rows of Q◦ must be the coefficients of a solution. Namely, we have p(x0 , x1 ) = −16x1 x0 . In this case, FM elimination has found the minimum number of monomial solutions to the problem.
On the Minimum Number of Monomial Solutions
3133
8 Extension of the Main Theorem Although the extension theorem gives a slight improvement, it shows how one might pursue a proof based on theorem 2 and FM elimination to improve the bound. The reader might have already noticed the freedom of parameter choice in theorem 2, which suggests that the bound can indeed be improved. Theorem 3 (extension of the main theorem). For any binary classification problem in {−1, 1}n , n > 2, there exists always a polynomial function solution with 2n − 2n /4 − 1 or fewer monomials. Formally stated, max
minψ(Cn ) = (n) ≤ 2n − 2n /4 − 1.
(8.1)
C∈n
First we prove two lemmas. Lemma 4. Given an arbitrary (row) vector β and a positive (row) vector δ q > 0, the system of equations with the unknown vectors γ and γ , max(|β j |) (−γ + γ ) = β with M = (1 + ε) q , ε > 0, (γ + γ ) = Mδ q min(δ k )
(8.2)
where max and min runs over the components of β and δ q , and ε > 0 always has a positive solution γ , γ > 0. Proof. By solving the system for γ , γ , we see the solutions have to be positive because of the construction of M:
max(|β j |) q q γ i = 0.5(Mδi + βi ) = 0.5 (1 + ε) q δi + βi min(δ k ) ≥ 0.5 (1 + ε) max(|β j |) + βi > 0
max(|β j |) q q δ − β γi = 0.5(Mδi − βi ) = 0.5 (1 + ε) i q min(δ k ) i ≥ 0.5 (1 + ε) max(|β j |) − βi > 0.
(8.3)
Thus, γ , γ > 0 as desired. Note that that Mδ q G and δ q G will have the same number of zero components (G is an appropriately sized real matrix). Lemma 5. Assume we have constructed F and G matrices as in theorem 2 and r (G) ≥ r (F ) for a given problem. Then assume that there exists a positive row vector δ q > 0 such that δ qG has q zero components. Furthermore, assume there exist
3134
E. Oztop
row vectors γ , γ > 0 satisfying both (−γ + γ ) = β(β as defined in theorem 2) and (γ + γ ) = δ q . Then the number of zeros in the solution can be improved to 2n−2 + q . Proof. We are given r (G) ≥ r (F). Choose α = α = 0.5Ir(G) . From theorem 2, we know that if we choose γ , γ > 0 such that (−γ + γ ) = β, the half solution vector w given by wT = (w0 w1 w2 , . . . , w2n−1 −1 ) + (−γ + γ )G is guaranteed to have at least 2n−2 zeros. The expression for the other half of the solution given by tT = (−α + α )F + (γ + γ )G is tT = δ q G, since α = α and δ q = (γ + γ ) by the premises of the lemma. So, the proof is complete because the solution is the concatenation of w and t. Proof of Theorem 3. Proceed as in theorem 2 to construct F and G matrices. Then either r (G) ≥ r (F) or r (F) > r (G); assume the former. We are given n > 2, so r (G) ≥ r (F) implies r (G) > 1 due to the identity r (G) + r (F) = 2n−1 . Since the rows of G are orthogonal and taken from {−1, 1}n−1 , there must be a column where not all the components have the same sign. Application of FM elimination on this column and taking FM sum results in a vector with at least one zero component. This means that there exists δ > 0 such that δ G has at least one zero component. By lemma 4, δ can be positively scaled as q1 = Mq such that (γ + γ ) = q1 and (−γ + γ ) = β have a positive solution γ , γ > 0, and q1 G has one zero component. Due to lemma 5, this implies that the number of zeros in the solution can be made at least 2n−2 + 1, proving the theorem. Remark. The case for r (F) > r (G) is proven in the same way by changing the roles of F,(α, α ), w with G, (γ , γ ), t, respectively, in lemma 5 and the proof. 9 Some Results on Lower Dimensions This section presents exact results concerning the minimum number of monomials required to solve the binary classification problems in {−1, 1}1 , {−1, 1}2 , and {−1, 1}3 . Corollary 3. (corollary to theorem 3). can be solved with four monomials.
Any binary classification in {−1, 1}3
Proof. Proceed as in theorem 2 to construct F and G matrices. Then either r (G) ≥ r (F) or r (F) > r (G); assume the former. Since r (G) + r (F) = 22 , we have three cases: Case 1: r (G) = 4, r (F) = 0. Due to the remark for theorem 2, there exists a four-monomial solution.
On the Minimum Number of Monomial Solutions
3135
Case 2: r (G) = 3, r (F) = 1. Application of theorem 2 yields three zeros on the w part of the solution vector. Following the steps of theorem 3, it is apparent that this solution can be improved by at least one. Case 3: r (G) = 2, r (F) = 2. Application of theorem 2 yields two zeros on the w part of the solution vector. Let δ q = (1, 1). Then δ q G must have exactly two zeros since the rows are orthogonal vectors from {−1, 1}4 . Therefore, due to lemmas 4 and 5, there exists a four-monomial solution. The case for r (F) > r (G) is proven similarly. Proposition 3. There is a dichotomy in {−1, 1}3 that cannot be solved with three (or fewer) monomials. Proof. We find a problem that cannot be solved with three monomials. The example problem given in section 7 serves the purpose. If the problem has a three-monomial solution, then the inequality system [ c j1 c j2 c j3 ]a = Ha > 0 must be satisfiable for some c j1 , c j2 , and c j3 , each of them distinct columns of D4 . Satisfiability of Ha > 0 can be checked using FM elimination: if the eliminated system is inconsistent, then Ha > 0 cannot be satisfied due to proposition 1. By applying this procedure for all the 8!/8! (8 − 3)! = 56 possible cases, it can be shown that that there is no ( j1 , j2 , j3 ) that leads to a consistent set of inequalities. Thus, there is no three-monomial solution to the given problem. Proposition 4. There is a dichotomy in {−1, 1}2 that cannot be solved with two or fewer monomials. Proof. We find an example problem. Take the classification problem = ({( 1 1 ), ( −1 1 ), ( 1 −1 )}, {( −1 −1 )}). Following the logic described in the proof of proposition 3, it can be shown that C cannot be solved with two monomials. Corollary 4. i. Clearly problems in {−1, 1}1 always require one monomial (one of x0 or 1) solution. Therefore, (1) = 1. ii. The problems in {−1, 1}2 can always be solved with three monomials (main theorem) and according to proposition 4, there is at least a problem that cannot be solved with fewer than three monomials. Therefore, (2) = 3. iii. Combining corollary 3 (there is always a four-monomial solution) and proposition 3 (there is a dichotomy that cannot be solved with three or fewer monomials), we get (3) = 4.
3136
E. Oztop
Thus, we have established the exact results for the maximum density of the dichotomies of dimensions 1, 2, and 3: (1) = 1 (2) = 3
(9.1)
(3) = 4. It can be shown that (4) ≤ 9 (proven with random search) and in fact it appears that (4) = 9 (not proven—empirical observation), tempting one to speculate on the possibility of the general formula, (n) =
2n−1 2 +1 n−1
if n is odd if n is even
that conforms the known bounds for n > 1, that is, 0.11 × 2n < (n) ≤ 0.75 × 2n . 10 Conclusion This letter presented theoretical results regarding the maximum density, (n), defined as the minimum number of monomials with which one can separate any dichotomy of {−1, 1}n . The best-known bound prior to this work was asymptotic and substantially inferior to the proven bound. It is shown that for dimensions 1, 2, and 3, (n) is equal to 1, 3, and 4, respectively, and less than 2n − 2n /4 for n > 3. This result says that given any dichotomy of {−1, 1}n , it is always possible to perform the target separation with less than three-quarters of the full set of n-dimensional monomials (2n monomials). This is the first time a ratio bound independent of n, namely, 3/4, is shown for the maximum density. In general, a higher-order neuron (HON) would require an exponentially growing number of input lines (number of monomials) to implement a given dichotomy (fully specified binary classification problems). Although this seems to reduce the validity of the HON models of real neurons, one also has to consider that the number of dichotomies grows superexponentially with n. This suggests the possibility that a useful subset of dichotomies might be implemented by HONs with a subexponential number of monomials. Although it is trivially shown that all the classification problems specified at, say, a polynomial number of assignments ( p(n)) are always solvable with p(n) monomials, the conditions at which a superpolynomial (e.g., ε2n for some, 0 < ε < 1) number of assignment specifications would assume a solution with polynomial (i.e., q (n)) number of monomials is unexplored. New techniques that use the unspecified assignments to reduce the number of monomials that would suffice to solve a partially specified
On the Minimum Number of Monomial Solutions
3137
problem must be developed, for which this study might provide a starting point. In spite of the success of spectral theory of Boolean functions in obtaining insights on the HON solutions of binary classification problems, it appears that it has certain limitations when the underlying local structure of the Boolean functions (i.e., individual vector components) has to be considered, as is the case when the unspecified assignments need to be exploited for arriving at reduced number of monomial solutions. This is probably why the previous bounds obtained using techniques from the spectral analysis of Boolean functions are inferior to the bound derived in this study, which employs simple local algebraic manipulations. Acknowledgments This study was supported by JST-ICORP Computational Brain Project. I was introduced to the problem of establishing a bound on the maximum density by Marifi Guler. I thank Junmei Zhu and Jun Nakanishi for their comments on an earlier version of the manuscript. I thank Mitsuo Kawato, Gordon Cheng, and Hiroshi Imamizu for providing the research environment. Finally I thank an anonymous reviewer for pointing me to the literature on spectral theory of Boolean functions. References Bruck, J. (1990). Harmonic analysis of polynomial threshold functions. SIAM Journal of Discrete Mathematics, 3, 168–177. Chandru, V. (1993). Variable elimination in linear constraints. Computer Journal, 36, 463–470. Ghosh, J., & Shin, Y. (1992). Efficient higher order neural networks for classification and function approximation. International Journal of Neural Systems, 3, 323–350. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in highorder neural networks. Applied Optics, 26, 4972–4978. Gotsman, C. (1989). On Boolean functions, polynomials and algebraic threshold functions. (Tech. Rep. TR-89-18). Tal Aviv: Department of Computer Science, Hebrew University. Guler, M. (2001). A model with an intrinsic property of learning higher order correlations. Neural Networks, 14, 495–504. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W., & Koch, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 474–481). San Mateo, Morgan Kaufmann. O’Donnell, R., & Servedio, R. (2003). Extremal properties of polynomial threshold functions. In Eighteenth Annual Conference on Computational Complexity (pp. 3–12). Piscataway, NJ: IEEE Computer Society. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, the PDP Research
3138
E. Oztop
Group (Eds.), Parallel distributed processing (Vol. 1, pp. 151–193). Cambridge, MA: MIT Press. Saks, M. E. (1993). Slicing the hypercube. In K. Walker (Ed.), Surveys in combinatorics (pp. 211–255). Cambridge: Cambridge University Press. Schmitt, M. (2002). On the complexity of computing and learning with multiplicative neural networks. Neural Computation, 14, 241–301. Schmitt, M. (2005). On the capabilities of higher-order neurons: A radial basis function approach. Neural Computation, 17, 715–729. Siu, K. Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation. Englewood Cliffs, NJ: Prentice Hall.
Received October 11, 2005; accepted May 5, 2006.
Index
Volume 18 By Author Ag´ıs, Rodrigo—See Ros, Eduardo Ahissar, Ehud—See Zacksenhouse, Miriam Ahmadi, Mandana—See Lerchner, Alexander Alexandre, Lu´ıs A.—See Silva, Lu´ıs M. Amari, Shun-ichi—See Miura, Keiji Amari, Shun-ichi and Nakahara, Hiroyuki Correlation and Independence in the Neural Code (Note)
18(6): 1259–1267
Amari, Shun-ichi—See Nakahara, Hiroyuki Amari, Shun-ichi, Park, Hyeyoung, and Ozeki, Tomoko Singularities Affect Dynamics of Learning in Neuromanifolds (Article)
18(5): 1007–1065
Ames, Jeffrey—See Psujek, Sean Ancona, Nicola and Stramaglia, Sebastiano An Invariance Property of Predictors in Kernel-Induced Hypothesis Spaces (Note) ¨ ¨ Andeli´c, E., Schaffoner, M., Katz, M., Kruger, S.E., and Wendemuth, A. Kernel Least-Squares Models Using Updates of the Pseudoinverse (Note)
18(4): 749–759
18(12): 2928–2935
Andersson, Ch.—See Sarishvili, A. Appleby, Peter A. and Elliott, Terry Stable Competitive Dynamics Emerge from Multispike Interactions in a Stochastic Model of Spike-Timing-Dependent Plasticity (Letter)
18(10): 2414–2464
3140
Index
B¨acker, Alex—See Brown, W. Michael Badcock, David R.—See Falconbridge, Michael S. Balas, Benjamin J. and Sinha, Pawan Receptive Field Structures for Recognition (Letter) Barak, Omri and Tsodyks, Misha Recognition by Variance: Learning Rules for Spatiotemporal Patterns (Letter)
18(3): 497–520
18(10): 2343–2358
Barber, David—See Pfister, Jean-Pascal Barbour, Boris—See Ros, Eduardo Basak, Jayanta Online Adaptive Decision Trees: Pattern Classification and Function Approximation (Letter)
18(9): 2062–2101
Basalyga, Gleb and Salinas, Emilio When Response Variability Increases Neural Network Robustness to Synaptic Noise (Letter)
18(6): 1349–1379
Becker, Suzanna—See Dominguez, Melissa Beer, Randall D. Parameter Space Structure of Continuous-Time Recurrent Neural Networks (Letter)
18(12): 3009–3051
Beer, Randall D.—See Psujek, Sean Bengio, Yoshua, Monperrus, Martin, and Larochelle, Hugo Nonlocal Estimation of Manifold Structure (Letter) Berkes, Pietro and Wiskott, Laurenz On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields (Letter)
18(10): 2509–2528
18(8): 1868–1895
Index
3141
Berkes, Pietro—See Blaschke, Tobias Beslon, Guillaume—See Soula, H´edi, Bienenstock, Elie—See Wu, Wei Black, Michael J.—See Wu, Wei Blaschke, Tobias, Berkes, Pietro, and Wiskott, Laurenz What Is the Relation Between Slow Feature Analysis and Independent Component Analysis? (Letter) Bo, Liefeng, Wang, Ling, and Jiao, Licheng Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation (Letter)
18(10): 2495–2508
18(4): 961–978
Borisyuk, Roman—See Kazanovich, Yakov Brette, Romain Exact Simulation of Integrate-and-Fire Models with Synaptic Conductances (Letter)
18(8): 2004–2027
Brown, Emery N.—See Srinivasan, Lakshminarayan Brown, W. Michael and B¨acker, Alex Optimal Neuronal Tuning for Finite Stimulus Spaces (Note)
18(7): 1511–1526
Bruce, Ian—See Dominguez, Melissa Brunel, Nicolas and Hansel, David How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory Neurons (Letter)
18(5): 1066–1110
¨ Bulthoff, Heinrich H.—See Graf, Arnulf B.A. Burwick, Thomas Oscillatory Networks: Pattern Recognition Without a Superposition Catastrophe (Letter)
18(2): 356–380
3142
Index
Butera, Robert—See Shao, Jie Calin-Jageman, Robert J. and Katz, Paul S. A Distributed Computing Tool for Generating Neural Simulation Databases (Note)
18(12): 2923–2927
Carrillo, Richard—See Ros, Eduardo Casile, Antonino and Rucci, Michele A Theoretical Analysis of the Influence of Fixational Instability on the Development of Thalamocortical Connectivity (Letter)
18(3): 569–590
Chan, Lai-Wan—See Zhang, Kun Chen, Tianping—See Lu, Wenlian Cheng, Sen and Sabes, Philip N. Modeling Sensorimotor Learning with Linear Dynamical Systems (Letter) Chhabra, Manu and Jacobs, Robert A. Properties of Synergies Arising from a Theory of Optimal Motor Behavior (Letter)
18(4): 760–793
18(10): 2320–2342
Choe, Yoonsuck—See Yu, Yingwei Chow, Tommy W.S.—See Huang, D. Clark, Paul T. and van Rossum, Mark C.W. The Optimal Synapse for Sparse, Binary Signals in the Rod Pathway (Letter)
18(1): 26–44
Claussen, Jens Christian—See Villmann, Thomas Coghill G.—See Unsworth, C.P. Cortes, J.M., Torres, J.J., Marro, J., Garrido, P.L., and Kappen, H.J. Effects of Fast Presynaptic Noise in Attractor Neural Networks (Letter) Courville, Aaron C.—See Daw, Nathaniel D. Cunningham, Mark O.—See Pervouchine, Dmitri D.
18(3): 614–633
Index
3143
d‘Avila Garcez, Artur S. and Lamb, Lu´ıs C. A Connectionist Computational Model for Epistemic and Temporal Reasoning (Letter)
18(7): 1711–1738
Daw, Nathaniel D., Courville, Aaron C., and Touretzky, David S. Representation and Timing in Theories of the Dopamine System (Letter)
18(7): 1637–1677
Dayan, Peter Images, Frames, and Connectionist Hierarchies (Letter)
18(10): 2293–2319
Dayan, Peter—See Schwartz, Odelia de Polavieja, Gonzalo G. Neuronal Algorithms That Detect the Temporal Order of Events (Letter)
18(9): 2102–2121
Destexhe, A.—See Rudolph, M. Destexhe, A.—See Rudolph, M. Detre, Greg—See Norman, Kenneth A. Diesmann, Markus—See Guerrero-Rivera, Ruben Dominguez, Melissa, Becker, Suzanna, Bruce, Ian, and Read, Heather A Spiking Neuron Model of Cortical Correlates of Sensorineural Hearing Loss: Spontaneous Firing, Synchrony, and Tinnitus (Letter)
18(12): 2942–2958
Donoghue, John P.—See Wu, Wei Douglas, Rodney—See Uchizawa, Kei Dror, Gideon—See Eisenthal, Yael Eckes, Christian, Triesch, Jochen, and von der Malsburg, Christoph Analysis of Cluttered Scenes Using an Elastic Matching Approach for Stereo Images (Letter) Eden, Uri T.—See Srinivasan, Lakshminarayan
18(6): 1441–1471
3144
Index
Eguchi, Shinto—See Mollah, Md. Nurul Haque Eisenthal, Yael, Dror, Gideon, and Ruppin, Eytan Facial Attractiveness: Beauty and the Machine (Letter)
18(1): 119–142
Elliott, Terry—See Appleby, Peter A. Enemark, Søren—See Lerchner, Alexander Falconbridge, Michael S., Stamps, Robert L., and Badcock, David R. A Simple Hebbian/Anti-Hebbian Network Learns the Sparse, Independent Components of Natural Images (Letter)
18(2): 415–429
ˇ Peter Farkaˇs, Igor—See Tino, Faugeras, Olivier—See Grimbert, Franc¸ois Felgueiras, Carlos S.—See Silva, Lu´ıs M. Focke, Walter W. Mixture Models Based on Neural Networks Averaging (Note)
18(1): 1–9
Frank, Michael J.—See O’Reilly, Randall C. Franke, J.—See Sarishvili, A. ¨ Franz, Matthias O. and Scholkopf, Bernhard A Unifying View of Wiener and Volterra Theory and Polynomial Kernel Regression (Letter)
18(12): 3097–3118
Friedman, Nir—See Slonim, Noam Fries, Pascal—See Zeitler, Magteld Fushiki, Tadayoshi, Horiuchi, Shingo, and Tsuchiya, Takashi A Maximum Likelihood Approach to Density Estimation with Semidefinite Programming (Letter)
18(11): 2777–2812
Index
Gal´an, Roberto F., Weidert, Marcel, Menzel, Randolf, Herz, Andreas V.M., and Galizia, C. Giovanni Sensory Memory for Odors Is Encoded in Spontaneous Correlated Activity Between Olfactory Glomeruli (Letter)
3145
18(1): 10–25
Galizia, C. Giovanni—See Gal´an, Roberto F. Gao, Yun—See Wu, Wei Gao, Xing-Bao and Liao, Li-Zhi A Novel Neural Network for a Class of Convex Quadratic Minimax Problems (Letter)
18(8): 1818–1846
Garrido, P.L.—See Cortes, J.M. Ge, Yang and Jiang, Wenxin On Consistency of Bayesian Inference with Mixtures of Logistic Regression (Letter) ¨ otter, ¨ Geng, Tao, Porr, Bernd, and Worg Florentin A Reflexive Neural Network for Dynamic Biped Walking Control (Letter)
18(1): 224–243
18(5): 1156–1196
Gerstner, Wulfram—See Pfister, Jean-Pascal Gielen, Stan—See Zeitler, Magteld Girolami, Mark and Rogers, Simon Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors (Letter)
18(8): 1790–1817
Govaerts, W. and Sautois, B. Computation of the Phase Response Curve: A Direct Numerical Approach (Letter)
18(4): 817–847
¨ Graf, Arnulf B.A., Wichmann, Felix A., Bulthoff, ¨ Heinrich H., and Scholkopf, Bernhard Classification of Faces in Man and Machine (Letter)
18(1): 143–165
Grimbert, Franc¸ois and Faugeras, Olivier Bifurcation Analysis of Jansen’s Neural Mass Model (Letter)
18(12): 3052–3068
3146
Index
Guan, Cuntai—See Li, Yuang Guerrero-Rivera, Ruben, Morrison, Abigail, Diesmann, Markus, and Pearce, Tim C. Programmable Logic Construction Kits for Hyper-Real-Time Neuronal Modeling (Letter)
18(11): 2651–2679
Hansel, David—See Brunel, Nicolas Havenith, Martha N.—See Schneider, Gaby Hertz, John—See Lerchner, Alexander Herz, Andreas V.M.—See Gal´an, Roberto F. Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee-Whye A Fast Learning Algorithm for Deep Belief Nets (Letter)
18(7): 1527–1554
Hinton, Geoffrey E.—See Osindero, Simon Hochreiter, Sepp and Obermayer, Klaus Support Vector Machines for Dyadic Data (Letter)
18(6): 1472–1510
ˇ Martin Holena, Piecewise-Linear Neural Networks and Their Relationship to Rule Extraction from Data (Letter)
18(11): 2813–2853
Horiuchi, Shingo—See Fushiki, Tadayoshi Huang, D. and Chow, Tommy W.S. Enhancing Density-Based Data Reduction Using Entropy (Letter) Hyv¨arinen, Aapo Consistency of Psuedolikelihood Estimation of Fully Visible Boltzmann Machines (Note) Izhikevich, Eugene M. Polychronization: Computation with Spikes (Article)
18(2): 470–495
18(10): 2283–2292
18(2): 245–282
Index
3147
Jacobs, Robert A.—See Chhabra, Manu Jacobs, Robert A.—See Michel, Melchi M. Jacobsson, Henrik The Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx (Letter) Jiang, Wenxin On the Consistency of Bayesian Variable Selection for High Dimensional Binary Regression and Classification (Letter)
18(9): 2211–2255
18(11): 2762–2776
Jiang, Wenxin—See Ge, Yang Jiao, Licheng—See Bo, Liefeng Johnson, Kenneth O.—See Sripati, Arun P. Kanamaru, Takashi Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory Connections (Letter)
18(5): 1111–1131
Kappen, H.J.—See Cortes, J.M. Kass, Robert E. and Ventura, Val´erie Spike Count Correlation Increases with Length of Time Interval in the Presence of Trial-to-Trial Variation (Note)
18(11): 2583–2591
Katz, M.—See Andeli´c, E. Katz, Paul S.—See Calin-Jageman, Robert J. Kazanovich, Yakov and Borisyuk, Roman An Oscillatory Neural Model of Multiple Object Tracking (Letter)
18(6): 1413–1440
Keil, Matthias S. Smooth Gradient Representations as a Unifying Account of Chevreul’s Illusion, Mach Bands, and a Variant of the Ehrenstein Disk (Letter)
18(4): 871–903
3148
Index
Kempter, Richard—See Leibold, Christian Kim, Soyoun, Singer, Benjamin H., and Zochowski, Michal Changing Roles for Temporal Representation of Odorant During the Oscillatory Response of the Olfactory Bulb (Letter)
18(4): 794–816
Koene, Ansgar R. A Model for Perceptual Averaging and Stochastic Bistable Behavior and the Role of Voluntary Control (Letter)
18(12): 3069–3096
Kopell, Nancy J.—See Pervouchine, Dmitri D. Kroisandt, G.—See Sarishvili, A. ¨ Kruger, S. E.—See Andeli´c, E. Lamb, Lu´ıs C.—See d‘Avila Garcez, Artur S. Larochelle, Hugo—See Bengio, Yoshua Leibold, Christian and Kempter, Richard Memory Capacity for Sequences in a Recurrent Network with Biological Constraints (Letter)
18(4): 904–941
Lerchner, Alexander, Ursta, Cristina, Hertz, John, Ahmadi, Mandana, Ruffiot, Pauline, and Enemark, Søren Response Variability in Balanced Cortical Networks (Letter)
18(3): 634–659
Li, Yuanqing and Guan, Cuntai An Extended EM Algorithm for Joint Feature Extraction and Classification in Brain Computer Interfaces (Letter)
18(11): 2730–2761
Liao, Li-Zhi—See Gao, Xing-Bao Lindner, Benjamin and Longtin, Andr´e Comment on “Characterization of Subthreshold Voltage Fluctuations in Neuronal Membranes,” by M. Rudolph and A. Destexhe (Letter)
18(8): 1896–1931
Index
3149
Longtin, Andr´e—See Lindner, Benjamin Lo rincz, Andr´as—See Szita, Istv´an Lu, Wenlian and Chen, Tianping Dynamical Behaviors of Delayed Neural Network Systems with Discontinuous Activation Functions (Letter) ¨ Ludtke, Niklas and Nelson, Mark E. Short-Term Synaptic Plasticity Can Enhance Weak Signal Detectability in Nonrenewal Spike Trains (Article)
18(3): 683–708
18(12): 2879–2916
Maass, Wolfgang—See Uchizawa, Kei Marques de S´a, J.—See Silva, Lu´ıs M. Marro, J.—See Cortes, J.M. Masuda, Naoki Simultaneous Rate-Synchrony Codes in Populations of Spiking Neurons (Letter)
18(1): 45–59
Mazet, Olivier—See Soula, H´edi Menzel, Randolf—See G´alan, Roberto F. Michel, Melchi M. and Jacobs, Robert A. The Costs of Ignoring High-Order Correlations in Populations of Model Neurons (Letter)
18(3): 660–682
Miller, Lee E.—See Westwick, David T. Miller, Paul Analysis of Spike Statistics in Neuronal Systems with Continuous Attractors or Multiple, Discrete Attractor States (Letter)
18(6): 1268–1317
ˇ Peter Mills, Ashley J.S.—See Tino, Minami, Mihoko—See Mollah, Md. Nurul Haque Miura, Keiji, Okada, Masato, and Amari, Shun-ichi Estimating Spiking Irregularities Under Changing Environments (Letter)
18(10): 2359–2386
3150
Mollah, Md. Nurul Haque, Minami, Mihoko, and Eguchi, Shinto Exploring Latent Structure of Mixture ICA Models by the Minimum β-Divergence Method (Letter)
Index
18(1): 166–190
Monperrus, Martin—See Bengio, Yoshua Montemurro, Marcelo A. and Panzeri, Stefano Optimal Tuning Widths in Population Coding of Periodic Variables (Letter)
18(7): 1555–1576
Morrison, Abigail—See Guerrero-Rivera, Ruben Nakahara, Hiroyuki, Amari, Shun-ichi, and Richmond, Barry J. A Comparison of Descriptive Models of a Single Spike Train by Information-Geometric Measure (Letter)
18(3): 545–568
Nakahara, Hiroyuki—See Amari, Shun-ichi ¨ Nelson, Mark E.—See Ludtke, Niklas Netoff, Theoden I.—See Pervouchine, Dmitri D. Newman, Ehren—See Norman, Kenneth A. Nikoli´c, Danko—See Schneider, Gaby Norman, Kenneth A., Newman, Ehren, Detre, Greg, and Polyn, Sean How Inhibitory Oscillations Can Train Neural Networks and Punish Competitors (Letter)
18(7): 1577–1610
Obermayer, Klaus—See Hochreiter, Sepp Okada, Masato—See Miura, Keiji O’Reilly, Randall C. and Frank, Michael J. Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia (Letter)
18(2): 283–328
Index
3151
Ortigosa, Eva M.—See Ros, Eduardo Osindero, Simon—See Hinton, Geoffrey E. Osindero, Simon, Welling, Max, and Hinton, Geoffrey E. Topographic Product Models Applied to Natural Scene Statistics (Letter)
18(2): 381–414
Ozeki, Tomoko—See Amari, Shun-ichi Oztop, Erhan An Upper Bound on the Minimum Number of Monomials Required to Separate Dichotomies of {−1, 1}n (Letter)
18(12): 3119–3138
Paninski, Liam The Spike-Triggered Average of the Integrate-and-Fire Cell Driven by Gaussian White Noise (Letter)
18(11): 2592–2616
Panzeri, Stefano—See Montemurro, Marcelo A. Park, Hyeyoung—See Amari, Shun-ichi Pearce, Tim C.—See Guerrero-Rivera, Ruben ˜ Juan Lu´ıs—See Pinzolas, Miguel Pedreno, Pelillo, Marcello and Torsello, Andrea Payoff-Monotonic Game Dynamics and the Maximum Clique Problem (Letter)
18(5): 1215–1258
Peng, Zhihang—See Wang, Yingfeng Perreault, Eric J.—See Westwick, David T. Pervouchine, Dmitri D., Netoff, Theoden I., Rotstein, Horacio G., White, John A., Cunningham, Mark O., Whittington, Miles A., and Kopell, Nancy J. Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus (Letter)
18(11): 2617–2650
3152
Index
Pfister, Jean-Pascal, Toyoizumi, Taro, Barber, David, and Gerstner, Wulfram Optimal Spike-Timing-Dependant Plasticity for Precise Action Potential Firing in Supervised Learning (Letter)
18(6): 1318–1348
˜ Juan Pinzolas, Miguel, Toledo, Ana, and Pedreno, Lu´ıs A Neighborhood-Based Enhancement of the Gauss-Newton Bayesian Regularization Training Method (Letter)
18(8): 1987–2003
Pohlmeyer, Eric A.—See Westwick, David T. Polyn, Sean—See Norman, Kenneth A. ¨ otter, ¨ Porr, Bernd and Worg Florentin Strongly Improved Stabiltiy and Faster Convergence of Temporal Sequence Learning by Using Input Correlations Only (Letter)
18(6): 1380–1412
Porr, Bernd—See Geng, Tao Psujek, Sean, Ames, Jeffrey, and Beer, Randall D. Connection and Coordination: The Interplay Between Architecture and Dynamics in Evolved Model Pattern Generators (Letter)
18(3): 729–747
Read, Heather—See Dominguez, Melissa Richmond, Barry J.—See Nakahara, Hiroyuki Rogers, Simon—See Girolami, Mark Ros, Eduardo, Carrillo, Richard, Ortigosa, Eva M., Barbour, Boris, and Ag´ıs, Rodrigo Event-Driven Simulation Scheme for Spiking Neural Networks Using Lookup Tables to Characterize Neuronal Dynamics (Letter)
18(12): 2959–2993
Index
Roth, Volker Kernel Fisher Discriminants for Outlier Detection (Letter)
3153
18(4): 942–960
Rotstein, Horacio G.—See Pervouchine, Dmitri D. Rucci, Michele—See Casile, Antonino Rudolph, M. and Destexhe, A. Analytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation Strategies (Letter)
18(9): 2146–2210
Rudolph, M. and Destexhe, A. On the Use of Analytic Expressions for the Voltage Distribution to Analyze Intracellular Recordings (Note)
18(12): 2917–2922
Ruffiot, Pauline—See Lerchner, Alexander Ruppin, Eytan—See Eisenthal, Yael Sabes, Philip N.—See Cheng, Sen Salinas, Emilio—See Basalyga, Gleb Sarishvili, A., Andersson, Ch., Franke, J., and Kroisandt, G. On the Consistency of the Blocked Neural Network Estimator in Time Series Analysis (Letter)
18(10): 2568–2581
Sautois, B.—See Govaerts, W. ¨ Schaffoner, M.—See Andeli´c, E. Schneider, Gaby, Havenith, Martha N., and Nikoli´c, Danko Spatiotemporal Structure in Large Neuronal Networks Detected from Cross-Correlation (Letter) ¨ Scholkopf, Bernhard—See Franz, Matthias O. ¨ Scholkopf, Bernhard—See Graf, Arnulf B.A.
18(10): 2387–2413
3154
Schwartz, Odelia, Sejnowski, Terrence J., and Dayan, Peter Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics (Letter)
Index
18(11): 2680–2718
Sejnowski, Terrence J.—See Schwartz, Odelia Shamir, Maoz The Scaling of Winner-Takes-All Accuracy with the Population Size (Letter)
18(11): 2719–2729
Shamir, Maoz and Sompolinsky, Haim Implications of Neuronal Diversity on Population Coding (Letter)
18(8): 1951–1986
Shao, Jie, Tsao, Tzu-Hsin, and Butera, Robert Bursting Without Slow Kinetics: A Role for a Small World? (Note)
18(9): 2029–2035
Shrestha, D.L. and Solomatine, D.P. Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression (Letter)
18(7): 1678–1710
Silva, Lu´ıs M., Felgueiras, Carlos S., Alexandre, Lu´ıs A., and Marques de S´a, J. Error Entropy in Classification Problems: A Univariate Data Analysis (Letter)
18(9): 2036–2061
Singer, Benjamin H.—See Kim, Soyoun Sinha, Pawan—See Balas, Benjamin J. Slonim, Noam, Friedman, Nir, and Tishby, Naftali Multivariate Information Bottleneck (Article)
18(8): 1739–1789
Smith, Anne C. and Smith, Peter A Set Probability Technique for Detecting Relative Time Order Across Multiple Neurons (Letter)
18(5): 1197–1214
Smith, Peter—See Smith, Anne C. Solla, Sara A.—See Westwick, David T.
Index
3155
Solomatine, D.P.—See Shrestha, D.L. Sompolinsky, Haim—See Shamir, Maoz Soula, H´edi, Beslon, Guillaume, and Mazet, Olivier Spontaneous Dynamics of Asymmetric Random Recurrent Spiking Neural Networks (Letter)
18(1): 60–79
Srinivasan, Lakshminarayan, Eden, Uri T., Willsky, Alan S., and Brown, Emery N. A State-Space Analysis for Reconstruction of Goal-Directed Movements Using Neural Signals (Letter)
18(10): 2465–2494
Sripati, Arun P. and Johnson, Kenneth O. Dynamic Gain Changes During Attentional Modulation (Letter)
18(8): 1847–1867
Stamps, Robert L.—See Falconbridge, Michael S. Stramaglia, Sebastiano—See Ancona, Nicola Szita, Istv´an and Lo rincz, Andr´as Learning Tetris Using the Noisy Cross-Entropy Method (Note)
18(12): 2936–2941
Teh, Yee-Whye—See Hinton, Geoffrey E. ˇ Peter and Mills, Ashley J.S. Tino, Learning Beyond Finite Memory in Recurrent Networks of Spiking Neurons (Letter) ˇ Peter, Farkaˇs, Igor, and van Mourik, Jort Tino, Dynamics and Topographic Organization of Recursive Fields in Recursive Self-Organizing Maps (Letter) Tishby, Naftali—See Slonim, Noam Toledo, Ana—See Pinzolas, Miguel Torres, J.J.—See Cortes, J.M. Torsello, Andrea—See Pelillo, Marcello
18(3): 591–613
18(10): 2529–2567
3156
Index
Touretzky, David—See Daw, Nathaniel D. Toussaint, Marc A Sensorimotor Map: Modulating Lateral Interactions for Anticipation and Planning (Letter)
18(5): 1132–1155
Toyoizumi, Taro—See Pfister, Jean-Pascal Triesch, Jochen—See Eckes, Christian Tsao, Tzu-Hsin—See Shao, Jie Tsodyks, Misha—See Barak, Omri Tsuchiya, Takashi—See Fushiki, Tadayoshi Uchizawa, Kei, Douglas, Rodney, and Maass, Wolfgang On the Computational Power of Threshold Circuits with Sparse Activity (Letter)
18(12): 2994–3008
Unsworth, C.P. and Coghill, G. Excessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented Environments (Letter)
18(9): 2122–2145
Ursta, Cristina—See Lerchner, Alexander Van Hulle, Marc M. Differential Log Likelihood for Evaluating and Learning Gaussian Mixtures (Letter)
18(2): 430–445
ˇ Peter van Mourik, Jort—See Tino, van Rossum, Mark C.W.—See Clark, Paul T. Ventura, Val´erie—See Kass, Robert E. Villmann, Thomas and Claussen, Jens Christian Magnification Control in Self-Organizing Maps and Neural Gas (Letter) von der Malsburg, Christoph—See Eckes, Christian
18(2): 446–469
Index
3157
Wang Jun—See Zeng, Zhigang Wang, Ling—See Bo, Liefeng Wang, Yingfeng, Zeng, Xiaoqin, Yeung, Daniel So, and Peng, Zhihang Computation of Madalines’ Sensitivity to Input and Weight Perturbations (Letter) Washizawa, Yoshikazu and Yamashita, Yukihiko Kernel Projection Classifiers with Suppressing Features of Other Classes (Letter)
18(11): 2854–2877
18(8): 1932–1950
Weidert, Marcel—See Gal´an, Roberto F. Welling, Max—See Osindero, Simon Wendemuth, A.—See Andeli´c, E. Westwick, David T., Pohlmeyer, Eric A., Solla, Sara A., Miller, Lee E., and Perreault, Eric J. Identification of Multiple-Input Systems with Highly Coupled Inputs: Application to EMG Prediction from Multiple Intracortical Electrodes (Letter) White, John A.—See Pervouchine, Dmitri D. Whittington, Miles A.—See Pervouchine, Dmitri D. Wichmann, Felix A.—See Graf, Arnulf B.A. Willsky, Alan S.—See Srinivasan, Lakshminarayan Wiskott, Laurenz—See Berkes, Pietro Wiskott, Laurenz—See Blaschke, Tobias ¨ otter, ¨ Worg Florentin—See Geng, Tao ¨ otter, ¨ Worg Florentin—See Porr, Bernd
18(2): 329–355
3158
Wu, Wei, Gao, Yun, Bienenstock, Elie, Donoghue, John P., and Black, Michael J. Bayesian Population Decoding of Motor Cortical Activity Using a Kalman Filter (Letter)
Index
18(1): 80–118
Yamashita, Yukihiko—See Washizawa, Yoshikazu Ye, Ji-Min—See Zhu, Xiao-Long Yeung, Daniel So—See Wang, Yingfeng Yu, Yingwei and Choe, Yoonsuck A Neural Model of the Scintillating Grid Illusion: Disinhibition and Self-Inhibition in Early Vision (Letter)
18(3): 521–544
Zacksenhouse, Miriam and Ahissar, Ehud Temporal Decoding by Phase-Locked Loops: Unique Features of Circuit-Level Implementations and Their Significance for Vibrissal Information Processing (Letter)
18(7): 1611–1636
Zeitler, Magteld, Fries, Pascal, and Gielen, Stan Assessing Neuronal Coherence with Single-Unit, Multi-Unit, and Local Field Potentials (Letter)
18(9): 2256–2281
Zeng, Xiaoqin—See Wang, Yingfeng Zeng, Zhigang and Wang Jun Multiperiodicity and Exponential Attractivity Evoked by Periodic External Inputs in Delayed Cellular Neural Networks (Letter)
18(4): 848–870
Zhang, Kun and Chan, Lai-Wan An Adaptive Method for Subband Decomposition ICA (Letter)
18(1): 191–223
Zhang, Xian-Da—See Zhu, Xiao-Long Zheng, Wenming Class-Incremental Generalized Discriminant Analysis (Letter)
18(4): 979–1006
Index
Zhu, Xiao-Long, Zhang, Xian-Da, and Ye, Ji-Min A Generalized Contrast Function and Stability Analysis for Overdetermined Blind Separation of Instantaneous Mixtures (Letter) Zochowski, Michal—See Kim, Soyoun
3159
18(3): 709–728